Infrastructure-Aided Localization and State Estimation for Autonomous Mobile Robots

: A slip-aware localization framework is proposed for mobile robots experiencing wheel slip in dynamic environments. The framework fuses infrastructure-aided visual tracking data (via ﬁsheye lenses) and proprioceptive sensory data from a skid-steer mobile robot to enhance accuracy and reduce variance of the estimated states. The slip-aware localization framework includes: the visual thread to detect and track the robot in the stereo image through computationally efﬁcient 3D point cloud generation using a region of interest; and the ego motion thread which uses a slip-aware odometry mechanism to estimate the robot pose utilizing a motion model considering wheel slip. Covariance intersection is used to fuse the pose prediction (using proprioceptive data) and the visual thread, such that the updated estimate remains consistent. As conﬁrmed by experiments on a skid-steer mobile robot, the designed localization framework addresses state estimation challenges for indoor/outdoor autonomous mobile robots which experience high-slip, uneven torque distribution at each wheel (by the motion planner), or occlusion when observed by an infrastructure-mounted camera. The proposed system is real-time capable and scalable to multiple robots and multiple environmental cameras.


Introduction
Navigating mobile robots in dynamic environments with human presence makes visual odometry challenging due to occlusion and dynamic features. This necessitates multi-modal (e.g., camera, LiDAR, inertial) data fusion to identify and remove the dynamic features for feature-based localization [1,2], address disturbance and model mismatch challenges for LiDAR based localization [3,4], or tackle perceptually degraded conditions through distributed estimation [5,6]. In this regard, multi-modal state estimation approaches for mobile robots [7,8] are revolutionizing accurate navigation for indoor applications (e.g., warehouse robotics or service robots using on-board sensors) where the loss of reception and low bandwidth of commercial Global Navigation Satellite Systems (GNSS), inhibit reliable robot state measurements.
One of main challenges for the the existing multi-modal state estimators that utilize on-board inertial measurement unit (IMU) data and visual odometry through monocular/stereo cameras is the wheel slip in the longitudinal and lateral directions. This is due to: (i) Model uncertainties caused by the wheel force saturation in the robot dynamical model (by various robot payloads, changing surface conditions, or harsh cornering scenarios) impacting estimation error and update frequency in real-time [9][10][11]; and (ii) The real-time performance of state estimators for safe motion planning and controls in a scene with dynamic features [12,13]. Infrastructure-aided state estimation approaches which leverage visual/radar data measured by fixed sensors and communication with the robot are proposed in the literature to deal with perceptually degraded conditions and dynamic features for navigation of mobile autonomous systems [14][15][16]. This is cost effective as it reduces the number of on-board sensors, specially for large-scale networked robotic systems. In [17], cameras installed on the ceiling detect multiple robots with unique markers and determine their position and heading states based on the distance to fixed markers on the ground and known marker sizes. A stationary fisheye camera installed on the ceiling is used in [14] for indoor robot localization, in which the pose is determined based on the azimuth and elevation of the line of view (to the center of the segmentation). Multiple fixed surveillance cameras are used in [18] to detect the robot and static objects to construct a 2D map. Pose data from low-cost cameras mounted on ceiling is fused with on-board LiDAR odometry data for robot state estimation in [19] where the fusion of camera and odometry is performed in a map with an adaptive Monte Carlo approach. The existing infrastructure-aided localization approaches require visual markers or free line of sight to the robot [17,19], heavily rely on robot model, and are challenged by occluded scenes and model uncertainties due to the wheel slip.
In order to compensate for the wheel longitudinal/lateral slip in robots with nonholonomic constraints, kinematic-or dynamic-based slip estimation/compensation methods have been adopted in the literature [20,21] using on-board sensory data. The dynamic-based approaches require wheel stiffness properties and vertical forces that may change due to various payloads and road surface conditions [22]. Kinematic-based methods, on the other hand, use wheel odometry and inertial data to estimate the slip with upper bounded mean square estimation error (MSE) through nonlinear or stochastic observers [12,23,24]. A highgain observer is designed in [25] to deal with unknown model parameters. To avoid model complexities due to tire force nonlinearities (and the combined-slip effect), an empirical parameterized kinematic model is proposed in [26] for robot state estimation. An event-based Kalman observer is designed in [27] to fuse IMU data and wheel odometry for heading and speed estimation. However, the information from on-board state observers has not been used for fusion with infrastructure sensing units to enhance reliability of the pose estimation. In addition, the computational efficiency and accuracy are main challenges for the existing infrastructure-mounted visual tracking and localization methods that use low-cost wide-angle lenses.
To address computational time and accuracy challenges of the existing visual and kinematic/dynamic model based localization methods (to be executed on embedded systems and robot's on-board processing units), this paper develops and experimentally verifies a cooperative state estimator using: (i) Proprioceptive data from low-cost odometry sensors of a skid-steer mobile robot; and (ii) Region of Interest (ROI)-based processing and visual tracking on the 3D point clouds obtained from fixed sensing units. The main contributions of the paper are summarized as:

•
Design of a computationally efficient ROI-based pose estimator using 3D point clouds from a stationary stereo camera with a wide-angle (fisheye) lens. • Developing an infrastructure-aided localization framework which is scalable for large systems with multiple robots using communication between a slip-aware onboard observer and the stationary sensing unit.

Background and System Overview
The localization framework includes visual tracking through forming an ROI for computationally enhanced processing at the edge (e.g., embedded Jetson Xavier) and a slip-aware state observer at the robot using proprioceptive data. The visual tracking is through a fixed low-cost stereo camera, Intel Realsense T265. As illustrated in Figure 1, the system has independent visual tracking thread and ego motion thread. The state vector is defined by for the proposed framework, where the longitudinal position, lateral position, and heading of the robot in the reference fixed frame {W} is denoted by x, y, and θ, respectively. The local robot body frame is denoted by {b}, which is at the geometrical center of the robot and is depicted in Figure 2. The reference coordinate system {W} is derived from {b} at time zero t 0 . The visual tracking thread estimates the robot poseξ v based on the captured images of the stationary stereo camera in the environment. The occlusion cases, in which visual-based pose estimates are intermittent (or not available), will be addressed by the Covariance Intersection (CI) fusion with the estimated statesξ p from the slip-aware motion model. The updated pose by CI is then used as a corrected pose for the relative motion prediction in the next sample time. The robot pose is a time-varying transformation is about the Z-axis of the {W}, and the position vector W p b = [x, y, 0] with x, y is the longitudinal and lateral robot position in the reference frame {W}.

Visual Tracking Thread
The visual tracking thread includes frontend and backend modules as illustrated in Figure 3. The frontend performs image processing and object detection. In the image processing step, the stereo image pair is undistorted and rectified. The object detection generates a boundingbox for the robot within the rectified stereo images. The area in the images enclosed with the boundingbox is termed as region of interest. The undistorted and rectified images, and image coordinates of the corresponding bounding box are used in the backend to localize the robot using the 3D position of points on the robot.  With the assumption of the pinhole model and known extrinsic parameters, the constraints for the projection of point clouds in {W} onto the two image planes are derived. These constraints are described with epipolar geometry, and determine the area in the image planes where the same point in {W} is mapped on. Figure 4 illustrates the epipolar geometry for two non rectified images. The projection of the point m into the camera centers C 1 and C 2 defines the epipolar plane which intersects the image plane P 1 and P 2 forming epipoles e 1 and e 2 for the left/right images. The homogeneous transformation T = [R, p] ∈ SO(3) with the rotation matrix R and translation vector p between the camera centers describes the extrinsic parameters [28].
are the extended camera matrix for the left and right image planes. The images have the same focal length in X and Y direction as well as the same principal point in Y direction; they are geometrically shifted with the baseline b in X direction. The radial distance r for perspective pinhole projection between the principal point and image coordinates of incoming ray of the point m is r = √ u 2 + v 2 and the angle Ψ between the principal axis and the ray is Ψ = tan −1 (r). The radial fisheye distortion factor Ψ d is modeled [29] as which are then converted into undistorted image coordinates Subsequent to this, a Yolov4 object detector [30] is used for 2D detection of the robot in the undistorted left image. The Yolov4 model is trained on a custom collected dataset of the robot for identification of the robot as a class label since the state-of-the-art COCO class labels have no training data corresponding to the robot. of the image. This enables an ROI which will be used to extract a frustum of the point cloud representing the robot. Point cloud processing will then be applied exclusively to the ROI-based frustum, i.e., interior Int(B d (k)). This bounding box-informed frustum significantly reduces the computational cost compared to processing the point cloud as a whole.

Point Cloud Computation and Post Processing
The feature extraction is restricted to the ROI, Int(B d (k)), and is scalable for visual tracking in multi-robot settings. The robot is depicted inside the ROI in the left and right image plane of the undistorted and rectified images. The aim is to find the image coordinates u l and v (in the left image) and u r (in the right image plane) of the world point m, as see in Figures 5 and 6.  For feature extraction, ORB features [31,32] were used, where the extracted features are matched within the stereo image pair and between subsequent captured image pairs. It is assumed that the remaining image coordinates represent the same point on the robot platform, then, these points' 3D coordinates are reconstructed. Based on the epipolar geometry, the depth z = b f u l −u r is computed for each match with the horizontal image coordinates u l and u r of the left and right stereo image and the baseline b, as well as the focal length f of the camera, then the depth is used for x = u l f z and y = v l f z with the vertical image coordinate v l of the left stereo image plane as illustrated in Figure 7. The coordinates are computed for every match and transformed into {W}. All points lead to a point cloud assumed to be derived from the surface of the robot. The point cloud is processed with the PCL library [33,34] and a statistical outlier filter. The filter rejects points that are further away from their neighbors compared to the average of the point cloud. The input parameters are the number of neighbors to calculate the average distance for a given point and a ratio to set the threshold based on the standard deviation across the point cloud.
The 2D projection of the point cloud is used to enhance the reliability of the 3D point clouds for navigating the robot far from the stationary sensing unit (i.e., the stereo visual node). The Euclidean center of the 2D points (which is less sensitive to outliers) is considered as an estimate of the position, i.e.,ξ p (k) at time step t k and will be corrected using the slip-aware motion model, which is described in the next section.  visual node). The Euclidean center of the 2D points (which is less sensitive to outliers) is 150 considered as an estimate of the position, i.e.,ξ p (k) at time step t k and will be corrected 151 using the slip-aware motion model, which is described in the next section.

152
For orientation estimation, a linear regressor is used for a moving horizon N h of the 153 estimated states. The angle between the estimated linear function and the world frame's 154 longitudinal axis is then considered as the orientation of the robot. To cope with situations 155 when the robot is not driving or turning with zero radius, a plausibility check is applied. 156 The plausibility check rejects estimates if the linear regression is too short or the distance 157 between the position estimates and the line is greater than a threshold.

159
A kinematic model is introduced and parametrized to predict the motion in presence 160 of wheel skidding and slipping. A covariance intersection (CI) method is then used to 161 update the prediction. The autonomous mobile robot used to evaluate the localization approach is the skidsteer Clearpath's Jackal robot, which is subject to the large wheel longitudinal slip in various cornering scenarios. The kinematic motion model in the following predicts the robot states using the heading and wheel rotational speed in the robot body frame {b}. The robots motion is defined based on the instantaneous center of rotation (IC) as shown in Fig. 8, assuming that the robot is a rigid body and has a planar motion with nonholonomic constraints. The longitudinal velocity, lateral velocity, and yaw rate are denoted by ν x , ν y , andθ in the body frame {b} and are expressed in terms of the left/right wheel rotational speeds ω l , ω r as For orientation estimation, a linear regressor is used for a moving horizon N h of the estimated states. The angle between the estimated linear function and the world frame's longitudinal axis is then considered as the orientation of the robot. To cope with situations when the robot is not driving or turning with zero radius, a plausibility check is applied. The plausibility check rejects estimates if the linear regression is too short or the distance between the position estimates and the line is greater than a threshold.

Infrastructure-Aided State Estimation
A kinematic model is introduced and parametrized to predict the motion in presence of wheel skidding and slipping. A covariance intersection (CI) method is then used to update the prediction.

Slip-Aware Motion Model
The autonomous mobile robot used to evaluate the localization approach is the skid-steer Clearpath's Jackal robot, which is subject to the large wheel longitudinal slip in various cornering scenarios. The kinematic motion model in the following predicts the robot states using the heading and wheel rotational speed in the robot body frame {b}. The robot's motion is defined based on the instantaneous center of rotation (IC) as shown in Figure 8, assuming that the robot is a rigid body and has a planar motion with nonholonomic constraints.
The longitudinal velocity, lateral velocity, and yaw rate are denoted by ν x , ν y , andθ in the body frame {b} and are expressed in terms of the left/right wheel rotational speeds where v(t) = [ν x (t), ν y (t),θ(t)] , the wheel rolling radius is denoted by R e , and G(Λ) includes the model parameter vector Λ = [x IC , y IC,l , y IC,r , α l , α r ] as follows where IC, l is the instantaneous center of the front-left and rear-left tires of the robot and IC, r denotes the instantaneous center of the front-right and rear-right tires of the robot.
In the schematic provided in Figure 8, due to nonholonomic constraints and since the longitudinal speed on the right side (i.e., rotational speed multiplied by the effective rolling radius R e ) is larger than the robot speed v x , the instantaneous center IC, r is located on the right side (i.e., y IC,r < 0 in the body frame).  [26]. The IC locations for the left and right wheels are expressed in {b} as (x IC,l , y IC,l ) and (x IC,r , y IC,r ), respectively. It is assumed that the longitudinal position of ICs along the x-axis lie all on a parallel line to the Y-axis, i.e., x IC = x IC,v = x IC,j = ν ẏ θ , j ∈ {l, r} and have the same angular velocity. The lateral IC locations, which are bounded variables, are expressed as [21]: where α l and α r are parameters accounting for model uncertainties (tire inflation and longitudinal slip ratios at each corner of the robot) and R e is the tire rolling radius. The location of IC is bounded, i.e., |x IC,v | <x and |y IC,v | <ȳ even reached in the proximity of straight trajectories where the numerator and denominator in (6) are of the same infinitesimal order which leads to finite values for x IC , y IC,j . The boundedness of y IC,v need to be guaranteed for lateral stability and minimizing the robot's sideslip angle in harsh turning. Using the transformation between {b} and the world frame, the robot states in {W} are expressed aṡ where θ(t) is the robot heading and ∈ R 3 represents model uncertainties. Then, the parameter identification process consists of two steps: gathering representative data from on-board and infrastructure-mounted sensory data; and developing an optimization program to find the optimal parameter vector Λ * through data set. The data collection consists of typically fast maneuvers on various surfaces in different trajectories based on the operational envelope of the mobile robot maintaining the lateral stability. The lateral stability is defined by a bounded sideslip angle |β| <β where β tan −1 ( training horizon d t . The measured wheel speeds of each segment are used to predict the robot speeds in the body frame using (4) and determine the robot pose in {W} using (7). The predicted poseξ ∈ R 3 and the ground truth at the end of each segment are included in the cost function whereξ(Λ) is the ground truth and ξ(k) is the predicted state based on the linearized slip-aware motion model in discrete times. Minimizing J results in the optimal parameter vector Λ * . The trained model is evaluated over different data sets with the evaluation horizon d e . In this context, the evaluation horizon represents the prediction horizon for specific applications. The evaluation horizon is the indication of the prediction horizon of the model in the application. Assessing variable evaluation horizons with respect to variable training horizons is reveals the impact of different prediction horizons in the application compared to the parameter identification process. To analyse the impact of different training and evaluation horizons, the mean relative translation/rotation errors are provided in Figures 9 and 10. The analysis reveals that the best performance is achieved if the evaluation horizon is equal to (or less than) 0.5 m. The error increases for larger deviation but remains bounded and lower than 5%.

Pose Prediction
The prediction model in (7), with elements from (4)- (6), is linearized around the operating point (ξ p (k), w(k)) at each time step k in discrete times, where ξ p (k) = [x(k), y(k), θ(k)] is the robot's pose by the ego motion thread. The linear affine prediction model can be written as: whereas the zero-mean term is due to model uncertainties. The discrete-time realization is approximated by and whereas A c , B c are the continuous-time system and input matrices of the linearized prediction model, and φ A c t i ,t j for t i > t j is the continuous-time state transition matrix expressed by the Peano-Baker series; the realization is assumed to not vary a lot in each interval [t k , t k+1 ], which is valid for the proposed cooperative mobile robot localization model with the sampling time T s = 25 ms. As a result, the bound of uncertainty due to the sampling time for discretization (in the slip-aware motion model) at the maximum speed of 1 m/s, at which the robot may experience wheel longitudinal slip, is 25 mm. Then, the expected state prediction from the ego motion thread is whereasξ p (k) = E{ξ p (k)} andw(k) = E{w(k)}; the joint covariance for x = [ξ p (k), w(k)] is then given by The predicted covariance is in which Then, by using cov(x), the predicted covariance from the slip-aware motion model yields:

Augmented Localization
The visual thread and the ego motion thread communicate within the ROS framework through WiFi for the specific mobile robot test platform. To ensure proper data synchronization, time stamps are used to associate the visual-based localization (i.e., state estimation ofξ v (k)) to the corresponding pose estimationξ p k by the slip-aware model description. Delay in the communication, which is less than 20 ms for the tests conducted within 10 m of the stationary visual node (i.e., infrastructure-mounted stereo camera with the fisheye lens), is ignored in this section for the CI fusion. This is a valid assumption considering the sampling time T s = 25 ms for the pose prediction in the slip-aware motion model, the fusion part's sampling time (i.e., 50 ms), and the maximum robot speed of 1 m/s at which the robot may experience wheel's longitudinal slip. Denoting the estimation error in the slip-aware motion model at time step k byξ p (k) = ξ p (k) −ξ p (k), and the visual thread byξ v (k) = ξ v (k) −ξ v (k), we utilize the covariance intersection method having the upper bound of the mean square estimation error and the consistency condition.

Remark 2.
The asymptotic stable state transition matrix of the error dynamicsξ p in the motion model (9), and the geometrical filters for the visual-based depth estimation guarantee that the mean square estimation error (MSE) for the pose prediction model and the visual localization are upper bounded, i.e.,Q p (k) As a result, the error covarianceQ v (k) andQ p (k) of the estimated states from the two threads are consistent.
The estimated states from the ego motion thread and the visual thread are then fused using CI which is a convex combination of the covariances of the estimated states and guarantees a consistent error covariance (i.e.,Q f ≤ Q f ). The CI is a geometric interpretation ofQ in which for all choices ofQ pv , the covariance ellipses of the bound Q f at level c, lies within the intersection of covarinace ellipses of Q p and Q v , i.e., E c The weights W p , W v will be obtained by minimizing a performance index on the bound Q f , e.g., tr(Q f ) or det(Q f ), and consequently the covarianceQ f . The CI update strategy finds Q f which encloses the intersection area E c Q p ∩ E c Q v and is consistent, although no knowledge about Q pv is available. The upper bounds of the covariance matrix elements for visual pose estimates is set to constant values derived from the error analysis (discussed in the next section) For the case whereQ pv = 0, the covariance Q f can be given by in which the optimal W p , W v that minimize tr(Q f ) is obtained from the following constrained optimization program where I is the identity matrix with the proper dimension. The trace minimization program in (20) yields As a result, the fusion of the estimated states from the ego motion and the visual threads iŝ where W p ∈ [0, 1] adjusts the assigned weights toξ p andξ v minimizing the performance index tr(Q f ) of the updated covariance.
According to the consistency in Remark 2 and the property of CI, it holds that The heading of the robot is fused once the robot is close to the camera, thus, measurements are more accurate and reliable. The slip-aware observer and fusion is described in Algorithm 1.

Algorithm 1: Augmented Slip-Aware Localization
Input : Stereo image (with fisheye lens distortion), robot's wheel speed, and initial estimateξ f (0) Output : Robot position and heading statesξ f (k) Recheck for occlusion in long distances z i ≥ z th end

Experiments and Discussion
The proposed infrastructure-aided localization framework is experimentally evaluated in this section through tests with harsh turning, cornering with acceleration/deceleration, and accelerated straight maneuvers which all include longitudinal slip at each wheel. The reference measurement and system setup is first discussed, then the experimental evaluations are provided. The wheel slip during harsh cornering, with nonholonomic constraints, results is reduced pose estimation accuracy for the existing odometry-based motion models which rely on wheel rotational speed. This has been addressed in this paper by the proposed slip-aware motion model (considering instantaneous centers of rotation) and the a multi-modal data fusion with the visual thread (even with distortion challenges imposed by low-cost fisheye lens).
The ground truth trajectory is recorded with the optical motion capture system Vicon Vantage V5. The test setup is composed of the Vicon system, the autonomous mobile robot (Jackal AGV), and the stationary stereo camera T265. The T265 is fixed mounted on a tripod at a height of 2 m and capturing the whole area where the tests are conducted. The robot is operating under the normal path-tracking mode and starting in front of the tripod of T265, with the speed between 0.4 and 1 m/s, and mild and harsh cornering in tight and wide trajectories. In the proposed motion model, the wheel slip is indirectly quantified as a kinematic model parameter.
To detect the robot and initial setup of the stereo camera in the environment, passive markers are mounted on top of the robot and the stationary stereo camera, as shown in Figure 11, having sufficient distance for a rotation invariant geometry which is essential to ensure a unique pose and proper localization results using the Vicon system. Figure 12 shows the visual point cloud of the robot detected under occlusion (by a human/user) in a long distance.   The ROI-based point cloud processing, which generates point cloud within the 2D bounding box of the detected robot, reduced computation time up to 67% as has experimentally been tested with the robot in dynamic indoor environments with human presence. The processing time for the pose prediction based on the slip-aware motion model is almost <5 ms. There is no exhaustive recursive algorithm associate with the motion model part. The visual thread with the ROI-based processing takes up to 16 ms in various harsh turn and random cornering maneuvers. The fusion part with the trace optimization program on the visual and motion threads take up to 35 ms on the utilized embedded system in dynamic environments with human presence.
The position estimation error by the stereo visual thread is shown in Figure 13 for a maneuver with several tight cornering. The largest error of 21 cm is for the situation in which the robot is occluded (by a human/user in a shared working indoor environment) in a far (i.e., 7.8 m) distance. The slip-aware motion model helps CI to recover the robot pose guaranteeing consistency of the estimation error covariance, i.e., E{(  The heading fusion result is depicted in Figure 14, where the heading prediction by the ego motion thread (without visual thread updates) is shown in dotted lines; this heading has large estimation error due to the harsh cornering scenarios and inaccuracies in the position of instantaneous center for the slip-aware ego motion model. The prediction fused with pose update from the visual thread in Figure 14 confirms better performance even with occlusion in this perceptually degraded test. This is due to the fact that the heading estimator (by the visual thread) employs multiple geometrical and nonholonomic constraints for the robot motion. Figure 13. Position estimation results based on Euclidean center of the point cloud from the mobile robot. The T265 camera is located at position (0,0) facing the longitudinal x-direction. The largest error occurs at the maximum relative position (indicated with a black ellipse) between the robot and infrastructure-mounted stereo camera.

274
An augmented state estimation framework was proposed for localization of au-275 tonomous mobile robots in dynamic environments using infrastructure-mounted visual 276 sensors and on-board data. The proposed system is composed of a visual tracking thread 277 based on a stationary low-cost fisheye stereo camera mounted in the environment and a 278 slip-aware ego motion thread that uses proprioceptive sensory data from a skid-steer mo-279 bile robot to enhance accuracy and reduce variance of the estimated states. The position and 280 heading of the robot was estimated using the visual thread with a region of interest-based 281 3D point cloud processing which reduced the computation up to 67% in dynamic indoor 282 environments with human presence. This significantly enhances the real-time processing 283 capability of the infrastructure-mounted sensing unit for localization and tracking of multi 284 robots in indoor settings. A slip-aware kinematic model was developed for the ego motion 285 The position fusion results are illustrated in Figure 15 which confirms improvements in the estimation error when CI is applied using the visual thread to address uncertainties in the slip-aware ego motion thread in such arduous scenarios. The position prediction is fused with visual thread data, intentionally at each 200 ms to evaluate the effectiveness in large sample time updates or possible packet drop. Once the heading estimates are corrected by CI, the localization data is accurate with the root mean square error (RMSE) ≤17% for several tests even with intermittent CI updates. The triangular shapes show the effect of the fusion process in which the kinematic motion model has been corrected and fused with the visual thread data. The kinematic model, a dead reckoning system, suffers from fault propagation and has an higher uncertainty as well as biased position prediction. Once the position is corrected with the visual localization, the corrected position and new initial position for the dead reckoning system moves close to the ground truth. Increasing the frequency of the update by the CI fusion will smooth the final estimates.

Conclusions
An augmented state estimation framework was proposed for localization of autonomous mobile robots in dynamic environments using infrastructure-mounted visual sensors and on-board data. The proposed system is composed of a visual tracking thread based on a stationary low-cost fisheye stereo camera mounted in the environment and a slip-aware ego motion thread that uses proprioceptive sensory data from a skid-steer mobile robot to enhance accuracy and reduce variance of the estimated states. The position and heading of the robot was estimated using the visual thread with a region of interest-based 3D point cloud processing which reduced the computation up to 67% in dynamic indoor environments with human presence. This significantly enhances the realtime processing capability of the infrastructure-mounted sensing unit for localization and tracking of multi robots in indoor settings. A slip-aware kinematic model was developed for the ego motion thread to predict the robot pose, then, covariance intersection with guaranteed consistency was used to update the pose prediction with visual estimates, addressing slippage and occlusion for wheel odometry based state estimators and visual based localization in dynamic environments. The experimental results confirmed RMSE ≤17% and an average position accuracy of 7 cm for various tests even with intermittent (e.g., 0.2 s) CI updates. The real time capability of the state estimation framework was confirmed by the computation time 35 ms for ROI-based visual processing and the fusion (through trace minimization). The future avenues include: (i) Using a motion model in the visual thread to enhance the consistency of the pose estimation; (ii) Integrating the IMU data into the ego motion thread and developing a motion model connecting wheel speeds, longitudinal slips, and robot dynamics within an optimization problem constrained to the robot kinematic/dynamic constrains to enhance orientation estimation.