Integration of a Multi-Camera Vision System and Strapdown Inertial Navigation System (SDINS) with a Modified Kalman Filter

This paper describes the development of a modified Kalman filter to integrate a multi-camera vision system and strapdown inertial navigation system (SDINS) for tracking a hand-held moving device for slow or nearly static applications over extended periods of time. In this algorithm, the magnitude of the changes in position and velocity are estimated and then added to the previous estimation of the position and velocity, respectively. The experimental results of the hybrid vision/SDINS design show that the position error of the tool tip in all directions is about one millimeter RMS. The proposed Kalman filter removes the effect of the gravitational force in the state-space model. As a result, the resulting error is eliminated and the resulting position is smoother and ripple-free.


Introduction
It is well known that inertial navigation sensors have drifts. There are two components in the inertial sensor drift: bias stability and bias variability. These components are involved in double integration in position calculation; so after a while, the output of the Inertial Navigation System (INS) is not reliable. Since these factors are involved in the inertial navigation computing task, they cause unavoidable drift in orientation and position estimation. Removing the drift of inertial navigation systems requires that OPEN ACCESS the sensors be assisted with other resources or technologies such as Global Positioning Systems (GPS) [1,2], vision systems [3][4][5], or odometers [6,7].
The use of Kalman filters is a common method used in the data fusion technique. The Kalman filter is a powerful method for improving the output estimation and reducing the effect of sensor drift. However, sensor integration is based on Kalman filtering, but different types of Kalman filters are being developed in this area [8][9][10][11][12][13][14].
In the past, the three-dimensional attitude representations were applied, but these representations are singular or discontinuous for certain attitudes [15]. As a result, the quaternion parameterization was proposed, which has the lowest dimensional possibility for a globally non-singular attitude representation [16,17].
In aided inertial motion tracking applications, the state variables of a Kalman filter usually take one of two forms: first, the sensed engineering quantities, that is acceleration, velocity, and attitude, etc.; and second, the errors of these quantities. The first form is used by Centralized Kalman Filter [14], Unscented Kalman Filter [18][19][20], Adaptive Kalman Filter [10,21], and Sigma-point Extended Kalman Filter [22], while the second is used by Indirect Kalman Filter [23][24][25].
A Kalman filter that operates on the error states is called an indirect or a complementary Kalman filter. The optimal estimates of the errors are then subtracted from the sensed quantities to obtain the optimal estimates. Since the 1960s, the complementary Kalman filter has become the standard method of integrating non-inertial with inertial measurements in aviation and missile navigation. This method requires dynamic models for both the navigation variable states and the error states [26].
This research develops an EKF which offers the estimation of the changes in the state variables. Then the current estimated values of changes in the variables are added to the previous estimation values of the position and velocity, respectively. According to the general equations of the SDINS, the constant value of the gravitational force is removed from the resulted equations and the resulting error from the uncertainty value of the gravitational force is eliminated.

Kinematics of Strapdown Inertial Navigation Systems Using Quaternions
Inertial navigation systems are typically employed to provide the present position and heading of a moving vehicle with respect to the known and fixed reference frame. An inertial navigation system localizes the vehicle by measuring the linear and angular components of its motion using inertial sensors and with knowing the initial value of its position and attitude.
In SDINS, MEMS-based inertial sensors are mounted rigidly on the body of a moving object [34] to provide the applied forces to and the turning rates of the object, while accelerometers and angular rate gyros are parallel to the axis of the body. Because of the measuring of inertial components in the body frame, a set of equations must be derived to compute the position and attitude with respect to the known navigation reference frame.
Motion analysis of a moving rigid body provides a set of equations determining its trajectory, speed, and attitude. Since the Inertial Measurement Unit (IMU) measures the inertial components of the connected point, the acceleration and velocity of other points on the body can be computed relatively. As shown in Figure 1, the IMU is attached to the bottom of the tool when the position of the tool tip is desired. Figure 1. Hand-held tool and assigned reference frames.
As the tool is rotating and translating, the body reference frame shown in Figure 1 is relocating with respect to the fixed navigation frame. The relative acceleration of the point B [35] is computed as: where and represent acceleration of the points A and B; / , ( / ) , and ( / ) denote the relative position, velocity, and acceleration of point B with respect to point A measured in the body frame; and ω̇ and are angular acceleration and angular velocity of the body frame. Since both point A and point B are located on the tool and moving along the tool, the relative acceleration and velocity of point B with respect to A is zero, and Equation (1) is rewritten as: In order to transform the acceleration of the tool tip from the body frame into the North-East-Down (NED) frame [34], the cosine direction matrix must be computed from Equation (3): where and Ω denote the cosine direction matrix and the skew symmetric form of the body rate with respect to the navigation frame. The three-dimensional Euler angles representations were applied for attitude estimation in the SDINS, but these representations are singular or discontinuous for certain attitudes [15]. Since the quaternion parameterization has the lowest dimensional possibility for a globally non-singular attitude representation [16,17], the quaternion is generally used for attitude estimation in the SDINS.
The transformation matrix is related to the corresponding quaternion = [ 1 2 3 4 ] : Therefore Equation (3) can be changed to Equation (5) as: Moreover, the gravity compensation is required since the accelerometers measure the local gravitational force. As a result, the acceleration of point B with respect to the navigation frame is calculated as: where represents the applied forces measured in the body frame by accelerometers, and denotes the gravity vector expressed in the navigation frame, [0 0 9.81] .
As a result, the state space equations of the system can be finalized as: where and stand for the position and velocity of the tool tip with respect to the navigation frame, and Q(q) is the 4 × 4 real matrix representation of a quaternion vector.
The navigation frame and the body frame shown in Figure 1 are rotating with the Earth as well. According to relative motion equations, the acceleration of point A in the Earth frame is: where point O is the origin of the navigation frame. Since the navigation frame is fixed to the ground, then the relative acceleration of the navigation frame with respect to the Earth, , is zero. The angular velocity of the Earth is constant and nearly 7.3 × 10 −5 rad/s [37], as a result the angular acceleration of the Earth is zero [36]. Since the relative position and velocity of point A with respect to point O is too small because of description of on-hand application, the effect of Coriolis and the centripetal acceleration terms in Equation (8) is too small to be detected by available accelerometers; therefore: which means the acceleration of point A with respect to the Earth reference frame is its acceleration with respect to the navigation frame.

Vision System
In this research, a vision system is proposed which includes four CCD cameras located on an arc to expand the domain of the field of view, see Figure 2. In order to find the Cartesian mapping grid for transforming 2D positions in the cameras' image plane to the corresponding 3D position in the navigation frame, the single camera calibration for each camera and the stereo camera calibration for each two adjacent cameras are required.
The calibration of the vision system provides the intrinsic and extrinsic parameters of the cameras [38] in order to map a 2D point on the image planes to the 3D point in the world coordinate system. The estimation of camera parameters requires a single camera imaging model, as shown in Figure 3.
The camera lens distortion causes two radial and tangential displacements [39]. The longer distance from the center of the image plane initiates the larger displacement, when the distance of a point  Considering two vectors α and β as the radial and tangential distortion factors of a camera, the distortions can be calculated as [40]: Consequently, the projection of each point in the world coordinate system into the image plane is: where the vector N is a zero-mean Gaussian random measurement noise, and 1 and 2 denote the focal length factors of the lens. In fact, 1 and 2 are related to the focal length and the dimension of the pixels: where f is the focal length; and refer to center-to-center distance between adjacent sensor elements in x and y directions, respectively; and represents the image scale factor [41], therefore: where 1 and 1 denote the dimension of a pixel on the image plane.
According to the camera model obtained in Equation (13), the geometric parameters f, , α, and β can be estimated by capturing enough images while the coordinate of each 3D point P and its 2D projected point p are known in calibration grids: Applying the parameter estimation method [34,42] to Equation (11) gives the geometric parameters of a camera. Furthermore, the transformation matrix for each two adjacent cameras is computed by substituting the equations of the coordinate system transformation into Equation (11) for each corresponding projected point.
In order to localize the tool tip, the edge detection and boundary extraction must be applied to every single frame from each camera. Obtaining the edge of the tool tip requires applying a thresholding technique. Each pixel is detected as an edge if its gradient is greater than the threshold. In this paper, the threshold is chosen as the boundary pixels of the tool tip are detected as the edge positions. Since the size of the tool tip is about a few pixels, then an adaptive thresholding technique is applied to remove the noise pixels around the tool tip as much as possible. For this purpose, a masking window is chosen around the initial guess of the position of the tool tip. Then, a fixed threshold is chosen which select pixels that their value is above the 80% of the value of all pixels of the image. If the boundary detection technique can identify the boundary of the tool tip, then it shows that the threshold selection is appropriate. Otherwise, the previous threshold is reduced by 5%, and this procedure is run recursively to find the proper threshold. Afterwards, the opening morphologic operation followed by closing operation is applied to simplify and smooth the shape of the tool tip. Finally, the boundary of the tool tip can be detected and extracted by using the eight-connected neighbors' technique.

Modified Kalman Filter
The integrated navigation technique employs two or more independent sources of navigation information with complementary characteristics to achieve an accurate, reliable, and low-cost navigation system. Figure 4 shows a block diagram of the integration of the multi-camera vision system and the inertial navigation system: Typically, Extended Kalman Filter (EKF) is applied by combining two independent estimates of a nonlinear variable [43]. The continuous form of a nonlinear system is described as: Since the measurements are practically provided at discrete intervals of time, it is appropriate to express the system modeling in the form of discrete differential equations: where: Therefore the two set of equations involving the prediction and updating of the state of the system are defined as: where the system noise and the measurement noise are zero mean with known covariance R and S, respectively. According to Equations (7), (17), and (18), the discrete form of the system is developed as: where is the sampling rate of the inertial sensors. In this research, instead of estimating the actual value of these quantities, we propose to estimate how much the position and the velocity will be changed; that is: As a consequence, the computation of the velocity is independent of the gravitational force in the new state-space model. In fact, the error caused by inaccurate value of the gravitational force in the new state-space model is completely eliminated.
The inertial sensor noise is theoretically modeled with a zero-mean Gaussian random process. In practice, the average of the noise is not absolutely zero. Due to the inherent characteristic of the Gaussian random process, the discrete difference of a zero-mean Gaussian random process is also a zero-mean Gaussian random process with very lower actual mean while its variance is twice of the variance of the original process. As a result, the drift resulting from the input noise is reduced and a smooth positioning is expected.
The equation of the INS with the state vector = [Δ Δ ] can be reformulated as: or: Subsequently, the transition matrix [44] can be calculated as: By considering Δ = [Δ 1 Δ 2 Δ 3 ] : (3) leads to the following Equation: Therefore: As a result of Equation (106): Because the vision system as the measurement system provides the position of the tool tip, velocity can be computed by knowing the present and the previous position at each time step: where is the sampling rate of the cameras. Accordingly, the observation matrix would be:

Experimental Results
This section presents the experimental hardware setup and the result of applying the proposed EKF. The experimental hardware includes a 3DX-GX1 IMU from Microstrain, an IDS Falcon Quattro PCIe frame grabber from IDS Imaging Development Systems, and four surveillance IR-CCD cameras. The IMU contains three rate gyros and three accelerometers with a sampling rate of 100 Hz and with a noise density of 3.5 °/√ℎ and 0.4 / √ , respectively [45].
All cameras are connected through the frame grabber to a PC, which includes four parallel video channels able to capture images from four cameras simultaneously with a sampling rate of 20 fps. Since the multi-camera vision system is used as a measurement system, the camera calibration procedure must be performed primary. The intrinsic and extrinsic parameters of each camera are listed in Table 1. Once the calibration is completed, the vision system is ready to track the tool and measure the position of the tool tip by applying image processing techniques. Figure 5 demonstrates the result of the video tracking by one of the cameras.
It should be mentioned that a predesigned path is printed on the 2D plane and it is tried to be traced by the tool tip during its movement on the plane in order to compare the performance of proposed EKF and with the performance of the conventional EKF reported in [5].
The sensor fusion techniques allow us estimating the states variables of the system at the sampling rate of the sensor with the highest measurement rate. In this experiment, the sampling rate of cameras and inertial sensors are 20 fps and 100 Hz. As a result of sensor fusion, the measurement rate of the proposed integrated system is 100 Hz. The classical EKF is applied in both switch and continues modes. In the switch mode, the estimation of the states variables is corrected whenever the measurement of the vision system is available. Otherwise, the states are estimated only based on the SDINS. In order to reduce the computational complexity of image processing algorithms, sensor fusion allows that the sampling rate of the vision system can be reduced to 10 fps and 5 fps. As illustrated in Table 2, the positioning error is increased by reducing the sampling rate of the cameras. In addition, the error in proposed EKF grows faster than the other methods; since this technique assumes that the rate of the changes in state variables is constant from one frame to another frame. So, this assumption cannot be valid in lower measurement rates. Although, it is shown in Table 2 that the position error of the continuous EKF is less than the others; it should be mentioned that the position obtained by the multi-camera vision system still has errors compared with the predesigned path. Figure 6 and Figure 7 compare the position resulting from each method at two different parts of the trajectory of the tool tip at two sampling rate of 16 fps and 5 fps. As shown, the camera path is traced smoothly by applying continuous EKF. Since the position is estimated in real-time, it is not possible to fit a curve between each two camera measurement without sensor fusion techniques.  The position resulting from switch EKF is crinkly due to the drift position in the SDINS and the wrinkles are amplified by decreasing the measurement rate of the cameras. The position estimated by the proposed EKF is smooth and ripple-free and this method tries to reduce the errors of the entire system compared with the predesigned path. As a result, the proposed EKF is suitable for the higher measurement rate; while the continuous EKF is recommended for the lower sampling rate. However, the error of inertial sensors resulting from noise and the common motion-dependent errors are compensated, but the remaining errors cause the position error estimation in the integrated system. In addition, the video tracking errors lead to the position estimation error as well.

Conclusions
This paper describes the use of the EKF to develop integration of the multi-camera vision system and inertial sensors. The sensor fusion techniques allow estimation of the state variables at the sampling rate of the sensor with the highest measurement rate. This helps to reduce the sampling rate of the sensors with high computational load.
The classical EKF is designed for nonlinear dynamic systems such as the strapdown inertial navigation system. The performance of the classical EKF is reduced by lowering the sampling rate of the cameras. When the sampling rate of the cameras is reduced, the rate of updating decreases and the system must rely more on the inertial sensors output for estimating the position. Because of the drift in the SDINS, the position error increases.
The modified EKF is proposed to obtain position estimation with less error. Furthermore, it removes the effect of the gravitational force in the state-space model. In fact, the error resulting from inaccuracy in the evaluation of the gravitational force is eliminated in the state-space model. In addition, the estimated position is smooth and ripple-free. However; the proposed EKF is not convincing at the lower measurement rate. The error of the estimated position results from inertial sensor errors, uncompensated common motion-dependent errors, attitude errors, video tracking errors, and unsynchronized data.