Enhancing Positioning Accuracy in Urban Terrain by Fusing Data from a GPS Receiver, Inertial Sensors, Stereo-Camera and Digital Maps for Pedestrian Navigation

The paper presents an algorithm for estimating a pedestrian location in an urban environment. The algorithm is based on the particle filter and uses different data sources: a GPS receiver, inertial sensors, probability maps and a stereo camera. Inertial sensors are used to estimate a relative displacement of a pedestrian. A gyroscope estimates a change in the heading direction. An accelerometer is used to count a pedestrian's steps and their lengths. The so-called probability maps help to limit GPS inaccuracy by imposing constraints on pedestrian kinematics, e.g., it is assumed that a pedestrian cannot cross buildings, fences etc. This limits position inaccuracy to ca. 10 m. Incorporation of depth estimates derived from a stereo camera that are compared to the 3D model of an environment has enabled further reduction of positioning errors. As a result, for 90% of the time, the algorithm is able to estimate a pedestrian location with an error smaller than 2 m, compared to an error of 6.5 m for a navigation based solely on GPS.


Introduction
GPS-NAVSTAR (Global Positioning System-NAVigation Signal Timing And Ranging), popularly known as the GPS system, has considerably gained in civilian interests, since May 2000. Earlier, the system was practically reserved for military purposes. The positioning accuracy for the civilian sector was ca. 100 m due to an intentional error, called Selective Availability. According to report [1], the horizontal positioning error is less than 17 m for 99% of the time in average conditions or 17 m for 90% of the time in worse outdoor conditions. The error depends on many factors, like atmospheric conditions, sun activity, geographical location, terrain type, satellites' constellation, etc. In an open space, positioning errors are of ca. 2-3 m. However, in dense built-up areas, the location error may reach 100 m [2,3] or even more [4]. The error is introduced due to multipath propagation of signals transmitted by the satellites when there is no line-of-sight. A satellite signal is bounced off the walls of a building before finding its way to a GPS receiver. The propagation time of the signal is delayed and the GPS receiver miscalculates its location with a reference to the satellites.
There are many techniques to improve the location accuracy. Along coasts, for marine purposes, special ground DGPS (Differential GPS) reference stations broadcast differential corrections that allow a GPS receiver to eliminate tropospheric, ionospheric, ephemeris and clock errors. The overall error is reduced to 10 m with accuracy decreasing by 1 m with each 150 km increase in distance from the reference station. The corrections are transmitted on a ca. 300 kHz carrier frequency. A receiver must be, however, equipped with an additional antenna [5].
A-GPS (Assisted GPS) is a technique that downloads from the Internet the data concerning GPS satellite constellation [6]. Otherwise a GPS receiver may require up to 12.5 minutes to receive data about satellite constellation. By employing the A-GPS, the first fix is provided within few seconds. Additionally, a GSM modem can support a GPS receiver with a rough location which is obtained by measuring the strengths of signals from GSM base stations.
Apart from L1 = 1,227 MHz, a second frequency called L2C = 1,575 MHz, has been made available to the civilian sector with the aim of reducing the ionospheric and tropospheric errors which are a well defined function of frequency. Also, the so-called augmentation systems effectively reduce tropospheric and ionospheric errors by sending differential corrections from geostationary satellites directly to GPS receivers. The geostationary satellites imitate also GPS satellites and improve mainly the vertical accuracy. EGNOS (European Geostationary Navigation Overlay Service) works in Europe, WAAS (Wide Area Augmentation System) in the US, MSAS (Multi-functional Satellite Augmentation System) in Japan and GAGAN (GPS aided Geo-Augmented Navigation) in India.
Real Time Kinematics technique achieves the accuracy of millimetres in an open space and is designated for cartographic measurements. However, two GPS receiver are necessary. The base receiver is placed in a known position and calculates tropospheric, ionospheric, ephemerids and clock errors of satellites. The corrections are sent via a radio link to a mobile GPS receiver. In the trials reported in [4] RMS error of 2 cm on a mountain high-way was noted whereas trials in urban canyons yielded a 50 m RMS error.
All the above mentioned techniques are helpless against multipath propagation errors. Algorithms harnessed in car navigation alleviate these errors by taking advantage of car kinematics and a comparatively sparse network of roads, as described, e.g., in [7] or [8]. An error of 20 m is of no bigger importance to a driver. In case of navigating pedestrians, especially blind ones, the target accuracy should ideally not exceed the pavement width, i.e., ca. 2 m. This work presents a navigation scheme using GPS readouts, digital maps, inertial sensors and stereovision images with an aim of navigating a blind pedestrian. An accelerometer is used to detect pedestrian's strides and estimate their length. A gyroscope serves for estimating the heading direction. Those data sources help eliminate gross positioning errors and outliers. The digital maps are used twofold. Firstly, the so-called probability map is built to eliminate improbable user transitions, like traversing water ponds, crossing walls and buildings, etc. Secondly, a 3D model of the environment is built. The model is compared to stereoscopic images recorded by a mobile stereo camera. Interestingly, this comparison of 3D geometry of the environment provides good positioning accuracy in the surrounding of buildings, where GPS readouts are compromised. The proposed scheme employs the particle filter, also known as a sequential Monte Carlo method.

Related Work
The topic of pedestrian navigation including navigation aids for blind pedestrians has been described in many publications [9][10][11]. The first step to correct GPS readouts is to apply inertial sensors. Inertial sensors are mainly used in aviation to compute the orientation and position of an aircraft. Calculating the location requires double integration of the acceleration vector. Prior to these calculations the gravity acceleration must be removed, as accelerometers cannot distinguish gravity from an aircraft's accelerations. Therefore, the orientation of an aircraft with respect to the Earth's surface must be calculated from gyroscopes. The technique is known as INS (Inertial Navigation Systems) and warrants very precise and expensive laser sensors using the Sagnac effect. However, the strict implementation of INS using MEMS sensors (Micro Electro-Mechanical Systems) is useless after few seconds due to errors growing quadratically with time [12].
A travelled distance of a pedestrian can be estimated with surprisingly good accuracy by measuring the length of steps. This is done by analysing the acceleration in the gravity axis [10,13]. As a person walks, the body undulates according to the strides. The technique is accurate from 0.5% to 10%, depending on the gait style. ZUPT (Zero Velocity Update) technique exploits the fact that a foot is at rest for some short period. An accelerometer must be mounted to a foot which is an inconvenience, offset however by better accuracy compared to the previous method [14].
A heading direction can simply be read out from a magnetic compass, optionally supported by a gyroscope. A compass is sensitive to local distortions of a magnetic field due to cars, power lines etc. [10]. An electric tram can compromise a compass readouts within the radius of up to 100 m. A gyroscope, coupled by the Kalman filter, can reduce erroneous readouts [15].
The combination of GPS and inertial sensors readouts provides continuous estimates during GPS outages in harsh environments like tunnels, underground passages, dense urban areas etc. Positioning data from these two sources are usually integrated by the Extended Kalman filter or particle filter, which perform well when errors can be modelled by white noise which has the property of being uncorrelated with itself. However, the GPS errors are characterized by coloured noise [2,16]. This is because when a GPS receiver loses track of satellites its position is estimated by using the history of previous locations. Secondly, signals from satellites occluded by the same building are equally delayed, which introduces a bias in a given direction. This feature of errors corrupting GPS readouts was reported in earlier studies [14,17]. Jirawimut et al. [18] present an interesting concept where the height of buildings were utilized to check if a given satellite is occluded. The authors carried out simulation which yielded good results.
To improve the accuracy of dead-reckoning techniques, an additional source of the absolute pedestrian location should be introduced. Experiments in [4] showed that a combination of GPS and GLONASS navigation systems reduces root-mean-square (RMS) error by half and outliers by several times. Another solution is introducing the so-called probability map as a new source of data, which defines regions of a terrain that the user is most likely to enter or cross. A pedestrian is more likely to walk along pavements, parking places, paths, etc., rather than crossing walls, water ponds, etc. This cuts down the positioning errors as they start to aggregate. This map-based navigation concept proved its usefulness in outdoor pedestrian applications [19,20] as well as indoor positioning [21][22][23]. In built-up areas, a probability map limits inaccuracy to a width of an urban canyon, usually delineated by buildings on both sides.
A further improvement of positioning accuracy requires positioning with respect to known landmarks, i.e., the so-called exteroception. The accuracy of landmarks' location should be approximately an order of magnitude better than the target accuracy of ca. 2 m. A vision-based technique called SLAM (Simultaneous Localization and Mapping) enables to build a map of the surrounding space and provide localization at the same time without employing any global navigation method. The SLAM based technique can be applied to build a 3D map of the environment. A review of SLAM techniques, modifications, results, etc., are given in book [24]. An application of SLAM for pedestrian navigation with results is given in [25]. Also a navigation system for the blind using a stereo-camera for tracing landmarks is presented in [26].
On the other hand, precise maps of outdoor or indoor environments may already be available, e.g., a plan of a building or plans of a city which are accurate to single centimetres, accuracy hardly achievable by any SLAM techniques. There are different sensors used to compare the environmental map with a robot's or pedestrian's location. Ultrasound sensors measure distance with 1% accuracy, being at the same time very cheap and compact. The maximum achievable distance is ca. 10 m. The angular resolution, however, is poor, ±15 • . Laser scanners provide 0.1% distance measurement accuracy with a maximum distance of ca. 50 m. The angular resolution is of a fraction of a degree. The scanning is omnidirectional. The power consumption is ca. 10 W for longer distances. The size of a laser sensor is around 5 cm by 5 cm by 5 cm. The cost is rather high (ca. $4,000 USD depending on accuracy, maximum distance etc.). These sensors are commonly used in robotics. Time-of-Flight (ToF) cameras offer similar parameters as laser sensors providing 3D reconstruction of the scanned environment. The angle of view is ca. 30 • . Application of laser scanners and ToF cameras is limited to indoor environments due sunlight interference.
In [27] ultrasound and laser sensing techniques are compared for indoor positioning of a robot. Work [23] presents an indoor localization system based on the particle filter. The user wears ultrasound sensors which measure a distance to adjacent walls. Therefore, the user location can be compared against the alignment of building walls. Application of an ultrasound sensor improved the positioning accuracy by 7 times in good trial conditions, i.e., with no additional objects in the building's corridor, closed doors etc. A similar technique using a laser sensor mounted on a white cane is reported in [22]. A laser sensor detects corners of corridors what provides a comparison with a building's plan.
Finally, stereovision is a passive imaging technique that can be used indoors and outdoors. The cameras can be compact and inexpensive. However, stereovision imaging is limited to the environments featuring good lighting conditions, requires calibration of the stereovision optics and offers worse depth estimation accuracy than the earlier outlined active methods [28]. Moreover, demanding computations are required for calculating depth maps. In stereovision, depth accuracy can range from a few centimetres for objects located in a few metres range to a few metres for more distant objects. The so-called subpixel interpolation methods were proposed to improve depth estimation accuracy [29].
The work presented here combines global and local positioning techniques. GPS location estimates are augmented by dead reckoning techniques derived from inertial sensors, i.e., gyroscopes and accelerometers. A stereo-camera is used for distance measurements. Finally, a digital map of the outdoor terrain is incorporated into the algorithm. Each of the listed source of positioning data features different error characteristic, e.g., in open spaces GPS readouts are most accurate whereas stereovision may yield large depth estimation errors. Conversely, for city canyons where GPS accuracy falters, stereovision and the digital map offer good positioning. A particle filtering algorithm is proposed to optimally fuse all data source.

A Prototype System for Pedestrian Positioning
A block diagram of the proposed system is presented in Figure 1. From the hardware point of view, the system is made up of three parts: a PC platform, stereo camera connected through FireWire interface and an electronic module housing a GPS receiver and 6DOF sensor. The data was acquired by walking through the University Campus with the stereovision camera attached to the chest of an experimenter. The laptop stored data from the electronic module and stereo camera. Then the data was processed off-line on a PC.

Electronic Module
The built electronic module, see Figure 2, is a dedicated PCB with a microcontroller, 6DOF sensor and GPS receiver. The microcontroller reads out data from the 6DOF sensor, ADIS16355 at a rate of 820 Hz. Downsampling is carried out to limit data stream to the PC. An 8th order, lowpass Chebyshev filter is used to avoid aliasing. Samples are sent to the computer at a rate of 205 Hz. The module sends data through the USB interface.

6DOF Sensor
The 6DOF sensor, ADIS16355 from Analog Devices, comprises a 3-axial accelerometer and 3-axial gyro. The data from the sensor is read out through a digital interface, SPI. The samples are of 14-bit resolution whereby practically the three least significant bits are random. The ADIS16355 has been superseded by much less noisy ADIS16375. The static parameters of the former were investigated by the Allan variance [30].
Gyroscope. The gyroscope is used to estimate the pedestrian orientation. Angular velocity ω(t), returned by the gyroscope, is corrupted mainly by white and flicker noise. The relative direction change ∆φ is calculated by integrating the angular velocity according to Equation (1).
whereω(t) denotes the true angular velocity, ω w (t) white noise and ω f (t) flicker noise in angular velocity readouts. ∆φ(t) is the true change in the heading direction. Integrated white noise ω w (t) results in a first-order random walk, called in this case angular random walk (ARW), and is denoted by φ ARW (t) in Equations (2) and (3). Angular random walk φ ARW (t) is described by the Gaussian distribution but it is not a stationary process as its standard deviation is growing with the square root of the integration time [31]. For the sensor used in the project the standard deviation of φ ARW grows at a rate of 2.77 • / √ h.  Similarly, integrated flicker noise ω f (t) causes an angular error ∆φ F (t), according to Equation (4). The values of ∆φ F are described by the Gaussian distribution whose standard deviation grows proportionally with the time and therefore it is a non-stationary process. In case of the gyroscope used in the project, the standard deviation of ∆φ F grows at a rate of 48 • /h. Therefore, confidence intervals for estimating ∆φ(t) widens with the time t. After an hour, a motionless gyro can drift by ±144 • for ±3σ confidence interval. Hence, dead reckoning is justified for short periods only, e.g., during GPS outages.
Inertial sensors also suffer from other errors like non-linearity or temperature random bias. The former error is ±0.3 • /s. The experienced random bias due to temperature was much higher than the manufacturer claimed and it reads 0.5 • /s for ±3σ confidence interval. This is, however, easy to compensate by leaving the sensor motionless and measuring the constant offset to be later subtracted from the readouts.
Accelerometer. The accelerometer is used in the project to estimate the user's steps and their lengths. This method is more precise than integrating twice the acceleration readouts to obtain the displacement. The accelerations readouts are also used to estimate the gravity direction which is necessary for correct calculation of the orientation change. By analogy, velocity random walk is a product of integration of white noise in acceleration. On average, the velocity random walk was 0.5 m/s/ √ h. Flicker noise introduces an error of 9.3 m/s/h. The displacement errors are much larger since they are products of integrating velocity errors. Any offset in acceleration readouts grows quadratically with the time. This quickly leads to aggregation of errors and renders the analytical approach ineffective after a dozen of seconds.

GPS Receiver
The PCB board houses a standard GPS receiver with an integrated ceramic antenna, measuring 15 mm by 15 mm. The receiver is built on the MTK MT3329 chipset and supports WAAS, EGNOS, MSAS, GAGAN corrections. GPS readouts are based on the WGS84 coordinate system aligned in the centre of the Earth, being represented by an ellipsoid. The conversion to a local coordinate system follows in two steps. Firstly, the polar coordinates, i.e., latitude, longitude and height above the ground, are transformed to the WGS84 Cartesian coordinates (x WGS84 , y WGS84 , z WGS84 ). This step, along with associated errors and corrections, is described in detail in [32]. Secondly, the local coordinates (x, y, z) are obtained by simplified Helmert's transformation (5).
The rotation R and translation T matrices are structured in such a way that the centre of the local coordinate system is on the Earth's surface. Versor 1 x is directed along with longitude, 1 y with latitude (towards the pole), 1 z is determined by 1 x and 1 y . The conversion accuracy is in an order of 10 cm for 20 km from the centre of the local coordinate system.
A GPS receiver provides the HDOP parameter (Horizontal Dilution of Precision) informing about the accuracy of the estimated coordinates, i.e., latitude and longitude (6). Higher values of HDOP should inform a dead reckoning filter to trust to other sources of positioning (e.g., inertial sensors) and disregard GPS readouts.
The standard deviations of estimating x and y on the Earth surface are denoted by σ x and σ y respectively. Providing that x and y have the Gaussian distribution and are independent, the probability density function of distance error r e is given by the Rayleigh distribution [33], which is not a monotone function of its argument. The function has a maximum for r e > 0. Tests showed that the function defined by Equations (7), (8) and (9) yields definitively superior results. This can be explained by a poor relationship between the HDOP parameter and the error r e . Since r e can assume only positive values, the Gaussian function p GPS (r e , σ GPS ) is accordingly scaled by a factor of 2. The coefficient β GPS in Equation (9) is chosen by trial and error.
r e = x 2 +ỹ 2 (8) If a GPS receiver loses track of satellites, prediction of its position is based on the previous velocities and position. The first order autocorrelation coefficients forx andỹ were calculated for a 2.6 km path-see Equation (10). They had considerable values of a x ≈ a y ≈ 0.86. Thusx andỹ can be modelled by Equation (11), which is known as exponentially correlated noise.
where v x (t) and v y (t) are random realizations of white noise, whereby y . This is a very important point. There are periods of e.g., 30 s when GPS readouts stray away by 40 m in one direction. A dead reckoning system, weighting GPS coordinates with inertial sensors' readouts to produce a better position estimate, will succumb to this bias error sooner or later. When the GPS receiver regains good visibility with satellites, the estimated position will be all of a sudden in a quite different location. The dead reckoning filter will have to take some time to adapt to a new, accurate GPS position. During this time the filter output will be encumbered with large errors. A way out is to recognize the situation and reinitialize the filter. To account for these random bias errors another source of absolute positioning should be introduced. Maps, video data are examples thereof. Another problem associated with GPS readouts is the correspondence of the HDOP parameter with the actual accuracy of latitude and longitude. Equation (6) shows that the value of HDOP is interpreted as the distance error. Figure 3 sheds light on the problem. It was noticed that the interpretation of the HDOP parameter depends on a GPS receiver at hand. The correlation coefficient (12) shows a poor linear dependence between distance errors r e (t) and the values of HDOP. The number of observed satellites and position error is not correlated. This confirms the statement about the necessity of additional positioning data.

Pedometer
When a person walks, the body undulates in unison with the steps. The step-detection algorithm examines the acceleration of the body only in the vertical direction (parallel to the gravity axis). The gravity axis is obtained from the algorithm presented in Section 3.3. The acceleration signal also reflects knee movements, body swings, sensor jerks and so forth. A 3 Hz 4-order low-pass Butterworth filter was applied to eliminate undesired constituents. The step detection algorithm is presented in Figure 4. The minimum A min and maximum A max acceleration in the filtered signal are detected. The difference between the peaks is raised to the empirical power of 0.25. The result is multiplied by the scale factor K, depending on the individual. Equation (13) expresses an estimated step length d. Related work can be found in [10,13,34].
Step length accuracy measurements are between 1% to 10% depending on an individual and gait style.

Tilt Estimation
Tilt estimation, i.e., estimation of the direction of the gravity axis, is important for two reasons: 1. Calculation of the relative change in the heading direction. When the sensor is not levelled, the gyroscope's axis does not coincide with the pivotal axis of the pedestrian. When the user turns, the relative change in the heading direction is underestimated. For example, when the gyroscope axis is tilted by 25 • from the pivotal axis, the angular velocity is underestimated by ≈ 10%. Consequently, the relative direction change might be 81 • instead of 90 • -c.f. Figure 5.
2. Estimation of the camera orientation with respect to the ground. The stereo camera is used to measure distance to nearest objects like buildings. When the camera is not levelled the measured distance will be overestimated.
A human rotates along the gravity axis when changing the walking direction. Thus, the task of estimating the correct change in direction comes down to finding the gravity axis. The problem is known in aeronautics as Inertial Navigation Systems. The estimation of six coordinates of an aeroplane (three for orientation and three for position) is not a trivial task. A standard approach based on off-the-shelf sensors and cosine direction matrices is very inaccurate after few dozens of seconds [12]. However, humans move with much smaller accelerations. The dominant acceleration comes from the gravity. When a person moves, the body is usually aligned vertically. The pitch and roll are limited to ≈ ±30 • . These assumptions help greatly to design an algorithm estimating the gravity axis. The algorithm uses Kalman filter to integrate the readouts from the gyroscopes and accelerometers. The angles between the sensor's axes and the gravity axis (see Figure 5) can be calculated from Equations (14) and (15), given the sensor is not moving.
where a x , a y and a z are accelerations measured by the 3-axial accelerometer, g is the gravity constant, 9.81 m/s 2 . When a user walks, the accelerations will fluctuate by ±0.25 g, depending on the walking style. The acceleration vector due to walking is treated here as measurement noise. Hence, calculating the direction of the gravity axis from Equation (16) would result in an error of 15 • . The accelerations need to meet the constraint Equation (17). Figure 5. The local coordinate system of the 6DOF sensor. g denotes the gravity acceleration. ω x , ω y and ω z are angular velocities which are clock-wise oriented with their rotation axes. α x , α y and α z are angles between the sensor's axes and the gravity.
This reduces the number of degrees of freedom by one. Gyroscopes can be used to improve the estimation of the sensor orientation. The transition equation of the Extended Kalman Filter (EKF) is given by Equations (18) and (19). Appearance of angular velocity ω y causes the X axis of the sensor changing its orientation with respect to the gravity. This change also depends on the orientation of the Y axis and is expressed by the sin α y term.
k denotes time instant, where t = k · dT and 1/dT = 205 Hz. w y (k) and w x (k) are the gyroscope errors. The measurement equation of the EKF is given by Equation (20) y The accelerometers readouts are corrupted by noise v(k). Further implementation of this Extended Kalman Filter is straightforward and follows through Taylor linearisation. Details can be found in [35]. Having estimated α x and α y and complementary angles β x and β y , the angle γ between the Z-axis gyroscope and the inverted gravity axis (see Figure 6) can be calculated from Equation (21). γ = 2 arcsin sin 2 β x + sin 2 β y 2 (21) Figure 6. The relationship between the angles β x and β y and the angle γ which describes the sensor tilt and is measured between the sensor's Z-axis and the −g axis.
The angular velocity can be corrected by Equation (22). The relative heading direction should be calculated from Equation (23).
As mentioned before, this approach is valid for limited range of pitch and roll, due to singularity in Equation (22) when γ approaches π/2. The measurement of γ can be included from Equation (16), however with a small benefit due to considerable fluctuations of the corresponding acceleration a z . The average error of estimating γ was measured to be 0.5 • with a standard deviation of 0.45 • . Figure 7(a,b) shows a trial with the sensor tilted by ≈15 • .

Heading Estimation
A gyroscope provides an angular velocity ω. A direction change can be found by integrating ω (c.f., Equation (23)). Thus a gyroscope can provide only a relative orientation as opposed to a magnetic compass. Experiments with a magnetic compass showed its susceptibility to magnetic field distortions due to tram lines, DC power lines with no return cable, cars changing a local magnetic field, etc. The magnitude of the Earth's magnetic field is ca. 50 µT . By comparison, a departing tram produces 120 m away a magnetic induction of 5 µT . Hence, a magnetic compass was not included. The absolute pedestrian orientation is sorted out run-time by the particle filtering algorithm based on other data. A magnetic compass could give a rough approximation ±90 • until the absolute orientation has been found and then switched off.
w map (x, y) denotes the type of an area at (x, y) coordinates. Each weight corresponds to the probability of the user being in a given area. Since the system is destined for outdoor navigation, it is rather improbable that the user will traverse buildings, walls and fences-c.f., Figure 8. The user is likely to travel along pavements and alleys, however, he or she can also walk across lawns, although less likely. The probability map mitigates the problem of gyroscope drifts and stepmeter inaccuracy. The particle filtering approach to the discussed problem will be explained in Section 3.9. Here the term particles denotes a collection of hypothetical locations of the user and the weights associated with each particle is understood as the probability of the user being at this location. When particles enter a forbidden area, they will be gradually eliminated. Particles that move along a preferred area will be preserved. Therefore, the direction will align itself and the direction drift will be reduced. The weight assigned for probable areas should not be too low, because otherwise when a user crosses a wide street, the particles' weights will be successively reduced and consequently die out. Consequently, the user's location will not be correctly calculated. The project was developed with a view of pedestrians, especially blind ones. Figure 8. An example of a probability map of the University Campus. Forbidden, probable and preferred areas are denoted by black, grey and white respectively. The streets at the campus are paved and have the same texture as pavements. A blind user would rather refrain from walking through lawns.

Introduction
A stereovision camera [28] is comprised of two cameras which are shifted away along the X axis, c.f., Figure 9. The distance between the cameras is called the baseline of a stereo system and is denoted by B. The baseline distance makes the same scene point P to be visible in the left and right camera at coordinates x L and x R respectively. The difference between the x L and x R values is called the disparity, and will be denoted by x d . The z coordinate of the observed object can be calculated from Equation (24), where f is the focal length.
For the legibility sake, the stereovision system can be treated as one camera, whose simplified coordinate system is shown in Figure 9. The Z axis coincides with the optical axis of the camera model. Figure 9. The local coordinate system of the stereovision camera. F is the focus of the camera, and f is the focal length, (x, y, z) are the coordinates of a point in the camera coordinate system. This point has the coordinates (x P , y P , 0) on the projection plane P P of the camera and α denotes the angle from the Z axis at which point (x, y, z) is visible by the camera.
The coordinates of a point (x, y, z) are projected onto the camera projection plane P P according to the transformation (25).
The depth of a point (x, y, z), denoted by d p (x, y, z), is understood as the distance from the point to the projection plane and equals Equation (26).
The distance from the Y axis of the camera to a point with the (x, y, z) coordinates is denoted by d Y (x, y, z) and calculated from Equation (27).
Let us define a function which returns the distance from the Y camera's axis to the nearest point visible at angle α from the optical axis Z, c.f. Figure 9. This function is defined by Equation (28).
The data from the stereovision camera is digitized. The (x p , y p ) coordinates are quantized and converted to units called pixels. The transformed coordinates are denoted by (x pix , y pix ). The conversion follows through Equation (29).
where the Round() operator denotes rounding to a nearest integer number, f pix is the focal length of the camera expressed in pixels and is provided by the manufacturer in the technical specification. The difference between the maximal and minimal value of x pix is called the horizontal resolution of the camera and is denoted by x res . The vertical resolution y res is defined by analogy. A depth map or a depth picture L is a two dimensional map defined by Equation (30).
L(x pix , y pix ) = min(d p (x, y, z)); (x, y) (25), (29) → (x pix , y pix ) The coordinates (x, y) are related with (x pix , y pix ) through transformations (25) and (29). Equation (30) formally says that a stereo camera provides the depth to nearest points that are not occluded. This is so-called 2.5 dimension image. Usually, the depth for a given point (x pix , y pix ) is represented by a floating point-number. Using the identity (24), the disparity map K can be defined by Equation (31).
As a matter of fact, stereovision cameras first determine the disparity map K and then calculate the depth map L. The disparity x d is calculated in digital cameras in pixels, thus K should be a map containing integer numbers. An example of a depth map is presented in Figure 12(b). Advanced stereo cameras perform subpixel interpolation, e.g., the camera used in the project calculates the disparity with the resolution of 0.25 pixel and then performs filtering and interpolation on the depth map.

Error Analysis
As the object recedes from the camera, the disparity x d decreases. An error in calculating the shift x d becomes more significant. Calculating the derivative of z with respect to x d in Equation (24) gives Equation (32). The accuracy of determining the distance z decreases with its squared value. As the stereo-camera is digital, the values of x L , x R are quantized and so is x d . It means that z manifests discontinuities as x d changes to an adjacent quantum. Let denote the size of the pixel in metrical units. For the sake of simplicity, x d will have units of metres or pixels, whereby the appropriate unit will be clear from the context or mentioned implicitly. Let us assume at this stage that the error of determining x d can be ± 2 or ± pixel 2 . Figure 10 exemplifies the effect of such quantization. The error of determining the disparity does not only come from quantizing x d , in which case, the disparity maps presented in Section 3.8 would be devoid of any artifacts or falsely calculated disparities-see [36] for full clarification. The camera used in this work calculates disparity by correlating small blocks (e.g., 7 × 7 pixels) and looking for best match. The probability density function of the disparity error is disputable in this case. Let it be the Gaussian distribution, expressed by Equation (33), where ∆x d denotes the disparity error.
The probability density function of ∆z can be calculated from Equation (34). The function f z is conditioned on z, i.e., the error ∆z depends on the value of z as mentioned before.
Equation (32) holds for small x d . The exact value of ∆z can be computed from Equation (35). The formula necessary for the transformation is given by Equation (36).
Thus, f z can be calculated from Equation (37).
The shape of the function f z (∆z|z) for different values of z is presented in Figure 11. The function is skewed as one should have suspected and decays slower for ∆z greater than 0.

3D Map of an Urban Terrain
A precise 3D map of an urban terrain was provided by the local cartographic office of Lodz (MODGiK). The map is divided in layers containing: roads, buildings, entrances to buildings, steps, fences, lawns, trees, bushes, ponds, lamp posts and many other data not used in the project. The accuracy of building deployment is in an order of single centimetres and the buildings' heights were measured with a bit worse accuracy. The radius of trees' crowns was assumed 3 m and the radius of lamp posts 0.5 m.

3D Matching Algorithm
Estimated location based on the stepmeter, gyroscope and digital maps is accurate to within 15 m (a street with pavements on both sides) [3]. GPS errors in urban canyons being a several dozens of metres, practically both sides of a road are equally probable.  Figures 12(a-c) show a picture from a monocamera, a disparity map K and a part of the 3D model of the University Campus.
Let us use the local coordinate system defined in Section 3.1.2. Let the vector [x S , y S , z S , φ S ] T denote the Cartesian coordinates and the orientation of the stereo camera in the local coordinate system. φ S is called the azimuth of the stereo camera and it has the same interpretation as with a magnetic compass. It is assumed that the stereovision camera is worn by a pedestrian and it is attached to the chest, say at z = h = 1.5 m above the ground. The optical axis of the camera is directed roughly parallel to the ground plane. The positioning algorithm has a precise 3D model of the environment that is defined for the local coordinate system-see Figure 12(c). Let us put a virtual camera in the (x V , y V , z V ) coordinates and set its orientation to φ V . Then, let us render the virtual environment based on the 3D model. The virtual camera should see the picture presented in Figure 12(c). On the other hand, the stereovision camera provides a depth image of the environment. Both images can be compared. Let L S and L V denote the depth image returned by the stereovision camera and retrieved from the virtual environment, correspondingly. f L (L S , L V ) is some cost function comparing the two depth images and returning comparison error. The function is defined by Equation (38).
Formally, the problem comes down to finding the vector where the function f L (L S , L V ) returns a minimum value, that is, the difference ||L S − L V || is minimal according to some criterion. Then the estimated coordinates of the stereo camera (that is the pedestrian location) are given by Equation (39). This task is handled by the particle filtering algorithm.
Assuming z S = z V = h reduces the dimension of the problem. The function f L is still not trivial. The depth picture L S is encumbered with many errors, see Figure 12(b). For example, sun illumination causes some parts of the depth map to have falsely determined values (see Figure 13). The function f L should be immune to these undesirable effects that can cause random bias errors, most troublesome errors to correct by any filter, including a particle filter which is used in the presented project. A random bias error is understood as an error whose expected value differs considerably from 0 for a significant period of time, e.g., 30 seconds. There can be also objects that can compromise the comparison between L S and L V and cannot be filtered by any methods, i.e., parked cars, trucks, bushes, posters, garbage bins, etc. Their influence can be minimized to some extent by excluding from comparison that part of L S which is below the optical axis of the camera. Hence the ground surface is also excluded from comparison. First L S is analysed to seek fragments for comparison. The method of analysing L S depends on many empirical factors that were verified during many trials. The idea is following: 1. Reject the part of the depth map L S that is below the optical axis of the camera.
2. Divide vertically the upper part of L S into n S = 5 equal regions S j , j = 1 . . . n S -see Figure 13(a). This forces to compare the depth maps L S and L V in different regions, which, to some extend, decorrelate random bias errors.
3. for j = 1 to n S (a) From the area S j , pick at random a small square window W j of W W by W W measurements, where W W = 24 pix.
(b) Check if W j is devoid of artifacts like sun illumination, falsely determined disparity etc.-c.f. Figure 13(b).
(c) Calculate the angle α j at which the camera sees the object pointed by W j (compare with the angle α in Figure 9).
(d) Calculate the distance d Sj to an object encompassed by W j .
(e) Check what the virtual camera sees at the angle α j from coordinates Calculate in this direction the distance to the nearest object d V j .
(f) Compare the distances d Sj and d V j by the distance error function f Lj . are denoted by W 1 through W 5 . One can see that for windows W 2 and W 3 the depth map was determined correctly and these windows should be compared with corresponding windows in the L V depth map. (b) The disparity values x d for each window W j as a function of the x pix coordinate in the disparity picture. For W 1 , W 4 and W 5 windows, the disparity x d changes from 0 pix to ≈2 pix which corresponds to distances ∞ to ≈16 m. These windows should be rejected from further analysis.

Return the comparison error between
(a) (b) Ad. 1, 2. The areas S j are shown in Figure 13(a).
Ad. 3a. The window W j is picked at random from the corresponding area S j , according to Figure 13(a).
Ad. 3b. By trial and error, the empirical measure (40) was proposed to check if the disparity map for W j is devoid of artifacts. x max d (41) and x min d (42) are the maximum and minimum values of disparities for a given window W j Window W j is stated to be devoid of artifacts if k j > k thr , where an empirical value of k thr ≈ 0.8. Ad. 3c. The angle α j at which the stereovision camera sees the object encompassed by the W j window can be calculated from Equation (43), where x (c) denotes the x coordinate of the centre of W j .
Ad. 3d. The distance d Sj to the object pointed by W j can be calculated from Equation (44). This is an arithmetical average of depths divided by the cos α j factor to obtain the Euclidean distance-c.f., Equations (26) and (27) that show the difference between the depth of a point and distance to the camera's Y axis.
Ad. 3e. Given the coordinates and orientation of a virtual camera in the 3D environment, the d V j distance from the virtual camera to the nearest object at α j angle can be easily obtained.
Ad. 3f. The corresponding distances d Sj and d V j are compared by the error function f Lj . For the integrity sake of Equation (47), the function f Lj should return 1 if window W j was rejected from comparison, i.e., k j ≤ k thr . The function f Lj reflects the correspondence between d Sj and d V j , whereby the comparison should be more tolerant for greater distances, according to formula (32)-the accuracy of determining d Sj drops with its square value. Following the reasoning in the previous chapter, the function f Lj , expressed by Equation (45), was proposed. The parameter a introduces some tolerance in case of unexpected objects registered by the stereo camera (e.g., cars, humans), c.f. Figure 14. Experimentally, a ∈< 0.95, 0.98 > gives desired results.
Ad. 4. Finally, the function f L comparing the two environments, i.e., real and virtual, can be defined by Equation (47).
The 3D match algorithm is evoked 10 times a second. 3.9. Particle Filtering 3.9.1. Introduction The particle filter is a sequential version of the Monte Carlo method [37] which enables to find solutions of a multidimensional problem by generating a large number of possible solutions, so-called particles, and then verifying these solutions by a given criterion, i.e., an error function. The particle filter can be looked at from the statistical point of view, whereby the solution is represented by a probability density function. The particle filter improves upon Kalman filter in case of multi-modal distributions and non-linear dynamics at the cost of high computation burden. Let i denote the number of a particle and L the total number of particles, thus i ∈< 1, L >. A particle at a given time instant k is represented by vector c i (k) given by Equation (48).
where x i (k) is a possible system state at time instant k and w i (k) is the weight associated with i-th particle. A possible system state is understood as a possible solution of the problem in question. In our case x i (k) represents a hypothetical user location and orientation, x i , y i denote coordinates in the local coordinate system defined in Section 3.1.2. φ i is the azimuth orientation. The aim of this simulation is to to find the most probable user location. Given the series of measurements y(1) . . . y(k) the pedestrian's location is given by the probability density function that can be approximated by the probability mass function (49).
where δ(·) is the Dirac delta function. Weights w i (k) of all generated particles should sum up to unity.

Implementation of the Particle Filtering Algorithm
The implementation of the particle filtering algorithm for processing the positioning data is as follows: 1. Initialization. At the algorithm outset for k=1, all the particle states x i (1) are randomly initialized according to a given distribution. The weights w i (1) are assigned equal values of 1 L . 2. Prediction. Based on the transition Equation (50), new particle states are predicted.
is a noise vector introduced to the state due to the error of u(k − 1), where each particle is perturbed with an individually generated vector is a transition function that calculates a new state based on the previous one, driving signals and errors thereof.
3. Measurement update. Each measurement y(k) updates the weights of the particles by Equation (51).
p(y(k)|x i (k) is a conditional probability density of measuring y(k), given the particle state x i (k).
In other words, p(y(k)|x i (k) characterizes error distribution of a given data source. Particles that diverge in the long run from measurements will assume small weights w i (k).

4.
Weights' normalization. The weights are normalized so that they sum up to 1-Equation (52).
5. State estimation. The most probable system state is estimated, e.g., by the first moment (53): 6. Resampling. After a number of algorithm iterations, all but a few particles have negligible weights and therefore do not participate in the simulation effectively. This situation is detected by calculating the so-called degeneration indicator, expressed by Equation (54), which is an inverse of the second moment reflecting the dispersion of the weights.
If all particles have the same weights of L −1 then d(k) reaches its maximum value of 1. As the weights start to differ, the d(k) indicator decreases. If d(k) falls below a given threshold, then a process called resampling is introduced and a new set of particles is created. The probability of copying a particle to the new set is proportional to its weight w i (k). Therefore particles that poorly approximate the system state are superseded by more "accurate" particles. Then all particles are assigned the same weight L −1 .

Implementation
Ad. 1. The particles can be initialized, e.g., around a first GPS readout. Ad. 2. Based on Figure 15, the transition equation for a particle i reads (55). Figure 15. An explanation of the unperturbed state transition: d(k) is the length of a step estimated by the pedometer. φ(k) is the azimuth orientation. ω z (k) is the angular velocity around the gravity axis. ω z is estimated by the gyroscope and the gravity estimation algorithm. (x(k − 1), y(k − 1)), (x(k), y(k)) are the coordinates in the previous and current time instant.
v i (k) and w i (k) are random realizations of the pedometer and gyroscope noise respectively. Each particle is perturbed individually. k is the time instant, where t = k · dT .
Ad. 3. The following sources are used to update the particle weights: • GPS. The measurement update for the GPS readouts are given by Equations (56) through (58) based on the Equation (7) w i (k) = w i (k − 1) · p GPS (r i , σ GPS ) (56) β GPS is a coefficient which enables to weight GPS readouts with appropriate importance. Larger values of β GPS makes the p GPS () function more tolerant to GPS errors.
• comparison between the 3D virtual environment model and depth picture retrieved from the stereo camera. The measurement update equation is given by Equation (60).
For each particle, a virtual environment is generated individually. The coordinates of the virtual camera are set to [x i , y i , h, φ i ]. h is the height at which the stereo camera is attached. Then the depth map L V,i is generated. The function f L (L S , L V,i ) is evoked for every particle. This is a computation demanding procedure. Note that, if an unexpected object appears in front of the stereo camera for 10 seconds, given particles' weights will be decreased by a 10/s·10s , c.f. Equation (45) (3D matching algorithm is run 10 times a second). This might introduce a large error in the position estimation. Bigger values of a diminish the benefit of using a stereo camera and 3D model of the urban terrain.

Trials
The trials were carried out at the campus of Lodz University of Technology, in an area of ca. 500 m by 500 m. The length of the trial path was ca. 2.6 km. The reference path was restored off-line by analysing images recorded by the camera. The average error of determining the reference path was ca. 0.5 m on average. The maximum error might have been 2-3 m. The delineated path is composed of straight lines. The location error is calculated as the distance between the appropriate path segment and the estimated location by a GPS receiver or the particle filter. Thus, actual positioning errors are larger. On the other hand, GPS errors assume dominant values in the direction perpendicular to a street. In the direction parallel to a street there are no buildings to occlude the sky. Hence, the signal from GPS satellites is not occluded, which results in a small error in the direction along a street [2].

Results
The results are presented by plots of paths, a histogram and a cumulated histogram of errors and a table that summarizes the errors. An interesting question is how much a given data source improves the accuracy of the positioning. Table 1 provides the answer. Therefore, several off-line tests were run with different combination of data sources. It was though assumed that the starting location is known.

Conclusions
The results were calculated based on ca. 1,900 GPS readouts, 3,400 particle filter estimates and 19,000 pictures from the stereo camera. The particle filter using all available data sources provides by far the best accuracy. For 99% of the time the accuracy is better than 4.5 m. This is still too large an error for safe navigation of visually impaired pedestrians. The main source of errors were introduced by misestimated direction, which is visible in Figure 22 at coordinates (410 m, 125 m). The turn is misestimated by ≈20 • . This error is too large for the non-linearity error mentioned in Section 3.1.1 or any other gyroscope error. Possibly the sensors came adrift and did not change its direction accordingly. This error, when not corrected immediately, has a ripple effect. This can be, however, taken care of. Also, more accurate sensors are available on the market. It is surmised that an error of 3 m for 99% of time is achievable.
It is interesting to note that when the initial position is known, the location of a pedestrian can be estimated without having to use a GPS receiver. The setup number 5 (see Table 1) gives best results after the setup number 4 and it is better than sole GPS readouts. Needless to say, this approach can be used in limited areas, e.g., inside buildings where GPS readouts are not available. The probability map concept limits the positioning error to the width of a street, ca. 10 m. Due to the imposed constraints on the pedestrian kinematics, the error dropped 3 times in terms of a root-mean-square error, which poses the best improvement from all other additional sources.
As mentioned before, on the bases of GPS readouts and the probability map, a user's location can be estimated within ca. 10 m. A stereo camera can refine the user's position down to 2 m, providing there are no unexpected objects, or difficult objects like trees, whose crown's radius could differ from the assumed 2 m. GPS readouts complement nicely with the algorithm in which the depth images derived from terrain model and stereoscopy are compared. In the presence of high buildings, GPS readouts are compromised. Under these circumstances the image matching algorithm performs well as the walls of buildings are confidently detectable by a stereo camera. On the other hand, in an open space GPS readouts are accurate down to, ca. 2-3 m. Then in turn, the image matching algorithm has no reference points to perform a viable comparison.
The presented algorithm works off-line. It is quite computationally demanding. The particle filtering algorithm requires generation of ca. 500 particles for good positioning accuracy. However, the task can be handled on-line without any optimization techniques by a dual core, 2.5 GHz standard notebook.