Error Analysis in a Stereo Vision-Based Pedestrian Detection Sensor for Collision Avoidance Applications

This paper presents an analytical study of the depth estimation error of a stereo vision-based pedestrian detection sensor for automotive applications such as pedestrian collision avoidance and/or mitigation. The sensor comprises two synchronized and calibrated low-cost cameras. Pedestrians are detected by combining a 3D clustering method with Support Vector Machine-based (SVM) classification. The influence of the sensor parameters in the stereo quantization errors is analyzed in detail providing a point of reference for choosing the sensor setup according to the application requirements. The sensor is then validated in real experiments. Collision avoidance maneuvers by steering are carried out by manual driving. A real time kinematic differential global positioning system (RTK-DGPS) is used to provide ground truth data corresponding to both the pedestrian and the host vehicle locations. The performed field test provided encouraging results and proved the validity of the proposed sensor for being used in the automotive sector towards applications such as autonomous pedestrian collision avoidance.


Introduction
Pedestrian protection is a key problem in the context of the automotive industry and its applications. Sensor systems onboard the vehicles are required for predicting the vehicle host-to-pedestrian (H2P) distance as wells as the time-to-collision (TTC). Cameras are the most commonly used sensor for that purpose. The use of video sensors comes quite natural for the problem of pedestrian detection since they provide texture information which enables the use of quite discriminative pattern recognition techniques. The human visual perception system is perhaps the best example of what performance might be possible with such sensors, if only the appropriate algorithm is used.
Pedestrian detection is a difficult task from computer vision perspective. Large variations in pedestrian appearance (e.g., clothing, pose, etc.) and environmental conditions (e.g., lighting, moving background, etc.) make this problem particularly challenging. The first stage in most systems consists of identifying generic obstacles as regions of interest (ROIs) using prior scene knowledge (camera calibration, ground plane constraint, etc.) and a computationally efficient method. Subsequently, a more expensive pattern recognition step is applied. The lack of explicit models leads to the use of machine learning techniques, where an implicit representation is learned from features obtained from thousands of samples. Concerning the related work, over the last decade, a considerable number of vision-based pedestrian detection systems have been proposed. Several remarkable surveys have been presented [1][2][3][4], some of them recently published [5,6]. Most of the work concerning human motion summarized in [1][2][3][4][5] focuses on pedestrian protection applications in the intelligent vehicle domain, covering both passive and active safety techniques. An overview of the current state of the art from both a methodological and an experimental perspective is presented in [6], where a novel benchmark set has been made publicly available. We refer to these surveys for general and detailed background concerning pedestrian detection for automotive applications. Accurate depth information is essential in the area of pedestrian protection applications (e.g., driving assessment, collision avoidance, collision mitigation, etc.). One of the advantages of stereo vision sensors is their ability to compute a detailed 3D representation of the scene around the vehicle, by passive sensing and at low sensor costs (compared with active sensors such as laser or radar). Depth information is obtained by solving the correspondence problem and performing triangulation. However, the depth reconstruction accuracy has its limitations due to the discrete nature of the stereo vision sensor. These limitations have to be analyzed to validate the sensor as a source of information for automotive applications. In addition, designing a stereo system involves choosing several parameters: the cameras focal length, the images size and the distance between the cameras (baseline). Unfortunately, a trade off must be reached between accurate depth estimation and other parameters such as the computational load and the size of the frontal blind zone. The theoretical characterization of range estimation errors based on system parameters for stereo vision is a well known topic [7,8] including applications such as robotics [9] and autonomous navigation [10]. However, these issues have been somewhat neglected in the context of pedestrian protection applications.
In this paper we present an analytical study of the depth estimation error of a stereo vision-based pedestrian detection sensor for automotive applications such as pedestrian collision avoidance and/or mitigation. The sensor comprises two synchronized and calibrated low-cost cameras. Pedestrians are detected by combining a 3D clustering method with Support Vector Machine-based (SVM) classification. The influence of the sensor parameters in the stereo quantization errors is analyzed in detail providing a point of reference for choosing the sensor setup according to the application requirements. The sensor is then validated in real experiments. Collision avoidance maneuvers by steering are carried out by manual driving. A real time kinematic differential global positioning system (RTK-DGPS) is used to provide ground truth data corresponding to both the pedestrian and the host vehicle locations. The performed field test provided encouraging results and proved the validity of the proposed sensor concerning the accuracy required in one of the most challenging and difficult applications in the context of the automotive industry.
The remainder of this paper is organized as follows: Section 2 provides an overall description of the proposed sensor, covering details of implementation, focusing on the analysis of the depth estimation error and the sensor setup and describing the proposed maneuver for pedestrian collision avoidance. Experimental results that validate the proposed approach are presented and discussed in Section 3. Finally, Section 4 summarizes the conclusions.

System architecture
The experimental vehicle used in this work is a car-like robot (a modified Citröen C4) which can be seen in Figure 1. It has an onboard computer housing the image processing system, a RTK-DGPS which is connected via RS232 serial port and a pair of synchronized low cost digital cameras connected via FireWire port. The stereo vision sensor uses 320 × 240 pixel greyscale images with a baseline of approximately 300 mm and a focal length of 4.2 mm. These parameters satisfy the application requirements as we will see in subsequent sections. The cameras are calibrated in a semi-supervised fashion by using a modified version of the Camera Calibration Toolbox for Matlab and a chessboard pattern. Thus we obtain the intrinsic parameters of each camera (focal length - -and translation vector - -). We refer to [11] for a detailed description of these parameters. Distortion parameters   are used to compensate both radial and tangential lens distortions.

Stereo vision-based pedestrian detection
Pedestrian detection is carried out using the system described in [12,13] (Figure 2 depicts an overview of the pedestrian detection architecture). Non-dense 3D maps are computed using a robust correlation process that reduces the number of matching errors [14]. The camera pitch angle is dynamically estimated using the so-called virtual disparity map which provides a better performance compared with other representations such as v-disparity map or YOZ map [13]. Two main advantages are achieved by means of pitch compensation. First, the accuracy of the time-to-collision estimation in car-to-pedestrian accidents is increased. Second, the separation between road points and obstacle points is improved, resulting in lower false-positive and false-negative detection rates [13]. 3D maps are filtered assuming the road surface as planar (which can be acceptable in most cases), i.e., points under the actual road profile and over the actual road profile plus the maximum pedestrian height are removed since they do not correspond to obstacles (possible pedestrians). The resulting filtered 3D maps are used to obtain the regions of interest (ROIs).
Based on the idea that obstacles (including pedestrians) have a higher density of 3D points than the road surface, ROI selection can be carried out by determining those positions in the 3D space where there is a high concentration of 3D points. A 3D subtractive clustering method is proposed to cope with the ROI selection stage using sparse data. The idea is to find high-density regions, which are roughly modelled by a single 3D Gaussian distribution, in the Euclidean space. The parameters of each Gaussian distribution are defined according to a minimum and maximum extent of pedestrians. Thus, whereas pedestrians are correctly selected, bigger obstacles such as vehicles or groups of pedestrians are usually split in two or more parts. To cope with the stereo accuracy the method is adapted to the expected depth error [14].
The 2D candidates are then obtained by projecting the 3D points of each resulting cluster and computing their bounding boxes. A Support Vector Machine-based (SVM) classifier is then applied using an optimal combination of feature-extraction methods and a by-components approach [12]. The RBF kernel provides better performance although the linear kernel is the best solution from a computational point of view. Each candidate (possible pedestrian) is divided in six regions (head, left and right arms, left and right legs and a region between the legs). Each region is independently learnt using different features. The optimal combination is obtained using texture features (Texture Unit Number) for the head and the region between the legs, histograms of grey level differences for arms, and Histograms of Oriented Gradients (HOG) for the legs [12]. The final classifier is trained with 67,650 samples (22,550 pedestrians and 45,100 non pedestrians, including mirrored images).
Nonetheless, the 2D bounding box corresponding to a 3D candidate might not perfectly match the actual pedestrian appearance in the image plane. Multiple candidates are generated around each original candidate. The so-called multi-candidate (MC) approach proves to increase the detection rate, the accuracy of depth measurements, as well as the detection range [12]. The resulting pedestrians are tracked by means of a Kalman filter and the data association problem is solved using the Hungarian method.
The last block of Figure 2 is based on the computation of the time-to-collision (TTC) between the host vehicle and the pedestrians ahead and is defined according to the application requirements. For example, in case of driving alert applications this stage will trigger alarms to the driver depending on the host-to-pedestrians (H2P) TTC. In case of pedestrian collision mitigation applications [13] an activation signal will be sent to a pedestrian protection airbag and/or an active hood system. For pedestrian collision avoidance applications this stage will trigger the corresponding signals to the brake pedal and/or the steering wheel controllers. The pedestrian detection system runs in real time (25 Hz) with 320 × 240 images and a baseline of t x = 300 mm approximately. The stereo vision-based pedestrian detection system has been tested in real collision-mitigation experiments by active hood triggering and collision-avoidance tests by brakeing or decelerating [13]. In this paper, its performance is analyzed in real experiments in the context of pedestrian collision avoidance by steering, including analysis of the depth estimation errors. Note that although depth estimation uncertainties have an effect in almost all blocks of Figure 2 (excluding block number 4 -feature selection and classification-), we will focus on the last stage by analyzing host-to-pedestrian distance estimation errors.

Collision avoidance maneuver by steering
Pedestrian collision avoidance is one of the most difficult and challenging automatic driving operations for autonomous vehicles and can be carried out by braking or by steering. Before designing autonomous collision avoidance maneuvers a proper analysis of the sensor errors has to be performed in order to validate the proposed approach. In our case, the stereo vision-based pedestrian detection sensor is evaluated in real scenarios with real drivers and real pedestrians. Since emergency brake maneuvers are risky, a set of experiments in which drivers have been requested to perform pedestrian collision avoidance maneuvers by steering at speeds-10, 15, 20, 25 and 30 km/h-have been devised. Higher speeds have not been considered due to the associated risks.
The avoidance maneuver has to fulfil some conditions. First, the vehicle has to be moving along a straight road in the right lane. Second, the pedestrian has to be located in the same lane. Third, the left lane has to be free and long enough for the pedestrian collision avoidance maneuver to be completed at the current speed. As soon as the driver detects a potential pedestrian collision that can be avoided, a lane change to the adjacent left lane is performed. Once the pedestrian has been passed, a second lane change is carried out to go back to the right lane (see Figure 3).

Stereo quantization error
There is a significant amount of published research on characterization of range estimation errors based on system parameters for stereo vision [7,8]. Here, our approach for computing the quantization error covariance for each point and the corresponding host-to-pedestrian distance estimation error is briefly described.
Given a calibrated rig of cameras and a correspondence between two points, one on the left camera   l l v u , and another one on the right   r r v u , the 3D position of the point in the world coordinate system is given by [11]: where A is the matrix containing the equations for the 3D to 2D transformation for each one of the cameras and b the independent term of the same equations. Matrices A and b are written as a function of the cameras intrinsic parameters: Each camera intrinsic parameters [M L M R ] are estimated using an off-line calibration process. The intrinsic parameters describe the 3D to 2D transformation for each one of the cameras according to the following equation: Applying the product rule for matrices: and writing C as: the expression for how the inaccuracies in the pixel position affect the 3D reconstruction is obtained: Solving the partial derivatives for Equation (7) using Equations (2) and (3): Finally, substituting the intrinsic matrices values from Equation (4):  are the uncertainties in pixels on the measure of T , the final expression for the quantization error covariance is (note that the errors in the image coordinates are assumed to be independent so the covariance matrix is diagonal): As each pedestrian is roughly modelled by a high concentration of 3D points, the final host-to-pedestrian distance estimation error is defined as the mean value of the zz  value corresponding to all 3-D points that lie within the pedestrians detected by the subtractive clustering algorithm.

System parameters
A stereo imaging system needs to know how the various system parameters affect the depth estimation error, especially for automotive applications due to their safety component. Designing a stereo system involves choosing three main parameters: the focal length of the cameras, the distance between the cameras (baseline) and the size of the images. The most important application requirements are the depth estimation error, the runtime and the distance of the frontal blind zone, although the range estimation error is usually the deciding parameter, and the system parameters are chosen in order to meet an acceptable range error. Concerning this topic a considerable number of statistical depth error analysis works have been carried out [7,8] deriving quantitative expressions. Although these expressions describe the relationship between the range error and the system parameters, it is still difficult to obtain a fast method to define those parameters, especially when there are many parameters that affect to the depth estimation error.
In order to facilitate the choice of the system parameters we propose the use of pre-computed graphs including different settings. Whereas the H2P distance estimation error is computed assuming the general stereo case (non parallel optical axes), the graphs are computed using the ideal case. From the geometry of a parallel stereo pair (ideal case), i.e., two cameras with parallel optical axes, the same intrinsic parameters   can be defined as: where f is the cameras focal length, r x and l x are the x-projections in metrics coordinates on the right and left image planes respectively, x d is the length of a pixel in the x-axis   , r u and l u are the x-projections in pixel coordinates on the right and left image planes respectively, and u d represents the disparity value in the x-axis of the image in pixels. Given the baseline x t , the focal length in pixels in the x-axis  As can be seen, the higher the baseline the lower the error. Let's consider that our system (with 320 × 240 images and 4 mm f  ) requires a relative error % 10  Z up to distances of 20 m. Then the baseline should be greater than 60 cm. If the relative error has to be % 5  Z up to distances of 5 m, then the minimum baseline would be 30 cm (and so on). If the baseline is defined to 300 mm x t  and the images have a size of 320 × 240 pixels, the higher the focal length the lower the error as can be seen in Figure 5. Finally, Figure 6 shows both the absolute and the relative range errors for different image sizes corresponding to a sensor with 4 mm f  and 300 mm x t  . As can be observed, the higher the image size the lower the error. These graphs can be used for determining the system parameters according to the depth error requirements. Accordingly we can conclude that the higher the baseline, the focal length and the size of the images the lower the depth error. However, other parameters have to be taken into account when designing a stereo sensor: the computational load (which is defined by the range of the disparity search space) and the size of the frontal blind zone. As soon as we increase the values of the baseline and the focal length, both the size of the frontal blind zone (see Figure 7) and the range of the disparity search space (see Figure 8) also increase. In addition, the higher the size of the images, the higher the disparity search space (the computational load) as can be seen in Figure 9.  Blind frontal range as a function of the focal length, for different baselines. Note that the size of the images has no effect on the size of the blind frontal area.  The proposed stereo vision-based pedestrian detection system uses 320 × 240 images, a baseline of 300 mm x t  and a focal length of f = 4 mm. This sensor setup is mainly defined as a trade off between accurate range estimation and low computational load as well as low size of the frontal blind area. We can derive from Figures 4-6 that this sensor setup implies an almost linear relationship (with a slope approximately equal to 1) between the range and the relative range error up to distances of 15 m. As can be observed in Figure 7, the size of the blind frontal area is mainly defined by the focal length of the cameras. In our case, a focal length f = 4 mm defines a blind frontal area lower than 1.5 m. Finally, our sensor setup implies a disparity search space of 50 px in a range from 2 m to 30 m, as depicted in Figures 8 and 9, which allows real-time stereo computation. Higher resolutions, e.g., 640 × 480, would certainly produce lower relative depth errors (almost half the error, as can be observed in Figure 6). However, the disparity search space would be increased by a factor of 2 ( Figure 9). This is also applicable to the focal length. A focal length of f = 8 mm would reduce the relative depth error by a factor 2 up to distances of 15 m (see Figure 5), but the disparity search space would be also increased by a factor of 2 (see Figure 8).

Experimental Results
The proposed stereo vision-based pedestrian detection sensor is evaluated in a set of experiments carried out in one of the most challenging tasks in the context of automotive applications: collision avoidance maneuvers. The experimental setup is described in Figure 10. Ground truth data corresponding to both pedestrian position and vehicle position are obtained from the RTK-DGPS (after linear interpolation due to its low sample frequency: 5Hz). First, we locate the DGPS in the dummy position during a few seconds and we obtain its coordinates. Secondly, we install the DGPS sensor onboard the vehicle. Finally, several manually driven avoidance/mitigation maneuvers at different speeds (10,15,20,25 and 30 km/h) are recorded saving data from both the RTK-DGPS and the stereo sensor. Thus, we can compare the results provided by the stereo vision sensor and determine its suitability for this important automotive application. In order to support the use of the RTK-DGPS sensor as ground truth we have devised a simple experiment in which we locate the sensor on the dummy position and we obtain the global position during 90 s. Figure 11(a) shows the obtained positions. Figure 11(b) depicts the standard deviation along time with respect to the average value. The maximum deviations in x-and y-axis are 5 mm and 5.6 mm respectively. The standard deviations in x-and y-axis are 0.0036 mm and 0.0041 mm respectively. These values are at least two orders of magnitude lower than the errors obtained from the stereo sensor, supporting the fact that the RTK-DPGS can be used as ground truth data provider (note that the default coverage factor k in GPS measurements is 2-3 [15]; commonly k = 2 coincides with level of confidence of the interval about 95%. When higher level of confidence is needed, k = 3, the level of confidence of the interval is about 99%).   In order to compare these trajectories with the ones provided by the stereo sensor, the relative car-to-pedestrian positions with respect to the left camera have to be computed. This transformation is carried out by applying two translations: one from the UTM global reference to the RTK-DGPS onboard the vehicle and other from the DGPS to the left camera. The orientation of both axes is computed using the longitudinal movement of the vehicle. Figures 13a-c depict the host-to-pedestrian distance supplied by the DGPS, with the reference located on the moving vehicle (left camera), and the host-to-pedestrian distance provided by the stereo sensor as well as their quantization errors [ zz  from Equation (12)], which are drawn with dotted lines, corresponding to the avoidance maneuver at 10, 20 and 30 km/h respectively. Some remarkable conclusions can be deduced from these figures. The maximum range (25-30 m) and the inverse proportion between the depth and the stereo accuracy can be easily appreciated (as demonstrated in Section 2.3). The ground truth measurements are almost always (99%) inside the limits of the stereo measurements plus their corresponding quantization errors, which proves that the stereo sensor provides information accurate enough despite its inner accuracy constraints. The reason why there are some cases where the H2P ground truth measurements are outside the error interval is because the stereo quantization errors are not computed according to the filtered values (note that the Kalman filter blocks high frequency changes) but to the measurements [e.g., see frame number 30 in Figure 13(c)]. Although stereo depth measurements are not reliable at long distances, their accuracy improves in proportion to the collision risk, i.e., as the car-to-pedestrian distance decreases. For example, at 15 m the depth error is about ±1.5 m, at 10 m is about ±0.7 m and at 5 m is lower than ±0.2 m. As suggested in [16] for braking maneuvers, we deduce that for avoidance maneuvers the decision to start the avoidance maneuver may well be based on TTC information as directly available to the driver from the optic flow field. This TTC information is an important cue for the driver in detecting potentially dangerous situations. The error is clearly unacceptable for TTC values above 8 s. However, the accuracy of the measurements increases as long as the TTC decreases.
In Table 1 we show the root mean square error (RMSE) of the TTC for all the speeds, specifying the error for TTC lower than 8 and 4 s. On average, the error for TTC < 8 s is lower than 0.3 s and for TTC < 4 s is lower than 0.1 s. In addition, we can see that the larger the speed the larger the error, although this relationship is not linear. A possible explanation for this may be the influence of other camera settings such as the exposure time. In addition, the Kalman filter performance can be reduced since the higher the speed the lower the amount of measurements available.

Conclusions
This paper presents an analytical study of the depth estimation error of a stereo vision-based pedestrian detection sensor for automotive applications such as pedestrian collision avoidance and/or mitigation. The sensor comprises two synchronized and calibrated low-cost cameras, providing information about the relative pedestrian position with respect to the host vehicle (H2P distance) and the TTC. Pedestrians are detected in a six stage process: non-dense reconstruction, pitch estimation, 3D clustering, 2D classification, tracking and decision making.
The accuracy of the measurements provided by the proposed sensor is obtained by computing the stereo quantization error. Sensor setup is defined according to the application requirements. The relationship between the relative range error and the sensor parameters (focal length, baseline and images size) is analyzed by means of graphs.
The proposed sensor is validated in a set of experiments in which real collision avoidance maneuvers were carried out by real drivers and with real pedestrians up to speeds of 30 km/h. The experimental results demonstrate that the sensor provides suitable measurements despite its inner accuracy constraints due to the quantization error. Even the fact that sensors measurements (H2P distance and TTC) are not reliable at long distances, their quantization errors decrease as long as both the distance and the TTC decrease. In other words, the higher the risks, the better the sensor accuracy.
These statements can be accepted up to speeds of 30 km/h. The risks associated with performing collision avoidance maneuvers at higher speeds are not acceptable with the current experimental setup. However, one main conclusion can be extrapolated from our results: higher speeds will endure higher errors in the estimated TTC, compromising the effectiveness of the proposed approach. In order to increase the accuracy of the measurements provided by the stereo sensor, higher resolution images and longer baseline can be used. However, that would increase the computational cost.
The performed field test provided encouraging results and proved the validity of the proposed sensor for being used in the automotive sector towards applications such as autonomous pedestrian collision avoidance.