3.1. System Setup and Nighttime Stereo Dataset
In this section, in order to evaluate the effectiveness and efficiency of the proposed nighttime pedestrian detection system, we build a complete nighttime pedestrian detection system that includes hardware and algorithmic parts. The hardware equipment included a binocular camera, computers, a network cable and brackets. The binocular camera was built using two inexpensive network cameras with the function of near-infrared by us, which served to acquire images in the scene, and our camera used a 190-mm baseline length, so the detected distance of this constructed pedestrian detection system was about 1–10 m. The computer was used to implement the algorithm to process the image from our binocular camera, and the computer connected with the camera through a network cable. As for the brackets, we fixed the camera on a bracket, so that we could easily move this device in different monitoring scenarios for data acquisition and system performance testing.
In the experiment, we tested our system and compared it with other methods for pedestrian detection in some challenging environments at night. At present, there are many public databases that can be used to test the pedestrian detection algorithms, but these databases are generally from monocular cameras during the day. Therefore, there are several binocular databases such as KITTI, but this database is used for a mobile platform, so our system was applied in the fixed monitoring scene. Therefore, we proposed a new stereo night dataset in this system, and as for our constant research on this project over the past year, we have expanded the dataset. This dataset was captured with a binocular network camera designed by us at resolution and included some common outdoor surveillance scenarios, such as buildings entrance, main street, etc. To the best of our knowledge, this is the first night stereo video surveillance dataset to address and analyze the performance of state-of-the-art pedestrian detection algorithms. At present, we have posted this stereo night dataset on a website: https://xdfhy.github.io/. Furthermore, due to the constant research in this project, we will continue to expand this dataset.
3.2. System Performance Evaluation
In this work, the performance of our system was evaluated on four broadly-used scenes at night, including the entrance of campus, dim footpath, a crowded flyover, main street. These scenes covered the challenges of traditional pedestrian detection methods at night such as object occlusion and less texture information (as shown in
Figure 5).
As shown in
Figure 5, we can see that the detection results of our nighttime pedestrian detection system were effective in a complex environment.
Figure 5 not only shows our experimental process on the four scenes, including the original image, disparity map, Zmax map and foreground segmentation on the Zmax map, but also shows the three resulting images containing the foreground binary image, foreground cluster depth and pedestrian detection results. Then, we see the original images in
Figure 5 and specifically analyzed these four experimental scenes. First, we take a look at Scene 1: we know this experiment was carried out at a campus entrance, where surveillance systems were often established for student safety. There was a serious occlusion problem owing to the large number of people and cars, and we know the obtained original image had little texture information, which were challenges for the traditional nighttime pedestrian detection technologies. Scenes 2 and 3 are dim and narrow paths, and we can see that there is severe occlusion, as well as dim light, which brought difficulties to the depth calculation. Therefore, the disparity map had many invalid points. Scene 4 is a main street, surrounded by many shops. This scene had many chaotic lights, and the nighttime pedestrian detection system could be used to monitor the flow of shop staff, which is promising for the future.
In other words, these four scenes are representative night monitoring scenes, including some challenges usually encountered by traditional night pedestrian detection systems. To observe the output results of our system in
Figure 5, we see that our nighttime pedestrian detection system can effectively detect the pedestrians in the scene when faced with these problems, the number of targets detected being quite high and the false detection rate low. We can analyze the performance of this algorithm more specifically according to
Figure 5. On the one hand, our pedestrian detection system can easily solve the partial occlusion, such as Frame 2589 in Scene 1, Frame 3134 in Scene 2, and Frame 1319 in Scene 3. This is because our system used a binocular camera to obtain the depth information of foreground targets and segmented the foreground pedestrian in the 3D space. Furthermore, we know that although the two targets were overlapping in the original image, the two targets usually were not at the same depth, and the distance may not be small in the 3D space. Like Frame 2589 in Scene 1, from the foreground cluster depth, we can see that the depths of the two overlapping objects were not the same, and they were separated on the Zmax map. Notice that the target was undetectable when it was blocked too much, because the target’s valid 3D points were too few.
On the other hand, our system was not very sensitive to the intensity of light in the scene. From
Figure 5, we can find that the pedestrian detection results are robust and effective in the environment with chaotic lights, such as Scene 4, and the dim environment such as Scenes 2 and 3. The system could even detect pedestrians that were hard to see by human eyes. That was because our method focused on pedestrian profile information, so it was not strict in the quality of the disparity map. For example, see Frame 1993 in Scene 2 and Frame 1253 in Scene 3: the color and texture features of the original image are not obvious, and there are many invalid points in the disparity map; however, the overall contours of the pedestrian on the disparity map are relatively complete, and the target points are clustered together and their number greater than on the Zmax map; Therefore, our pedestrian detection system could get a robust and acceptable pedestrian detection result in the dim night. What is more, our system was not so strict on the height of the camera so as to keep more information about the people. We can see the face information and people’s posture information in
Figure 5. This information can be used for other studies on images. Generally speaking, all of the results of detection showed that our system could effectively solve the problems encountered by pedestrian detection algorithms at night, and the performance of our system was efficient and robust.
3.3. Comparing with State-Of-The-Art Methods
For further analysis, the proposed system was evaluated through a comparison with the three classic and state-of-the-art pedestrian detection methods. These three methods are mainstream solutions of pedestrian detection in surveillance scenes and include:
- (1)
Background subtraction methods: the widely-used background subtraction algorithms, such as ViBe [
17] and fast MCD [
18]. ViBe [
17] has the advantages of a small amount of calculation, a small memory footprint, high processing speed and the detection of good features. Fast MCD [
18] models the background through the dual-mode Single Gaussian Model (SGM) with age for foreground object detection on non-stationary cameras.
- (2)
RGB-D methods such as [
30]: In [
30], the algorithm uses a stereo vision system called subtraction stereo, which extracts a range images of foreground regions, then the extracted range image is segmented for each object by the clustering method.
- (3)
Learning-based detection methods such as the classic algorithm DPM [
21] and the current popular deep learning algorithm YOLO2 [
26]: DPM [
21] is a successful target detection algorithm and has become an important part of many classifiers, segmentation, human gesture and behavior classification. YOLO2 [
26] is a state-of-the-art, real-time object detection algorithm that can detect over 9000 object categories.
We chose the five algorithms mentioned above for comparative experiments and selected four sets of scenarios for testing to obtain the different detection results (as shown in
Figure 6). As for the specific implementation of the experiment, first, the code implementation of ViBe [
17] came from [
37]. The parameters in the code are the standard parameters of the algorithm, such as the number of samples per pixel and the sub-sampling probability, etc. The fast MCD [
18] code was from [
38], then we followed the steps to implement the code and complete the test, and the parameters of the code did not change, which maintained the original settings. In addition, because these methods deal with a single image, we selected the left rectified image as the input of the algorithm. Second, the RGB-D method [
30] is a stereo algorithm, and the input is left and right rectified images. What is more, this method’s source code cannot be found on the Internet, so we wrote the code and completed the test according to the algorithm flow described in [
30]. Finally, the DPM [
21] algorithm code was from [
39]; this algorithm requires only a single image, and the code could be used directly to complete the experiment after the input was changed to our left rectified image. As for YOLO2 [
26], the code was downloaded from the official website [
40]. In general, a large amount of data needs to be trained to generate the weights when using this algorithm. However, there are many good training weights for pedestrian detection on this official website. In the experiment, we chose the current mainstream weights directly (YOLOv2 [
26]) and the left rectified image as the input, and these weights have been tested and found to be very effective.
Various metrics can be used to assess the output of a pedestrian detection algorithm given a series of final detection images. These metrics usually involve the following quantities: the number of true positives (TP); the number of false positives (FP); and the number of false negatives (FN) [
17]. Thereafter, we could calculate the precision, recall and F1-measure (precision and recall weighted average). For the purpose of comparison, we selected four scenes and 400 frames in each scene. Then, we obtained these parameter values from the detection results (as shown in
Table 2).
As presented in
Figure 6, first, we could know that ViBe [
17] and fast MCD [
18] easily split one target into multiple targets and regarded the several overlapping targets as one target. What is more, these two methods lost many blocked targets due to the poor image quality at night. Second, the detection results of DPM [
21] were acceptable, but there were some false detections and some targets missed owing to serious occlusion. Third, as for the method in [
30], we know moving foregrounds were segmented firstly in each individual camera by grey level background subtraction, and then, stereo matching was applied only on the segmented foreground image to obtain the depth of the human. Therefore, the gray level background subtraction easily failed due to poor color discrimination of infrared images at night, which leads to a severe decrease in the performance of the following stereo matching and pedestrian detection. Therefore, we found that there were some targets missed in
Figure 6, such as Frame 3441 in Scene 1 and Frame 952 in Scene 4. Finally, we can see that the detection results of YOLO2 [
26] were robust and effective. However, it still cannot overcome the inherent problem of a single camera without 3D information. Its performance will be reduced in a scene with severe occlusion, such as Frame 2029 in Scene 2. Furthermore, the infrared image had less texture information, blurred contours and poor color discrimination at night, resulting in some foreground objects being similar to the background, so there were some false detections and some targets missed, such as Frame 952 in Scene 4.
After that, we obtained the statistics of the detection results and performance metrics of various algorithms. From
Table 2, we could ascertain that the total number of people in the experimental scene was 1476. Fist of all, it can be seen that the precision of our system was the highest in these methods up to
, and the precision of DPM [
21] and RGB-D [
30] was also good at about
. By contrast, the precision of YOLO2 [
26] was not high, which was
, and this is because YOLO2 [
26] had many erroneously-detected targets, although the FP of YOLO2 [
26] occupied first place among these methods. Then, we found the precision of ViBe [
17] and fast MCD [
18] to be the lowest at less than
. Thereafter, looking at the calculated recall in the table, we know that the recall of YOLO2 [
26] was the highest at up to
; the recall of our algorithm took second place, which is
; and the recall of ViBe [
17], DPM [
21] and fast MCD [
18] was about
. We know the larger the value of precision and recall, the better the performance of the algorithm. Therefore, we chose the F1-measure of all of the algorithms synthetically, and we could know that our algorithm and YOLO2 [
26] were the most effective at about 0.8; the RGB-D [
30] method and DPM [
21] were also acceptable; and ViBe [
17] and fast MCD [
18] were the lowest at less than 0.45.
It is noted that the difference in the F-measure between our approach and background subtraction such as ViBe [
17] was obtained by comprehensive experiments in many nighttime scenes. Our method was essentially different from the traditional background subtraction method, which led to such a huge difference in performance.
Many of the traditional background subtraction methods such as ViBe [
17] are color based. They learn a background model with input image sequences and segment the foreground through comparing the color difference of the input image and background image. Obviously, these color-based background methods are very sensitive to illumination changes and cluttered background. In addition, their performance will be significantly reduced in the case of a crowded scene with an occluded or even invisible background. Different from traditional methods, our algorithm was based on the three-dimensional spatial structure information of the scene. Once a moving person appeared, the height structure in the three-dimensional space would change simultaneously. Since our method only relied on the spatial information of the scene, it was not affected by light changes, and it was still very robust in the crowded scene because of the possession of spatial information.
Finally, we could ascertain the processing speed of every algorithm from
Table 2, as well as the computational efficiency: the proposed system ran on a notebook with an Intel i7-6700 CPU and 8 GB RAM. Our system could process at 25.3 fps on this platform, so this method basically met the system’s real-time requirements. ViBe [
17] is a fast method that can process 33.3 fps on the same platform, but the precision was too low to be used directly. The DPM [
21] processed 1.7 fps on the same platform, which is too slow. The YOLO2 [
26] is very slow on the CPU, so this method usually processes on the GPU, and this method processed at 25 fps on GPU Navida Geforce gtx1060 and 6GB video memory. In summary, we could know that our nighttime pedestrian system performed well and significantly outperformed some recently published state-of-the-art methods, including commonly-used background subtraction algorithms such as ViBe [
17] and fast MCD [
18], feature-based detection algorithms such as DPM [
21] and RGB-D algorithms such as the method in [
30]. In addition, the performance of our system was comparable to the most popular deep learning method, YOLO2 [
26], and sometimes outperformed it.
In order to further compare with YOLO2 [
26], we chose a special scenario to experiment with, and this scenario had severe specular reflection due to a mirror or window. As given in
Figure 7, there are many large French windows outside the building, which are quite common in shopping malls and libraries, etc. We can see that pedestrians leave a clear mirror image on the window when they pass through this scene. From YOLO2 [
26]’s detection results in
Figure 7, we find out that YOLO2 [
26] can easily detect these mirror images as pedestrians and once again found that this method cannot detect the blocked targets, which can lead to a significant reduction in the detection effect. Compared with YOLO2 [
26], our method could acquire 3D information in the scene and set a monitoring range in 3D space to remove the French window when detecting pedestrians; thereby, we find the false detection rate to be low in
Figure 7. What is more, we could see that this algorithm can solve partial occlusion problems and improve the detection rate. In other words, comparing all of the detection results of these two methods, we could recognize that our system works much better than YOLO2 [
26] in this kind of scenario.
We summarize the characteristics of these two algorithms in
Table 3. Firstly, the computing environment of these two methods was not the same, and YOLO2 [
26] improved the speed with the help of the Graphics Processing Units (GPUs). In contrast, our system achieved real-time performance without GPU acceleration, so our binocular system would be more easily used in real life. Secondly, YOLO2 [
26] is a feature-based detection method that only gets the bounding box of the object. Our approach was a pedestrian detection method based on the voxel surface model; in addition to getting the bounding box, we also obtained all of the moving foreground points’ information in the scene. Finally, our system used a binocular camera, so we could obtain the depth of all foreground points and the distance of the foreground targets. In other words, although their detection performance was quite good, our approach had lower requirements for the system and was more convenient. What is more, there was more information including the 3D information of the scene, foreground points and the distance of foreground targets, which was helpful for other studies in the scene. In addition to the detection results presented in the paper, we also tested several datasets and produced a video demo to demonstrate the effectiveness of our system at night. Please review it in the
Supplementary Material.