Pedestrian Flow Tracking and Statistics of Monocular Camera Based on Convolutional Neural Network and Kalman Filter

: Pedestrian ﬂow statistics and analysis in public places is an important means to ensure urban safety. However, in recent years, a video-based pedestrian ﬂow statistics algorithm mainly relies on binocular vision or a vertical downward camera, which has serious limitations on the application scene and counting area, and cannot make use of the large number of monocular cameras in the city. To solve this problem


Introduction
Pedestrian flow statistics is an important application in the field of computer vision [1].It is a key technology in intelligent cities, intelligent retail, public place security and many other fields [2].In recent years, with the continuous progress of intelligent city, pedestrian flow statistics has attracted more and more researchers and companies to participate in, and developed more and more statistics algorithms [3,4].
The earliest methods for pedestrian traffic statistics depend on manual statistics or bill statistics, which are either costly or have an impact on pedestrians.Then pedestrian flow statistics methods based on pressure sensor or photoelectric sensor are proposed.However, the accuracy of these methods are not good enough for dense pedestrian flow with severe occlusion.Due to the development of computer vision, pedestrian traffic statistics methods based on vertical downward stereo camera have emerged [5].This kind of method is the most popular pedestrian traffic statistics method at present.Because the vertical downward camera can effectively avoid pedestrian occlusion, and the binocular-vision-based three-dimensional reconstruction algorithm can well filter out complex color background, the method has high accuracy, but the installation location and visual field are severely limited.It can only be applied to narrow indoor entrances, but not to commercial streets with wide outdoor entrances.At the same time, it is impossible to use the large number of front-down monocular cameras in the city.
Compared with vertical downward stereo camera, front-down monocular camera has wider vision, and because it can see the positive human face, it can better take into account security, criminal investigation and abnormal behavior early warning tasks.In front-down cameras, the image plane is about 45 degrees from the ground.The occlusion between the targets in front-down cameras is not as serious as that in front cameras.At the same time, the scale difference between targets is relatively small.There is a negative correlation between the size of the pedestrian target and the placement height of the camera.
However, due to the challenges of frequent occlusions, illumination changes, target scale changes, different fog concentrations in different distance and so on, pedestrian flow statistics algorithm based on front-down monocular camera puts forward higher requirements for detection and tracking algorithm.Thanks to the rapid progress of deep learning algorithm [6][7][8][9][10][11], the accuracy of pedestrian detection is constantly improving.More accurate detection results make the performance of tracking-by-detection method reach a higher level [12], which makes it possible to develop a high-precision pedestrian flow statistics algorithm based on front-down monocular camera.
In this paper, we propose a pedestrian flow tracking and statistics method based on front-down monocular camera.Firstly, we use convolutional neural network to detect pedestrians appearing in the camera, and modify the detection results based on intersection over union and aspect ratio.Secondly, we use Kalman filter to build uniform linear motion models for the detected pedestrians.Then pedestrian tracking is accomplished by data association algorithm.Finally, the virtual block method is used to count the target.We test the proposed algorithm using the real scenic spot entrance surveillance video, the F1 score of pedestrian flow statistics has reached 95%.At the same time, we compare the proposed multi-target tracking algorithm with other multi-target tracking algorithms on the 2DMOT2015 dataset.The MOTA of the proposed algorithm reaches 48.1, and the algorithm has advantages in computing speed.
The rest of this paper is organized as follows.Section 2 reviews the related work.The proposed algorithm is described in detail in Section 3. In Section 4, experiments and comparisons are carried out.Conclusion and analysis are presented in Section 5.

Related Work
The pedestrian flow statistics algorithm mainly includes three steps: pedestrian detection, multi-target tracking and pedestrian counting.
Current fast pedestrian detection methods are mainly divided into two categories.The first is pedestrian detection method based on background modeling, which mainly relies on the background modeling method to extract the foreground moving object, and uses classifier to judge whether the moving object is a pedestrian.GMM algorithm [13] and vibe algorithm [14] are the most representative.This kind of algorithm is fast in speed, but it cannot cope with the change of illumination and the jitter of camera very well.At the same time, it is difficult to distinguish dense objects or objects that are occluded from each other.The second kind of pedestrian detection algorithm is based on statistical learning.The algorithm has high accuracy and can cope with occlusion and environmental changes to a certain extent.The HOG + SVM [15] proposed by Dalal et al. is the most classical algorithm of this kind.Because of the popularity of deep learning methods in recent years, the accuracy of pedestrian detection algorithm based on convolutional neural network has reached an unprecedented height [16].Detection methods based on deep learning can be divided into two-stage method and one-stage method.Faster R-CNN proposed by Shaoqing Ren [17] and its subsequent variants [18,19] belong to the two-stage method, which has advantages in detection and positioning accuracy.WeiLiu et al.'s SSD detection algorithm [20] and its subsequent variants [21,22] belong to one-stage algorithm, which achieves good accuracy and better real-time performance.
Multi-objective tracking methods can be divided into DBT (detection-based-tracking) methods and DFT (detection-free-tracking) methods based on initialization method.DFT methods needs to label the targets manually, and then track them in subsequent frames [23,24].DBT methods completes tracking by detecting the targets in each frame and putting the targets into the tracklets [25][26][27].DBT methods are more suitable for pedestrian traffic statistics applications because of the frequent appearance of new targets and the frequent disappearance of old ones.At the same time, according to the processing mode, multi-target tracking methods can be divided into online and offline algorithms.Online algorithm only uses the current frame of image sequence and several previous frames [28,29], which is more suitable for applications that need real-time implementation.The offline algorithm needs to use some future frames in the image sequence [30,31], which is more suitable for post-analysis and processing of video.Appearance model is widely used in the field of multi-target tracking.The appearance model has an important role in associating tracklets and detections.With the help of the appearance model, the ID scitches can be effectively suppressed.Ullah, M. et al. proposed a multi-target tracking method establishing appearance model with HoG descriptor [32].Bae, S.H. et al. proposed a deep appearance learning method to learn a discriminative appearance model which can distinguish multiple objects with large appearance variations [33].At the same time, the motion model has the same important role as the appearance model.Since the motion of the target in the image is usually relatively flat, the estimation of the trend of the target motion can predict the position of the target in the next frame, thus reducing the search area and even directly obtaining the tracking results.
The pedestrian counting method was originally road marking method.In this method, a mark is set on the road surface, and when the mark is covered, a pedestrian is judged to pass.Then Kryjak et al. proposed a counting method based on virtual lines [34].The main idea is that when the target center passes through the virtual line, a pedestrian is judged to pass through.However, if there is a target hovering near the virtual line, it will seriously affect the counting accuracy.Later Xu et al. proposed a counting method based on double virtual lines [2].In this method, two virtual lines are delineated, and the sequence of pedestrians passing through the virtual lines is judged to realize the counting.

Methodology
The proposed algorithm can be divided into three parts: pedestrian detection, multi-pedestrian tracking and pedestrian counting.The overall flow of the proposed method is shown in Figure 1. .

Pedestrian Detection
Pedestrian detection is the first step of pedestrian flow statistics algorithm.The algorithm mainly improves from the yolov3 detection network [35].In order to reduce the computational complexity of the algorithm, the darkent-53 network in the front of the network is replaced by a pruned and compressed VGG network [36], which reduces the computational complexity of the network from 65 Bflops to 39 Bflops.Because the target is pedestrian traffic statistics, too small anchor settings have no effect on improving detection accuracy, so kmeans algorithm is used to re-cluster the size of network anchor.Due to the large scale of the targets in the front-down camera, the last FPN structure of the yolov3 network is removed to reduce the computational complexity.The final computational complexity of the neural network is 34 Bflops.At the same time, the mAP (mean Average Precision) of the detector only reduce 1.61, from 74.49 to 72.88.The final network structure of the algorithm is shown in Figure 2. In pedestrian flow statistics, pedestrian targets are very dense, and there is serious occlusion between the targets.The traditional non-maximum suppression (NMS) method can cause a part of the correct detection to be lost while removing redundant detection bounding boxes.So soft-NMS method is used to improve the non-maximum suppression [37].Unlike the original non-maximum suppression method, as shown in Formula (1), soft-mns method does not directly remove the bounding box whose IOU exceeds the threshold and confidence is lower, but reduces the confidence of the detection box, which makes it more difficult for the correct detection to be removed incorrectly due to the dense targets.In Formula (1), d i is a detection result with score s i , d m is another detection result which has higher score than d i , N i represent the threshold of soft-NMS.As shown in Figure 3, after using soft-NMS, not only the redundant detection results can be correctly removed, but also the missing rate can be reduced.
In the front-down cameras, the objects close to the cameras can easily occlude the lower half of the objects farther from the cameras.On this basis, because of the lateral movement of the target, the bounding box of the farther target will change dramatically in height, which is not conducive to the following tracking operation.However, the top position and width of the bounding box will not be affected in such case.Based on this observation, we adjust the shape of bounding boxes by their width and the top position.For the bounding boxes whose aspect ratio are greater than 1:2.5, the heights of them are increased on the basis of fixing the top position and width, so that the aspect ratios are adjusted to 1:2.5.As shown in Figure 4, this scheme effectively reduces the deformation of the bounding box due to occlusion.At the same time, this method improves the positioning accuracy of the real center position of the occluded target, and is more conducive to the final counting task.

Multi-Pedestrian Tracking
The tracking algorithm is a multi-target tracking method based on detection results.The tracking method can be divided into two parts.Firstly, a motion model is built for the detected target, and new tracklets are built for new targets.The second part is the data association algorithm, which matches the target detected in each frame with the existing tracklets by the cost function to achieve the purpose of detection.The overall flow of the tracking algorithm is shown in Figure 5.In the tracking algorithm, the linear motion model is chosen as the motion model of the tracklet.The model is based on Kalman filtering algorithm, and the target state is expressed as [u, v, s, r, u , v , s ] T .Where (u, v) is the coordinate of the center position of the bounding box, and (s, r) are the scale and aspect ratio of the bounding box respectively.(u , v ) is the speed of the target in horizontal and vertical directions, s represents the changing rate of the scale of the target.Since the aspect ratio of the bounding box is adjusted before, it is assumed that the aspect ratio of the target does not change here.
Data association algorithm uses Hungarian algorithm to match existing tracklets and detection results in current frame.The cost function is divided into three parts: IOU limit, scale changing limit and standardized distance.It is required that the IOU of the tracklet's state and the detection result is larger than a certain threshold, and the scale change is lower than a certain threshold, otherwise the tracklet and the detection result will not match each other.Scale changes are described by the following formula: where w 1 and h 1 are the width and height of the detection result respectively, w 2 and h 2 are the width and height of the tracklet state respectively.On the basis of these two limitations, the matching degree between a detection result and a tracklet is mainly described by standardized distance.The standardized distance standardizes the pixel distance between the detector and the tracklet by the minimum width and height of them, which can effectively reduce the difference caused by the depth of field between the pixel distance and the actual distance.The standardized distance is shown as follows: where (x 1 , y 1 ) is the coordinate of the detection result, (x 2 , y 2 ) is the coordinate of the tracklet state.
For the successfully matched tracklets and detections, the status of the tracklets are updated by the positions of the detections.In this case, the state of the tracklet, including position, scale, velocity and scale change rate, is the optimal estimate obtained by Kalman filter.Unmatched detectors are candidates for new targets and candidate tracklets for them are established.If these candidate tracklets match detection results in consecutive multiple frames, they will be used as new tracklets for targets newly appear in these frames.Motion models established by detection results in the first frame of the image sequence are used as new tracklets immediately for the initialization of the sequence.The unmatched tracklet outputs the predicted results directly.In this case, the tracking state is completely determined by the prediction matrix, and the predicted tracklet moves in a straight line with a uniform speed decided by the state variables.This method can reduce the influence caused by occlusion or detector failure in a short time.When a tracklet fails to match any detection result in continuous multiple frames, it is considered that the target tracked by the tracklet has disappeared and the tracklet is deleted.

Pedestrian Counting
A pedestrian counting algorithm based on virtual blocks is proposed.Similar to the counting method based on double virtual lines, the pedestrian counting algorithm based on virtual blocks counts the number of pedestrians according to the sequence of blocks passed by the pedestrian detection center.This method shortens the time requirement of continuous target tracking and makes the holistic algorithm more robust to occlusion.
Block-based counting algorithms need to delimit the beginning area, count area and end area, as shown in Figure 6.If the center of the target is initialized in the beginning area and reaches the end area after passing through the counting area, the count is made once.Two-way counting can be realized by delimiting regions in different order.
Compared with the counting method based on double virtual lines, the block-based method can better adapt to the counting area with different shapes, which is more suitable for the application scenarios of front-down cameras.In addition, the block-based method is easier to achieve the effect of counting part of the road area through flexible setting of counting area.In addition, by setting the start and end areas, it is easier to count only for the target entering from a specific entry or leaving from a specific exit.

The Performance of Pedestrian Flow Statistics Algorithms in Real Scene
Since there is no dataset specifically for pedestrian flow statistics, we use the image sequence captured by front-down cameras at the entrance of crowded scenic spots to verify the detection and counting effect of the proposed algorithm.
To develop an algorithm with high robustness against the changes of illumination and fog concentration, we have manually detected and labeled the images captured by the camera, and made a new dataset.As a detection dataset, five half-hour video recordings were collected.There were intense changes in illumination in the videos, and some videos have dense fog.The frame rate of video recordings is 25 fps, and an image is saved every 12 frames.The total number of annotated images is 18,750, including 282,674 pedestrian bounding boxes.Among them, 2000 images from the end of two videos with different perspectives are used as the test set, while the remaining 16,750 images are used as the training set.After training, the detection effect of the algorithm is evaluated, and the most popular evaluation indicators in the detection field, such as precision, recall, mAP (mean Average Precision), are selected as the evaluation indicators.The detection performance is shown in Table 1.Compared with detectors trained only with coco datasets, the robustness of the detector trained with new images to the changes of illumination and fog concentration is greatly improved.For pedestrian counting effect, we selected five other videos to evaluate.By observing the counting results manually, the missing alarm and false alarm of the counting algorithm are counted.Finally, the counting algorithm is evaluated by the precision, recall and F1-Score of the counting algorithm.The evaluation results of the algorithm in five videos are shown in Table 2.It can be seen that the mean F1-score of the algorithm is over 95% in the five videos, which proves the effectiveness of the algorithm.The running speed of the algorithm on NVIDIA GTX1060 GPU is 28.4 FPS, which has good real-time performance.The actual running effect is shown in Figure 7.

The Performance of Tracking-by-Detection Algorithm Compared with Other Algorithms
In pedestrian flow statistics algorithms, the performance of detecting and tracking algorithms has an important impact on the counting results.Therefore, we evaluate the proposed algorithm by comparing the comprehensive performance of the proposed detecting and tracking algorithm with other algorithms.
Here we choose to use the 2DMOT2015 benchmarks [38] to evaluate the performance of the algorithm quantitatively.2DMOT2015 benchmark is a well-known framework for the fair evaluation of multi-pedestrian tracking algorithms.The dataset includes 22 image sequences, of which 11 are training sets and 11 are test sets.Image sequences come from several influential pedestrian detection and tracking datasets, including KITTI, ADL, ETH, PETS and TUD.
The main evaluation indicators are as in Table 3.Among them, MOTA is a comprehensive evaluation of FP, FN and IDSW.Its formula is as follows: where GT t is the number of ground truth in frame t.MOTP focuses on the average difference between TP and its corresponding ground truth.The formula is as follows: Where d t,i denotes the overlap of the bounding box i with its corresponding ground truth, c t denotes the amount of bounding boxes which match ground truths successfully.
To verify the comprehensive performance of detector and tracker, we compare the proposed algorithm with other algorithms using private detector, including MDP_Subcnn [39], DMT [40] and Sort [41].The specific comparison results are shown in Table 4.It can be seen that the proposed algorithm has high accuracy.The actual effect of the algorithm on the MOT2015 dataset is shown in Figure 8.

Conclusions
In this paper, we propose a pedestrian flow tracking and statistics algorithm for front-down monocular camera.The algorithm relies on convolutional neural network for real-time pedestrian detection, and uses Kalman filter linear motion model and data association algorithm to track pedestrian targets.Finally, a counting method based on virtual blocks is proposed to complete pedestrian flow statistics.We use real scene videos to evaluate the counting performance of the algorithm.At the same time, we compare the detection and tracking performance of the algorithm with other algorithms using a public dataset, MOT 2015, which proves the effectiveness of the algorithm.The experiment results show that the algorithm has good accuracy and real-time performance, and has high application value.Although the algorithm has achieved good results, there are still some shortcomings, which need further improvement.Since the accuracy of the algorithm will be affected when the pedestrian is seriously occluded, future work is to further improve the tracking accuracy of the algorithm in the case of serious occlusion by means of optical flow and re-identifying appearance model.

Figure 1 .
Figure 1.Overall flow chart of the proposed algorithm.

Figure 2 .
Figure 2. Network structure of detection algorithms.The blue layers are convolutional layers, the red layers are max-pooling layers, and the yellow layers are detection layers.

Figure 3 .
Figure 3. Detection result with soft-NMS (left) and NMS (right).The bounding box for the person in yellow is not miss deleted as redundant detection with soft-NMS.

Figure 4 .
Figure 4.The deformation of the bounding box of a single target without (left) and with (right) shape adjusting.

Figure 5 .
Figure 5. Overall flow of the tracking algorithm.

Figure 7 .
Figure 7.The actual operation effect of the algorithm.

Table 1 .
Detection performance of detector trained with different data.

Table 2 .
The evaluation results of the algorithm in test videos.

Table 4 .
The performance of proposed tracking-by-detection algorithm compared with other algorithms in 2DMOT2015 dataset.The green and yellow colors indicate the best and the second best algorithm in each measure.