You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

6 July 2020

Counting Crowds with Perspective Distortion Correction via Adaptive Learning

,
,
,
and
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
*
Author to whom correspondence should be addressed.

Abstract

The goal of crowd counting is to estimate the number of people in the image. Presently, use regression to count people number became a mainstream method. It is worth noting that, with the development of convolutional neural networks (CNN), methods that are based on CNN have become a research hotspot. It is a more interesting topic that how to locate the site of the person in the image than simply predicting the number of people in the image. The perspective transformation present is still a challenge, because perspective distortion will cause differences in the size of the crowd in the image. To devote perspective distortion and locate the site of the person more accuracy, we design a novel framework named Adaptive Learning Network (CAL). We use the VGG as the backbone. After each pooling layer is output, we collect the 1/2, 1/4, 1/8, and 1/16 features of the original image and combine them with the weights learned by an adaptive learning branch. The object of our adaptive learning branch is each image in the datasets. By combining the output features of different sizes of each image, the challenge of drastic changes in the size of the image crowd due to perspective transformation is reduced. We conducted experiments on four population counting data sets (i.e., ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF-QNRF), and the results show that our model has a good performance.

1. Introduction

The goal of the crowd counting task is to count the number of people in an image. The crowd counting task plays an important role in the production, life, disaster management, security monitoring, and public space design [,,]. With the improvement of people’s safety awareness, crowd counting has been paid increasing attention. Recently the crowd counting task has utilized convolutional neural network (CNN) to address the scale variation issue and has achieved good improvements in crowd density estimation [,].
However, the perspective distortion of the image is still an important challenge for crowd counting, more specifically, the model is not particularly accurate in predicting avatars with large differences in size in the same image. Hence, how to better handle objects of different sizes is a key to improve the crowd counting model. Recently, the demand for crowd counting is no longer simply counting the total number of people in the image, but also want to locate a specific personal location, so that accurate counting can be performed more accurately. Most of the current work uses Visual Geometry Group (VGG) [] as a backbone. Subsequently, separately extract different sizes of features after each max pooling operation, and decoded these features. We will obtain the features that size is 1/2, 1/4, 1/8, and 1/16 of the original image size after each max pooling.
The current method is to simply superimpose features of different sizes without considering the combination of different sizes brought by different image inputs and different scene inputs. The degree of perspective transformation in each image is not the same, that is, if our branch information is merged according to the same pattern, then the learned knowledge cannot cover all samples. On this basis, we envisage using a dynamic mechanism to combine branch information according to different image features to achieve the goal of dynamic evolution. Inspired by the adaptive scenario discovery framework (ASD) [] model, we also propose a dynamic learning branch combination method. Different from ASD, our model is not only concerned with simple counting tasks, but we also add the positioning of specific objects to the model. At the same time, ASD distinguishes between sparse and dense scenes, and our model is to explore the degree of perspective change in the image. In this paper, we propose an adaptive learning framework (CAL) with perspective distortion correction for crowd counting and localization. We employ several of VGG-16 convolution layers for crowd feature extraction before the multiple receptive fields instead of utilizing them directly. Additionally, for exploring the degree of perspective change in the image, four parallel pathways with the counting and localization network named main, scale, middle, and lowest are proposed. The four pathways are designed for the people with different scale, respectively. Besides, we also designed a branch to learn the degree of perspective change. Afterwards, combine the perspective into the output branch of the model.
Our contributions are listed, as follows.
  • We propose a novel adaptive framework with perspective distortion correction for crowd counting and localization. Different from the former proposed multiple columns frameworks, we use a branch to dynamically characterize the degree of perspective change of the images. We further verify the effect of our CAL network and compare with the No-CAL methods in order to explain the improvement of our architecture.
  • We design a novel size characterization branch to realize both the crowd counting and the localization task.
  • We use VGG [] for the feature extraction structure and the network constructed by four branches (including the main path), which select output features of different sizes. The perspective change in the image is considered to be a linear combination of our four branches and discrete weights, while the adaptation branch aims to portray a continuous perspective change trend and make corresponding corrections.
  • We apply our framework to four congested multi-scene crowd counting datasets (i.e., ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50, and UCF-QNRF) and prove that our method outperforms the state-of-the-art methods.
In the remaining part of the paper, we discuss related works of crowd counting and localization in Section 2, describe the backbone, the CAL network architecture and training process in Section 3, verify the proposed framework in both qualitative and quantitative extent in Section 4 and finally conclude our work in Section 5.

3. Framework

We propose a novel model that contains three parts: the backbone, the pathways, and the adaptive branch. We design a novel framework named Adaptive Learning Network (CAL) and the architecture is shown in Figure 1. In the following parts, we introduce the structure and implementation in detail.
Figure 1. The architecture and weight of CAL.

3.1. Backbone

Presently, the mainstream method for extracting features from the crowd counting task is to use the VGG [] network as a backbone. The backbone network utilizing can be separated into two ways: starting from scratch to designing a new network (e.g., []) or migrating a pre-trained subnet from an existing network (e.g., [,,]). Between these two categories, the second way have more advantages in both time-saving and efficiency. Our network design also follows this principle. We first designed a feature extraction structure with VGG16 as the backbone. However, we duplicated and fine-tuned some blocks to adapt the feature extraction task with multiple resolutions. More specifically, our backbone removes the fully connected layer of VGG16, as shown in Table 1. Besides, our VGG model first uses the ImageNet dataset [] for pre-training.
Table 1. The struct of backbone.

3.2. The Pathways

Following the general principles of localization network design, our network design also uses the FCN structure. Similar to many networks, we also set up four different branches to decode 1/2, 1/4, 1/8, and 1/16 of the original image size, four parallel pathways with the counting, and localization network named main, scale, middle, and lowest are proposed. Table 2 shows the pathways configuration.
Table 2. The config of the pathways.

3.3. The Adaptive Branch

However, unlike other networks connected directly in series, we propose using different weights to combine the output of each branch. We constructed a self-learning classification branch, named the adaptive branch. The input of this branch is the feature parameter extracted by VGG16. The branch structure is as follows: conv (3, 512, 1)–conv (3, 512, 1)–conv (3, 512, 1)–pool(2)–FC (25088, 4096)–RELU–FC(4096, 4096)–RELU–FC (4096, 4). Where ‘conv’ represents a convolutional layer, and ‘pool’ represents a max-pooling layer, ‘FC’ represents the fully connected layer, ‘RELU’ represents Rectified Linear Unit. The numbers in the parentheses are respectively kernel size, the number of channels and dilation rate. Finally, we can obtain four channels (CH) and then normalize each channel separately after summing. The weight coefficient is obtained, as shown in Equation (1).
We provide different weights through the adaptive branch, in order to determine the proportion of the original image that scales different sizes in the result. Using different weight values determines which size image details we will pay more attention to. If we pay attention to more than 1/16 of the image, then it is bound to be ignored for those particularly small heads. Conversely, if we pay attention to 1/2 of the image, then we are bound to pay more attention to those larger heads. Through dynamic learning, we can allocate the proportion of images with different degrees of attention according to the specific scene. That helps eliminate the effects of perspective changes.
w i = C H i i = 1 4 C H i + 10 9 i = 1 , 2 , 3 , 4

3.4. Implementation Details

Our perspective distortion correction model is implemented using PyTorch []. To train the model, we first initialize the batch size as typically four, while the momentum parameter is set as 0.9. We then set the learning rate of 1 e 3 for all the datasets as initial, and use SGD [] for training. For the training of UCF_CC_50, we especially use the five-fold cross-validation to make full use of the datasets to test the effectiveness of the algorithm.

3.4.1. Loss Function

Following the design of loss function in [,], we proposed the loss function as Equation (2). In which N, X i , and θ represent for the batch size, the ith input image, and a set of trainable parameters, respectively. Besides, γ i is the ground truth of X i . Additionally, Y ( X i ; θ ) stands for the estimated density map generated by our proposed model with parameters θ . L ( θ ) denotes the loss function between the estimated results and the ground truth.
L ( θ ) = 1 2 N n = 1 N ( γ ( X i ; θ ) γ i G T 2 2 )

3.4.2. Density Map Generation

CNN needs to process continuous data for crowd counting tasks. As a result, we have to convert the discrete point annotated data (including the annotation of ground truth and the result of prediction) into the density map. The conversion is pixel level and the idea is to convert the point annotation information into images that probably contain density information. The details of the operation are shown in Algorithm 1.
Algorithm 1: Ground-truth generation
Sensors 20 03781 i001

4. Experiments

In this section, we introduce three popular crowd counting datasets that are frequently used in crowd counting and localization tasks. Besides, several ways to evaluate the performance of the architectures are introduced. Afterwards, we compare the previous experimental results and evaluate our method on these datasets.

4.1. Evaluation Metrics

Several ways are used to evaluate both the person detection and counting performance. For the counting evaluation, the commonly used mean absolute error (MAE) and mean square error (MSE) is used by us to measure the deviation of the prediction and the ground truth. The MAE and the MSE are defined as:
M A E = 1 T t = 1 T | μ t G t |
M S E = 1 T t = 1 T ( μ t G t ) 2
where T is the sum amount of testing frames. While μ t and G t are the frame t prediction count and the ground-truth count of pedestrians, respectively.

4.2. Datasets

Currently, various of public datasets for crowd counting task is available, such as MALL [], UCSD [], ShanghaiTech [], UCF_CC_50 [], UCF-QNRF [], etc. The comparison of the images in the listed datasets is shown in Figure 2. In our experiment, we evaluate the proposed model on three crowd counting datasets, including ShanghaiTech [], UCF_CC_50 [], and UCF-QNRF []. In the latter parts, we present the chosen datasets and explain why these datasets are chosen.
Figure 2. Sample images from various datasets. In order from left to right, each column is in turn UCSD [], Mall [], Shanghai Tech PartA [], Shanghai Tech Part B [], UCF_CC_50 [], UCF-QNRF []. It is obvious that in UCSD and Mall dataset, the images providing no variation in perspective across images.
ShanghaiTech. Shanghai Tech [] is one of the largest large-scale datasets in recent years which consists of total 1198 crowd images with 330,165 annotations. The dataset is divided into two sets, named Part A (SHT A) and Part B (SHT B), respectively. Part A is composed of images randomly selected from the Internet, in which the density fluctuates between 33 and 3139 people per image and with an average count of 501.4. In contrast, images in Part B are taken from a busy street of Shanghai and the crowd distribution of which is less diverse and sparser (123.6 in average).
UCF_CC_50. UCF_CC_50 [] is the first challenging dataset on multiple counts created from Web images. The dataset contains various densities and different perspective distortions for multiple scenes. Being a small set of 50 images with crowd counts ranging in 50 to 4543, the dataset poses a serious problem for deep neural networks.
UCF-QNRF. UCF-QNRF [] is collected from Web Search, Flickr, and Hajj footage, which was first introduced by Idrees et al. in []. The dataset is consist of a 1201 images train set and a 334 images test set with 1.25 million annotations in total and the density of images varying from 49 people per image to 12,865.
Extracting from these three datasets, the ideal dataset to examine the performance can be concluded as the following list:
  • Challenging images Some challenging images are necessary to evaluate the performance of the model in extreme conditions. As the development of the crowd counting methods, most of them perform stably in the sparse scenes. As a result, our model focus on improving the performance in congestion crowds and achieve localization tasks. For the crowd counting and localization task, images of some exceeding congestion crowds are the ideal material to evaluate the robustness and the accuracy of our model.
  • Proper density distribution The distribution of the images can directly affect the performance of the model in the scenes with different levels of congestion. The proper amount of sparse, middle and congested images can improve training accuracy and make verification more effective.
  • Multiple scenes The dataset contains multiple scenes, such as the street view, the market view, the live show view, etc., can improve the robustness of our model. The multiple scenes is not only the images take from a different location but also the different condition of weather (such as rainy and foggy), light intensity etc., which can affect the performance of our model, especially in the localization task.
In conclusion, the chosen datasets can well meet these issues while the Mall [] and the UCSD [] are insufficient in some respects. This explains why we exclude these two datasets.

4.3. Results and Discussion

ShanghaiTech. Following the introduction of ShanghaiTech dataset above, we evaluated the proposed framework with several state-of-the-art methods, including the localization method utilizing the adaptive fusion scheme named RAZNet [], the LSC-CNN [] with different receptive fields and ASD [] introducing the adaptive scenario discovery framework. Table 3 summarizes the MAE and MSE of the former approaches and ours in two parts of ShanghaiTech. On Part A of ShanghaiTech, we achieve an impressing improvement of 2.1 of absolute MAE value over ASD [] and 1.6 of MAE over the state-of-the-art RAZNet []. When compared with the state-of-the-art (LSC-CNN []) on Part B, our CAL network also achieved the best MAE of 8.1 and MSE of 11.9. As the output of our crowd counting and localization model, Figure 3 and Figure 4 show the localization performance of some images from part A and Part B, respectively.
Table 3. The comparison among the state-of-the-arts and our approach in ShanghaiTech (Part A & Part B). The best result is in bold.
Figure 3. Qualitative results on the ShanghaiTech Part A.
Figure 4. Qualitative results on the ShanghaiTech Part B.
UCF_CC_50. As a challenging crowd counting dataset introduced above, we also evaluated the CAL in UCF_CC_50. The results are shown in Table 4 and the instance results are reported in Figure 5. The same as the results on ShanghaiTech, the proposed framework shows better results, and the performance improves on the former state-of-the-art results by 14.2 for the MAE metric, which shows the less volatility of the model in high crowd density images.
Table 4. The comparison among the state-of-the-arts and our approach in UCF_CC_50. The best result is in bold.
Figure 5. Qualitative results on the UCF_CC_50.
UCF-QNRF. Follow the process and the idea of the other two datasets, we use MAE as the evaluation metric and keep the consistent detail for training. Table 5 compares our CAL model with state-of-the-art methods. It is obvious that our model outperforms all of the preceding models. Especially comparing with other localization methods, our network improves at least 10.2 in MAE. Additionally, we provide the performance in predicting the bounding box for localization in Figure 6, which illustrates the localization performance of some images in UCF-QNRF.
Table 5. The comparison among the state-of-the-arts and our approach in UCF-QNRF. The best result is in bold.
Figure 6. Qualitative results on the UCF-QNRF.

4.4. Ablation Studies

In this part, we focus on two issues regarding the effectiveness of the structure of the multi-branch network and the performance of the adaptive branch. For this issue, we adjust our model and remove the adaptive branch to make it similar to some normal multi-column models (as shown in Figure 7). Moreover, we name the adjusted model the ‘NO-CAL’. We removed the adaption branch is that we want to explore the improvement effect of the adaption branch on the model. We removed the adaption branch and create the NO-CAL structure in order to better compare the experimental results. To respond to the first issue, we compared our models (CAL & NO-CAL) with the previous multi-branch networks. Additionally, for the second issue, we make a comparison between our CAL model and the NO-CAL model. The results are shown in Table 6 and Table 7.
Figure 7. The architecture and weight of NO-CAL.
Table 6. The comparison between other structure and our approach.
Table 7. The comparison between the NO-CAL structure and our approach.

4.4.1. The Effectiveness of the Multi-Branch Structure

Table 6 shows the comparison of the former multi-branch structure with our design on ShanghaiTech, UCF_CC_50 and UCF-QNRF. It can be seen that our design outperforms the previous methods (MCNN [], Switch-CNN [], CMTL []). Additionally, the result shows that even the adaptive-branch-cutoff model (NO-CAL), the performance still at least improves on the former results (Part A: 70.8 vs. 90.4; Part B: 14.2 vs. 21.6; UCF_CC_50: 258.9 vs. 318.1; UCF-QNRF: 163.7 vs.228 on MAE). Moreover, the performance is much better than the NO-CAL structure (Part A: 7.3; Part B: 6.1; UCF_CC_50: 47.5; UCF-QNRF: 53.4 improvement on MAE). This is an illustration of the effectiveness of our multi-branch structure.

4.4.2. The Effectiveness of the Adaptive Branch

The previous method cannot handle the perspective distortion challenge properly, as discussed in Section 1. To deep-in validate if our proposed method is affected by the adaptive branch, we first conduct experiments with the CAL model and the same model cancels the adaptive branch (names NO-CAL). We validated both of the models on the three datasets, and the results are revealed in Table 7. It is shown that the CAL model visibly outperforms the NO-CAL, which is due to the good handling of the perspective distortion challenge. The experiment proves the effectiveness of the adaptive branch.
To compare the efficiency of our adaptive branch, we evaluate its time performance. Because the size of the image of the ShanghaiTech Part B is the fix, we use ShanghaiTech Part B as a benchmark to test the time efficiency of the model. The CAL achieves 12 FPS detection speed on an Nvidia TITAN XP GPU and the NO-CAL achieves 13 FPS detection speed on an Nvidia TITAN XP GPU during inference. It may take a little time to use the adaptive branch, but the time spent is in an acceptable range as compared with the improved accuracy.
As our ablation study shows, our design of the network structure is effective and well-performed among three chosen datasets.

5. Conclusions

In this paper, we have presented a novel architecture for counting crowds with perspective distortion correction via adaptive learning. The focus of our method is to use a dynamic learning network to learn the dynamic combination relationship under different samples, and use this dynamic combination relationship to form different ratios for each image sample. Experimental comparisons with the state-of-the-art approaches (at most 15 methods) on ShanghaiTech, UCF_CC_50, and UCF-QNRF showed the effectiveness and efficiency of our proposed adaptive scenario discovery framework for the crowd counting task.

Author Contributions

Y.S. and X.W. contributed to the design and implementation of the research, analyzed the results, and wrote the manuscript. J.J. devised the project and wrote the manuscript. T.M. and J.Y. devised the main conceptual ideas, planned the experiments, and wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Youth Program of National Natural Science Foundation of China No. 61907015, the Science and Technology Commission of Shanghai Municipality of China No. 18511103801, 18511103802, 18511106202.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; Wang, Y. CNN-based Density Estimation and Crowd Counting: A Survey. arXiv 2020, arXiv:2003.12783. [Google Scholar]
  2. Kang, D.; Ma, Z.; Chan, A.B. Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks Counting, Detection, and Tracking. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1408–1422. [Google Scholar] [CrossRef]
  3. Sindagi, V.A.; Patel, V.M. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognit. Lett. 2018, 107, 3–16. [Google Scholar] [CrossRef]
  4. Tong, M.; Fan, L.; Nan, H.; Zhao, Y. Smart Camera Aware Crowd Counting via Multiple Task Fractional Stride Deep Learning. Sensors 2019, 19, 1346. [Google Scholar] [CrossRef] [PubMed]
  5. Yu, Y.; Huang, J.; Du, W.; Xiong, N. Design and analysis of a lightweight context fusion CNN scheme for crowd counting. Sensors 2019, 19, 2013. [Google Scholar] [CrossRef]
  6. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  7. Wu, X.; Zheng, Y.; Ye, H.; Hu, W.; Ma, T.; Yang, J.; He, L. Counting crowds with varying densities via adaptive scenario discovery framework. Neurocomputing 2020, 397, 127–138. [Google Scholar] [CrossRef]
  8. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
  9. Leibe, B.; Seemann, E.; Schiele, B. Pedestrian detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 878–885. [Google Scholar]
  10. Tuzel, O.; Porikli, F.; Meer, P. Pedestrian detection via classification on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1713–1727. [Google Scholar] [CrossRef]
  11. Enzweiler, M.; Gavrila, D.M. Monocular pedestrian detection: Survey and experiments. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 2179–2195. [Google Scholar] [CrossRef]
  12. Li, M.; Zhang, Z.; Huang, K.; Tan, T. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 8–11 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–4. [Google Scholar]
  13. Chan, A.B.; Vasconcelos, N. Bayesian poisson regression for crowd counting. In Proceedings of the International Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 545–551. [Google Scholar]
  14. Ryan, D.; Denman, S.; Fookes, C.; Sridharan, S. Crowd counting using multiple local features. In Proceedings of the Digital Image Computing: Techniques and Applications, Melbourne, Australia, 1–3 December 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 81–88. [Google Scholar]
  15. Kong, D.; Gray, D.; Tao, H. A viewpoint invariant approach for crowd counting. In Proceedings of the International Conference on Pattern Recognition (ICPR), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 3, pp. 1187–1190. [Google Scholar]
  16. Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. In Proceedings of the British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012; Volume 1, p. 3. [Google Scholar]
  17. Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2547–2554. [Google Scholar]
  18. Chen, K.; Gong, S.; Xiang, T.; Change Loy, C. Cumulative attribute space for age and crowd density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2467–2474. [Google Scholar]
  19. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
  20. Walach, E.; Wolf, L. Learning to count with cnn boosting. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016; pp. 660–676. [Google Scholar]
  21. Sam, D.B.; Surya, S.; Babu, R.V. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4031–4039. [Google Scholar]
  22. Stewart, R.; Andriluka, M.; Ng, A.Y. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2325–2333. [Google Scholar]
  23. Liu, L.; Qiu, Z.; Li, G.; Liu, S.; Ouyang, W.; Lin, L. Crowd counting with deep structured scale integration network. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 1774–1783. [Google Scholar]
  24. Guo, D.; Li, K.; Zha, Z.J.; Wang, M. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1823–1832. [Google Scholar]
  25. Zhang, L.; Shi, Z.; Cheng, M.M.; Liu, Y.; Bian, J.W.; Zhou, J.T.; Zheng, G.; Zeng, Z. Nonlinear regression via deep negative correlation learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef]
  26. Lian, D.; Li, J.; Zheng, J.; Luo, W.; Gao, S. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1821–1830. [Google Scholar]
  27. Basalamah, S.; Khan, S.D.; Ullah, H. Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 2019, 7, 71576–71584. [Google Scholar] [CrossRef]
  28. Liu, C.; Weng, X.; Mu, Y. Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1217–1226. [Google Scholar]
  29. Wen, L.; Du, D.; Zhu, P.; Hu, Q.; Wang, Q.; Bo, L.; Lyu, S. Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network. arXiv 2019, arXiv:1912.01811. [Google Scholar]
  30. Li, W.; Mahadevan, V.; Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 18–32. [Google Scholar]
  31. Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–21 June 2018; pp. 1091–1100. [Google Scholar]
  32. Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
  33. Onoro-Rubio, D.; López-Sastre, R.J. Towards perspective-free object counting with deep learning. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016; pp. 615–629. [Google Scholar]
  34. Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8198–8207. [Google Scholar]
  35. Sam, D.B.; Sajjan, N.N.; Maurya, H.; Babu, R.V. Almost unsupervised learning for dense crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8868–8875. [Google Scholar]
  36. Liu, X.; van de Weijer, J.; Bagdanov, A.D. Exploiting unlabeled data in cnns by self-supervised learning to rank. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1862–1878. [Google Scholar] [CrossRef] [PubMed]
  37. Chaker, R.; Al Aghbari, Z.; Junejo, I.N. Social network model for crowd anomaly detection and localization. Pattern Recognit. 2017, 61, 266–281. [Google Scholar] [CrossRef]
  38. Chen, S.; Fern, A.; Todorovic, S. Person count localization in videos from noisy foreground and detections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1364–1372. [Google Scholar]
  39. Matan, O.; Burges, C.J.; LeCun, Y.; Denker, J.S. Multi-digit recognition using a space displacement neural network. In Advances in Neural Information Processing Systems; MIT: Cambridge, MA, USA, 1992; pp. 488–495. [Google Scholar]
  40. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  41. Ning, F.; Delhomme, D.; LeCun, Y.; Piano, F.; Bottou, L.; Barbano, P.E. Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. 2005, 14, 1360–1371. [Google Scholar] [CrossRef]
  42. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
  43. Eigen, D.; Krishnan, D.; Fergus, R. Restoring an image taken through a window covered with dirt or rain. In Proceedings of the International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 633–640. [Google Scholar]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  45. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  46. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  47. Laradji, I.H.; Rostamzadeh, N.; Pinheiro, P.O.; Vazquez, D.; Schmidt, M. Where are the blobs: Counting by localization with point supervision. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 547–562. [Google Scholar]
  48. Sam, D.B.; Peri, S.V.; Kamath, A.; Babu, R.V. Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection. arXiv 2019, arXiv:1906.07538. [Google Scholar]
  49. Wu, X.; Zheng, Y.; Ye, H.; Hu, W.; Yang, J.; He, L. Adaptive scenario discovery for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2382–2386. [Google Scholar]
  50. Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  51. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems; MIT: Cambridge, MA, USA, 2019; pp. 8024–8035. [Google Scholar]
  52. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
  53. Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–7. [Google Scholar]
  54. Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
  55. Sam, D.B.; Babu, R.V. Top-down feedback for crowd counting convolutional neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  56. Zhang, L.; Shi, M.; Chen, Q. Crowd counting via scale-adaptive convolutional neural network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1113–1121. [Google Scholar]
  57. Zeng, L.; Xu, X.; Cai, B.; Qiu, S.; Zhang, T. Multi-scale convolutional neural networks for crowd counting. In Proceedings of the International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 465–469. [Google Scholar]
  58. Shen, Z.; Xu, Y.; Ni, B.; Wang, M.; Hu, J.; Yang, X. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5245–5254. [Google Scholar]
  59. Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1861–1870. [Google Scholar]
  60. Liu, L.; Wang, H.; Li, G.; Ouyang, W.; Lin, L. Crowd counting using deep recurrent spatial-aware network. arXiv 2018, arXiv:1807.00601. [Google Scholar]
  61. Cao, X.; Wang, Z.; Zhao, Y.; Su, F. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  62. Shi, M.; Yang, Z.; Xu, C.; Chen, Q. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7279–7288. [Google Scholar]
  63. Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 833–841. [Google Scholar]
  64. Chen, X.; Bin, Y.; Sang, N.; Gao, C. Scale pyramid network for crowd counting. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1941–1950. [Google Scholar]
  65. Sindagi, V.A.; Patel, V.M. Ha-ccn: Hierarchical attention-based crowd counting network. IEEE Trans. Image Process. 2019, 29, 323–335. [Google Scholar] [CrossRef] [PubMed]
  66. Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.; Shao, L. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6133–6142. [Google Scholar]
  67. Zhang, A.; Shen, J.; Xiao, Z.; Zhu, F.; Zhen, X.; Cao, X.; Shao, L. Relational attention network for crowd counting. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6788–6797. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.