HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
Abstract
:1. Introduction
- We design a deeper and denser attention-based CNN-based CC network to obtain abrupt to continuously varying scale features. Densely organized DSCMs extract and aggregate the local to global information in a final density map.
- The proposed network, composed of a backbone, DFEM, and CM, enhances the ability of the network to obtain general, contextual, and perspective-varying features for better CC accuracy.
- The rich semantic feature representation in the final layers is obtained by modeling the dependency among channels, thus combining the low-to-high semantic features for enhanced counting accuracy.
- The proposed approach is responsible for aggregating task-independent and task-specific features at higher layers from lower and middle-lower layers, enhancing the estimation accuracy.
2. Related Work
3. The Proposed Approach
3.1. Backbone Network
3.2. Dense Feature Extraction Module (DFEM)
Dense Stacked Convolution Module (DSCM)
3.3. Task-Independent and Task-Specific Feature Acquisition
3.4. Channel Attention Module (CAM)
4. Implementation Details
4.1. Network Configuration
4.2. Training Details
4.3. Data Preparation
5. Performance Evaluation
5.1. Metrics Used for Evaluation
5.2. Testing Data
5.2.1. Venice Dataset
5.2.2. ShanghaiiTech (Part-A)
5.2.3. ShanghaiTech (Part-B)
5.3. Ablation Study
- Backbone: Backbone is a VGG-16 based network.
- Backbone + DFEM: This consists of two modules. The sequential concatenation of Backbone with DFEM.
- Backbone + DFEM + CAM: Three modules are concatenated sequentially.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cohen, J.E. World population in 2050: Assessing the projections. In Conference Series-Federal Reserve Bank of Boston; Federal Reserve Bank of Boston: Boston, MA, USA, 1998; Volume 46, pp. 83–113. [Google Scholar]
- Ilyas, N.; Ahmad, A.; Kim, K. CASA-Crowd: A Context-Aware Scale Aggregation CNN-Based Crowd Counting Technique. IEEE Access 2019, 7, 182050–182059. [Google Scholar] [CrossRef]
- Ilyas, N.; Shahzad, A.; Kim, K. Convolutional-Neural Network-Based Image Crowd Counting: Review, Categorization, Analysis, and Performance Evaluation. Sensors 2020, 20, 43. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
- Sam, D.B.; Surya, S.; Babu, R.V. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 6. [Google Scholar]
- Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
- Liu, W.; Lis, K.; Salzmann, M.; Fua, P. Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos. arXiv 2018, arXiv:1803.08805. [Google Scholar]
- Huang, S.; Li, X.; Cheng, Z.Q.; Zhang, Z.; Hauptmann, A. Stacked Pooling: Improving Crowd Counting by Boosting Scale Invariance. arXiv 2018, arXiv:1808.07456. [Google Scholar]
- Shi, Z.; Zhang, L.; Sun, Y.; Ye, Y. Multiscale Multitask Deep NetVLAD for Crowd Counting. IEEE Trans. Ind. Inform. 2018, 14, 4953–4962. [Google Scholar] [CrossRef]
- Wang, L.; Shao, W.; Lu, Y.; Ye, H.; Pu, J.; Zheng, Y. Crowd Counting with Density Adaption Networks. arXiv 2018, arXiv:1806.10040. [Google Scholar]
- Zhang, Y.; Zhou, C.; Chang, F.; Kot, A.C. Attention to Head Locations for Crowd Counting. arXiv 2018, arXiv:1806.10287. [Google Scholar]
- Liu, L.; Wang, H.; Li, G.; Ouyang, W.; Lin, L. Crowd Counting using Deep Recurrent Spatial-Aware Network. arXiv 2018, arXiv:1807.00601. [Google Scholar]
- Huang, S.; Li, X.; Zhang, Z.; Wu, F.; Gao, S.; Ji, R.; Han, J. Body structure aware deep crowd counting. IEEE Trans. Image Process. 2018, 27, 1049–1059. [Google Scholar] [CrossRef] [PubMed]
- Yang, B.; Cao, J.; Wang, N.; Zhang, Y.; Zou, L. Counting challenging crowds robustly using a multi-column multi-task convolutional neural network. Signal Process. Image Commun. 2018, 64, 118–129. [Google Scholar] [CrossRef]
- Liu, X.; van de Weijer, J.; Bagdanov, A.D. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7661–7669. [Google Scholar]
- Zhu, J.; Feng, F.; Shen, B. People counting and pedestrian flow statistics based on convolutional neural network and recurrent neural network. In Proceedings of the 2018 33rd Youth Academic Annual Conference of Chinese Association of Automation (YAC), Nanjing, China, 18–20 May 2018. [Google Scholar]
- Wan, J.; Chan, A. Adaptive density map generation for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1130–1139. [Google Scholar]
- Qiu, Z.; Liu, L.; Li, G.; Wang, Q.; Xiao, N.; Lin, L. Crowd counting via multi-view scale aggregation networks. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1498–1503. [Google Scholar]
- Zhang, A.; Yue, L.; Shen, J.; Zhu, F.; Zhen, X.; Cao, X.; Shao, L. Attentional neural fields for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5714–5723. [Google Scholar]
- Tian, Y.; Lei, Y.; Zhang, J.; Wang, J.Z. PaDNet: Pan-Density Crowd Counting. arXiv 2018, arXiv:1811.02805. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sindagi, V.A.; Yasarla, R.; Patel, V.M. Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1221–1231. [Google Scholar]
- Sindagi, V.A.; Patel, V.M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1002–1012. [Google Scholar]
- Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.; Shao, L. Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network. arXiv 2019, arXiv:1903.00853. [Google Scholar]
- Zeng, X.; Wu, Y.; Hu, S.; Wang, R.; Ye, Y. DSPNet: Deep scale purifier network for dense crowd counting. Expert Syst. Appl. 2020, 141, 112977. [Google Scholar] [CrossRef]
- Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? arXiv 2014, arXiv:1411.1792. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
- Marsden, M.; McGuinness, K.; Little, S.; O’Connor, N.E. Fully convolutional crowd counting on highly congested scenes. arXiv 2016, arXiv:1612.00220. [Google Scholar]
- Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
- Zhang, L.; Shi, M.; Chen, Q. Crowd counting via scale-adaptive convolutional neural network. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 May 2018; pp. 1113–1121. [Google Scholar]
- Wang, Y.; Wang, G.; Chen, C.; Pan, Z. Multi-scale dilated convolution of convolutional neural network for image denoising. Multimed. Tools Appl. 2019, 78, 19945–19960. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1861–1870. [Google Scholar]
- Shen, Z.; Xu, Y.; Ni, B.; Wang, M.; Hu, J.; Yang, X. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5245–5254. [Google Scholar]
- Shi, Z.; Zhang, L.; Liu, Y.; Cao, X.; Ye, Y.; Cheng, M.M.; Zheng, G. Crowd counting with deep negative correlation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5382–5390. [Google Scholar]
- Babu Sam, D.; Sajjan, N.N.; Venkatesh Babu, R.; Srinivasan, M. Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3618–3626. [Google Scholar]
- Li, J.; Xue, Y.; Wang, W.; Ouyang, G. Cross-Level Parallel Network for Crowd Counting. IEEE Trans. Ind. Inform. 2019, 16, 566–576. [Google Scholar] [CrossRef]
- Wang, Z.; Xiao, Z.; Xie, K.; Qiu, Q.; Zhen, X.; Cao, X. In defense of single-column networks for crowd counting. arXiv 2018, arXiv:1808.06133. [Google Scholar]
- Ranjan, V.; Le, H.; Hoai, M. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 270–285. [Google Scholar]
- Liu, J.; Gao, C.; Meng, D.; Hauptmann, A.G. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5197–5206. [Google Scholar]
- Ilyas, N.; Najarro, A.C.; Kim, K. DFE-Crowd: Dense Feature Extraction for Single Image Crowd Counting; Korean Communication Society: Jeju, Korea, 2020; pp. 457–458. [Google Scholar]
- Sindagi, V.A.; Patel, V.M. Inverse attention guided deep crowd counting network. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
- Zhang, A.; Shen, J.; Xiao, Z.; Zhu, F.; Zhen, X.; Cao, X.; Shao, L. Relational attention network for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6788–6797. [Google Scholar]
- Liu, W.; Salzmann, M.; Fua, P. Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5099–5108. [Google Scholar]
Modules | Sub-Modules | Channels | Filter | Padding | Dilation | HADF-Crowd |
---|---|---|---|---|---|---|
Backbone Network | Sub-M1 | 64 | 3 × 3 | 1 | 1 | Conv3-64 Conv3-64 Max pooling |
Sub-M2 | 128 | 3 × 3 | 1 | 1 | Conv3-128 Conv3-128 Max pooling | |
Sub-M3 | 256 | 3 × 3 | 1 | 1 | Conv3-256 Conv3-256 Max pooling | |
Sub-M4 | 512 | 3 × 3 | 1 | 1 | Conv3-512 Conv3-512 Conv3-512 | |
DFEM | Sub-M5 | 512, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-512-1 Conv3-256-1 Conv3-128-1 Conv3-64-1 |
Sub-M6 | 576, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-576-2 Conv3-256-2 Conv3-128-2 Conv3-64-2 | |
Sub-M7 | 640, 256, 128, 64 | 3 × 3 | 1 | 1 | Conv3-640-2 Conv3-256-2 Conv3-128-2 Conv3-64-2 | |
Sub-M8 | 640, 512 | 3 × 3 | 1 | 1 | Conv3-640-3 onv3-512-3 | |
Output | 512, 128, 64, 1 | 3 × 3 | 1 | 1 | Conv3-512-1 Conv3-128-1 Conv3-64-1 Conv1-1-1 |
Technique | Part-A | Part-B | Venice | |||
---|---|---|---|---|---|---|
MAE | MSE | MAE | MSE | MAE | MSE | |
Marsden et al. [29] | 126.5 | 173.5 | 23.8 | 33.1 | - | - |
MCNN [4] | 110.2 | 173.2 | 26.4 | 41.3 | 145.4 | 147.3 |
C-MTL [30] | 101.3 | 152.4 | 20.0 | 31.1 | - | - |
SwitchCNN [5] | 90.4 | 135.0 | 21.6 | 33.4 | 52.8 | 59.5 |
SaCNN [31] | 86.8 | 139.2 | 16.2 | 25.8 | - | - |
Mult-S-CNN [32] | 83.7 | 124.5 | 17.9 | 32.4 | - | - |
CP-CNN [33] | 73.6 | 106.4 | 20.1 | 30.1 | - | - |
ACSCP [34] | 75.7 | 102.7 | 17.2 | 27.4 | - | - |
Deep-NCL [35] | 73.5 | 112.3 | 18.7 | 26.0 | - | - |
IG-CNN [36] | 72.5 | 118.2 | 13.6 | 21.1 | - | - |
CLPNet [37] | 71.5 | 108.7 | 12.2 | 20.0 | - | - |
SCNet [38] | 71.9 | 117.9 | 9.3 | 14.4 | - | - |
ic-CNN [39] | 68.5 | 116.2 | 10.7 | 12.2 | - | - |
CSRNet [6] | 68.2 | 115.0 | 10.0 | 16.0 | 35.8 | 50.0 |
DecideNet [40] | - | - | - | - | 21.5 | 31.9 |
DRASAN [12] | 69.3 | 96.4 | 11.1 | 18.2 | - | - |
DFE-Crowd [41] | 71.6 | 110.9 | 9.7 | 16.0 | 23.8 | 34.5 |
IA-DCCN [42] | 66.9 | 108.4 | 10.2 | 16.0 | - | - |
DsNet [41] | 61.2 | 102.6 | 6.7 | 10.5 | - | - |
RANet [43] | 59.4 | 102.0 | 7.9 | 12.9 | - | - |
ECAN [44] | 62.3 | 100.0 | 7.8 | 12.2 | 20.5 | 29.9 |
HADF-Crowd | 71.1 | 111.6 | 9.7 | 15.7 | 14.1 | 20.1 |
Modules | Venice Dataset | |
---|---|---|
MAE | MSE | |
Backbone | 43.0 | 60.2 |
Backbone + DFEM | 23.8 | 34.5 |
Backbone + DFEM + CAM | 14.1 | 20.1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ilyas, N.; Lee, B.; Kim, K. HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting. Sensors 2021, 21, 3483. https://doi.org/10.3390/s21103483
Ilyas N, Lee B, Kim K. HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting. Sensors. 2021; 21(10):3483. https://doi.org/10.3390/s21103483
Chicago/Turabian StyleIlyas, Naveed, Boreom Lee, and Kiseon Kim. 2021. "HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting" Sensors 21, no. 10: 3483. https://doi.org/10.3390/s21103483
APA StyleIlyas, N., Lee, B., & Kim, K. (2021). HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting. Sensors, 21(10), 3483. https://doi.org/10.3390/s21103483