1. Introduction
Automated crowd analysis is a challenging problem and has received tremendous importance from the research community over the last decade. Due to the increasing population, many people attend mass events, such as religious festivals, marathons, concerts, etc. Although these events are organized for entertainment or fulfillment of religious obligations, sometimes these peaceful events end up with a crowd disaster. To predict and prevent crowd disasters, surveillance cameras are mounted in different locations of venues, where security personnel manually analyze the whole crowd with the naked eye. Studies have proved that such manual analysis is a tedious job and is usually prone to errors [
1].
To automatically analyze a crowded scene, researchers have developed different models and methods that automatically analyze a crowd and understand crowd dynamics [
2]. Crowd analysis includes various applications, including crowd counting [
3,
4], congestion detection [
5], crowd tracking [
6], crowd behavior understanding [
7,
8,
9], and more. Crowd counting, in particular, has gained significant importance within the research community.
Crowd counting in naturalistic scenes has numerous applications and is significant both from political and geo-political perspectives [
10]. The task of crowd counting is to count the number of participants attending an event. Currently, most of the state-of-the-art crowd-counting methods can be divided into two groups: (1) regression-based approaches and (2) detection-based approaches.
Regression-based methods regress density information and estimate the count without localizing people. Zhang et al. [
11] proposed a network that simultaneously solves the counting and density estimation problems. This method relies on the generation of perspective maps that enhance the counting accuracy; however, the acquisition of perspective maps for every scene increases the computational cost. Similarly, a Multicolumn Convolutional Neural Network (MCNN) is proposed in [
12] that consists of three columns. Each column implements a small network with a different receptive field with the aim of solving multi-scale problems. The Switching Convolutional Neural Network is proposed in [
13] that contains multiple Convolutional Neural Network (CNN) regressors with different receptive fields, and a switch classifier is trained to route the patch to one of the CNN regressors that can best estimate the count. While regression-based approaches excel in high-density scenarios, they tend to overestimate crowd counts in low-density situations.
Detection-based crowd-counting methods not only estimate crowd counts but also localize the people in the scenes. Composition loss was introduced in [
14] to address the simultaneous challenges of counting, density estimation, and localization. Similarly, Locate, Size and Count Convolutional Neural Network LSC-CNN [
15] is proposed that localizes every person in a crowded scene, estimates the bounding box (size) of visible heads and finally counts the number of people. Scale-Driven Convolutional Neural Network (SD-CNN) [
16] is proposed to count the number of people in high-density crowds by detecting visible heads. These approaches work well in low-density scenes; however, their performance degrades when applied in high-density situations. Therefore, we need a “one-model method” that can accurately count people in all kinds of scenes.
To address the above problems, we proposed a framework that combines the advantages of both regression-based and detection-based models by exploiting the variations of crowd density within an image to accurately predict the crowd counts. Generally, the proposed framework adopts a routing strategy that routes the image patch to one of two counting modules based on the density level. The framework divides the input image into non-overlapping patches of fixed size. Then, each patch is classified into four classes, i.e., Low, Medium, High, and No Crowd. Then, the patches are provided as input to the Decision Block (DB), where, based on the classification label, the patches are routed to either of two modules, i.e., the Detection Network or Regression Network. The network estimates the count in each patch and then calculates the final count by summing the count from all patches.
The proposed framework offers the following contributions:
A unified deep-learning framework is proposed that estimates crowd count in diverse scenes.
We introduce a Crowd Classifier (CC) that classifies the patches into four categories, including Low Crowd, Medium Crowd, High Crowd, and No Crowd.
A novel Head-Detection (HD) network is introduced for the efficient detection of human heads in complex scenes, leveraging iterative deep aggregation (IDA) to extract multi-scale features from various layers of the network.
A novel Crowd-Regression Module (CRM) is introduced, which utilizes an Atrous Convolution Grid (ACG) to densely sample a wide range of scales and contextual information for accurate crowd count estimation.
An effective routing strategy is developed that efficiently routes the patches to either a detection network or regression module based on crowd density variations within an image.
The remaining sections of the paper are structured as follows:
Section 2 discusses related work,
Section 3 outlines the proposed methodology and detailed experiment results along with performance analysis are presented in
Section 4. Concluding remarks are provided in
Section 7.
3. Proposed Methodology
In this section, we will delve into the various components of the proposed framework. The pipeline of the framework work is illustrated in
Figure 1. Generally, the framework comprises four major modules: the Crowd Classifier (CC), Patch-Routing Module (PRM), Head Detector (HD), and the Crowd-Regression Module (CR). The primary objective of this framework is to estimate the number of people within a given image. The initial step involves dividing the input image into non-overlapping patches. Subsequently, these patches serve as input to the CC, which classifies them into four distinct categories: No Crowd (NC), Low Crowd (LC), Medium Crowd (MC), and High Crowd (HC). Based on the classification outcomes, the PRM directs the patches towards the Head-Detector Module and the Crowd-Regression Module. The counting modules estimate the count in the input patch, and then the count accumulator provides the final count by accumulating the count of all patches of the input image.
The Head-Detection Module is responsible for processing Low-Density Crowd and Medium-Density Crowd patches. It employs a deep-learning model to detect the number of heads in each patch. On the other hand, the Crowd-Regression Module handles high-density crowd patches and estimates the count within each patch. Detailed information on each module is provided as follows:
3.3. Head-Detection Module
Head detection in images and videos has a wide range of applications in crowd analysis and large-scale surveillance. Head detection is a special case of object detection. Although object detection in images has achieved significant progress, head detection presents a distinctive set of challenges. These challenges arise from the substantial variations in head sizes, complex background clutter, and the relatively small size of human heads within images.
The current generic object detectors face the following challenges while detecting human heads in images for counting tasks: (1) Current deep-learning-based object detectors represent the objects through bounding boxes that tightly encompass the objects. This approach is highly effective when precise ground-truth bounding box annotations are available for training. However, such annotations are not available in the crowd-counting dataset. The crowd-counting datasets usually contain dot annotations (2-D points), which represent the position of a human head in the image. This difference in annotation methodology complicates the training process for Head-Detection models, as these models are primarily designed for bounding box annotations. (2) Current deep-learning models, such as Faster R-CNN, extract deep hierarchical features from the input image by passing the input image through subsequent convolution and pooling layers. These pooling operations typically downsample the input image, leading to the loss of crucial information regarding small objects.
For precisely detecting human heads in complex scenes, we propose a simple yet effective approach by addressing the above-mentioned problems. To tackle the problem related to bounding box representation, we employ CenterNet. CenterNet adopts a keypoint-centric approach, which demonstrates exceptional performance in situations where bounding boxes are not available or in cases that involve small and densely clustered human heads. The network efficiently identifies the location of heads by predicting the central of each human head, even in crowded scenes or cases of occlusions.
Although the adoption of CenterNet solves the bounding box representation problem, and we directly use the dot annotation provided by the dataset, CenterNet in its original form may suffer from a loss of fine-grained information and may be unable to address the second problem. This is because CenterNet employs subsequent pooling operations, which leads to the downsampling of input images. This downsampling potentially results in the loss of crucial information about small objects and may result in many false positives. To address this problem, we modified the original CenterNet by incorporating the iterative deep layer aggregation strategy, which combines features from both shallow and deep layers of the network. This strategy allows for better context understanding while retaining the spatial details of tiny heads. The integration of shallow- and deep-feature layers helps the network address the downsampling problem by providing the network with more comprehensive and precise information about the small heads.
As in high-density crowds, the distance between the human heads is a few pixels. To accurately detect each head, the Head-Detection network produces a high-resolution heatmap. In this heatmap, dark pixels indicate the likelihood of a human head’s presence, while bluish pixels represent the background or other objects. The overall architecture of the proposed Head-Detection framework is illustrated in
Figure 4. We use ResNet-18 as the backbone of the framework. Resnet-18 consists of four blocks, namely
ResNet-1,
ResNet-2,
ResNet-3, and
ResNet-4. The network accepts the input image and applies a convolutional layer of size 7 × 7 with stride 2 followed by a max-pooling layer of size 3 × 3 and stride 2. The resultant feature map is then passed through
Resnet-1, which employs a stack of two convolutional layers of size 3 × 3 and reduces the size of the original feature map to half. The reduced feature map is then passed through
ResNet-2. The output of
ResNet-2 is then up-sampled by employing a deconvolutional layer and then integrated with the feature map of the
Resnet-1 using an iterative deep-aggregation module,
IDA-1. The feature maps of the subsequent ResNet blocks are integrated through iterative deep-aggregation function
, which captures the deep semantic information formulated in Equation (
1).
where
is the feature map of ResNet block
i, and
N represents aggregation node.
The final feature map is subsequently subjected to a 1 × 1 convolution layer followed by a SoftMax operation to estimate the probability of human heads. Next, a 3 × 3 filter is applied to mitigate noise and detect peaks based on a specified threshold. In this study, we employ a threshold value of 0.5. Any pixel with a value lower than 0.5 is considered noise and is suppressed, while pixels with values greater than 0.5 are set to 1. We then utilize the coordinates of these peaks to derive the location of human heads.
For training the Head-Detection network, we utilize dot-level annotations, where 1 represents the presence of the human head, and 0 represents the background. To supervise the Head-Detection network, we need to generate a ground-truth heatmap. For this purpose, we place a 2D-Gaussian kernel at the location of the head. After generating ground-truth heatmaps from dot-level annotations, we train the Head-Detection network employing the focal cross-entropy loss function formulated in Equation (
2).
where
is the number of positive samples (heads) in the image
G,
represents the predicted probability of the pixel, and
is the ground truth, where 1 is for head and 0 for background,
is the hyper-parameter of focal loss [
47] and we set its value to 2 in all experiments as also adopted in [
48],
is also hyper-parameter that controls the penalty of negative samples and we fix its value to 4 in all experiments.
is the balancing parameter that balances the positive and negative points, and its value is fixed as
.
Author Contributions
Conceptualization, S.D.K. and F.U.R.; methodology, S.D.K.; software, S.D.K.; validation, F.U.R., A.N.A. and S.D.K.; formal analysis, S.D.K.; investigation, S.D.K.; resources, A.N.A.; data curation, A.N.A.; writing—original draft preparation, S.D.K.; writing—review and editing, A.N.A.; visualization, F.U.R.; supervision, A.N.A.; project administration, A.N.A.; funding acquisition, A.N.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research is funded by Custodian of the Two Holy Mosques Institute for Hajj and Umrah Research, Umm Al-Qura, Makkah, Saudi Arabia under project No. 23/113.
Data Availability Statement
The data used in this research is publicly available.
Acknowledgments
The researchers extend their sincere thanks to the Custodian of the Two Holy Mosques Institute for Hajj and Umrah Research for supporting and financing this project No. 23/113, which significantly contributed to the completion of the project phases.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Khan, S.D.; Tayyab, M.; Amin, M.K.; Nour, A.; Basalamah, A.; Basalamah, S.; Khan, S.A. Towards a crowd analytic framework for crowd management in Majid-al-Haram. arXiv 2017, arXiv:1709.05952. [Google Scholar]
- Gayathri, H.; Aparna, P.; Verma, A. A review of studies on understanding crowd dynamics in the context of crowd safety in mass religious gatherings. Int. J. Disaster Risk Reduct. 2017, 25, 82–91. [Google Scholar] [CrossRef]
- Khan, M.A.; Menouar, H.; Hamila, R. Revisiting crowd counting: State-of-the-art, trends, and future perspectives. Image Vis. Comput. 2023, 129, 104597. [Google Scholar] [CrossRef]
- Wang, M.; Cai, H.; Dai, Y.; Gong, M. Dynamic Mixture of Counter Network for Location-Agnostic Crowd Counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 167–177. [Google Scholar]
- Basalamah, S.; Khan, S.D.; Felemban, E.; Naseer, A.; Rehman, F.U. Deep learning framework for congestion detection at public places via learning from synthetic data. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 102–114. [Google Scholar] [CrossRef]
- Stadler, D.; Beyerer, J. Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 133–142. [Google Scholar]
- Li, Y. A deep spatiotemporal perspective for understanding crowd behavior. IEEE Trans. Multimed. 2018, 20, 3289–3297. [Google Scholar] [CrossRef]
- Grant, J.M.; Flynn, P.J. Crowd scene understanding from video: A survey. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2017, 13, 1–23. [Google Scholar] [CrossRef]
- Khan, S.D.; Bandini, S.; Basalamah, S.; Vizzari, G. Analyzing crowd behavior in naturalistic conditions: Identifying sources and sinks and characterizing main flows. Neurocomputing 2016, 177, 543–563. [Google Scholar] [CrossRef]
- Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; Wang, Y. Cnn-based density estimation and crowd counting: A survey. arXiv 2020, arXiv:2003.12783. [Google Scholar]
- Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 833–841. [Google Scholar]
- Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
- Babu Sam, D.; Surya, S.; Venkatesh Babu, R. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5744–5752. [Google Scholar]
- Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
- Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Babu, R.V. Locate, size and count: Accurately resolving people in dense crowds via detection. arXiv 2019, arXiv:1906.07538. [Google Scholar]
- Basalamah, S.; Khan, S.D.; Ullah, H. Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 2019, 7, 71576–71584. [Google Scholar] [CrossRef]
- Wang, Y.; Lian, H.; Chen, P.; Lu, Z. Counting people with support vector regression. In Proceedings of the 2014 10th International Conference on Natural Computation (ICNC), Xiamen, China, 19–21 August 2014; pp. 139–143. [Google Scholar]
- Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
- Pham, V.Q.; Kozakaya, T.; Yamaguchi, O.; Okada, R. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3253–3261. [Google Scholar]
- Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2547–2554. [Google Scholar]
- Wan, J.; Chan, A. Adaptive density map generation for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1130–1139. [Google Scholar]
- Dong, L.; Zhang, H.; Ji, Y.; Ding, Y. Crowd counting by using multi-level density-based spatial information: A Multi-scale CNN framework. Inf. Sci. 2020, 528, 79–91. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
- Xu, Y.; Zhong, Z.; Lian, D.; Li, J.; Li, Z.; Xu, X.; Gao, S. Crowd counting with partial annotations in an image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15570–15579. [Google Scholar]
- Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1861–1870. [Google Scholar]
- Zhai, W.; Gao, M.; Souri, A.; Li, Q.; Guo, X.; Shang, J.; Zou, G. An attentive hierarchy ConvNet for crowd counting in smart city. Clust. Comput. 2023, 26, 1099–1111. [Google Scholar] [CrossRef]
- Zhang, J.; Ye, L.; Wu, J.; Sun, D.; Wu, C. A Fusion-Based Dense Crowd Counting Method for Multi-Imaging Systems. Int. J. Intell. Syst. 2023, 2023, 6677622. [Google Scholar] [CrossRef]
- Zhai, W.; Gao, M.; Li, Q.; Jeon, G.; Anisetti, M. FPANet: Feature pyramid attention network for crowd counting. Appl. Intell. 2023, 53, 19199–19216. [Google Scholar] [CrossRef]
- Guo, X.; Song, K.; Gao, M.; Zhai, W.; Li, Q.; Jeon, G. Crowd counting in smart city via lightweight ghost attention pyramid network. Future Gener. Comput. Syst. 2023, 147, 328–338. [Google Scholar] [CrossRef]
- Gao, M.; Souri, A.; Zaker, M.; Zhai, W.; Guo, X.; Li, Q. A comprehensive analysis for crowd counting methodologies and algorithms in Internet of Things. Clust. Comput. 2024, 27, 859–873. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
- Ren, X. Finding people in archive films through tracking. In Proceedings of the Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Yan, J.; Lei, Z.; Wen, L.; Li, S.Z. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2497–2504. [Google Scholar]
- Li, H.; Lin, Z.; Shen, X.; Brandt, J.; Hua, G. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5325–5334. [Google Scholar]
- Yang, S.; Luo, P.; Loy, C.C.; Tang, X. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3676–3684. [Google Scholar]
- Zhang, K.; Zhang, Z.; Wang, H.; Li, Z.; Qiao, Y.; Liu, W. Detecting faces using inside cascaded contextual cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3171–3179. [Google Scholar]
- Zhu, C.; Zheng, Y.; Luu, K.; Savvides, M. Cms-rcnn: Contextual multi-scale region-based cnn for unconstrained face detection. In Deep Learning for Biometrics; Springer: Berlin/Heidelberg, Germany, 2017; pp. 57–79. [Google Scholar]
- Hu, P.; Ramanan, D. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 951–959. [Google Scholar]
- Khan, S.D.; Basalamah, S. Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis. Comput. 2021, 37, 2127–2137. [Google Scholar] [CrossRef]
- Shami, M.B.; Maqbool, S.; Sajid, H.; Ayaz, Y.; Cheung, S.C.S. People counting in dense crowd images using sparse head detections. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2627–2636. [Google Scholar] [CrossRef]
- Lian, D.; Chen, X.; Li, J.; Luo, W.; Gao, S. Locating and counting heads in crowds with a depth prior. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9056–9072. [Google Scholar] [CrossRef]
- Zhou, T.; Yang, J.; Loza, A.; Bhaskar, H.; Al-Mualla, M. Crowd modeling framework using fast head detection and shape-aware matching. J. Electron. Imaging 2015, 24, 023019. [Google Scholar] [CrossRef]
- Saqib, M.; Khan, S.D.; Sharma, N.; Blumenstein, M. Crowd counting in low-resolution crowded scenes using region-based deep convolutional neural networks. IEEE Access 2019, 7, 35317–35329. [Google Scholar] [CrossRef]
- Arandjelovic, O. Crowd detection from still images 2008. In Proceedings of the British Machine Vision Conference, Leeds, UK, 1–4 September 2008. [Google Scholar]
- Sirmacek, B.; Reinartz, P. Automatic crowd analysis from airborne images. In Proceedings of the 5th International Conference on Recent Advances in Space Technologies-RAST2011, Istanbul, Turkey, 9–11 June 2011; pp. 116–120. [Google Scholar]
- Saqib, M.; Khan, S.D.; Blumenstein, M. Texture-based feature mining for crowd density estimation: A study. In Proceedings of the 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ), Palmerston North, New Zealand, 21–22 November 2016; pp. 1–6. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Wang, Y.; Hou, J.; Hou, X.; Chau, L.P. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans. Image Process. 2021, 30, 2876–2887. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, W.; Liu, Y.; Zhu, J. Two-branch fusion network with attention map for crowd counting. Neurocomputing 2020, 411, 1–8. [Google Scholar] [CrossRef]
- Yang, Y.; Li, G.; Du, D.; Huang, Q.; Sebe, N. Embedding perspective analysis into multi-column convolutional neural network for crowd counting. IEEE Trans. Image Process. 2020, 30, 1395–1407. [Google Scholar] [CrossRef]
- Dai, F.; Liu, H.; Ma, Y.; Zhang, X.; Zhao, Q. Dense scale network for crowd counting. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 64–72. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 18 March 2024).
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Cheng, Z.Q.; Dai, Q.; Li, H.; Song, J.; Wu, X.; Hauptmann, A.G. Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–27 June 2022; pp. 19638–19648. [Google Scholar]
- Huang, L.; Zhu, L.; Shen, S.; Zhang, Q.; Zhang, J. SRNet: Scale-aware representation learning network for dense crowd counting. IEEE Access 2021, 9, 136032–136044. [Google Scholar] [CrossRef]
- Zeng, X.; Wu, Y.; Hu, S.; Wang, R.; Ye, Y. DSPNet: Deep scale purifier network for dense crowd counting. Expert Syst. Appl. 2020, 141, 112977. [Google Scholar] [CrossRef]
- Wang, S.; Lu, Y.; Zhou, T.; Di, H.; Lu, L.; Zhang, L. SCLNet: Spatial context learning network for congested crowd counting. Neurocomputing 2020, 404, 227–239. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
- Gao, J.; Wang, Q.; Li, X. Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3486–3498. [Google Scholar] [CrossRef]
- Hafeezallah, A.; Al-Dhamari, A.; Abu-Bakar, S.A.R. U-ASD net: Supervised crowd counting based on semantic segmentation and adaptive scenario discovery. IEEE Access 2021, 9, 127444–127459. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).