Advances in Convolution Neural Networks Based Crowd Counting and Density Estimation

: Automatically estimating the number of people in unconstrained scenes is a crucial yet challenging task in different real-world applications, including video surveillance, public safety, urban planning, and trafﬁc monitoring. In addition, methods developed to estimate the number of people can be adapted and applied to related tasks in various ﬁelds, such as plant counting, vehicle counting, and cell microscopy. Many challenges and problems face crowd counting, including cluttered scenes, extreme occlusions, scale variation, and changes in camera perspective. Therefore, in the past few years, tremendous research efforts have been devoted to crowd counting, and numerous excellent techniques have been proposed. The signiﬁcant progress in crowd counting methods in recent years is mostly attributed to advances in deep convolution neural networks (CNNs) as well as to public crowd counting datasets. In this work, we review the papers that have been published in the last decade and provide a comprehensive survey of the recent CNNs based crowd counting techniques. We brieﬂy review detection-based, regression-based, and traditional density estimation based approaches. Then, we delve into detail regarding the deep learning based density estimation approaches and recently published datasets. In addition, we discuss the potential applications of crowd counting and in particular its applications using unmanned aerial vehicle (UAV) images.


Introduction
In recent years, great efforts have been devoted to counting people in crowd unconstrained scenes due to its importance in applications, such as video surveillance [1], traffic monitoring [2], etc. The increasing growth of the world population and the development of urbanization has resulted in frequent crowd gatherings in numerous activities, such as stadium events, political events, and festivals (See Figure 1). In this context, crowd counting and density estimation are crucial for a better control & management and to ensure the security and the safety of the public.
Crowd counting remains a challenging task due to different difficulties related to the unconstrained scenes, such as extreme occlusions, variation in light conditions, changes in scale and camera perspective, and non-uniform density of people (See Figure 2). The aforementioned issues motivated many research communities to consider crowd counting as their main research direction, and attempted to develop more sophisticated techniques to deal with limitations in crowd counting.
In particular, with the recent progress in deep learning, convolution neural networks have been widely used to address crowd counting and made significant progress owing to their capacity of effectively modeling the scale changes of people/heads and the variation in regions' crowd density. The developed people counting techniques can be extended and applied to related tasks. Therefore, a section in this paper is dedicated to reviewing the most important techniques that can be extended to develop potential applications using unmanned aerial vehicle (UAV) images.  This work reviews papers that were published in the last decade and is organized as follows: Section 2 is dedicated to reviewing the traditional crowd counting and density estimation methods. In Section 3, we review the previous related surveys. In Section 4, we review, in detail, the CNN-based density estimation methods. Section 5, we discuss the most important public datasets along with the results of the state-of-the-art methods. We also present crowd counting applications that are based on Unmanned Aerial Vehicles (UAV) images in Section 6. Finally, we conclude our survey in Section 7.

Related Work and Motivation
Different approaches have been proposed to tackle the problem of crowd counting in images and videos. These approaches can be mainly divided into four categories: detection-based, regression-based, traditional density estimation, and recent CNN-based density estimation.
The scope of this survey is to focus on modern CNN-based density estimation and crowd counting approaches and review the most important techniques that can be extended to develop real-world applications using UAV images. However, first, we briefly review the detection and regression approaches using hand-crafted features.

Detection-Based Approaches
Early work on crowd counting adopted a detection framework [3][4][5][6]. Given a crowded situation, these approaches used a sliding window to detect the most visible parts of the body, which are mainly the head and the shoulders. Recently, various CNN-based object detectors have been proposed, which lead to a higher object detection performance as compared to systems based on simpler hand-crafted features [7]. In this context, we note the two-stage detectors, such as RCNN [7], Faster-RCNN [8] and Mask-RCNN [9], and the one-stage detectors, such as YOLO [10] and SDD [11]. Despite their high accuracy detection recorded in a sparse scene, these approaches do not perform well in the presence of the visual occlusions and ambiguities in crowded scenes.

Regression-Based Approaches
To overcome the limitations of the detection-based approaches, researchers attempt to formulate the crowd counting as a regression problem where they learn directly how to map the appearance of the image patches to their corresponding object density maps [12][13][14]. These approaches operate mainly on two steps: feature extraction and regression modeling. A variety of local features , such as SIFT [15], HOG [16], LBP [17,18], and global features, such as texture [19] and gradient [18] have been used to encode the object information. Learning a mapping from low-level features to the crowd count has been carried out using Gaussian process regression [20], linear regression [21], and ridge regression [14]

Traditional Density Estimation Based Approaches
While earlier approaches were successfully dealing with occlusion and scene cluttering, most of them regressed from global features directly to the number of objects and discard any available spatial information. In contrast, Lemptisky et al. [22] first involved spatial information in the learning process by adopting a linear mapping between local patch features and corresponding density maps. Thereby, they avoided the complex task of learning to detect and localize individual object instances and introduced a new approach to estimate an image density whose integral over any image region gives the count of objects within that region.
The learning process to estimate such density is formulated as a minimization of a regularized risk quadratic cost function, where a new appropriate loss function is introduced. Thus, the entire learning process is posed as a convex quadratic program solvable with cutting-plane optimization. To alleviate the difficulty of linear mapping, Pham et al. [23], proposed a non-linear mapping between local patch features and density maps through a random forest regressor. They obtained satisfactory results by introducing a crowdedness prior to tackle the large variation in appearance and shape between crowded image patches and non-crowded ones.
In addition, they proposed an effective forest reduction method to speed up estimation and met the real-time requirement. This method requires relatively less memory to build and store the forest. These methods incorporate the spatial information in the learning process, which improves the counting accuracy compared to the regression and detectionbased approaches. However, they only used traditional hand-crafted features to extract low-level information from local patches, which can lead to estimating a low-quality density map.

CNN-Based Density Estimation
CNN-based approaches have demonstrated a good performance in numerous computer vision problems, thus, motivating more researchers to use their ability to estimate a non-linear function mapping from crowd images to their corresponding density maps. In this context, numerous techniques have been proposed, which can be categorized into five groups according to the network architecture and the inference algorithm process, as depicted in Figure 3.

Related Previous Surveys
Researchers have attempted to review the techniques of density estimation and crowd counting. Notably, Junior et al. [24] were among the first to provide a comprehensive study of the existing techniques for crowd counting. Li et al. [25] reviewed various methods for the crowded scene analysis, which covered different tasks, such as crowd motion, pattern learning, and anomaly detection in crowds. In [26], Zitouni et al. reviewed the existing visual crowd analysis techniques based on different key statistical evidence, which was inferred from the literature, and provided recommendations toward the general aspects of techniques instead of focusing on a specific algorithm.
In [27], Loy et al. provided a study that evaluates and compares the state-of-the-art techniques of visual crowd counting using the same protocol. Saleh et al. [28] presented a survey on crowd counting methods used in video surveillance and categorized the existing algorithms into two main approaches: direct and indirect. While these surveys provide detailed and comprehensive studies of the existing density estimation and crowd counting techniques, they all reviewed the traditional methods based on the hand-crafted features. Recently, Sindagi et al. [29] provided a survey of advances in CNN-based density estimation and crowd counting from a single image, published up to the year 2017. Guangshuai et al. [30] put forward a survey on CNN-based density estimation and crowd counting where over 220 papers have been reviewed up to the year 2020.
Though the last survey [30] covers the most recent CNN-based crowd counting approaches, as with the previous surveys, it focuses only on the statistical evaluation and comparison between different approaches without analyzing the importance of extending the approaches used for counting people in crowds in order to develop real-world applications in various areas. In this paper, we survey various papers that adapt crowd counting approaches to counting different objects from UAV images.

Taxonomy for CNN-Based Density Estimation
In this section, we review different CNN-based density estimation and crowd counting methods in view of the network architectures and the training and inference paradigm of the methods. Table 1 summarizes a categorization of different CNN-based crowd counting and density estimation methods according to the network architecture and the inference process.

Typical CNN Architecture for Density Estimation and Crowd Counting
Based on the type of the network architecture, we classify the approaches into three major categories (see Figure 3): These architectures are among the first deep learning approaches applied to density estimation and crowd counting. They are basically composed of convolution layers, pooling layers, and fully connected layers.
Fu et al. [31] and Wang et al. [32] were among the first researchers that attempted to use a convolution neural network in the context of crowd density estimation. Wang et al. [32] proposed the first end-to-end deep convolution neural network regression model for counting people in images of extremely dense crowds. The proposed architecture is composed of five Conv-layers and two fully connected layers, where its output is the estimated people counts in the input image. In addition, to reduce the false positive errors, which are mainly caused by the existence of trees and buildings in the background, training data are augmented by adding negative samples whose ground truth count is set as zero.
In a different approach, Fu et al. [31] used the multi-stage ConvNet model proposed in [33] to ensure better shift, scale and distortion invariance. To reduce the computation time at both training and detection stages, they optimized the model by discarding all similar features maps. In addition, two optimized multi-stage ConvNets are cascaded as a strong classifier to achieve boosting in which the first classifier is trained to pick out the hard samples, whereas the second one is trained to give them a final determination. Yao et al. [34] fine-tuned several architectures of deep residual network (ResNet [35]) to develop a cell counting framework.
Elad et al. [36] used a basic CNN and incorporated layered boosting and selective sampling to enhance the accuracy and the training computation. The training process is done in stages, where CNNs are iteratively added so that each new CNN is trained on the difference between the estimation of its predecessor and the ground truth. The selective sampling approach is used to speed up the training process by reducing the effect of low-quality samples, such as trivial samples and outlier samples.
As stated by the authors, trivial samples are those that are correctly classified early on. Feeding again these samples to the network tends to introduce a bias toward them, thereby, affecting its generalization performance. On the other hand, the presence of outliers, such as mislabeled samples, can affect the generalization of the model and, in particular, increase the computation of the training time (i.e., due to the boosting technique).
These basic CNNs approaches are simple and easy to implement. However, their performance is often limited due to the quality of the extracted features, which are usually not invariant to perspective effect or image resolution. These network architectures incorporate multiple columns to extract multi-scale features that allow generating high-quality crowd density maps.
Zhang et al. [37] were among the first to introduce the idea of using multiple-column architecture for crowd counting. They proposed Multi-column Convolutional Neural Network (MCNN) architecture to map the image to its crowd density map. As depicted in Figure 4, MCNN incorporates different columns where each one adopts filters with receptive fields of different sizes, so that the extracted features are adaptive to scene variations (i.e., people/head size). In addition, they proposed a new large dataset with around 330,000 head annotations and showed that MCNN is easily transferred for crossscene crowd counting.
In [38], the authors introduced CrowdNet, which combines deep and shallow networks at two different columns, so that the shallow network is used to extract low-level features, whereas the deep network is used to extract high-level features. Both extracted features are crucial for detecting people under large-scale variations and severe occlusion. Hydra-CNN is introduced in [12]. It uses a pyramid of input patches so that each level has a different scale. By doing that, Hydra-CNN extracts multi-scale features, which are combined to generate the crowd density map.
Deepak et al. [39] developed a switching-CNN for crowd counting by training various CNN crowd density regressors on patches from a crowd scene. The regressors were designed to incorporate different receptive fields. In addition, a switch classifier was trained to select the best regressor that estimates the density map corresponding to the crowd scene patch.
In another work, Deepak et al. [40] proposed a top-down feedback architecture that carries high-level features to correct false predictions. This architecture has a bottom-up CNN, which is directly connected to a top-down CNN so that the top-down generated feedback to the bottom-up to help deliver a better crowd density map. In [41], Liu et al. proposed a CNN framework to address the issues of scale variation and rotation variation using the Spatial Transform Network [58]. Zhang et al. [42] exploited the attention mechanism to address the limitations of the pixel-wise regression technique, which is very popular for estimating the crowd density map. This technique assumes the interdependence of pixels, which leads to noisy and inconsistent predictions. Thus, the proposed Relational Attention Network (RANet) incorporates local self-attention (LSA) and global self-attention (GSA) to capture the interdependence of pixels. In addition, a relational model is used to combine LSA and GSA to obtain a more informative aggregated feature representation.
In the same context, Hossain et al. [43] proposed a CNN model based on the attention mechanism to extract global and local scale features appropriate for the image. By combining both local and global features, the model outputs an improved crowd density map. Guo et al. [44] introduced a deep model called Dilated-Attention-Deformable ConvNet (DADNet), which incorporates two modules: multi-scale dilated attention and deformable convolutional DME (Density Map Estimation). The multi-scale dilated attention is based on using various kernel dilation levels to extract different visual features of crowd regions of interest, whereas the deformable convolution is used to generate a high-quality density map.
Observing that most of the existing techniques are susceptible to overestimate or underestimate people counts of regions with different patterns, Jiang et al. [45], introduced a new CNN model based on the attention mechanism, which consists of two components Density Attention Network (DANet) and Attention Scaling Network (ASNet). DANet is used to extract attention masks from regions of different density levels. ASNet, on the other hand, outputs density maps and scaling factors and multiplies them by the attention masks yielding separate attention-based density maps. These maps are then summed to form that final density map.
Liu et al. introduced DecideNet [59], which consists of detection and regression based density maps. These two count modes are used to deal with the crowd density variation in the image regions. An attention module is used to adaptively assess the reliability of the two count modes. Liu et al. [60] proposed a self-supervised method to improve the training of models for crowd counting. This was based on the fact that crops sampled from a crowd image contain the same or fewer persons than the original image.
Thus, crops can be ranked and used to train a model to estimate whether one image contains more persons than another image. Fine-tuning the resulting model on a small labeled dataset achieved state-of-the-art results. In [61], PACNN a perspective-aware CNN was introduced. It is specifically designed to predict multi-scale perspective maps and to deal with the perspective distortion problem.
Dingkang Liang et al. [62] introduced a weakly-supervised crowd counting in images by introducing TransCrowd, which is a sequence-to-count framework based on a Transformer-encoder. In the same context, Sun et al. [63] used transformers to encode features with global receptive fields and proposed two modules: a token-attention module and regression-token module. In [64], Gao et al. proposed a crowd localization network called the Dilated Convolutional Swin Transformer (DCST). It provides the location information of each instance in addition to counting the numbers for a scene.
Despite the significant improvement and the great performance recorded by the multi-column architecture, they are still suffering from various limitations as detailed in [46]. The multi-columns CNNs are difficult to train since they require huge computation times and memory. Such architecture can generate redundant features since the columns implement almost the same network architecture. Moreover, multi-column architecture often requires a density level classifier before feeding the image to the network. However, since the number of objects is varying widely in a congested scene, it is very difficult to estimate the granularity of the crowd density maps. In addition, using crowd density level classifiers leads to the implementation of more columns, which increases the complexity of the architecture and yields more feature redundancy.
These limitations motivated some researchers to adopt single-column CNNs, which are a much simpler yet efficient architecture to overcome these disadvantages and deal with different challenging scenarios in crowd counting.

Single-Column Architecture
The single-column network architectures are based on deeper end-to-end CNNs. This implements more specific features to deal with critical problems in crowd counting and generate high-quality crowd density maps.
In [46], Li et al. proposed CRSNet, a CNN model for crowd counting in highly congested scenes. This model consists mainly of a frond-end CNN used to extract 2D features and a back-end dilated CNN, which uses dilated kernels to provide larger receptive fields and to replace pooling operations. The dilated convolution layers expand the receptive field without losing resolution and, thereby, aggregate the multi-scale contextual information. CSRNet is an easy-trained end-to-end approach that generates high-quality density maps and achieves state-of-the-art performance on four datasets.
Zhang et al. [47] proposed a scale-adaptive convolution neural network for crowd counting. The proposed architecture is composed of a backbone, which is a fully convolution neural network (FCN) with fixed small receptive fields. It extracts feature maps with different dimensions from multiple layers. The extracted feature maps are resized to have the same output dimension and combined to calculate the final crowd density map. Two loss functions have been introduced in the training process, density map loss and count loss, to jointly optimize the model. The loss count is used to reduce the variance of the prediction error and enhance the generalization performance of the model on a very sparse scene. This architecture is illustrated in Figure 5.
Wang et al. [48] proposed SCNet, which is a compact single-column architecture for crowd counting. It consists of three modules: residual fusion modules (RFM) to extract multi-scale features, a pyramid pooling module (PPM) to combine features at different stages, and a sub-pixel convolutional module (SPCM) followed by an upsampling layer to recover the resolution. These three modules allow SCNet to generate multi-scale features, which leads to generating a high-quality density map. In [65], Shi et al. proposed a learning strategy called deep negative correlation learning (NCL), based on learning a pool of decorrelated regressors.
In a different approach, Cao et al. [49] proposed an encoder-decoder based on the inception network [66] called SANet for crowd counting. The encoder is used to extract multi-scale features, whereas the decoder used transposed convolution layers to upscale the extracted features and generate the final crowd density map. In addition, unlike most of the existing approaches, which use only Euclidean loss that ignores the correlation between pixels of the density, they introduced a new training loss that combines the Euclidean loss and local pattern consistency loss.
In [50], Varun et al. extended the U-Net [67] by adding a decoding reinforcement branch to accelerate the training of the network and using Structural Similarity Index to maintain the local correlation of the density map in order to generate a good crowd density map. Xiaolong et al. [51] proposed a new trellis encoder-decoder architecture, which consists of multiple decoding paths and a multi-scale encoder. The multiple decoding paths are used to hierarchically aggregate features at different decoding stages, whereas the multi-scale encoder is incorporated to preserve the localization precision in the encoded feature maps.
In recent years, models based on attention mechanisms have demonstrated significant performance in different computer vision tasks [52]. Instead of extracting features from the entire image, the attention mechanism allows models to focus only on the most relevant regions. In this context, Mnih et al. [52] used the attention mechanism to introduce a scale-aware attention network to address the scale variation in crowd counting images. Due to the attention mechanism, the model can automatically focus on the most important global and local features and combine them to generate a high-quality crowd density map.
Liu et al. also proposed the ADCrowdNet [53], which is an attention-based network for crowd counting. ADCrowdNet consists of two concatenated networks: an Attention Map Generator (AMG), which first estimates crowd regions in images as well as their congestion degree, and a Density Map Estimator (DME), which is a multi-scale deformable network that uses the output of AMG to generate a crowd density map. Due to the simplicity of their architectures and their effective training process, singlecolumn network approaches have received more attention in recent years.

Typical Inference Paradigm
Based on the inference methodology, we can categorize the crowd density estimation techniques into the following two categories:

Patch-Based Inference
The patch-based model is trained on random crops from the original image. During the inference, a sliding window is applied to the test image, and the prediction is obtained for each crop. The total count is obtained by summing the counts over all the crops.
In [54], the model is trained on random patches extracted from the input images so that every crop cover 3 by 3 m square in the actual scene. The patches are then resized to 72 × 72 pixels (see Figure 6) and fed as input to the CNN model to obtain the corresponding crowd density map. The number of objects in every patch is obtained by integrating over the crowd density map. In [55], a new CNN model called PaDNet was proposed.
It consists of three modules as follows: (1) the Density-Aware Network (DAN) incorporates multiple CNN sub-networks, which are pre-trained on images with different crowd density levels, and used to capture the crowd density level information; (2) the Feature Enhancement Layer (FEL) generates weighted local and global contextual features; and (3) the Feature Fusion Network (FFN) is used to combine these contextual features. The network is trained on patches so that nine crops are taken from every input image.
In [56], Sajid et al. proposed a plug-and-play-based patch rescaling module (PRM) to address the problem of crowd diversity in the scene. As shown in Figure 7, the PRM module takes a patch image as input and then rescales it using the appropriate scaler (Up-scaler or Down-scaler) according to its crowd density level, which is computed by the classifier before using PRM. In this approach, the low-crowd and high-crowd regions pass directly through the Up-scaler or Down-scaler, the Medium-crowd bypasses the PRM without rescaling, whereas the no-crowd regions are automatically discarded.

Image-Based Inference
By taking random patches from the input image, patch-based methods ignore the global information and also require a huge computation during the inference due to the use of a sliding window. Training with the whole image help exploit the global information. However, it is still dependant on the resolution of the image, which is often very large in the context of crowd counting.
In [57], Chong et al. proposed an end-to-end CNN model that takes the whole image as input and directly produces the final count. First, the image is fed into a pre-trained CNN to obtain high-level features, which are then mapped to local counting numbers using a recurrent neural network with memory cells. In addition, sharing computation over overlapping regions leads to a reduction in model complexity and allows the model to incorporate contextual information when predicting both local and global counts.

Datasets and Results
With the increasing development of crowd counting approaches, numerous datasets have been proposed over the last decade to drive research on crowd counting and develop models to deal with various limitations including changes in perspective and scale, variation in light conditions, crowd density, cluttering, and severe occlusion. Table 2 summarizes the most popular datasets, which can be categorized into three different groups according to the view type: free view, crowd surveillance view, and drone view.
Some images of these categories are depicted in Figure 8. • UCSD [69]: It was among the first datasets to be collected to count people. It was acquired with a stationary camera mounted at an elevated position, overlooking pedestrian walkways. The dataset contains 2000 frames of size 158 × 512 along with annotations of pedestrians in every 1/5 frames, while the other frames are annotated using linear interpolation. It provides also the bounding box coordinates for every pedestrian. The dataset has 49,885 person instances, which are split into training and test subsets. UCSD has a low-density crowd with an average of 25 pedestrian instances, and the perspective across images does not change greatly since all images are captured from the same location. • Mall [70]: This dataset is collected from a publicly accessible webcam in a shopping mall. The video sequence of the dataset contains over 2000 frames of size 640 × 480 in which 62,325 heads were annotated with an average of 25 heads per image. By comparing to UCSD, the Mall dataset was created with higher crowd densities as well as more significant changes in illumination conditions and different activity patterns (static vs. moving people). The scene has severe perspective distortion along the video sequence, which results in large variations in scale and appearance of objects.
In addition, there exist severe occlusions caused by different objects in the mall. • UCF_CC_50 [13]: It is the first challenging dataset, which was created by directly scraping publicly web images. The dataset presents a wide range of crowd densities along with large varying perspective distortion. It contains only 50 images whose size is 2101 × 2888 pixels. These images contain a total of 241,677 head instances with an average of 1279 heads in each image. Due to its small size, the performance of recent CNN-based models is far from optimal. • WorldExpo'10 [54]: Zhang et al. [54] remarked that most existing crowd counting methods are scene-specific and their performance drops significantly when they are applied to unseen scenes with different layouts. To deal with this, they introduced the WorldExpo'10 dataset to perform a data-driven cross-scene crowd counting. They collected the data from Shanghai 2010 World-Expo, which contains 1132 video sequences captured by 108 cameras with an image resolution of 576 × 720 pixels. The dataset contains 3980 frames that contain a total of 200,000 annotated heads for an average of 50 heads by frame. • AHU-Crowd [71]: It is composed of diverse video sequences representing dense crowds in different public places including stations, stadiums, rallies, marathons, and pilgrimage. The sequences have different perspective views, resolutions, and crowd densities and cover a large multitude of motion behaviors for both obvious and subtle instabilities. The dataset contains 107 frames whose size is 720 × 576 pixels, and 45,000 annotated heads. SmartCity [47]: It consists of 50 images, collected from 10 different cities for outdoor scenes of different places, such as shopping malls, office entrances, sidewalks, and atriums. • Crowd Surveillance [74]: It is composed of 13,945 high-resolution images. It is split into 10,880 images for training and 3065 images for testing for a total of 386,513 head count.
• DroneCrowd [75]: It was captured using a drone-mounted camera and recorded at 25 frames per second with a resolution of 1920 × 1080 pixels. It contains 112 video clips with 33,600 frames. The annotation was performed by over 20 experts for more than two months so that more than 4.8 million heads are annotated on 20,800 people trajectories. • DLR-ACD [76]: It is a collection of 33 aerial images for crowd counting and density estimation. It was captured through 16 different flights and over various urban scenes including sports events, city centers, and festivals. • Fudan-ShanghaiTech [77]: It is a large-scale video crowd counting dataset, and it is the largest dataset for crowd counting and density estimation. It is composed of 100 videos captured from 13 different scenes. It contains 150,000 images for a total of 394,081 annotated head count. • Venice [78]: It is a small dataset acquired in Piazza San Marco in Venice (Italy). It contains four different sequences for a total of 167 annotated images with a resolution of 1280 × 720 pixels. • CityStreet [79]: It was collected from a busy city street using a multiview camera system, which is composed of five synchronized cameras. The dataset contains a total of 500 multi-view images in total. • DISCO [80]: It was collected to jointly utilize ambient sounds and visual contexts for crowd counting. The dataset contains a total of 248 video clips, where each clip was recorded at 25 frames per second with a resolution of 1920 × 1080. • DroneVehicle [81]: It consists of 15,532 pairs of RGB and infrared images for a total of 441,642 annotated objects. The images were acquired by a drone-mounted camera over various urban areas, including different types of urban roads, residential areas, and parking lots from day to night. • NWPU-Crowd [82]: It contains 5109 images for a total of 2,133,375 annotated heads with point and box labels. Compared to existing datasets, it has negative samples and a large appearance variation. • JHU-CROWD++ [83]: It is composed of 4372 images and 1.51 million annotations and acquired under various scenarios and environmental conditions. Labeling is provided in different formats, including dots, approximate bounding boxes, and blur levels.

Results and Discussions
We report results of recent traditional approaches along with the CNN-based methods on the most popular datasets. The count estimation performance is reported directly from the original public work. We compare different methods based on the following metrics: • N: is the number of test samples. • y i is the ground truth result corresponding to sample i. • y i is the estimated result corresponding to sample i.
The comparison results are summarized in Table 3. In general, CNN-based methods highly outperformed the traditional approaches. CNN-based methods showed effective results in a very cluttered scene with large density crowds and under different scene conditions (lighting, scaling, etc.). While the multiple-column techniques achieved stateof-the-art results on three datasets: UCF_CC_50, ShanghaiTech Part A, and Mall, some single-column techniques also achieved a high performance, such as [53], which obtained state-of-the-art results on ShanghaiTech Part B.
In addition, CSRNet [46] and SaNet [49] showed comparable results on almost all datasets. The single-column technique in [53], which involves the attention mechanism, achieved state-of-the-art results on the WorlExpo'10 dataset. Finally, single-column techniques, like [46,48,49], not only presented great performances but are also easy to implement and were applied in a real-time scenario.

Potential Application of Crowd Counting
Crowd counting techniques have been applied to count and estimate the number of persons in crowded and cluttered scenes to develop real-world applications, mostly related to video surveillance and public safety. In addition, these techniques have been adapted and applied to different related problems, such as traffic control, plant/fruits counting. Different applications use different image sources, including a fixed camera, multi-cameras, a moving camera, unmanned aerial vehicles (UAVs), etc.
In this section, we review the crowd counting applications developed using unmanned aerial vehicles (UAV). In [86,87], the authors introduced an automated vehicle detection and counting system in high-resolution aerial images. The proposed method used a convolution neural network to generate a vehicle spatial density map from the aerial image.
Jingyu et al. [88] introduced an efficient convolution neural network called Flounder-Net, which used aerial images captured by a drone (See Figure 9), to count crowd people for a security purpose. Flounder-Net architecture (See Figure 10) involves an interleaved group convolution to eliminate the redundancy of the network, and the rapid shrink of feature maps in order to tackle the high-resolution problem. The model is implemented and integrated into the embedded system of the drone. In the same context, Castellano et al. [89] introduced a light-weight and fast fully-convolutional neural network to regress a crowd density map on aerial images captured by a drone.
Recently, crowd counting techniques have been applied in the agriculture domain. In [90], Jintao et al. developed and implemented an automatic counting of in situ rice seedlings in the field using a basic fully convolution neural network to regress a crowd density map. The system takes, as input, aerial images captured by an UAV equipped with RGB cameras. Sungchan et al. [91] implemented an automatic cotton plant counting by adapting the Yolo3 [92] deep learning algorithm. In the same context, Kitano et al. [93] used a fully convolution neural network to develop an application that captured images using a UAV and returned the number of corn plants.
The technology of crowd counting is being increasingly adapted to agriculture, which facilitates the decision-making of the farmer and the management process of work-labor and products.

Conclusions
In recent years, the need for crowd counting in many areas has greatly boosted research in crowd counting and density estimation. With the development of deep learning, the performance of crowd counting models has been remarkably improved, and the realworld applications scenarios have been expanded. This paper presented a survey on the recent advances in convolution neural network (CNN)-based crowd counting and density estimation. We explored the existing approaches from different perspectives, including the network architecture and the learning paradigm. We presented a description of the most popular datasets that are used to evaluate the crowd counting models. In addition, we conducted a performance evaluation for the most representative crowd counting algorithms. Finally, we reviewed the potential applications of crowd counting in the context of unmanned aerial vehicle (UAV) images.