Crowd Counting Guided by Attention Network

Crowd Crowd counting is not simply a matter of counting the numbers of people, but also requires that one obtains people’s spatial distribution in a picture. It is still a challenging task for crowded scenes, occlusion, and scale variation. This paper proposes a global and local attention network (GLANet) for efficient crowd counting, which applies an attention mechanism to enhance the features. Firstly, the feature extractor module (FEM) uses the pertained VGG-16 to parse out a simple feature map. Secondly, the global and local attention module (GLAM) effectively captures the local and global attention information to enhance features. Thirdly, the feature fusing module (FFM) applies a series of convolutions to fuse various features, and generate density maps. Finally, we conduct some experiments on a mainstream dataset and compare them with state-of-the-art methods’ performances.


Introduction
The purpose of crowd counting is to obtain the number of individuals and crowd distribution in a specific scene. Crowd counting has complete video surveillance applications, security monitoring, urban planning, behaviour analysis, and so on. The crowd counting methods are mainly based on detection, regression, and density map. However, it is still a highly challenging task due to occlusion, low quality, resolution, perspective distortion, and scale variations of objects.
With the development of CNNs, various crowd counting method-based CNNs have been proposed in response to this challenging situation. Additionally most of the current work uses Visual Geometry Group (VGG) [1] as a backbone. The receptive field does not change everywhere by using convolution and pooling operations with the same kernel size on the whole image. In the past, this has been solved by combining density maps extracted from image patches of different resolutions or fusing feature maps obtained using convolution filters of various sizes. However, these methods [2][3][4][5], by fusing features at all scales, ignore the fact that the scale varies continuously across the image. Later, these methods were proposed using classifiers to predict the size of each patch's receptive domains, which made the network train complicated and not end-to-end. Recent works in crowd counting have been applying attention mechanisms [6,7] to improve network performance and employ perspective maps to guide the accurate estimation of density maps [8][9][10]. Additionally, most methods [2,4,11] are optimized by comparing the Euclidean distance between the model estimation and the target density map. However, this ignores the connection between pixels and makes the distribution of the crowd blurred.
In recent years, attention models have shown great success in various computer vision tasks [12,13]. Instead of extracting features from the entire image, the attention mechanism allows models to focus on the most relevant features as needed. In this paper, we propose a lightweight attention network to alleviate the effects of various noises in the input, ignore irrelevant information, and enhance the salient feature extraction for accurate density map estimation. Besides, we combine mean structural similarity [14] loss and Euclidean loss to exploit the local correlation in density maps. The main contributions of this paper are summarized as follows: • The proposed GLANet generates low-quality density by enhancing different spatial semantic features using multi-column attention mechanisms.

•
GLANet utilizes the mean structural similarity to obtain connections between different pixels and the local correlation in density maps.

Related Work
The previous literature on crowd counting problems can be divided into three categories according to different methods: detection-based, regression-based and density-based. They can solve or avoid phenomena such as uneven population distribution and overlapping goals, as shown in Table 1. Table 1. Comparison of the three methods. The √ and × respectively represent the ability and inability to solve or avoid the phenomenon.

Detection-Based
Detection-based counting methods have deployed a detector to traverse the image to locate and count targets along the way [15][16][17]. Therefore, people choose to use the more advanced detection to crowd counting. For example, Ref. [15] proposed using head and shoulder detections for crowd counting, and [6] trained a Faster R-CNN [18] for crowd counting by manually annotating the bounding boxes on partial of SHB [2]. However, in crowded scenes, the head sizes can be extremely small, and bounding box annotations is very difficult, which restricts the exploration of detection-based approaches for crowd counting. Besides, the tiny objects have been difficult to handle effectively in previous object detection methods, yet, these objects are very common in crowd counting.

Regression-Based
Regression-based methods [19][20][21][22][23] directly predict count value of images since it is learnable to map between image features and the crowd count. So regression-based approaches can remedy the occlusion problems, which are difficult for detection-based methods, bypassing explicit detection tasks. More specifically, Ref. [22] has introduced a Bayesian model for discrete regression, which is suitable for crowd counting. However, the regression function is between image features extracted globally from the entire image space, and the total people count in that image, and the spatial information is lost. Although these regression-based methods can accurately estimate the number of people in the picture, they cannot reflect the population's distribution, and most ignore the spatial distribution in the crowd images.

Density-Based
Density-based approaches effectively count the targets in crowd scenes while maintaining the spatial distribution of the crowd compared with regression-based approaches. In an object density map, the integral over any sub-region is the number of objects within the corresponding region in the image. Therefore, most of the recent methods of crowd counting are based on the density map. Early methods [2,4,8,10,11] advocate a multi-column convolutional architecture to learn the features of different scales by different columns network. For example, Ref. [2] has proposed a three-column network which employs different filters on separate columns to obtain the various scale features, Ref. [11] adds a classifier to distinguish which column of the picture to use, and [4] has introduced a contextual pyramid CNN that utilizes various estimators to capture both global and local contextual information, which is integrated with high-dimensional feature maps extracted from a multi-column CNN by a Fusion-CNN. Although they have achieved some effect, conflicts caused by optimization between different columns cause training difficulties, and information redundancy between different columns, causing overfitting. Some single column [3,9,[24][25][26] networks are also proposed and attain good performance. For instance, Ref. [25] combines a VGG network with dilated convolution layers to generate a density map, and [3] employs the Inception module to extract features, and deconvolution operator high-resolution density maps.

Architecture
The architecture of the proposed network is illustrated in Figure 1

Feature Extractor Module
Following the common practices [9,23,25,33], we chose VGG-16 [1] except for the fully connected layers and two max-pooling layers as the FEM since it has firm transfer learning ability, and its architecture is flexible. Given image I as input, the output I f produced by the feature extractor can be represented by the following mapping: The main difficulty in crowd counting is that the background and the scale of the object have variations in real-world scenes. To apply deep learning for such a situation, a sufficiently large training dataset is required. However, there are few datasets of crowd counting, and the sizes of most datasets are small. As a result, we choose the first ten convolutional layers of a pre-trained VGG-16 with ReLU as FEM.

Global and Local Attention Module
The channel attention block was first introduced as a squeeze and excitation(SE) block in [29] to exploit the inter-channel relationship of features. It utilized the global average pooling to determine the spatial dependency and made specific channel descriptors to emphasize proper channels and recalculate the features map. Additionally, CBAM [28] introduced the spatial attention module to focus on "where" is an informative part. It applied pooling operations along the channel axis is shown to be effective in highlighting informative regions.
In regular attention operations, spatial attention is separate from channel attention and used separately. Our GLANet introduces an attention mechanism, combining a spatial attention mechanism and channel attention mechanism [28]. As shown in Figure 2, given an intermediate feature map I f as input, our attention module drives a 3D attention map M, and is summarized as: where denotes element-wise multiplication. Compared with channel attention, we use global average pooling and upsampling operation to obtain each channel's background information, instead of the inter-channel relationship. We get the foreground information and we need to pay attention to it by subtracting the feature map's background information. In crowd counting datasets, there are many unimportant background data that interfere with detection. Additionally, the global average pooling operation will bring a lot of background information. If we use global average pooling to obtain spatial semantic information, it will attain a lot of information that we are not interested in. So, we subtract the features from global average pooling and the input features to eliminate the background information brought by global average pooling and strengthen the original features. As is shown in Figure 1, we use a set of attention mechanisms, called global and local attention, in the form of a pyramid to deal with the uneven distribution of people on the picture, and at the same time to make us more aware of crowded areas. In contrast to the pyramid pool network structure [39], it uses global average pooling operators of different sizes to obtain global and local semantic information, which is then concatenated directly with input features. Our attention module uses global average pooling operators of different sizes to get background information of various regionsȦdditionally firstly, we use a 1 × 1 convolution to delete some irrelevant information. The attention module is shown in Figure 2

Density Map Generation
We apply a density map as ground truth to optimize and validate the network. Following the procedure of density map generation in [3], one object at pixel x i can be represented by a delta function δ (x − x i ) . As a result, given an image with instances annotated, the ground truth can be represented as: In order to generate the density map F(x), the ground truth H(x) is convoluted with a Gaussian kernel, and it can be defined as follows: where σ i is the standard deviation at the pixel x i and, empirically, we adopt the fixed kernel with σ = 15 for all the experiments. The integral over any sub-region is the number of objects within the image's corresponding region in a density map.

Loss Function
The loss comprises two parts: the pixel-wise Euclidean distance and structural similarity index (SSIM) [14] loss. The proposed framework's training is to minimize a weighted combination loss function concerning the parameters Θ. The final loss function is represented as: where L 2 is the Euclidean distance, L s is the SSIM loss, and α i is the weight to balance Euclidean distance and SSIM Loss. In our experiments, the α i is 100. We use Stochastic Gradient Descent (SGD) with a batch size of 1 for various datasets and a fixed learning rate at 10 −7 to minimize the loss while the momentum parameter is set as 0.9.

Euclidean Distance
The crowd counting methods based on density map are mostly a regression task, which usually adopts Euclidean distance as a loss function to measure the difference between the estimated density map and ground truth. In symbols, it is defined as: where I i is the ground truth corresponding to the input image I i , and F (I i ) is the output density map.

SSIM Loss
The Euclidean distance neglects the local coherence and spatial correlation in density maps. So, we use SSIM loss to enforce the local structural similarity between estimated density map and ground truth. SSIM index is usually used in image quality assessment and computes the similarity between two images from three local statistics-i.e., mean, variance and covariance. Following [14], we use an 11 × 11 normalized Gaussian kernel with a standard deviation of 1.5 to estimate local statistics. The weight function is defined by W = {W(p) | p ∈ P, P = {(−5, −5), (−4, −4), · · · , (4, 4), (5, 5)}},where p is offset from the center and P contains all positions of the kernel. For each pixel x on the estimated density map I and the corresponding ground truth I, the local statistics are computed by: where µ I and σ 2 I are represented as the local mean and variance estimation of I, σ IÎ is local covariance estimation. The SSIM index can be calculated by the following: where C 1 and C 2 are small constants to avoid the occurrence of zero. The SSIM loss is defined as: where N is the number of pixels in density maps.

Experiments
A pre-trained VGG-16 initializes the FEM parameters with ReLU, Xavier randomly initializes others with a mean zero and a standard deviation of 0.01. In this section, we introduce datasets and experiment details. Experimental results are compared with the current state-of-the-art methods in Table 2, and the result performance is shown in Figure 3.

Evaluation Metrics
The mean absolute error (MAE) and root mean squared error (RMSE) are commonly used to evaluate different methods' performances in previous works. They are defined as: where N is the size of testing images, y i , and y t i are the predicted count and the ground truth for the i-th test image, respectively. Generally speaking, the accuracy of the estimation is shown by the MAE, and the RMSE reflects the robustness of estimation.

ShanghaiTech
This dataset is made up of 1198 annotated images with a total of 330,165 people head annotated. It is divided into two parts, A and B. The images of part A are randomly collected from the Internet, which consists of 300 training images of different resolutions and 182 testing images, and the scene is mostly highly congested. The average counts are 501.4. Part B images are captured from a relatively sparse crowd in the street, which contain 400 training images and 316 testing images with the same resolutions (768 × 1024), and the average numbers are 126. During the training, we randomly crop image patches of 1/4 the original image's size at different locations. These patches are further mirrored to four times the training set.

UCF QNRF
This dataset is the largest crowd dataset which contains 1535 high-resolution images, with 1201 training images and 334 testing images. It annotates approximately 1.25 million people, and the mean counts are 815.4. We crop image patches of 1/9 the original image's size during the training at different locations, with four patches per original image.

UCF-CC 50
This dataset only has 50 images, but it is annotated with 63,974 in total. It is a challenging task for estimating on this dataset, owing to the number of images and considerable variation in crowd counts, which range from 94 to 4543. We follow the standard protocol and use five-fold cross-validation [41] to evaluate the performance of the proposed method. Ground truth density maps are generated with a fixed spread Gaussian kernel.

Ablation Study on ShanghaiTech Part A
In order to analyze the effectiveness of our GLA module, we conducted an ablation study on the ShanghaiTech Part A [2] dataset. We removed our Global and Local Attention Module consisting of feature extractor module and feature fusing module. Then we trained it using the same parameters with the fixed learning rate at 10 −7 . As shown in Table 3, only using the feature extractor module and feature fusing module we obtained the MAE of 68.2 and RMSE of 115 while collecting the MAE of 63.9 and the RMSE of 104.2 to add the Global and Local Attention Module. This achieved 6.3% lower MAE and 9.3% lower RMSE, which demonstrates that the Global and Local Attention Module can significantly decrease the error of estimated crowd count in congested scenes with varied scales.

Conclusions
In this work, we proposed a novel global and local attention network(GLANet) for density maps generation and accurate crowd count estimation. We design an attention mechanism drawing on the spatial attention mechanism and channel attention mechanism to handle the scale variation while keeping the overhead small. Additionally, to exploit the local correlation of density maps, we used the SSIM loss to enforce the local structural similarity between density maps. Extensive experiments show that our method achieves the superior performance on four major crowd counting benchmarks to state-of-the-art methods.