A High-Density Crowd Counting Method Based on Convolutional Feature Fusion

: In recent years, the trampling events due to overcrowding have occurred frequently, which leads to the demand for crowd counting under a high-density environment. At present, there are few studies on monitoring crowds in a large-scale crowded environment, while there exists technology drawbacks and a lack of mature systems. Aiming to solve the crowd counting problem with high-density under complex environments, a feature fusion-based deep convolutional neural network method FF-CNN (Feature Fusion of Convolutional Neural Network) was proposed in this paper. The proposed FF-CNN mapped the crowd image to its crowd density map, and then obtained the head count by integration. The geometry adaptive kernels were adopted to generate high-quality density maps which were used as ground truths for network training. The deconvolution technique was used to achieve the fusion of high-level and low-level features to get richer features, and two loss functions, i.e., density map loss and absolute count loss, were used for joint optimization. In order to increase the sample diversity, the original images were cropped with a random cropping method for each iteration. The experimental results of FF-CNN on the ShanghaiTech public dataset showed that the fusion of low-level and high-level features can extract richer features to improve the precision of density map estimation, and further improve the accuracy of crowd counting.


Introduction
Crowd safety in public places has always been a significant but troublesome affair, especially in high-density crowd gathering places. The higher the crowd density is, the easier of out of control will be [1], and it can cause serious casualties. It is important to seek an intelligent method of crowd analysis in public places to assist in prevention and decision making. As an important part of crowd analysis [2], crowd counting and density estimation can help to quantify the importance of events and provide relevant personnel with information to support decision-making. Therefore, crowd counting and its density estimation become hot topics in the security field, which are widely used in video surveillance, traffic monitoring, public safety and urban planning [3]. Nowadays, there is a very large demand for a crowd monitoring system. However, the existing crowd monitoring system products still have many weakness, for example, most of them are limited by application scenes or have low precision. Especially, research on monitoring the number of pedestrians in a large-scale crowded environment (Refer to Figure 1) is insufficient [4]. The crowd counting methods can be divided into the detection-based methods and the regression-based methods. The detection-based crowd counting methods usually use a sliding window to detect each pedestrian in the scene to obtain the approximate position of the pedestrian, and then count the number of pedestrians [5][6][7]. The detection-based methods can achieve good results for low-density crowd scenes, but are greatly limited for high-density crowd scenes. The incipient regression-based methods try to learn the direct mapping between the low-level features extracted from the local image blocks and the head count [8][9][10]. Such direct regression-based methods only count the number of pedestrians, while ignoring the important spatial information. Refs. [11,12] suggested learning the linear or non-linear mapping between the local block features and their corresponding target density maps, which may incorporate the spatial information into the learning process.
The great success of the Convolutional Neural Network (CNN) in many computer vision tasks motivated researchers to use CNN to learn nonlinear functions from crowd images to corresponding density maps or corresponding counts. In 2015, Wang et al. [13] applied CNN to the crowd counting task firstly by using the Alexnet network structure [14]. The fully connected layer with 4096 neurons was replaced by a layer with only one neuron to count the number of pedestrians in the crowd image. In the same year, Zhang et al. [4] noticed that the performance of the existing methods was drastically reduced when they were applied to new scenes that differed from the training dataset. To overcome this problem, a data-driven method was proposed to fine-tune the pre-trained CNN model with the training samples similar to the density level in the new scene to adapt to the unknown application scenes. This method avoids the problem of re-training when the model is transformed to a new scene, but still needs a large amount of training data, and it is difficult to know the density level of the new scene in advance in practical applications. Inspired by the success of multi-column networks [15] in image recognition, in 2016, Zhang et al. [16] proposed a multi-column convolutional neural network-based architecture (MCNN) by constructing a network consisting of three columns of filters corresponding to the receptive fields with different sizes (large, medium, small) to adapt to changes of head size due to the perspective effects or image resolution. During training, each column of the MCNN pre-trains all image blocks, then combines the three networks for fine-tuning training. The training process is complex and there is a large amount of structural redundancy. In 2017, Sam et al. [17] proposed switching the convolutional neural network for crowd counting (Switching CNN) to train the regressions by using a particular set of training data patches according to different crowd densities in the image. The network consists of multiple independent The crowd counting methods can be divided into the detection-based methods and the regression-based methods. The detection-based crowd counting methods usually use a sliding window to detect each pedestrian in the scene to obtain the approximate position of the pedestrian, and then count the number of pedestrians [5][6][7]. The detection-based methods can achieve good results for low-density crowd scenes, but are greatly limited for high-density crowd scenes. The incipient regression-based methods try to learn the direct mapping between the low-level features extracted from the local image blocks and the head count [8][9][10]. Such direct regression-based methods only count the number of pedestrians, while ignoring the important spatial information. Refs. [11,12] suggested learning the linear or non-linear mapping between the local block features and their corresponding target density maps, which may incorporate the spatial information into the learning process.
The great success of the Convolutional Neural Network (CNN) in many computer vision tasks motivated researchers to use CNN to learn nonlinear functions from crowd images to corresponding density maps or corresponding counts. In 2015, Wang et al. [13] applied CNN to the crowd counting task firstly by using the Alexnet network structure [14]. The fully connected layer with 4096 neurons was replaced by a layer with only one neuron to count the number of pedestrians in the crowd image. In the same year, Zhang et al. [4] noticed that the performance of the existing methods was drastically reduced when they were applied to new scenes that differed from the training dataset. To overcome this problem, a data-driven method was proposed to fine-tune the pre-trained CNN model with the training samples similar to the density level in the new scene to adapt to the unknown application scenes. This method avoids the problem of re-training when the model is transformed to a new scene, but still needs a large amount of training data, and it is difficult to know the density level of the new scene in advance in practical applications. Inspired by the success of multi-column networks [15] in image recognition, in 2016, Zhang et al. [16] proposed a multi-column convolutional neural network-based architecture (MCNN) by constructing a network consisting of three columns of filters corresponding to the receptive fields with different sizes (large, medium, small) to adapt to changes of head size due to the perspective effects or image resolution. During training, each column of the MCNN pre-trains all image blocks, then combines the three networks for fine-tuning training. The training process is complex and there is a large amount of structural redundancy. In 2017, Sam et al. [17] proposed switching the convolutional neural network for crowd counting (Switching CNN) to train the regressions by using a particular set of training data patches according to different crowd densities in the image. The network consists of multiple independent CNN regressions, which is similar to a multi-column network, while a Switch classifier based on the VGG-16 [18] architecture was introduced to select the optimal regression for a particular input block. The Switch classifier and the independent regression are trained alternately. However, Switching CNN uses the Switch classifier to switch between the regressions, which is very expensive and often inaccurate. Similar to Refs. [16,17], in 2017, Kumaga et al. [19] considered that a single predictor in different scene environments is not sufficient to predict the number of pedestrian properly, and thus proposed a hybrid neural network Mixture of CNNs (MoCNN). The model structure consists of a mixture of expert CNNs and a gated CNN. The appropriate expert CNN is selected adaptively upon the background of the input image. In prediction, expert CNNs estimate the head count in the image, while gated CNN predicts the appropriate probability of each expert CNN. These probabilities are further used as weighting factors to calculate a weighted average of the head count estimated by all expert CNNs. MoCNN not only trains multiple expert CNNs, but also learns the probability of the head count estimated by each expert CNN through a gated CNN training. However, it can only be used for crowd counting estimation, while it cannot get the crowd density distribution information. Tang et al. [20] proposed a low-rank and sparse-based deep-fusion convolutional neural network for crowd counting (LFCNN) by adopting a regression method based on low-rank and sparse penalty to promote the accuracy of the projection from the density map to global counting, which got an excellent performance. Zhang et al. [21] proposed scale-adaptive CNN (SaCNN) to estimate the crowd density map and integrate the density map to get a more accurate estimated head count by extracting feature maps from multiple layers and adapted them to have the same output size. Han et al. [22] combined convolutional neural network and Markov Random Field (CNN-MRF) to achieve the head count in static images which contained three parts: a pre-trained deep residual network 152 [23] to extract features, a fully connected neural network for count regress and a MRF to smooth the counting results of the local patches. In this way, high correlation of local adjacent patches was used to improve count accuracy.
For the crowd counting task under high-density and complex environments, a feature fusion-based deep convolutional neural network method, FF-CNN (Feature Fusion of Convolutional Neural Network), was proposed in this paper to obtain more accurate crowd counting performance. FF-CNN aimed to map the crowd image to its crowd density map, and then obtained the head count by integration. Following MCNN [16], the geometry-adaptive kernels were adopted to generate high-quality density maps which were used as the training ground truths. VGG network was used as the trunk network of FF-CNN to obtain richer features. Deconvolution technique [24,25] was used to achieve the fusion of high-level and low-level features. Two loss functions i.e., density map loss and absolute count loss, were used to jointly optimize for obtaining a more precise density map and a more accurate crowd count. In order to increase the sample diversity, the original images were cropped to images of 256 × 256 with a random cropping method for each iteration.

Density Map Based Crowd Counting
There are two options for estimating the head count of a crowd image with CNN: (1) Input the crowd image and output the estimated head count directly; (2) input the crowd image, and then output the crowd density map which shows the head count of each pixel in the image, and then integrate any area in the density map to get a final head count in the area. Compared with outputting head count directly, the estimated crowd density map may show the spatial distribution information of a given image. In addition, when learning density maps with a CNN model, the learned filters are more suitable for different sizes of human heads, which can improve the robustness to perspective changes. Thus, these filters have more semantic information and can improve the accuracy of crowd counting [16]. Therefore, we will learn the nonlinear mapping between the crowd image and the corresponding density map with CNN, and get the head count by integrating the density map.

Ground Truth Density Map
To apply supervised training on FF-CNN by means of density map regression from an input image, the image with labeled pedestrian heads should be converted into a ground truth density map. The ground truth density map generation method was used in FF-CNN [14].
Suppose a pedestrian head is at pixel x i , which can be represented as a delta function δ(x − x i ). For N labeled pedestrian heads in an image, it can be expressed as follows: Convert it into a continuous density function with Gaussian kernel G σ [7]: where N is the head count, σ is the variance of Gaussian kernel. The sum of the density maps is equivalent to the total number of pedestrians.
Considering the perspective transformation, it is necessary to obtain the Gaussian spread parameter σ according to the size of each pedestrian head in the image. However, in a high-density environment, it is almost impossible to accurately obtain the size of the occluded head manually. It is observed that the distance between two adjacent pedestrian heads in a crowded scene is proportional to the size of the pedestrian head. Therefore, a geometrically adaptive Gaussian kernel can be used to generate a density map [16].
Suppose that the crowd is distributed evenly around each pedestrian head, the average distance between each pedestrian head and its k neighbors may provide a viable geometric distortion estimation method. For a given head coordinate x i , suppose its distance from k neighbors is d i 1 , d i 2 , · · · , d i k , the average distance is d i = 1 k ∑ k j=1 d i j and set β as the initial variance. The density function is: Although we used the adaptive density map generation method of Ref. [16], the value of β was optimized to obtain the ground truth density map with higher quality, which will be introduced in Section 3.2.

Network Architecture
The VGG [18] network has achieved outstanding results in the image classification competition, which has strong expansion and generalization performance. The network model used in this paper was based on the VGG-16 network, and the backbone of the network was constructed by repeatedly stacking the convolutional layer with 3 × 3 convolutional kernel and the max pooling layer with 2 × 2 filter. Since too much downsampling will result in smaller feature maps and lower resolutions, to obtain a higher-quality crowd density map, only the first four sets of convolutions in the VGG-16 network will be adopted, which includes three downsampling layers. The first four sets of convolutions in the VGG-16 network are shown in Figure 2: The experiments with CCF (Convolutional Channel Features) [26] demonstrated that, to extract richer features, the suggested typical accumulated downsampling coefficients are four and eight. Inspired by this conclusion, the corresponding feature maps of downsampling coefficients of four and eight (the feature maps of Conv3_3 and Conv4_3) were fused by deconvolution technique to get richer features and finally get higher-quality crowd density map.
The constructed network architecture in this paper is shown in Figure 3. The convolutional layers from Conv1_1 to Conv4_3 use a small convolutional kernel of 3 × 3. The convolutional step is fixed to 1 pixel, and a pixel padding with a value of 0 is performed around the input space of the convolutional layer, which may keep spatial resolution after convolution operation. A rectified linear unit (ReLU) [27] is applied to each convolutional layer as an activation function. All pooling layers use the max pooling operation, while the filter size is 2 × 2 with a fixed step size of 1. The number of feature maps of the convolutional layer will be doubled after each downsampling. After twice max pooling operations with filter size 2 × 2, the size of the feature map Conv3_3 is W/4 × H/4, and Conv4_3 has undergone three times downsampling operations to obtain a feature map of size W/8 × H/8. In order to achieve the fusion of Conv3_3 and Conv4_3 feature maps, a deconvolution layer is added after the Conv4_3 to sample the feature maps of Conv4_3 to the same size as the feature maps of Conv3_3. Conv5_1 and Conv5_2 of the convolution layer with 3 × 3 filters are adopted in the upper layer of the network to reduce the number of feature maps and increase the nonlinearity of the model. Finally, a convolutional layer with 1 × 1 filters is used to map it to the density map, which can be integrated to obtain the overall count. Since a deconvolution layer after the Conv4_3 layer is added, the size of the feature map of the Conv4_3 layer is sampled to the size of the feature map of Conv3_3 layer. Therefore, though there are three downsampling operations in the network, the final accumulated downsampling coefficient is four. It only needs to downsample the ground truth density map to 1/4 of the input image, while the resulted estimated density map size is also 1/4 of the original image. In addition, though FF-CNN is a fully convolutional network [28,29], which may accept images with any sizes, since there is a concatenate layer in the network, the output sizes of Conv3_3 and Conv4_3 layers should be identical after convolutional and pooling layers, so the height and width of the input image should be a multiple of eight. The experiments with CCF (Convolutional Channel Features) [26] demonstrated that, to extract richer features, the suggested typical accumulated downsampling coefficients are four and eight. Inspired by this conclusion, the corresponding feature maps of downsampling coefficients of four and eight (the feature maps of Conv3_3 and Conv4_3) were fused by deconvolution technique to get richer features and finally get higher-quality crowd density map.
The constructed network architecture in this paper is shown in Figure 3. The convolutional layers from Conv1_1 to Conv4_3 use a small convolutional kernel of 3 × 3. The convolutional step is fixed to 1 pixel, and a pixel padding with a value of 0 is performed around the input space of the convolutional layer, which may keep spatial resolution after convolution operation. A rectified linear unit (ReLU) [27] is applied to each convolutional layer as an activation function. All pooling layers use the max pooling operation, while the filter size is 2 × 2 with a fixed step size of 1. The number of feature maps of the convolutional layer will be doubled after each downsampling. After twice max pooling operations with filter size 2 × 2, the size of the feature map Conv3_3 is W/4 × H/4, and Conv4_3 has undergone three times downsampling operations to obtain a feature map of size W/8 × H/8. In order to achieve the fusion of Conv3_3 and Conv4_3 feature maps, a deconvolution layer is added after the Conv4_3 to sample the feature maps of Conv4_3 to the same size as the feature maps of Conv3_3. Conv5_1 and Conv5_2 of the convolution layer with 3 × 3 filters are adopted in the upper layer of the network to reduce the number of feature maps and increase the nonlinearity of the model. Finally, a convolutional layer with 1 × 1 filters is used to map it to the density map, which can be integrated to obtain the overall count. The experiments with CCF (Convolutional Channel Features) [26] demonstrated that, to extract richer features, the suggested typical accumulated downsampling coefficients are four and eight. Inspired by this conclusion, the corresponding feature maps of downsampling coefficients of four and eight (the feature maps of Conv3_3 and Conv4_3) were fused by deconvolution technique to get richer features and finally get higher-quality crowd density map.
The constructed network architecture in this paper is shown in Figure 3. The convolutional layers from Conv1_1 to Conv4_3 use a small convolutional kernel of 3 × 3. The convolutional step is fixed to 1 pixel, and a pixel padding with a value of 0 is performed around the input space of the convolutional layer, which may keep spatial resolution after convolution operation. A rectified linear unit (ReLU) [27] is applied to each convolutional layer as an activation function. All pooling layers use the max pooling operation, while the filter size is 2 × 2 with a fixed step size of 1. The number of feature maps of the convolutional layer will be doubled after each downsampling. After twice max pooling operations with filter size 2 × 2, the size of the feature map Conv3_3 is W/4 × H/4, and Conv4_3 has undergone three times downsampling operations to obtain a feature map of size W/8 × H/8. In order to achieve the fusion of Conv3_3 and Conv4_3 feature maps, a deconvolution layer is added after the Conv4_3 to sample the feature maps of Conv4_3 to the same size as the feature maps of Conv3_3. Conv5_1 and Conv5_2 of the convolution layer with 3 × 3 filters are adopted in the upper layer of the network to reduce the number of feature maps and increase the nonlinearity of the model. Finally, a convolutional layer with 1 × 1 filters is used to map it to the density map, which can be integrated to obtain the overall count. Since a deconvolution layer after the Conv4_3 layer is added, the size of the feature map of the Conv4_3 layer is sampled to the size of the feature map of Conv3_3 layer. Therefore, though there are three downsampling operations in the network, the final accumulated downsampling coefficient is four. It only needs to downsample the ground truth density map to 1/4 of the input image, while the resulted estimated density map size is also 1/4 of the original image. In addition, though FF-CNN is a fully convolutional network [28,29], which may accept images with any sizes, since there is a concatenate layer in the network, the output sizes of Conv3_3 and Conv4_3 layers should be identical after convolutional and pooling layers, so the height and width of the input image should be a multiple of eight. Since a deconvolution layer after the Conv4_3 layer is added, the size of the feature map of the Conv4_3 layer is sampled to the size of the feature map of Conv3_3 layer. Therefore, though there are three downsampling operations in the network, the final accumulated downsampling coefficient is four. It only needs to downsample the ground truth density map to 1/4 of the input image, while the resulted estimated density map size is also 1/4 of the original image. In addition, though FF-CNN is a fully convolutional network [28,29], which may accept images with any sizes, since there is a concatenate layer in the network, the output sizes of Conv3_3 and Conv4_3 layers should be identical after convolutional and pooling layers, so the height and width of the input image should be a multiple of eight.

Network Loss
The network training uses Euclidean loss to measure the distance between the estimated density map and the ground truth density map: where θ is the set of parameters to be learned in the network, M is the total number of training images, X i is the input image, and D i is the corresponding true density map. F d (X i ; θ) − D i represents the density map estimated for X i . The Euclidean distance is applied to each pixel, and then accumulated. The purpose of Equation (4) is to obtain a high quality crowd density distribution to finally obtain an accurate crowd count.
Since the estimated crowd count is expected to be as close as possible to the ground truth head count after training, in addition to density map loss function, another count loss function is also introduced. The absolute loss function with weight was denoted as: where F y (X i ; θ) and Y i represent the estimated head count and the ground truth head count, respectively.

Experiments
ShanghaiTech dataset was used to evaluate our proposed FF-CNN model. Implementation of the proposed network and its training were based on the Caffe framework provided by the Berkeley Center for Vision and Learning (BCVL). The computer used for experiments was with Intel (R) Xeon CPU E5-2683 v3 @ 2.00Ghz, while the GPU was NVIDIA TESLA K80 (NVIDIA, Santa Clara, CA, USA). The experimental platform was equipped with 64-bit ubuntu14.04, Anaconda3.4, CUDA Toolkit8.0 and Opencv2.7.0.

Dataset
The experiments were conducted on the challenging ShanghaiTech dataset [16], which not only has different density levels, but also has different complex scenes, such as different scales and different perspective distortions. It can be seen in Table 1, the ShanghaiTech dataset consists of two parts: Part A and Part B, including a total of 1198 images and 330,165 labeled heads. Part A includes 482 images randomly selected from the Internet. The images in Part B were taken from street photographs in Shanghai. Compared to Part B, Part A includes the images with higher density. For experiments, both of these two parts were divided into training and test sets. Three-hundred and 182 images in Part A were used for training and testing, respectively, while 400 and 316 images in Part B were used for training and testing, respectively.

Implementation Details
In our experiments, different from Ref. [14], which set the value of the parameter β in Equation (3) as β = 0.3, we set β = 0.12, which showed the best result. Also, same as Ref. [16], the size of the head was limited within 100 pixels, i.e., when d i > 100, let d i = 100(d i is a parameter in Equation (3)). Other parameters were set following Ref. [16]. During training, all parameters were optimized with Batch gradient descent (BGD) and Back Propagation (BP). The initial learning rate was 0.00001, the momentum was 0.9, and the weight decay was set to be 0.0001 to avoid over-fitting during training. The loss weight of the density map was set to be 1 and the absolute count loss weight was 0.00004. The density map loss function and the absolute count loss function were trained together. The total number of iterations was set to 2,200,000 (batch size was set to 1) both on Part A and Part B datasets. For Part A dataset, 300 iterations were regarded as one epoch and the total training epoch was 7333. For Part B dataset, 400 iterations were regard as one epoch and the total training epoch was 5500.
For the original training dataset, a random crop method to crop nine patches from each image was used to augment training images and each patch was 1/4 size of the original image in Ref. [16]. In order to further speed up training, while reducing the memory usage and enhancing the diversity of the data, the training data set images were cropped randomly in each training iteration. Since the height and width of the input image should be a multiple of eight, the cropped image block size was fixed to 256 × 256, while the cropping position was random.

Evaluation Metrics
The mean absolute error (MAE) and mean square error (MSE) were used to evaluate performance [15][16][17]. The MAE reflects the accuracy of the prediction and the MSE reflects the robustness of the prediction. The detailed definitions are as follows: where M is the number of test images, z i andẑ i represent the ground truth head count and the estimated head count in the i-th image, respectively. With smaller values of MAE and MSE, the results will be better.

Results and Analysis
The proposed method was compared with other existing methods and the experimental results were shown in Table 2. "LBP + RR" (local binary pattern + Ridge Regression) is a traditional regression-based method, which uses LBP features extracted from the original image as input and uses RR to estimate the head count for each image [4]. It can be seen that the method based on traditional artificial feature extraction is very ineffective for complex high-density crowd datasets, and the CNN-based method is far superior to the traditional method.
Also, it can be concluded from Table 2 that the results of FF-CNN are much better than Cross-scene [4], MCNN [16], FCN [29], Cascaded-MTL [30] and Switching-CNN [17] method. Compared to LFCNN [20] and SaCNN [21], the proposed FF-CNN method shows lower error on the crowd-intensive Part A dataset. MAE and MSE for LFCNN are reduced by 7.45 and 3.1, respectively. MAE and MSE for SaCNN are reduced by 5.05 and 0.4, respectively. The difference is not obvious in the more sparse Part B datasets. MAE and MSE for our proposed method are only 1.75 and 0.79 higher, respectively, than those of LFCNN. MAE and MSE for our proposed method are only 0.25 and 0.39 higher, respectively, than those of SaCNN. However, LFCNN should go through two phases: deep-fusion density map regression and low-rank and sparse based regression. The FF-CNN proposed needs only one phase and the performance is excellent; SaCNN is more complex than the FF-CNN proposed in this paper. SaCNN needs more training parameters. Also, the resolution of the density map of SaCNN is only 1/8 of the original image, while the resolution of the density map obtained by FF-CNN is 1/4 of the original image. Compared to CNN-MRF [22], the proposed FF-CNN method shows relatively higher error on the crowd-intensive Part A, especially the error of MSE is higher than 8.7 which shows that CNN-MRF is more robust. On the Part B dataset, the proposed FF-CNN method performs better on MAE. The CNN-MRF contains three parts, a pre-trained deep residual network 152 to extract features, a fully connected neural network for count regress and a MRF to smooth the counting, which resulted in more parameters (residual network 152 contains more than 40 million parameters, our method contains about 12 million parameters) and calculations, but not much improvement.  Figure 4 shows the ground truth head count and the estimated head count in the 182 test images of the Part A dataset. Figure 5 shows the ground truth head count and the estimated head count in the 316 test images of the Part B dataset. The dashed line indicates the ground truth head count, while the solid line indicates the head count estimated (the ground truth head count with the corresponding estimated head count is arranged in ascending order to get a more intuitive broken line diagram). It can be seen that either for Part A or Part B, the estimation head count is close to the ground truth head count. Note that for Part A, the crowd density ranges from 66 to 2255, while for Part B, the crowd density ranges from 11 to 539, which shows quite different density and varied density distribution of the crowd. However, the FF-CNN can be effectively adaptive to different density levels and the estimation results are very accurate. obtained by FF-CNN is 1/4 of the original image. Compared to CNN-MRF [22], the proposed FF-CNN method shows relatively higher error on the crowd-intensive Part A, especially the error of MSE is higher than 8.7 which shows that CNN-MRF is more robust. On the Part B dataset, the proposed FF-CNN method performs better on MAE. The CNN-MRF contains three parts, a pre-trained deep residual network 152 to extract features, a fully connected neural network for count regress and a MRF to smooth the counting, which resulted in more parameters (residual network 152 contains more than 40 million parameters, our method contains about 12 million parameters) and calculations, but not much improvement.  Figure 4 shows the ground truth head count and the estimated head count in the 182 test images of the Part A dataset. Figure 5 shows the ground truth head count and the estimated head count in the 316 test images of the Part B dataset. The dashed line indicates the ground truth head count, while the solid line indicates the head count estimated (the ground truth head count with the corresponding estimated head count is arranged in ascending order to get a more intuitive broken line diagram). It can be seen that either for Part A or Part B, the estimation head count is close to the ground truth head count. Note that for Part A, the crowd density ranges from 66 to 2255, while for Part B, the crowd density ranges from 11 to 539, which shows quite different density and varied density distribution of the crowd. However, the FF-CNN can be effectively adaptive to different density levels and the estimation results are very accurate.    Figure 6 shows several experimental results on the Part A test sets and Part B test sets. The left side shows the test crowd images, while the right side shows the density maps estimated by the proposed model. It can be seen that the density map estimated by the method proposed in this paper can restore the crowd distribution in the image well and adapt to different scenes, light, occlusion, different sizes, and perspective changes effectively. The estimation is very close to the ground truth.
(e) (f) Figure 5. The ground truth head count and estimated head count with FF-CNN on Part B dataset (the vertical axis represents the head counts, and the horizontal axis represents the test sample). Figure 6 shows several experimental results on the Part A test sets and Part B test sets. The left side shows the test crowd images, while the right side shows the density maps estimated by the proposed model. It can be seen that the density map estimated by the method proposed in this paper can restore the crowd distribution in the image well and adapt to different scenes, light, occlusion, different sizes, and perspective changes effectively. The estimation is very close to the ground truth.  Figure 6 shows several experimental results on the Part A test sets and Part B test sets. The left side shows the test crowd images, while the right side shows the density maps estimated by the proposed model. It can be seen that the density map estimated by the method proposed in this paper can restore the crowd distribution in the image well and adapt to different scenes, light, occlusion, different sizes, and perspective changes effectively. The estimation is very close to the ground truth.

Conclusions
In this paper, a deep convolutional neural network method based on feature fusion (FF-CNN) was proposed to solve the problem of crowd counting and density distribution estimation in crowded scenes. The proposed FF-CNN mapped the crowd image to its corresponding crowd density map and got the crowd count by integrating the crowd density map. For obtaining a higher-quality crowd density map, the geometry adaptive kernels were adopted to generate high-quality ground truth density maps during training and the deconvolution technique was used to achieve the fusion of high-level and low-level features to get richer features. The absolute count loss with weight 0.00004 was used to get a more accurate crowd count. For enhancing the generalization ability of FF-CNN, the original images were cropped with a random cropping method for each iteration to increase the sample diversity. The experimental results on the ShanghaiTech dataset showed that the fusion of low-level features and high-level features can extract richer features and can get comparable results with state-of-the-art methods. Also, it is adaptive to different scenes and different crowd density levels, and is robust to scale and perspective changes.

Conclusions
In this paper, a deep convolutional neural network method based on feature fusion (FF-CNN) was proposed to solve the problem of crowd counting and density distribution estimation in crowded scenes. The proposed FF-CNN mapped the crowd image to its corresponding crowd density map and got the crowd count by integrating the crowd density map. For obtaining a higher-quality crowd density map, the geometry adaptive kernels were adopted to generate high-quality ground truth density maps during training and the deconvolution technique was used to achieve the fusion of high-level and low-level features to get richer features. The absolute count loss with weight 0.00004 was used to get a more accurate crowd count. For enhancing the generalization ability of FF-CNN, the original images were cropped with a random cropping method for each iteration to increase the sample diversity. The experimental results on the ShanghaiTech dataset showed that the fusion of low-level features and high-level features can extract richer features and can get comparable results with state-of-the-art methods. Also, it is adaptive to different scenes and different crowd density levels, and is robust to scale and perspective changes.