Smart Camera Aware Crowd Counting via Multiple Task Fractional Stride Deep Learning

Estimating the number of people in highly clustered crowd scenes is an extremely challenging task on account of serious occlusion and non-uniformity distribution in one crowd image. Traditional works on crowd counting take advantage of different CNN like networks to regress crowd density map, and further predict the count. In contrast, we investigate a simple but valid deep learning model that concentrates on accurately predicting the density map and simultaneously training a density level classifier to relax parameters of the network to prevent dangerous stampede with a smart camera. First, a combination of atrous and fractional stride convolutional neural network (CAFN) is proposed to deliver larger receptive fields and reduce the loss of details during down-sampling by using dilated kernels. Second, the expanded architecture is offered to not only precisely regress the density map, but also classify the density level of the crowd in the meantime (MTCAFN, multiple tasks CAFN for both regression and classification). Third, experimental results demonstrated on four datasets (Shanghai Tech A (MAE = 88.1) and B (MAE = 18.8), WorldExpo’10(average MAE = 8.2), NS UCF_CC_50(MAE = 303.2) prove our proposed method can deliver effective performance.


Introduction
The automatic analysis of crowd has been a particular security technique in the current intelligent surveillance literature [1], which will prevent severe accidents by providing crucial information about the number of people and crowd density in a scene. Therefore, crowd counting and analysis has become an active topic in the computer vision literature due to its extensive application video surveillance, traffic monitoring, public safety, and urban planning [2]. However, the current intelligent surveillance system is still incapable of handling a large-scale crowded environment with severe occlusion and non-uniformity [3].
People-Counting technology can be generalized into two kinds of literature: detection methods and regression methods. Traditional approaches for crowd counting from images relied on hand-crafted representations to extract low-level features. These features were then mapped for counting or generating density maps via various regression techniques.
Some of the early methods [4] relied on pedestrian detection indirectly, such as HOG (Histograms of oriented gradients) features, can get a more accurate number of people when the crowd is sparse or there are no apparent overlap between people, but the results will be questionable when the group becomes denser. The detection-based model typically employs sliding window-based detection sending pictures in the MCNN. However, the granularity of density level is hard to define in real-time full scene analysis since the number of objects keeps varying on a large scale. Also, using a classifier means more columns need to be implemented which makes the design more complicated and causes more overabundance. These works spend a large portion of parameters on density level classification to label the input regions instead of feeding parameters into the final density map generation. Since the branch structure in MCNN is not efficient for generating a density map is unfavorable to the ultimate accuracy. Considering the above drawbacks, we propose a novel approach to concentrate on encoding the broader and deeper features in clustered scenes and generating a high-quality density map. The front-end VGG-16 [24] of the model named CSRNet in Reference [25] outputs a picture, which is 1/8 size of the original input. As discussed in CSRNet, the output size will be further shrunken if proceeding to stack more convolution layers and pooling layers (necessary components in VGG- 16), additionally, it is difficult to generate high-quality density maps, so the back-end employs dilated convolutions used to extract more in-depth salient information and improve output resolution.
In this paper, the proposed work is motivated by the idea of multiple task learning and dilated convolution in Cascaded-MTL [18] and CSRNet.
Dilated convolutions make it possible to acquire a larger receptive field, meanwhile applying small convolutions. Yu et al. [26] first attempted to utilize dilated convolutions in semantics segmentation, in which a hybrid dilated convolution (HDC) framework was developed to enlarge the receptive fields effectively. A further discussion on dilated convolution was expounded by Chen et al. [27], who employed atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple dilated rates to handle the problem of segmenting objects at various scales. These researches have a significant effect on choosing convolution types and the formation of model structures in our paper. Furthermore, the problem that some details of the image would be lost in the down-sampling, which is worth taking into consideration.
Multiple task learning can boost the classification and regression performance of a deep network. Eigen et al. [28] presented a multi-scale CNN for simultaneously predicting depth, surface normal and semantic labels from an image. They apply CNN at three different scales where the output of the smaller scale network is fed as input to the larger one. UberNet [29] employ a similar concept of training low, mid, and high-level vision tasks at the same time. It fuses different scales of the image pyramid for multi-task training on diverse sets. Gkiovani et al. [30] model a CNN for pose estimation and action detection, using features only from the last layer. Ranjan et al. [31] present an algorithm for simultaneous face detection, landmarks localization, pose estimation, and gender recognition using deep convolutional neural networks (CNN). The proposed method fuses the intermediate layers of a deep CNN which boosts up their performances.
However, we address designing an easy-trained CNN-based density map manager in this paper. The dilated convolution layers are deployed as the front-end to enlarge receptive fields and fractional stride convolution layers as the back-end to restore its spatial resolution. By making use of such structure, we significantly reduce parameters which make CAFN trained easily. The proposed MTCAFN extends the CAFN to multiple tasks not only in density map regression but also in density-level classification. Moreover, MTCAFN is developed as an extended network, uniting the shallow features extracted from the shallow sub-net with the broad features obtained from deep sub-net contributes to improving the network effectively via updating parameters. Furthermore, the proposed model uses the multi-task learning method to get the ability of density-level classification; finally, the loss of details will be restored by transposed convolutional layers as much as possible. This new improvement could significantly boost the performance of crowd counting. Besides, we outperform the previous crowd counting solution lower MAE, frequently used benchmark datasets, respectively. To summarize, we make the following contributions: (1) A principled way of combining different size dilated convolution layers.
(2) Multiple tasks learning framework not only regress the density map, but classify the density level of the crowd. (3) Some results could outperform state-of-the-art methods on several benchmarking datasets.
(4) The proposed model has been verified in a camera with an embedded system.
The rest of the paper is stated as follows. Section 2 introduces the fabric and configuration of the proposed system and model. Section 3 presents the experimental results on several datasets. In Section 4, we conclude the paper.

Proposed Framework
The framework proposed in this paper is designed to estimate the crowd density if the number of crowd people exceeds a threshold, defined as representing an abnormality.

Smart Camera System Configuration
The proposed system could accurately predict the crowd by using smart systems with a single camera and Nvidia Tx2 boards. First, the crowd scene images from three data sets will be inputted into our proposed model CAFN and MTCAFN on PC for training and testing, where we apply 1/10 of training set as the validation set. The trained model is transferred to the Tx2 board for online computing. After that, the real crowd scene will be captured by a camera and the final result will be predicted by our model. When the output counting is larger than the altering threshold T, the system will send the early warning to the surveillance center. The whole flowchart of the alerting system is shown in Figure 1. The rest of the paper is stated as follows. Section 2 reviews the previous works for crowd counting and density map generation. Section 3 introduces the fabric and configuration of the proposed system and model. Section 4 presents the experimental results on several datasets. In Section 5, we conclude the paper.

Proposed Framework
The framework proposed in this paper is designed to estimate the crowd density if the number of crowd people exceeds a threshold, defined as representing an abnormality.

Smart Camera System Configuration
The proposed system could accurately predict the crowd by using smart systems with a single camera and Nvidia Tx2 boards. First, the crowd scene images from three data sets will be inputted into our proposed model CAFN and MTCAFN on PC for training and testing, where we apply 1/10 of training set as the validation set. The trained model is transferred to the Tx2 board for online computing. After that, the real crowd scene will be captured by a camera and the final result will be predicted by our model. When the output counting is larger than the altering threshold T, the system will send the early warning to the surveillance center. The whole flowchart of the alerting system is shown in Figure 1.

CAFN Architecture
In the framework CAFN, the fundamental idea is to deploy a double-column dilated CNN for recording high-level features with larger receptive fields and producing high-quality density maps without obviously expanding network complexity. In this subsection, we firstly introduce the architecture, a network whose input is the image and the output is a density map of the crowd (say how many people per square meter), and then obtain the headcount by integration, then we present the corresponding training method.
Inspired by the idea of CSRNet, we utilized dilated convolutions as the front-end of CAFN because of its greater receptive fields, unlike adopting dilated convolution to capture more features when the resolution has been dropped off to a shallow level in CSRNet. Atrous convolutions are primarily made use of in this paper, which intent to gain more image information from an original image, then the transposed convolutional layer is to enlarge the size of image and up-sampling the previous layer's output to supplement the loss of details.

CAFN Architecture
In the framework CAFN, the fundamental idea is to deploy a double-column dilated CNN for recording high-level features with larger receptive fields and producing high-quality density maps without obviously expanding network complexity. In this subsection, we firstly introduce the architecture, a network whose input is the image and the output is a density map of the crowd (say how many people per square meter), and then obtain the headcount by integration, then we present the corresponding training method.
Inspired by the idea of CSRNet, we utilized dilated convolutions as the front-end of CAFN because of its greater receptive fields, unlike adopting dilated convolution to capture more features when the resolution has been dropped off to a shallow level in CSRNet. Atrous convolutions are primarily made use of in this paper, which intent to gain more image information from an original image, then the transposed convolutional layer is to enlarge the size of image and up-sampling the previous layer's output to supplement the loss of details.
In this paper, for attaining the training dataset, we crop 9 patches from each image at different locations with 1/4 size of the inputting images. The first four patches contain four quarters of the image without overlapping while the other five pieces are randomly cropped from the input image. Based on three branches of MCNN, we add dilation rate to filters to enlarge its receptive fields. To reduce net parameters, we consider four types of double-column association as experimental objects, which will be discussed in detail in Section 4. After extracting features from filters with different scales, we try to deploy transposed convolutional layers as the back-end for maintaining the output resolution. We choose a relatively better model, taking into account the stability of the model, of which MAE is not the lowest (but MSE is the lowest) by comparing different groups.
The overall structure of our CAFN is illustrated in Figure 2. It contains double parallel columns whose filters are with different dilate rates and local receptive fields of different sizes. Double convolutional columns are merged for fusing features from different scales, here we use the function (torch.cat) to concatenate matrices (feature maps) output from double columns, respectively, on the first dimension. For simplification, we use the same network structures for all columns (i.e., conv-pooling-conv-pooling) except for the sizes and numbers of filters. Max pooling is applied for each 2 × 2 region, and Parametric Rectified linear unit (PReLU) is adopted as the activation function because of its favorable performance for CNN. To reduce the computational complexity (the number of parameters to be optimized), we apply less number of filters for a convolutional layer with larger filters. We stack the output feature maps of all convolutional layers and map them to a density map. To map the feature maps to the density map, we adopt filters whose sizes are 1 × 1. The configuration of our network is shown below in detail (See Table 1). In this paper, for attaining the training dataset, we crop 9 patches from each image at different locations with 1/4 size of the inputting images. The first four patches contain four quarters of the image without overlapping while the other five pieces are randomly cropped from the input image. Based on three branches of MCNN, we add dilation rate to filters to enlarge its receptive fields. To reduce net parameters, we consider four types of double-column association as experimental objects, which will be discussed in detail in Section 4. After extracting features from filters with different scales, we try to deploy transposed convolutional layers as the back-end for maintaining the output resolution. We choose a relatively better model, taking into account the stability of the model, of which MAE is not the lowest (but MSE is the lowest) by comparing different groups.
The overall structure of our CAFN is illustrated in Figure 2. It contains double parallel columns whose filters are with different dilate rates and local receptive fields of different sizes. Double convolutional columns are merged for fusing features from different scales, here we use the function (torch.cat) to concatenate matrices (feature maps) output from double columns, respectively, on the first dimension. For simplification, we use the same network structures for all columns (i.e., convpooling-conv-pooling) except for the sizes and numbers of filters. Max pooling is applied for each 2 × 2 region, and Parametric Rectified linear unit (PReLU) is adopted as the activation function because of its favorable performance for CNN. To reduce the computational complexity (the number of parameters to be optimized), we apply less number of filters for a convolutional layer with larger filters. We stack the output feature maps of all convolutional layers and map them to a density map. To map the feature maps to the density map, we adopt filters whose sizes are 1 × 1. The configuration of our network is shown below in detail (See Table 1). All convolutional layers use padding to keep the previous size. The convolutional layers' parameters are denoted as "conv (kernel size) @ (number of filters)", max-pooling layers are conducted over a 2 × 2 pixel window with stride 2. The fractional stride convolutional layer is denoted as "Conv Transposed (kernel size) @ (number of filters)", and PReLU is used as a non-linear activation layer. All convolutional layers use padding to keep the previous size. The convolutional layers' parameters are denoted as "conv (kernel size) @ (number of filters)", max-pooling layers are conducted over a 2 × 2 pixel window with stride 2. The fractional stride convolutional layer is denoted as "Conv Transposed (kernel size) @ (number of filters)", and PReLU is used as a non-linear activation layer.

Front-End (Double-Column)
Back-End Then Euclidean distance is exploited to measure the difference between the estimated density map and ground truth. The loss function is defined as follows: where θ is a set of trainable parameters in the CAFN. N is the number of the training image. X i is the input image and Y i is the ground truth density map of the image X i . Y(X i ; θ) means the estimated density map generated by CAFN which is parameterized with X i . L is the loss function between estimated density map and the ground truth density map. The network is fetched by an image of arbitrary size and outputs crowd density map. The network has two sections corresponding to the two functions, with the first part learning larger scale features and the second part restoring resolution to perform density map estimation. One of the critical components in our model is the dilated convolutional layer. Systematic dilation supports the exponential expansion of the receptive field without loss of resolution or coverage.
The higher the network layer, the more information the original image contains in the unit pixel, that is, the larger the receptive field, however, which is done by pooling and takes the reduction of the resolution and the loss of the information in the original image as a cost. Due to the existence of the pooling layer, the size of the feature map in the back layer will be smaller and smaller. Dilated convolution is a kind of convolution idea that the downsampling will reduce the image resolution and the loss of information for image semantic segmentation. This feature enlarges the receptive field without increasing the number of parameters or the amount of computation. Examples can be found in Figure 3c where standard convolution (dilation rate = 1) is a 3 × 3 receptive field, and the dilated convolutions (dilation rate = 2) deliver 5 × 5 receptive fields.  Then Euclidean distance is exploited to measure the difference between the estimated density map and ground truth. The loss function is defined as follows: where θ is a set of trainable parameters in the CAFN. N is the number of the training image. i X is the input image and i Y is the ground truth density map of the image means the estimated density map generated by CAFN which is parameterized with i X . L is the loss function between estimated density map and the ground truth density map. The network is fetched by an image of arbitrary size and outputs crowd density map. The network has two sections corresponding to the two functions, with the first part learning larger scale features and the second part restoring resolution to perform density map estimation. One of the critical components in our model is the dilated convolutional layer. Systematic dilation supports the exponential expansion of the receptive field without loss of resolution or coverage.
The higher the network layer, the more information the original image contains in the unit pixel, that is, the larger the receptive field, however, which is done by pooling and takes the reduction of the resolution and the loss of the information in the original image as a cost. Due to the existence of the pooling layer, the size of the feature map in the back layer will be smaller and smaller. Dilated convolution is a kind of convolution idea that the downsampling will reduce the image resolution and the loss of information for image semantic segmentation. This feature enlarges the receptive field without increasing the number of parameters or the amount of computation. Examples can be found in Figure 3c where standard convolution (dilation rate = 1) is a 3 × 3 receptive field, and the dilated convolutions (dilation rate = 2) deliver 5 × 5 receptive fields.  The second component consists of a transposed convolution layer for up-sampling the previous layer's output to explain the loss of details due to earlier pooling layer. Convolution arithmetic [32] is shown in Figure 3. The second component consists of a transposed convolution layer for up-sampling the previous layer's output to explain the loss of details due to earlier pooling layer. Convolution arithmetic [32] is shown in Figure 3.
We apply a simple method to ensure that the improvements obtained are due to the proposed model and are not dependent on the sophisticated methods for calculating the ground truth density maps. Ground truth density map D i corresponding to i, the training patch is calculated by summing a 2D Gaussian kernel centered at every person's location as defined below: ( where σ is the scale parameter of the 2D Gaussian kernel and P is the set of all points where people are located.

MTCAFN Architecture
Since neural networks have complex network structures, training a reliable neural network requires a great deal of training data to update network parameters. However, the public datasets of crowd counting are usually included a few dozens to a few hundred samples, which results in relatively poor robustness and weak generalization of the model in some applications. Multi-task learning is a machine learning method that learns simultaneously from multiple tasks, like learning favorable information about similar tasks while effectively alleviating the problem of sparse data and improving network performance. The application of multi-task learning in deep convolution networks is mainly divided into two categories, one way is hard weight-sharing, and the alternative approach is soft weight-sharing, as seen in Figure 4. Hard weight sharing is a constraint of equal weights, while soft weight sharing means that groups of weights are encouraged to have similar values. In this paper, the hard weight-sharing is adopted to build two tasks covering crowd density estimation, as noted in Equation (3), and crowd density level classification.  In soft weight-sharing, each task has its model with its parameters individually. The distance between the weights of the model is then regularized to encourage the parameters to be similar.
We apply a simple method to ensure that the improvements obtained are due to the proposed model and are not dependent on the sophisticated methods for calculating the ground truth density maps. Ground truth density map i D corresponding to i , the training patch is calculated by summing a 2D Gaussian kernel centered at every person's location as defined below: where σ is the scale parameter of the 2D Gaussian kernel and Ρ is the set of all points where people are located.  Therefore a regression problem is converted into a multi-class classification problem. First, the crowd density was divided into five categories among the training data applied in this paper. Then, image classification is roughly shown as the following steps, (1) the original image is utilized as an input; (2) extracting features via the CNN; and (3) outputting the classification results by employing the Softmax classifier. The classification cross entropy is applied as the loss function in sub-network of density-level classification is demonstrated in Equation (3). where the L class represents the loss of the classification of the crowd density level, is the total number of training samples, M is the total number of categories, i is the sample, j is the category, T i,j is the real classification label, and P i,j is the prediction classification. The crowd density and the classification loss are merged into a loss for the neural network to train and update the parameters. The function definition is shown in Equation (4):

MTCAFN Architecture
where L represents the total loss, λ 1 λ 2 are the weights of the two task loss functions respectively. Due to the difference between the loss function, the loss values obtained from the actual experiments of the two tasks are entirely different. As the crowd count is the principal task, the estimated loss of crowd density should account for a more significant proportion. Therefore, λ 1 , λ 2 are used as hyper-parameters to balance the importance of two tasks. The overall structure of the MTCAFN is shown in Figure 5. It consists of a deep sub-net and a shallow one, which share convolutional layers from Conv1_1 to Conv4_1 (name of convolutional layers' block). Grayscale images are as input to the network for reducing computational costs. The deep sub-net contains a group of convolutional layers and max-pooling to handle various sized images followed by a set of fully connected layers. The shallow sub-net includes a series of convolutional layers followed by fractional stride convolutional layers up-sampling the previous layer's output to lower the loss of details due to earlier pooling layers. However, the loss of the shallow sub-net relies on the output of the broad network. Merge1 is an example to illustrate that Conv1_1 and Conv2_2 are merged for fusing features from different scales. In our structure, we adopted 3 × 3 (kernel size) in each layer.
The crowd density and the classification loss are merged into a loss for the neural netwo and update the parameters. The function definition is shown in Equation (4): e L represents the total loss, 1 λ 2 λ are the weights of the two task loss functions respect to the difference between the loss function, the loss values obtained from the actual experim e two tasks are entirely different. As the crowd count is the principal task, the estimated l d density should account for a more significant proportion. Therefore, 1 λ , 2 λ are used as h meters to balance the importance of two tasks. The overall structure of the MTCAFN is shown in Figure 5. It consists of a deep sub-net ow one, which share convolutional layers from Conv1_1 to Conv4_1 (name of convolu s' block). Grayscale images are as input to the network for reducing computational costs sub-net contains a group of convolutional layers and max-pooling to handle various es followed by a set of fully connected layers. The shallow sub-net includes a seri olutional layers followed by fractional stride convolutional layers up-sampling the pre 's output to lower the loss of details due to earlier pooling layers. However, the loss o ow sub-net relies on the output of the broad network. Merge1 is an example to illustrat 1_1 and Conv2_2 are merged for fusing features from different scales. In our structur ted 3 × 3 (kernel size) in each layer.  Table 2 lists the layers configuration in details. The convolutional layers are denot v2d", specific parameters are demonstrated by the number of channels, for instance, Con channel 16/32/32/32, which means the layers' block named Conv1_1 includes 4 convolu s connected sequentially and outputs 32 feature maps. Max-pooling layers are conducted 2 pixel window with stride 2. The fractional stride convolutional layer is denoted as "Upsam dified linear unit ReLU is used as an activation function. The shallow network aims for c  Table 2 lists the layers configuration in details. The convolutional layers are denoted as "Conv2d", specific parameters are demonstrated by the number of channels, for instance, Conv1_1 with channel 16/32/32/32, which means the layers' block named Conv1_1 includes 4 convolutional layers connected sequentially and outputs 32 feature maps. Max-pooling layers are conducted over a 2 × 2 pixel window with stride 2. The fractional stride convolutional layer is denoted as "Upsample", a modified linear unit ReLU is used as an activation function. The shallow network aims for crowd density estimation, while deep sub-net is for density classification. In shallow sub-net, the Conv4_2 layer block is a combination of convolutional layers, and "Upsample" applied to refine the details of the image further, while the following layer blocks in deep sub-net is with the same principal as it, which is beneficial to gain deep features thereby getting density map accurately. In deep sub-net, the Conv4_1 layer block is a combination of convolutional layers and "Max-pooling" utilized for reducing the dimension which is propitious to get abstract features so then facilitate classification, at the end the global average pooling is used to flatten the features to one dimension and output the classification results through Softmax.

Experimental Results
In this section, we propose the experimental details and evaluation results on four publicly available datasets: Shanghai Tech Part_A and Part_B [16], UCF_CC_50 [18], and WorldExpo'10 [3] dataset. For evaluation, the standard metrics used by many existing methods for crowd counting were used. MAE and MSE are applied to indicate the accuracy and robustness of estimation, respectively, where MAE is the mean absolute error and MSE is the mean squared error.

CAFN and MTCAFN Model Training and Testing
The proposed models are trained in a DELL PC station with a GPU Titan XP using Torch framework [33] equipped on different benchmark datasets, where variants combination to find the best way to perform the system. The Adam optimization performed during the training and evaluation with a learning rate of 0.00001 and momentum of 0.9.
First, four types of different column combinations (see Figure 6) with different dilate rates. Type1 is the combination of column1 (dilation rate = 2) with column2 (dilation rate = 3). Type2 is the fusion of column2 and colum3 (dilation rate = 4). Type3 combines the column1 and 3. Type4 merges all the columns. The experimental results are shown in Table 3, and we choose the Type1 model as the final way to train the CAFN network.
First, four types of different column combinations (see Figure 6) with different dilate rates. Type1 is the combination of column1 (dilation rate = 2) with column2 (dilation rate = 3). Type2 is the fusion of column2 and colum3 (dilation rate = 4). Type3 combines the column1 and 3. Type4 merges all the columns. The experimental results are shown in Table 3, and we choose the Type1 model as the final way to train the CAFN network.

Comparison of Datasets
In Table 4, the parameters of the benchmark are listed for 3 existing datasets: Num means the number of images; Max means the maximal crowd count; Min means the minimum crowd count; Ave is the average crowd count; and Total means a total number of labeled people. Shanghai tech dataset contains 1198 annotated images, in which a total of 330,165 people with centers of their heads interpreted. As far as we know, this dataset is the largest one regarding the number of annotated people. This dataset consists of two parts: there are 482 images in Part A, which are randomly crawled from the Internet, and 716 images in Part B, which are taken from the busy streets of metropolitan areas in Shanghai. The crowd density varies significantly between the two subsets, making an accurate estimation of the crowd more challenging than most existing datasets. Both Part A and Part B are divided into training and testing: 300 images of Part A are used for training and the remaining 182 images for testing, and 400 images of Part B are for training and 316 for testing.
Results from MTCAFN show that the double-column version achieves higher performance on Shanghai Tech Part A dataset with the lowest MAE, shown in Table 5. The UCF_CC_50 dataset includes 50 images with different perspective and resolutions. With arriving at an average number of 1280, the number of persons annotated per image varies from 94 to 4543. Five-fold cross-validation is performed following the standard setting in Reference [16]. Result comparisons of MAE and MSE are listed in Table 6. Zhang et al. firstly introduced WorldExpo'10 crowd counting dataset. This dataset contains 1132 annotated video sequences which are captured by 108 surveillance cameras, all from Shanghai 2010 World Expo. The author provided a total of 199,923 annotated pedestrians at the centers of their heads in 3980 frames. 3380 frames are used in training data. The testing dataset includes five different video sequences, and each video sequence contains 120 labeled frames. We train our model following the instructions are given in Section 3. Results are shown in Table 7. The proposed MTCAFN delivers the best results in Sce2 and Sce4 and promote the average MSE of 5 scenes.

System Verification
To validate our system on Nvidia Tx2 visualizing the experimental results, the predicted density map and the number of estimated crowd count. Compared with the ground truth and the number of labeled people. The quality of density maps is measured using two standard metrics: PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image).
The higher the two indicators are, the more representative of the generated density map is. As seen in Figure 7, due to the different shooting angles and backgrounds, the following pictures have uneven density distribution and significant scale changes. The network can still distinguish the crowd and background under harsh conditions, accurately generating density maps with quite good quality, which can prove the effectiveness of the method. Besides, our method CAFN and MTCAFN could achieve predicting results in an average of 10 frame/sec. MCNN [16] 3.4 20.6 12.9 13.0 8.1 11.6 Switching-CNN [22] 4.4 15.7 10.0 11.0 5.9 9.4 CP-CNN [23] 2.9 14.7 10.5 10.

System Verification
To validate our system on Nvidia Tx2 visualizing the experimental results, the predicted density map and the number of estimated crowd count. Compared with the ground truth and the number of labeled people. The quality of density maps is measured using two standard metrics: PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image).

Conclusions
In this paper, we presented a smart camera-aware crowd counting system for jointly adopting dilated convolutions and fractional stride convolutions. Atrous convolutions are devoted to enlarging receptive fields which is beneficial to incorporate abundant characteristics into the network, which enables the model for learning globally relevant discriminative features thereby accounting for large count variations in the dataset. Additionally, we employed fractional stride convolutional layers as the back-end to restore the loss of details due to max-pooling layers in the earlier stages, therefore allowing us to regress on full resolution density maps. The model structure has moderate complexity and strong generalization ability, which possess satisfactory density estimation performance in densely crowded scenes via the experiments on multiple datasets. MTCAFN is extended where shallow sub-net extracts pixel-level detail features for crowd density map estimation, deep network extracts high-level semantic features for crowd density classification. Multi-task learning is adopted to improve the loss function of crowd counting. Through the experimental comparison, the proposed system achieved better results and verified the feasibility and effectiveness of the method.
The proposed work has some limitation as follows. A perfect alerting system focuses not only on accurate crowd counting but also on crowd behavior prediction. Moreover, local information such as sub-group trajectory analysis of the crowd also needs to be investigated. In the future, we will also focus on the compression of the broad network to fit different real-time embedding systems.