Crowd Counting Using End-to-End Semantic Image Segmentation

: Crowd counting is an active research area within scene analysis. Over the last 20 years, researchers proposed various algorithms for crowd counting in real-time scenarios due to many applications in disaster management systems, public events, safety monitoring, and so on. In our paper, we proposed an end-to-end semantic segmentation framework for crowd counting in a dense crowded image. Our proposed framework was based on semantic scene segmentation using an optimized convolutional neural network. The framework successfully highlighted the foreground and suppressed the background part. The framework encoded the high-density maps through a guided attention mechanism system. We obtained crowd counting through integrating the density maps. Our proposed algorithm classiﬁed the crowd counting in each image into groups to adapt the variations occurring in crowd counting. Our algorithm overcame the scale variations of a crowded image through multi-scale features extracted from the images. We conducted experiments with four standard crowd-counting datasets, reporting better results as compared to previous results. Contributions: Conceptualization, K.K. methodology, K.K.; software, validation, K.K. W.A.; K.K.; D.N.; visualization, M.I.;


Introduction
The challenging and meaningful task of precisely estimating the number of objects and persons in an image has several applications in the Computer Vision (CV) domain. Among many applications, crowd counting is widely used, and one of the most practical usages of image object counting is that it can be exploited both for security and development purposes. Similarly, crowd counting and image object counting also help in areas such as surveys and traffic management. An accurate crowd count helps in emergency situations such as stampedes and fire events. Hence, considering these factors, many researchers are inclined to explore image-based object counting and its applications in various fields. Furthermore, much of the literature covers enormous contributions in the mentioned fields to analyze these contributions.
Whereas many data are present for crowd counting, the major bottleneck lies in the annotation process [1]. This bottleneck can be removed using crowd-sourcing, such as Amazon MTurk, or image-level annotations rather than bounding-box-focused ones. However, errors are possible in the case of relying on crowd-sourced annotations. This necessitates models that can deal with noisy labels. Due to vast urbanization and an abrupt increase in the world population, substantial crowd gatherings such as religious and political events, parades, marathons, and concerts make crowd counting an indispensable service for managing and securing the crowd virtually and physically. Furthermore, crowd counting also helps in assessing the political significance of protests. It is not uncommon for different political parties to come up with different numbers for crowd gathering. Nevertheless, monitoring crowds from the surveillance videos is quite challenging because of the occlusion among people in the crowd. With the advent of effective deep learning algorithms and the techniques of Convolutional Neural Network (CNNs) in the computer vision field, the applications of objects and crowd counting have overwhelmingly improved. The structural and distribution patterns of all such applications are in some ways similar to each other, hence the improvement in one application implies the improvement in other related applications. This also implies that crowd counting methods can be extended to crowd analysis applications including flow analysis, density estimation, crowd monitoring, and son on.
In this paper, we proposed an end-to-end Semantic Scene Segmentation (SSS) framework, which uses the concept of semantic segmentation for crowd counting. To the best of our knowledge, our proposed framework is the first to use the idea of SSS for the task of crowd counting. Our proposed method highlighted the head region by suppressing the non-head part through a novel optimized loss function. This guided sort of mechanism pays comparatively more attention to the head part and encodes the specific refined density map. We also utilized the classification function, which automatically adapts the changes occurring in crowd counting. We performed extensive experiments on four standard datasets, reporting better results as compared to previous results.

Related Work
Crowd analysis, in general, and counting, in particular, are very mature areas of CV due to their diverse applications. Many excellent works have been reported by researchers to address these fields. Some recent survey papers [41,42] can be explored to learn about crowd analysis and counting.
There are generally four major classifications for the crowd-counting implementations. These are regression-based approaches, detection-based approaches, density-based approaches, and CNN-based approaches. These four methods are discussed in the following paragraphs; • Detection-based approaches: Initially, most of the work on crowd counting was performed with detection-based approaches [43][44][45][46][47]. These approaches apply the head detector through a sliding window on an image. Recent methods such as R-CNN [48][49][50], You Only Look Once (YOLO) [51], and Single-Shot multibox Detector (SSD) [52] have been proposed and exploited, which attain high accuracy in sparse scenes, but these methods do not perform well in highly dense environments. • Regression-based approaches: To target issues in detection-based methods, regressionbased approaches [22,53,54] are proposed that can learn a mapping from the image patch by extracting the global features [55] or local features [56]. The global features include the texture, edge, and gradient features, and the local features include Scale-Invariant Feature Transform (SIFT) [57], Local Binary Patterns (LBPs) [58], Histograms of Oriented Gradients (HOGs) [59], and Gray-Level Co-occurrence Matrices (GLCMs) [60]. To learn the mapping function for crowd counting, regression techniques [61] and Gaussian regression [62] are exploited. These algorithms solve occlusion and background clutter issues with detection-based approaches, but the spatial information is compromised. The regression-based techniques may overestimate the crowd in the presence of a sparse crowd. • Density-based approaches: Similarly, the density-based methods make use of features such as pixels or regions. This helps to maintain the location information while avoiding the disadvantages of regression-based approaches. Lemptisky et al. [19] exploited a density-based approach with a linear mapping between local features and density maps. A nonlinear method, namely Random Forest Regression (RFR), was proposed to tackle the linear approach's issues by introducing the crowdedness before and training two different forests with it [63]. The method outperforms the linear method and also requires small memory for storing the forests. The issue with this approach is that the standard features are used to extract low-level information that cannot accurately be counted with a high-quality density map. • CNN-based approaches: More research work is currently carried out with CNN algorithms because of their robust feature representation and improved density estimation. The CNN outperformed the traditional models to predict the density of crowds with improved performance in [18,[64][65][66]. Recently, improved versions of CNNs, such as the Fully Convolution Network (FCN), have been proposed with an enhanced architecture, density estimation performance, and crowd counting. Besides FCN, many other CNN approaches have been proposed recently in the domain of density estimation and crowd counting [67]. Sang et al. [11] developed an improved crowd counting approach based on the Scaleadaptive CNN (SaCNN). The CNN was used to obtain the crowd density map, which was further processed to find the approximate headcount. The proposed approach was tested on the Shanghai Tech dataset and worked well on sparse and dense scenes. More recently, Zhang et al. [68] used the CNN to count people on metro platforms. A dataset consisting of 627 images and 9243 annotated heads was also developed. The images were captured during peak and off-peak times during the weekdays and weekends. The authors used the first 13 layers of VGG-16. The results on standard datasets such as ShanghaiTech and UCF-QNRF showed a smaller MAE and MSE as compared to the state-of-the-art methods.
Accurate annotation of the ground truth is critical for crowd counting. Dot annotations, sometimes called land marking, put dots in the image to mark the objects of interest. This technique is used in crowd counting, face recognition, and posture alignment. However, it is not only time-consuming, but prone to errors as well. While a single annotator usually achieves the dot annotation, Arteta et al. [1] proposed an approach whereby crowd-sourcing was used to accomplish the annotation. Thirty-five-thousand volunteers were available for annotation, and as soon as an image received 20 annotations, it was removed from the system. As opposed to crowd-sourcing, no manual annotation is required in simulated data since we are fully aware of every object and its location. Lei et al. [69] developed a weak supervision model for crowd counting. Weaker annotations only require the total count of objects. They employ the multiple density map estimation technique and are able to obtain superior performance over already existing approaches.
Tong et al. [70] developed a simple deep learning-based model for crowd counting using a smart camera. The proposed approach was based on multi-task learning to perform density-level classification. Furthermore, the potential loss of detail was overcome using transposed convolutional layers. The proposed method was used to estimate the crowd density if the number of people was more than a threshold.
Songchenchen's work [71] aimed to find the head features using texture feature analysis and crowd image edge detection. The researcher also used a multi-column multifeature CNN for crowd counting. The proposed methods outperformed the state-of-the-art methods on datasets such as Shanghai Tech, USCD, WorldExpo'10, and GCC. Later, he discussed the hardware implementation of the neural network architecture using an FPGA for crowd counting.
Zhang et al. [12] proposed a multi-column CNN to overcome large-scale changes in crowd images. A new dataset of 1198 images having more than 300,000 annotated heads was also developed. Another key benefit of this approach is that once trained on one dataset, their model can easily work with a new dataset. Nevertheless, this approach is severely limited by the number of columns, i.e., only three branches. Cao et al. [12] developed the Scale Aggression Network (SANet) for crowd counting based on the encoderdecoder model. Kang and Chan [5] used the image pyramid CNN for crowd counting while handling scale variations. Each scale of the image pyramid was fed to the FCN, which predicted a density map. Lastly, a 1 × 1 convolution combined the density maps at various scales.
A near-real-time crowd counting approach for both images and videos using a deep CNN was developed by Bhangale et al. [72]. The proposed model required only five seconds to perform a headcount from the provided video. The researchers concluded that the optimal resolution was 300 × 450 pixels. The experiments were conducted on Google Colab using a Tesla K80 GPU while employing the Shanghai Tech dataset. The results on both the dense, as well as sparse datasets were better than the multi-column CNN of Zhang et al. [2], the SANet of Cao et al. [12], and the image pyramid by Kang and Chan [5].

Proposed Method
As compared to Traditional Machine Learning (TML), recent Deep Learning Methods (DLMs) have shown better performance for various visual recognition tasks. We in the proposed work also employed a DLM for crowd counting. In this section of the paper, we discuss our proposed crowd-counting method using the concept of semantic image segmentation and the DLM.
The performance of a DLM relies on many factors, for example the kernel used, the number of convolutional layers, and the specific filters used in each layer. We used various combinations of Convolutional Layers (ConLs), and each layer was followed by Maximum Pooling layers (MaxPs). We also performed experiments regarding the size of the ConL to be used. Details of these parameters are presented in Tables 1 and 2. We used ReLU as the function activator. As usual, a Deep Convolutional Neural Network (DCNN) has three layers, ConL, MaxP, and FCL. We also used the same setting. N × M × C was the kernel with N representing the height and M the width of a specific filter C. Similarly, we represented the MaxP filters with P × Q, where P is the height and Q the width of each filter. Lastly, FCL was the final layer, which performed the classification.

Model Learning
Our proposed network consisted of three parts, i.e., classification, SSS, and Density Estimation (DE). Our proposed crowd counting model is presented in Figure 1. To extract features from images, we used the Feature Extractor Framework (FEF). The stages in Figure 2 represent the main blocks of the deep feature learning architecture. Stage 1 handles the initial feature variations. There are many scaling variations in images due to different environmental circumstances. To overcome all these variation problems, we used four receptive fields. Each of these fields had sixteen filters. The output of Stage 1 is fed into the Stage 2 FCL layer. In Stage 2 and onwards, to extract multi-scale features, we used 2 × 2 pooling layers (maximum). Each ConL was followed by Rectified Linear Unit (ReLU). We placed Spatial Pyramid (SP) pooling layers between the ConL and Fully Connected Layer (FCL). We then fed the feature map, which was extracted from the input images to the SP pooling layers. The SP pooling layers produced output, which was given to the FCL of Stage 3. The shared module block in Stage 2 represents the SP module. The different stages, which are Stage 3 and Stage 4, take care of the feature extraction at different scales of the pyramid. Finally, in Stage 4, the FC layer (3) is used to extract the final features, which are then fed into the modules for the classification, SSS, and DE modules. The details of the different parameters can be seen in Tables 1 and 2. Our proposed classification part automatically learns the crowd's count distribution to adapt to the changes occurring in the crowd. We quantized the crowd count in each crowded image into several groups. We connected the FCL to the backbone at the end side. Both FCLs were followed by ReLU individually and had 64 and 6 neurons, respectively. In our case, six neurons showed the count groups. We did not change the input image's size to keep the distribution of the original crowd, as in the original image. We placed the SPP layers between the ConL and FCL. The feature map extracted from the crowd images was fed to the SPP layers, producing outputs provided to the FCL. In the classification phase, the counts from each database were classified into six groups, which adapted to changes in crowd counts.
In the SSS part, the training data along with the Ground Truth (GT) annotations are given to the framework. In the DE, we predicted the final density map with a kind of supervision from the GT density maps. We added a segmentation map and estimated the density maps, then fed the results to the ConL. The ReLU layers encoded the final density maps. We fixed the head regions' weight higher. Therefore, more attention was given to the head in the density estimation. We also introduced a loss based on the Dice coefficient in the segmentation part. Similarly, we introduced the Euclidean distance loss in the density estimation, which optimized the estimated density map more.

CNN Optimization
Our proposed framework included classification, segmentation, and crowd density estimation. To overcome and alleviate the overfitting problem, we used the methodology as followed in [73]. We optimized our framework by minimizing four loss functions, which also included supervision loss. In the DE, we utilized the Euclidean distance, which optimized the ED map in a better way. As a result, the obtained ED map can be given as shown in Equations (1) and (2). (1) In Equation (1),d represents the predicted density in the intermediate supervision process. Similarly,D j shows the final ED, and D j shows the GT density. M represents the pixel numbers in the GT density map.
We introduced a novel loss in the segmentation part. This loss was based on the Dice coefficient. In simple words, the Dice coefficient is twice the area of overlap between the predicted segmentation and the ground truth divided by the total number of pixels in two images. We optimized this loss to estimate the segmentation map for the head part. The range of the Dice coefficient is between 0 and 1. We quantized the crowd counts into six groups. For example, if the crowd counts in a densely crowded image ranged from 1 to 600, the images in the range from 1 to 100 would lie in the first class, and so on. We utilized the cross-entropy loss function, which is given in Equation (3).
where N is the total samples used for training and M represents the number of classes, which in our case was six. Similarly, x b c represents the GT class, and x b shows the classification output. We represented the weighted loss function by the following equation: W.L = Loss int + Loss den + λLoss X−entropy (4) where we fixed the value of λ as 0.02.

Data Annotation
Tools: For a machine learning task, GT data are created through annotation. The original data are in the form of audio, images, text, etc. A computer recognizes patterns similar in data not provided previously through a learning process from the GT data. These annotation categories vary, such as 3D cuboids, lines, bounding box, dot, and landmark annotation. In the crowd counting case, normally, dot annotation is the first step that creates the GT and is carried out with tools such as RectLabel and Label Me.
A tool for online annotation was also developed in JAVA and Python. This specific tool creates data for head points only. Two kinds of labels are supported by this tool: a point and a bounding box. The image is first zoomed-in this method, and then, the head part is labeled with some desired scaling factor. Then, the image is divided into patches having a size of 16 × 16. This specific size allows annotators to make the GT under various scales times the original crowded image size. Annotation with this tool is comparatively easy, and the quality is also good. For details, readers are requested to explore [54].
Pointwise annotation: In this way, annotation is divided into two stages. In the first stage, labeling is performed, followed by the refinement of the previous labeling. In the first step, annotators perform the labeling process. However, this method of creating the GT is a laborious and time-consuming task. After creating the GT, additional individuals perform the preliminary annotation, which brings a kind of refinement to the whole labeling process.
Annotation at the box level: This is a more time-consuming task as annotation is performed in three steps here. Initially, ten to twenty percent of the points are typically selected in an image for drawing a bounding box. Secondly, for those points having no box, a linear regression method is adapted to obtain its nearest box along with the size. Third, a manual kind of refining of the estimated box label is performed.
In summary, GT labels are produced through a manual process. This labeling is performed without any automatic labeling tool. This labeling depends on a subjective perception of a person who is involved in the labeling task. Hence, giving an accurate GT label in this scenario is complex, and chances for error exist.
Unlike these methods, we adopted a different strategy for data annotation and creating the GT. Since most of these methods are manual works that involve laborious work and time-consuming efforts, we adopted a different approach for data annotation. For all training images, a point located at the center of each head was provided. We encoded the GT density map by employing a Gaussian kernel known as the normalized Gaussian on every point p, which is: where symbol (x,y) shows the location of a specific pixel in an image and S represents a series of annotated points. Similarly, M(p; µ, σ) is the normalized Gaussian Kernel having mean value 0 and variance 4. We used a window size of 15 × 15. We used this method to generate GT density maps. As it is impossible to label GT data for segmentation for larger datasets manually, we proposed an effective way for the GT segmentation map to have the same background and foreground as the GT density maps.

Experimental Setup
We performed our experiments with an Intel i7 CPU having 16 G RAM. The graphical processing unit used was the NVIDIA 840 M graphics card. All the tests were performed with Google TensorFlow and Keras in the Python environment. We trained the model for 30 epochs while keeping the batch training size as 125. We kept this setting for all four datasets and their experiments.

Databases
We evaluated the performance of our proposed crowd-counting framework with four datasets including NWPU-crowd, UCF-QNRF, Shanghai Tech, and World Expo10. A summary of the crowd counting dataset is presented in Table 3. We provide the details about these datasets in the following paragraphs. : This is the latest database introduced for various crowd analysis tasks including crowd counting. The dataset has 1535 images with massive variation in density. Images in UCF-QNRF have a higher resolution of × 300 to 9000 × 6000). Images in UCF-QNRF were collected from Hajj footage, web search, and Flicker. Annotation for the data is also provided. Lighting variations, diverse density conditions, and changes in view points are the main characteristics of the dataset. Images were collected from unconstrained conditions with sources such as buildings, roads, sky, and vegetation. Due all the mentioned conditions, the dataset is challenging and fit for deep learning-based models. • Shanghai Tech [2]: The dataset has the particular feature of large-scale counting. It has 1198 images and 330,165 annotated heads are. The dataset consists of two parts, where Part A has 482 and B 716 images. Part A images were collected from web sources and B images from the streets of Shanghai. The authors defined various combination sets for experiments. Most of the literature uses 300 images for training and the remaining 182 for testing for Part A. Similarly, four-hundred images of Part B are used for training and 316 for the testing phase. Diverse scenes and highly varying density conditions were included in the data collection to make the database challenging. • World Expo [66]: All data for World Expo were collected from 108 surveillance cameras installed at various places. The dataset is suited for cross-scene management scenarios and is efficient. It has 3980 frames, which have a size of 576 × 720 each. There are 199,923 labeled pedestrians in the dataset. To ensure diversity in the collected data, disjoint bird views were used by the creators of the dataset. The reported literature divides the training data as one-thousand one-hundred twenty-seven videos with a length of one minute. Due to having fewer data as compared to the State-Of-The-Art (SOTA) dataset, the database is not suitable for performing experiments with deep learning-based models. Some sample images from these dataset are shown in Figures 3-6. The datasets are summarized in Table 3.

Quantification of Tasks
We represented the count estimation as i by C i . C i is a single metric that does not provide information, in particular about the distribution of people in an image. However, this metric helps predict the size of a crowd, which may span many kilometers. An idea was presented in [75], which divides the occupied area of the crowd into further smaller sections. The average number of participants in that particular area is then further estimated. The method also estimates the mean density of the covered area. However, counting for several crowded images at many locations is comparatively difficult. Due to a much more complicated nature, two additional metrics, Mean Absolute Error (MAE) and Mean Squared Error (MSE), are frequently used by researchers. We also reported our work with these two measures. Mathematically, these measures can be defined as: In Equations (6) and (7), M represents the test samples, Y j the ground truth count, and Y ' j the count estimated for the jth sample. We noticed that crowd localization is a less explored area. Similarly, researchers have not yet firmly established evaluation metrics for localization problems. We observed that [54] was the only research work that proposed a one-to-one matching. However, this idea [54] also leads to some optimistic issues. Similarly, the authors defined no penalization method if overdetection cases occur. We noticed that the lastly discussed method has failed to be acknowledged widely. We evaluated our method with precision, recall, and the Fmeasure. These evaluation metrics are further defined as True Positive (TP), False Positive (FP), and False Negative (FN). The TP is the number of heads that are correctly detected. Similarly, the FN is the number of heads that are incorrectly detected as non-heads, whereas the FP is the number incorrectly detected as heads. Mathematically we can write precision, recall, and the F-measure as:

Comparative Analysis
The reported results with the proposed method and its comparison are presented in Table 4. From Table 4, it is clear that we have better results in most of the cases as compared to the previous results. We present a summary of the concluding remarks in the paragraphs follows: • We reported our results in the form of precision, recall, and the F-measure. We also used other evaluation metrics, the MAE and MSE. All these values are reported in Tables 4 and 5. From both tables, it is clear that we had much better results as compared to previous results. • In the last ten years, crowd counting has been explored by researchers significantly. A summary of the results can be seen in Table 4. Researchers have introduced several datasets that address the crowd-counting problem. We noticed that less emphasis has been given to crowd behavior analysis and localization of a crowd. Due to many more applications, crowd counting has been targeted in a better way than other crowd analysis tasks. Due to the diversity of applications, our work also focused on crowd counting. • The labeling process for creating GT annotation data was performed by a manual process. We observed that this was a time-consuming process, and also, more chances for error existed. Such a labeling process entirely depends on the subjective perception of the person involved in labeling. Compared to this manual labeling, automatic labeling is a comparatively better option, but it is still not a mature case to be used effectively for research. We, in our work, also introduced an automatic labeling process for creating GT data. • As discussed earlier, crowd counting is an active area of research due to diverse applications. Table 4 shows a detailed summary of the research conducted on SOTA datasets. We reported all the metrics, including the MAE, MSE, precision, recall, and F-measure, from the original research papers. It is very clear from the Table that all these metrics have improved with the passage of time. Much more improvement is brought in particular with the introduction of improved deep learning methods. • Some research papers reported that TML methods showed better performance as compared to the DLM. Even though through this comparison, it was not claimed that hand-crafted features are better than deep learning-based methods. We argue that a better understanding of the deep learning-based methods is needed for the crowd-counting task. For example, a limited data scenario is a major drawback faced by the deep learning-based methods. We noticed that the performance of the traditional machine learning methods is acceptable with data that are collected in controlled environmental conditions. However, when these TML methods were applied to data collected in-the-wild, a drop in performance by a huge amount was noticed. On the other hand, the DLM extracts a higher level of abstraction from the data. As a result, the DLM outperforms traditional methods. The need for feature engineering is reduced with deep learning-based methods. It is also worth noting that the DLM is facing some concerns from the research community. For instance, the DLM and its applications are complicated procedures that require various inputs from the practitioner's end. Most of the researchers rely on a trial and error strategy. Hence, these methods are time consuming and more engineered. However, it must be noted that the DLM is the only definitive choice for the crowd-counting task. • As can be seen from Table 4, most of the DLMs for crowd counting use DCNNs. However, most of these DCNN-based methods employ the pooling layer, which results in comparatively low resolution and some feature loss as well. It is clear that the deeper layers extract some high-level information, whereas comparatively shallower layers somehow extract low-level information and features, which include spatial information. We suggest that both deeper and shallower layer information showed be combined for better results. More reasonable accuracy will be reported with this, and the count error will also be reduced. • Crowd counting is an active area of research in computer vision. Tremendous progress has been reported in the last couple of years. From the reported results to date, it is evident that most of the metrics such as the MAE, MSE, and F-measure have improved. However, noting the trend of the computer vision developments in various application scenarios with the DLM, it is clear that crowd counting is not a mature research area.
As the training phase in the DLM is facing problems due to limited data, an option for researchers to explore is knowledge transfer [76,77].