A Systematic Deep Learning Based Overhead Tracking and Counting System Using RGB-D Remote Cameras

Featured Application: The proposed technique is an application for people detection and counting which is evaluated over several challenging benchmark datasets. The technique can be applied in heavy crowd assistance systems that help to ﬁnd targeted persons, to track functional movements and to maximize the performance of surveillance security. Abstract: Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at ﬁrst, preprocessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.


Introduction
Head and shoulders detection has become a research hotspot which plays a significant role in people counting [1] and crowd analysis which can be used for several practical applications such as surveillance, logistics and resource management coding and public transportation systems [2,3]. Many studies have been carried out on RGB image based head and shoulders counting but, due to the development of depth cameras and sensors, researchers are now studying RGB-Depth images for crowd counting using head and shoulders tracking. Compared with RGB images, RGB-D images provide additional and more general depth map information for the detection of heads and shoulders.
Computer vision techniques provide remarkable performance improvements to the problem of automatic head and shoulders detection and tracking in complex indoor/outdoor environments [4,5]. However, research and development is usually carried out on whole body detection and counting using RGB videos which are challenged by multiple issues such as variations in occlusion, illumination, clutter, shadows etc. Thus, different RGB-D cameras (e.g., Kinect V1 [6][7][8][9][10], Vzence DCAM and many more) are used to solve these issues by providing depth information. However, head and shoulder counting using depth datasets is still a challenging task for many researchers due to various unsolved problems related to occlusions and noise.
Vision-based head counting is a challenging task that involves different techniques such as object detection, human detection, object and human tracking and recognition [11]. Moreover, techniques used in this area are categorized into three main streams; (1) clusteringbased methods, (2) regression-based methods, and (3) detection-based methods. Clusteringbased methods target certain objects, track their features and cluster the object trajectories to count them [12][13][14][15][16][17]. Regression based methods use regression function learning which uses human object and non-human object features and utilizes them to count people [18,19]. Detection based methods have a common architecture which is divided into image/video pre-processing, object or body detection, feature extraction and classification [20][21][22][23][24][25][26][27]. These three main streams are further divided based on the data types they use, e.g., color/depth or hybrid videos etc. These approaches face some common issues during real time people counting under practical conditions, e.g., restricted camera angles, computational time and complexity, failure to handle cluttered scenes and excessively occluded images etc. [28][29][30][31][32][33][34]. In this paper, we describe a novel method of head and shoulders tracking and counting using depth datasets that address such issues.
Our proposed work flow is divided into six main phases. First, video preprocessing is done in which videos are converted into frames and complex backgrounds are removed from the image frames. Second, heads are detected in the frames using Hough Circular Gradient Transform and shoulders are detected by a joint HOG based symmetry method. Third, Robust features such as Fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, Apriori-Association is implemented to select the best features. Fifth, Convolution Neural Network (CNN) is used for accurate people tracking. Finally, heads are counted using the Cross line judgment technique.
The main contributions of our system are listed below: • Complex backgrounds with excessive occlusions in videos cause mis-detection of individual sets of heads and shoulders. We use novel techniques to mitigate the problems associated with occlusions and to more precisely detect individual sets of heads and shoulders. • Our salient feature vectors provide far better accuracy than other state-of-the art techniques.

•
The Apriori-Association rule is used for the selection of the ideal sets of features along with the CNN classifier for head tracking.

•
Our head and shoulders tracking and counting (HASTAC) system performance is evaluated using three benchmark datasets; (1) the PCDS dataset, (2) the MICC people counting dataset and (3) the GOTPD dataset. Our proposed model was fully validated for its efficacy, outperforming other state-of-the-art methods.
This article is structured as follows: Section 2 describes related work. Section 3 gives a the detailed overview of the proposed HASTAC model. In Section 4, the proposed model's performance is assessed on three publicly available benchmark datasets through various experiments. Lastly, in Section 5 we sum up the paper and outline future directions.

Related Work
Over the past few years, various studies on head tracking and counting using RGB and RGB-D datasets have been reported. In this section we are going to give a broader view of the latest techniques and methodologies used in these systems. They are divided into two main streams; (1) Head tracking and counting using RGB datasets (See Section 2.1) and (2) Head Tracking and counting using RGB-D datasets (See Section 2.2).

Head Tracking and Counting using RGB Datasets
Many head tracking and counting systems that work on RGB datasets have been developed in recent years. Table 1 gives a detailed account of these systems. Gallery-Probe database This system was developed to count the people who occur simultaneously on two cameras located at different positions in overlapping areas. A trained head detector was used to detect the heads from the images taken from the two cameras and these pictures of the two cameras are passed to a Siamese network. Then the re-identification of humans step was carried out to count the people in the repeating area. Finally, the median of the total number of individuals that were counted from the video intervals at a certain time was the attendace of the class.

Head Tracking and Counting using RGB-D Datasets
From the past few years, due to the arrival of depth cameras and sensors, many researchers are carrying out studies on head tracking and counting using depth imagery. Table 2 gives a detailed account of these recent systems. Table 2. Detailed explanation of head tracking and counting systems that used RGB-D datasets.

Paper Name RGB-D Datasets Methodologies Classification/Regression Results
People counting base on head and shoulder information [40] Depth Surveillance Videos The Nearest Neighbor interpolation technique was used for background subtraction which results in greater computational effeciency. Candidate heads and shoulders were detected using edge detection and circle detection techniques. The tracking of heads was done by tracking the circle using the nearest neighbor interpolation method. Finally, a virtual line is used to count the people crossing that line.
The precision results obtained were based on two scenerios, (1) Two to five people were passing the line without holding any object and, (2) People with bags, luggage and crossing the virtual line. The precision results in both scenerios were 100% and 95%, respectively.

Benchmark Data and Method for Real-Time People
Counting in Cluttered Scenes Using Depth Sensors [41] PCDS dataset The system was developed to calculate the number of people entering or exiting a bus. Intially, the background was removed from each frame of a video by the farthest background model. A 3D human model and seed fill method can reliably detect the human heads in frames. After that, the human heads were tracked to identify their trajectories which further helps in human counting.
The dataset is divided into four parts that are identified by People counting accuracy over these four parts on entering were 85.40%, 83.25%, 77.54% and 75.32%. People counting accuracy rate over these four parts upon exiting were 93.04%, 92.66%, 93.71% and 91.30%, respectively.
Real-time people counting from depth imagery of crowded environments [42] MICC dataset This system was developed to count people in crowded environments. In order to detect the head of an individual, background subtraction was implemented first using selective running average background subtraction. For head tracking, connected components are used. To eliminate the problem of overlapping connected pixels, edges were detection. A multitarget tracker using greedy data association was used to track the entrance and exit of people from a designated area.
The precision rate achieved over the MICC dataset was 97.9% Depth driven people counting using deep region proposal network [43] CBSR dataset The main goal of this system was to count people in crowded environments. The authors used CNN for head detetcion. The authors also explored the impact of the number and quality of RPN anchors over the faster RCNN model and improved the performance by proposing a new solution.
The precision rate was 97.54%

3D Head Pose Estimation through Facial Features and Deep Convolutional
Neural Networks [44] Pointing'04, BU, AFLW, and ICT-3DHPE The authors introduced an end to end face parsing algorithm which tries to address a challenging problem of face pose estimation. The face parsing model through DCNNs by extracting useful information from different face parts was trained. The face parsing model provides a class label for each pixel in a face image. We use a probabilistic classification technique and create PMAPS in the form of grey scale images for each face class The accuracy achieved on the Pointing'04 dataset is 96.5% and MAE of BU, AFLW, and ICT-3DHPE was 3.6, 2.1 and 3.0 respectively.

The Proposed System Methodology
This section discusses the overall proposed methodology that is used in our HASTAC system. The system framework is divided into 6 major steps. First, video preprocessing is done in which the video is converted into frames and complex backgrounds are removed from the image frames using the Kernel Density Estimation (KDE) technique. Second, heads are detected in the frames using Hough circular gradient transform and shoulders are detected by joint HOG based symmetry methods. Third, Robust features such as Fused joint HOG-LBP, Energy based point clouds and Fused intra-inter trajectories are extracted. Fourth, Apriori-Association is implemented to select the best features. Fifth, Convolution Neural Network (CNN) is used for accurate people tracking. Finally, heads are counted using the Cross-line judgment technique. The overall architecture of our HASTAC system is shown in Figure 1.

Preprocessing Stage
During video pre-processing, the first step is the conversion of videos into image frames. Thus, the depth frames are extracted from static videos V = [F 1 , F 2 , . . . , F n ] where n is the number of frames. The next step is to remove complex backgrounds from these depth frames using the Kernel Density Function (KDE) technique. KDE is a non-parametric technique for estimating the densities of the pixels. The main idea of KDE is that each frame background is identified by the histogram of the most recent pixels N, where each pixel is smoothed by a kernel (generally a Gaussian kernel). Perhaps, the main objective of using KDE is to detect changes occurring frequently in the background and to accurately detect and identify the target objects in the forground with high sensitivity.This technique works by selecting the most recent frames of a video and updating these frames continuously in order to update changes in the background. These frames are collected as samples and their pixels are further processed to obtain the intensity values of each pixel.
By using the intensity value samples of the pixels the probability density function of each pixel has the intensity value i T at time T, and kernel estimator E can be used to estimate it as [45]; where i 1 , i 2 , . . . , i n are the recent intensity value samples of a pixel. E can be calculated using Gaussian distribution for the depth frames as in [45]; By calculating the probability estimate, the set of pixels are considered as foreground pixels if P(i T ) < Th 1 where Th 1 is the global threshold that can be adjusted for every image frame in order to get a desired false positive rate. Figure 2a,b show the results obtained after background subtraction using the KDE technique over PCDS and GOTPD datasets, respectively.

Head and Shoulders Detection
Hough Circular Gradient Transform (HCGT) is a technique that detects the circles by first passing an imperfect image to the edge detection phase. In our proposed work the HCGT technique is used for detecting heads in an image by first finding the edges of an image. These edges will further help to find the circular heads in an image by considering the local gradient. The circle in an image is defined using Equation (3); where p and q are the circle center coordinate points, r is the radius and x and y are the arbitrary edge points. Because the dataset images are taken from the top view cameras the heads are more likely to appear circular. Circular candidates are detected in the Hough parameter space by voting and selecting the local maxima in an accumulator matrix. The location of each non-zero pixel is identified to get the centers of the heads from the points in an accumulator that are above the threshold and larger than their candidate neighbors. These candidate centers are then sorted in descending order by their accumulator values, causing the most supported pixels to appear first. All the non-zero pixels are considered for every center. These pixels are then sorted according to the minimum distance from the centers of the circles. Considering the smallest distance to the largest distance, the radius of the circle that is less than the given threshold thresh_R is selected. The center that provides sufficient support for the non-zero pixels on the edge of the image is kept and provides sufficient distance from a previously selected center. Thus, the HCGT technique can divide the problem of finding the circular head into two sub-stages. First, the candidate centers are found and then the appropriate circle radius is found. A 2D array is required to store the votes of each edge point whereas the distance between each point is accumulated to find the radius of the circle. Figure 3 shows the final results of head detection using the HCGT technique over the GOTPD dataset. For the detection of shoulders, the joint HOG symmetry-based detection method is used. This type of HOG detection method uses two descriptors of the same size which use the symmetry and continuity of shoulder contours to get higher discrimination and greater detection accuracy. d i and d j are the two descriptors, the sample p joint HOG features can be calculated as [46]; The pixel position on x and y has a depth value l, gradient magnitude g m and the direction of the gradient is θ. We extracted a 1D center gradient operator [−1, 0]. The horizontal gradient and the vertical gradient are calculated as [46]; hg y (x, y) = l(x, y + 1) − l(x, y − 1) whereas, the pixel gradients located at position (x,y) are calculated as [46]; hg(x, y) = hg x (x, y) 2 + hg y (x, y) 2 θ(x, y) = tan −1 hg x (x, y) hg y (x, y) Thus, each cell of 4 × 4 pixels their gradient magnitude according to their gradient direction are calculated and accumulated in bin direction. A histogram of the gradient is obtained. The directional range is from 0-180 • for the joint HOG features of the shoulders, with 5 bin directions. The joint HOG features for the shoulders are then expressed as [46]; Finally, these HOG features are normalized to get the joint HOG features of different blocks. Figure 4 shows the results obtained for shoulder detection over the GOTPD dataset.

Feature Extraction
To track the number of heads in depth frames, local and global feature extraction plays a vital role. Therefore, Fused joint HOG-LBP, Energy based Point clouds and Fused intrainter trajectories are passed to an Apriori Association algorithm to remove the unnecessary redundant features (See Algorithm 1).

Fused Joint HOG-LBP
After the normalization of all the HOG features to obtain joint HOG features of different blocks (See Section 3.2), histograms for all the overlapping blocks are collected over the detection window [46]. Then, detection window values are fused with Local Binary Pattern (LBP) to improve the performance of feature extraction. Using a specific threshold, image pixels are labeled by comparing it with every neighboring pixel and then converting it into binary. Again, the window of an image is divided into cells of 16 × 16 pixels and each pixel is compared to its neighboring pixel, i.e., a central pixel's value is compared with the 8 neighboring pixels. If the value of the central pixel is greater than a neighboring pixel, it is replaced with the value 0, otherwise it is replaced with value 1. Figure 5 shows the results of fused joint HOG-LBP results over the PCDS dataset.

Energy Based Point Clouds
The Energy based point cloud method is quite similar to the geodesic distance algorithm. According to our knowledge, it is used for automatic head counting for the first time. It is robust, efficient and simple to implement. In this method, a central point v V is selected as the anchor point on the head and is given the fixed distance of zero d(v) = 0. It is inserted as priority queue S, priority being based on the smallest distance. All the other points s V are labeled with distance d(s) = ∞. One point v from the priority queue is selected then, based on the geodesic distance algorithm which works on the principle of the Dijkstra algorithm, the shortest distance between the central point to the other varying fiducial points is found. Then, energy based point clouds can be displayed. These point clouds are changed according to the varying positions of the head, i.e., as the positions of the fiducial points are changed. The distances between the central points to the other points are examined as optimal features. The change in distance between fiducial points is calculated as [33]; where D x = min d p+1,q , d p−1, q and D y = min d p, q+1 , d p, q−1 . Figure 6 shows the results for the energy based point clouds obtained over the PCDS dataset.

Fused Intra-Inter Trajectories
We propose "fused intra-inter depth silhouette localized point trajectories" for the first time for human head tracking and counting. This type of trajectory method contains a subset X, containing a set of human joints which gives localized points to form trajectories. First, a subset X containing five localized points is plotted on human the depth silhouette s, i.e., X = {head, neck, left_shoulder, right _shoulder and chest}. The number of localized points varies according to the number of human depth silhouettes N in each frame i.e., N = {n 1 , n 2 , . . . , n ∞ }. These localized points are then joined in the form of trajectories. This results in four trajectories for each human silhouette. These four trajectories are represented as T = {HN, NRS, NLS, NC} where HN is the trajectory between head and neck, NRS is the trajectory between neck and right_shoulder, NLS represents the trajectory between neck and left_shoulder and at last NC represents the trajectory between neck and chest. Figure 7 shows the fused inta-inter silhouettes trajectories over the MICC dataset. After the formation of all the trajectories, shape descriptors are extracted which can calculate the displacement of the changes in length L in each silhouette, frame by frame, during the time t along x and y coordinates. This change during time t can be measured using Equation (12) and the normalized displacement vector can be measured using Equation (13) [33];

Apriori Association
The Apriori-Association technique is used to detect and extract meaningful association relationships between the quantities in a dataset. It is used in several practical applications such as the study of disease, improvements in the production process, correlations in alarm analysis etc., and it is used, for the first time, in our proposed head tracking and counting system. An Apriori-association technique identifies the data itemset which frequently occurs in a dataset. At first, the minimum support of the individual itemset is calculated to identify frequent items in the first pass. In the second pass, the same procedure is repeated by taking each seed set of the previous transaction and finding the frequently occurring itemset. This procedure is repeated several times until no frequent itemset is found. This algorithm generates candidate item sets on each transaction in order to improve the computational efficiency [47]. Figure 8 illustrates the Apriori-Association graph of the accuracy of all three datasets across three feature extraction techniques. Algorithm 2 defines the step by step procedure of the working of the Apriori-Association.

Head Tracking Using Convolution Neural Network
All the local and global features extracted from the above mentioned feature extraction methods (See Section 3.3) are then passed through CNN, resulting in the accurate tracking of heads over the three benchmark datasets. CNN always promised to give better results than other deep learning techniques in both RGB and RGB-D image and video based features [48][49][50][51][52][53]. Figure 9 illustrates the overall architecture of our proposed 1D CNN over the PCDS dataset. 1D CNN is proposed for the first time for the accurate tracking of heads in both indoor and outdoor complex environments. The PCDS dataset contains 10,500 feature sets in 4500 videos. The proposed CNN consists of three convolution layers, three max_poolinglayers and one fully connected layer. Each layer has its own specific purpose. The first layer is the convolution layer C 1 and contains the input matrix. This layer is convolved with 32 kernels each having a size of 1 × 13. As a result a matrix of 4500 × 10,488 × 32 is produced. The convolution matrix can be calculated as [54]; ReLU where C (m−1) n (a, b) generates the results of the convolution layer for the coordinates (a, b) of the (m − 1) layer with the nth convolution map. Ω represents the previous layer map and the size of the kernel is represented as x. W m n is the mth convolution kernel for the layer nth. α m n is the mth bias of the nth layer. The result of the first convolution layer is passed through the first max_pooling layer P 1 . A ReLU activation function is used between each convolution and max_pooling layer containing the sum of weights and bias of the previous layer passed to the next layer. The max_pooling layer downsamples the result obtained by the convolution layer by using a sliding window of size 1 × 2. Thus the pooling results of the (p − 1)th layer, q kernel and x row and y column can be calculated as [55][56][57][58]; where 1 ≤ p ≤ q and n is the pooling window size. The result obtained by the first pooling layer is passed through the second convolution layer C2, convolved with 64 kernels and then the result is passed through the second pooling layer P2. The same procedure is repeated for the third convolution layer convolved with 128 kernels. Finally, a fully connected layer FC result is obtained as; where W m iv w m iv represents the matrix having weights from the ith node of the mth layer to the vth node of the (m + 1)th layer. x m i represents the content of the mth node at ith layer. Figure 10 represents the convergence plot of the CNN using 250 epochs of all three benchmark datasets.

Head Counting Using Cross-Line Judgement
First, the counting line and region are specified in each frame of a video. Secondly, the movements of the heads are detected, i.e., either the heads are moving from left to right, right to left, upward to downward or downward to upward in direction. Hence, the relative location between heads and crossing lines can be determined and this information is further used to detect the position of the initial movement of the heads using the horizontal and vertical crossing lines. So if the crossing line is aligned horizontally, then the directions of the moving heads are calculated as in Equation (18), and, if the crossing line is aligned vertically, then the directions of the moving heads are calculated using Equation (19); up_to_down, y head init <y u down_to_up, y head init >y d (18) Direc head = { le f t_to_right, x head init <x l right_to_le f t, x head init >x r (19) where Direc head denotes the initial movement of the head. x head init and y head init denotes the initial x and y coordinates of the center of the heads.
After detecting the direction of the initial head movements, the heads are counted based on the cross-line judgment as in [35]; le f t_to_right, Direc head = le f t_to_right and x head > x r right_to_le f t, Direc head = right_to_le f t and x head < x l up_to_down, Direc head = up_to_down and y head > y d down_to_up, Direc head = down_to_up and y head < y u (20) where Head count denotes the direction of the heads being counted, x head and y head are the x and y coordinates of the center of the heads. If one of the above conditions is true, a counter of 1 is added in Head count , otherwise the head will be discarded. Figure 11 shows the results obtained for head counting over the PCDS dataset.

Experimental Setup and Performance
In this section, three benchmark datasets are described in detail (See Section 4.1). Different experiments are conducted over three benchmark datasets and their performances are evaluated. In experiment 1, head tracking performance is evaluated on all three datasets. In experiment 2, a comparison of the computational efficiency of our proposed model with the other state of the art methods is conducted. In experiment 3, the mean accuracy for head counting on all three benchmark datasets is illustrated in the form of a column chart.

Dataset Descriptions
This section describes the three benchmark datasets in detail.

The People Counting Dataset (PCDS)
The People Counting dataset (PCDS) is a first RGB-D dataset which contains 4500 videos that were taken at the entrance and exit of the bus in both normal and cluttered scenes. The videos are captured using a Kinect V1 camera. The dataset videos present large variations in illumination, occlusion, clutter and noise. This dataset is publicly available at [41]. Figure 12 shows video frame examples of the PCDS dataset.

The MICC People Counting dataset
The MICC people counting dataset is another publicly available RGB-D dataset [42]. This dataset is divided into three sequences; Flow Sequence, Queue Sequence and Groups Sequence. In the Flow sequence the participants are moving straight from one point to another point in the room. This sequence contains 1260 frames with 3542 persons. In the Queue sequence, the participants are waiting in a line in a room. This sequence contains 918 frames with 5031 persons. In the Groups sequence, the participants are split into two groups which talk to each other without exiting the room. The number of frames in this sequence is 1180 and the number of persons is 9057 [42]. Figure 13 shows examples of video frames form the MICC people counting dataset. The GOTPD is a multimodel dataset containing both depth and infra-red video recordings which were captured using Kinect 2 camera. The camera was positioned above the heads for people detection. The dataset is composed of 48 video sequences each having different illumination conditions, various complex environments, various objects like hats, chairs, caps etc. The dataset frames consist of both single and multiple persons. Figure 14 shows some examples of the GOTPD dataset [59,60].  Tables 3-5 shows the overall tracking accuracy, sensitivity and specificity performance metrices of the PCDS dataset, the MICC people counting dataset and the GOTPD dataset under various conditions. The accuracy, sensitivity and specificity measures are calculated as;

Experiment 1: Evaluation of Tracking Performance
Speci f icity head track = TN TN + FP where TP is the True Positive, TN is the True Negative, FP is the False Positive and FN is the False Negative.  Results show that our model is more efficient than other state of the art techniques. For the PCDS dataset, the computational time is calculated using the first 7000 video frames. For the MICC datset and the GOTPD dataset, the number of frames taken for testing computational efficiency is 3499 and 2950 respectively.

Experiment 3: Performance Evaluation of Heads Counting Over all Three Benchmark Datasets
In this section, the mean head counting accuracy over all three benchmark datasets is shown in Figure 18. The counting accuracy can be calculated as;

Count heads =
Predicted Number o f Heads Actual Number o f Heads (24) Figure 18. Accumulative head counting accuracies achieved on the three benchmark datasets.

Discussion
The automatic HASTAC system is primarily built to track human heads and count the number of heads in depth imagery. We have tested our system on three benchmark datasets i.e., the PCDS dataset, the MICC dataset and the GOTPD dataset. The results acquired on the PCDS dataset and GOTPD dataset exhibit greater accuracy i.e., 98.40% and 99% respectively for head tracking and counting compared to the MICC dataset which produced an acuracy rate of 98%. The reason for getting lower head tracking and counting accuracy on the MICC dataset is the overlapping persons in the Queue condition when the persons are walking one after the other; our system causes some misdetection of heads which results in low tracking and counting accuracy. Secondly, the challenges we face during the processing of the MICC dataset is that it provides noisy depth images with greater illumination and complex backgrounds which hinders the detection of heads in some images. The heads in some of the images of this dataset are overlapped with the noisy background which results in mis-tracking and mis-counting of heads.

Conclusions
We have proposed an efficient method for head tracking and counting which works well despite variations in occlusion, illuminaton and background complexity. The proposed HASTAC model is sub-divided into six domains. First, using Kernel Density Function (KDE), preprocessing is done to remove complex backgrounds under varying illumination conditions. Second, head detection is achieved using Hough Circular Gradient Transform and shoulders detection is achieved using the HOG based symmetry method. Third, robust features, like Fused joint HOG-LBP, Energy based point clouds and Fused intra-inter trajectories are extracted which further help the tracking of the heads and shoulders of each individual detected in the video frames. These features are then projected to the Apriori-Association technique which can find frequent itemsets and remove them to get ideal results for head tracking. Fifth, these features are then passed to our proposed 1D CNN model in order to distinguish between heads and non-heads in the video frames. In the last step, the cross line judgement technique is used to count the number of heads frame by frame. We evaluated the proposed model on three publicly available datasets and found that the results are very promising with respect to accuracy and computational time.

Theoretical Implications
Our proposed HASTAC model has various practical applications in security, visitor counting, queue management etc. Our system is efficient and applicable for people counting and crowd analysis etc.

Research Limitations
Due to the complex backgrounds and varrying illumination conditions of the MICC dataset, we faced minor challenges in the processing and we achieved less accurate results for this dataset in the Queue condition compared to other conditions, namely, the Flow and Group conditions. Similarly, due to overlaps between persons in the video frames in the Queue condition, our system produced some mis-detection of heads in this dataset. In future, we will work on this problem using different head detection, feature extraction techniques and deep learning models to get better results.