An Intelligent Automatic Human Detection and Tracking System Based on Weighted Resampling Particle Filtering

: At present, traditional visual-based surveillance systems are becoming impractical, ine ﬃ cient, and time-consuming. Automation-based surveillance systems appeared to overcome these limitations. However, the automatic systems have some challenges such as occlusion and retaining images smoothly and continuously. This research proposes a weighted resampling particle ﬁlter approach for human tracking to handle these challenges. The primary functions of the proposed system are human detection, human monitoring, and camera control. We used the codebook matching algorithm to deﬁne the human region as a target and track it, and we used the practical ﬁlter algorithm to follow and extract the target information. Consequently, the obtained information was used to conﬁgure the camera control. The experiments were tested in various environments to prove the stability and performance of the proposed system based on the active camera.


Introduction
Recently, security surveillance has applied visual-based tracking and detection techniques for improving convenience and safety for humans. Human tracking and detection are essential topics in a surveillance system. Human recognition and moving object extraction are the two parts of any typical human detection system. Human recognition identifies an object as nonhuman or human, and objects are extracted from the background by means of moving object extraction, which determines the related size and position of the object in an image. The tracking system is essentially able to predict the location during and after occlusion, as the tracked object or human is possibly occluded by other objects while tracked.
Surveillance systems typically use two kinds of the cameras: fixed camera and active camera. The fixed camera has the benefit of being low cost but comes with limited field of view (FOV), whereas an active camera takes proper FOV as it can do pan-tilt to retain the target object within the camera scene. In addition, the latter has a better resolution since it can perform zoom in/out.
Generally, a tracking system on an active camera considers the temporal difference for extracting moving object. In this procedure, it is necessary to wait for the camera to be stable enough to process the image. In other words, the moving camera takes blurred images and extracts background pixels along with the moving object. Subsequently, the active camera operates non-smoothly and discontinuously. Hence, a particle filter tracking algorithm is applied to resolve such problem. The codebook technique is employed initially to spot the human as the target model, and after that the particle filter tracks the human by computing the Bhattacharyya distance amid the color histogram of target model with the next color histogram frame of the sampled particle position. There are various advantages of using a color histogram such as efficient computation, tracking of nonrigid objects, robustness to partial occlusion, scale invariant, and rotation.
In this paper, a real-time human tracking system is constructed with an active camera and has the following characteristics: • Rapidly detects a human • Tracks an object by not considering background information • Handles occlusion conditions • Operates an active camera continuously and smoothly • Appropriately zoom in/out

Related Work
There are four key parts in our entire system: image source, human detection, human tracking, and camera control, as described in Figure 1. As a quick review of our procedure, we set the initial FOV as the scene we wanted to capture. Then, we detect and extract an object recognized as a human. We track the human object and use its moving information to pan-tilt-zoom (PTZ) the camera via a proportional-integral-derivative (PID) controller so that the target stays in the center of the FOV. A human detection system finds the position and size of the human in an image. Optical flow [1,2] is considered in order to estimate a moving object independently at the cost of complex computations. Zhao and Thorpe [3] proposed a stereo-based segmentation technique for extracting objects from the background and then recognize the objects using neural network. While techniques based on stereo vision are more robust, it needs a minimum of two cameras, and it fails to perform well in long-distance detection. Viola et al. [4] proposed a cascade architecture detector, where adaptive boosting (AdaBoost) iteratively builds a robust classifier guided by performance criteria that are specified by user. The cascade method swiftly rejects non-pedestrian samples in the early cascade layer; thus, processing speed of this approach is high. The templates in a template-based approach [5] have short sequences of 2D silhouettes gained from motion capture data. This method detects human silhouettes having a particular walking pose. To rapidly spot humans, a shape-based human model is chosen, and codebook matching is used to classify a human. This reduces the time taken in detecting humans from the other objects. Montabone and Soto [6] proposed a novel computer vision technique that can operate moving cameras and spot a human in various poses in the case of a complete or partial appearance of the human. Pang et al. [7] presented an efficient histogram of a gradient-based human detection technique. A human tracking system follows a human target through the sequence of images regarding changes in scale and position. Between the several tracking methods, we analyzed three to synthesize our research.
First, feature-based tracking, a very common method, tracks features by motion, edge, or color using edge detecting methods such as the Sobel approach, Laplacian approach, and Marr-Hildreth approach [8,9]. These techniques use masks to perform convolution over an image for edge detection. Li et al. [10] proposed a 3D human motion tracking system with a coordinated mixture of factor analyzers. Lopes et al. [11] designed a hierarchical fuzzy logic-based approach for object tracking.
It uses a complicated and large set of rules, has a long computation time, and the pixels at the edges are not always continuously detected. The abovementioned approach uses gray scale images for edge detection, and we chose not to use this for color images because of information loss on the color space vector. Moreover, edge detection in a gray scale image cannot be robust and sufficient.
Big Data Cogn. Comput. 2020, 4, x FOR PEER REVIEW 3 of 24 Figure 1. Overview of the system. A human detection system finds the position and size of the human in an image. Optical flow [1,2] is considered in order to estimate a moving object independently at the cost of complex computations. Zhao and Thorpe [3] proposed a stereo-based segmentation technique for extracting objects from the background and then recognize the objects using neural network. While techniques based on stereo vision are more robust, it needs a minimum of two cameras, and it fails to perform well in long-distance detection. Viola et al. [4] proposed a cascade architecture detector, where adaptive boosting (AdaBoost) iteratively builds a robust classifier guided by performance criteria that are specified by user. The cascade method swiftly rejects non-pedestrian samples in the early cascade layer; thus, processing speed of this approach is high. The templates in a template-based approach [5] have short sequences of 2D silhouettes gained from motion capture data. This method detects human silhouettes having a particular walking pose. To rapidly spot humans, a shape-based human model is chosen, and codebook matching is used to classify a human. This reduces the time taken in detecting humans from the other objects. Montabone and Soto [6] proposed a novel computer vision technique that can operate moving cameras and spot a human in various poses in the case of a complete or partial appearance of the human. Pang et al. [7] presented an efficient histogram of a gradient-based human detection technique. A human tracking system follows a human target through the sequence of images regarding changes in scale and position. Between the several tracking methods, we analyzed three to synthesize our research.
First, feature-based tracking, a very common method, tracks features by motion, edge, or color using edge detecting methods such as the Sobel approach, Laplacian approach, and Marr-Hildreth approach [8,9]. These techniques use masks to perform convolution over an image for edge detection. Li et al. [10] proposed a 3D human motion tracking system with a coordinated mixture of factor analyzers. Lopes et al. [11] designed a hierarchical fuzzy logic-based approach for object tracking. It uses a complicated and large set of rules, has a long computation time, and the pixels at the edges are Secondly, pattern recognition methods learn the objects at the target and find it in sequential images. Williams et al. [12] extended the method to a relevance vector machine (RVM) that learns a nonlinear translation predictor. Collins et al. [13] proposed a mechanism for an online feature selection mechanism that can be used for multiple features evaluation. The presented approach tracks and adjusts the features set for improving tracking performance. The feature evaluation mechanism is embedded in a mean-shift tracking system. It can adaptively select tracking features. Zhang et al. [14] proposed a robust 3D human pose tracking approach from silhouettes using a likelihood function. Zhao et al. [15] used a principal component analysis to extract features from color and use them in a random walker segmentation algorithm to assist human tracking.
Thirdly, there are gradient recognition methods with a focus on pattern recognition, such as the mean-shift algorithm. Fukunaga and Hostetler [16] initially proposed the mean-shift algorithm for clustering data. Comaniciu et al. [17] proposed a kernel-based object tracking method, where object region tracking is denoted using a spatially weighted intensity histogram, and its similarity rate is computed using Bhattacharyya distance following an iterative mean-shift technique. Many applications [18][19][20][21] later proposed various mean-shift algorithm variants. Even though the mean-shift object tracking technique is well-performed over sequences with comparatively slight object displacement, its performance cannot be guaranteed in the case where objects suffers full or partial occlusions. Kalman filter [22,23] and particle filter [24,25] algorithms are considered along mean-shift algorithms for improving the tracking performance under partial occlusion. The approach by Bhat et al. [24] uses a fusion of color and KAZE features [26] in the particle filter framework to give an effective result in different environments for tracking the target. Still, this approach requires a strategy for fast failure occlusion recovery for the post-occlusion target recovery. To track multiple targets by deploying the same color description with cancelation functionality and internal initialization, Nummiaro et al. [25] proposed a color particle filer embedded along a detection algorithm. Our major contribution in this work is a novel multitarget tracking algorithm that incorporates particle filters with a Gaussian mixture model to improve tracking accuracy and computational efficiency. In order to detect humans fast, we chose the shape-based human model to classify humans by codebook matching, which decreases the time of human detection compared to the other objects.
Many tracking systems work only on PTZ because to keep the object in FOV, an active camera can be pan-tilt and can utilize zoom in/out for adjusting resolution, thus keeping the tracked object with a well-proportioned resolution regarded to the FOV. Morphological filtering of motion images were used by Murray et al. [27] to perform background compensation. Using an active camera mounted on a pan/tilt platform, Murray's technique can successfully track a moving object from dynamic images. A kernel-based tracking method was used in the proposed system to overcome the apparent background motion on a moving camera. Karamiani and Farajzadeh [28] considered feature points' information of direction and magnitude to detect camera motion accurately. The method is used for detecting multiple moving object accurately in active and fixed camera models. Lisanti et al. [29] proposed a method that enables real-time target tracking in world coordinates, and the method offers continuous adaptive calibration of a PTZ camera. Mathivanan and Palaniswamy [30] used optimal feature points and fuzzy feature matching to accomplish human tracking. In the context of the tracking applications of humans using deep learning, Fan et al. [31] proposed human tracking and detection using a convolutional neural network for partial occlusion and view, scale, and illumination changes. Tyan and Kim [32] proposed a compact convolutional neural network (CNN) based visual tracker in conjunction with a particle filter architecture. A face tracking framework based on convolutional neural networks and Kalman filter was proposed for the real-time detecting and tracking of the human face [33,34]. Luo et al. [35] proposed a matching Siamese network and CNN-based method to track pedestrian. The method used a faster-R-CNN to distinguish pedestrians from surveillance videos. However, the method still requires target occlusion to be resolved in order for it to be a more robust real-time pedestrian tracking tool. Xia et al. [36] proposed method tracks single and multi-objects in long-term tracking in real time, which determine and identify the target bounding box in a traffic scene, CNN is firstly trained. Then, a particle filter (PF) is used as the tracker to implement the preliminary multi-object tracking. A particle filter and neural network learning evaluated in person re-identification scenario was proposed in [37], while a hybrid Kalman particle filter (KPF) for human tracking was proposed in [38]. KPF is more time-consuming, especially in the case of non-occlusion. Real-time performance of the proposed filter is not good in terms of speed.
The deep learning models are time inefficient and costly in terms of memory as they tend to expand large number of nodes, which results in large computation. Such models mostly fail in real-time applications, and their implementation requires high-end processors. Therefore, complexities of the network need to be reduced to decrease the computation time and limit the number of computations [37]. The advantage of the proposed method is its simplicity and ease of implementation. The proposed models can be executed on a simple CPU for the real-time videos. Thus, it is an efficient approach as well.
In this research, we used a wide-angle camera to find the target, and then camera calibration methods gave the active camera pan-tilt commands to keep the target in the center of the FOV and for specific object position tracking. In the case where the size of the target was larger or smaller than a maximum or minimum predefined size, then the zoom in/out command was used accordingly.

Proposed System
This section describes each algorithm and method used in this paper. Figure 2 shows the three categories of the tracking system. To detect a human, we first extracted moving objects from the image source and then used codebook matching for each one of them to be categorized as human and non-human.
for specific object position tracking. In the case where the size of the target was larger or smaller than a maximum or minimum predefined size, then the zoom in/out command was used accordingly.

Proposed System
This section describes each algorithm and method used in this paper. Figure 2 shows the three categories of the tracking system. To detect a human, we first extracted moving objects from the image source and then used codebook matching for each one of them to be categorized as human and non-human.

Human Detection
In the majority of the surveillance systems, the position of the camera is fixed, whether it is a static camera or active camera. The fixed position of the camera allows for extraction of a moving object by using background subtraction. To make the method computationally efficient, background subtraction uses only gray level images. This will also make our system more efficient when using it in real time situations. The first image frame can be adjusted over time using Equation (1), which is used to construct the background, where and represent previous and current background images, respectively.
Scaling factor (0,1) was used to update the background image. Active pixels between frames n and n−1 are represented by ( , ).
To determine the moving object, the current image is subtracted from the background image as described in Equation (2). To obtain the binary moving object , threshold ℎ is applied to results of Equation (2) using Equation (3).

Human Detection
In the majority of the surveillance systems, the position of the camera is fixed, whether it is a static camera or active camera. The fixed position of the camera allows for extraction of a moving object by using background subtraction. To make the method computationally efficient, background subtraction uses only gray level images. This will also make our system more efficient when using it in real time situations. The first image frame can be adjusted over time using Equation (1), which is used to construct the background, where I n−1 B and I n B represent previous and current background images, respectively.
Scaling factor α(0, 1) was used to update the background image. Active pixels between frames n and n−1 are represented by I M (x, y).
To determine the moving object, the current image I c is subtracted from the background image I B as described in Equation (2). To obtain the binary moving object M obj , threshold ths is applied to results of Equation (2) using Equation (3).
M obj (x, y) = 1, I BS ≥ ths 0, I BS < ths The details of the moving object and codebook matching are indicated in Figure 3. The binary threshold image M obj undergoes a dilation process to fill holes of moving objects and to enlarge the boundaries. The step by step process is shown in Figure 4.
Human-shape information was used to build our codebook matching algorithm. The extracted moving object was normalized into a 20 × 40 pixels image. The position of the shape pixels in the image was extracted by the shape feature extraction. These features are pointed by red dots in Figure 5; 10 Y-axis coordinates are chosen from the object's rightmost and leftmost boundary, and 20 coordinates of the corresponding X-axis are arranged as a feature vector. The vectors are shown by blue blocks in Figure 5. As shown in Figure 5, there are a total of 10 bins in the histogram, represented by green blocks. As a result, there are 30 features vectors representing a human object.
We can conclude by observation that the top and bottom shape pixels of the Y-axis cannot be chosen as feature points as these pixels are changeable. The method used to select Y-axis coordinates is to firstly calculate the standard deviation of the reach value of Y-axis in the training sample, and then select the 10 lowest standard deviation values from each side.
The details of the moving object and codebook matching are indicated in Figure 3. The binary threshold image Mobj undergoes a dilation process to fill holes of moving objects and to enlarge the boundaries. The step by step process is shown in Figure 4.  Human-shape information was used to build our codebook matching algorithm. The extracted moving object was normalized into a 20 × 40 pixels image. The position of the shape pixels in the image was extracted by the shape feature extraction. These features are pointed by red dots in Figure  5; 10 Y-axis coordinates are chosen from the object's rightmost and leftmost boundary, and 20 coordinates of the corresponding X-axis are arranged as a feature vector. The vectors are shown by blue blocks in Figure 5. As shown in Figure 5, there are a total of 10 bins in the histogram, represented by green blocks. As a result, there are 30 features vectors representing a human object.
The details of the moving object and codebook matching are indicated in Figure 3. The binary threshold image Mobj undergoes a dilation process to fill holes of moving objects and to enlarge the boundaries. The step by step process is shown in Figure 4.  Human-shape information was used to build our codebook matching algorithm. The extracted moving object was normalized into a 20 × 40 pixels image. The position of the shape pixels in the image was extracted by the shape feature extraction. These features are pointed by red dots in Figure  5; 10 Y-axis coordinates are chosen from the object's rightmost and leftmost boundary, and 20 coordinates of the corresponding X-axis are arranged as a feature vector. The vectors are shown by blue blocks in Figure 5. As shown in Figure 5, there are a total of 10 bins in the histogram, represented by green blocks. As a result, there are 30 features vectors representing a human object. We can conclude by observation that the top and bottom shape pixels of the Y-axis cannot be chosen as feature points as these pixels are changeable. The method used to select Y-axis coordinates is to firstly calculate the standard deviation of the reach value of Y-axis in the training sample, and then select the 10 lowest standard deviation values from each side.
A list of feature vectors was represented by the codebook. Matching of the feature vector and codebook vectors was done to find the minimum distortion code vector in comparison to the object feature vector. We can say that X denotes a series of feature vectors including M-dimensional data, designated by … … ( ) . Code words V are defined as … … ( ) and have N sets each in codebook C. Similar to the feature vector, each code word has M-dimensional data defined as … … ( ) . Distortion between code words and feature vectors was defined by Equation (4). A list of feature vectors was represented by the codebook. Matching of the feature vector and codebook vectors was done to find the minimum distortion code vector in comparison to the object feature vector. We can say that X denotes a series of feature vectors including M-dimensional data, designated by X 0 . . . each in codebook C. Similar to the feature vector, each code word has M-dimensional data defined as . Distortion between code words and feature vectors was defined by Equation (4).
If the value of Dis min in Equation (5) is less than the threshold, it is assumed that feature vector X and the moving object it represented was of a human, and if the value of Dis min is greater than the threshold we then assume that it is a nonhuman object. The demonstration of comparing X with V j is shown in Figure 6.
We can conclude by observation that the top and bottom shape pixels of the Y-axis cannot be chosen as feature points as these pixels are changeable. The method used to select Y-axis coordinates is to firstly calculate the standard deviation of the reach value of Y-axis in the training sample, and then select the 10 lowest standard deviation values from each side.
A list of feature vectors was represented by the codebook. Matching of the feature vector and codebook vectors was done to find the minimum distortion code vector in comparison to the object feature vector. We can say that X denotes a series of feature vectors including M-dimensional data, designated by … … ( ) . Code words V are defined as … … ( ) and have N sets each in codebook C. Similar to the feature vector, each code word has M-dimensional data defined as … … ( ) . Distortion between code words and feature vectors was defined by Equation (4).
If the value of Dismin in Equation (5) is less than the threshold, it is assumed that feature vector X and the moving object it represented was of a human, and if the value of Dismin is greater than the threshold we then assume that it is a nonhuman object. The demonstration of comparing X with Vj is shown in Figure 6.

Human Tracking
A particle filter algorithm was proposed in the study, which is based on a weighted resampling particles method. In this algorithm, high weighted samples were selected for the human tracking system. The basic idea of our particle filter is to approximate the probability distribution by weighted sample sets. One hypothetical state of the object with corresponding discrete sampling probability is represented by each sample [25].
Colored information is more accurate compared to grayscale information if we use color as the feature for the purpose of object tracking. For our experimentation we chose HSV (Hue, Saturation, and Value) color space for better performance of tracking compared to RGB (Red, Green, Blue) color space because of its ability to reduce lightness and illumination sensitivity. Every color channel was represented by 8 bits, which in turn produces 256 × 256 × 256 bins of the color histogram. Color data are quantized into 6 × 6 × 6 without generality loss, thus making the entire bin of color histogram as

Human Tracking
A particle filter algorithm was proposed in the study, which is based on a weighted resampling particles method. In this algorithm, high weighted samples were selected for the human tracking system. The basic idea of our particle filter is to approximate the probability distribution by weighted sample sets. One hypothetical state of the object with corresponding discrete sampling probability is represented by each sample [25].
Colored information is more accurate compared to grayscale information if we use color as the feature for the purpose of object tracking. For our experimentation we chose HSV (Hue, Saturation, and Value) color space for better performance of tracking compared to RGB (Red, Green, Blue) color space because of its ability to reduce lightness and illumination sensitivity. Every color channel was represented by 8 bits, which in turn produces 256 × 256 × 256 bins of the color histogram. Color data are quantized into 6 × 6 × 6 without generality loss, thus making the entire bin of color histogram as 216 bins. To represent the target object, kernel function was used. The Epanechnikov kernel function was selected to represent the target object to introduce a spatially-smooth function to reduce the search on small neighborhood region. The convex and monotonically decreasing Epanechnikov kernel was selected to mask the target's density estimate spatially. The rationale of using the kernel as a weighted mask is to assign smaller weights to the pixels farther away from the center of the target, since those pixels are often affected by occlusion or interference from the background. Figure 7b shows the Epanechnikov kernel. This kernel function has the highest value at the center of distribution. If we look at the Region of Interest (ROI) of the target model in Figure 7a, the pixels that are closer to the center of the ROI contain more important information, and the background pixels are mostly near the ROI's boundary. The Epanechnikov kernel function was selected to represent the target object as it is computationally simple and can disregard the boundary information. This kernel performs well in terms of improved stability, accuracy, and robustness on camera motion and partial occlusions. Epanechnikov kernel is defined by Equation (6), where x represents normalized pixels in the region defined as the target model. When the proposed kernel function is applied to the target model, more critical information is contained by pixels closer to the ROI center, as shown in Figure 7.
that are closer to the center of the ROI contain more important information, and the background pixels are mostly near the ROI's boundary. The Epanechnikov kernel function was selected to represent the target object as it is computationally simple and can disregard the boundary information. This kernel performs well in terms of improved stability, accuracy, and robustness on camera motion and partial occlusions. Epanechnikov kernel is defined by Equation (6), where x represents normalized pixels in the region defined as the target model. When the proposed kernel function is applied to the target model, more critical information is contained by pixels closer to the ROI center, as shown in Figure 7. A robust tracking framework is provided by the particle filter algorithm, as it represents uncertainty. The algorithm is capable of keeping its options open and at same time it is also capable of considering multiple state hypotheses. Temporary occlusions can be dealt with by the particle filter as less likely object states will be part of the tracking process temporarily [25]. Occlusion handler steps and weighted resampling are the two basic differences between the original tracking method and our tracking method. Our proposed tracking method is shown in Figure 8. The differences between the original particle filter and ours are weighted resampling and occlusion handler. A robust tracking framework is provided by the particle filter algorithm, as it represents uncertainty. The algorithm is capable of keeping its options open and at same time it is also capable of considering multiple state hypotheses. Temporary occlusions can be dealt with by the particle filter as less likely object states will be part of the tracking process temporarily [25]. Occlusion handler steps and weighted resampling are the two basic differences between the original tracking method and our tracking method. Our proposed tracking method is shown in Figure 8. The differences between the original particle filter and ours are weighted resampling and occlusion handler. Step-by-step process of the weighted resampling particle filter.
The first step in the process of weighted resampling particle filter is to define the target model. It can be defined in Equation (7) at location y as m-bin histogram = { ( ) } … . The normalization factor f can be represented by Equation (8); δ is the Kronecker delta function, while is the number of pixels in the ROI region, and = √ + ℎ is used as the normalization factor for the size of the object region.
The sample model = { ( ) } … is represented in the same way as the target model.
Bhattacharyya distance is used to measure the distance between the sample and target model; it is Step-by-step process of the weighted resampling particle filter.
The first step in the process of weighted resampling particle filter is to define the target model.
It can be defined in Equation (7) at location y as m-bin histogram q y = q (u) y u=1...m . The normalization factor f can be represented by Equation (8); δ is the Kronecker delta function, while I is the number of pixels in the ROI region, and a = √ w 2 + h 2 is used as the normalization factor for the size of the object region.
The sample model p y = p (u) y u=1...m is represented in the same way as the target model.
Bhattacharyya distance d is used to measure the distance between the sample and target model; it is termed as similarity value ρ. If the value is large, the two models are considered similar, whereas if the value of ρ is equal to 1, it implies that the histogram of the sample and the target model is identical.
In the particle filter algorithm, the target model can also be represented by state vector s_target. It is defined in Equation (12) where w and h represent the width and height of ROI, respectively; (x, y) represents the center of ROI, and v x , v y represents the motion of the object. Equation (13) is used to compute the initial sample set S initial = s (n) n=1...N where I is an identity matrix, r.v. is a multivariate Gaussian random variable, and N represents the number of samples. A dynamic model is represented by Equation (14), which propagates the sample; the deterministic component of the model is represented by A. The target human size and position can be determined from the estimated vector using the weight of every sample and its state vector, as shown in Equation (15). To update the weight of each sample, Bhattacharyya distance is used and is shown in Equation (16).
The resampling step in the process of the weighted resampling particle filter is used to avoid the degeneracy of the algorithm, which means, it prevents the situation where most of the sample weights are close to zero. To determine the need and time of resampling step, Equations (17) to (19) can be used; in rate ∈ (0, 1), N ths and N e f f represent the given threshold sample and the effective number of samples, respectively.
N e f f < N ths (17) In the process of resampling, sample selection depends on weights; high weight samples may be selected a number of times, which will lead to a number of copies of those samples, and relatively low weight samples may not get selected at all. Given a sample set S t−1 and the target model q, for the first iteration, S t−1 is set to S initial . The details of the particle filter algorithm for each iteration is described as follows: 1.
Propagate each sample from the set S t−1 by a linear stochastic differential equation: Observe the color distributions: (a) Calculate the color distribution: p Calculate the Bhattacharyya coefficient for each sample of the set Estimate the mean state of the set S t : if N e f f < N ths : Select N samples from the set S t with probability ω (n) t : (a) Calculate the normalized cumulative probabilities c t : c Generate a uniformly distributed random number r ∈ [0, 1].
(c) Use binary search to find the smallest j for which c Finally, resample by S t = S t .
In the initial resample step of the particle filter, samples were selected randomly, so it is possible that the selected sample has a relatively low weight, and the process ended up tracking different objects and considering them as target object, which decreased tracking accuracy, as shown in Figure 9. Figure 9 shows the sample points with high weights are in the ROI (green block), and samples with relatively low weights are in the red block. Although two blocks have nearly the same similarity value, the actual target object is in the green block. Consequently, it may track a different object as the target object. In other words, it will decrease the accuracy of tracking. Thus, we proposed a weighted resampling algorithm to cover this problem. The proposed algorithm of weighted resampling prevents this problem. First, the top sample is selected and set to S top t with N top weights from set S t , as shown in Equations (20) to (21). The parameter top represents the top rate and for our experiment it is set to 0.
N samples were reproduced in S t according to the weight of s top(n) . This step will produce s top(n) , which has a relatively larger number of times in S t , and others with relatively low weight will be produced at least once. Figure 10 shows the samples points with high weights are in the ROI (green block), and samples with relatively low weights are in the red block. Figure 11 shows the weighted resampling result. Most of sample points lie in the green block or in the target object region. A Gaussian mixture model (GMM) was applied to update the target model over time. For approximation of any continuous probability distribution K, Gaussian distributions have been used. The GMM [39] is a robust method for dynamic backgrounds. It is mostly used due to its robustness to various background variations like multi-modal, quasi periodic and gradual illumination changes. GMM is a semiparametric multimodal density model consisting of a number of components to compactly represent pixels of image block in color space with illumination changes. Therefore, a Gaussian mixture model (GMM) was applied to update the target model over time. The image can be represented as a set of homogeneous regions modeled by a mixture of Gaussian distributions in color feature space. In comparison, non-Gaussian mixture models [40] present an image without taking spatial factor into computation. Gaussian distribution N(x µ k , σ k ) with mean µ k and standard deviation σ k was considered here. The weight of Gaussian distribution is represented by π k , and sum of all weights is equal to 1. Equation (22)  resampling prevents this problem. First, the top sample is selected and set to with weights from set , as shown in Equations (20) to (21). The parameter top represents the top rate and for our experiment it is set to 0.2. The only selects samples with the top 20% weights from set .
samples were reproduced in according to the weight of ( ) . This step will produce ( ) , which has a relatively larger number of times in , and others with relatively low weight will be produced at least once. Figure 10 shows the samples points with high weights are in the ROI (green block), and samples with relatively low weights are in the red block. Figure 11 shows the weighted resampling result. Most of sample points lie in the green block or in the target object region. A Gaussian mixture model (GMM) was applied to update the target model over time. For approximation of any continuous probability distribution , Gaussian distributions have been used. The GMM [39] is a robust method for dynamic backgrounds. It is mostly used due to its robustness to various background variations like multi-modal, quasi periodic and gradual illumination changes. GMM is a semiparametric multimodal density model consisting of a number of components to compactly represent pixels of image block in color space with illumination changes. Therefore, a Gaussian mixture model (GMM) was applied to update the target model over time. The image can be represented as a set of homogeneous regions modeled by a mixture of Gaussian distributions in color feature space. In comparison, non-Gaussian mixture models [40] present an image without taking spatial factor into computation. Gaussian distribution ( | , ) with mean and standard deviation was considered here. The weight of Gaussian distribution is represented by , and sum of all weights is equal to 1. Equation (22) describes the process of Gaussian mixture model (GMM).
samples were reproduced in according to the weight of ( ) . This step will produce ( ) , which has a relatively larger number of times in , and others with relatively low weight will be produced at least once. Figure 10 shows the samples points with high weights are in the ROI (green block), and samples with relatively low weights are in the red block. Figure 11 shows the weighted resampling result. Most of sample points lie in the green block or in the target object region. A Gaussian mixture model (GMM) was applied to update the target model over time. For approximation of any continuous probability distribution , Gaussian distributions have been used. The GMM [39] is a robust method for dynamic backgrounds. It is mostly used due to its robustness to various background variations like multi-modal, quasi periodic and gradual illumination changes. GMM is a semiparametric multimodal density model consisting of a number of components to compactly represent pixels of image block in color space with illumination changes. Therefore, a Gaussian mixture model (GMM) was applied to update the target model over time. The image can be represented as a set of homogeneous regions modeled by a mixture of Gaussian distributions in color feature space. In comparison, non-Gaussian mixture models [40] present an image without taking spatial factor into computation. Gaussian distribution ( | , ) with mean and standard deviation was considered here. The weight of Gaussian distribution is represented by , and sum of all weights is equal to 1. Equation (22) describes the process of Gaussian mixture model (GMM).  If the difference between the previous and current frames' ( ) was smaller than the threshold, we used Equation (24) to find the first Gaussian distribution where k follows the descending order { , , }.
If we successfully find the Gaussian distribution by Equation (24), it would update , , by Equations (25)  . The proposed occlusion handler was color-based. The algorithm equated similarities between the target model and candidate model. Figure 12 shows the flowchart of the occlusion handler. The following is the step-by-step process of the proposed occlusion handler: ROI was created in the current frame. was computed.
3. If similarity was less than ℎ , resampling was not performed, and it was assumed that the candidate model was occluded by another object. 4. The count was increased using = + 1.

5.
Step 1-4 were repeated during the tracking process to see whether the similarity value becomes larger than ℎ , the tracked human appeared or ≥ 10. Termination condition avoids the spreading of the samples out of the image. Figure 13 shows the images for frame T, T+4, T+9, T+14 using proposed occlusion handler. The GMM update algorithm is applied to update the color histogram of the target model; K = 3 Gaussian distributions is used to model each bin q (u) . The mean µ k , standard deviation σ k , and weight π k were initialized respectively as µ k = q (u) , σ k = 1, and π k = 1 K , where k = 1 ∼ K.
We updated the bin's value using Equation (23) where A = 0.6, B = 0.25, C = 0.15, and a, b, c was the descending order. 3.
If the difference between the previous and current frames' q (u) was smaller than the threshold, we used Equation (24) to find the first Gaussian distribution where k follows the descending If we successfully find the Gaussian distribution by Equation (24), it would update µ k , σ k , π k by Equations (25) to (27), where α = 0.05 and β = 0.01, and the other weights would be updated by π j = (1 − β) * π j where j = 1 ∼ K and j k.
These steps produced the updated target model q = q (u) u=1...m . The proposed occlusion handler was color-based. The algorithm equated similarities between the target model and candidate model. Figure 12 shows the flowchart of the occlusion handler. The following is the step-by-step process of the proposed occlusion handler:

1.
Candidate model c = c (u) u=1...m ROI was created in the current frame.

2.
The similarity value between target model q = q (u) If similarity was less than ths sim , resampling was not performed, and it was assumed that the candidate model was occluded by another object. 4.
The count was increased using Count = Count + 1.

5.
Step 1-4 were repeated during the tracking process to see whether the similarity value becomes larger than ths sim , the tracked human appeared or Count ≥ 10. Termination condition avoids the spreading of the samples out of the image. Figure 13 shows the images for frame T, T+4, T+9, T+14 using proposed occlusion handler.

Camera Control
Pelco P-protocol [31] was used to control the active camera through an RS-232 to RS-485 converter. The protocol allows us to have control over pan, zoom step, and tilt angle to achieve effective tracking. The active camera is controlled by pelco P-protocol [34] through the RS-232 to RS 485 converter. It needs to control pan (horizontal direction), tilt (vertical direction) angle, and the zoom's step to achieve tracking purpose. The pelco P-protocol has 8 bytes data with message format as shown in Figure 14a. Byte 1 and Byte 7 are the start and stop bytes, respectively, and they are always set to 0xA0 for Byte 1 and 0xAF for Byte 7. Byte 2 is the receiver or camera address. In this thesis, we only used one camera, so Byte 2 is always set to 0 × 00. Byte 3, Byte 4, Byte 5, and Byte 6 are used to control the pan-tilt-zoom (PTZ) as shown in Table 1. The last byte is an XOR check sum byte.

Camera Control
Pelco P-protocol [31] was used to control the active camera through an RS-232 to RS-485 converter. The protocol allows us to have control over pan, zoom step, and tilt angle to achieve effective tracking. The active camera is controlled by pelco P-protocol [34] through the RS-232 to RS 485 converter. It needs to control pan (horizontal direction), tilt (vertical direction) angle, and the zoom's step to achieve tracking purpose. The pelco P-protocol has 8 bytes data with message format as shown in Figure 14a. Byte 1 and Byte 7 are the start and stop bytes, respectively, and they are always set to 0xA0 for Byte 1 and 0xAF for Byte 7. Byte 2 is the receiver or camera address. In this thesis, we only used one camera, so Byte 2 is always set to 0 × 00. Byte 3, Byte 4, Byte 5, and Byte 6 are used to control the pan-tilt-zoom (PTZ) as shown in Table 1. The last byte is an XOR check sum byte.

Camera Control
Pelco P-protocol [31] was used to control the active camera through an RS-232 to RS-485 converter. The protocol allows us to have control over pan, zoom step, and tilt angle to achieve effective tracking. The active camera is controlled by pelco P-protocol [34] through the RS-232 to RS-485 converter. It needs to control pan (horizontal direction), tilt (vertical direction) angle, and the zoom's step to achieve tracking purpose. The pelco P-protocol has 8 bytes data with message format as shown in Figure 14a. Byte 1 and Byte 7 are the start and stop bytes, respectively, and they are always set to 0xA0 for Byte 1 and 0xAF for Byte 7. Byte 2 is the receiver or camera address. In this thesis, we only used one camera, so Byte 2 is always set to 0 × 00. Byte 3, Byte 4, Byte 5, and Byte 6 are used to control the pan-tilt-zoom (PTZ) as shown in Table 1

Camera Control
Pelco P-protocol [31] was used to control the active camera through an RS-232 to RS-485 converter. The protocol allows us to have control over pan, zoom step, and tilt angle to achieve effective tracking. The active camera is controlled by pelco P-protocol [34] through the RS-232 to RS 485 converter. It needs to control pan (horizontal direction), tilt (vertical direction) angle, and the zoom's step to achieve tracking purpose. The pelco P-protocol has 8 bytes data with message format as shown in Figure 14a. Byte 1 and Byte 7 are the start and stop bytes, respectively, and they are always set to 0xA0 for Byte 1 and 0xAF for Byte 7. Byte 2 is the receiver or camera address. In this thesis, we only used one camera, so Byte 2 is always set to 0 × 00. Byte 3, Byte 4, Byte 5, and Byte 6 are used to control the pan-tilt-zoom (PTZ) as shown in Table 1 Figure 14b demonstrates the scheme used to keep the tracking object in the center of the FOV. Our FOV was divided into 9 regions corresponding to the directions of the pan-tilt. To make the target object size larger or smaller, zoom-out and zoom-in were also used. Every region has a specific direction as shown in Figure 14b. If the target is located on the stop-region, then the camera is set to stop. Meanwhile, the camera speed on other regions is determined by the PID controller. The zoom-in and zoom-out will be activated if the target's size becomes smaller or larger than the user's defined size. The details of the camera control are shown in Figure 15. To control the vertical and horizontal position difference, two independent PID controllers were used. Equations (28) and (29) used to estimate the pan/tilt speed, where we defined and values. Our PID controller defines its variables as follows:  To control the vertical and horizontal position difference, two independent PID controllers were used. Equations (28) and (29)  Speed pan = C out * 0.1 + o f f set pan (28) The pan and tilt speed of the camera are provided by the manufacturer of the camera (0 to 64). Equations (28) and (29) of PID controller will give the speed in limited range. If the speed is too low, the target object could go out of the frame of the camera by the time the camera moves. On the other hand, if the speed is too high, the camera could lose track of the target object and drive over it.
Depending on the size of the ROI, we decided on whether to zoom in or out. We applied Equations (32) and (33), where we set rate big = 1.1 and rate small = 0.9, and w initial and h initial were, respectively, the width and height of our human target object.
upper w = w initial * rate big upper h = h initial * rate big (32) lower w = w initial * rate small lower h = h initial * rate small (33) Upon zoom-in/out, we updated the size of the target model by an aspect of ratio w/h , which Equation (34) defines.
We updated the target model size with Equations (35) and (36) in the case of a zoom-in operation or Equations (37) and (38) in the case of a zoom-out operation. Later, we used these renewed states to update the variables from Equations (32) and (33).

Experimental Results
The proposed method was implemented on a PC platform with Intel ® Core™ i5 CPU 650 at 3.20GHz, 4GB RAM, and developed in Borland C++ Builder 6.0 on Windows 7. To verify the performance and stability of the system, it was tested under several environments. We tested both image sequences and video files (AVI uncompressed format) from the active camera, with a resolution of 720 × 480 pixels.

Results of Tracking on Video File
To verify the tracking algorithm with the proposed particle filter, we used three video files, with parameters as follows: • Number of bins in histogram m = 6 * 6 * 6 = 216 • Number of samples N = 30 • State covariance σ x , σ v x , σ y , σ v y , σ w , σ h = (2, 0.5, 2, 0.5, 0.4, 0.8) Video 1 shows our system's occlusion handler in operation. Figure 16 shows the tracking system without the occlusion handler while Figure 17 shows the same track with our occlusion handler solution. We used the second video to verify the tracking feature. The full occlusion condition happens in frame 3 of Figures 16 and 17. If the particle filter resamples during the full occlusion condition, it may resample on incorrect positions as shown in frame 4, and tracking will be lost, as in frames 5 and 6. Meanwhile, when the full occlusion happens in the particle filter with occlusion handle, the resample step will not be done immediately. Thus, the sample set can keep the widespread range to track the target after full occlusion. 2. Video 2 is used to verify the tracking feature. Figure 18 shows a human wearing a black jacket while walking near a black chair, which is used as an object with similar color features as the human. In this case, the target human has a similar color feature with the black chair, but the proposed system can still track the target human. Video 3 is used to verify the tracking performance in a complex situation. Figure 19 shows the target human is partially occluded with a chair. The target human performs sitting-down and standing-up activities, and later, another human object partially occludes our original target, which continues to be the target, hence showing the system not losing track of the target.

Results of Tracking On Active Camera Output
We used an active camera set up in our lab, with an environment complex enough to verify the system operation. We set the particle filter and PTZ parameters as follows:

Results of Tracking on Active Camera Output
We used an active camera set up in our lab, with an environment complex enough to verify the system operation. We set the particle filter and PTZ parameters as follows: • Number of bins in histogram m = 6 × 6 × 6 = 216 • Number of samples N = 30 • State covariance σ x , σ v x , σ y , σ v y , σ w , σ h = (10, 1, 10, 1, 1, 2) Figure 20 shows the tracking system controlling the pan/tilt of the camera. The targeted human was mostly located in the camera's FOV. Figure 21 shows the results of the zoom in/out while tracking. Figure 22 shows the tracking system controlling the pan/tilt/zoom of the camera, with the targeted human freely walking in the environment. Figures 23 and 24 show our system tracking a target human with more than one person walking in the same environment. While in the test from Figure 23 we see the target only walking around, in the test from Figure 24 we see the human target also performing some more actions, such as crouching and intentionally occluding himself. system operation. We set the particle filter and PTZ parameters as follows:  Figure 20 shows the tracking system controlling the pan/tilt of the camera. The targeted human was mostly located in the camera's FOV. Figure 21 shows the results of the zoom in/out while tracking. Figure 22 shows the tracking system controlling the pan/tilt/zoom of the camera, with the targeted human freely walking in the environment. Figures 23 and 24 show our system tracking a target human with more than one person walking in the same environment. While in the test from Figure 23 we see the target only walking around, in the test from Figure 24 we see the human target also performing some more actions, such as crouching and intentionally occluding himself.    Figure 21 (a) shows the target human has been detected and the is initialized to 0. The targeted human was walking away or approaching the camera. If there is a zoom-in happening, zoomlayer is added by 1. On the other hand, zoomlayer is subtracted by 1 when zoom-out happens. The details of zoomlayer is showed in Tables 2 and 3 for Figures 20a-l and 21a-i respectively. Table 2 Zoom layer varies in Figure 20.   The experiment results show that the proposed system can track a moving human target by particle filter algorithm on an active camera. In addition, the tracking system is able to track the target human when more than one person is walking in the same environment. Moreover, the zoom-in/out adjusts the resolution image while tracking the human. There are several contributions in this  The experiment results show that the proposed system can track a moving human target by particle filter algorithm on an active camera. In addition, the tracking system is able to track the target human when more than one person is walking in the same environment. Moreover, the zoom-in/out adjusts the resolution image while tracking the human. There are several contributions in this research:  Figure 21a shows the target human has been detected and the Zoom layer is initialized to 0. The targeted human was walking away or approaching the camera. If there is a zoom-in happening, zoom layer is added by 1. On the other hand, zoom layer is subtracted by 1 when zoom-out happens. The details of zoom layer is showed in Tables 2 and 3 for Figures 20a-l and 21a-i respectively. Table 2. Zoom layer varies in Figure 20.  Table 3. Zoom layer varies in Figure 21. The experiment results show that the proposed system can track a moving human target by particle filter algorithm on an active camera. In addition, the tracking system is able to track the target human when more than one person is walking in the same environment. Moreover, the zoom-in/out adjusts the resolution image while tracking the human. There are several contributions in this research: 1.
Our system can accurately distinguish human and nonhuman.

2.
The weighted resampling can help the particle filter to preserve the samples with high weights. 3.
The occlusion handler can solve the temporal full occlusion condition.

4.
It can track the human target smoothly by using the PID controller to determine the motion of camera.

Conclusions
In this paper, we proposed a new system that smoothly tracks the human target by camera motion by means of PID controller. The experimental results demonstrated that the proposed system was capable of tracking a moving human target using a particle filter on an active camera. It was also able to precisely differentiate nonhuman and human. In the case when multiple people are walking in the same environment, the tracking system accurately tracked the human targeted. The resolution image of the tracked human can be adjusted using zoom in/out. The weighted resampling used in this paper helps the particle filter to preserve high weight samples. In addition, the temporal full occlusion condition was solved using occlusion handler.