Combining Weighted Contour Templates with HOGs for Human Detection Using Biased Boosting

This paper proposes a method to detect humans in the image that is an important issue for many applications, such as video surveillance in smart home and driving assistance systems. A kind of local feature called the histogram of oriented gradients (HOGs) has been widely used in describing the human appearance and its effectiveness has been proven in the literature. A learning framework called boosting is adopted to select a set of classifiers based on HOGs for human detection. However, in the case of a complex background or noise effect, the use of HOGs results in the problem of false detection. To alleviate this, the proposed method imposes a classifier based on weighted contour templates to the boosting framework. The way to combine the global contour templates with local HOGs is by adjusting the bias of a support vector machine (SVM) for the local classifier. The method proposed for feature combination is referred to as biased boosting. For covering the human appearance in various poses, an expectation maximization algorithm is used which is a kind of iterative algorithm is used to construct a set of representative weighted contour templates instead of manual annotation. The encoding of different weights to the contour points gives the templates more discriminative power in matching. The experiments provided exhibit the superiority of the proposed method in detection accuracy.


Introduction
Detecting humans is an important topic in many applications, such as intelligent surveillance and intelligent transportation systems (ITSs) and has received considerable attention. However, vision-based human detection is still challenging due to factors including varied illumination conditions, complex backgrounds, various types of clothes, the occlusion effect, and a broad range of human poses and views. Compared to stereo vision, a monocular solution demands less computation and eases the calibration process. Therefore, we present an approach for detecting humans based on monocular vision. Since the camera is mounted on a moving platform, the background is not static so that the background subtraction approaches widely used for identifying the regions of human candidates are inapplicable in our work. The most common method of human detection in the literature is to use the sliding window strategy which formulates the detection problem as binary classification one. This scans an image pyramid by a fixed-sized window and bounding boxes around humans are then determined from the use of non-maximum suppression process.
To mitigate the difficulty from intra-class variance, shape is a kind of effective feature in representing the human appearance and determining the existence of a human in a single window or not. A comprehensive survey of the use of the shape feature can found in [1]. In the human detection discusses the experiments on three popular datasets. Finally, we conclude this paper in Section 5 with some discussion.

Template-Based Classifier
Using a binary contour template to describe the human shape is popular in the literature. To improve the discriminative ability of contour templates, we impose importance(s) on the contour point(s) instead of considering them as equally weighted in the literature. The construction of the weighted contour templates is through the use of EM [25].

Problem Formulation
A weighted contour human template θ j = p is a binary contour image, where p k , and |θ j | denotes the number of contour points in the template θ j . Figure 1a shows an example of a weighed template. Every white block denotes a contour point p (j) k and the number inside the block is its associated weighting factor α (j) k . The matching difference between a binary edge image y composed of a set of edge points and θ j using Chamfer distance can be expressed as: DT y (p) is the distance transform of (DT) [26] of the binary edge image y and is defined as the distance from the pixel p to its closest edge point in y.
DT y (p) = min q∈y ||p − q|| (2) where ||.|| means the Euclidean distance. To cover a wide range of human postures, we construct a set of representative weighted contour templates Θ = λ j , θ j |Θ| j=1 , where |Θ| denotes the number of representative weighted contour templates, λ j is the weight of the template θ j and |Θ| ∑ j=1 λ j =1. discusses the experiments on three popular datasets. Finally, we conclude this paper in Section 5 with some discussion.

Template-Based Classifier
Using a binary contour template to describe the human shape is popular in the literature. To improve the discriminative ability of contour templates, we impose importance(s) on the contour point(s) instead of considering them as equally weighted in the literature. The construction of the weighted contour templates is through the use of EM [25].

Problem Formulation
A weighted contour human template is a binary contour image, where ( ) is the position of the kth contour point in , ( ) indicates the matching importance of ( ) , and | | denotes the number of contour points in the template . Figure 1a shows an example of a weighed template. Every white block denotes a contour point ( ) and the number inside the block is its associated weighting factor ( ) . The matching difference between a binary edge image y composed of a set of edge points and using Chamfer distance can be expressed as: [26] of the binary edge image y and is defined as the distance from the pixel p to its closest edge point in y.

Expectation Maximization (EM)-Based Formulation
By formulating the template construction problem as the maximum likelihood one, an algorithm called (EM) [25] is adopted to obtain Θ without human intervention. Let

Expectation Maximization (EM)-Based Formulation
By formulating the template construction problem as the maximum likelihood one, an algorithm called (EM) [25] is adopted to obtain Θ without human intervention. Let Y = {y i , t i } |Y| i=1 be a set of |Y| training samples, where y i is the binary edge image of the ith sample and t i ∈ {+1, −1} is the ground-truth label of y i . The binary edge images for training are all obtained by applying the Canny edge detector. Figure 1b are some training examples including both positive and negative ones. Based on the assumption that training samples are i.i.d (independent and identically distributed), the likelihood probability of Y given Θ can be defined as Pr(Y|Θ) = ∏ |Y| i=1 Pr(y i |Θ). Accordingly, the maximization of Pr(Y|Θ) leads to the construction of a set of weighted contour templates which is denoted asΘ. Since the sum operator is easier than product operator in implementation, it is often to calculate theΘ that maximizes the log-likelihood of Y, that is, Therefore, a latent random variable is thus introduced to model the relation between Y and Θ, where z i ∈ {1, 2, ..., |Θ|} is a discrete random variable that defines which template the observed image y i comes from. Given the observed data Y and currently estimated Θ (m) , we make a guess about Z and find the Θ (m+1) that maximizes the log-expectation of Pr(Y, Z|Θ), which is called Q-function in the EM literature.
The first term Pr(y i , z i = j|Θ) introduced in the right-hand site of (4) evaluates the possibility of the training sample is from the z i th template. Since the template used is the human contour, the similarity evaluation between y i and z i th template is utilized to define this term. Similar to the definition of Normal distribution, Pr(y i , z i = j|Θ) is defined as: where β is a parameter for controlling the effect of the matching distance and is set to 0.01. Remarkably, this term Pr(y i , z i |Θ) is a function of unknown parameter Θ. The second term Pr(z i |y i , Θ (m) ) denotes the probability that the ith training image y i belongs to z i th template based on the estimated Θ (m) and can be evaluated as: For notation simplicity, let γ The Q-function in (4) can be accordingly becomes: Sensors 2019, 19, 1458 5 of 15

Template Construction Algorithm
After introducing the EM framework for formulating the problem of weighted template construction, we further elaborate the implementation issues in this section. Initially, an incremental clustering similar to [27] is firstly applied to generate a set of good initial templates. In this stage, all contour points in a template are set as equally important and have the same weights. After obtaining the initial templates, the E-Step and M-Step are performed iteratively until the convergence condition is reached. At each round m, the E-Step is to calculate the possibilities of all training samples derived from each weighted contour template at the current stage as the definition of γ (m) ij . According to the estimated γ (m) ij , the M-Step updates all weighted contour templates at the current stage denoted as Θ (m) to obtain a set of new weighted contour templates Θ (m+1) so that the Q-function is maximized.
For each θ (m) j ∈ Θ (m) , the associated template weight λ j is firstly updated as: The update of template θ where |y i | is the number of contour points in the training image y i . The next step is to localize all contour points. Let c j (m, n) denote the confidence value of the point (m, n) belonging to the contour point. Here, c j (m, n) is defined as: where Y + = {y i |t i = +1} is the set of the positive training images and b i (m, n) denotes whether the point (m, n) of the training image y i is a contour point b i (m, n) = 1 or not b i (m, n) = 0. By sorting the points in descending order according to their confidence values, we label the first |θ (m+1) j | points labelled as the contour ones.
The last step is to determine the weight (importance) α (j) k depending on its power in distinguishing human from non-human. Let F k ) be the average matching distances of the positive and negative training sets, Y + and Y − , respectively, to the contour point p k ) is given in (11) and the illustration of weight evaluation in the schematic form is shown in Figure 2a.
Here, the contrast value of F k ) is utilized to define the weight as: Here, the contrast value of is utilized to define the weight as: The larger The larger α k is less than or equals 0.5 and the point p k has no matching contribution. Algorithm 1 gives the pseudo code of detailed implementation.

Algorithm 1: Algorithm for Weighted Template Construction
Apply the distance transform to all training samples.
• Take the samples in positive set Y + to generate a set of |Θ| initial templates θ j |Θ| j=1 by using incremental clustering • Set all template weights to λ j = 1 |Θ| and m ← 0 repeat E-Step: Form log expected function Form the Q-function defined in (7) M-Step: Update the parameters as follows to maximize the Q-function.
Determine the number of contour points θ (m+1) j and their positions p (j) k according to (9) and (10), respectively. (3) Assign a weight α (j) k to each contour point according to (12) • end for

Classifier Formation and Analysis
In this section, we describe how to learn a classifier based on a set of weighted contour templates and analyze the performance improvement in imposing the weight to every contour point. The dataset used consists of 924 positive subjects from the MIT CBCL dataset [28] and 3342 negative ones from the INIRA dataset [29]. First of all, a half of dataset is considered as training dataset and is used to construct a set of weighted contour templates. The generated 10 weighted contour templates through EM algorithm are shown in Figure 2b. The high-weight contour points are labelled in red color and obviously locate at the salient body part, such as head or shoulder. The low-weight contour points with green color are at the background edges or in the interior of body part. This exhibits that the weighted contour templates constructed by the proposed EM algorithm are effective in representing the contour of a human. A classifier H G (.) called global classifier based on the constructed weighted contour templatesΘ to determine the existence of the human is thus defined as: where TH G is a threshold and is set as the value that minimizes the training error. The learned classifier H G (.) is thus applied to another half part of dataset, called testing dataset, for analysis. For validating the effectiveness of imposing the weight to every contour point, H G (.) is compared with the approach only using binary templates which considers the contour points as equally weighted. Figure 3 exhibits the ROC (receiver operating characteristic) curves of proposed classifier H G (.) using weighted contour templates and the one using binary templates. Obviously, the proposed classifier H G (.) has superior performance.

Classifier Formation and Analysis
In this section, we describe how to learn a classifier based on a set of weighted contour templates and analyze the performance improvement in imposing the weight to every contour point. The dataset used consists of 924 positive subjects from the MIT CBCL dataset [28] and 3342 negative ones from the INIRA dataset [29]. First of all, a half of dataset is considered as training dataset and is used to construct a set of weighted contour templates. The generated 10 weighted contour templates through EM algorithm are shown in Figure 2b. The high-weight contour points are labelled in red color and obviously locate at the salient body part, such as head or shoulder. The low-weight contour points with green color are at the background edges or in the interior of body part. This exhibits that the weighted contour templates constructed by the proposed EM algorithm are effective in representing the contour of a human. A classifier (.) G H called global classifier based on the constructed weighted contour templates Θ to determine the existence of the human is thus defined as: where is a threshold and is set as the value that minimizes the training error. The learned classifier (. ) is thus applied to another half part of dataset, called testing dataset, for analysis.
For validating the effectiveness of imposing the weight to every contour point, (. ) is compared with the approach only using binary templates which considers the contour points as equally weighted. Figure 3 exhibits the ROC (receiver operating characteristic) curves of proposed classifier (. ) using weighted contour templates and the one using binary templates. Obviously, the proposed classifier (. ) has superior performance.

Training Framework
HOGs proposed by [14] are an effective feature to represent the human appearance in a local patch. The description of the human appearance is simply achieved by the concatenation of thousands of local HOGs and a SVM classifier is trained for human and non-human discrimination in such high-dimensional feature space. To reduce time complexity of detection process, the work in [30] learns a SVM classifier for each patch representing by HOGs feature and uses boosting algorithm to select a set of SVM classifiers to form a human detector. Boosting is a way to approach the solution by iteratively reducing training error with a set of additive classifiers. However, HOGs as a kind of local feature generally suffer from the false detection problem in case of complex background or noise effect. Motivated by [31], the way to alleviate this problem is by imposing the learned classifier H G (.) to Zhu's [30] boosting framework. This integrates the global contour and local HOGs features so that the detection accuracy can be improved.

Biased Boosting
First of all, we briefly describe Zhu's boosting framework for the learning of a human detector in this section. Let H be a set of learned SVM classifiers in each of which h ∈ H is referred to as weak classifier in the boosting literature. Initially, each training sample y i is assigned a weight D where 1. is an indicator function. The selected weak classifier h (m) is the one which has minimal training error. The form of h (m) for human and non-human discrimination using SVM can be formally expressed as: where φ (m) svm (.) is a SVM hyper-plane which makes decision based on a specific local HOGs patch of y i . The confidence π (m) of the selected weak classifier h (m) is set as: The weight D (m) i of each training sample y i is updated accordingly as: The integration of the global contour with local HOGs features is thus by adjusting the bias of the SVM classifier at each round m. For the samples classified as human ones by H G (.), they are generally with a human-like contour and have high possibility of the ground-true labels equal to positive (human). To response this, we move the φ In short, the weak classifier h (m) in the original boosting framework is decomposed to h +(m) and h −(m) , respectively, G + and G − , as: bias over all training samples is re-expressed as: And each sample weight is updated as: Finally, we obtain the human detector consisting of H G (.) and π (m) , h (m) , Th G − will significantly affect the detection performance of h (m) and their determination will be deferred to the next section.

Input: A set of training samples
Initialize the positive sample weight to 1 |Y + | and negative sample weight to 1 Find the classifier h (m) that has the minimal error defined in (14) (2) Estimate two bias values Th (3) Assign a weight α (j) k to each contour point according to (12) end for

Bias Determination
The main concept of boosting is to choose a weak classifier at each round m so as to maximally reduce the error rate on the weighted training set. To conform to this, a strategy for searching appropriate bias values is proposed and described as follows. We adjust Th G + and Th G − to lower down the total error rate ξ by increasing with interval 0.05. If the error of the biased weak classifier exceeds in the initial error (obtained from the basis of Th G + = Th G − = 0.0), the searching process should be stopped. Thus, the value that derives the lowest error rate within the searching interval is taken as the final bias. Figure 4 illustrates the proposed strategy for bias determination.
appropriate bias values is proposed and described as follows. We adjust + G Th and − G Th to lower down the total error rate ξ by increasing with interval 0.05. If the error of the biased weak classifier exceeds in the initial error (obtained from the basis of + G Th = − G Th = 0.0), the searching process should be stopped. Thus, the value that derives the lowest error rate within the searching interval is taken as the final bias. Figure 4 illustrates the proposed strategy for bias determination

Bias Determination
In this section, we describe how to determine the existence of the human of a scanning window

Bias Determination
In this section, we describe how to determine the existence of the human of a scanning window y in an image using the learned detector H(.). The first step is to check if the appearance of y has a human-like contour. If the answer is yes, the set of positive-based weak classifiers is used for further classification; otherwise, the set of negative-based weak is used. The flow chart of the detection process is illustrated in Figure 5. The formal definition of the final human detector can be expressed as:

Bias Determination
The main concept of boosting is to choose a weak classifier at each round m so as to maximally reduce the error rate on the weighted training set. To conform to this, a strategy for searching appropriate bias values is proposed and described as follows. We adjust

Bias Determination
In this section, we describe how to determine the existence of the human of a scanning window

Experiment
To validate the effectiveness of the proposed method called WTM-Boost, we implement three methods proposed in [14] (HOG-SVM), [3] (WTM), and [32] (TM-Boost) for comparison. We consider these three algorithms for comparison because HOG-SVM uses HOG local features, WTM is based on weighted templates which is a kind of global feature, TM-Boost combines local and global features but only use binary contour templates instead of weighted contour ones. The templates in our work and [32] are both learned from the EM algorithm but those used in the work [32] are binary contour templates all the points of which are considered as equally weighted in matching. The templates used in [3] are formed by using k-means clustering algorithm with k = 10 in order to obtain the same number of templates used in our work and [32]. In [14], the human appearance is described by the dense HOGs and a SVM classifier is learned for human detection. In implementation, the parameter settings in HOGs representation and SVM classifier learning for these three methods are all the same in this work. The sizes of HOGs blocks used are 16, 24, 36, 48, and 60, and the aspect ratio for each block can be one of the following choices: (1:1), (1:2), and (2:1).
The cost constant for training the SVM as the weak classifier for a block is 1.0 and the kernel function is Gaussian radius basis one. The number of weak classifiers used in boosting framework is 40 in all methods. The white rectangles in Figure 6 shows the learned 40 weak classifiers from biased boosting for TM-Boost and WTM-Boost, respectively. The aforementioned methods are implemented in C programming language with the support of OpenCV library and are then run on a computer with Intel i7 3.4GHz and 8GB RAM. In this work, no GPU is used for speeding up. Table 1 lists the average processing time of all testing samples for the four implemented methods at different stages, respectively. Since HOG-SVM describes human appearance in dense manner, it wastes more time in HOGs computation than TM-Boost and WTM-Boost. However, HOG-SVM performs SVM classification more efficiently than TM-Boost and WTM-Boost which have to 40 weak classifiers.
is based on weighted templates which is a kind of global feature, TM-Boost combines local and global features but only use binary contour templates instead of weighted contour ones. The templates in our work and [32] are both learned from the EM algorithm but those used in the work [32] are binary contour templates all the points of which are considered as equally weighted in matching. The templates used in [3] are formed by using k-means clustering algorithm with k = 10 in order to obtain the same number of templates used in our work and [32]. In [14], the human appearance is described by the dense HOGs and a SVM classifier is learned for human detection. In implementation, the parameter settings in HOGs representation and SVM classifier learning for these three methods are all the same in this work. The sizes of HOGs blocks used are 16,24,36,48, and 60, and the aspect ratio for each block can be one of the following choices: (1:1), (1:2), and (2:1).
The cost constant for training the SVM as the weak classifier for a block is 1.0 and the kernel function is Gaussian radius basis one. The number of weak classifiers used in boosting framework is 40 in all methods. The white rectangles in Figure 6 shows the learned 40 weak classifiers from biased boosting for TM-Boost and WTM-Boost, respectively. The aforementioned methods are implemented in C programming language with the support of OpenCV library and are then run on a computer with Intel i7 3.4GHz and 8GB RAM. In this work, no GPU is used for speeding up. Table 1 lists the average processing time of all testing samples for the four implemented methods at different stages, respectively. Since HOG-SVM describes human appearance in dense manner, it wastes more time in HOGs computation than TM-Boost and WTM-Boost. However, HOG-SVM performs SVM classification more efficiently than TM-Boost and WTM-Boost which have to 40 weak classifiers.  For performance validation, we use three popular human datasets including MIT CBCL, INRIA, and CVC in our experiment. The statistics of images from the three datasets for training and  For performance validation, we use three popular human datasets including MIT CBCL, INRIA, and CVC in our experiment. The statistics of images from the three datasets for training and testing is listed in Table 2. Of all the training samples, all 924 human images in the CBCL dataset are provided as positive samples, while the negative samples come from 3342 randomly-chosen images from the INRIA dataset because there are no non-human images in CBCL dataset. The training dataset is used for weighted template construction as well as detector boosting. For validating the trained detector in experiments I and II at the testing stage, the positive and negative images are, respectively, from the INRIA and CVC datasets. The ROC (receiver operating characteristics) curve which illustrates the relation of detection rate and false positive rate is used for objective evaluation. The four curves, respectively, shown in Figures 7 and 8 are those of the ROC for the INRIA and CVC datasets of the four methods. Obviously, detectors learned by machine learning algorithm, such as boosting and SVM, have superiority over the template-matching algorithm in both datasets. This is because a few of used templates is hard to model the significant appearance variation in human pose. The curves of the proposed WTM-Boost method for both datasets are closer to the top-left hand and exhibit better performance. Imposing the contour template to the boosting framework makes the global contour and local HOGs features complement each other in a mutually beneficial manner so that TM-Boost as well as WTM-Boost methods outperforms the HOG-SVM one. Besides, using the weighted contour template to describe the human appearance in various poses is more effective than the binary one and this is the reason why the proposed WTM-Boost has better accuracy than TM-Boost. The four curves, respectively, shown in Figures 7 and 8 are those of the ROC for the INRIA and CVC datasets of the four methods. Obviously, detectors learned by machine learning algorithm, such as boosting and SVM, have superiority over the template-matching algorithm in both datasets. This is because a few of used templates is hard to model the significant appearance variation in human pose. The curves of the proposed WTM-Boost method for both datasets are closer to the top-left hand and exhibit better performance. Imposing the contour template to the boosting framework makes the global contour and local HOGs features complement each other in a mutually beneficial manner so that TM-Boost as well as WTM-Boost methods outperforms the HOG-SVM one. Besides, using the weighted contour template to describe the human appearance in various poses is more effective than the binary one and this is the reason why the proposed WTM-Boost has better accuracy than TM-Boost.  To further validate this point, we replace the training samples from the MIT CBCL dataset with those from the INRIA and CVC datasets, respectively, for experiments I and II, to construct the templates for matching in the TM-Boost method. The resulting templates are for global classifier learning followed by boosting the human detector, as mentioned. The ROC curves of the human detector learned using TM-Boost for INRIA and CVC datasets are shown in Figures 7 and 8, respectively. Obviously, the accuracy is almost the same as to WTM-Boost. This indicates that the performance difference between TM-Boost and WTM-Boost is from the representation ability of their used templates. In other word, the proposed WTM-boost method can alleviate the overfitting problem because it uses the weight(s) assigning to the contour point(s).

Conclusions
The main contribution of our work lies in two aspects. Firstly, we propose a method based on the EM algorithm to automatically construct a set of representatively weighted contour templates By formulating the problem of template construction as a maximum likelihood one, the contour template as well as contour point weight are determined in the M-Step according to the estimated likelihood probabilities of all training samples in the E-Step. The assignment of different weights to the contour points gives the constructed templates more discriminative power.
Secondly, we systematically integrate the global contour and local HOGs features in the To further validate this point, we replace the training samples from the MIT CBCL dataset with those from the INRIA and CVC datasets, respectively, for experiments I and II, to construct the templates for matching in the TM-Boost method. The resulting templates are for global classifier learning followed by boosting the human detector, as mentioned. The ROC curves of the human detector learned using TM-Boost for INRIA and CVC datasets are shown in Figures 7 and 8, respectively. Obviously, the accuracy is almost the same as to WTM-Boost. This indicates that the performance difference between TM-Boost and WTM-Boost is from the representation ability of their used templates. In other word, the proposed WTM-boost method can alleviate the overfitting problem because it uses the weight(s) assigning to the contour point(s).

Conclusions
The main contribution of our work lies in two aspects. Firstly, we propose a method based on the EM algorithm to automatically construct a set of representatively weighted contour templates By formulating the problem of template construction as a maximum likelihood one, the contour template as well as contour point weight are determined in the M-Step according to the estimated likelihood probabilities of all training samples in the E-Step. The assignment of different weights to the contour points gives the constructed templates more discriminative power.
Secondly, we systematically integrate the global contour and local HOGs features in the proposed biased boosting framework. The determination of bias values, respectively, for those with contours similar to and different from the pedestrian templates, is by finding the values minimizing the error rate. By comparing the other two approaches, the experimental results exhibit that the trained pedestrian detector increases the detection rate and reduces the false positive rate as well. Given the effectiveness and power of deep learning, the use of deep learning is the trend in the detection area [33,34]. One of the main advantages of deep learning is to extract the semantic features through the convolution and pooling layers. Since our proposed boosting framework is to integrate various features, it could be used to fuse the extracted feature from deep learning in our future work.