Robust Pedestrian Classification Based on Hierarchical Kernel Sparse Representation

Vision-based pedestrian detection has become an active topic in computer vision and autonomous vehicles. It aims at detecting pedestrians appearing ahead of the vehicle using a camera so that autonomous vehicles can assess the danger and take action. Due to varied illumination and appearance, complex background and occlusion pedestrian detection in outdoor environments is a difficult problem. In this paper, we propose a novel hierarchical feature extraction and weighted kernel sparse representation model for pedestrian classification. Initially, hierarchical feature extraction based on a CENTRIST descriptor is used to capture discriminative structures. A max pooling operation is used to enhance the invariance of varying appearance. Then, a kernel sparse representation model is proposed to fully exploit the discrimination information embedded in the hierarchical local features, and a Gaussian weight function as the measure to effectively handle the occlusion in pedestrian images. Extensive experiments are conducted on benchmark databases, including INRIA, Daimler, an artificially generated dataset and a real occluded dataset, demonstrating the more robust performance of the proposed method compared to state-of-the-art pedestrian classification methods.


Introduction
Pedestrian safety is an important problem for autonomous vehicles. A World Health Organization report describes road accidents as one of the significant causes of fatalities. About 10 million people become traffic casualties around the world each year, and two to three million of these people are seriously injured. The development of pedestrian protection systems (PPS) dedicated to reducing the number of fatalities and the severity of traffic accidents is an important and active research. PPS typically use forward vision sensors to detect pedestrians. Notwithstanding years of methodical and technical progress, e.g., see [1][2][3], pedestrian detection is still a difficult task from a machine-vision point of view. There is a wide range of pedestrian appearance arising from changing articulated pose, clothing, lighting and in case of a moving camera in a changing environment and partial occlusions pose additional problems. For different communities to benchmark and verify their pedestrian detection methods, many large-scale pedestrian data sets, including the Caltech [3], ETH [4], TUD-Brussels [5], Daimler [6], and INRIA [7] data sets, have been established and used as evaluation platforms.
Recently, some researchers and automobile manufacturers have tended to utilize advanced and expensive sensors such as infrared camera [8,9], radar [10], and laser scanners [11] in order to acquire much more information. The PPS of SAVE-U system contains a variety of sensors to achieve good system-level performance [12]. However, vision-based PPS is still a valuable strategy for onboard with various pedestrian image variations (e.g., illumination, appearance and background) and partial occlusion, as demonstrated in our extensive experiments conducted on benchmark databases. This paper is organized as follows. Section 2 briefly reviews some related work. Section 3 presents the proposed HFE − WKSR algorithm. Section 4 presents the experimental results. Section 5 summarizes this paper.

CENTRIST Features
CENTRIST (CENsus TRansform hISTogram) is a histogram vector designed for establishing correspondence between local patches, firstly proposed for scene categorization [40]. Census transform (CT) compares the intensity value of a pixel with its eight neighboring pixels, as illustrated in Equation (1).  10 (1) CT compares the intensity value of a pixel with its 8-neighborhood. If the intensity value of the center pixel is bigger than (or equal to) one of its neighbors, a bit "1" is set in the corresponding location, otherwise a bit "0" is set. The eight bits stream generated from left to right, and top to bottom order, which is consequently converted to a base-10 number in [0,255]. This is the CT value for the center pixel. After the pixel values are replaced by the CT values, the corresponding CT image is obtained. The CENTRIST descriptor is a histogram with 256 bins, which is a histogram of these CT values in an entire image or a rectangular region in an image.
The CENTRIST feature is robust with regard to illumination changes and gamma variations. It is a powerful tool to capture global local structures and contours beyond the small 3 × 3 range. Figure 1a,b shows a 108 × 36 human image and its contour. We divide this image into 12 × 4 blocks, so each block has 81 pixels. We can find a similar image that has the same pixel intensity histogram and CENTRIST descriptor through a reconstruction algorithm [40]. As shown in Figure 1c, the reconstructed image is similar to the original image. The global characteristics of the human contour are well preserved in spite of errors in the left part of the human. From this example, we know that CENTRIST not only encodes important information but also implicitly encodes the global contour encourages us to use it as a suitable representation for object detection. The speed issue of feature extraction is very important, because real-time detection is the prerequisite in the PPS. Comparing with SIFT and HOG, CENTRIST not only exhibits good performance, it is easy to implement and evaluates extremely quickly. HOG features and SRC with holistic features, the proposed HFE − WKSR model shows much greater robustness with various pedestrian image variations (e.g., illumination, appearance and background) and partial occlusion, as demonstrated in our extensive experiments conducted on benchmark databases. This paper is organized as follows. Section 2 briefly reviews some related work. Section 3 presents the proposed HFE − WKSR algorithm. Section 4 presents the experimental results. Section 5 summarizes this paper.

CENTRIST Features
CENTRIST (CENsus TRansform hISTogram) is a histogram vector designed for establishing correspondence between local patches, firstly proposed for scene categorization [40]. Census transform (CT) compares the intensity value of a pixel with its eight neighboring pixels, as illustrated in Equation (1).

27
(1) CT compares the intensity value of a pixel with its 8-neighborhood. If the intensity value of the center pixel is bigger than (or equal to) one of its neighbors, a bit "1" is set in the corresponding location, otherwise a bit "0" is set. The eight bits stream generated from left to right, and top to bottom order, which is consequently converted to a base-10 number in [0,255]. This is the CT value for the center pixel. After the pixel values are replaced by the CT values, the corresponding CT image is obtained. The CENTRIST descriptor is a histogram with 256 bins, which is a histogram of these CT values in an entire image or a rectangular region in an image.
The CENTRIST feature is robust with regard to illumination changes and gamma variations. It is a powerful tool to capture global local structures and contours beyond the small 3 × 3 range. Figure 1a,b shows a 108 × 36 human image and its contour. We divide this image into 12 × 4 blocks, so each block has 81 pixels. We can find a similar image that has the same pixel intensity histogram and CENTRIST descriptor through a reconstruction algorithm [40]. As shown in Figure 1c, the reconstructed image is similar to the original image. The global characteristics of the human contour are well preserved in spite of errors in the left part of the human. From this example, we know that CENTRIST not only encodes important information but also implicitly encodes the global contour encourages us to use it as a suitable representation for object detection. The speed issue of feature extraction is very important, because real-time detection is the prerequisite in the PPS. Comparing with SIFT and HOG, CENTRIST not only exhibits good performance, it is easy to implement and evaluates extremely quickly. In order to capture the rough global information of an image, CENTRIST generally uses the spatial pyramid framework, which is an extension of the SPM scheme in [41]. As shown in Figure 2, In order to capture the rough global information of an image, CENTRIST generally uses the spatial pyramid framework, which is an extension of the SPM scheme in [41]. As shown in Figure 2, it rescales Sensors 2016, 16, 1296 4 of 15 the image size for different level and the overlapped region indicated by dash lines, so it contains 31 blocks of the same size in 3 levels. CENTRISTs extracted from all the blocks are then concatenated to form the final feature vector. Features pyramid representations have proven effective for visual processing tasks such as denoising, texture analysis and recognition [42]. it rescales the image size for different level and the overlapped region indicated by dash lines, so it contains 31 blocks of the same size in 3 levels. CENTRISTs extracted from all the blocks are then concatenated to form the final feature vector. Features pyramid representations have proven effective for visual processing tasks such as denoising, texture analysis and recognition [42].

Sparse Representation Classifier
SRC is a nonparametric learning method similar to nearest neighbor (NN) and nearest subspace (NS). The basic idea is that training samples form a training matrix as a dictionary and then the testing sample can be spanned by this dictionary sparsely. In other words, a testing sample is only related to few columns in this dictionary. SRC has been successfully applied to human frontal face recognition in [36]. They experimentally show that SRC has better classification performance, which can effectively overcome the small samples and overfitting problem of NN and NS.
Assume that there are a set of training samples where α is the vector of coefficients which is expected to be sparse,  1 denotes the L1-norm. The classification of y is done by where     ( ) : j n n j is the characteristic function that selects from α the coefficients associated with the jth class. When the L1-norm changes L2-norm in Equation (3), we can get the collaborative representation classifier (CRC). It is shown in [39] that CRC has comparable accuracy to SRC in face recognition without occlusion but with much faster speed. For occlusion or corruption, Robust-SRC [39] where

Sparse Representation Classifier
SRC is a nonparametric learning method similar to nearest neighbor (NN) and nearest subspace (NS). The basic idea is that training samples form a training matrix as a dictionary and then the testing sample can be spanned by this dictionary sparsely. In other words, a testing sample is only related to few columns in this dictionary. SRC has been successfully applied to human frontal face recognition in [36]. They experimentally show that SRC has better classification performance, which can effectively overcome the small samples and overfitting problem of NN and NS.
Assume that there are a set of training samples {(x i, l i )|x i ∈ m , l i ∈ {1, 2, · · · , c} , i = 1, 2, · · · n}, where c is the number of classes, m is the dimensionality of the input sample, l i is label corresponding to x i . Given a test sample y, the goal is exactly to predict the label of y from the given c-class training samples. Now we arrange the jth class training samples as columns of a matrix X j = [x j,1 , · · · , x j,n j ] ∈ m×n j , j = 1, 2, · · · , c, where x j,i denotes the sample belonging to the jth class, and n j is the number of the class training samples. Define a new dictionary matrix X for all training samples.
where n = ∑ c j=1 n j . The representation model of SRC could be written aŝ where α is the vector of coefficients which is expected to be sparse, · 1 denotes the L1-norm.
The classification of y is done by where δ j (·) : n → n j is the characteristic function that selects fromα the coefficients associated with the jth class. When the L1-norm changes L2-norm in Equation (3), we can get the collaborative representation classifier (CRC). It is shown in [39] that CRC has comparable accuracy to SRC in face recognition without occlusion but with much faster speed. For occlusion or corruption, Robust-SRC [39] classifies the occluded image y with and X e is an occlusion dictionary to code the outliers and could set as the identity matrix.

Hierarchical Features Extraction
The appearance of pedestrians exhibits very high variability since they can change pose, wear different clothes, carry different objects, and have a considerable range of sizes. Pedestrians can be partially occluded by common urban elements, such as parked vehicles or street furniture. Classical features extraction methods such as the HOG mainly consider the global scatter of samples and may fail to reveal object local discriminative structures. In this section, we propose a very effective hierarchical features extraction (HFE) technique to capture discriminative structures at varying scales.
Firstly, we adopt S + 1 level block partition, where s = 0, 1, . . . , S. That is to say, in the sth level, the whole image is divided into p s × q s blocks, each of which is further partitioned into p s × q s sub-blocks. Different from the partition of spatial pyramid, such as 1 × 1, 2 × 2, 4 × 4, we adopt a more flexible partition. As shown in the first row of Figure 3, for example, the partition of the sample can be made as 2 × 2, 3 × 2, and 4 × 3, respectively, with 22 blocks of three different sizes in total. This kind of partition could flexibly set the number of blocks in each scale and is expected to capture more spatial discrimination information than the spatial pyramid. As shown in the second row of Figure 3, in each sub-block we first create a sequence of 3 × 3 sliding boxes (e.g., the red box shown in Figure 3), and then compute the CENTRIST descriptor of each box's local feature. In this paper, HFE is defined as the one with the following setting: p s = 2 and q s = 2 for partition scale s = 0 and 1; p s = 1 and q s = 1 for s > 1.
and e X is an occlusion dictionary to code the outliers and could set as the identity matrix.

Hierarchical Features Extraction
The appearance of pedestrians exhibits very high variability since they can change pose, wear different clothes, carry different objects, and have a considerable range of sizes. Pedestrians can be partially occluded by common urban elements, such as parked vehicles or street furniture. Classical features extraction methods such as the HOG mainly consider the global scatter of samples and may fail to reveal object local discriminative structures. In this section, we propose a very effective hierarchical features extraction (HFE) technique to capture discriminative structures at varying scales.
Firstly, we adopt S + 1 level block partition, where s = 0, 1, …, S. That is to say, in the sth level, the whole image is divided into ps × qs blocks, each of which is further partitioned into ps × qs sub-blocks. Different from the partition of spatial pyramid, such as 1 × 1, 2 × 2, 4 × 4, we adopt a more flexible partition. As shown in the first row of Figure 3, for example, the partition of the sample can be made as 2 × 2, 3 × 2, and 4 × 3, respectively, with 22 blocks of three different sizes in total. This kind of partition could flexibly set the number of blocks in each scale and is expected to capture more spatial discrimination information than the spatial pyramid. As shown in the second row of Figure 3, in each sub-block we first create a sequence of 3 × 3 sliding boxes (e.g., the red box shown in Figure 3), and then compute the CENTRIST descriptor of each box's local feature. In this paper, HFE is defined as the one with the following setting: ps = 2 and qs = 2 for partition scale s = 0 and 1; ps = 1 and qs = 1 for s > 1. Pooling techniques are widely used in object and in image classification to extract invariant features [43,44]. In this paper, the max pooling operation is operated on a series of local features generated in each partitioned sub-block. Denoted by fi is the feature vector extracted from the ith sliding box, and suppose that there are n feature vectors, f1, f2, …, fn, which are extracted from all possible sliding boxes in this sub-block, and then the final output feature vector, denoted by f, after max pooling is Pooling techniques are widely used in object and in image classification to extract invariant features [43,44]. In this paper, the max pooling operation is operated on a series of local features generated in each partitioned sub-block. Denoted by f i is the feature vector extracted from the ith sliding box, and suppose that there are n feature vectors, f 1 , f 2 , . . . , f n , which are extracted from all possible sliding boxes in this sub-block, and then the final output feature vector, denoted by f, after max pooling is Let us suppose that the sample is partitioned into B blocks in total. In each block, after extracting the max pooling (MP) features of every sub-block, we concatenate the MP features of all sub-blocks as the output feature vector. Denoted by y i is the output feature vector in the ith block. Then the concatenation of all feature vectors extracted from all blocks, i.e., y = [y 1 , y 2 , . . . , y B ] could be taken as the descriptor of the sample image. For example, the size of original image is 128 × 48. The whole image is divided into three level as 2 × 2, 3 × 2, and 4 × 3, totally 22 blocks. Each block is partitioned into 2 × 2 sub-blocks, for a total of 88 sub-blocks. Each sub-block extracts 16 dimensions of the feature vector. Then, the final image descriptor has 1408 dimensions through concatenating all feature vectors. The proposed HFE method could not only introduce more spatial information because of its use of hierarchical structures, but also enhance the robustness with regard to varying illumination and appearance because of its use of max pooling.

Robust Kernel Sparse Representation
SRC behaves well in human frontal face recognition. However, SRC has poor classification ability even for the linearly separable task in which the data from different classes have the same direction. The main reason is that the data in the same direction would overlap each other after the normalization process, so we cannot essentially distinguish them. To resolve this problem occurring in SRC, the kernel trick is introduced into SRC and generates a kernel sparse representation-based classifier [45].
Only a kernel satisfying Mercer's condition is called a Mercer kernel which is generally used in kernel methods. In other words, a Mercer kernel is continuous, symmetric, positive semi definite kernel function. Usually, a Mercer kernel function k(.) can be expressed as where T denotes the transpose of a matrix or vector, ϕ is the implicit nonlinear mapping associated with the kernel function k(.), which maps the feature vectors x and z to a higher dimensional feature space. The kernel function is actually Euclidian vector inner product between two image features.
In kernel methods, we do not need to know what is and just adopt the kernel function Equation (8).
It has been shown that histogram intersection kernel and Chi-square kernel are more powerful than other kernel function in classification [27]. Therefore, more discriminant information embedded in HFE could be exploited if the histogram intersection kernel or Chi-square kernel could be adopted in the SRC. The histogram intersection kernel k H IK and Chi-square kernel k C are defined as follows: After the HFE-based features extraction on the query image, B blocks of multiple partitions are obtained, and B sub-feature vectors, denoted by y 1 , y 2 , . . . , y B , are extracted. Similarly, for each of the training samples, we can extract the sub-feature vectors, and let us denote by X i the matrix formed by all the sub-feature vectors of the ith block from all training samples. Taking the ith block as an example, the kernel representation of y i over the matrix X i could be formulated as where α i is the coding coefficient vector in the high dimensional feature space mapped by the kernel function ϕ. Let k XX be a n × n matrix with {k XX } ij = k(X i , X j ) and k Xy be a n-dimensional vector with k Xy i = k(X i , y). Equation (4) can be written as: If we enforce α i = α j for different blocks i = j, i.e., we assume that the different blocks y i extracted from the same test sample have the same representation over their associated matrix X i , then kernel representation of the query image by combining all the block features could be written as  (12) where α is the coding coefficient vector of the query sample. The above model seeks a regularized representation for a mapped feature under the mapped basis in the high dimensional space.

Occlusion Solution
In the kernel representation model Equation (12), the L2-norm is used to measure the representation residual. Such a kernel representation is effective when there are no outliers in the query image. However, partial occlusion or noise can often appear in the query pedestrian image. In such case, the block in which occlusion appear will have a big representation residual, reducing the role of clean blocks in the final classification. In short, the representation model in Equation (12) is very sensitive to partial occlusion.
To make the kernel representation robust to partial occlusion and noises, we propose to adopt some robust fidelity term in the modeling. Denoted by e = [e 1 , e 2 , . . . , e B ] the representation residual vector, where e i is the kernel representation residual of the ith block: We assume that e i is independent from e j if i = j as they represent the representation residuals of different blocks.
The proposed weighted kernel sparse representation (WKSR) can then be formulated as where ω(e) = ∑ B i=1 ω(e i ) and the weight function ω(·) is expected to be insensitive to the outliers in the query sample. A good weight function should be robust to outliers, i.e., ω(e i ) has a large value when | e i | is small (e.g., blocks without outliers), and a small value when | e i | is big (e.g., blocks with outliers). The widely used Gaussian function can be chosen as the weight function The above weight function could effectively assign the outliers with large representation residuals low weights, and assign inliers with small representation residuals high weights (here the weight value is normalized to the range of [0, 1]). It should be noted that the weight values of each testing sample are estimated online, and there is not a training phase of them.
With the above development, Equation (12) could be rewritten as where ω i is ω(e i ) computed by Equation (15) and α i is an known coding coefficient vector. Here σ are scalar parameters, which could be set as a constant value or automatically updated. σ is usually set as 1/ √ 2π to make the weight close to 1 when e i = 0. With the defined kernel matrix k XX and kernel vector k Xy , Equation (16) could be re-written as From Equation (17) we can see that the proposed WKSR methods could exploit the discrimination information in the mapped higher dimensional feature space; at the same time, the weight ω i can effectively remove the outliers' effect on computing the coefficient vector.
The coefficient vector α is regularized by L1-norm. Efficient feature-sign search algorithm [46] could be used to solve the sparse coding problem of Equation (17). The solving of WKSR is an iterative and alternative process: the weight value is estimated via Equation (15) with known sparse coefficient, and then the sparse coefficient is computed via Equation (17) with known weight value. After getting the solutionα after some iteration, the classification of the query sample is done via where is the ith-block kernel representation residual associated with the jth class. X i = [X i,1 , X i,2 , · · · , X i,c ] with X i,j being the sub-matrix of X i associated with the jth class,α j being the representation coefficient vector associated with the jth class. From Equation (18) it can be seen that the classification criteria is based on a weight sum of kernel representation residuals, which utilizes both the discrimination power of kernel representation in high-dimensional feature space and the insensitiveness of robust representation to outliers. In addition, the kernel representation residual, ε i,j could be rewritten as

Proposed Classification Algorithm
For pedestrian classification, the goal is to determine a class label for a query image. We consider a two class problem with classes C0 (pedestrian) and C1 (nonpedestrian). The whole algorithm of the proposed pedestrian classification is summarized in Algorithm 1.
where τ is a small positive scalar and ω i (t) is the weight value of ith block in the iteration t.

Do classification
where X i,j the sub-matrix of X i associated with the jth class,α j being the representation coefficient vector associated with the jth class. The algorithm includes three steps: (1) the first step extracts the discrimination information using the proposed HFE; (2) the second step performs WKSR; and (3) the last step performs classification. The second step is an iterative process. Through experiments, we found that this process converges fast. For instance, when there is no occlusion, only two or three iterations are needed, and when there is occlusion in the query image, approximately ten iterations can lead to a good solution.
Compared with the HOG + SVM and SRC approaches, the proposed WKSR method attenuates the problems of the query images with corrupted, occluded or largely varied appearances that may mislead the representation and classification. The running speed of HFE − WKSR is very fast. Under the programming environment of MATLAB version R2010a in a desktop of 3.07-HHz CPU with 8-GHz RAM, the running time of SRC and HFE − WKSR using feature-sign search algorithm [46] is compared in Table 1. In the experiment of INRIA database (refer to Section 4 for the detailed experimental setting), the average running time of HOG + SVM is 0.1806 s; the average running time of HFE + SRC and HFE − WKSR is 0.1239 s and 0.1372 s, respectively. In the experiment of Daimler datasets with partial occlusion (refer to Section 4 for the detailed experimental setting), the average running time of HFE + SRC and HFE − WKSR is 0.0403 s and 0.0463 s, respectively, which is much less than that of HOG + SVM (0.0682 s).

Experimental Results
In this section, we present experimental results on benchmark pedestrian databases to illustrate the effectiveness of our method. In Section 4.1, we discuss the parameter setting. In Section 4.2, we present the experimental results on INRIA databases captured in high definition digital camera. In Section 4.3, we present the experimental results on Daimler dataset captured in mobile recoding setup to demonstrate the robustness of HFE − WKSR to varied illumination, background and appearance. Then in Section 4.4, we test the robustness of HFE − WKSR against partial occlusion in INRIA random block occlusion and Daimler Occlusion datasets.

Parameter Setting
The proposed method consists of two main procedures: hierarchical feature extraction (HFE) and WKSR. With no specific instruction, the parameters of HFE-WKSR are set as shown in Table 2. In feature extraction, the histogram of CENTRIST encoded on the raw image is used as the local features, and the number of histogram bins for each sub-block is set to 16. In the proposed hierarchical features extraction method, we set s = 0, p 0 = 4, and q 0 = 4 for INRIA and Daimler dataset with non-occlusion images. For Daimler and INRIA dataset with partial occlusion images, we set s = 2, and (p s , q s ) ={(4, 4, (3, 2), (2, 2)} for s = {0, 1, 2}. In the procedure of WKSR, the histogram intersection kernel [42] is used as the kernel function. In the Gaussian weight, we set σ = 0.5 for samples with occlusion and σ = 0.4 for samples without occlusion. The convergence parameter τ and the Lagrange multiplier λ is empirically set as 0.7 and 0.005, respectively. The other parameters are obtained by cross-validation. We use randomly selected 100 of all labeled samples as the training set and 500 samples as test set, then vary level from 1 to 4, bin number form 8, 16 and 32, weight from 0.2 to 0.8. Each experiment is repeated five times using different random sampling. Finally, we determine parameters setting according to time consumption and classification accuracy.

Pedestrian Classification on INRIA Dataset
We first evaluate the performance of the proposed algorithm on INRIA databases captured in static digital camera, which has been widely used for pedestrian/human detection evaluation in recent years. The original SRC and SVM with HOG feature [7] is used as the baseline methods, and we then apply the proposed HFE feature to SRC [36], CRC [39], histogram intersection kernel-based support vector machine (HIKSVM) as its similarity measurement, and compare them with the proposed HFE − WKSR. INRIA consists of 1758 positives and 1685 negatives images captured under various view and illumination conditions. Example of images from the dataset are shown in Figure 4. In our experiment, N samples are randomly chosen as training samples and 500 of the remaining images are randomly chosen as the testing data. Here the images are normalized to 128 × 64 and the experiment for each N samples runs ten times. set and 500 samples as test set, then vary level from 1 to 4, bin number form 8, 16 and 32, weight from 0.2 to 0.8. Each experiment is repeated five times using different random sampling. Finally, we determine parameters setting according to time consumption and classification accuracy.

Procedure Parameters
Feature extraction

Pedestrian Classification on INRIA Dataset
We first evaluate the performance of the proposed algorithm on INRIA databases captured in static digital camera, which has been widely used for pedestrian/human detection evaluation in recent years. The original SRC and SVM with HOG feature [7] is used as the baseline methods, and we then apply the proposed HFE feature to SRC [36], CRC [39], histogram intersection kernel-based support vector machine (HIKSVM) as its similarity measurement, and compare them with the proposed HFE − WKSR. INRIA consists of 1758 positives and 1685 negatives images captured under various view and illumination conditions. Example of images from the dataset are shown in Figure  4. In our experiment, N samples are randomly chosen as training samples and 500 of the remaining images are randomly chosen as the testing data. Here the images are normalized to 128 × 64 and the experiment for each N samples runs ten times. The pedestrian classification results and mean recognition accuracy of all the competing methods are listed in Table 3. The proposed HFE − WKSR achieves the best performance, with more than a 4% improvement over all the others when N is small (e.g., 20 and 50). When 100 training samples are selected, an accuracy of 97.5% is achieved by HFE − WKSR. It could also be seen that those methods based on sparse representation (e.g., HFE − WKSR, HFE + CRC, HFE + SRC, and HOG + SRC) are more powerful than SVM-based methods.  The pedestrian classification results and mean recognition accuracy of all the competing methods are listed in Table 3. The proposed HFE − WKSR achieves the best performance, with more than a 4% improvement over all the others when N is small (e.g., 20 and 50). When 100 training samples are selected, an accuracy of 97.5% is achieved by HFE − WKSR. It could also be seen that those methods based on sparse representation (e.g., HFE − WKSR, HFE + CRC, HFE + SRC, and HOG + SRC) are more powerful than SVM-based methods.

Pedestrian Classification on Daimler Dataset
In this section, we test the robustness of the proposed method to real traffic scenes on Daimler databases with complex background, varied illumination and appearances. Daimler databases consists of 15,659 pedestrian and 6740 nonpedestrian samples captured from vehicle-mounted camera in an urban environment. As opposed to the INRIA dataset, nonpedestrian samples were selected by a preprocessing step from the negative samples, which match a pedestrian shape template based on the average Chamfer distance score. Both samples were scaled into a fixed size of 96 × 48 windows, and pedestrian samples include a margin of 2 pixels around. The small size of the windows, combined with motion background, makes detection on the Daimler dataset extremely challenging. Examples of images from the dataset are shown in Figure 5. In the experiment, all pedestrian samples are divided into three groups, including illumination, background and appearance change. 1000 samples are randomly chosen as training samples and 9000 of the remaining images are randomly chosen as the testing data. Here the images are normalized to 96 × 48 and the experiment for each group runs ten times.
Sensors 2016, 16, 1296 11 of 15 In this section, we test the robustness of the proposed method to real traffic scenes on Daimler databases with complex background, varied illumination and appearances. Daimler databases consists of 15,659 pedestrian and 6740 nonpedestrian samples captured from vehicle-mounted camera in an urban environment. As opposed to the INRIA dataset, nonpedestrian samples were selected by a preprocessing step from the negative samples, which match a pedestrian shape template based on the average Chamfer distance score. Both samples were scaled into a fixed size of 96 × 48 windows, and pedestrian samples include a margin of 2 pixels around. The small size of the windows, combined with motion background, makes detection on the Daimler dataset extremely challenging. Examples of images from the dataset are shown in Figure 5. In the experiment, all pedestrian samples are divided into three groups, including illumination, background and appearance change. 1000 samples are randomly chosen as training samples and 9000 of the remaining images are randomly chosen as the testing data. Here the images are normalized to 96 × 48 and the experiment for each group runs ten times.  Table 4 lists the results of all the competing methods. It can be seen that the proposed HFE − WKSR achieves the highest recognition rates, with at least 3% improvements than all the other methods, respectively. The original SRC with HOG gets the worst recognition rates, much lower than HFE + SRC. This validates that HFE is robust to misalignment to some extent. Sparse representations (e.g., CRC and SRC) combined with HFE could have approximately 10% improvements over other kinds of classifiers (e.g., HISVM, SVM). To show the effectiveness of MP, we also give the recognition rate of SLF-RKR without the step of MP in Table 4. One can see that even without MP, HFE − WKSR still outperforms HFE + SRC by 1.9% in average, whereas HFE − WKSR outperforms HFE + CRC by 2.6%. It can also be observed that the improvement introduced by MP is over 5% in each session, which clearly shows the effectiveness of the proposed MP in dealing with varied illumination, background and appearance.

Pedestrian Classification on Partial Occlusion Datasets
Partial occlusion is a very challenging issue in a pedestrian detection system when the subject is covered by other objects such as trees, cars and other human. One interesting property of SRC [36] is its robustness to occlusions. In this section, we test the performance of HFE − WKSR to various occlusions, including random block occlusion and real occlusion. In HFE − WKSR, the robustness to occlusion mainly comes from its iterative reweighed kernel robust representation. In this section, the weight W in each block is automatically updated.  Table 4 lists the results of all the competing methods. It can be seen that the proposed HFE − WKSR achieves the highest recognition rates, with at least 3% improvements than all the other methods, respectively. The original SRC with HOG gets the worst recognition rates, much lower than HFE + SRC. This validates that HFE is robust to misalignment to some extent. Sparse representations (e.g., CRC and SRC) combined with HFE could have approximately 10% improvements over other kinds of classifiers (e.g., HISVM, SVM). To show the effectiveness of MP, we also give the recognition rate of SLF-RKR without the step of MP in Table 4. One can see that even without MP, HFE − WKSR still outperforms HFE + SRC by 1.9% in average, whereas HFE − WKSR outperforms HFE + CRC by 2.6%. It can also be observed that the improvement introduced by MP is over 5% in each session, which clearly shows the effectiveness of the proposed MP in dealing with varied illumination, background and appearance.

Pedestrian Classification on Partial Occlusion Datasets
Partial occlusion is a very challenging issue in a pedestrian detection system when the subject is covered by other objects such as trees, cars and other human. One interesting property of SRC [36] is its robustness to occlusions. In this section, we test the performance of HFE − WKSR to various occlusions, including random block occlusion and real occlusion. In HFE − WKSR, the robustness to occlusion mainly comes from its iterative reweighed kernel robust representation. In this section, the weight W in each block is automatically updated.
(1) Pedestrian classification with random block occlusion. In the database of INRIA, we chose 100 non-occlusion images with normal-to-moderate lighting conditions for training, and 500 of the remaining images are randomly chosen for testing. Similar to the settings in [36], we simulate various levels of contiguous occlusion, from 0% to 50%, by replacing a randomly located square block of each testing image with an unrelated image, as illustrated in Figure 6, where (a) shows a pedestrian image with 20% block occlusion, (b) shows a pedestrian image with 30% block occlusion and (c) shows a pedestrian image with 40% block occlusion. Here the location of occlusion is randomly chosen for each image and is unknown to each algorithm, and the image size is normalized to 128 × 64. (1) Pedestrian classification with random block occlusion. In the database of INRIA, we chose 100 non-occlusion images with normal-to-moderate lighting conditions for training, and 500 of the remaining images are randomly chosen for testing. Similar to the settings in [36], we simulate various levels of contiguous occlusion, from 0% to 50%, by replacing a randomly located square block of each testing image with an unrelated image, as illustrated in Figure 6, where (a) shows a pedestrian image with 20% block occlusion, (b) shows a pedestrian image with 30% block occlusion and (c) shows a pedestrian image with 40% block occlusion. Here the location of occlusion is randomly chosen for each image and is unknown to each algorithm, and the image size is normalized to 128 × 64.   Table 5, we can see that almost all methods could correctly classify most of the testing samples when occlusion level is from 10% to 20%. However, when occlusion percentage is larger than 20%, the advantage of HFE − WKSR over other methods becomes significant. For instance, when occlusion is 40%, HFE − WKSR could achieve at least 84% recognition accuracy, compared with at most 72.5% for other methods. For HFE − WKSR, when there is 50% block occlusion, it can still achieve a recognition rate of over 75%. This clearly demonstrates the effectiveness of the proposed HFE − WKSR method to deal with partial occlusion. (2) Pedestrian classification real occlusion: The Daimler dataset is divided into partially occluded set and non-occluded test set. The partially occluded test set contains 11,160 pedestrians and 16,253 non-pedestrians. Example of images from the dataset are shown in Figure 7. Figure 8 shows the classification results. It can be seen that the proposed methods achieve 84.2% recognition accuracy, much higher than the state-of-the-art results, for example, 56.8% (HOG + SVM) and 68.7% (HOG + SRC), and 77.8% (HFE + SRC) and 78.0% (HFE + CRC), and 74.6%(HFE + HIKSVM). The improvement of HFE − WKSR over all the other methods is at least 6%, which clearly shows the superior classification ability of HFE − WKSR.    Table 5, we can see that almost all methods could correctly classify most of the testing samples when occlusion level is from 10% to 20%. However, when occlusion percentage is larger than 20%, the advantage of HFE − WKSR over other methods becomes significant. For instance, when occlusion is 40%, HFE − WKSR could achieve at least 84% recognition accuracy, compared with at most 72.5% for other methods. For HFE − WKSR, when there is 50% block occlusion, it can still achieve a recognition rate of over 75%. This clearly demonstrates the effectiveness of the proposed HFE − WKSR method to deal with partial occlusion. (2) Pedestrian classification real occlusion: The Daimler dataset is divided into partially occluded set and non-occluded test set. The partially occluded test set contains 11,160 pedestrians and 16,253 non-pedestrians. Example of images from the dataset are shown in Figure 7. Figure 8 shows the classification results. It can be seen that the proposed methods achieve 84.2% recognition accuracy, much higher than the state-of-the-art results, for example, 56.8% (HOG + SVM) and 68.7% (HOG + SRC), and 77.8% (HFE + SRC) and 78.0% (HFE + CRC), and 74.6%(HFE + HIKSVM). The improvement of HFE − WKSR over all the other methods is at least 6%, which clearly shows the superior classification ability of HFE − WKSR.
16,253 non-pedestrians. Example of images from the dataset are shown in Figure 7. Figure 8 shows the classification results. It can be seen that the proposed methods achieve 84.2% recognition accuracy, much higher than the state-of-the-art results, for example, 56.8% (HOG + SVM) and 68.7% (HOG + SRC), and 77.8% (HFE + SRC) and 78.0% (HFE + CRC), and 74.6%(HFE + HIKSVM). The improvement of HFE − WKSR over all the other methods is at least 6%, which clearly shows the superior classification ability of HFE − WKSR.

Conclusions
Because a vision-based pedestrian protection system (PPS) is low in cost, and is not influenced by temperature, it has extensive applications in autonomous vehicles. Pedestrian classification is a key technology for PPS. In this paper, we proposed a novel HFE − WKSR model for pedestrian classification. A robust representation model for image outliers (e.g., occlusion and noise) was built in the kernel space, and a hierarchical features extraction based on the CENTRIST descriptor was proposed to capture the discriminative structures of object. A max pooling operation is used to enhance the invariance of the local pattern feature to varying illumination and appearance. We evaluated the proposed method in different conditions, including variations of illumination, view, appearance, as well as block occlusion. One big advantage of the proposed method is its high recognition rates and robustness against various occlusions. The extensive experimental results demonstrated that HFE − WKSR is superior to state-of-the-art methods and has great potential to be applied in practical pedestrian protection systems.

Conclusions
Because a vision-based pedestrian protection system (PPS) is low in cost, and is not influenced by temperature, it has extensive applications in autonomous vehicles. Pedestrian classification is a key technology for PPS. In this paper, we proposed a novel HFE − WKSR model for pedestrian classification. A robust representation model for image outliers (e.g., occlusion and noise) was built in the kernel space, and a hierarchical features extraction based on the CENTRIST descriptor was proposed to capture the discriminative structures of object. A max pooling operation is used to enhance the invariance of the local pattern feature to varying illumination and appearance. We evaluated the proposed method in different conditions, including variations of illumination, view, appearance, as well as block occlusion. One big advantage of the proposed method is its high recognition rates and robustness against various occlusions. The extensive experimental results demonstrated that HFE − WKSR is superior to state-of-the-art methods and has great potential to be applied in practical pedestrian protection systems.