Deepgait: a Learning Deep Convolutional Representation for View-invariant Gait Recognition Using Joint Bayesian

Human gait, as a soft biometric, helps to recognize people through their walking. To further improve the recognition performance, we propose a novel video sensor-based gait representation, DeepGait, using deep convolutional features and introduce Joint Bayesian to model view variance. DeepGait is generated by using a pre-trained " very deep " network " D-Net " (VGG-D) without any fine-tuning. For non-view setting, DeepGait outperforms hand-crafted representations (e.g., Gait Energy Image, Frequency-Domain Feature and Gait Flow Image, etc.). Furthermore, for cross-view setting, 256-dimensional DeepGait after PCA significantly outperforms the state-of-the-art methods on the OU-ISR large population (OULP) dataset. The OULP dataset, which includes 4007 subjects, makes our result reliable in a statistically reliable way.


Introduction
Biometrics refer to the use of intrinsic physical or behavioral traits in order to identify humans.Besides regular features (face, fingerprint, iris, DNA and retina), human gait, which can be obtained from people at larger distances and at low resolution without subjects' cooperation has recently attracted much attention.It also has a vast application prospect in crime investigation and wide-area surveillance.For example, criminals usually wear gloves, dark sun-glasses, and face masks to invalidate finger print, eyes, and face recognition.In such scenarios, gait recognition is the only useful and effective identification method.Previous research [1,2] has shown that human gait, specifically the walking pattern, is difficult to disguise and unique to each person.
In general, video sensor-based gait recognition methods are divided into two families: appearance-based [3][4][5][6][7] and model-based [8][9][10].Appearance-based methods focus on the motion of human body and usually operate on silhouettes of gait.They extract the gait descriptors from the silhouettes.The general framework of appearance-based methods usually consists of silhouette extraction, period detection, representation generation, and recognition.Model-based gait recognition focuses more on the extraction of the stride parameters of subject that describe the gait by using the human body structure.The model-based methods usually require high resolution images as well as being computationally expensive, while gait recognition needs to be real-time and effective at low resolution.Our proposed work falls in the category of appearance-based methods.It differs from the majority of contributions in the field in that the Deep Learning (DL) framework is used to extract gait representation compared with well engineered features such as the widely used average silhouette representations: Gait Energy Image (GEI) [3], Gait Flow Image (GFI) [5], Gait Entropy Image (GEnI), Masked GEI based on GEnI (MGEI) [4], and Frequency-Domain Feature (FDF) [6,7].However, the performance of gait recognition is often influenced by several covariates such as clothing, walking speed, observation views, and carrying bags.For appearance-based methods, view changes are the most problematic covariates.Therefore, we propose a more discriminative appearance-based representation, DeepGait and introduce Joint Bayesian to deal with the view change problems.Numerous experiments were conducted for both non-view variance and cross-view settings on the OU-ISIR large population (OULP) dataset [11] to validate the effectiveness of our proposed method.

Proposal of Deep Convolutional Gait Representation
Inspired by the deep learning breakthroughs in the image domain [12][13][14] where rapid progress has been made in the past few years in feature learning, and various pre-trained deep convolutional models [12,13,15] were made available for extracting image and video features, DeepGait was proposed.These features are the activations of the network's last few fully-connected layers which perform well in the other vision tasks [14][15][16][17].A convolutional neural network (CNN) has been successfully demonstrated in many research fields, such as face recognition [18][19][20] and human action recognition [15] which are relevant to gait recognition.However, to the best of our knowledge, few studies have applied deep learning features in video sensor-based human gait recognition except for [21,22].In this paper, we proposed a novel gait representation, DeepGait based on VGG-D [12] features using max-pooling on each gait cycle.If the gait video sequence has more than one cycle, we just choose the first one.Our proposed DeepGait differs from [21] in two ways: (1) they first needed to compute the traditional gait representations (GEI, FDF), and regard them as the input data while we just used the original silhouette images; (2) their net needed to be trained on the gait dataset while ours just used the pre-trained VGG-D model without any fine-tuning.

Joint Bayesian for Modeling View Variance
When dealing with view change problems, several appearance-based approaches are proposed: (1) the view transformation model (VTM) [23,24]; (2) the view-invariant feature-based approaches [21,25]; and (3) multiview gallery-based approaches [26,27].On the OULP dataset, VTM-based methods are widely used: [24] proposed a generative approach which is a kind of VTM-based methods and makes use of transformation consistency measures (TCM+); [23] further proposed a quality-dependent VTM (wQVTM).Recently, a view-invariant feature-based approach (GEINet) [21] was proposed and achieved the best performance.We introduce Joint Bayesian [28] to model the view variance which differs from the above approaches.For comparison, the unsupervised Nearest Neighbor classifier based on euclidean distance (NN) is also adopted as a baseline method.In order to evaluate the compactness of DeepGait, PCA is used to project the representation into lower dimensions.Furthermore, we choose the right K = 256 components to strike a balance between recognition performance and computational complexity when using Joint Bayesian.

Overview
Our contributions include: (1) introducing deep learning for gait recognition and proposal of a new gait representation which outperforms traditional gait representations when the gallery and probe gait sequences are from the same view (non-view setting); (2) model view variance using Joint Bayesian when the gallery and probe gait sequences are from different views (cross-view setting); (3) improved recognition performances on the OULP dataset for non-view and cross-view settings; (4) making public the trained Joint Bayesian model, test codes and experimental results for further comparison.
Figure 1 shows the overview of our method.The outline of the paper is organized as follows.Section 2 introduces DeepGait, Joint Bayesian for identification and verification tasks, and some evaluation criteria.Section 3 presents the experimental results on the OULP dataset.Section 4 offers our conclusion.

Gait Period Estimation
Similar to the other appearance-based gait recognition methods, the first step for DeepGait generation is gait period detection.As in [6,11], we calculated the Normalized Auto Correlation (NAC) of each normalized gait sequence along the temporal axis: where N AC(N) stands for the autocorrelation for the N frame shift which can quantify periodic gait motion.N total is the number of frames in each gait sequence.S(x, y, n) is the silhouette gray value at position of (x, y) on the n-th frame.Empirically, for the natural gait period, the domain of N is set to be [20,40] and the gait period is estimated as: where T gait is the gait period.We have made the code and result (large deviations was manually modified) public in Supplementary Materials.

Network Structure
In this paper, a state-of-the-art deep convolutional model (VGG-D) [12] which consists of 19 parameterized layers (16 convolutional layers and 3 fully connected layers) was adopted.Figure 1 shows its' partial structure.VGG-D evaluated very deep convolutional networks using an architecture with very small (3 × 3) convolution filters, which achieved a significant improvement on ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC-2014) [12].

Supervised Pre-Training
By leveraging a large auxiliary labeled dataset to train a deep convolutional model, the high-level learned features from the pre-trained model have sufficient discrimination ability in some image-based classification tasks [16].To evaluate the efficacy of learned features on gait recognition task, we trained VGG-D net using ImageNet dataset (classification annotations only) [13].The training procedure generally followed Simonyan et al. [12].Namely, based on mini-batch stochastic gradient descent, the back-propagation algorithm is used to optimize the softmax-regression objection function [29].In this paper, we did not fine-tune the model using any gait dataset, because deep convolution features using the pre-trained model had already shown a significant improvement compared to traditional hand-crafted gait representations for non-view setting.

Feature Extraction
In order to extract deep learned features for gait representation generalization, the size of input gait silhouette images must be compatible with VGG-D's input size which is known as 224 × 224 pixel size.We first rescaled each image to fixed size.Features were then computed by forward propagating a mean-subtracted and size-fixed (224 × 224) gait image through 16 convolutional/pooling layers and 2 fully connected layers using Caffe, a open source CNN library [30].According to the other vision tasks [14][15][16][17], the first fully connected layer's ( f c6) features outcome the other layers' features.Unless otherwise specified, we extracted the 4096-dimensional f c6 features as deep convolutional features for gait representation generalization.

Representation Generalization and Visualization
Inspired by Gait Energy Image (GEI) which is obtained by simply averaging the silhouette sequence over one gait period and can capture both the spatial and temporal information [3,21] , we make use of max-pooling method over one gait period's f c6 features to combine the spatio-temporal information.Another version of f c6 features with average-pooling has been tested in our experiments and showed inferior performance, which suggests the DeepGait is valid.In the i-th gait period, if there are T silhouette images, we can generate T f c6 features.The j-th deep convolutional gait representation (DeepGait) element of 4096-dimensional representation can then be created from maxing the f c6 features by using Equation (3).
Examples of the 256-dimensional DeepGait from the OULP dataset after dimension reduction (in Section 2.2.3) and L2-normalization are shown in Figure 2.

Gait Recognition
Usually, gait recognition can be divided into two major tasks: gait verification and gait identification as in face recognition [18][19][20].Gait verification is used for verifying whether two input gait sequences (Gallery, Probe) belong to the same subject.In this paper, we calculated the similar score (SimScore) using Joint Bayesian to evaluate the similarity of two given sequences.Euclidean distance was also adopted as a baseline method for comparison.In gait identification, a set of subjects are gathered (The gallery), and it aims to decide which of the gallery identities are similar to the probe at test time.Under the closed set identification condition [31], a probe sequence is compared with all the gallery identities, then the identity which has the largest SimScore is the final result.

Gait Verification Using Joint Bayesian
Joint Bayesian [28] technique was widely and successfully used for face verification [18,19,32].In this paper, we modeled the extracted DeepGait (after mean-subtracted) by summing two independent Gaussian variables as: where x represents a mean-subtracted DeepGait vector.For a better performance, L 2 -normalization was applied for DeepGait.µ is gait identity following a Gaussian distribution N(0, S µ ).ε stands for different gait variations (e.g., view, clothing and carrying bags etc.) following a Gaussian distribution N(0, S ε ).Joint Bayesian models the joint probability of two gait representations using the intra-class variation (I) or inter-class variance (E) hypothesis, P(x 1 , x 2 |H I ) and P(x 1 , x 2 |H E ).Given the above prior from Equation ( 4) and the independent assumption between µ and ε, the covariance matrix of P(x 1 , x 2 |H I ) and P(x 1 , x 2 |H E ) can be derived separately as: ) S µ and S ε are two unknown covariance matrices which can be learned from the training set using the Expectation Maximization (EM) algorithm.During the testing phase, the likelihood ratio (r(x1, x2)) is regarded as the similar score (SimScore): r(x 1 , x 2 ) is efficiently obtained with the following closed-form process: where A and G are two final result models, which can be obtained by using simple algebra operations between S µ and S ε .Please refer to [28] for more details.We also make public our trained model (A and G) and testing codes in Supplementary Materials for further comparison.
Euclidean distance is also adopted as a baseline method for comparison and the similar score (SimScore) can be calculated as: Finally, SimScore is compared with a threshold value to verify whether x 1 and x 2 belong to the same subject.

Gait Identification
For gait identification, the probe sample x p is classified as class i, if the final SimScore with all the gallery (x i ) is the maximum as shown in Equation (10).
SimScore(x i , x p ) (10) where N gallery is the number of training subjects.In the experiments, we just used the first period of the gait sequence.

Dimension Deduction by PCA
The dimension of DeepGait is relatively large (4096) which makes the training process of Joint Bayesian computationally expensive.In order to compute efficiently and evaluate the compactness of DeepGait, we used PCA to project the representation into lower dimensions.PCA can capture the principle components of the origin space.Among all the gallery dataset, we calculated a transformation matrix (E PCA ) using singular value decomposition for its within-class scatter matrix.The transformation matrix's dimension is M × K, where M is DeepGait's origin dimension, and K is the number of components.
After PCA, for baseline method (euclidean distance), the SimScore is calculated as: For Joint Bayesian, the SimScore is calculated as:

Evaluation Criteria
The recognition performance was evaluated using four metrics: (1) Cumulative Match Characteristics (CMC) curve; (2) rank-1 and rank-5 identification rates; (3) the Receiver Operating Characteristic (ROC) curve of False Acceptance Rates (FAR) and Ralse Rejection Rates (FRR); and (4) Equal Error Rates (EERs).CMC curve, and rank-1/rank-5 identification rates were used for the identification task while ROC curve and EERs were used for the verification task.

Experiment
The proposed method was evaluated on the OU-ISIR large population (OULP) dataset which has over 4000 subjects and contains high-quality silhouette images with view variations [11].The experiments were conducted with two main settings: non-view setting and cross-view setting.For the first setting, all the subjects were used to evaluate the performance of our proposed DeepGait, so that the result could be reliable in a statistical manner.For the second setting, we used a subset of the OULP dataset following the protocol of [21,23,24] for comparison.For further comparison, experimental results, learning models, and test codes are released in Supplementary Materials.

Comparisons of Different Gait Representations for the Non-View Setting
In this section, we aimed at comparing the performance of our proposed DeepGait with some state-of-the-art gait representations (e.g., GEI, FDF, MGEI, GEnI and GFI) in a statistically reliable manner.The unsupervised whole dataset (NN) classifier was chosen for the sake of all the subjects being used for testing.When we exchanged the gallery and the probe, 2-fold cross validation was adopted.Based on the video sensor's recorded view (55 • , 65 • , 75 • , 85 • ), we reported the results of comparison in Table 1.As result, DeepGait, using the simple classify method (NN), retained powerful discrimination even over large population condition and outperforms other famous representations.From the four observed views' result, the performance of Deep Gait, GEI and FDF is nearly the same under different observation view.Our proposed DeepGait is independent of view change.

Results for the Cross-View Setting
In the following two subsections, we chose 1912 subjects containing two gait sequences (Gallery, Probe), and the subset was further divided into two groups of the same number of subjects, one for training while the other one for testing.Following the protocol of [21,23,24] (publicly available at http://www.am.sanken.osaka-u.ac.jp/BiometricDB/dataset/GaitLP/Benchmarks.html),five 2-fold cross validations were performed.During each training phase, 956 × (956-1) = 912,980 intra-class samples and 956 × 1 = 956 inter-class samples were used for training Joint Bayesian.Due to the limited space, the gallery dataset are fixed at three views (55 • , 65 • , 75 • ) when we show the CMC and ROC curves.

Number of Components Selection for Joint Bayesian
As we know, the dimension of DeepGait is 4096, and high dimension means that more training data are needed for model learning when Joint Bayesian [28] is used for gait recognition.In fact, number of training samples is often limited in gait recognition, therefore, the dimension of DeepGait needs to be reduced.Due to the powerful discrimination of our proposed DeepGait, we can achieve a competitive performance even in a low dimension after PCA.Experiments of different number of components were performed with Joint Bayesian, so that we could choose the right K components, where K is the number of components, to strike a balance between recognition performance and computational complexity.Figure 3 shows the results of different K components under different combinations of Gallery and Probe views.We can see that K = 2048, achieved the worst performance due to under-fitting.The training samples are insufficient when Joint Bayesian was used with high dimension.When dealing with the lowest dimension (K = 64), our proposed method still achieved competitive performances among three cross-view combinations (55:65, 65:75, 75:85).Further, we found that 'K = 256' achieved almost the same result with 'K = 512' under all the cross-view conditions while 'K = 256' has half the number of components.For the best balance of performance and computing cost, we finally set K = 256 when Joint Bayesian is used in the following experiments.

Comparisons with the State-of-the-Art Methods
The proposed method is further compared with other state-of-the-art methods [21,23,24] in cross-view gait recognition.Muramatsu et al. [23,24] proposed the evaluation criteria and five 2-fold cross validations were performed to reduce the effect of random grouping in their experiments.Ref. [24] proposed a generative approach which is a View Transformation Model (VTM) based on transformation consistency measures (TCM+).Ref. [23] further proposed a quality-dependent VTM (wQVTM).Shiraga et al. [21] designed a convolutional neural network for cross-view gait recognition.They reported two kinds of results which mainly differ in input data (GEI, FDF), and the two methods are referred to as GEINet and w/FDF, respectively [21].

A. Comparisons for identification task
The performance of our proposed method, 256-dimensional DeepGait with Joint Bayesian (DeepGait + JB) was firstly evaluated in identification task.4096-dimensional DeepGait with nearest neighbor classifier based on euclidean distance (DeepGait + NN) is also adopted as a baseline method.We summarize the rank-1 and rank-5 identification rates in Table 2. CMC curves are also shown in Figure 4.
As a result, DeepGait + JB significantly outperformed the three state-of-the-art methods for all the view combinations.Even with simple classifier NN, DeepGait still achieved competitive performances for four side litter view difference combinations (65:75, 75:85).

B. Comparisons for verification task
We used the same protocol as the identification task and summarize the EERs for verification task in Table 3.We also referred DeepGait based on euclidean distance as DeepGait + NN for the sake of consistency.
We find that our proposed method also achieved the best EERs in all cases, especially in cases with large view variance.More specifically, our proposed method improved from 2.5% to 1.9% compared to the best method (GEINet) where the probe view was 85 • and gallery view was 55 • .Under the exchanged view condition, EERs improved from 2.4% to 1.6%.When comparing DeepGait + NN with DeepGait + JB, we can conclude that Joint Bayesian well models the view variance while simple euclidean distance can not well deal with cross-view test in verification task.Figure 5 shows more details of ROC curves.

Figure 2 .
Figure 2. Examples of the 256-dimensional DeepGait after dimension reduction under four observation views (55 • , 65 • , 75 • , 85 • ) .S1 and S2 represent two different subjects, separately.We rearrange the vector as 16 × 16 matrix for the convenience of visualization.Approximately 25% features are non-zero values.Different colors stand for different values.

Table 2 .
Comparison of rank-1 (%) and rank-5 (%) identification rates with other existent methods in different cross-view settings.

Table 3 .
Comparison of EERs (%) with other existent methods under different cross-view settings.