Cross View Gait Recognition Using Joint-Direct Linear Discriminant Analysis

This paper proposes a view-invariant gait recognition framework that employs a unique view invariant model that profits from the dimensionality reduction provided by Direct Linear Discriminant Analysis (DLDA). The framework, which employs gait energy images (GEIs), creates a single joint model that accurately classifies GEIs captured at different angles. Moreover, the proposed framework also helps to reduce the under-sampling problem (USP) that usually appears when the number of training samples is much smaller than the dimension of the feature space. Evaluation experiments compare the proposed framework’s computational complexity and recognition accuracy against those of other view-invariant methods. Results show improvements in both computational complexity and recognition accuracy.


Introduction
During the past two decades, the use of biometrics for person identification has been a topic of active research [1]. Several schemes have been proposed by using fingerprints, face iris, retina and speech features, all of which can provide a fairly good performance in several practical applications [2][3][4][5][6][7][8][9][10][11]. However, performance significantly degrades when they operate in an un-constrained environment. Because there are practical applications that operate in un-constrained environments, several biometrics have been developed to carry out person identification in these environments. Among them, gait recognition has received considerable attention [8,9]. Particularly, those gait recognition methods that do not depend on human walking models [12], has been shown to significantly increase accuracy and reduced computational complexity by using information extracted from simple silhouettes of moving persons [13]. In general, several aspects may degrade the performance of gait recognition methods, e.g., clothes, shoes, carried objects, the walk surface, time elapsed, and view angles. Among them, the view angle, which corresponds to the angle between the optical axis of the capturing camera and the walking direction [14], is an important factor because the accurate performance of most appearance-based approaches strongly depends on a fixed view angle [15].
Gait recognition approaches aimed at solving problems related to varying view angles can be classified as (a) view invariant approaches; (b) visual hull-based approaches; and (c) view transformation-based approaches. View-invariant approaches transform samples of different views into a common space; while visual hull-based approaches depend on 3-D gait information, and thus usually require the acquisition of sequences by multiple calibrated video cameras. Bodor et al. [11] propose application of images on a 3-D visual hull model to automatically reconstruct gait features. Zhang et al. [16] propose a view-independent gait recognition algorithm using Bayesian rules and a 3-D linear model, while Zhao et al. [17] propose an array of multiple cameras to capture a set of video sequences that are used to reconstruct a 3-D gait model. These methods perform well for fully controlled and cooperative multi-camera environments; however, their computational cost is usually high [13].
The idea behind view transformation approaches is to transform the features vectors from one domain to another by estimating the relationship between the two domains. These transformed virtual features are then used for recognition [18]. View transformation approaches do not require synchronization of gait data of multiple views of the target subjects. Therefore, these approaches are suitable for cases where the views available in the gallery and probe sets are different [18]. These approaches may employ singular value decomposition (SVD), e.g., [14] or regression algorithms for the matrix factorization process during the training stage [19]. The principal limitation of these approaches is that the number of available images is limited to a discrete set of training views and recognition accuracy degrades when the target view and the views used for training are significantly different.
View-invariant gait recognition approaches can be classified further into geometry-based approaches [20], subspace learning-based approaches [21] and metric learning-based approaches [18]. In geometry-based approaches, the geometrical properties of gait images are used as features to carry out recognition. Using this approach, Kale et al. proposed to synthetize side-view gait images using any arbitrary view. This assumes that the person is represented as a planar object on a sagittal plane [22]. Their method performs well when the angle between the image and sagittal planes of the person is small; however, accuracy is significantly degraded when this angle is large [23]. Subspace and metric learning-based approaches do not depend on this angle. Metric learning approaches estimate a weighting vector that sets the relevance of a matching score related to each feature and uses the weighting vector to estimate a final recognition score [23]. The pairwise RankSVM [24] is used by Kusakunniran et al. [19] to improve gait recognition performance for view angle variation, and for cases when the person wears extra clothing accessories and carries objects.
Subspace learning-based approaches project features onto a subspace that is learned from training data and then estimate a set of view-invariant features. Liu et al. [25] propose an uncorrelated discriminant simplex analysis method to estimate the feature subspace, while Liu et al. [18] propose the use of the joint principal component analysis (JPCA) to estimate the joint gait feature pairs subspace with several different view angles. View-invariant gait recognition methods based on subspace learning approaches have been shown to achieve high recognition rates.
Dimensionality reduction is considered as a within-class multimodality problem if each class can be classified into several clusters [26]. In this case, during the training stage, the system creates a set of clusters using similarities among view angles. To analyze subspaces obtained after dimensionality reduction, a preprocessing step is used to manipulate the high-dimensional data. This is especially important when gait energy images (GEIs) are used as features because the dimensionality of the feature space is usually much larger than the training set. This problem is known as the small sample size (SSS) [27] or under-sampling (USP) problem [28], and results into a singular sample scatter matrix. A common solution for this problem is to use principal component analysis (PCA) [28] for dimensionality reduction of the feature space. A potential problem of this approach is the fact that PCA may discard dimensions containing important discriminant information [29]. In other approaches, such as those in Mansur et al. [30], a model for each view angle (MvDA) is constructed independently; however this approach results in a higher computational cost and requires the use of cross-data set information.
This paper presents an appearance-based gait recognition framework that helps overcome the limitations associated with different view angles. This paper extends our work in [31] by providing a more detailed description of the methodologies, as well as an extensive analysis and comparisons of the framework's performace. The proposed framework, which is based on subspace learning, employs GEIs as the features. It uses direct linear discriminant analysis (DLDA) to create a single projection model used for classification. This approach differs from previously proposed approaches, like the View Transformation Model (VTM), cross-view and multi-view gait recognition scheme, proposed by Kusakunniran et al. [19], which is based on a view transformation model using multilayer perceptron and reduces the GEIs size. The advantages of the proposed framework, called Joint-DLDA hereinafter are manifold: (1) it does not require creating independent projection models, one for each distinct view angle, for classification. This is particularly useful in practical situations where the test data may be acquired at a view angle that does not exist in the gallery data. A unique projection model for classification of several angles can handle this situation; (2) It can handle high-dimensional feature spaces; (3) It has a considerably lower computational complexity than other approaches, as it uses a simple classification stage. Evaluation performance using the CASIA-B gait database [30] shows that the proposed framework outperforms several recently proposed view-invariant approaches, in terms of recognition rate and computational time.
The rest of the paper is organized as follows. In Section 2, we describe the proposed framework in detail; Section 3 provides the evaluation results; We conclude this paper in Section 4.

Proposed Framework
The proposed gait recognition framework consists of three stages: computation of GEIs, joint model estimation, subspace learning using DLDA and person recognition, as shown in Figure 1. A detailed description of each of these stages is described next.

Computation of GEIs
Several approaches have been developed for gait representation. A suitable approach is the spatio-temporal gait representation, called gait energy image (GEI), proposed by Han and Bhanu [13], which extracts the human silhouettes of a walking sequence. Then, the extracted binary silhouettes are preprocessed to normalize them such that each silhouette image has the same height and their upper half is centered with respect to a horizontal centroid [13]. A GEI is obtained as an average of the normalized binary silhouettes, as follows [32,33]: where G j,k,v (x, y) is the (x, y)-th gray value of the GEI of j-th sequence captured at the v-th view angle, which corresponds to the k-th class; B j,k,v,t (x, y) is the (x, y)-th value of the binary silhouette of the t-th frame of the sequence; K, J and V are number of classes (persons), sequences per class and view angles per sequence, respectively; and N F is the total number of frames in the walking cycle. Figure 2 shows a set of normalized binary silhouette images representing a walking cycle of two different persons, and the corresponding GEIs.

Joint Model Estimation
The proposed framework estimates a joint projection model that avoids creating a model independently for each view angle. Once GEIs of all sequences with different view angles for each person k are obtained by Equation (1), these GEIs are concatenated to generate the k-th input matrix X k , which has a size of d × m k where d is the total number of pixels in each GEI and the size of m k = J × V, where J is number of sequences per class and V is number of angles per class. The training set X is generated by concatenating all input matrices X k , k = 1, 2, · · · , K, where K is the number of classes. The size of the training set X is therefore d × M, where M is the total number of GEIs of all classes. Figure 3 shows the generation of training set X. Figure 3. Illustration of the joint model constructed by using the training data corresponding to K-classes using gait energy images (GEIs) of the CASIA-B database [15]. The class (k) in this figure consists of all different view angles V and samples available for subject k.
Since the size of X is too large, a dimensionality reduction method must be used. DLDA is a suitable approach, because it is effective in separating classes and reducing the intra-class variance, while reducing the dimensionality. The discriminant properties of the DLDA ensure that the classes defined by different view angles can be discriminated well enough. In other words, when the training set contains several view angles, the discriminant properties of DLDA can effectively separate the classes represented by the different view angles in the projected subspace; thus allowing for the characterization of query view angles even if they are not included in the training set [8]. Thus, the DLDA is used for estimating a joint projection matrix W from the input matrix X.

Direct Linear Discriminant Analysis
To estimate the joint projection model, consider matrix X of size d × M where the samples are stored as M d-dimensional column vectors that correspond to all possible view angles of all individuals contained in the training set (see Figure 3). Let the number of GEIs in the class k be given by m k , where M = ∑ K k=1 m k denotes the total number of GEIs in X; then the matrix X k ∈ R d×m k that contains all GEIs belonging to the k-class is given by: where d is the number of pixels or features. Then, the matrix containing all GEIs is given by: Next, we employ the DLDA for dimensionality reduction, as shown in Figure 4, to project X into a lower dimensional embedding subspace. Let z i ∈ R r : 1 ≤ r ≤ d be a low-dimensional representation of X i , where r is the dimension of the embedding subspace while the embedded samples z i are then given by z i = W T X i where W T denotes the transpose of the transformation matrix [29,31]. The purpose of the DLDA is to find a projection matrix W that maximizes the ratio between class scatter matrix, S (b) , and the within-class scatter matrix, S (w) ; also known as Fisher's criterion: using the procedure described in the block diagram of Figure 4, where: where µ k is the sample mean belonging to the class k and µ is the mean of all samples in the dataset. If the number of samples is smaller than their dimension, both S (b) and S (w) may become singular. For example, the within-class scatter matrix S (w) may become singular if the size of the samples is much larger than the number of samples in each class because its rank is at most M − K, this is a common situation in gait recognition applications, as well as some face recognition applications. In order to prevent S (w) from becoming singular, Belhumeur et al. [10] propose reducing the dimensionality of the features space by using the PCA, such that the pixel or features should be at least equal to M − K, and then applying Linear Discriminant Analysis (LDA), also known as the Fishers criterion, as given by Equation (4). Thus maximizing Fisher's criterion requires reducing the within-class scatter matrix S (w) and then incrementing the between-class scatter matrix. Dimensionality reduction by using PCA is based on data variability; PCA allows discarding those dimensions that do not contain important discriminant information. In DLDA, the diagonalization of the between-class scatter matrix S (b) is given by (see Figure 4): where V and Λ denote the eigenvectors and eigenvalues of matrix S (b) , respectively. Let Y denote a matrix of dimension d × M, where r d. The M columns in V correspond to the eigenvectors associated with the largest eigenvalues such that: where the matrix D b of dimension r × r is a submatrix of Λ. Next, let , from Equation (10) it follows that: Thus Z unitizes S (b) and reduces the dimensionality from d to r. Let us now diagonalize matrix Z T S w Z using the PCA as follows where U T U = I. Defining A = U T Z T , Equation (12) becomes: By multiplying Equation (11) by U T on the left and by U on the right, and by using A = U T Z T , it follows that Because A, diagonalizes S (w) , the dimensionally reduced input vector is then given by The expression in Equation (16) is used for project the gallery and during testing. Figures 5 and 6 show the LDA and DLDA projection, respectively, of GEIs belonging to two different classes of the CASIA-B database, where one class is expressed by circles and the other class by crosses. In both classes, two different view angles are used, the 0 • view angle is represented by thin circles and thin crosses, in the other hand, the 90 • view angle is expressed by thick circles and thick crosses. It is important to note that, even though they might belong to the same class, GEIs from view angle 0 • are different from those at 90 • , thus after projection, clusters of crosses and circles, either thin or thick, should appear. In other words, the projection model must simultaneously reduce the intra-classes variability and to separate the crosses from the circles, which represent a different classes; i.e., increase the distance between classes. Figure 5 shows that LDA tends to cluster the samples according the view angle instead of clustering them by classes; while DLDA ( Figure 6) tends to separate the samples by classes instead of view angles, thus allowing to improve the classification even using when using distinct view angles. Thus the projection used must allow for the clustering of all classes independently of the view angle. To achieve this goal, DLDA diagonalizes the scatter matrix S (b) , in order to discard the null space of S (b) that does not contain useful information instead of discarding the null space of S (w) that includes the most important information for discrimination purposes [29]. By using DLDA, we obtain a transformation matrix W that projects the data into a low dimensional subspace with an appropriate class separability.

Gallery Estimation
After the projection model is estimated using DLDA, the gallery of images used by the KNN classification stage is projected as follows: where s is any of the J GEIs corresponding to any of the K classes from any of the V view angles, available in the gallery set. Figure 7 shows the block diagram of the gallery estimation process.

Classification Stage
During classification, the system uses a GEI of the person to be identified, X PG , which is projected into a dimensionally reduced space, using Equation (16), as follows: X P is fed into the KNN stage, where X P is compared with the features vectors X G (s) stored in the database. Next, the distance between the input vector and those contained in the gallery is estimated, keeping the K vectors X G (j) with the smaller distance. Finally, the class label of the input to which the GEI belongs is the class with the larger number of previously estimated K projected vectors. The classification process is illustrated in Figure 8.

Evaluation Results
Performance of the proposed framework recognition algorithm was evaluated using the CASIA-B gait database [8] with the GEI features obtained using the method proposed by Bashir et al. [8]. The CASIA-B database consists of 124 subjects (classes), each with 11 incoming angles with 10 walking sequences per angle from 0 • to 180 • with a separation among them of 18 • . These sequences include six normal walking sequences that are used to perform the experiments. The size of GEIs used in the proposed framework is equal to 240 × 240.
The proposed framework is evaluated using three different configurations. The first configuration is similar to that proposed by Mansur et al. [30], which is used to evaluate their MvDA method. The second configuration is used to evaluate the VTM models proposed by Kusakunniran et al. [34][35][36], besides the configuration proposed by Bashir et al. [8]. Finally, the recognition performance of the proposed framework is also evaluated using the configuration used by Yu et al. [15], which employs an structure for evaluating the effect of the view angle.
Mansur et al. [30] propose to use two different databases, the CASIA-B and the OULP. For the CASIA-B database, they use two non-overlapping groups. The first one comprises 62 classes and is used for training; the second one comprises the remaining 62 classes and is used for testing. The testing group is divided in two subsets: gallery and probe, where the gallery subset consists of the six samples of each class corresponding to the view angle of 90 • available in the testing group; while the probe subset is divided in five subsets containing, each, the six samples corresponding to view angles 0 Only for this configuration we employ two transformation matrices called JDLDA(1) and JDLDA (2). The first transformation matrix JDLDA(1), is obtained using only the samples available in the training group, without any modification, to show that the proposed method is able to solve the small sample size problem. The second transformation matrix, JDLDA(2), is obtained when increasing the number of samples in the training group, by rotating the samples of the view angles 180 • , 162 • , 144 • , 126 • and 108 • of the training group. The evaluation results obtained are shown in Table 1. 17% 30% 46% 63% 83% MvDA [30] 17% 27% 36% 64% 95% JDLDA(1) 16% 21% 32% 50% 84% JDLDA(2) 20% 25% 37% 58% 94% The performance obtained using the second configuration described above is compared with the framework proposed by Yu et al. [15], where only the CASIA-B database is used. In this configuration, four samples for each one of the 11 view angles in each one of the 124 classes are used to build the training subset and estimating the projection matrix. This procedure is also followed for the gallery. The remaining two samples for each one of the 11 view angles in each class are used for testing.
The testing is performed by using all samples available in the gallery subset, fixing each one of the 11 view angles θ G as gallery, using all samples available in the probe subset by varying each one of the angle θ P , contained in this subset. The evaluation results obtained by [15] are presented in Table 2, where each row corresponds to the results obtained for each view angle, θ G , while each column belongs to a testing view angle θ P . These results are shown in Figure 9a. The evaluation results obtained with the JDLDA are shown in Table 3 and Figure 9b. In this case the transformation matrix is obtained using the training and gallery as proposed by Yu [15].  In the third configuration, only the CASIA-B database is used following two rules to divide the data. In the first rule [19,[34][35][36], the database is divided in two groups: The training group, which consists of 24 classes, and the testing, which has the remaining 100 classes. In the second rule [8], the training group consists of 74 classes and the testing group comprises the remaining 50 classes. In both rules, the training and testing groups do not non-overlap and the testing group is divided into the gallery and probe subsets. The gallery subset consists of the four samples of each angle of all classes available in the testing group; while the probe subset consists of the remaining two samples of each view angle of each class of the testing group. To evaluate the performance of proposed framework, all samples of each view angle of the gallery subset are compared with the samples in the probe subset ordered according to the view angle. The results obtained using the rule 1 are shown in Table 4; while the results obtained using the rule 2 are shown in Table 5. In both cases, each row corresponds to the results obtained for a given gallery view angle θ G while each column belongs to the variation of a given probe view angle θ P .  Tables 1-5 show that the proposed framework provides a very competitive recognition rate. The following are significant features of JDLDA: it does not require the use of two different datasets or modification of the samples size in X to overcome the USP; it achieves its best performance for the most challenging angle, i.e., 0 • and 180 • , and finally, it provides very competitive recognition rates when a simple 1-NN classification model is used. The proposed framework, JDLDA(2), achieves a recognition rate close to 100% when the probe view angle is 72 • (see Table 1). Figure 9a shows the graphical comparison between the evaluation results obtained by Yu et al. [15] and those obtained using the proposed framework Figure 9b. In both cases, the same experimental setup is used. Figure 9 shows that the proposed framework provides a higher correct classification rate (CCR) than the system reported in [15] even when the view angle of the gallery data and that of the probe data are different. The main drawback of some existing state-of-the-art methods, e.g., VTM and MvDA, is their requirement of building an independent model for each probe view angle to partially overcome the USP. This is an important limitation because these methods imply previous knowledge about the view angles to be tested. The proposed framework does not require any previous knowledge about the probe view angles. Other approaches have been proposed that depend on a single transformation matrix, but they usually require increasing the number of samples to overcome the USP [30]. In these approaches, the view angles of the extra samples and those of the test samples must be close; this situation greatly reduces the ability to transfer the estimated parameters across two different gait datasets if the view angles included in them are not relatively close. Another advantage of our scheme is the time required for classification. Some methods such as that proposed in [14] may require up to 6 h for performing system training [19]. The proposed JDLDA framework is a much more efficient framework not only because it provides a higher recognition rate, but also because it requires as few as 25 s per test. This means that a complete set of experiments may take approximately 40 min. Figure 10 shows the main time-consuming processes in the proposed framework, as a rate of total consumed time. These processes are reading the GEI features from the dataset, creating the joint model, the computing matrix W T , generating the k-NN model and classification. From Figure 10 it can be observed that the most time-consuming step is reading the dataset to create the joint model.

Conclusions
This paper proposed a framework for view-angle invariant gait recognition that is based on the estimation of a single joint model. The proposed framework is capable of classifying GEIs computed from sequences acquired at different view angles. It provides a higher accuracy, with a lower computational complexity than other previously proposed approaches. The estimated joint model used in the framework, which is based on DLDA, helps to reduce the under-sampling problem with remarkable results. Evaluation experiments indicate that it is possible to obtain a projection matrix independently of the gallery subset, which allows us, in several practical applications, to include new classes without the need for recalculating the projection matrix. The evaluation results also show that proposed scheme improves the performance of several previously proposed schemes, although its performance still degrades when the incoming angle and the gallery angle are different. Therefore, in the future it should be interesting to analyze the possibility of developing a gait recognition scheme based on a global model which would be able to keep the same performance independently of the difference between the incoming and gallery angles.