Gait Recognition with Self-Supervised Learning of Gait Features Based on Vision Transformers

Gait is a unique biometric trait with several useful properties. It can be recognized remotely and without the cooperation of the individual, with low-resolution cameras, and it is difficult to obscure. Therefore, it is suitable for crime investigation, surveillance, and access control. Existing approaches for gait recognition generally belong to the supervised learning domain, where all samples in the dataset are annotated. In the real world, annotation is often expensive and time-consuming. Moreover, convolutional neural networks (CNNs) have dominated the field of gait recognition for many years and have been extensively researched, while other recent methods such as vision transformer (ViT) remain unexplored. In this manuscript, we propose a self-supervised learning (SSL) approach for pretraining the feature extractor using the DINO model to automatically learn useful gait features with the vision transformer architecture. The feature extractor is then used for extracting gait features on which the fully connected neural network classifier is trained using the supervised approach. Experiments on CASIA-B and OU-MVLP gait datasets show the effectiveness of the proposed approach.


Introduction
Gait is a biometric feature that describes the walking pattern of every individual. Compared with other biometric features such as the face, iris, fingerprint, and ears, gait has several unique properties. Gait can be captured from a greater distance than face or iris, which also means that the person does not have to interact with the sensor, i.e., a camera. In addition, gait is difficult to change, making it a reliable biometric feature. The gait of an individual can also be extracted from low-resolution sensors, such as those found in most current surveillance cameras. The range of applications of gait biometric is wide, e.g., in surveillance scenarios, access control, and identification of individuals for crime investigation purposes.
Gait biometric has several limitations when applied in the real world. First, factors such as illumination changes, shadows, and occlusions can significantly alter the appearance of an individual's gait. Second, the cameras that capture an individual's gait often have different viewing angles, resulting in drastically different appearances of the gait, even though the individual's gait signature is the same. Third, the carrying modalities are also commonly present, such as individuals wearing a bag, coat, hat, or another accessory, which visually change the individual's gait from an appearance perspective.
In the literature, there are two general approaches for tackling the task of gait recognition. The first is compressing the silhouettes of a single gait cycle of an individual into a single image, which serves as a gait features representation [1,2]. Han et al. [1] propose compressing the individual's binary silhouettes of one gait cycle, extracted from video frames by background subtraction, into one compact gait representation, called Gait Energy Image (GEI). The second approach considers the gait as a sequence of silhouettes of an individual, (LDA) [2,23,24], or CNN [25][26][27]. Finally, the similarity between features is computed, for example, by using the cosine similarity. Some methods also propose to integrate the above steps into an end-to-end network [28]. Wang et al. [29] proposed the feature-distributionconsistent Generative Adversarial Networks (GAN) to tackle the problem of cross-view gait recognition.
Video-based methods extract silhouettes from a video sequence similar to imagebased methods; however, instead of compressing them into a single representation, they are used in their raw form as input data. For every individual, all silhouettes are fed into the network and gait features are extracted. Chao et al. [3] proposed a method called GaitSet, which considers the gait as a set consisting of permutable silhouettes. The method is able to learn identity information from the set and proved to be effective in solving the problem of different viewpoints and different carrying conditions. GaitPart [4] improved the aforementioned method by considering that the different body parts of an individual carry information of different significance, and thus modeled the different spatio-temporal representations for different body parts. mmGaitSet [30] is another improvement of the GaitSet method, in which information about an individual's body posture was incorporated into the network. Some approaches propose to use a model-based gait feature based on the individual's body pose to solve the problem [5,31,32]. Wolf et al. [33] proposed the use of 3D convolutions to better capture the spatio-temporal features of gait. Although video-based methods produce better results than image-based methods, they are generally more difficult to train.
Regarding the type of backbone network used in the mentioned deep learning approaches, the CNN's are used almost exclusively.

Self-Supervised Learning
In recent years, a new learning paradigm in deep learning has piqued the interest of researchers-self-supervised learning (SSL). Self-supervised learning aims at solving the ever-present problem of the lack of data for training deep learning models. By using selfsupervision, the model learns without any labels, by means of pretext learning, where one part of the input data is learned from another part of the same input. Many self-supervised methods exist nowadays, such as [17][18][19][20]34]. SimCLR [18] used contrastive learning, with contrastive loss function, by maximizing the similarity between two augmented views of the same image. BYOL [17] used two networks, online and target network, which have the same architecture but different weights. The target network trains the online network, and the target network's weights are updated through the exponential moving average of the online network. SwAV [34] used instance-level discrimination, where each image or its transformation is considered as a separate class. The goal of the approach is to learn an embedding in a way that semantically similar images are grouped closer together in the features space, through means of using contrastive loss and image augmentation.
In DINO [20], the knowledge distillation with no labels is used. The DINO framework consists of two networks-teacher Φ t and student Φ s -that share the same architecture but different parameters, φ t and φ s , respectively. The goal of the student network is to match the probability distribution of the teacher network. The method uses a multicrop strategy [34] during training, where for every input image two global views are generated (about 50% of the input image), along with several local views of the same image (less than 50% of the input image). The global views pass through the teacher network while the global and local views both pass through the student network. The cross-entropy loss is used to measure similarity between output vectors from the teacher and student network. The student parameters φ s are learned by minimizing the cross-entropy loss with stochastic gradient descent, while the teacher parameters φ t are defined as an exponential moving average of the student parameters. In that way, the framework is able to gradually learn useful features from input images, learning the global to local correspondences from different views of the same image. Further, in contrast to many SSL methods [18,19], DINO does not require negative samples, which greatly simplifies the training procedure.

Self-Supervised Gait Recognition
Self-supervised deep learning approaches have only recently attracted the interest of researchers in the area of gait recognition. As a result, there are not many methods that use SSL for gait recognition. Among the first to investigate the application of SSL learning to gait recognition was WildGait [35]. In their manuscript, the authors created the novel Uncooperative Wild Gait (UWG) dataset, in which gait representations are automatically annotated by recognizing skeletal sequences of individuals. In addition, they propose the use of the SSL approach for pretraining the Spatio-Temporal Graph Convolutional Network to utilize a large number of samples for creating useful gait representations. Finally, the model is fine-tuned in a supervised manner to the target datasets and evaluated. Another SSL approach to the gait recognition task is SelfGait [36]. In the aforementioned manuscript, the authors propose using the SSL approach for learning the spatio-temporal gait representation from unlabeled samples. They use the horizontal pyramid mapping (HPM) [3] and micro-motion template builder (MTB) [4] spatio-temporal backbones, which are specifically designed for the gait recognition task. As in WildGait [35], the proposed approaches use CNN as the backbone network.

The Proposed Approach
In this section, we describe our proposed approach, along with a detailed explanation of its key components. The overall processing pipeline is depicted in Figure 1. The first part of our proposed approach uses the DINO self-supervised model to learn gait features from unlabeled training data, as shown in Figure 1a. Next, a simple FCNN is used as a classifier for the features obtained by the DINO feature extractor model, and is trained on gallery samples and tested on query samples, as shown in Figure 1b. Labeled samples are only needed for training the FCNN classifier, as the classifier is trained using a supervised approach.

Preprocessing
The first step in our proposed approach is data preparation. In general, assuming the input data are in the form of raw RGB image sequences taken from a camera, the typical gait data preprocessing steps [12,27] are applied. First, the noise is filtered from the images. Second, the silhouettes are extracted for every subject in binary form, using, e.g., background subtraction method. Third, images are normalized so that all silhouettes have the same height and are horizontally aligned. Then, a gait cycle estimation is performed in order to construct a final gait representation. In this manuscript, image-based gait features are used in the form of GEI [1]. GEI is able to preserve the static information of a gait sequence, such as the shape of the subject's body, and the subject's dynamic information, such as the variation of frequency and phase during the subject's locomotion. The GEI representation G for a given gait cycle can be calculated with the formula where N represents the number of silhouette frames in the gait cycle, t represents the frame number in a gait cycle at a moment in time, and I(i, j) is the original silhouette image with (i, j) values in the 2D image coordinate.

Learning Discriminative Gait Features
The second step in our proposed approach is training the feature extractor. In this manuscript, we propose using a self-supervised learning paradigm in order to tackle the problem of learning discriminative gait features. We use the recently proposed method called DINO [20], which showed promising results in various computer vision tasks such as image classification and image retrieval. The DINO architecture is depicted in Figure 2.  [20]. The goal of the student network is to match the probability distribution of a teacher network using cross-entropy loss, given different views of the same input image.
Originally, DINO constructs a set of eight local views (96 × 96 crops, passed only through Φ s ) and two global views (224 × 224 crops, passed through both Φ t and Φ s ). In this work, to adapt to gait-specific data, we use eight local views but with local crops of size 20 × 20, while two global crops are of size 64 × 64. We change crop sizes in order to adapt to the sizes of our gait training images while retaining the similar ratios of global and local crops as in the original manuscript. Moreover, since the DINO was originally trained on ImageNet, we change the augmentations used during training, by removing most of the image augmentations used (color jitter, Gaussian blur, solarization, random horizontal flip) and using only the random erasing augmentation, since the aforementioned augmentations do not bring a performance gain when used on gait-specific data.
The DINO method exhibits the ability to segment the foreground objects in an image, i.e., object boundaries, in a self-supervised manner. In natural images, such as ImageNet, foreground object segmentation is a difficult problem, considering that many possible variations of the foreground object and the background exist. In a gait recognition scenario, where images are presented in the form of, e.g., GEI, the foreground object, i.e., a subject, is clearly outlined in relation to the background, which could lead to the model focusing its attention on the most significant parts of an image such as the dynamic features presented as pixels in the range of 0, 255 .
Since gait datasets lack the large amount of data needed to train the ViT model from scratch [13], the fine-tune strategy is used in this work. The DINO model is trained on the ImageNet dataset and then fine-tuned to gait data.
We propose using the DINO method as a feature extractor to produce discriminative features of input images to be used later for classification.

Vision Transformers
The DINO uses the vision transformer model [13] as its backbone network, although CNN's also work without modifying the general DINO architecture. The ViTs input consists of patches of resolution p × p that represent non-overlapping sections of the input image. For an image I, where H represents the height of an image, W represents its width, and C is the number of channels in an image, the resulting image patches are where N = HW p 2 is the number of patches and p is the patch resolution. Patches are linearly projected into an embedding, and a CLS token is added, which serves as a class token, i.e., representation of the entire input image, and is used for the actual classification. Furthermore, at this step, the positional embeddings are added to help the model retain the positional information of input patches. Then, patch embeddings, positional embeddings, and CLS token are passed through the standard Transformer Encoder, which consists of self-attention and feed-forward layers, with skip connections. Finally, the output CLS token of the Transformer Encoder is sent to a Multilayer Perceptron (MLP) model for classification.
We use the small ViT model, as defined by Touvron et al. [37]. Furthermore, we train models with a patch size of 16 and 8 to investigate the influence of patch size on model performance.

Classifier
After the DINO feature extractor model is trained, the gait features for gallery and query image can be extracted and used for classification. In order to classify the features, we propose using a simple FCNN classifier. Accordingly, we set the gait recognition problem as a gait classification problem, where the gallery acts as training data for the FCNN classifier and query acts as test data. For example, if a gallery contains 100 subjects we consider that a classification problem with 100 classes. We design a simple FCNNdepicted in Figure 3-that consists of two linear layers, together with batch normalization, ReLU activation function, and dropout. The hyperparameters of a proposed FCNN are determined empirically. Additionally, we use the center loss [38] to further facilitate learning a more diverse feature representation. The main loss used is the cross-entropy loss, and the combination with center loss is given by the formula where L represents final loss value; L ce and L c are values of cross-entropy loss and center loss functions, respectively; and α is a scalar that balances influence of the center loss on the overall loss value and is set to α = 0.0001. As in feature extractor training, the images were normalized according to the custom dataset's normalization values. Random erasing is used as a data augmentation technique. Furthermore, in order to boost representation learning, we concatenate the CLS tokens from all 12 blocks of the DINO model as a final input image representation that serves as input to the FCNN classifier. Dimensionality of CLS token for the small ViT model is 384; thus, the input dimensionality of FCNN classifier is 4608.

Experimental Setup
To validate the proposed approach, we conducted experiments to assess the performance of the proposed DINO feature extractor model and the performance of the FCNN classifier trained on features extracted with the feature extractor model. Experiments were conducted in a way that allows for easy comparison with current state-of-the-art models used in gait recognition, following the same dataset splits and comparison metrics. The experimental setup is described next; then, the results are presented and analyzed.

Datasets
In this manuscript, we conducted experiments on two widely used gait recognition datasets: CASIA-B [39] and OU-MVLP [40], where CASIA-B a presents a smaller but widely used dataset, while OU-MVLP presents one of the largest gait datasets to date. The aforementioned allows for analyzing the performance of the proposed approach on a smaller or larger dataset, to see if the data amount is critical in training a successful DINO feature extractor.
CASIA-B dataset [39] is one of the most popular gait datasets in the literature. It consists of 124 subjects, three different walking conditions, and 11 different views (0-180 • with an increment of 18 • ). Walking conditions are normal (NM) with six sequences per subject, walking with a bag (BG) with two sequences per subject, and walking with a coat or a jacket (CL) also with two sequences per subject. In total, 110 sequences are available for each subject in the dataset. Since in this manuscript we use GEI images, the aforementioned translates to almost 13,600 images in total, with an average of 110 images per subject. We conduct experiments on three partition settings for training and testing, commonly used in literature. First, the ST (small-sample) setting uses the first 24 subjects for training and the rest (100 subjects) are used for testing. Second, the MT (medium-sample) setting uses the first 62 subjects for training and the rest (62 subjects) are used for testing. Third, the LT (large-sample) setting uses the first 74 subjects for training and the rest (50 subjects) are used for testing. In all three partition settings, the first 4 sequences of the NM modality are used in the gallery, while the remaining 6 sequences of NM modality are used in the query along with the 2 sequences of BG and CL modalities.
OU-MVLP dataset [40] is one of the largest public gait datasets available today. It consists of 10,307 subjects and 14 different views (0 • -90 • and 180 • -270 • , in increments of 15 • ) per subject. For every view, there are two sequences (#00-01). For training, 5153 subjects are used, while for testing, the rest of the 5154 subjects are used. In the test set, sequences with index #01 are used as a gallery, while the ones with index #00 are used as a query. In total, there are over 267,000 GEI images, with approximately 26 GEI images per subject.
Additionally, we resize all images from both datasets to size 64 × 44 as performed in [3,36], to ensure comparison compatibility as well as lowering computing requirements for training the DINO model. Furthermore, when training the DINO model, the training data are normalized using the mean and stdev calculated from the used training data.

Experiments
In order to evaluate the performance of our proposed approach, we constructed GEI image representations for each subject in each dataset. Then, we trained DINO feature extraction models on two aforementioned datasets, CASIA-B and OU-MVLP. For each dataset, two models were trained: one with a patch size of 16 and one with a patch size of 8. Next, a simple FCNN classifier was trained on gallery samples, to construct the final model for gait classification. Finally, the trained FCNN classifier was evaluated using the query samples.

DINO Implementation Details
For the implementation of the DINO method, the official GitHub repository was used [41], with slight modifications, as explained in Section 3.2, to account for the different data distribution of gait data in comparison with natural images of ImageNet dataset, such as adjusted global and local crop sizes and different training data augmentations. In order to fine-tune both the student and the teacher networks, the full ImageNet pretrained DINO model checkpoint was used. In our experiments, we used only small ViT models, which roughly correspond to the size of normal Resnet-50 [42] architecture by the number of parameters in the network. We trained models with patch sizes 16 and 8 to study the effect of patch size on the model's accuracy. The remaining DINO model parameters, such as momentum teacher value, teacher temperature, and global and local crop scales are the same as in the original manuscript [20].

Training Details
We trained the DINO models for 1000 epochs, with a batch size of 32 for all experiments on the CASIA-B and OU-MVLP datasets. The optimizer used was AdamW [43] with a learning rate of 0.0005. The training was performed using one Nvidia 2080Ti 11 GB GPU.
The FCNN classifier was trained for 100 epochs, with a batch size of 128. The Adam optimizer was used for FCNN classifier with a learning rate of 0.0005; similarly, the Adam was used for the center loss optimizer with a learning rate of 0.1.
For both DINO models and the FCNN classifier, the learning rates were determined empirically. The learning rates were searched within the range of 0.1 to 0.000001 using the grid search method. The number of epochs for training the DINO model was set to 1000, as the accuracy did not improve when training the model for longer. Similarly, the number of epochs for training the FCNN classifier was set to 100. The batch size for both models was set by finding the optimal value between the batch sizes of 8 and 128, with steps of the power of 2.

Evaluation Protocol
For evaluation of our experimental results, we use rank-1 accuracy, where we look at the percentage of predictions where the top prediction is the correct one, i.e., matches the ground-truth value. Additionally, the identical-view cases are excluded for comparability with other state-of-the-art methods.

Results
In this section, the results of conducted experiments are presented. It is worth noting that, except SelfGait [36], which uses self-supervised learning, every other method compared uses a supervised learning approach. Furthermore, the state-of-the-art methods mentioned in this section use silhouettes as input data, as well as features extracted directly from frames of a subject walking, while the method proposed by Liao et al. [27] uses GEIs, the same as our method.

CASIA-B
For the ST setting, the results are presented in Table 1. Compared with the other state-of-the-art methods, our method achieves the highest accuracy in the NM and BG modality. Although, the CL modality accuracy is the lowest among the state-of-the-art methods. In the MT setting, Table 2, the overall accuracy of the NM modality of our method outperforms the rest of the methods again, while the BG modality is below the rest of the methods. Further, the CL modality showed significantly lower results.
Finally, in the LT setting, Table 3, our method again gained the best accuracy in the NM modality, while the BG modality is comparable although lower in accuracy than the rest of the methods. CL modality in this setting showed poor accuracy.
Overall, our approach performs best on the NM modality, regardless of the CASIA-B dataset setting. The BG modality performs best in the ST setting, and in the other settings, it is comparable with other methods. The CL modality showed the lowest accuracy in all the settings. The reason for that could be that our model focused its attention primarily on the NM modality, which has the most training data and is easiest to discriminate, without any other covariate condition. BG modality considers the subject carrying a bag, which alters the subject's appearance slightly; thus, the results for BG modality are overall comparable with those of other state-of-the-art methods. The CL modality considers the subject wearing a coat, which alters the subject's appearance significantly; as a result, it is the hardest modality available in the dataset, on which our method achieved low accuracy. As such, our proposed method on CL modality may not be the best choice in practical applications, compared with other methods. Further research into boosting the proposed method's accuracy in the mentioned modality will be performed. Considering the presented results, our approach showed the ability to perform well across different modalities, excluding CL modality. Furthermore, our method discriminates well across the different angles of subjects at which they are recorded. The best accuracy is obtained for the angles that are closer to values of 0 • and 180 • , while the lowest are in the area around the 90 • angle. Both models with patch size 16 and patch size 8 performed similarly in the NM modality, without significant differences in accuracy, across all dataset settings. The significant differences in accuracy arise in BG and CL modalities, where the model with patch size 8 showed significant improvement in accuracy compared with the model with patch size 16. This effect could be due to the ability of the model with patch size 8 to focus its attention to smaller parts of the image, hence, building a model that is more robust to the effect of covariate factors such as a bag or a coat.

OU-MVLP
In Table 4, the results for the OU-MVLP dataset are presented. The results show that our approach achieved comparable results with the other state-of-the-art methods. Our method performs well across all angles-specifically, the 210 • and 225 • angles-while the lowest accuracy is at an angle of 180 • . The method SelfGait [36] also uses the selfsupervised learning approach but with a specialized backbone network that enhances the spatio-temporal ability of the model, and it achieves the state-of-the-art result on this dataset. In contrast, our approach uses a standard unmodified ViT network, with simple FCNN as a classifier, and achieves comparable accuracy. As the OU-MVLP dataset contains many images, the DINO model was able to learn discriminative features and achieve results comparable with the state-of-the-art. Compared with SelfGait, the advantage of our approach is that it uses a simple general ViT architecture, as opposed to the gait-specific network used in SelfGait. In addition, our method does not explicitly infer temporal features from the data, unlike SelfGait, which uses MTB to learn temporal features from silhouettes, thus making our method more straightforward in terms of learning since only appearance features are learned. The model with patch size 16 performed slightly better on this dataset compared with the model with patch size 8. As, in this dataset, there are no covariate conditions such as a bag or coat, the model with patch size 8 does not bring any performance improvement as in CASIA-B dataset.

Self-Attention Visualization
In order to assess the features learned by the DINO model, we visualize the different attention heads in the last multihead self-attention block. A random image from each of the datasets is chosen, for which the attention is displayed. The model used was the ViT small model, which has n = 6 heads per self-attention block.
In Figures 4 and 5, the random images from CASIA-B and OU-MVLP datasets are shown, respectively. As depicted in Figures 4a and 5a, each head learns different features from the data, as its attention is focused on different parts of the image. Some attention heads are focused on the subject's head, while others are on the legs or the left or right part of the subject in the image. Figures 4b and 5b show the average of all attentions across all the heads. This observation is consistent with the ones from the original DINO manuscript, where it is noted that the DINO method successfully segments objects of interest inside the image. In GEI images, the most important area of the image is the outline of the subject, which our proposed approach successfully detects and uses that information for the classification of subjects, producing good results, as shown in Section 5.

Ablation Experiments
In this section, the effectiveness of the vision transformer backbone network and the proposed classifier is studied.
To evaluate the effectiveness of the vision transformer network, we trained the DINO model with the Resnet-50 as a backbone network for comparison. The Resnet-50 is chosen because it has a similar number of parameters in the network compared with the small ViT network, with 23 million and 21 million parameters, respectively. Both models were trained on the CASIA-B dataset's LT setting, for 1000 epochs and with a patch size of 16.
Hyperparameters of the small ViT model were determined as described in Section 4.4, while for the Resnet-50 model the same methodology was used, setting the lr = 0.005. In both models, the full ImageNet pretrained DINO model checkpoint was used for fine-tuning. For evaluation, the FCNN network proposed in Section 3.4 was used. In Table 5, the comparison of accuracy of Resnet-50 and the small ViT model is shown. It is evident that the small ViT model significantly outperforms the Resnet-50 model in accuracy across all modalities, proving the effectiveness of the ViT model for the problem of gait recognition. In Table 6, the comparison of different classifiers is shown. To study the effectiveness of the proposed FCNN classifier, we evaluated the trained ViT feature extractor model using the standard weighted nearest neighbors classifier (k-NN) as in [46]. The feature extractor model used was the small ViT model with a patch size of 16. The FCNN classifier is the same as proposed in Section 3.4. An evaluation is performed on the CASIA-B dataset using the LT setting. The results show that the proposed FCNN classifier significantly outperformed the k-NN classifier in all modalities and angles, especially in the BG modality.

Conclusions
In this manuscript, we propose a novel approach that uses self-supervised learning for application in the gait recognition task. Using the DINO self-supervised method, the useful gait features are learned using training samples without any annotations. The obtained model is used as a feature extractor for gallery and query images. The simple FCNN classifier is trained using the features extracted from gallery images, and query images are evaluated using the trained model. Experiments conducted on two widely used gait recognition datasets, CASIA-B and OU-MVLP, showed that our proposed approach achieved good results, outperforming the supervised approaches in some cases. Moreover, the self-supervised feature extractor focused its attention on the outlines of the individuals in the GEI images, deeming the outline as the most meaningful information in the image. Taking into account covariate factors, such as different camera viewpoints and different carrying modalities, our method also produced good results comparable with those of other state-of-the-art methods, considering both supervised and self-supervised approaches. We also note that our approach is one of the first that employs ViTs in the domain of gait recognition. In future work, we will investigate the effect of training the feature extractor on specific parts of an image such as the legs, torso, or head on recognition accuracy. Furthermore, additional work will be conducted to further reduce the gap between poorer BG and CL modality results compared with those of NM modality in CASIA-B dataset. Newly proposed variants of vision transformers will also be tested in conjunction with DINO to further boost the recognition accuracy.
Author Contributions: D.P. and K.L. conceptualized the paper, proposed the methodology, and coordinated the paper preparation; D.P. and D.S. curated the data, performed the formal analysis and visualizations, participated in the investigation, and performed the experimental validation; D.P. wrote the original draft; D.S. and K.L. reviewed and edited the draft; K.L. was responsible for the project administration, funding, and supervision. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: