Two-Path Spatial-Temporal Feature Fusion and View Embedding for Gait Recognition

: Gait recognition is a distinctive biometric technique that can identify pedestrians by their walking patterns from considerable distances. A critical challenge in gait recognition lies in effectively acquiring discriminative spatial-temporal representations from silhouettes that exhibit invariance to disturbances. In this paper, we present a novel gait recognition network by aggregating features in the spatial-temporal and view domains, which consists of two-path spatial-temporal feature fusion module and view embedding module. Speciﬁcally, two-path spatial-temporal feature fusion module ﬁrstly utilizes multi-scale feature extraction (MSFE) to enrich the input features with multiple convolution kernels of various sizes. Then, frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE) are parallelly constructed to capture spatial and temporal gait features of different granularities and these features are fused together to obtain muti-scale spatial-temporal features. FLSFE is designed to extract both global and local gait features by employing a specially designed residual operation. Simultaneously, MSTFE is applied to adaptively interact multi-scale temporal features and produce suitable motion representations in temporal domain. Taking into account the view information, we introduce a view embedding module to reduce the impact of differing viewpoints. Through the extensive experimentation over CASIA-B and OU-MVLP datasets, the proposed method has achieved superior performance to the other state-of-the-art gait recognition approaches.


Introduction
Gait recognition refers to judging individuals by their distinct walking patterns, making it one of the most auspicious biometric technologies for identity recognition.In contrast to other biometric identification methods such as fingerprints, DNA, facial recognition, vein recognition, and so on, gait recognition works by using regular or low-resolution cameras at long ranges and does not require explicit cooperation from the subjects of interest.Therefore, gait recognition has demonstrated huge development potential in the fields of crime prevention, video surveillance, and social security.Invariance to disturbances in gait recognition signifies the capability of achieving high recognition accuracy without being affected by various external factors, such as bag-carrying, cloth-change, cross-view, and speed changes [1][2][3].However, the performance of gait recognition is susceptible to the forementioned external factors in real-word scenarios, which brings substantial challenges to achieve invariance to disturbances.Therefore, a multitude of state-of-art works have focused on achieving invariance to disturbances in gait recognition to ensure the reliability of gait recognition systems.
Benefiting from recent developments in deep learning algorithms, numerous powerful gait recognition algorithms have been developed [4][5][6][7][8][9][10][11], and have been proven effective under challenging conditions.Liao et al. [9] proposed long short-term memory (LSTM) [12] for the extraction of spatial and temporal features based on human skeleton points.PoseGait [10] transformed 2D poses into 3D poses to enhance recognition accuracy by extracting additional gait features from 3D poses.Shiraga et al. [11] utilized the gait energy image (GEI) to train their networks and learn covariate features for gait recognition.GaitSet [4] used 2D convolutional neural networks (CNNs) at the frame level to extract global features and treated gait silhouettes as a set to capture temporal features.GaitPart [7] extracted local gait feature representations by dividing the feature maps horizontally and utilized micro-motion features to focus on the short-term temporal expressions.Huang et al. [8] introduced a 3D local CNN to extract sequence information from specific human body parts.
To the best of our knowledge, the human body possess evidently various visual appearances and movement patterns during walking.Spatial-temporal representations refer to the spatial-temporal features derived from the modeling of silhouettes, which can effectively capture the visual appearances and movement patterns of the pedestrian.Spatial feature representations indicate the visual appearances of the silhouettes, while temporal feature representations reflect the movement patterns of the silhouettes.Spatial feature extraction and temporal modeling can extract rich and discriminative spatial-temporal feature representations, distinctly capturing an individual's walking process.Since there are significant differences among various persons in terms of visual appearance and movement patterns during walking, these spatial-temporal representations can play a pivotal role in enhancing the effectiveness of gait recognition.The combing of spatial and temporal representations can not only describe the visual appearances, but also capture motion patterns of the pedestrians, and utilizing either spatial representations or temporal representations alone would lead to the poor accuracy of gait recognition.Therefore, by jointly investigating motion learning and spatial mining simultaneously, the accuracy of gait recognition can be substantially improved.Despite considerable efforts in gait recognition, the aforementioned methods still encounter the following challenges: (1) there is a need for muti-scale feature extraction to capture more robust spatial-temporal representations, thereby enhancing the accuracy of gait recognition especially in case of appearance camouflage; (2) few methods have taken view angle into consideration explicitly and the detection or estimation of viewpoint has been somewhat overlooked, which can exert an essential impact to improve the recognition ability of existing approaches.
With such considerations, we propose an advanced gait recognition network, namely two-path spatial-temporal feature fusion module and view embedding module.The two-path spatial-temporal feature fusion module consists of multi-scale feature extraction (MSFE), frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE).Firstly, MSFE is deployed to facilitate the effective extraction of shallow features, which can extend the receptive field and enable the extraction of multiple internal features within different regions.Subsequently, we introduce a two-path parallel structure containing FLSFE and MSTFE, which aims to effectively extract the multi-scale spatialtemporal information across various granularities.In FLSFE, we develop an innovative residual convolutional block (R-conv) to capture both global and local gait features by the special design of residual operation.Meanwhile, in MSTFE, we design independent branches of temporal feature extraction with varying scales and integrate the temporal features in an attention-based way.For reasonable refinement of extracted features, a view embedding module is constructed to reduce the negative impact of viewpoint differences, which uses view prediction learning to calculate the best view and embeds the view information into the multi-scale spatial-temporal characteristics to obtain the final features.Extensive experiments conducted on the two public datasets CASIA-B and OU-MVLP have demonstrated that our method has outperformed other state-of-the-art methods, showing its superior performance in gait recognition.

Related Work
Gait recognition: Current deep-learning-based gait recognition methods can be broadly classified into two categories: model-based [13][14][15][16] and appearance-based [4][5][6][7][8][17][18][19][20][21][22][23][24][25][26][27][28][29][30].Modelbased methods leverage the relationships between bone joints and pose information to create models of walking patterns and human body structures [13], such as OpenPose [14], HR-Net [15], and DensePose [16].These methods exhibit stronger robustness to clothing variation and carrying articles.However, model-based methods depend on accurate joint detection and pose estimators, which can significantly increase the computational complexity and may lead to inferior performance in certain scenarios.Conversely, appearance-based methods use gait silhouettes (the binary images shown in Figure 1) as the model's input and capture spatial and temporal characteristics from silhouettes by CNNs [17].Gait silhouettes can describe the body state in a single frame at a low computational cost and more detailed information in each silhouette image can be preserved directly from the original silhouette sequences.The silhouettes are the basis of appearance-based methods, and the quality of silhouettes will directly affect the performance of the gait recognition system.Moreover, spatial feature representations can be obtained from the silhouette of an individual frame, which can represent appearance characteristics, while temporal feature representations can be captured from consecutive silhouettes, in which the relationship between adjacent frames can reflect the temporal characteristics and motion patterns.Therefore, some appearance-based methods have overcome the challenges of pose estimation and achieved competitive performance [4][5][6][7][8]21,22,29].Particularly, the first opensource gait recognition framework, named OpenGait (https://github.com/ShiqiYu/OpenGait,accessed on 29 January 2023) [18], encompassed a series of state-of-the-art appearance-based methods for gait recognition.In this paper, our approach is categorized as appearance-based and we count on binary silhouettes as our input data without the need for pose estimation or joint detection.By focusing on silhouettes, we aim to reduce the influence of variations in subjects' appearance, thereby enhancing the accuracy and robustness of our gait recognition approach.

Related Work
Gait recognition: Current deep-learning-based gait recognition methods can be broadly classified into two categories: model-based [13][14][15][16] and appearance-based [4][5][6][7][8][17][18][19][20][21][22][23][24][25][26][27][28][29][30].Model-based methods leverage the relationships between bone joints and pose information to create models of walking patterns and human body structures [13], such as OpenPose [14], HRNet [15], and DensePose [16].These methods exhibit stronger robustness to clothing variation and carrying articles.However, model-based methods depend on accurate joint detection and pose estimators, which can significantly increase the computational complexity and may lead to inferior performance in certain scenarios.Conversely, appearance-based methods use gait silhouettes (the binary images shown in Figure 1) as the model's input and capture spatial and temporal characteristics from silhouettes by CNNs [17].Gait silhouettes can describe the body state in a single frame at a low computational cost and more detailed information in each silhouette image can be preserved directly from the original silhouette sequences.The silhouettes are the basis of appearance-based methods, and the quality of silhouettes will directly affect the performance of the gait recognition system.Moreover, spatial feature representations can be obtained from the silhouette of an individual frame, which can represent appearance characteristics, while temporal feature representations can be captured from consecutive silhouettes, in which the relationship between adjacent frames can reflect the temporal characteristics and motion patterns.Therefore, some appearance-based methods have overcome the challenges of pose estimation and achieved competitive performance [4][5][6][7][8]21,22,29].Particularly, the first opensource gait recognition framework, named OpenGait (https://github.com/ShiqiYu/OpenGait,accessed on 29 January 2023.)[18], encompassed a series of state-of-the-art appearance-based methods for gait recognition.In this paper, our approach is categorized as appearance-based and we count on binary silhouettes as our input data without the need for pose estimation or joint detection.By focusing on silhouettes, we aim to reduce the influence of variations in subjects' appearance, thereby enhancing the accuracy and robustness of our gait recognition approach.Spatial feature extraction modeling: Regarding the range of feature representations in spatial feature extraction modeling, two main approaches are commonly used: globalbased and local-based methods.Specifically, global-based methods focus on exploring gait silhouettes as a whole to generate global feature representations [4,5,12,17,19,20].For instance, Shiraga et al. [12] proposed the GEI and utilized 2D CNNs to obtain global gait feature representations from the GEIs.Similarly, GaitSet [4] and GLN [5] extracted global gait features at the frame-level by 2D CNNs.Conversely, local-based methods usually segment the silhouettes into multiple parts to establish the local feature representations [7,21,22], and focus more on learning the local information of different body parts.For example, Zhang et al. [21] split human gait into four distinct segments and adopted 2D CNNs to capture more detailed information from each of these segments.Fan et al. [7] introduced the focal convolution layer (FConv), a novel convolution layer that divided the feature map into several parts to obtain part-based gait features.Qin et al. [22] developed RPNet to discover the intricate interconnections between each part of gait silhouettes and then integrated them by a straightforward splicing process.GaitGL [23] and GaitStrip [24] both used 3D CNNs to construct multi-level frameworks to extract more discriminative and richer spatial features.In this paper, we design an innovative residual convolutional block (R-conv) in spatial feature extraction model by 2D CNNs, which combines regular convolution with FConv to extract both global and local gait features and enhance the discriminative capacity of the feature representations.
Temporal feature extraction modeling: As a crucial cue for gait tasks, the current temporal feature extraction model generally employs various approaches such as 1D convolutions, LSTMs and 3D convolutions [25].For example, Fan et al. [7] and Wu et al. [26] utilized 1D convolutions to model the temporal dependencies and aggregated temporal information by concatenation or in a summation.Additionally, LSTM networks were built in [21,27] to preserve the temporal variation of gait sequences and fuse temporal information by accumulation.Moreover, some studies have proposed 3D convolutions [28][29][30][31] to simultaneously extract spatial and temporal information from gait silhouette sequences.Lin et al. [32] introduced MT3D network, which used 3D CNNs to extract spatial-temporal features with multiple temporal scales.However, 3D CNNs often bring complex calculations and encounter challenges during training.In this paper, we present a novel approach for temporal feature extraction modeling by 2D CNNs, which aggregates temporal information with different scales.By incorporating muti-scale temporal branches, our approach can capture rich temporal clues and empower the network to learn more discriminative motion representations adaptively.

View-invariant Modeling:
To the best of our knowledge, viewpoint change poses a formidable challenge in biometrics, particularly in face recognition and gait recognition.In contrast to face recognition, fewer methods in gait recognition have incorporated the aspect of viewpoint into their considerations.He et al. [33] introduced a multitask generative adversarial network (GAN) and trained the GAN by using viewpoint labels as the supervision.Chai et al. [34] adopted a different projection matrix as a perspective embedding method and achieved high growth on multiple backbones.However, these methods often involve a large number of parameters, making them extremely complex for effective cross-view gait recognition.Therefore, we propose a concise view model that applies view prediction learning to calculate the best view and embeds view information into our two-path spatial-temporal feature fusion module, which can significantly enhance the robustness of our network to view changes and improve gait recognition performance across varying viewpoints.

System Overview
The overall framework is depicted in Figure 1.Firstly, MSFE is employed to extract shallow features from the original input silhouettes, which can extend the receptive field and capture the gait spatial and temporal features within different regions.Next, a two-path parallel structure including FLSFE and MSTFE is designed to obtain multi-scale spatialtemporal features.FLSFE is devised to extract combined features by encompassing both global and local gait information in the spatial domain and use a novel convolutional block (R-conv) by the special design of residual operation.Simultaneously, MSTFE is implemented to learn more discriminative motion representation between long-term and short-term features in the temporal domain and integrate the muti-scale temporal features in an attention-based way.Subsequently, both the spatial and temporal features are integrated to generate muti-scale spatial-temporal representations.HPP [4] is applied to complete feature mapping process, ensuring comprehensive and discriminative features.Furthermore, a view embedding module is introduced to use view prediction learning to calculate the best view and incorporate view information explicitly into the spatial-temporal fused feature representations to gain the final features, which can effectively alleviate the effect of viewpoint variations.Finally, joint losses of cross entropy loss and the triplet loss [4,7] are selected to train the proposed network.

Two-Path Spatial-Temporal Feature Fusion Module
The two-path spatial-temporal feature fusion module is composed of three different components, namely multi-scale feature extraction (MSFE), frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE).In the following sections, we will furnish a comprehensive description of the structure of each component individually.

Multi-Scale Feature Extraction (MSFE)
Generally, a convolution layer utilizes multiple convolution kernels of the same size to carry out convolution operation, which leads to the limitation of feature extraction to a certain extent.In the two-path spatial-temporal feature fusion module, we propose multi-scale feature extraction (MSFE) to process the input silhouettes, which can extract more comprehensive and discriminative feature representations with different granularities.MSFE utilizes the multi-scale convolution (MSC) structure with multiple kernels of various sizes to extract gait features during convolution operation and merges the multi-scale features by different single-path convolution branches.The MSC can expand different perceptual fields and learn more fine-grained features.Moreover, multiple branches can fuse the fine-grained features and obtain more abundant and complete gait characteristics.
The structure of MSFE is presented in Figure 2. MSFE comprises a three-layer structure, where the first layer is a MSC structure and the other two layers are the regular convolutions.The MSC structure comprises three parallel convolution branches, each of which is designed with distinct kernel sizes: (1,1), (3,3), and (5,5), respectively.Each branch processes the input gait sequences and transforms them into multi-scale feature maps.Then, the outputs of the parallel convolution operations at different scales are combined together by element-wise addition to obtain more abundant features.Subsequently, another two convolution layers are employed to optimize the comprehensive characteristics to achieve the appropriate number of channels, which can also balance the capabilities of the network more flexibly.Assuming that the input is X ∈ R C×T×H×W , where C means the number of channels, T represents the length of the gait sequence and (H, W) signify the height and width of each frame.The MSC feature F MSC can be formulated below: where F 1×1 , F 3×3 and F 5×5 denote the 2D convolution operations with the kernel sizes of (1,1), (3,3) and (5,5), respectively.

Frame-Level Spatial Feature Extraction (FLSFE)
To further enhance the granularity of spatial-temporal learning and capture contextual information, frame-level spatial feature extraction (FLSFE) is constructed to extract both global and local spatial features for each frame.The structure of FLSFE is shown in Figure 3.The input of FLSFE is demoted as  ∈  / / , and FLSFE is composed of three consecutive R-convs (detailed structure could be found in Figure 4), each of which is designed to extract both the whole-body and part-informed spatial features.Then, set pooling operation (SP) is applied to each R-conv block to extract set-level features, and after point-by-point summation, column vector spatial features  ∈  are obtained by the horizontal pooling operation (HP), where  means the number of channels, and  represents the number of strips cut in HP.
and  can be expressed as:

Frame-Level Spatial Feature Extraction (FLSFE)
To further enhance the granularity of spatial-temporal learning and capture contextual information, frame-level spatial feature extraction (FLSFE) is constructed to extract both global and local spatial features for each frame.The structure of FLSFE is shown in Figure 3.The input of FLSFE is demoted as F MSFE ∈ R C×T×H/2×w/2 , and FLSFE is composed of three consecutive R-convs (detailed structure could be found in Figure 4), each of which is designed to extract both the whole-body and part-informed spatial features.Then, set pooling operation (SP) is applied to each R-conv block to extract set-level features, and after point-by-point summation, column vector spatial features F space ∈ R C 1 ×K are obtained by the horizontal pooling operation (HP), where C 1 means the number of channels, and K represents the number of strips cut in HP.
and  can be expressed as: where  represents the 2D convolution operation with the kernel size of (3,3), and  denotes the concatenation operation on the horizontal dimension.In FLSFE, except for three consecutive R-convs, SP is used to aggregate high-level feature maps from different gait timepoints that are robust to the appearance and observation perspectives into set-level feature maps.The set-level spatial representation is generated as  ∈  and the process can be formulated as Equation ( 5): Figure 4 illustrates the structure of the R-conv, which is a novel residual convolutional block that combines regular convolution with FConv in parallel to extract both global and local features from gait sequences.The R-conv is designed to leverage the benefits of both global and local features by adding the output of FConv with the result of regular convolution.Assuming that the input is denoted as X global ∈ R c 1 ×t×h×w , where c 1 represents the number of channels, t means the length of feature maps and (h, w) indicate the dimensions of each image frame.Initially, the global feature map is horizontally partitioned into n parts to derive local feature maps, represented as , where n denotes the total number of partitions and X i local ∈ R c 1 ×t× h n ×w represents the i-th local gait segment.Subsequently, 2D CNN is used to extract global and local gait information independently.Finally, the combined features F R−conv , which includes global and local information, are fused through element-wise summation, formulated as: F global and F local can be expressed as: where F 3×3 represents the 2D convolution operation with the kernel size of (3,3), and cat denotes the concatenation operation on the horizontal dimension.
In FLSFE, except for three consecutive R-convs, SP is used to aggregate high-level feature maps from different gait timepoints that are robust to the appearance and observation perspectives into set-level feature maps.The set-level spatial representation is generated as F SP ∈ R C×h×w and the process can be formulated as Equation ( 5): Then, set-level features are summed up point by point and HP is designed to obtain the column vector spatial features F space ∈ R C 1 ×K by dividing the feature map into K parts according to the height.The HP process can be formulated as Equation ( 6): where maxpool represents the horizontal maximum pooling and avgpool represents the horizontal average pooling.

Multi-Scale Temporal Feature Extraction (MSTFE)
MSTFE is devised to integrate long-term and short-term features in the temporal dimension, facilitating effective information exchange between different scales.As depicted in Figure 5, firstly, the HP operation is implemented on F MSFE to obtain fine-grained information and then, on the one hand, the long-term features are extracted to represent the motion characteristics of all frames, which reveals the global motion periodicity of different body parts.On the other hand, two sequential 1D convolutions with a kernel size of 3 are used to extract short-term temporal contextual features that are favorable for modeling micromotion patterns.Finally, an adaptive temporal feature fusion approach is employed to acquire the temporal importance weights for each temporal scale at both long-term and short-term scales to adaptively highlight or suppress features to obtain the most discriminative motion characteristics.The input of MSTFE is denoted as  , and the HP operation is used to partition the human body into  segments, resulting in the temporal feature representation  ∈  ,  denotes the -th sample of each frame.Initially, a multilayer perceptron (MLP) followed by a Sigmoid function is applied to each frame to obtain the longterm temporal features.The MLP could be considered as a temporal attention mechanism, learning to assess the importance of different frames based on their relevance to the overall motion characteristics.Subsequently, the long-term temporal feature  is obtained by performing a weighted summation of all frames.The process can be expressed below: where ⊙ denotes dot product and  describes the global motion cues.
To obtain the short-term temporal features, two sequential 1D convolutions with a kernel size of 3 are applied to the temporal feature representation  obtained from the HP operation.After each 1D convolution, the resulting features are summed to capture the short-term temporal feature  , which is formulated as: To capture the varying motion patterns of different temporal scales and their varying importance in gait recognition, an adaptive temporal feature fusion approach is introduced.This approach aims to determine temporal importance weights that adaptively highlight or suppress features to obtain the most discriminative motion characteristics, which is realized through two fully connected layers followed by a Sigmoid function [35,36].These layers are specifically designed to learn the temporal importance weights for both long-term and short-term temporal scales.The process can be expressed as follows: The fused temporal features can be realized by: where  denotes the importance weight of the -th frame of a sample, and the weights of the long-term and short-term temporal feature are denoted as  , ,  , Based on the adaptive temporal feature fusion, we obtain sequence-level representations in a weighted summation way as follows: where  ∈  , and  means the number of channels.The input of MSTFE is denoted as F MSFE , and the HP operation is used to partition the human body into K segments, resulting in the temporal feature representation Temp denotes the i-th sample of each frame.Initially, a multilayer perceptron (MLP) followed by a Sigmoid function is applied to each frame to obtain the long-term temporal features.The MLP could be considered as a temporal attention mechanism, learning to assess the importance of different frames based on their relevance to the overall motion characteristics.Subsequently, the long-term temporal feature F L is obtained by performing a weighted summation of all frames.The process can be expressed below: where denotes dot product and F L describes the global motion cues.
To obtain the short-term temporal features, two sequential 1D convolutions with a kernel size of 3 are applied to the temporal feature representation F Temp obtained from the HP operation.After each 1D convolution, the resulting features are summed to capture the short-term temporal feature F S , which is formulated as: To capture the varying motion patterns of different temporal scales and their varying importance in gait recognition, an adaptive temporal feature fusion approach is introduced.This approach aims to determine temporal importance weights that adaptively highlight or suppress features to obtain the most discriminative motion characteristics, which is realized through two fully connected layers followed by a Sigmoid function [35,36].These layers are specifically designed to learn the temporal importance weights for both long-term and short-term temporal scales.The process can be expressed as follows: The fused temporal features can be realized by: where W n T denotes the importance weight of the n-th frame of a sample, and the weights of the long-term and short-term temporal feature are denoted as W n T,1 , W n T,2 .Based on the adaptive temporal feature fusion, we obtain sequence-level representations in a weighted summation way as follows: Appl.Sci.2023, 13, 12808 where F SL−term ∈ R C 2 ×K , and C 2 means the number of channels.

View Embedding Module
Since view information is an imperative condition in gait recognition, the view embedding module is introduced to build view estimation and view angle is embedded into the previous two-path spatial-temporal feature fusion module, which can not only greatly minimize the intra-class caused by the view differences, but also improve the recognition ability of gait recognition.

View Prediction
As shown in Figure 6, F space and F SL−term in the two-path parallel structure are concatenated along the channel dimension as the multi-scale spatial-temporal features F ST , which can be formulated as: where Θ denotes a merge operation on the channel dimension.

View Prediction
As shown in Figure 6,  and  in the two-path parallel structure are concatenated along the channel dimension as the multi-scale spatial-temporal features  , which can be formulated as: where  denotes a merge operation on the channel dimension.The view classification feature can be realized by: where  means a fully connect layer and  _ indicates the global average pooling.Subsequently, the view probability of the input silhouette, denoted as  and the maximum probability of the predicted view, represented as  are determined by calculating the following formulas, respectively, as: where  ∈  ,  means the number of discrete views,  are the view weight matrix of  layer,  denotes the bias terms of  layer and  ∈ 0,1,2 ⋯  .Regarding the predicted discrete view  , a group of view projection matrix will be trained as   | 1,2, ⋯ ,  , where  ∈  is the projection matrix, and it will be used in Section 3.3.2.

HPP with View Embedding
In this section, horizontal pyramid pooling (HPP) [4] is utilized on the multi-scale spatial-temporal feature maps and then the weights under the best view are connected with these multi-scale spatial-temporal features by matrix multiplication to acquire the final feature.
The multi-scale spatial-temporal feature maps after HPP can be denoted as  , ,  1,2,3, ⋯ , where  means the number of strips into split in HPP,  , ∈  .Supposing the predicted view of  is θ, the embed procedure can be realized by: The view classification feature can be realized by: where FC means a fully connect layer and P Global_Avg indicates the global average pooling.Subsequently, the view probability of the input silhouette, denoted as P view and the maximum probability of the predicted view, represented as ∼ P view are determined by calculating the following formulas, respectively, as: where P view ∈ R V , V means the number of discrete views, W view are the view weight matrix of FC layer, B view denotes the bias terms of FC layer and Regarding the predicted discrete view ∼ P view , a group of view projection matrix will be trained as is the projection matrix, and it will be used in Section 3.3.2.

HPP with View Embedding
In this section, horizontal pyramid pooling (HPP) [4] is utilized on the multi-scale spatial-temporal feature maps and then the weights under the best view are connected with these multi-scale spatial-temporal features by matrix multiplication to acquire the final feature.
The multi-scale spatial-temporal feature maps after HPP can be denoted as F HPP,i , i = 1, 2, 3, • • • n, where n means the number of strips into split in HPP, F HPP,i ∈ R D .Supposing the predicted view of F ST is θ, the embed procedure can be realized by: where

Joint Losses
We employ two joint loss functions, namely the cross entropy (CE) loss [37] and the triplet loss [38] for training our proposed framework.The CE loss is applied for the view prediction task and is calculated as follows: where N represents the total number of silhouette sequences and y n denotes the discrete ground value of the n-th sequence.
Next, the triplet loss is employed to augment the discriminative capability of the features.For each triplet of gait silhouette sequences (Q, P, N), where Q and P belong to the same subject, and Q and N are from the different subjects, the triplet loss can be calculated as follows: where K represents the number of triplets, Finally, to obtain the overall loss function, we joint the CE loss and the triplet loss as follows: where λ ce is a hyper-parameter.

Datasets
CASIA-B: CASIA-B (http://www.cbsr.ia.ac.cn/, accessed on 28 August 2006) [39] is a well-known gait dataset with a total of 124 subjects.Each subject contains 11 views (from 0 • to 180 • ).Each view includes 10 gait sequences captured under three various walking conditions: normal walking (NM) with 6 sequences, carrying bags (BG) with two sequences and wearing coats or jackets (CL) with two sequences.For our experiments, we follow the protocol from a previous work [20] and the dataset is split into a training set with samples from the initial 74 subjects, and a testing set with samples from the remaining subjects.During test, the first 4 sequences of the NM condition (NM #1-4) are kept as the gallery set, and the remaining 6 sequences (NM#5-6, BG#1-2, CL#1-2) are defined as the probe set, which aims to ensure consistency with the division of CASIA-B datasets in the state-of-the-art methods [4,7,12,21,29].

Implementation Details
The batchsize is configured to (8,8) and (32,8) for CASIA-B dataset and OUMVLP dataset separately.The input silhouettes are resized to the size of 64 × 44 and aligned according to the method in [40].Adam optimizer is applied for training with an initial learning rate of 0.0001 and the momentum 0.9 [41].In the triplet loss L trip , the margin is set to 0.2, and the λ ce in Equation ( 20) is set to 0.05 and 0.3 on CASIA-B dataset and OUMVLP dataset separately.In CASIA-B dataset, three-layer CNNs are used in MSFE and three consecutive R-convs are implemented in FLSFE to extract features, while in OU-MVLP dataset, five-layer CNNs are applied in MSFE and an additional R-conv is stacked in FLSFE, and the value of n in each R-conv is set to 1, 1, 3, 3. Furthermore, the CASIA-B dataset is trained 70 K iteration, while the OUMVLP dataset is trained 250 K iteration.The learning rate is adjusted to 1 × 10 −5 after 160 K iterations to ensure the stable convergence.All the experiments are carried out by using the Pytorch framework [42] on NVIDIA GeForce RTX 3090 GPUs [39]. 1 presents the comprehensive comparison results between the proposed method and other state-of-the-art methods on CASIA-B dataset.From Table 1, it can be clearly seen that our method attains outstanding performance across nearly all viewpoints compared to other methods, such as GaitNet [12], GaitSet [4], GaitPart [7], MT3D [29], and RPNet [21].In particular, the proposed method exhibits significantly higher average accuracies than other gait recognition methods, especially under BG and CL conditions.Our proposed method achieves average accuracies of 97.7%, 93.7%, and 83.8% under these conditions, outperforming GaitPart [7] by +1.5%, +2.2%, and +5.1%, respectively.Furthermore, our proposed method demonstrates superior performance in some specific view angles.For example, under the view angles of 90 • and 180 • , the NM accuracy of our method reaches 95.9% and 94.2%, surpassing GaitPart [7] by +3.4% and +3.8% separately.The experimental results illustrate that our framework exhibits substantial robustness and advantages under unfavorable conditions, which can be attributed to the fact that multi-scale convolution can simultaneously observe more detailed gait features in spatial and temporal domain.Furthermore, the effective combination of multi-scale spatialtemporal features can obtain discriminative features even in the presence of occlusion, thus enhancing the recognition ability of the proposed method.

CASIA-B: Table
OUMVLP: Similar experiments are executed on OU-MLVP dataset to prove the generalization efficacy of our approach.As demonstrated in Table 2, our method consistently attains higher accuracies across most camera views than other methods such as GaitSet [4], GaitPart [7] and RPNet [21], which displays a more outstanding recognition ability in the large-scale dataset.Specifically, the recognition ability of our method is superior to the state-of-the-art RPNet [18] by a margin of +4.4% (85.0% vs. 89.4%).Due to the addition of view embedding, the accuracies of our method under the view angles of 0 • and 180 • have been greatly improved.Meanwhile, it can be also observed that our results may not surpass GaitPart [7] under certain view angles, such as 45 • , 75 • and 90 • , this is mainly due to the fact that our training dataset does not include as many samples as other works contained under these view angle.Although our results may not surpass GaitPart [7] under some specific view angles, our method has still achieved substantial overall improvement.In conclusion, our method can perform better recognition ability and effectively improve the recognition accuracy on the large and worldwide dataset for cross-view gait recognition.

Ablation Study
In this paper, the three components (MSFE, FLSFE and MSTFE) of the two-path spatialtemporal feature fusion module and the view embedding module are included in the proposed framework.Hereby, exhaustive ablation experiments of these components are carried out on CASIA-B dataset to assess the effectiveness of each individual component.The experimental results and analysis are presented below.
Impact of Each component: Table 3 illustrates the effect of each component in our method on CASIA-B dataset.The baseline refers to the remaining structure after removing MSFE, FLSFE, MSTFE and view embedding module from the proposed methods, which consists of three-layer CNNs and aggregates temporal information through the max operation.It can be observed from Table 3 that when MSFE is added to baseline alone, the average accuracy under three walking conditions achieves +1.5% improvement than the baseline, which implies that MSFE can perform well on the baseline by using different kernel sizes to handle the input.In addition, with the utilization of FLSFE and MSTFE, the average accuracy reaches 90.5%, which is +1.7% higher than GaitPart [7] and the accuracies under each condition are improved significantly, especially under BG and CL, proving the superiority of the two-path spatial-temporal feature fusion module under occlusion.Particularly, when FLSFE is added to the baseline, the average accuracy increases substantially by +7.9% compared to Baseline + MSFE, which verifies the fact that FLSFE plays a vitally important role in this two-path parallel structure.Finally, we incorporate view embedding into the baseline to increase the recognition accuracy of our method, resulting in a remarkable 91.7% performance, which is +1.2% higher than the previous version.The comparison results confirm that view prediction offers valuable perspective information for cross-view gait recognition and the view embedding module proves to be instrumental in improving the recognition capability of multi-view gait tasks.

Impact of different kernel sizes in MSFE:
To study the impact of various kernel sizes in MSFE, three various kinds of kernel sizes are designed and the ablation studies are conducted by arranging them.The results of these experiments are presented in Table 4. Evidently, increasing the number of multi-scale convolution kernels can attain higher recognition accuracies.Therefore, we set the kernel sizes of the multi-scale convolution to 1, 3, and 5, as it yields the most favorable recognition performance.Impact of the value of n in R-conv: Following the approach of setting the parameter n in R-conv as described in Section 3.2.2,five controlled experiments are performed and the results are presented in Table 5. Notably, when the n value of all the R-conv is set to 1, the R-conv becomes entirely composed of regular layers.From Table 5. it becomes evident that when the n value of three consecutive R-conv is set to 2, 8, 8, the recognition accuracy achieves the most excellent performance compared to other four experiments.There is another thing worth noting that it is not the case that the larger n is, the higher recognition accuracy will be.For example, when the n value of three consecutive R-conv is set to 4, 8, 8, the accuracy under NM increases slightly, while the accuracies under BG and CL decrease dramatically.Impact of multi-scale temporal features: We conduct an investigation into the effects of the multi-scale temporal features in MSTFE and the results are presented in Table 6 Obviously, both the long-term and short-term temporal features produce positive effects on recognition performance, and joining these two muti-scale temporal features achieves the best performance, since the long-term and short-term features can interact with each other, increasing the diversity of temporal representations and further improving the overall recognition accuracy.

Discussion
Based on extensive comparative experimental results, our proposed model exhibits significant improvements in two key aspects: (1) Enhanced accuracy under BG and CL conditions on CASIA-B dataset.This improvement is attributed to the utilization of MSFE, FLSFE and MSTFE.MSFE can expand different perceptual fields and observe more detailed gait features in spatial and temporal domain.Besides, both FLSFE and MSTFE in a two-path parallel structure can extract muti-scale spatial and temporal discriminative features, which can enhance the robustness of the proposed method, particularly under unfavorable conditions.(2) Enhanced accuracy on the OUMVLP dataset.The proposed method demonstrates outstanding performance in the large-scale dataset, this is due to the fact that the view embedding module can predict the best view and embed view angle into the multi-scale spatial-temporal features, which can help mitigate the intra-class variations resulting from view differences and enhance the recognition ability of gait recognition.
In the future, we will further improve the performance of our proposed method in more complex test scenarios and gait in-the-wild datasets, such as the GREW [43] dataset and the Gait3D [44] dataset.Simultaneously, whether our method exists overfitting over CASIA-B and OU-MVLP datasets also needs to be verified in the wild datasets and realword scenarios.Additionally, as silhouettes are easily disturbed by pedestrians' clothes and objects, it is essential to explore multi-modal gait recognition approaches that combines silhouettes, skeletons [45], and pose heatmaps [46].

Conclusions
In this paper, we propose a novel gait recognition framework that combines two-path spatial-temporal feature fusion with view embedding.In the two-path spatial-temporal feature fusion module, MSFE is firstly utilized to extract feature representations with different granularities from the input frames.Then, the two-path parallel structure is designed to obtain the muti-scale spatial-temporal features, where FLSFE extracts both global and local features by R-convs and MSTFE interacts temporal features with multiple scales for achieving the strong ability of muti-scale spatial-temporal modeling.Additionally, a view embedding module is put forward to make the muti-scale spatial-temporal features

Figure 1 .
Figure 1.Overview of the whole framework, which mainly consists of two modules as: two-path spatial-temporal feature fusion module and view embedding module.The two-path spatial-temporal feature fusion module consists of three component including multi-scale feature extraction (MSFE), frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE).View embedding module consists of two components: view prediction and HPP with view embedding.

Figure 1 .
Figure 1.Overview of the whole framework, which mainly consists of two modules as: two-path spatial-temporal feature fusion module and view embedding module.The two-path spatialtemporal feature fusion module consists of three component including multi-scale feature extraction (MSFE), frame-level spatial feature extraction (FLSFE) and multi-scale temporal feature extraction (MSTFE).View embedding module consists of two components: view prediction and HPP with view embedding.

Figure 4
Figure 4 illustrates the structure of the R-conv, which is a novel residual convolutional block that combines regular convolution with FConv in parallel to extract both global and local features from gait sequences.The R-conv is designed to leverage the benefits of both global and local features by adding the output of FConv with the result of regular convolution.Assuming that the input is denoted as  ∈  , where  represents the number of channels,  means the length of feature maps and ℎ,  indicate the dimensions of each image frame.Initially, the global feature map is horizontally partitioned into n parts to derive local feature maps, represented as   | 1, ⋯ ,  , where  denotes the total number of partitions and  ∈  represents the -th local gait segment.Subsequently, 2D CNN is used to extract global and local gait information independently.Finally, the combined features  , which includes global and local information, are fused through element-wise summation, formulated as:

3. 2 . 2 .
Frame-Level Spatial Feature Extraction (FLSFE)To further enhance the granularity of spatial-temporal learning and capture contextual information, frame-level spatial feature extraction (FLSFE) is constructed to extract both global and local spatial features for each frame.The structure of FLSFE is shown in Figure3.The input of FLSFE is demoted as  ∈  / / , and FLSFE is composed of three consecutive R-convs (detailed structure could be found in Figure4), each of which is designed to extract both the whole-body and part-informed spatial features.Then, set pooling operation (SP) is applied to each R-conv block to extract set-level features, and after point-by-point summation, column vector spatial features  ∈  are obtained by the horizontal pooling operation (HP), where  means the number of channels, and  represents the number of strips cut in HP.

Figure 4
Figure 4 illustrates the structure of the R-conv, which is a novel residual convolutional block that combines regular convolution with FConv in parallel to extract both global and local features from gait sequences.The R-conv is designed to leverage the benefits of both global and local features by adding the output of FConv with the result of regular convolution.Assuming that the input is denoted as  ∈  , where  represents the number of channels,  means the length of feature maps and ℎ,  indicate the dimensions of each image frame.Initially, the global feature map is horizontally partitioned into n parts to derive local feature maps, represented as   | 1, ⋯ ,  , where  denotes the total number of partitions and  ∈  represents the -th local gait segment.Subsequently, 2D CNN is used to extract global and local gait information independently.Finally, the combined features  , which includes global and local information, are fused through element-wise summation, formulated as:

Figure 4 .
Figure 4.The detailed structure of R-conv.

Figure 4 .
Figure 4.The detailed structure of R-conv.

Figure 6 .
Figure 6.The detailed structure of view prediction.

Figure 6 .
Figure 6.The detailed structure of view prediction.

Table 3 .
Study of the effectiveness of each component on CASIA-B dataset.

Table 4 .
Study of the effectiveness of different kernel sizes in MSFE on CASIA-B dataset.

Table 5 .
Study of the effectiveness of the value of n in each R-conv on CASIA-B dataset.

Table 6 .
Study of the effectiveness of multi-scale temporal features on CASIA-B dataset.