Gait Recognition Based on Gait Optical Flow Network with Inherent Feature Pyramid

: Gait is a kind of biological behavioral characteristic which can be recognized from a distance and has gained an increased interest nowadays. Many existing silhouette-based methods ignore the instantaneous motion of gait, which is an important factor in distinguishing people with similar shapes. To further emphasize the instantaneous motion factor in human gait, the Gait Optical Flow Image (GOFI) is proposed to add the instantaneous motion direction and intensity to original gait silhouettes. The GOFI also helps to leverage both the temporal and spatial condition noises. Then, the gait features are extracted by the Gait Optical Flow Network (GOFN), which contains a Set Transition (ST) architecture to aggregate the image-level features to the set-level features and an Inherent Feature Pyramid (IFP) to exploit the multi-scaled partial features. The combined loss function is used to evaluate the similarity between different gaits. Experiments are conducted on two widely used gait datasets, the CASIA-B and the CASIA-C. The experiments show that the GOFN performs better on both datasets, which shows the effectiveness of the GOFN.


Introduction
Gait recognition has attracted much attention because it can identify an individual from a distance without subject cooperation.Compared with other biometrics such as face, fingerprint, and iris recognition, gait recognition identifies a person from their walking style, which is hard to disguise.However, gait recognition suffers from many variations, such as clothing, carrying conditions, and camera viewpoint [1].A key challenge for gait recognition is effectively leveraging the condition noises and extracting the invariant gait features.
Many approaches have been proposed to extract gait features.A gait energy image (GEI) [2] aggregates the silhouette sequence into one image to represent gait.However, it is sensitive to some variations, such as camera viewpoint and carrying conditions.Moreover, more representation methods are proposed based on GEIs to improve the accuracy, such as the gait history image (GHI) [3], the frame difference energy image (FDEI) [4], and the active energy image (AEI) [5].They can represent not only the moving part but also the non-moving part of the subject, but they all transform the silhouette sequence into one image or gait template and cannot effectively represent the temporal feature.Differently, some methods treat the gait sequence as a video.For example, 3D-CNN is used to obtain the spatial and temporal features in [6,7], but it requires more effort to train and works inefficiently.Inspired by [8], some approaches based on unordered sets have been recently proposed.For example, ref. [9] regards the gait sequence as an unordered set and makes progress, although the method remains to be improved.
However, most of the above works focus on the silhouettes and rely on a data-driven network to learn long-term motion features from the silhouette sequences.These methods ignore the short-term motion features, i.e., the instantaneous motion features.To overcome this issue, the Gait Optical Flow Image (GOFI) is presented in this paper to add the Appl.Sci.2023, 13, 10975 2 of 11 instantaneous motion direction and intensity to original gait silhouettes.The GOFI also helps to leverage both the temporal and spatial condition noises.Spatial condition noises, such as clothes, will change the shape of the body, but the instantaneous motion of clothes will be the same as the body.The temporal condition noises such as the walking speed will change the cycle and phase of the gait but will maintain the same instantaneous motion direction.The GOFIs are obtained by cropping the original silhouette sequence and combining the optical flow images with the outline of the corresponding silhouettes in a particular proportion.The GOFIs can avoid the influence of silhouette distortion, which happens frequently in gait sequences.Then, a Gait Optical Flow Network (GOFN) is proposed, which contains the Set Transition (ST) architecture to fuse the image-level and the set-level features and the Inherent Feature Pyramid (IFP) to exploit the multi-scaled partial features from the GOFIs.
In summary, the main contributions of this work include the following: (1) The GOFI representation is proposed to describe the instantaneous motion direction and intensity feature, which is robust to the temporal and spatial condition noises.(2) A GOFN is proposed, which contains the Set Transition (ST) architecture to aggregate the image-level features to the set-level features and the Inherent Feature Pyramid (IFP) to exploit the multiscaled partial features from the GOFIs.( 3) The experiments and comparisons on the CASIA-B and the CASIA-C gait datasets show that the GOFN achieves better performances than previous methods under both the cross-view condition and the identical-view condition, which proves the effectiveness of the GOFN.
The rest of the paper is organized as follows: Section 2 elaborates on the related works and the ideas that inspire our work.Section 3 introduces the GOFI and the ST component and IFP component of the GOFN.Section 4 presents the dataset, the settings details, and the experimental analyses.Section 5 draws the conclusions.

Gait Representation
Most prior works can be divided into appearance-based methods and model-based methods.Appearance-based approaches include GEI [2], GHI [3], gait entropy image (GEnI) [10][11][12][13], and so on.These approaches directly extract the feature from the silhouette sequences without modeling the skeleton structure.For example, GEI represents the whole gait sequence with the averaged silhouette image.This type of method can be severely influenced by variations such as clothing, bagging, and camera view.Some appearancebased approaches also take the image sequence as input, which concerns more temporal information, such as the GaitSet [9], GaitPart [14], GaitGL [15], GaitNet [16], and GaitBase approaches [17].On the other hand, model-based approaches [18][19][20] extract the skeleton or body structure and then obtain features from a graph model, such as GaitGraph [21].However, these approaches are usually inefficient and only work when the resolution of the gait videos is high.By comparing the neighbor joint, the short-term motion of the joint is described.Inspired by joint motion, the short-term motion of the silhouette is considered in this paper to describe more temporal features.

Unordered Set
The conception of the unordered set is proposed in [8].The unordered set is used in point cloud research, and it was introduced into gait recognition in [9,22,23].The experiments in these methods show that gait sequences can be regarded as unordered sets, which obtain a higher gait recognition accuracy, especially in cross-view cases.To avoid the influence of multiple views and obtain the set-level features, the unordered set of GOFI is applied by using the permutation invariant function.Moreover, the experiments prove that this method is effective and robust.Moreover, the design of the permutation invariant function is better than traditional functions.

Instantaneous Motion Description
Optical flow is a method that uses the change in the pixels of an image sequence in the temporal domain and the correlation between adjacent frames to find the correspondence between the previous and current frames to calculate the object's motion information between adjacent frames.The authors of [24,25] introduced two kinds of methods to calculate optical flow.Since then, optical flow has been commonly used in action recognition and action detection, such as in [26,27].Moreover, optical flow has yet to be used widely in gait recognition.It was applied in gait recognition in [28].Nevertheless, the method simply sends the optical flow extracted from the gait sequences into a convolutional neural network, and the experiments remain to be improved.

Gait Optical Flow Image Extraction
To distinguish people based on their gait sequences, the GOFI representation describes the instantaneous motion using the optical flow descriptor.The GOFI retains both temporal and spatial condition features while preserving the invariant gait features.
For a three-channel RGB gait image model, first the image is subtracted from the background image and transformed into a binary image.Then, the outlier points and the holes caused by the subtraction are eliminated using morphological operations.The center of the subject is detected using the intensity distribution.For some samples, the person is too close to the camera, and the image is filled with foreground, so these samples, in which the sum of pixel intensities is large, are eliminated.Then, the method introduced by [26] is used to calculate the two-channel optical flow matrix [u, v].The optical flow of each pixel draws the direction and the intensity of body movement at the next step.Assuming the brightness is constant and the movement is tiny, the optical flow matrix is calculated as follows: where I x and I y are, respectively, image gradients in the vertical and horizontal direction, which can be obtained using the Sobel filter, and I t represents the temporal gradient, which can be obtained using the frame differential method.Then, the optical flow is transformed into an HSV image formulated as where, for a location (i, j), H i,j is the hue of the pixel and u i,j and v i,j are the optical flow component in the vertical and horizontal directions of the pixel, respectively.
where S = S i,j means the magnitude map and S i,j is the magnitude at the pixel.
where V i,j means the value of the pixel.As shown in Figure 1, the HSV image I 1 is merged with the silhouette edge image I 2 .To obtain I 2 , the origin image is transformed into a grayscale image and then the Canny operator is applied to detect the border of the character image.Only the edge is used instead of the whole silhouette because we are more concerned about the instantaneous motion, which is more obvious at the edges but is confusing inside the silhouette.The gait optical flow image model I 3 is formulated as where σ represents the intensity factor.Finally, the original silhouette is cropped into subject-centered images whose size is 64 × 64 to obtain the GOFI.
Appl.Sci.2023, 13, x FOR PEER REVIEW 4 of 12 where  represents the intensity factor.Finally, the original silhouette is cropped into subject-centered images whose size is 64 × 64 to obtain the GOFI.

Set-Level Feature from Unordered Set
Assume a gait dataset contains N objects, and the corresponding gait sequence is   , where i ∈ 1, 2, 3, …, N. The process to transform the origin gait sequences into the GOFI representation can be formulated as follows: where P represents the process of fusing the optical flow and the silhouette edge, and   , where i ∈ 1, 2, 3, …, N represents the GOFI output.Then, the gait model, named GOFN, is used to extract both the image-level  and set-level feature  for higher recognition accuracy.The set-level feature can be formulated as follows: where  represents a group of CNN layers used to extract image-level features  ∈ ℝ ,  is the batch size,  is the number of channels, and  is the length of the image-level feature.The Set Transition (ST) blocks are used to aggregate the image-level features to the set-level features.
To aggregate the temporal information and obtain set-level features from unordered gait sequences, a permutation invariant function is need to process the image-level features extracted by the CNN.Regardless of the order of objects in the set, the value of a permutation invariant function remains the same.As shown in Figure 2, the ST block can be represented as follows, where ꞷ denotes the parameters in ST block: where    ℎ  ,  ,  ; ꞷ .
LN is the layer normalization, FL is the row-wise feedforward layer, and r is the weight factor.ℎ () is the multi-head attention, which collects the global information to refine image-level features.First,  is copied and input into multiple linear layers to extract the multi-head subspace features.The length of subspace features is  .For example, in head 0, the multi-head subspace features are denoted as (  ,  ,  ).The attention  is calculated by     /   .The attentions of each head are then concatenated.The output is resized to the original shape by multiplying it with a linear layer.The output is then fed into the FL module.Each feature is multiplied with a share-weight linear layer, and the outputs are aggregated.ST takes the unordered gait sequences and applies self-attention between the image-level features, resulting in set-level features of the same size.The image-level feature is transformed to the set-level feature through ST.The structure will stabilize the network and is beneficial for recognition accuracy.

Set-Level Feature from Unordered Set
Assume a gait dataset contains N objects, and the corresponding gait sequence is X = {X i }, where i ∈ 1, 2, 3, . .., N. The process to transform the origin gait sequences into the GOFI representation can be formulated as follows: where P represents the process of fusing the optical flow and the silhouette edge, and G = {G i }, where i ∈ 1, 2, 3, . .., N represents the GOFI output.Then, the gait model, named GOFN, is used to extract both the image-level f IL and set-level feature f SL for higher recognition accuracy.The set-level feature can be formulated as follows: where L represents a group of CNN layers used to extract image-level features f IL ∈ R d n ×d c ×d l , d n is the batch size, d c is the number of channels, and d l is the length of the image-level feature.The Set Transition (ST) blocks are used to aggregate the image-level features to the set-level features.
To aggregate the temporal information and obtain set-level features from unordered gait sequences, a permutation invariant function is need to process the image-level features extracted by the CNN.Regardless of the order of objects in the set, the value of a permutation invariant function remains the same.As shown in Figure 2, the ST block can be represented as follows, where ω denotes the parameters in ST block: where LN is the layer normalization, FL is the row-wise feedforward layer, and r is the weight factor.Multihead() is the multi-head attention, which collects the global information to refine image-level features.First, f IL is copied and input into multiple linear layers to extract the multi-head subspace features.The length of subspace features is d sub .For example, in head 0, the multi-head subspace features are denoted as ( The attentions of each head are then concatenated.The output is resized to the original shape by multiplying it with a linear layer.The output is then fed into the FL module.Each feature is multiplied with a shareweight linear layer, and the outputs are aggregated.ST takes the unordered gait sequences and applies self-attention between the image-level features, resulting in set-level features of the same size.The image-level feature is transformed to the set-level feature through ST.The structure will stabilize the network and is beneficial for recognition accuracy. .The image-level feature   is fed into the multi-head attention block for global information and is normalized.Then, the output is summed with   and is fed into a row-wise feedforward layer.

Inherent Feature Pyramid
Handled by a group of CNN layers and the ST blocks, both the temporal and the spatial information of the GOFI are extracted and transformed into a feature map with size    . is the batch size,  is the number of channel, and  is the length of the features.To capture the discrimination of the feature map from the global to local levels, from coarse to fine at various scales, the feature map is sliced into multiple-scale sub-spaces horizontally.The sub-spaces are pooled via global average pooling (GAP) and global max pooling (GMP) to obtain the feature vectors.GAP calculates the average of all pixels, and GMP calculates the maximum of all pixels for each feature map.As shown in Figure 3, the feature map is equally divided into several spatial bins as  , in a horizontal manner according to different scales. ∈ 1,2, … ,  indicates the index of sub-space, and  ∈ 1,2, … ,  stands for the index of the bin in the sub-space.Then, each bin  , is pooled and obtains feature vector  , , formulated as follows: The fully connected (fc) layers are then applied for every feature vector  , to learn the discrimination, and the outputs  , are concatenated.To limit the length of the concatenated feature, the feature map is divided into {1, 2, 4, 8, 16, 32, 64} sub-spaces.So, the size of the concatenated gait feature is  127.The IFP is used for mapping the feature to the discriminative spaces, while taking account of both global and local details.Both the local and global information is emphasized.Finally, the combined loss function is used to evaluate the similarity of IFP gait features, which contains the triplet loss and the cross-entropy loss to train the network.

Inherent Feature Pyramid
Handled by a group of CNN layers and the ST blocks, both the temporal and the spatial information of the GOFI are extracted and transformed into a feature map with size d n × d c × d l .d n is the batch size, d c is the number of channel, and d l is the length of the features.To capture the discrimination of the feature map from the global to local levels, from coarse to fine at various scales, the feature map is sliced into multiple-scale sub-spaces horizontally.The sub-spaces are pooled via global average pooling (GAP) and global max pooling (GMP) to obtain the feature vectors.GAP calculates the average of all pixels, and GMP calculates the maximum of all pixels for each feature map.As shown in Figure 3, the feature map is equally divided into several spatial bins as f i,j in a horizontal manner according to different scales.i ∈ {1, 2, . . . ,n} indicates the index of sub-space, and j ∈ {1, 2, . . . ,n} stands for the index of the bin in the sub-space.Then, each bin f i,j is pooled and obtains feature vector f i,j , formulated as follows: Appl.Sci.2023, 13, x FOR PEER REVIEW 6 of 12

Framework of GOFN
The framework of the GOFN is shown in Figure 4.The GOFI combines the optical flow and the silhouette edge, which describes both the instantaneous motion and the shape.Then, each GOFI is input to a CNN block to obtain the image-level feature, which contains a 5 5 layer, a 3 3 layer, and a 2 2 pooling layer.Then, the ST block aggregates the image-level features into the set-level features.The set-level features are spitted and concatenated in the IFP block to balance the global and local features.The output features of the IFP are used to compute the similarity between different gaits.The The fully connected (fc) layers are then applied for every feature vector f i,j to learn the discrimination, and the outputs f i,j are concatenated.To limit the length of the concatenated feature, the feature map is divided into {1, 2, 4, 8, 16, 32, 64} sub-spaces.So, the size of the concatenated gait feature is d n × 127.The IFP is used for mapping the feature to the discriminative spaces, while taking account of both global and local details.Both the local and global information is emphasized.Finally, the combined loss function is used to evaluate the similarity of IFP gait features, which contains the triplet loss and the cross-entropy loss to train the network.

Framework of GOFN
The framework of the GOFN is shown in Figure 4.The GOFI combines the optical flow and the silhouette edge, which describes both the instantaneous motion and the shape.Then, each GOFI is input to a CNN block to obtain the image-level feature, which contains a 5 × 5 layer, a 3 × 3 layer, and a 2 × 2 pooling layer.Then, the ST block aggregates the imagelevel features into the set-level features.The set-level features are spitted and concatenated in the IFP block to balance the global and local features.The output features of the IFP are used to compute the similarity between different gaits.The gait features with annotated IDs are used to train the GOFN, and the combined loss function is used for supervision.

Framework of GOFN
The framework of the GOFN is shown in Figure 4.The GOFI combines the optical flow and the silhouette edge, which describes both the instantaneous motion and the shape.Then, each GOFI is input to a CNN block to obtain the image-level feature, which contains a 5 5 layer, a 3 3 layer, and a 2 2 pooling layer.Then, the ST block aggregates the image-level features into the set-level features.The set-level features are spitted and concatenated in the IFP block to balance the global and local features.The output features of the IFP are used to compute the similarity between different gaits.The gait features with annotated IDs are used to train the GOFN, and the combined loss function is used for supervision.

Experiments
In this section, the performance of the GOFN is evaluated on two public datasets: the CASIA-B dataset [29] and the CASIA-C dataset [30].First, the details of the datasets are described, including an overview, training set, and test set containing a gallery and a probe.Then, the results of the GOFN are compared with other methods.Finally, the results of the ablation study are used to verify the effectiveness of each component in the GOFN.

Experiments
In this section, the performance of the GOFN is evaluated on two public datasets: the CASIA-B dataset [29] and the CASIA-C dataset [30].First, the details of the datasets are described, including an overview, training set, and test set containing a gallery and a probe.Then, the results of the GOFN are compared with other methods.Finally, the results of the ablation study are used to verify the effectiveness of each component in the GOFN.

Datasets
As shown in Figure 5, the CASIA-B dataset includes 124 objects that are in three different walking conditions, including normal (NM), bag (BG), and wearing a coat (CL), containing 11 views from 0 • to 180 • .The first 62 subjects are used as the training set, and the remaining 62 subjects are used for testing.Specifically, each subject's first 4 normal walking sequences are used as the gallery set, and the others as the probe set.
CASIA-C is a dataset that contains 153 subjects in different conditions, including normal walking (NW), slow walking (SW), fast walking (FW), and walking with a bag (BW), as shown in Figure 5. Every subject has four NW sequences, two SW sequences, two FW sequences, and two BW sequences.We use the first 24, 62, and 100 subjects as the training set and the last 53 subjects as the testing set.For every subject, NW sequences are used as the gallery set, and the remaining sequences are used as the probe set.
As shown in Figure 5, the CASIA-B dataset includes 124 objects that are in three different walking conditions, including normal (NM), bag (BG), and wearing a coat (CL), containing 11 views from 0° to 180°.The first 62 subjects are used as the training set, and the remaining 62 subjects are used for testing.Specifically, each subject's first 4 normal walking sequences are used as the gallery set, and the others as the probe set.CASIA-C is a dataset that contains 153 subjects in different conditions, including normal walking (NW), slow walking (SW), fast walking (FW), and walking with a bag (BW), as shown in Figure 5. Every subject has four NW sequences, two SW sequences, two FW sequences, and two BW sequences.We use the first 24, 62, and 100 subjects as the training set and the last 53 subjects as the testing set.For every subject, NW sequences are used as the gallery set, and the remaining sequences are used as the probe set.

Comparisons with Other Methods
The performance of the GOFN is compared with other methods on CASIA-B and CASIA-C.As shown in Table 1, the experiments show the effectiveness of the GOFN on CASIA-B under cross-view conditions.Our approach achieves an average rank-1 accuracy of 96.4% under normal walking conditions (NM), an accuracy of 85.5% under bag-carrying (BG), and an accuracy of 66.1% under coat-wearing walking (CL) conditions on CASIA-B, excluding identical-view cases.It shows that the accuracy of the GOFN is highest at 144°, and this may be because the spatial and motion features are most abundant in this view.It also indicates that the accuracy under the BG condition is lower than under the NM condition, and the accuracy is the lowest under the CL condition.This may be because the appearances are changed under the BG condition while motion information remains the same, and both spatial and motion features are partly concealed under the CL condition, which brings significant challenges to gait recognition.Compared with the SPAE method proposed by [31] and the MGAN proposed by [32], the GOFN is much more accurate.Compared with Gaitset [9], the accuracy is 4.4% higher on average under NM conditions.Gaitset uses only the silhouette while the GOFI uses both the instantaneous motion information as well as the silhouette edge.The edge of the silhouette in the GOFI retains useful spatial information and is beneficial for accuracy under NM conditions.Moreover, the accuracy is 3.6% higher on average under CL conditions.This may be because the GOFI leverages the motion information, which excludes the influence of the outline of silhouette, and the Set Transition works well.Compared with Gait-D [33] and GaitGraph [21], the methods based on skeleton features, the GOFN performs better under NM and BG conditions, although Gait-D obtains a better result under CL conditions.This may be because skeleton features are free from the errors caused by the coat.Compared with

Comparisons with Other Methods
The performance of the GOFN is compared with other methods on CASIA-B and CASIA-C.As shown in Table 1, the experiments show the effectiveness of the GOFN on CASIA-B under cross-view conditions.Our approach achieves an average rank-1 accuracy of 96.4% under normal walking conditions (NM), an accuracy of 85.5% under bag-carrying (BG), and an accuracy of 66.1% under coat-wearing walking (CL) conditions on CASIA-B, excluding identical-view cases.It shows that the accuracy of the GOFN is highest at 144 • , and this may be because the spatial and motion features are most abundant in this view.It also indicates that the accuracy under the BG condition is lower than under the NM condition, and the accuracy is the lowest under the CL condition.This may be because the appearances are changed under the BG condition while motion information remains the same, and both spatial and motion features are partly concealed under the CL condition, which brings significant challenges to gait recognition.Compared with the SPAE method proposed by [31] and the MGAN proposed by [32], the GOFN is much more accurate.Compared with Gaitset [9], the accuracy is 4.4% higher on average under NM conditions.Gaitset uses only the silhouette while the GOFI uses both the instantaneous motion information as well as the silhouette edge.The edge of the silhouette in the GOFI retains useful spatial information and is beneficial for accuracy under NM conditions.Moreover, the accuracy is 3.6% higher on average under CL conditions.This may be because the GOFI leverages the motion information, which excludes the influence of the outline of silhouette, and the Set Transition works well.Compared with Gait-D [33] and GaitGraph [21], the methods based on skeleton features, the GOFN performs better under NM and BG conditions, although Gait-D obtains a better result under CL conditions.This may be because skeleton features are free from the errors caused by the coat.Compared with methods using local patterns such as GaitPart [14] and GaitGL [15], the GOFN achieves better results in NM conditions but still encounters issues in BG and CL conditions.However, GaitPart [14] and GaitGL [15] split the original sequence into partitions.In the datasets with aligned subject heights, the methods perform well.But in situations where the heights of subjects change heavily, the methods may encode the misplacement of partitions.However, in our method, the features are extracted from the global silhouettes and the global instantaneous motion, which is more robust when applying the model to subjects with different heights.
As shown in Table 2, the experiments show the effectiveness of the GOFN on CASIA-B under the identical-view condition.The performance of the GOFN is compared with other approaches on CASIA-B under identical-view conditions.Under the identical view, the GOFN performs better than other methods.It achieves an accuracy of 98.2% under the NM conditions, 87.5% under the BG conditions, and 69.4% under the CL conditions.Compared with the SPAE method proposed by [28], PoseGait proposed by [34], and LGSD + PSN proposed by [35], the GOFN obtains better performance because the instantaneous motion is less affected under identical-view conditions.Moreover, the GOFN shows its superiority under different conditions, and it exceeds the previous method by at least 2.5% under BG conditions and by at least 5.2% under CL conditions.This shows that the GOFI captured more instantaneous motion using the optical flow, which improves the performance of gait representation.The GOFN used ST and IFP to aggregate the image-level features to set-level features, which helps to exploit the multi-scaled partial fe  The performance of the GOFN is also evaluated on the CASIA-C dataset with different training sets.As shown in Table 3, the GOFN achieves an average accuracy of 51.7% using 24 subjects as the training set, and it achieves accuracies of 56.0% and 64.9% with 62-subject and 100-subject training sets, respectively.This shows that the recognition accuracy increases when the training set is larger.It is because a larger training set has more boundary features, which is beneficial to training the set-level boundaries and reduces overfitting.Moreover, the overall results are lower than those obtained on CASIA-B because the quality of gait sequences in CASIA-C is poorer, introducing outlier data in the process of extracting optical flow.Compared with the PSN proposed by [35], the GOFN increases the accuracy by about 5% under FW conditions and by about 1.8% on average.This may be because the edge of the silhouette in the GOFI, which is helpful spatial information for improving recognition accuracy as motion information may contain outliers under FW conditions, and results under FW conditions are better.

Ablation Study
The ablation experiments are conducted on the CASIA-B dataset using different components of the GOFN to verify their effectiveness, as shown in Table 4.The first row shows the baseline using the GEI directly for classification.The second row shows the results using ST and IFP with the GEI as input, where the result increases by about 20% in NM and BG conditions and about 35% in CL conditions.The third row shows the effectiveness of the GOFI, compared with the GEI; using the GOFI increased the accuracy by 2.8% under NM, 0.5% under BG, and 3.9% under CL conditions.Rows 4-8 show the impact of different components of the GOFN, which demonstrates the effectiveness of ST and IFP.For the permutation invariant function, it is proven that using ST achieves the highest accuracy compared with other common permutation invariant functions such as max, mean, median, and attention.The ablation experiments show that IFP helps for accuracy as well.With IFP, the result increases by about 2-3% in each condition.Meanwhile, the ablation experiments for the loss function on CASIA-B are shown in Table 5.The outcomes indicate that the combined loss function performs better than any function alone.

Conclusions
In this paper, since many methods concern the long-term motion of gait but ignore the short-term motion of gait, a novel GOFI gait representation is proposed to extract the instantaneous motion as well as the silhouette's edge.Then, the gait features are extracted

Figure 2 .
Figure 2. The structure of Set Transition (ST).The image-level feature   is fed into the multi-head attention block for global information and is normalized.Then, the output is summed with   and is fed into a row-wise feedforward layer.

Figure 2 .
Figure 2. The structure of Set Transition (ST).The image-level feature f IL is fed into the multi-head attention block for global information and is normalized.Then, the output is summed with f IL and is fed into a row-wise feedforward layer.

Figure 3 .
Figure 3.The structure of the IFP.The feature map is first horizontally sliced into some bins in various scales, and each bin is pooled by global average pooling and global max pooling.After the process of pooling, the feature vector is fed into fully connected layers and mapped into a discriminative space.Global features and local features are balanced during the process.

Figure 3 .
Figure 3.The structure of the IFP.The feature map is first horizontally sliced into some bins in various scales, and each bin is pooled by global average pooling and global max pooling.After the process of pooling, the feature vector is fed into fully connected layers and mapped into a discriminative space.Global features and local features are balanced during the process.

Figure 3 .
Figure 3.The structure of the IFP.The feature map is first horizontally sliced into some bins in various scales, and each bin is pooled by global average pooling and global max pooling.After the process of pooling, the feature vector is fed into fully connected layers and mapped into a discriminative space.Global features and local features are balanced during the process.

Figure 5 .
Figure 5.The examples from CASIA-B (a) and CASIA-C (b).(a) From left to right, the figures show the CL condition, the BG condition, and the NM condition in different views.(b) From left to right, the figures show the BW condition, the SW condition, and the FW condition.

Figure 5 .
Figure 5.The examples from CASIA-B (a) and CASIA-C (b).(a) From left to right, the figures show the CL condition, the BG condition, and the NM condition in different views.(b) From left to right, the figures show the BW condition, the SW condition, and the FW condition.

Table 1 .
Comparison with previous methods in various views on CASIA-B by accuracies (%), excluding identical-view cases.GOFN obtains the best average results under NM.

Table 2 .
Comparison with previous methods in various views on CASIA-B by accuracy (%) under identical-view conditions.GOFN obtains the best results under BG and CL conditions and remains to be improved under NM conditions.

Table 3 .
Recognition accuracy (%) under different conditions on CASIA-C with different training sets.Compared with the PSN model, GOFN has better performance under FW and BW conditions.

Table 4 .
Ablation experiments for each structure of GOFN conducted on CASIA-B.Specifically, GEI and GOFI represent different inputs for the network.ST, Max, Mean, Median, and Attention represent different permutation invariant functions to extract set-level features.

Table 5 .
Ablation experiments for different loss functions on CASIA-B.The combined loss function performs better than any function alone.