Gait Recognition by Combining the Long-Short-Term Attention Network and Personal Physiological Features

Although gait recognition has been greatly improved by efforts from many researchers in recent years, its performance is still unsatisfactory due to the lack of gait information under the real scenariowhere only one or two images may be used for recognition. In this paper, a new gait recognition framework is brought about which can combine the long-short-term attention modules on silhouette images over the whole sequence and the real human physiological information calculated by a monocular image. The contributions of this work include the following: (1) Fusing the global long-term attention (GLTA) and local short-term attention (LSTA) over the whole query sequence to improve the gait recognition accuracy, where both the short-term gait feature (from two or three frames) and long-term feature (from the whole sequence) are extracted; (2) presenting a method to calculate the real personal static and dynamic physiological features through a single monocular image; (3) by efficiently applying the human physiological information, a new physiological feature extraction (PFE) network is proposed to concatenate the physiological information with silhouette for gait recognition. Through the experiments between the CASIA-B and Multi-state Gait datasets, the effectiveness and efficiency of the proposed method are proven. Under three different walking conditions of the CASIA-B dataset, the mean accuracy of rank-1 in our method is up to 89.6%, and in the Multi-state Gait dataset, wearing different clothes, the mean accuracy of rank-1 in our method is 2.4% higher than the other works.


Introduction
Gait recognition has drawn attention from numerous researchers as an important biometric recognition algorithm. It plays an import role in many tasks, such as surveillance systems, anti-terrorist operations, clinic diagnosis, etc. Unlike some biometric features that need to be extracted at close range (such as face or iris) or by touch-sensor (fingerprint), gait features can be effectively collected even if the target person is at a distance of 20 m from the camera. The extraction of gait features does not need the intentional cooperation of the target person, and can be widely applied in surveillance videos, biometric measurement, dialog, etc. Meanwhile, gait features, as habitual human movement features, will not be easily changed. However, there are still many challenges among the conventional gait recognition algorithms; this is because the extraction of gait features can be affected by different factors, such as the variation of camera view angles and different walking directions of pedestrians. Some works [1][2][3] have shown that in real scenes, the changes in clothing, object occlusion, and pedestrian walking speed can also affect the feature extraction.
In recent years, some methods have been proven to be very effective in extracting gait features, which could be categorized in two ways: the discriminative model-based algorithms [4][5][6][7][8][9][10][11] and the generative model-based ones [12,13]. Regarding the discriminative model-based algorithms, the identity discrimination is mainly performed by extracting features in gait templates or gait sequences. The gait energy image [14] is a gait template generated from the gait silhouette, which is obtained by time-averagedpooling. In [4], Shiraga et al. used gait energy images for feature representation to obtain the perspective invariant features through fully connected layers. Wu et al. [8] selected deep convolutional neural networks to learn the similarity between gait energy images for identity discrimination. Although the gait energy image can reduce the computational cost, it will lose the frame level features. Therefore, in recent years, researchers have tended to extract features directly from the frame sequence [5,6]. Since neither a single spatial nor a temporal feature can provide complete access to the information, neither of them could provide enough information for gait recognition alone. Chao et al. [5] considered that a gait sequence is composed of both spatial location and temporal information, and took the gait sequence as an unordered set for feature extraction. Fan et al. [6] proposed a focal convolutional layer to refine the feature extraction, and brought out a novel local gait feature representation to describe the spatiotemporal features of the human body in a short period of time, where such features have been proven to be superior to other ones. Inspired by [15] who proposedthat 3D CNN could efficiently extract both the spatial features and temporal ones, Wolf et al. [9] applied 3D CNN to extract the spatiotemporal information in gait sequences in order to solve the problems caused by the indefinite length of gait sequence, where the gait sequences were cut into several short ones. In [10], Thapar et al. used 3D CNN for feature extraction from different viewpoints. On the other hand, in [11], Liao used the extracted human key point information for gait recognition, and added three posture features: joint angle, limb length, and joint motion. Liao et al. [7] used human pose information as the input of the model, and mentioned that RNN [16] or LSTM [17] are used to extract time information from a sequence. Regarding the generative model-based algorithms, the operation of encoding and decoding the gait sequence features is required; Feng et al. [12] applied the LSTM to process the obtained nodes and reconstruct the gait sequence features from different viewpoints, while Yu et al. [13] used generative adversarial networks to reduce the effects caused by clothing changes, viewpoint changes, etc. Although the silhouette image of a person may change greatly (due to the variation of view angles among the surveillance cameras), it is well known that a person's real physiological information (such as his real height, shoulder width, step frequency, and other information) will not actually change greatly. In [18,19], such physiological information was reported to be able to improve the accuracy of gait recognition algorithms.
In this paper, a novel gait recognition framework is brought about with the combination of long-short-term global/local features and real personal physiological information. Since the movement frequency of persons may change greatly, both the local short-term attention (LSTA, for three continuous frames) and global long-term attention (GLTA, for the whole gait cycle) modules are proposed to collect more effective gait features. Based on the observation that a silhouette image can only provide limited shape or motion features, a novel human physiological information (HPI) module is also brought about for calculating the real personal static and dynamic physiological features through the monocular images. To efficiently apply the HPI features, a new physiological feature extraction (PFE) network is proposed to concatenate the physiological information with silhouette for gait recognition. Through the experiments between the CASIA-B [20] and Multi-state Gait (collected by us) datasets, the effectiveness and efficiency of the proposed method are proved.

System Overview
The main structure of this paper is shown in Figure 1, which mainly includes two parts: the gait-silhouette-based attention module and real personal physiological feature estimation module. Step length Step frequency  Figure 1. The overviewof this work, which is mainly composed of two parts: the gait-silhouettebased global (local) long (short)-term attention module, and the personal physiological feature estimation module.
In the gait-silhouette-based attention module, the local short-term attention (LSTA) features and global long-term attention (GLTA) features are extracted by the different multilayer perceptronand feature aggregation. Then, the feature aggregation is performed by using multiple features to complement each other. Regarding the global long-term features, four layers of 2D CNN and two layers of max pooling are used to obtain the shallow feature information, and then the global long-term features of the whole sequence are extracted using the global long-term attention module (GLTA). Finally, the features are aggregated by the adaptive temporal feature aggregation module (ATFA). Regarding the local short-term features, GaitPart [6] is selected as the backbone network to extract shallow features, and then the local short-term features are extracted using the local shortterm attention module (LSTA), where the extracted features are also aggregated by the ATFA module.
During the personal physiological feature estimation module, after the camera calibration, the personal physiological information (such as shoulder width, step length, frequency, etc.) is extracted from the skeletal points through the input monocular images by the human physiological information module (HPI). Then, the physiological feature extraction module (PFE) is used to extract and aggregate each of the gait physiological features. The full connectivity (FC) layers are used to map the feature vector into the metric space, and the features obtained from the two modules are concatenated for reranking.

Gait-Silhouette-Based Attention Modules
A similar local convolution network to GaitPart [6] is brought about to extract local features of different receptive fields. As shown in Figure 1, block1 and block2 are applied to split the input feature map into four and eight parts horizontally by focal convolution layers [6]. Here, the local feature dimension is represented by F local ∈ R N×P×S×C×(H/P)×W , where N is the batch, S means the time series, C represents the feature channel, H and W are the height and width of the feature map, and P denotes the number of timesthe feature map is split. The global features are extracted by a four-layer 2D CNN, and represented as F global ∈ R N×S×C×H×W . Spatial aggregation (SA) refers to horizontal feature aggregation on the width W dimension of an image. The SA operations are described in Equations (1) and (2) to obtain the local features F local ∈ R N×P×S×C×(H/P) and global features F global ∈ R N×S×C×H , respectively. The avg W and max W mean the average and maximum value in width W.
2.2.1. Local Short-Term Attention (LSTA) As shown in Figure 2, a channel attention module is introduced to enhance the feature representation of the local features, A channel-based attention feature F AC ∈ R N×P×S×C×(H/P) is described in Equation (3), where its distribution is obtained by using 1D CNN and a sigmoid function on the channel C. In Equation (3), a channel attention element-wise product over local features F local is selected to obtain the channel excitation features, F AC . After that, a one-dimensional convolution of size 1 is used in the time dimension to obtain temporal attention features F AS ∈ R N×P×S×C×(H/P) of each row. Then, the average and max pooling with the sizes of 3 and 5 are used to slide into the time series S so as to extract the short-term features of different receptive fields, which can be defined as Equations (4) and (5), to obtain local short-term attention features F LSTA ∈ R N×S×C×H based on time series.

Global Long-Term Attention (GLTA)
Besides the local features, the global features could also be useful to describe the holistic information of the target person. As described in Figure 3, after obtaining a gait silhouette sequence of S frames, the shallow features F global of the whole sequence are obtained by the SA operation (described in Equations (1) and (2)). In order to extract a more discriminative feature representation, the importance of each frame in the whole sequence is calculated by a feedforward network which is represented as F AL ∈ R N×S×C×H along the time dimension. As defined in Equation (6), through a multilayer perceptron (MLP) module consisting of two-layer 2D CNN, F AL (the output of MLP) is the element-wise product of F global to obtain the temporal excitation features F GLTA ∈ R N×S×C×H .
avg MLP Figure 3. The overview of GLTA.

Adaptive Temporal Feature Aggregation (ATFA)
In this module, as shown in Figure 4, an adaptive temporal feature aggregation is proposed. First, the max pooling and the average pooling are applied to reduce the dimension of the input features in the temporal dimension S, and concatenate them. The max pooling can represent the salient information of the sequence, while the average pooling can represent the overall information of the sequence. The temporal feature pooling can be formulated as Equations (7) and (8) to obtain the features F cat−global ∈ R N×K×C×H and F cat−local ∈ R N×K×C×H from F global , F GLTA and F local , F LSTA , where K denotes the number of features after the temporal feature pooling.
concat ( ) Then, in order to adaptively select the feature representations among them and enhance the discriminative power of selected features, multilayer perceptronis introduced to score the splicing dimension of F cat and perform a weighted summation over the splicing dimension K. This process can be represented by Equations (9) and (10) to obtain the output features F out−global ∈ R N×C×H and F out−global ∈ R N×C×H , respectively. As shown in Figure 5, the real height and width of a person and his depth to the camera could be expressed as (H p , W p , Z p ); the optical focus length of a monocular camera is defined as f; its optical center is defined as (O x , O y ); θ cam and φ cam mean the tilt and rotation of the camera; the camera height is expressed as H cam . In this research, this paper can estimate the height and width (H img , W img ) of a person in the image by YOLOV5 [21]. Then, the real personal feature information could be described by the camera parameters and image information as follows: Figure 5. The real height and width of a person can be estimated from his bounding box in the image as long as the camera viewpoint and setting are known. Detailed illustration and explanation can be found in [22].
As described in [22], zero rolling (or the fact thatthe image has been rotated to account for roll) is assumed to calculate (H p , W p , Z p ) as follows: The readers are referred to more detailed discussion and description of Equation (12) in [22].
According to Equation (12), since there is a linear relation between the bounding (H img , W img ) and real human height and width (H p , W p ), as shown in Figure 6, according the skeletal points estimated by OpenPose [23] and the (H img , W img ) obtained from YOLOV5, it is easy to estimate the following seven real pieces of human physiological information: height, shoulder width, hunchback angle, elbow angle, knee angle, step length, and step frequency.  (12), with the skeleton points extracted from OpenPose [23], this paper can obtain 7 real human physiological information parameters: height, shoulder width, hunchback angle, elbow angle, knee angle, step length, and step frequency.

Physiological Feature Extraction (PFE) Module
After obtaining each piece of physiological information, in order to enhance the discriminative ability and correlation among these pieces of information, this paper proposes a new network, shown in Figure 7, for physiological feature extraction, where a 1D CNN of size 3 is used to obtain the correlation between each piece of physiological information. After two such 1D CNN layers, a batch normalization (BN) process is adopted to accelerate the convergence of the proposed network, and the output of the BN is applied to a full connection (FC) layer. Then, a dropout layer is selected to avoid the overfitting problem.  Assuming that the feature input is defined as P in ∈ R N×C×L , where N is the batch number, C denotes the number of channels and L denotes the feature length. This process can be defined as Equation (13), through which the feature output F PFE ∈ R N×L is obtained. Here, L is the feature length after passing through the full connection layer.

Loss Function
As shown in Figure 1, in this work, in order to make the silhouette gait features more distinguishable, the batch all triple loss [24] and cross-entropy loss functions are selected, where the triple loss could increase the compactness within a class and the cross-entropy loss can measure the separability between global classes.
The combined loss function is defined as Equation (14). Within each batch, the triple loss over all samples is defined as Equation (15), where all_d a,p is the average distance between each anchor and all positive samples, all_d a,n is the average distance between each anchor and all negative samples, and α is the margin value.
L all = max all_d a,p − all_d a,n + α , 0 , The triple loss of PFE is also measured by Equation (15). The loss vectors of the silhouette-based module and the PFE module are concatenated to create a new feature vector, and the ID reranking process is performed by measuring the length of this new feature vector.

Results
To ensure the effectiveness and efficiency of the proposed algorithm, three experiments were carried out: (1) In Section 3.2, we examined the effectiveness of the human physiological information estimation module; (2) in Section 3.3.1, we conducted comparative experiments among the conventional gait recognition algorithms [5,6,25,26] and the proposed method (without HPI and PFE modules) on CASIA-B [20]; (3) we conducted comparative experiments among the proposed method (with HPI and PFE modules) and the baseline methods on the Multi-state Gait dataset, where the real human physiological information could be estimated.

Datasets and Training Details
CASIA-B [20]: There are a total of 124 persons included in this dataset, where each person contains 11 views and each view contains 10 sequences under three walking conditions: normal (NM), carrying a bag or backpack (BG), and wearing coats or jackets (CL). The first six sequences are obtained under NM condition, and the other two sequences are captured with BG and the last two sequences under CL conditions. This paper follows the popular protocol carried out in [8]: the first 74 persons are used for training and the remaining 50 ones for testing. During the test, the first 4 = four sequences of NM (NM#1-4) are used as gallery, the remaining six sequences (regard as probe) are divided into three subsets according to the walking conditions: the NM subset contains NM#5-6, the BG subset contains BG#1-2, and the CL subset contains CL#1-2.
Multi-state Gait dataset (the approved informed consent was obtained from all the subjects in this dataset. Their personal images were authorized to be used for the academic research): Since the setting and parameters of cameras are not included in the CASIA-B dataset, the new Multi-state Gait dataset was created, where the camera parameters and settings were recorded, through which the HPI module could estimate the real human physiological information.
As shown in Figure 8, all the data in the Multi-state Gait dataset were captured by seven Hikvision cameras (the interval between two adjacent cameras is 15 degrees) at 20 fps with the resolution of 1280 × 720 pixels. There were 60 subject persons included in this dataset, and each person was instructed to walk in bio-directions (forward and backward). Therefore, the viewing angles varied from 0-90 • and 180-270 • , respectively. All the cameras were set at 2 m height from the ground with their pitch angles fixed as 5 • . With the help of OpenCV, all the data were collected by a desktop PC with the AMD R9 5950X CPU, 32 GB memory, and NVIDIA RTX3090. During the training and test process, the software condition is Pytorch1.8 + Cuda10.1 + Pycharm + Ubuntu. Figure 9, shows some collected samples, and the gait silhouette sequences were extracted by using the Mask R-CNN [26].  Similar to [27], this dataset contains 60 persons, where each person contains 14 angles (0, 15, 30, 45, 60, 75, 90, 180, 195, 210, 225, 240, 255, 270 degrees) and each angle contain 14 sequences: six sequences for NM, four sequences of BG, and four sequences for CL. Here, the first 34 persons are selected for training and the remaining 26 persons for testing. During the test, the first four sequences of NM (NM#1-4) are used as gallery, and the remaining 10 sequences (regard as probe) are divided into three subsets: the NM subset contains NM#5-6, the BG subsets contains BG#1-4, and the CL subset contains CL#1-4.
Training Details in CASIA-B: The gait silhouette map inputted into the network is set to 64 × 44 pixels, and the images are aligned according to the method of [27]. Each gait cycle contains 30 frames from each view angle (a total of 11 angles). The margin in the triple loss L all is set to 0.2, the Adam optimizer is applied in the training process, and the learning rate is set to 1 × 10 −4 . After 120K iterations, the learning rate is adjusted to 1 × 10 −5 , and the local optimization for the proposed network was achieved after 5K iterations.
Training Details in the Multi-State Gait Dataset: The resolution of the gait silhouette map inputted into the network is 64 × 44 pixels, and all the images are aligned according to the method of [27]. Each gait cycle contains 30 frames from each view angle (a total of 11 angles). During training, the Adam optimizer is used, the margin in the triple loss L all is set to 0.2, and the learning rate is set to 1 × 10 −4 . Because this dataset is small, a total of 70Kiterations are performed. Since both GaitSet [5] and GaitPart [6] were selected as the comparative baseline methods, they were also trained in this dataset. GaitSet [5] and GaitPart [6] use the same parameter settings to perform 70K iterations, respectively.
Regarding the PFE module in the proposed work, the size of a single piece of information inputted to the network is set as 1 × 7, the dropout layer parameter is set to 0.5, in which the margin in the triple loss L hard is set to 0.4, the Adam optimizer is also used, and the learning rate is set to 1 × 10 −4 for 30 iterations.
For either dataset, the training process of the proposed model is implemented in Pytorch1.8 + Cuda10.1 by using one NVIDIA RTX3090 GPU under the Ubuntu conditions.

Efficiency Evaluation of Physiological Information Computing
In order to verify the effectiveness of the HPI module in the proposed work, this paper selected four persons under three angles from the Multi-state Gait dataset. Here, two different experiments were performed: the static measurement for angle evaluation (elbow, knee, and hunchback angles) from static images, and the dynamic measurement for length evaluation (height, shoulder width, step length, and step frequency) from video sequences. Table 1 shows the detailed experimental results of the static estimation error evaluation of the HPI module, under the three angles. The error rate is the ratio between A EST − A GT and A GT . Here, A EST means the estimated angle by the HPI module and A GT represents the ground truth angle. In total, the estimation error of the HPI module for elbow and knee angles is around 6% and the hunchback angle has more estimation error (up 9.4%). This is because of the unstable skeleton point 0 (shown in Figure 6) due to the variation of view angles. Another important element lies in the experimental error that was caused by the displacement between the skeleton points estimated by OpenPose [23] and the position in which the real medical instrument was placed.  Table 2 shows the detailed information in the dynamic measurement experiment. Here, the target persons were required to walk from different angles. Regarding the step length measurement, the ground truth value was collected by measuring the distance between two footprints of a person, where his/her shoe's bottom was painted with ink. The ground truth (step length) was the mean value of all manually measured step lengths during a walking sequence. As shown in the right image of Figure 10, a Kalman filter is applied to the estimated HPI features (such as height, shoulder width, etc.) to eliminate the effect of random noise. The estimation error of each HPI varies from 1.2% to 8.1%; this is because the real person's height is estimated from the detection result of YOLOV5 [21], where the bounding box of the target person is quite accurate. While the shoulder width, step length, and step frequency were estimated from the skeleton points from OpenPose [23], the positions of skeleton points become unstable due to the motion blur in the test images; the estimation error of such information is higher than that of the person's real height.

Comparative Experiments on CASIA-B Dataset
To confirm the effectiveness of the proposed method, the comparative experiment was performed on the CASIA-B dataset among the proposed method and the other four methods: GaitSet [5], GaitPart [6], CNN-LB [8], and GaitNet [25]. Here, as shown in Table 3, 50 persons were selected as the target persons and the detailed training information could be found in the aforementioned section. All the target persons contained 11 view angles under three conditions: normal walking (NM), carrying a bag or backpack (BG), and wearing coats or jackets (CL). It is obvious that the proposed method achieved superior performance to the other methods over most view angles under all conditions (rank-1 under NM, BG, and CL conditions). The mean value of the recognition accuracy in the proposed method was 96.5% under NM, 92.6% under BG, and 79.8% under CL. Such stable ranking over all 11 view angles under three conditions could prove the effectiveness of the proposed methods. The superior performance of the proposed method lies in the fact that, compared with the other works such as GaitPart (extracting features from three or a fixed number of frames), more effective gait features are extracted by the LSTA and GLTA modules, where both the local three continuous frames and the whole gait cycle are processed. This is because the movement frequency of different people may change greatly, and extracting the gait feature at a fixed image interval may not produce enough information for recognition, while the proposed method can obtain more useful feature from the whole gait cycle. Therefore, it is natural that the more powerful gait features (obtained by LSTA and GLTA) could help to improve the gait recognition accuracy.

Comparative Experiment on the Multi-State Gait Dataset
Since the CASIA-B [20] dataset does not contain the necessary information such as camera setting and internal parameters, which is required by the HPI module, this paper performed another comparative experiment on the Multi-state Gait dataset to confirm the effectiveness of the proposed method. Here, GaitSet [5] and GaitPart [6] are selected as the compared baseline methods due to their good performance on the CASIA-B dataset (shown in Table 3). Here, 26 persons are selected as the tested targets from 14 view angles under three conditions (NM, BG, and CL). In Table 4, "ours_1" means the LSTA+GLTA+ATFA in the proposed work, while "ours_2" represents the method containing all the proposed modules (LSTA+GLTA+ATFA+HPI+PFE). Among all the 14 view angles, "ours_2" ranked number 1 for nine angles under NM and BG conditions and ranked number 1 for 10 angles under the CL condition. From all 14 angles under three conditions, the average recognition accuracy of "ours_2" is superior to all the compared methods, which implies that introducing the HPI information with the PFE module could efficiently improve the performance in gait recognition tasks. This is because, compared with the silhouette images which may become completely different due to the variation of camera viewing angles, a person's real physiological information could hardly change (despite different viewing angles). Therefore, it is not strange that introducing such stable personal gait features will improve the performance of a gait recognition method. Experiments in the following section prove that such an idea is also suitable for the other compared methods. In addition, the FLOPs of the proposed model and compared baseline works are also calculated to measure their computational complexity. Under the same data input, the FLOPs of the proposed model (145 M) are in between GaitSet [5] and GaitPart [6], which is 20% lower than GaitSet (183 M) and 37% higher than GaitPart (106 M). Since GaitSet contains a more complex network structure (including feature pyramid structure) than the proposed work, it is reasonable that the computation cost of the proposed method is less than that of GaitSet. Compared with GaitPart, besides the similar network to extract features for short-term, the proposed work contains more complex structures, such as GLTA, ATFA, and PFE, to compute the long-term gait feature and real human physiological feature. Therefore, the proposed method could achieve better recognition accuracy at the cost of more computational complexity than GaitPart.

Ablation Study
Besides the overall performance evaluation, the effectiveness of each module in the proposed methods were also investigated.
Firstly, the validity of each module of the gait silhouette part on the CASIA-B [20] dataset was verified. As shown in Table 5, the baseline of the work is that the LSTA module and the GLTA and ATFA modules are also selected to verify how to combine them to improve the performance of the method. Through these experiments, directly applying the LSTA module for gait recognition will lead to a similar result to the well-known GaitPart [6], while the combination of LSTA and GLTA could improve the performance under all conditions and the mean recognition accuracy could reach 89.2%, and by combining LSTA with GLTA and ATFA, the performance of the proposed work is further improved to 89.6% in the mean value of recognition rate. The improvements caused by introducing LSTA and ATFA modules are, respectively, 0.8% and 0.4%, which indicates that, compared with the adaptive adjustment of feature weights (by ATFA), the global gait feature could be more useful to improve the performance of the proposed method. Since the effectiveness of HPI and PFE cannot be verified in CASIA-B [20], the Multi state Gait dataset was applied to examine these two modules. In Table 6, the performance of HPI and PFE modules in improving the gait recognition accuracy were investigated. Here, "Baseline" means the LSTA + GLTA + ATFA modules in the proposed method, "Baseline + HPI" represents directly applying the obtained real human physiological information in the silhouette-based network, and "Baseline + HPI + PFE" denotes the combination of the baseline method with the human physiological features extracted through the PFE network. The performance of the baseline method is quite similar to that of the well-known GaitPart [6] work, and it is obvious that directly applying the HPI information can only slightly improve the gait recognition accuracy. Through this experiment, the real human physiological features obtained through the HPI+PFE can achieve the best improvement in gait recognition.

Transplantation Study
Besides the proposed method, as shown in Table 7, the authors also investigated whether the HPI and PFE modules could help to increase the accuracy of other gait recognition methods or not. Here, according to the network structure of GaitPart [6], a weight parameter γ is introduced to the real personal physiological features, so that the ratio of the gait silhouette feature length to the real personal physiological feature length is set to 32:1. It is interesting to see that, with the help of the HPI and PFE module proposed in this paper, the average recognition accuracies of GaitSet [5] and GaitPart [6] increased by0.53% and 0.47%, respectively. This can be considered as proof of the idea that the unique real human physiological information can be helpful to improve the performance of a gait recognition algorithm.

Discussion
In the future, several improvements should be considered: (1) Experiments on the other large public datasets (such as the OUMVLP Dataset) should be performed. Due to the limitations of hardware, the authors cannot test the proposed method on such large datasets. It is believed that such experiments could be achieved with more powerful GPU hardware. (2) Determining how to extract more accurate HPI features should be investigated. Currently, since only the monocular images were applied, the skeleton points of a person may be invisible due to variation of view angles. The 3D skeleton points are considered be a solution to this problem, and such points could be obtained through the RGB-D camera, stereo vision, or other 2D-3D neural networks through successive frames. (3) More clear test images should be applied in future research. As the motion blur has caused many experimental errors in the work (because the estimation of skeleton points becomes unstable), the proposed method should be applied to the test images obtained through high-speed cameras rather than the normal ones (such as the shutter speed of 20 fps).

Conclusions
In this paper, a new gait recognition method was brought about, which is based on the fusion of gait silhouette features and real personal physiological features. To deal with the variation of gait frequency among different people, both the short-term (three frames) and long-term (whole gait cycle) gait features are extracted by the novel LSTA and GLTA modules for improving the recognition accuracy. As for the appearance variation of silhouette images under different viewing angles, the real human physiological information calculated from monocular images is selected so as to provide more robust gait features. The final gait recognition is achieved by reranking among the feature vectors concatenated by the features obtained from LSTA, GLTA, and human physiological information. The effectiveness and efficiency of the proposed method was proved through the massive comparative experiments among the proposed methods and the other well-known algorithms on both the public dataset and the newly brought about Multi-state one. Since the proposed method is mainly designed for intelligent security monitoring systems, its performance will depend on several things such as the image resolution, camera capture speed, etc. This is because low image resolution will lead to more estimation error for the skeleton points and low capture speed will cause motion blur, which will not only affect the estimation of skeleton points but also the quality of the silhouette image. One of the future work directions in our research is to introduce the high-speed camera as well as carry out the experiment under more real-life scenes.