Gait Recognition Method of Underground Coal Mine Personnel Based on Densely Connected Convolution Network and Stacked Convolutional Autoencoder

Biological recognition methods often use biological characteristics such as the human face, iris, fingerprint, and palm print; however, such images often become blurred under the limitation of the complex environment of the underground, which leads to low identification rates of underground coal mine personnel. A gait recognition method via similarity learning named Two-Stream neural network (TS-Net) is proposed based on a densely connected convolution network (DenseNet) and stacked convolutional autoencoder (SCAE). The mainstream network based on DenseNet is mainly used to learn the similarity of dynamic deep features containing spatiotemporal information in the gait pattern. The auxiliary stream network based on SCAE is used to learn the similarity of static invariant features containing physiological information. Moreover, a novel feature fusion method is adopted to achieve the fusion and representation of dynamic and static features. The extracted features are robust to angle, clothing, miner hats, waterproof shoes, and carrying conditions. The method was evaluated on the challenging CASIA-B gait dataset and the collected gait dataset of underground coal mine personnel (UCMP-GAIT). Experimental results show that the method is effective and feasible for the gait recognition of underground coal mine personnel. Besides, compared with other gait recognition methods, the recognition accuracy has been significantly improved.


Introduction
Gait recognition is a new biometric recognition technology that can recognize the identity of people based on walking posture [1]. Gait recognition has a unique advantage over other biometric technologies, namely the recognition potential at long distances or with low video quality. In addition, the gait is difficult to hide or camouflage and does not require people to cooperate with [2]. Especially in the dark, infrared gait recognition technology can play its role. At present, the identification of people in coal mines is mostly based on faces and fingerprints. Although the recognition technology based on faces and fingerprints has matured and has a high recognition rate in normal environments, due to limited space, dim light, moist air, coal dust in the roadway, etc., face and fingerprints are often blurred, seriously affecting the recognition rate of these identification methods [3]. The gait recognition method does not have more requirements for illumination, video quality, or distance, which is very consistent with the environmental characteristics of underground coal mines. By identifying and monitoring the gait images of the personnel, the identity information of the underground operating personnel can be accurately identified the first time. This is of great significance for the realization of mine safety monitoring and personnel identity positioning. In [4], the gait features are extracted in the form of interval-valued symbol features using the distance relationship between the minimum inertia axis and the outermost contour. In [5], the author proposed a view-invariant gait recognition method based on Kinect skeleton features. The author uses gait energy image (GEI)-based local multi-scale feature descriptors for human gait recognition in [6]. Reference [7] proposed a gait recognition method based on a static energy map and dynamic group hidden Markov model, which has certain robustness to angle changes and reduces the impact of noise on recognition. In [8], the time-frequency domain features of the gait image were extracted using wavelet transform and time-frequency analysis methods to perform gait recognition. The author proposed a gait recognition method based on tensor discriminant analysis and Gabor feature extraction in [9]. This method is a linear quantification of linear discriminant analysis (LDA). Its advantage is that it does not need to convert gait images into vectors. Therefore, the problem of having a "small sample" is overcome.
With the rise and maturity of deep learning methods such as residual neural network (ResNet) [10] and generative adversarial network (GAN) [11], deep learning has become one of the most popular methods for solving gait recognition. A cross-perspective gait recognition method based on deep convolutional neural network proposed by Huang et al. [12] of the Institute of Automation of the Chinese Academy of Sciences can perform multi-perspective recognition, with improved accuracy. Chen et al. [13] proposed a GaitGAN gait recognition method based on GAN. This method uses GAN to transform gait images at any viewing angle and any state into gait images at a 90° normal walking state, with high recognition accuracy and fast speed. Fudan University proposed a GaitSet algorithm based on a gait contour graph [14]. Regarding gait contours as a set of images without time-series relations, instead of deliberately modeling the time series of gait contours, the study makes deep neural network optimize itself to extract and use this relationship. In [15], the author developed an efficient spatiotemporal gait features with deep learning. The extracted spatial and temporal gait features are embedded into the null space to obtain the similarity of gait image pairs and thereby achieve gait recognition. In [16], the author used CNNs and multiple loss functions to extract gait features from silhouette sequences and GEIs, respectively. This method can better extract spatiotemporal information based on appearance. Wang et al. [17] proposed a DL-based gait recognition method named Two-branch Convolution Neural Network. This method uses two kinds of CNN models to extract gait features, and then trains an SVM classifier with the output of each CNN model to achieve gait recognition. In [18], the author proposed an end-to-end system based on In [4], the gait features are extracted in the form of interval-valued symbol features using the distance relationship between the minimum inertia axis and the outermost contour. In [5], the author proposed a view-invariant gait recognition method based on Kinect skeleton features. The author uses gait energy image (GEI)-based local multi-scale feature descriptors for human gait recognition in [6]. Reference [7] proposed a gait recognition method based on a static energy map and dynamic group hidden Markov model, which has certain robustness to angle changes and reduces the impact of noise on recognition. In [8], the time-frequency domain features of the gait image were extracted using wavelet transform and time-frequency analysis methods to perform gait recognition. The author proposed a gait recognition method based on tensor discriminant analysis and Gabor feature extraction in [9]. This method is a linear quantification of linear discriminant analysis (LDA). Its advantage is that it does not need to convert gait images into vectors. Therefore, the problem of having a "small sample" is overcome.
With the rise and maturity of deep learning methods such as residual neural network (ResNet) [10] and generative adversarial network (GAN) [11], deep learning has become one of the most popular methods for solving gait recognition. A cross-perspective gait recognition method based on deep convolutional neural network proposed by Huang et al. [12] of the Institute of Automation of the Chinese Academy of Sciences can perform multi-perspective recognition, with improved accuracy. Chen et al. [13] proposed a GaitGAN gait recognition method based on GAN. This method uses GAN to transform gait images at any viewing angle and any state into gait images at a 90 • normal walking state, with high recognition accuracy and fast speed. Fudan University proposed a GaitSet algorithm based on a gait contour graph [14]. Regarding gait contours as a set of images without time-series relations, instead of deliberately modeling the time series of gait contours, the study makes deep neural network optimize itself to extract and use this relationship. In [15], the author developed an efficient spatiotemporal gait features with deep learning. The extracted spatial and temporal gait features are embedded into the null space to obtain the similarity of gait image pairs and thereby achieve gait recognition. In [16], the author used CNNs and multiple loss functions to extract gait features from silhouette sequences and GEIs, respectively. This method can better extract spatiotemporal information based on appearance. Wang et al. [17] proposed a DL-based gait recognition method named Two-branch Convolution Neural Network. This method uses two kinds of CNN models to extract gait features, and then trains an SVM classifier with the output of each CNN model to achieve gait recognition. In [18], the author proposed an end-to-end system based on pre-trained DenseNet-201 [19] model for features extraction to realize gait recognition. This method achieved a high recognition rate.
Aiming at the problems that the existing gait recognition methods have low accuracy and cannot simultaneously extract the dynamic and static features of human walking, we propose a Two-Stream neural network model based on similarity learning. [20,21]. Our proposed model takes the gait energy image (GEI) as the model input [22]. GEI contains both human physiological information and spatiotemporal information during walking. The gait image sequence is summed and averaged into one gait picture, as shown in Figure 2. Aiming at the problems that the existing gait recognition methods have low accuracy and cannot simultaneously extract the dynamic and static features of human walking, we propose a Two-Stream neural network model based on similarity learning. [20,21]. Our proposed model takes the gait energy image (GEI) as the model input [22]. GEI contains both human physiological information and spatiotemporal information during walking. The gait image sequence is summed and averaged into one gait picture, as shown in Figure 2.

Overview of the Proposed TS-Net Model
The proposed Two-Stream neural network (TS-Net) model simultaneously extracts dynamic deep features and static invariant features in gait images, and it fuses multiple resolutions static invariant features with dynamic deep features to obtain the most discriminating spatial-temporal information. In addition, the recognition task is transformed into a binary classification problem through similarity learning methods, which can more accurately achieve personnel recognition. As shown in Figure 3, the TS-Net model proposed in this paper mainly consists of three parts: dynamic and static feature extraction, feature fusion, and recognition. Dynamic and static feature extraction consists of two parallel networks: the mainstream network based on DenseNet [19] and the auxiliary stream network based on stacked convolutional autoencoder (SCAE) [23,24]. Firstly, in the mainstream network, dynamic deep features are extracted from gait image, which represent more macroscopic and abstract spatiotemporal information of the gait image. In the auxiliary stream network, static invariant features are extracted from gait image samples, which represent physiological information such as the body shape and head shape of the low-dimensional gait image.
In the dynamic and static feature extraction process, the gait features extracted from the auxiliary stream network are integrated into the mainstream network, and the final gait image features are

Overview of the Proposed TS-Net Model
The proposed Two-Stream neural network (TS-Net) model simultaneously extracts dynamic deep features and static invariant features in gait images, and it fuses multiple resolutions static invariant features with dynamic deep features to obtain the most discriminating spatial-temporal information. In addition, the recognition task is transformed into a binary classification problem through similarity learning methods, which can more accurately achieve personnel recognition. As shown in Figure 3, the TS-Net model proposed in this paper mainly consists of three parts: dynamic and static feature extraction, feature fusion, and recognition. Aiming at the problems that the existing gait recognition methods have low accuracy and cannot simultaneously extract the dynamic and static features of human walking, we propose a Two-Stream neural network model based on similarity learning. [20,21]. Our proposed model takes the gait energy image (GEI) as the model input [22]. GEI contains both human physiological information and spatiotemporal information during walking. The gait image sequence is summed and averaged into one gait picture, as shown in Figure 2.

Overview of the Proposed TS-Net Model
The proposed Two-Stream neural network (TS-Net) model simultaneously extracts dynamic deep features and static invariant features in gait images, and it fuses multiple resolutions static invariant features with dynamic deep features to obtain the most discriminating spatial-temporal information. In addition, the recognition task is transformed into a binary classification problem through similarity learning methods, which can more accurately achieve personnel recognition. As shown in Figure 3, the TS-Net model proposed in this paper mainly consists of three parts: dynamic and static feature extraction, feature fusion, and recognition. Dynamic and static feature extraction consists of two parallel networks: the mainstream network based on DenseNet [19] and the auxiliary stream network based on stacked convolutional autoencoder (SCAE) [23,24]. Firstly, in the mainstream network, dynamic deep features are extracted from gait image, which represent more macroscopic and abstract spatiotemporal information of the gait image. In the auxiliary stream network, static invariant features are extracted from gait image samples, which represent physiological information such as the body shape and head shape of the low-dimensional gait image.
In the dynamic and static feature extraction process, the gait features extracted from the auxiliary stream network are integrated into the mainstream network, and the final gait image features are   Dynamic and static feature extraction consists of two parallel networks: the mainstream network based on DenseNet [19] and the auxiliary stream network based on stacked convolutional autoencoder (SCAE) [23,24]. Firstly, in the mainstream network, dynamic deep features are extracted from gait image, which represent more macroscopic and abstract spatiotemporal information of the gait image. In the auxiliary stream network, static invariant features are extracted from gait image samples, which represent physiological information such as the body shape and head shape of the low-dimensional gait image.
In the dynamic and static feature extraction process, the gait features extracted from the auxiliary stream network are integrated into the mainstream network, and the final gait image features are obtained through the mainstream network to learn the gait similarity and then predict whether the input image pair belongs to the same person. Finally, during the test, an image pair consisting of the probe view and the gallery view is input to the network, and the similarity of the image pair is obtained to realize gait recognition. The three parts of the TS-Net model will be described in detail next.

Mainstream Network
Gait images are high-dimensional, complex, and changeable non-linear data. To extract discriminative spatiotemporal information from gait images, it is necessary to build a deeper network. Studies shows that increasing the number of layers in the network can help extract more hierarchical features, and the deeper the network, the better the expression ability. However, in practical applications, as the number of stacked layers increases, training and convergence will be difficult. Although methods such as batch normalization (BN) can be used to alleviate the vanishing gradient and explosive gradient problems [25,26], the performance of the network will still decline. This degradation is not caused by overfitting, but the network is too deep to make it difficult to train. Therefore, the traditional neural network model cannot build a sufficiently deep network.
Densely connected convolution network (DenseNet) can build deeper networks. Traditional neural networks have only one connection between each layer and subsequent layers; that is, a convolutional network with L layers has L connections. Compared with the traditional neural network, in order to improve the information flow between layers, the DenseNet is connected to each layer in a feed-forward manner, and each layer has a direct connection with the subsequent layers [27][28][29]. The network will have L(L + 1)/2 dense connections [19]. For each layer, the inputs of the current layer are feature maps of all the previous layers, and its own feature maps are used as the input of all the subsequent layers, such as shown in Figure 4. Compared with a shortcut connection of the ResNet model, the special connection structure of DenseNet can effectively promote features reuse, which can realize better performance than ResNet in the case of less parameters and calculation costs. This connection structure can strengthen feature propagation, thereby alleviating the problem of vanishing gradients. obtained through the mainstream network to learn the gait similarity and then predict whether the input image pair belongs to the same person. Finally, during the test, an image pair consisting of the probe view and the gallery view is input to the network, and the similarity of the image pair is obtained to realize gait recognition. The three parts of the TS-Net model will be described in detail next.

Mainstream Network
Gait images are high-dimensional, complex, and changeable non-linear data. To extract discriminative spatiotemporal information from gait images, it is necessary to build a deeper network. Studies shows that increasing the number of layers in the network can help extract more hierarchical features, and the deeper the network, the better the expression ability. However, in practical applications, as the number of stacked layers increases, training and convergence will be difficult. Although methods such as batch normalization (BN) can be used to alleviate the vanishing gradient and explosive gradient problems [25,26], the performance of the network will still decline. This degradation is not caused by overfitting, but the network is too deep to make it difficult to train. Therefore, the traditional neural network model cannot build a sufficiently deep network.
Densely connected convolution network (DenseNet) can build deeper networks. Traditional neural networks have only one connection between each layer and subsequent layers; that is, a convolutional network with layers has connections. Compared with the traditional neural network, in order to improve the information flow between layers, the DenseNet is connected to each layer in a feed-forward manner, and each layer has a direct connection with the subsequent layers [27][28][29]. The network will have ( + 1)/2 dense connections [19]. For each layer, the inputs of the current layer are feature maps of all the previous layers, and its own feature maps are used as the input of all the subsequent layers, such as shown in Figure 4. Compared with a shortcut connection of the ResNet model, the special connection structure of DenseNet can effectively promote features reuse, which can realize better performance than ResNet in the case of less parameters and calculation costs. This connection structure can strengthen feature propagation, thereby alleviating the problem of vanishing gradients. In DenseNet, the L layer output of the feature map can be described by Equation 1: where 0 , 1 , … , −1 are feature maps from layers 0, 1, … , − 1, respectively. is a connection function that connects feature maps by channel dimensions.
(·) is a non-linear transformation function that represents some operations, including BN, ReLU, pooling, and convolution with the weights . In DenseNet, the L layer output of the feature map x l can be described by Equation (1): where x 0 , x 1 , . . . , x l−1 are feature maps from layers 0, 1, . . . , l − 1, respectively. Concat is a connection function that connects feature maps by channel dimensions. H l (·) is a non-linear transformation function that represents some operations, including BN, ReLU, pooling, and convolution with the weights W l . The mainstream network mainly learns the similarity of dynamic deep features in gait images, that is, learning the changes in the stride, knee bending angle, arm swing amplitude, and body center of gravity during the walking process by DenseNet [19]. The architecture is shown in Figure 5.
DenseBlock layer contains a DenseLayer (set to 2, 4, 8, 6 respectively). Each DenseLayer contains two convolution layers that apply a 1 × 1 kernel with a stride of 1 pixel and 3 × 3 kernel with a stride of 1 pixel-that is, BN-ReLU-Conv (1 × 1)-BN-ReLU-Conv (3 × 3). The compression layer uses a convolution kernel of 1 × 1 with a stride of 1 pixel and an avg-pooling kernel of 2 × 2 with a stride of 2 pixels to reduce the image to half its size. The purpose of the compression layer is to adjust the dimensions and further improve the compactness of the model. In this way, the feature map size of the auxiliary stream network input to the mainstream network can be reduced on the one hand, and the number of feature maps input to the next DenseBlock is reduced on the other hand. The output layer applies an average pooling layer with a sliding window of 8 × 8 and a 62-dimensional fully connected layer to obtain the final gait features. Finally, the similarity of gait images is obtained through the sigmoid function. In the training task of gait recognition, because the network inputs a pair of gait images to learn their similarity, the training sample label is 1 (positive sample) or 0 (negative sample). A value of 1 means that two gait images are from same person, and a value of 0 means that two gait images are from different people. Therefore, our network uses a binary cross-entropy loss function to calculate the loss, as in Equation (2): represents the number of samples; represents the label value of sample ; and ( ) represents the predicted probability value of the label value of sample .
In the TS-Net model, the Adam [30] stochastic optimization algorithm is used to perform parameter updates. Adam is an efficient optimization algorithm, because the first moment estimation (mean of gradient) and second moment estimation (variance of gradient) are considered together, which makes the back-propagation algorithm easier to execute.

Reducing Overfitting
In general, the deeper the model is during training, the more parameters need to be learned, making it easier to overfit. We applied the dropout [31,32] method before the last fully connected layer to solve this problem. Dropout means that during the training of the network, the input and output neurons are unchanged, and the hidden neurons are temporarily dropped from the network according to a certain probability-that is, the output is set to zero with a certain probability. The neurons that are "dropped out" participate in neither forward propagation nor backward propagation. Obviously, the network samples different architectures for each input, but these different architectures share identical weight. Therefore, dropout can effectively prevent overfitting The input of the network is a fixed-size 128 × 128 gait image (in order to better adapt to the network, we resize the 240 × 240 image to 128 × 128). The input layer contains one convolutional layer that applies a convolution kernel of 7 × 7 with a stride of 2 pixel and one max-pooling layer that applies 3 × 3 sliding windows with a stride of 1 pixels. The purpose of the input layer is to extract multi-scale basic visual features and reduce the image size to reduce network parameters. Each DenseBlock layer contains a DenseLayer (set to 2, 4, 8, 6 respectively). Each DenseLayer contains two convolution layers that apply a 1 × 1 kernel with a stride of 1 pixel and 3 × 3 kernel with a stride of 1 pixel-that is, The compression layer uses a convolution kernel of 1 × 1 with a stride of 1 pixel and an avg-pooling kernel of 2 × 2 with a stride of 2 pixels to reduce the image to half its size. The purpose of the compression layer is to adjust the dimensions and further improve the compactness of the model. In this way, the feature map size of the auxiliary stream network input to the mainstream network can be reduced on the one hand, and the number of feature maps input to the next DenseBlock is reduced on the other hand. The output layer applies an average pooling layer with a sliding window of 8 × 8 and a 62-dimensional fully connected layer to obtain the final gait features. Finally, the similarity of gait images is obtained through the sigmoid function.
In the training task of gait recognition, because the network inputs a pair of gait images to learn their similarity, the training sample label is 1 (positive sample) or 0 (negative sample). A value of 1 means that two gait images are from same person, and a value of 0 means that two gait images are from different people. Therefore, our network uses a binary cross-entropy loss function to calculate the loss, as in Equation (2): N represents the number of samples; y i represents the label value of sample i; and p(y i ) represents the predicted probability value of the label value of sample i.
In the TS-Net model, the Adam [30] stochastic optimization algorithm is used to perform parameter updates. Adam is an efficient optimization algorithm, because the first moment estimation (mean of gradient) and second moment estimation (variance of gradient) are considered together, which makes the back-propagation algorithm easier to execute.

Reducing Overfitting
In general, the deeper the model is during training, the more parameters need to be learned, making it easier to overfit. We applied the dropout [31,32] method before the last fully connected layer to solve this problem. Dropout means that during the training of the network, the input and output neurons are unchanged, and the hidden neurons are temporarily dropped from the network according to a certain probability-that is, the output is set to zero with a certain probability. The neurons that are "dropped out" participate in neither forward propagation nor backward propagation. Obviously, the network samples different architectures for each input, but these different architectures share identical weight. Therefore, dropout can effectively prevent overfitting of the training data. In this paper, we set dropout to 50% (usually 30% or 50%, empirically chosen in practical applications).

Auxiliary Stream Network
After the mainstream network learns the similarity of dynamic deep features in the gait image, the auxiliary stream network is used to learn the similarity of static invariant features, including static information such as body shape, head shape, and shoulder width. The auxiliary stream network is designed based on a stacked convolutional autoencoder (SCAE). The architecture during the experiment is shown in Figure 6. of the training data. In this paper, we set dropout to 50% (usually 30% or 50%, empirically chosen in practical applications).

Auxiliary Stream Network
After the mainstream network learns the similarity of dynamic deep features in the gait image, the auxiliary stream network is used to learn the similarity of static invariant features, including static information such as body shape, head shape, and shoulder width. The auxiliary stream network is designed based on a stacked convolutional autoencoder (SCAE). The architecture during the experiment is shown in Figure 6. SCAE is composed of multiple convolutional autoencoders (CAE). CAE is a neural network designed to copy input to output. The network is divided into two parts: an encoder and decoder. The encoder compresses the input into a latent spatial representation, and the decoder is used to reconstruct this representation [33,34]. The purpose of CAE is to extract the most representative information to represent the original image-that is, the process of image dimensionality reduction. Compared with traditional dimensionality reduction methods such as PCA, the information extracted by the convolutional neural network is more effective and representative with the better recovery effect. The encoder network can be expressed by a neural network function passed by the activation function. The encoder network is defined as: where denotes the hidden dimension of the encoder. denotes the weight of the encoder network. denotes the bias, and represents the non-linear activation function. Similarly, the decoder network can be expressed in the same way. However, different weights, bias, and activation functions are applied, and it is defined as follows: Here, ′ denotes the hidden dimension of the decoder. W ′ denotes the weight of the decoder network. ′ denotes the bias, ′ represents the non-linear activation function, and represents the hidden dimension of the encoder.
The higher the similarity between the original data and the reconstructed data, the more effective the features extracted by the network. Therefore, we update the network by decreasing the discrepancy between input and output. The auxiliary stream network uses the mean squared error loss function to calculate the loss of the original gait image and the reconstructed gait image, which is expressed as: where and denote the pixel values corresponding to the i-th row and the j-th column of the original gait image and the reconstructed gait image. and denote the number of rows and columns of the input data, respectively. The term 2 || || 2 is used for the weight decay.
In this paper, the auxiliary stream network is composed of three CAEs that have one hidden layer. During the training process, each CAE is individually trained. The output of the previous CAE SCAE is composed of multiple convolutional autoencoders (CAE). CAE is a neural network designed to copy input to output. The network is divided into two parts: an encoder and decoder. The encoder compresses the input into a latent spatial representation, and the decoder is used to reconstruct this representation [33,34]. The purpose of CAE is to extract the most representative information to represent the original image-that is, the process of image dimensionality reduction. Compared with traditional dimensionality reduction methods such as PCA, the information extracted by the convolutional neural network is more effective and representative with the better recovery effect. The encoder network can be expressed by a neural network function passed by the activation function. The encoder network is defined as: where z denotes the hidden dimension of the encoder. W denotes the weight of the encoder network. b denotes the bias, and σ represents the non-linear activation function. Similarly, the decoder network can be expressed in the same way. However, different weights, bias, and activation functions are applied, and it is defined as follows: Here, x denotes the hidden dimension of the decoder. W denotes the weight of the decoder network. b denotes the bias, σ represents the non-linear activation function, and z represents the hidden dimension of the encoder.
The higher the similarity between the original data and the reconstructed data, the more effective the features extracted by the network. Therefore, we update the network by decreasing the discrepancy between input and output. The auxiliary stream network uses the mean squared error loss function to calculate the loss of the original gait image and the reconstructed gait image, which is expressed as: where x ij and y ij denote the pixel values corresponding to the i-th row and the j-th column of the original gait image and the reconstructed gait image. u and v denote the number of rows and columns of the input data, respectively. The term λ 2 ||W|| 2 is used for the weight decay. In this paper, the auxiliary stream network is composed of three CAEs that have one hidden layer. During the training process, each CAE is individually trained. The output of the previous CAE is used as the input of the next CAE to achieve the purpose of "Each layer iterates; only a single layer updates". In this way, the training revenue of the next CAE will be very high, because the input is the mapped features from the previous CAE training.
The auxiliary stream network extracts hierarchical features from gait image samples. During the feature extraction process, as the number of layers increases, the resolution of the feature map becomes smaller and smaller, and the blurriness gradually increases. However, the extracted static invariant features are increasingly obvious. The resemblance between the original image and the recovered image indicates that the extracted features retain the most significant information. The reconstructed visualization process is shown in Figure 7. is used as the input of the next CAE to achieve the purpose of "Each layer iterates; only a single layer updates". In this way, the training revenue of the next CAE will be very high, because the input is the mapped features from the previous CAE training. The auxiliary stream network extracts hierarchical features from gait image samples. During the feature extraction process, as the number of layers increases, the resolution of the feature map becomes smaller and smaller, and the blurriness gradually increases. However, the extracted static invariant features are increasingly obvious. The resemblance between the original image and the recovered image indicates that the extracted features retain the most significant information. The reconstructed visualization process is shown in Figure 7.

Feature Fusion and Recognition
In this paper, a novel feature fusion method is used, as shown in Figure 8. We feed the multiscale static invariant features extracted by the auxiliary stream network to the mainstream network, respectively. Compared with the traditional feature fusion method, which adds different features directly, our proposed method achieves feature reuse. As the depth of the model increases, the proportion of distinctive features becomes larger and larger, and the proportion of indistinguishable features becomes smaller and smaller. Therefore, the similarity of the image pair can be judged more accurately. The mainstream network fuses the features extracted by itself and the features extracted by the auxiliary stream network. The final gait feature features include both the dynamic and static features of human walking. Experiments show that this feature fusion method is very effective. During the training task, the gait feature vector obtained by the mainstream network represents the similarity of a pair of gait images. We use the sigmoid function to convert the feature vector value to a value between 0 and 1. That is, the recognition task is transformed into a binary classification

Feature Fusion and Recognition
In this paper, a novel feature fusion method is used, as shown in Figure 8. We feed the multi-scale static invariant features extracted by the auxiliary stream network to the mainstream network, respectively. Compared with the traditional feature fusion method, which adds different features directly, our proposed method achieves feature reuse. As the depth of the model increases, the proportion of distinctive features becomes larger and larger, and the proportion of indistinguishable features becomes smaller and smaller. Therefore, the similarity of the image pair can be judged more accurately. The mainstream network fuses the features extracted by itself and the features extracted by the auxiliary stream network. The final gait feature features include both the dynamic and static features of human walking. Experiments show that this feature fusion method is very effective. is used as the input of the next CAE to achieve the purpose of "Each layer iterates; only a single layer updates". In this way, the training revenue of the next CAE will be very high, because the input is the mapped features from the previous CAE training. The auxiliary stream network extracts hierarchical features from gait image samples. During the feature extraction process, as the number of layers increases, the resolution of the feature map becomes smaller and smaller, and the blurriness gradually increases. However, the extracted static invariant features are increasingly obvious. The resemblance between the original image and the recovered image indicates that the extracted features retain the most significant information. The reconstructed visualization process is shown in Figure 7.

Feature Fusion and Recognition
In this paper, a novel feature fusion method is used, as shown in Figure 8. We feed the multiscale static invariant features extracted by the auxiliary stream network to the mainstream network, respectively. Compared with the traditional feature fusion method, which adds different features directly, our proposed method achieves feature reuse. As the depth of the model increases, the proportion of distinctive features becomes larger and larger, and the proportion of indistinguishable features becomes smaller and smaller. Therefore, the similarity of the image pair can be judged more accurately. The mainstream network fuses the features extracted by itself and the features extracted by the auxiliary stream network. The final gait feature features include both the dynamic and static features of human walking. Experiments show that this feature fusion method is very effective. During the training task, the gait feature vector obtained by the mainstream network represents the similarity of a pair of gait images. We use the sigmoid function to convert the feature vector value to a value between 0 and 1. That is, the recognition task is transformed into a binary classification problem through similarity learning methods. If it is greater than 0.5, it is judged as a positive   During the training task, the gait feature vector obtained by the mainstream network represents the similarity of a pair of gait images. We use the sigmoid function to convert the feature vector value to a value between 0 and 1. That is, the recognition task is transformed into a binary classification problem through similarity learning methods. If it is greater than 0.5, it is judged as a positive example-that is, the pair of gait images comes from the same person. Less than 0.5 is judged as a negative example-that is, the pair of gait images comes from different people. In the test task, we combine the probe view with each gallery view to form an image pair. Through the TS-Net model, we obtain the similarity of each pair of gait images. The person with the highest number of positive examples-that is, the person with the highest similarity-is the final recognition result. In order to better describe our model, the pseudocode for training and testing is shown in the Algorithm 1. negative example-that is, the pair of gait images comes from different people. In the test task, we combine the probe view with each gallery view to form an image pair. Through the TS-Net model, we obtain the similarity of each pair of gait images. The person with the highest number of positive examples-that is, the person with the highest similarity-is the final recognition result. In order to better describe our model, the pseudocode for training and testing is shown in the Algorithm 1.

Algorithm 1 TS-Net Model.
Training: Input data: image pair (X1, X2) randomly selected from the training set.

CASIA-B Dataset
First, we used the CASIA-B dataset [35], one of the largest public gait datasets created by the Institute of Automation, Chinese Academy of Sciences in 2005, to test the recognition performance of the proposed TS-Net model. The database contains 124 subjects (93 male and 31 female). The subject's angle of view divided 0 • to 180 • into 11 different angles at 18 • intervals. Each subject was divided into three walking states, including six normal walking sequences (NM), two walking with a bag sequences (BG), and walking with a coat sequences (CL), as shown in Figure 9. angle of view divided 0° to 180° into 11 different angles at 18° intervals. Each subject was divided into three walking states, including six normal walking sequences (NM), two walking with a bag sequences (BG), and walking with a coat sequences (CL), as shown in Figure 9.

UCMP-GAIT Dataset
There is currently no public gait dataset for underground coal mine personnel (UCMP-GAIT). Therefore, in order to further verify the feasibility of the model for gait recognition of coal miners, we collected the gait data of 30 coal miners (all male, usually male workers under the mine). The UCMP-GAIT dataset is constructed as shown in Figure 10. The gait behavior of underground coal mine personnel is related to work content, environment, and dress. Therefore, the dataset contains 10 workers in each of the three types of work, namely coal miner, hydraulic support workers, and shearer driver. Each subject in the dataset contains 3 angles (18°, 54°, 90°). Each subject contains 2 walking sequences. One is taken in the coal mine examination room (with sufficient light and wide space), which is used as gallery views. The other is taken in the underground coal mine (dim light, limited space, wet, coal dust), which is used as probe views.

Experimental Design
In the experiments, the three walking states in the CASIA-B dataset, including "NM", "BG", and "CL" are involved. We used the six "NM" sequences, two "BG" sequences, and two "CL" sequences of the first 62 subjects (001-062) in the dataset as the training set. The remaining 62 subjects (063-124) were used as the test set. In the test set, the first 4 "NM" sequences of each subject are used as the gallery view, and the remaining two "NM" sequences, two "BG" sequences, and two "CL" sequences are used as the probe view to test the performance of the model in different walking states.
In the UCMP-GAIT dataset, all gait images of 30 coal miners were used to test the model. The gait image sequence captured in the coal mine examination room is used as the gallery view, and the sequence captured in the underground coal mine is used as the probe view.

UCMP-GAIT Dataset
There is currently no public gait dataset for underground coal mine personnel (UCMP-GAIT). Therefore, in order to further verify the feasibility of the model for gait recognition of coal miners, we collected the gait data of 30 coal miners (all male, usually male workers under the mine). The UCMP-GAIT dataset is constructed as shown in Figure 10. The gait behavior of underground coal mine personnel is related to work content, environment, and dress. Therefore, the dataset contains 10 workers in each of the three types of work, namely coal miner, hydraulic support workers, and shearer driver. Each subject in the dataset contains 3 angles (18 • , 54 • , 90 • ). Each subject contains 2 walking sequences. One is taken in the coal mine examination room (with sufficient light and wide space), which is used as gallery views. The other is taken in the underground coal mine (dim light, limited space, wet, coal dust), which is used as probe views.
Entropy 2020, 22, x 9 of 19 angle of view divided 0° to 180° into 11 different angles at 18° intervals. Each subject was divided into three walking states, including six normal walking sequences (NM), two walking with a bag sequences (BG), and walking with a coat sequences (CL), as shown in Figure 9.

UCMP-GAIT Dataset
There is currently no public gait dataset for underground coal mine personnel (UCMP-GAIT). Therefore, in order to further verify the feasibility of the model for gait recognition of coal miners, we collected the gait data of 30 coal miners (all male, usually male workers under the mine). The UCMP-GAIT dataset is constructed as shown in Figure 10. The gait behavior of underground coal mine personnel is related to work content, environment, and dress. Therefore, the dataset contains 10 workers in each of the three types of work, namely coal miner, hydraulic support workers, and shearer driver. Each subject in the dataset contains 3 angles (18°, 54°, 90°). Each subject contains 2 walking sequences. One is taken in the coal mine examination room (with sufficient light and wide space), which is used as gallery views. The other is taken in the underground coal mine (dim light, limited space, wet, coal dust), which is used as probe views.

Experimental Design
In the experiments, the three walking states in the CASIA-B dataset, including "NM", "BG", and "CL" are involved. We used the six "NM" sequences, two "BG" sequences, and two "CL" sequences of the first 62 subjects (001-062) in the dataset as the training set. The remaining 62 subjects (063-124) were used as the test set. In the test set, the first 4 "NM" sequences of each subject are used as the gallery view, and the remaining two "NM" sequences, two "BG" sequences, and two "CL" sequences are used as the probe view to test the performance of the model in different walking states.
In the UCMP-GAIT dataset, all gait images of 30 coal miners were used to test the model. The gait image sequence captured in the coal mine examination room is used as the gallery view, and the sequence captured in the underground coal mine is used as the probe view.

Experimental Design
In the experiments, the three walking states in the CASIA-B dataset, including "NM", "BG", and "CL" are involved. We used the six "NM" sequences, two "BG" sequences, and two "CL" sequences of the first 62 subjects (001-062) in the dataset as the training set. The remaining 62 subjects (063-124) were used as the test set. In the test set, the first 4 "NM" sequences of each subject are used as the gallery view, and the remaining two "NM" sequences, two "BG" sequences, and two "CL" sequences are used as the probe view to test the performance of the model in different walking states.
In the UCMP-GAIT dataset, all gait images of 30 coal miners were used to test the model. The gait image sequence captured in the coal mine examination room is used as the gallery view, and the sequence captured in the underground coal mine is used as the probe view.

Model Parameters
We set the batch size to 64. In addition, we use the Gaussian distribution with a mean of 0 and a standard deviation of 0.01 to initialize the weights of each layer. All biases are initialized to 0. In order to make the network converge better, we set the learning rate to 0.0001. We determine the number of iterations based on the recognition results on the validation set. The parameters are shown in Table 1.

Mainstream Network Parameters
Too deep a network and too many feature maps will cause the model to be too complicated, require too many parameters, and take too long to identify. Too shallow a network and too few feature maps will cause the model to fail to learn discriminative features in the gait image and poor recognition results. Therefore, the optimal parameter settings obtained through multiple experiments are shown in Table 2. Each "conv" in the table corresponds to the BN-ReLU-Conv mode in the experiment.

Auxiliary Stream Network Parameters
In the experiment, the auxiliary stream network parameters are divided into two parts: an encoder and decoder. The encoder is divided into 3 layers, and the decoding is also divided into 3 layers. The detailed parameters are shown in Table 3: Table 3. Auxiliary stream network parameters.

Number of Filters Filter Size Stride Batch Norm Activation Function
Conv.1

Experimental Results
Firstly, we conducted experiments in the CASIA-B test set (62 subjects). In order to evaluate the robustness of the TS-Net model, three variations covering view, clothing, and carrying objects are evaluated. There are 11 views in the database with a total of 121 pairs. We combined the probe view with the same angle view of all the subjects (62 subjects) in the gallery set (4 per subject) to composed into image pairs, and input into our proposed model to get the similarity (values between 0 and 1, closer to 1 indicates more similar). The experimental results are shown in Tables 4-6. Each row and each column in the table correspond to the angle of the gallery view and the probe view, respectively. Secondly, in order to better verify the performance of the model, we interchange the testing set and the training set. We set the first 62 subjects (001-062) in the dataset as the testing set, and the remaining 62 subjects (063-124) were used as the training set. The experimental results are shown in Table 7. Experimental results show that our model performs stably and still achieves a higher recognition rate. Finally, we conducted experiments on the UCMP-GAIT dataset, and the identity recognition rate was shown in Table 8. Gait recognition has little effect on environmental factors such as light and distance. Compared with the subjects in CASIA-B, the difference is that the personnel in the coal mines wear miner hats on their heads, carry tool bags on their bodies, and wear waterproof shoes on their feet, as shown in Figure 11. However, our model still has a high recognition rate, indicating that our proposed gait recognition method is also robust to these unique features of underground coal mine personnel.   Figure 11. Gait images of underground coal mine personnel. Figure 11. Gait images of underground coal mine personnel.

Compared with State-of-the-Art Methods
We compare the proposed TS-Net model with the latest gait recognition method on the CASIA-B dataset, including deep convolutional neural networks (represented as CNNs) [12], principal component analysis (represented as GEI + PCA) [22], generate adversarial network (represented as GaitGAN) [13], and a perspective transformation model (represented as SPAE) [33].
Firstly, we compared the recognition rate without angle changes-that is, the angle of the probe view and the gallery view are the same. The average recognition rate can be obtained by taking the recognition rates on the diagonals of Tables 4-6. The corresponding CNNs, GaitGAN, GEI + PCA, ResNet, and SCAE average rates are also obtained in the same way. The proposed method has a high recognition rate, as shown in Figure 12. Especially in the case of BG and CL, the recognition rates are 92.37% and 73.9%, which is significantly higher than other methods.

Compared with State-of-the-Art Methods
We compare the proposed TS-Net model with the latest gait recognition method on the CASIA-B dataset, including deep convolutional neural networks (represented as CNNs) [12], principal component analysis (represented as GEI + PCA) [22], generate adversarial network (represented as GaitGAN) [13], and a perspective transformation model (represented as SPAE) [33].
Firstly, we compared the recognition rate without angle changes-that is, the angle of the probe view and the gallery view are the same. The average recognition rate can be obtained by taking the recognition rates on the diagonals of Tables 4-6. The corresponding CNNs, GaitGAN, GEI + PCA, ResNet, and SCAE average rates are also obtained in the same way. The proposed method has a high recognition rate, as shown in Figure 12. Especially in the case of BG and CL, the recognition rates are 92.37% and 73.9%, which is significantly higher than other methods. Secondly, we compared the recognition accuracy in the cross-view case-that is, the perspectives of the probe view and the gallery view are different. We select three walking conditions when the probe view is 36°, 72°, 108°, 144°, and the comparison results are shown in Figure 13. It can be seen from the results that our proposed method is significantly better than these methods, regardless of whether the viewing perspective changes.
Thirdly, we compared the overall recognition rate of the model with the SPAE [33] and MGAN [36] methods, as shown in Table 9. The results show that our model has a higher overall recognition rate. Especially in the presence of noise, the recognition effect is significantly higher than other methods. Finally, we compared the recognition rates in UCMP-GAIT, as shown in Table 10. The gait recognition accuracy of the proposed TS-Net model increased by 6  Secondly, we compared the recognition accuracy in the cross-view case-that is, the perspectives of the probe view and the gallery view are different. We select three walking conditions when the probe view is 36 • , 72 • , 108 • , 144 • , and the comparison results are shown in Figure 13. It can be seen from the results that our proposed method is significantly better than these methods, regardless of whether the viewing perspective changes. Thirdly, we compared the overall recognition rate of the model with the SPAE [33] and MGAN [36] methods, as shown in Table 9. The results show that our model has a higher overall recognition rate. Especially in the presence of noise, the recognition effect is significantly higher than other methods. pixe to reduce network parameters. Secondly, the compression layer will be used to reduce the size of the feature map to half after feeding the feature maps. In this way, the feature map size can be reduced on the one hand, and the number of feature maps input to the next DenseBlock is reduced on the other hand. Finally, the output layer applies a global pooling layer to reduce output parameters. (c) (d) Figure 13. (a-d) represent the cross-view recognition rate of "NM", "BG", and "CL" walking condition when the probe view is 36°, 72°, 108°, 144°, respectively.
We ran the proposed method on a server with 4 Titan X (12 GB) GPU. For our experiment, only the inference time is measured on 1 GPU. After experimental calculation, the average time for inputting a gait image pair into our model to obtain the similarity is 2.73 ms. It takes a total of 22.11 s to predict 30 people on the UCMP-GAIT dataset. The average prediction time for a person is 0.25 s. Under the premise of ensuring the calculation efficiency, the accuracy of our model has been greatly improved. If you add batch processing, which can make sure the algorithm maximizes GPU capability or use a GPU with better computing performance, the computing efficiency of our models can be further improved.

Conclusions and Outlook
This paper proposes a TS-Net model based on DenseNet and SCAE, which is used to extract and fuse the dynamic deep features and static invariant features of gait images for the gait recognition of Finally, we compared the recognition rates in UCMP-GAIT, as shown in Table 10. The gait recognition accuracy of the proposed TS-Net model increased by 6.67% compared with the recognition methods that have the highest accuracy.
The performance of our proposed TS-Net model is much better than the state-of-the-art gait recognition methods. No matter whether a cross-view or identical-view, the performance of the proposed model is demonstrated. The GEIs of BG, CL, and UCMP-GAIT contain massive but different noises, but the model still has a high recognition rate in these cases, indicating that our proposed model can eliminate the effects of noise to obtain the most discriminating features in GEIs. Our proposed model has good robustness to these noises; this is mainly due to the use of efficient multi-scale feature extraction and novel feature fusion techniques. At the same time, the TS-Net model based on DenseNet and SCAE has better performance, indicating that the multi-scale feature fusion of static invariant features and dynamic deep features are much better than a single static feature or dynamic feature. This is why our model has a high recognition rate.

Efficiency
Finally, we analyzed the computational cost of the model. In the calculation process, the main time-consuming step is in the mainstream network. In the testing process, the auxiliary stream network only needs to provide the encoded feature map, so there is no need to perform the most time-consuming decoding operation. The mainstream network needs to calculate the feature map generated by itself and the feature map from the auxiliary flow network. However, we have done a lot of optimization on the model, which greatly reduces the computational cost. Firstly, we use a convolutional layer in the input layer, which applies a convolution kernel of 7 × 7 with a stride of 2 pixe to reduce network parameters. Secondly, the compression layer will be used to reduce the size of the feature map to half after feeding the feature maps. In this way, the feature map size can be reduced on the one hand, and the number of feature maps input to the next DenseBlock is reduced on the other hand. Finally, the output layer applies a global pooling layer to reduce output parameters.
We ran the proposed method on a server with 4 Titan X (12 GB) GPU. For our experiment, only the inference time is measured on 1 GPU. After experimental calculation, the average time for inputting a gait image pair into our model to obtain the similarity is 2.73 ms. It takes a total of 22.11 s to predict 30 people on the UCMP-GAIT dataset. The average prediction time for a person is 0.25 s. Under the premise of ensuring the calculation efficiency, the accuracy of our model has been greatly improved. If you add batch processing, which can make sure the algorithm maximizes GPU capability or use a GPU with better computing performance, the computing efficiency of our models can be further improved.

Conclusions and Outlook
This paper proposes a TS-Net model based on DenseNet and SCAE, which is used to extract and fuse the dynamic deep features and static invariant features of gait images for the gait recognition of underground coal mine personnel. Mainstream networks use DenseNet to learn dynamic deep features to represent the macroscopic spatiotemporal characteristics of gait images. Auxiliary stream networks use SCAE to learn static invariant features, which are used to provide a low-dimensional physiology of gait image information. Then, the pixel-level dynamic deep features and hierarchical static invariant features are fused together to realize gait identification based on the similarity learning method. The proposed TS-Net model not only has a high recognition rate, but it also has good robustness to personnel angle changes, carrying conditions, miner hats, and clothing. The experimental results show that the proposed TS-Net model has a gait recognition accuracy of 92.22% in the UCMP-GAIT dataset, which is significantly better than the state-of-the-art gait recognition methods. What's more, it is effective and feasible for underground coal mine personnel gait recognition.
In the underground coal mine, without being restricted by the complicated environment and by the distance, the gait recognition will play a vital role in identifying the personnel in the coal