A Novel Spatial–Temporal Network for Gait Recognition Using Millimeter-Wave Radar Point Cloud Videos

: Gait recognition is a behavioral biometric technology that aims to identify individuals through their manner of walking. Compared with vision and wearable solutions, millimeter-wave (mmWave)-radar-based gait recognition has drawn attention because radar sensing is privacy-preserving and non-contact. However, it is challenging to capture the motion dynamics of walking people from mmWave radar signals, which is crucial for robust gait recognition. In this study, a novel spatial–temporal gait recognition network based on mmWave radar is proposed to address this problem. First, a four-dimensional (4D) radar point cloud video (RPCV) was introduced to characterize human walking patterns. Then, a PointNet block was utilized to extract spatial features from the radar point clouds in each frame. Finally, a Transformer layer was applied for the spatial–temporal modeling of the 4D RPCVs, capturing walking motion information, followed by fully connected layers to output the identiﬁcation results. The experimental results demonstrated the superiority of the proposed network over mainstream networks, which achieved the best human identiﬁcation performance on a dataset of 15 volunteers


Introduction
Human gait, which is defined as the manner of walking, is a behavioral biometric trait that is unique for each person and can be used to authenticate individuals [1].Compared with other biometrics, such as faces, fingerprints, DNA, and irises, gait signatures can be captured from a distance and without cooperation from individuals, and it is hard to conceal and disguise gait characteristics [2,3].These advantages make gait recognition a promising human identification technology for diverse applications, including public security, forensics, and healthcare [4,5].Vision-and wearable-sensor-based methods are the two main categories of gait recognition techniques used in the community [6,7].However, vision sensors are limited by the illumination conditions, and people may feel constrained by wearable devices and find them inconvenient.More importantly, vision-based devices may raise privacy concerns in non-public scenarios, such as home and office, which can result in the leakage of private or confidential information (e.g., human habits, relationships, visited places, and so on).
mmWave-radar-based human sensing has attracted extensive attention in recent years for its high sensing sensitivity [8].Radar sensors are non-contact devices, can work under any lighting conditions, and do not infringe on the privacy of monitored individuals [9].Therefore, mmWave-radar-based gait recognition can be a promising option for insensitive human identification compared to vision and wearable solutions.
To achieve accurate gait recognition, it is crucial to capture walking dynamics that produce inter-personal differences in walking patterns [7].Most existing radar-based gaitrecognition methods exploit micro-Doppler signatures from radar echoes to characterize the micro-motion patterns of human gait, combined with machine learning or deep learning technologies for human identification [10][11][12].Micro-Doppler signatures are induced by the motions of different body parts (e.g., the torso and limbs), embodying the unique kinematic patterns of differently walking individuals [13].However, micro-Doppler signatures are not robust against viewpoint changes [14], which is limited when people walk in a wide field of view.
Radar point clouds are collections of reflection points representing the target surface, which are generated by performing a series of target-detection algorithms on multipleinput multiple-output (MIMO) radar echoes, containing spatial coordinates and velocity information [8].Radar point clouds across consecutive frames can be regarded as a radar point cloud video (RPCV).Time-varying radar point clouds reveal motion dynamics, as well as the physical shape of a walking human [15], which are also more resilient to changes in viewpoints than micro-Doppler signatures.Therefore, RPCV with spatialtemporal signatures have been taken into consideration for gait recognition in some related studies [16][17][18].
Nonetheless, 4D-radar-point-cloud-based solutions still pose challenges for highperformance gait recognition.First, the limited numbers of antennas on commercial mmWave radars result in sparse radar point clouds that exhibit a lack of appearance or geometric information [19].Second, due to the specular reflection phenomenon of mmWave signals, only parts of human body reflections propagate back to the received antennas [20,21].Consequently, radar point clouds emerge inconsistently across different frames, resulting in difficulty in modeling the spatial-temporal signatures and motion dynamics of human gait.
To address the problem mentioned above, a 4D-RPCV-based spatial-temporal network for gait recognition is proposed in this study.In our proposed network, PointNet [22] was adopted to extract the spatial features from sparse radar point clouds in each frame.In PointNet, shared multi-layer perceptrons (MLPs) are utilized to extract high-level representations from point clouds.Furthermore, a max pooling operation is applied to process an unordered set of point features.After being processed via the PointNet block, 4D RPVs are transformed into point feature sequences.To capture the motion dynamics of human gait, a Transformer layer is deployed to perform multi-head attention on point feature sequences.Transformers [23] have dominated the field of natural language processing in recent years and have been extended to the computer vision community for their capacity to capture global correlations [24].Inspired by Transformers and self-attention mechanisms, a Transformer layer is employed in the proposed network to further exploit the spatialtemporal correlation across the 4D RPCVs, thereby capturing the motion dynamics of human gait.Finally, fully connected layers are utilized to output the identification results.
The gait-recognition method in this study can be formulated as follows: (1) First, mmWave radar signals reflected from walking human subjects are transformed into 4D RPCVs through a series signal-processing algorithms, which can be used to characterize walking patterns.(2) Second, a PointNet block was adopted to extract spatial features from the radar point clouds in each frame of the 4D RPCVs, followed by a Transformer layer for the temporal modeling of features from consecutive frames, capturing the motion dynamics of walking individuals.(3) Third, the class token in the output of the Transformer layer is fed into fully connected layers to predict the identities.After training and evaluation, the experimental results demonstrated the effectiveness and robustness of the proposed spatial-temporal gait recognition network, which achieved the best performance in the case of identifying 10 and 15 subjects.
The main contributions of this study are summarized as follows: • A 4D-RPCV-based spatial-temporal network is proposed to better capture the motion dynamics of gait for accurate human identification, which is capable of modeling walking motion from time-varying sparse radar point clouds.

•
A Transformer encoder architecture is introduced in the proposed network to learn radar point features' sequences, capturing the spatial-temporal dependencies that contribute to accurate gait recognition.
• A 4D RPCV human gait dataset was built on real mmWave MIMO frequency-modulated continuous wave (FMCW) radar measurements, which involved 15 volunteers walking along different paths.Furthermore, experiments on this dataset showed that the proposed spatial-temporal network effectively improved the accuracy of gait recognition.
The remainder of this paper is organized as follows.Section 2 reviews the related works on radar-based human sensing and gait recognition research.Section 3 introduces the 77 GHz radar system used in this study.Section 4 describes the proposed radarbased gait-recognition method, including the generation of 4D RPCVs and the spatialtemporal network.In Section 5, the performance of the proposed gait recognition network is evaluated.The conclusion and future works are provided in Section 6.

Related Works
With the rapid development of mmWave radar technology, radar sensors have been widely explored for various human sensing tasks.In 2016, Google designed the mmWave radar sensing module Soli, which supports gesture recognition [25].Various radar-based human activity recognition methods have been studied in recent years, involving many kinds of radar representations such as range-time maps, range-Doppler maps (RDM), micro-Doppler signatures, and radar point clouds [26].Radar-based human recovery is also a field that attracts attention.Ref. [27] estimated 25 human skeletal joints from radar point clouds, and Ref. [19] reconstructed a three-dimensional human mesh by combining mmWave sensing and the SMPL model.In summary, mmWave radar is suitable for human sensing tasks, including gait recognition.
In [28], a micro-Doppler signature-based multi-branch convolutional neural network (CNN) for human gait recognition was proposed.However, micro-Doppler signatures are limited due to their poor robustness to viewpoint changes, which is also computationally demanding in a multi-person scenario.Ref. [17] designed a sequence radar point network combining PointNet and bidirectional long short-term memory (Bi-LSTM) to learn on 4D radar point cloud sequences.Similarly, the researchers in [29] proposed a gait recognition network combining PointNet and a temporal convolution network (TCN).However, the networks of these methods are based on the recurrent neural network (RNN) architecture, which does not consider global dependencies related to walking motion dynamics with time-varying sparse radar point clouds.

Frequency-Modulated Continuous Wave-MIMO Radar System
This study utilized a mmWave FMCW-MIMO radar that transmits linear chirp sequences.The frequency of transmission is linearly increased over time through the transmit (TX) antenna.A single chirp with the carrier frequency f c can be expressed as [30] where B is the bandwidth and T c is the chirp duration.Denoting c as the speed of light and R and v as the range and velocity of the target, the time delay of the received signal can be expressed as The received (RX) signal is mixed with the TX signal and, subsequently, filtered by a low-pass filter, generating the intermediate frequency (IF) signal.
A radar frame is a sequence of consecutive chirps that can be structured as a twodimensional waveform across two temporal dimensions.In a frame with M chirps, each chirp is sampled with sampling rate f s to obtain N points (fast time dimension), while M samples, corresponding to the number of chirps, are obtained with sampling period T rep (slow time dimension).Thus, the IF signal of the target in a frame across these two time dimensions can be approximately expressed as [30] where n indicates the index of fast time samples within each chirp and m is the index of slow time samples across successive chirps.The beat frequency f b = 2BR/cT c and Doppler frequency f d = 2 f c v/c reveal the range and velocity of the target, respectively.The information can be extracted by a two-dimensional fast Fourier transform (FFT) along the fast and slow time dimensions.
Because each antenna's received signal has a different phase, a radar with a linear antenna array can be used to estimate a target's azimuth.Denoting by d the distance between two adjacent antennas and λ = c/ f c the base wavelength of the transmitted chirp, the phase shift between the received signals from these two antennas is [31] where θ denotes the azimuth of the target.For Q number of targets, the three-dimensional (3D) FMCW-MIMO radar IF signal can be represented as [30] where l indicates the index of the receiving antenna and α q is the complex amplitude of the qth target.The samples of the IF signal can be arranged into a 3D matrix across fast time, slow time, and channel dimensions, forming the Raw Data Cube.Range, velocity, and angle estimation can be achieved by applying FFT along these three dimensions, respectively.

Method 4.1. Four-Dimensional Radar Point Cloud Videos
Four-dimensional RPCVs are introduced to characterize human walking patterns.In comparison to 3D point clouds obtained from Lidar or depth cameras, 4D radar point clouds include velocity information, providing benefits for modeling human walking patterns.The generation of 4D radar point clouds involves using a frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar with antennas placed both horizontally and vertically.This configuration enables the estimation of the azimuth and elevation of scatter points from walking human targets.
The steps for generating 4D radar point clouds from IF signals are shown in Figure 1.First, a 2D-FFT is applied to the raw radar data to obtain a range-Doppler matrix (RDM).Here, the 2D-FFT involves applying the FFT along the fast time and the slow time dimensions sequentially.Subsequently, a moving target indication (MTI) filter is utilized to remove static clutter caused by the environment.Following this, a two-dimensional constant false alarm rate (2D-CFAR) is applied on the RDM to select prominent range-Doppler pixels as potential scattering points, using a threshold that varies according to the noise level.For each potential scatter point in the range-Doppler domain, the signal along channel dimension is arranged into a 2D matrix based on the antenna array positions.The spatial spectrum in the horizontal and vertical directions can be obtained by performing the 2D-FFT on this 2D matrix.The angle of arrival (AoA) in the horizontal and vertical directions of each scattering point can be estimated by applying peak searching to the spatial spectrum.A 4D detected scattering point can be expressed by p = [r, v, θ, φ], where θ and φ are the azimuth and elevation angles, respectively.After coordinate transformation, p = [x, y, z, vs.], and the transformation is where x, y, and z are the 3D coordinates in the Cartesian coordinate system and v is the radial velocity.Finally, to remove clutter points and cluster the scattering points belonging to the same target, the density-based spatial clustering of applications with noise (DBSCAN) algorithm [32] is employed.A cluster consisting of multiple scattering points in frame k can be expressed as

2D-FFT
Furthermore, a 4D RPCV can be constructed by combining clusters belonging to the same target from consecutive frames, which can be expressed as A short 4D RPCV sample is shown in Figure 2, and the cluster with the most scattering points is regarded as the human target.Radar point clouds emerge inconsistently across multiple frames due to the specular reflection phenomenon of mmWave signals.The 4D RPCVs were taken as the input of the spatial-temporal network introduced in Section 4.2 for human identification.

Spatial-Temporal Network
The key factor for the recognition network is extracting person-specific gait features, which are related to both spatial and motion patterns.After extracting high-level human walking motion and appearance-related representations from radar point clouds, a max pooling operation is applied to the unordered set of pointwise features to obtain a global spatial feature in a single frame.Specifically, the 4D RPCV C 1:L ∈ R L×Num×4 is transformed to a feature sequence by the PointNet block, where L is the length, Num is the number of points in each frame, and Dim is the dimension of the global spatial feature.

Transformer Layer
To model person-specific walking motion dynamics with feature sequences obtained from the PointNet block, a Transformer layer with a multi-head attention mechanism is applied in the spatial-temporal network.
The feature sequences are regarded as gait embeddings in this Transformer layer.A learnable vector, termed as the class token, is initialized and concatenated with the gait embeddings, as shown in Figure 3.The class token interacts with the features in all states, avoiding preference for motion information in specific states.Compared to simply pooling features from all states, using the class token for further classification is a better way to aggregate gait information across the entire RPCV.The input of the Transformer layer can be expressed as where f cls is the class token.The architecture of the Transformer layer is shown in Figure 5a, consisting of a multihead attention block and a positionwise feed-forward block with layer normalization applied and residual connections used.First, F is projected to the Query, Key, and Value by linear transformation, which can be expressed as where W q , W k , and W v are the weights of the linear transformation.Multi-head attention then divides Q, K, and V into different representation subspaces by linear projection and aggregates features from all the representation subspaces to capture various dependencies within the 4D RPCVs, as shown in Figure 5b.The process of this transformation can be expressed as

Add & Layernorm
where h represents the index of the representation subspace (termed as the head) and Q h , K h , V h ∈ R (L+1)× Dim H . Furthermore, H is the number of heads.The scaled dot-product attention for each head is calculated as Attention from all heads is concatenated and processed by linear projection.Finally, the positionwise feed-forward block applies an identical MLP to each state for further feature extraction.In addition, the use of layer normalization and residual connections facilitates building a deeper architecture.

Output Layer
In the output layer, the feature vector corresponding to the class token in attention is extracted, and FC layers are applied to reduce the dimension of the gait feature.A dropout layer is used to prevent overfitting problems.Afterward, a softmax layer is applied to predict the human identity ŷ.The categorical cross-entropy loss function compares ŷ with the ground-truth label y of the walking human, instructing the optimization of the spatial-temporal network.The loss function can be expressed as where p is the number of people registered in the gait-recognition system.

Data Collection
A mmWave FMCW-MIMO radar platform developed by Texas Instruments was utilized for evaluation in this study.The radar platform comprises an RF module and a DSP module, implementing a four-device cascaded array of AWR1243 chips, as shown in Figure 6.The radar, equipped with a two-dimensional antenna array, can be used for azimuth and elevation estimation.It employs the time-division multiplexing (TDM) technique to achieve waveform orthogonality.The virtual RX antenna array is shown in Figure 7.The detailed parameters of the radar system can be found in Table 1.The data were collected in an open area, as shown in Figure 8, with a sensing area measuring 10 m × 15 m.The radar platform was placed 3 m away from the sensing area and mounted on a tripod stand at a height of 1 m.We recruited 15 volunteers for the experiment, with heights ranging from 160 cm to 183 cm and weights from 51 kg to 75 kg, as detailed in Table 2.Each volunteer was instructed to walk within the sensing area from six different viewpoints relative to the radar platform.For each viewpoint, we collected five sequences, each lasting 100 frames.Four of these sequences were allocated for training, while the remaining one was used for testing.In total, we collected 45,000 radar frames.

Implementation Details
The size of the 4D RPCVs was set to 50 × 64 × 4. For the implementation of DBSCAN, the radius and minimal number of points in the neighborhood were set to 0.8 m and 10, respectively.The hidden dimension Dim was set to 512, and the number of heads H in the multi-head attention block was set to four.The data of 10 volunteers were used for the basic evaluation, and the data of all 15 volunteers were used for further stability analysis of the networks.In the training phase, Adam was chosen as the optimizer, with a batch size of 32 and a learning rate of 0.0001.We trained the networks for 250 epochs.All the training and evaluating processes of the networks were implemented using PyTorch with an NVIDIA A40 GPU.

Comparison of Performance
To verify the effectiveness of the proposed spatial-temporal gait-recognition network, we compared it with several gait-recognition benchmarks based on radar point clouds or micro-Doppler signatures.A combination of PointNet and an RNN-based temporal block is the mainstream for radar-point-cloud-video-based gait recognition.In this experiment, we compared the proposed network with "PointNet + BiLSTM" and "PointNet + TCN", both of which use PointNet to extract radar point cloud features and employed an RNNbased temporal block to capture the time-varying characteristics.mmGaitNet [16] uses 2D convolutional kernel to extract spatial-temporal features from the RPCVs.The Multi-Channel CNN [28] captures gait Doppler features from the micro-Doppler signature using an Inception and residual-connection-based network.
As shown in Table 3, the proposed spatial-temporal network achieved the best identification performance with an accuracy of 94.44% on the test set, showcasing the capacity of our model in capturing human walking dynamics.The spatial-temporal network was 8.33% more accurate than "PointNet + BiLSTM" and 6.94% more accurate than "PointNet + TCN", respectively, demonstrating the effectiveness of using the Transformer layer to model the temporal correlation from the 4D RPCVs.In the case of 4D-RPCV-based gait recognition, mmGaitNet achieved a higher accuracy than that of "PointNet + BiLSTM" and "PointNet + TCN", showcasing the potential of the 2D CNN in capturing spatial-temporal human gait features from time-varying radar point clouds.Since the samples in our dataset were collected from different viewpoints relative to the radar, the micro-Doppler signatures were affected, resulting in the lowest accuracy among all the compared networks for the Multi-Channel CNN.The confusion matrices of the proposed network, mmGaitNet, "PointNet + TCN", and Multi-Channel CNN are shown in Figure 9.All the networks in this experiment exhibited poor recognition accuracy for certain users.However, the spatialtemporal network achieved more than 92% accuracy for all subjects except 'Subject 1', an outcome that other networks could not achieve.As shown in Figure 10, as the number of subjects increased, the performance of all networks degraded.However, the proposed spatial-temporal network still achieved the highest accuracy, demonstrating the robustness and stability of our network in the gait recognition task.The proposed network captured the motion dynamics of the human gait by combining PointNet and the Transformer encoder, fully exploiting the spatial-temporal structure of the 4D RPCVs.

Impact of Hidden Dimension
The hidden dimension Dim is correlated with the size and performance of the network, and we compared different sizes of a hidden dimension in this experiment.As shown in Table 4, with the increase of the hidden dimension, the size of the network increased, which was not conducive to the deployment of the network.It is worth noting that the performance of the network with a hidden dimension of 1024 slightly degraded compared to that with 512, potentially due to overfitting.After considering both the performance and computational complexity of the network, the hidden dimension was set to 512.

Discussion
This sub-section discusses the potential applications in real-time scenarios, as well as the limitations of the proposed radar-based gait-recognition method.The designed spatial-temporal gait recognition network, with 1.656 million parameters and 1.158 billion FLOPs, can be deployed on many commercial AI edge computing devices for real-time processing.The gait-recognition method in this study can be applied in many real-world scenarios.It has great potential application over traditional vision solutions in personalized surveillance systems such as smart homes and enterprise settings, where the number of individuals involved is a few tens.The sensing device used in the proposed method is a single radar module, making it easy to deploy with edge computing devices in most practical scenarios without requiring extensive additional hardware deployment in the environment.
Although radar-based gait recognition offers a non-invasive way of human identification, it is limited in certain cases.As a soft biometric, gait cannot be used to identify subjects within very large groups, since it is hard to separate each subject's gait representation from the radar echoes of a large crowd.In addition, abnormal walking patterns due to injuries may lead to poor gait recognition performance.

Conclusions
In this article, a 4D-RPCV-based spatial-temporal network for gait recognition was proposed.The 4D RPCV was introduced to characterize human gait.In the proposed network, PointNet was adopted to extract spatial features from sparse radar point clouds in each frame.Furthermore, a Transformer layer was employed to further exploit the spatialtemporal correlation across the 4D RPCVs, enabling the capture of motion dynamics in human gait.The experimental results demonstrated the effectiveness and robustness of the spatial-temporal network, achieving an accuracy of 94.44% in identifying 10 subjects and 90.76% for 15 subjects.
In the future, related research will be continued from two aspects.On the one hand, to make the proposed network more general and robust, we will increase the number and diversity of the subjects in the experiment, as well as evaluate the network with various environmental settings.On the other hand, we will study the radar-based gait recognition network that is robust to environment changes.Radar sensing is easily affected by interference associated with the surroundings, for which it is important to enhance the environment adaptivity of the radar-based gait-recognition model.The potential of meta-learning methods in radar-based gait recognition will be explored, with the goal of rapid adaptation to new environments with minimal observations.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Figure 1 .
Figure 1.Flowchart of generating radar point clouds.

Figure 3 .Figure 4 .
Figure 3. Architecture of the spatial-temporal network.4.2.1.PointNet BlockL identical PointNet encoders are applied to process the input of 4D RPCVs, which consists of radar point clouds in L frames.As shown in Figure4, for radar point clouds in each frame, the PointNet block implements MLPs in parallel to extract pointwise features, and all the MLPs in parallel share the same weights.In the MLP layer, each MLP extracts high-dimensional features from a single 4D radar point through a linear transformation.

Author Contributions:
Conceptualization, C.M.; methodology, C.M.; software, C.M.; validation, C.M.; formal analysis, C.M.; investigation, C.M.; resources, Z.L.; data curation, C.M.; writingoriginal draft preparation, C.M.; writing-review and editing, Z.L.; visualization, C.M.; supervision, Z.L.; project administration, Z.L.; funding acquisition, Z.L.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded in part by the Guangdong Provincial Science and Technology Plan Project under Grant 2021A0505080014, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515012873, and in part by the Guangzhou Key Research and Development Project under Grant 2023B01J0011.

Table 1 .
Parameters of the radar FMCW-MIMO system.

Table 2 .
Information about the subjects.

Table 3 .
Comparison of the proposed spatial-temporal network with different gait recognition networks.

Table 4 .
Performance of network with different hidden dimensions.