Continuous Gesture Recognition Based on Time Sequence Fusion Using MIMO Radar Sensor and Deep Learning

: Gesture recognition that is based on high-resolution radar has progressively developed in human-computer interaction ﬁeld. In a radar recognition-based system, it is challenging to recognize various gesture types because of the lacking of gesture transversal feature. In this paper, we propose an integrated gesture recognition system that is based on frequency modulated continuous wave MIMO radar combined with deep learning network for gesture recognition. First, a pre-processing algorithm, which consists of the windowed fast Fourier transform and the intermediate-frequency signal band-pass-ﬁlter (IF-BPF), is applied to obtain improved Range Doppler Map. A range FFT based MUSIC (RFBM) two-dimensional (2D) joint super-resolution estimation algorithm is proposed to obtain a Range Azimuth Map to obtain gesture transversal feature. Range Doppler Map and Range Azimuth Map then respectively form a Range Doppler Map Time Sequence (RDMTS) and a Range Azimuth Map Time Sequence (RAMTS) in gesture recording duration. Finally, a Dual stream three-dimensional (3D) Convolution Neural Network combined with Long Short Term Memory (DS-3DCNN-LSTM) network is designed to extract and fuse features from both RDMTS and RAMTS, and then classify gestures with radial and transversal change. The experimental results show that the proposed system could distinguish 10 types of gestures containing transversal and radial motions with an average accuracy of 97.66%.


Introduction
Gesture recognition has been regarded as an effective way of human-computer interaction (HCI) and it has been increasingly applied in many applications [1][2][3]. There are many researches on gesture recognition that is based on computer vision [4][5][6][7]. The Vision-based techniques study the contours, shapes, and textures of gestures. However, vision-based methods require a large amount of computational resource consumption, and they cannot work well in strong light or low light.
In recent years, radar sensor-based gesture recognition has gained a lot of attention. Radar sensors can solve the problem of low recognition accuracy of vision-based system, due to poor lighting conditions, which are ideal for in-car environments with poor lighting conditions. In addition, a radar system is able to protect the user's privacy better than the vision-based system. Therefore, radar-based gesture recognition system has very broad application prospect and far-reaching application value in practical applications [8][9][10][11][12][13][14][15][16][17]. There are some hand gesture recognition methods that are based on Doppler radar [9,10]. However, Doppler radar can only get the Doppler information, also called velocity information, but it cannot get the range information of target. Therefore, there are some

FMCW MIMO Radar
The employed FMCW radar is millimeter wave radar sensor with three transmitters and four receivers. We use two transmitters and four receivers to generate a virtual array of eight receiving antennas. The signals are generated by synthesizer and transmitted by two transmitters. The signal is received by four receivers after being reflected by target. The received signal is mixed with transmit signal to obtain IF signal. We used one transmitter and two receivers to show the work process of FMCW MIMO radar sensor. Figure 2 shows the simplified block diagram. The employed radar sensor radiates sawtooth modulated waveform. The transmitted saw tooth FMCW signal consists of several frames, and each frame contains many chirps; a chirp is a sinusoid or a sin wave whose frequency increases linearly with time. The received IF signal can be expressed as where = is the slope of chirp, is the bandwidth of the transmitted signal, and is the chirp duration, is amplitude of IF signal, and and ∆ denote time delay and phase shift caused by hand, respectively.

FMCW MIMO Radar
The employed FMCW radar is millimeter wave radar sensor with three transmitters and four receivers. We use two transmitters and four receivers to generate a virtual array of eight receiving antennas. The signals are generated by synthesizer and transmitted by two transmitters. The signal is received by four receivers after being reflected by target. The received signal is mixed with transmit signal to obtain IF signal. We used one transmitter and two receivers to show the work process of FMCW MIMO radar sensor. Figure 2 shows the simplified block diagram.
Electronics 2020, 9,869 3 of 18 employed to obtain RDMTS. Meanwhile, the RFBM 2D joint super-resolution estimation algorithm is used to obtain RDMTS. In the recognition part, RDMTS and RAMTS are input to DS-3DCNN-LSTM and the classification results are given.

FMCW MIMO Radar
The employed FMCW radar is millimeter wave radar sensor with three transmitters and four receivers. We use two transmitters and four receivers to generate a virtual array of eight receiving antennas. The signals are generated by synthesizer and transmitted by two transmitters. The signal is received by four receivers after being reflected by target. The received signal is mixed with transmit signal to obtain IF signal. We used one transmitter and two receivers to show the work process of FMCW MIMO radar sensor. Figure 2 shows the simplified block diagram. The employed radar sensor radiates sawtooth modulated waveform. The transmitted saw tooth FMCW signal consists of several frames, and each frame contains many chirps; a chirp is a sinusoid or a sin wave whose frequency increases linearly with time. The received IF signal can be expressed as where = is the slope of chirp, is the bandwidth of the transmitted signal, and is the chirp duration, is amplitude of IF signal, and and ∆ denote time delay and phase shift caused by hand, respectively. The employed radar sensor radiates sawtooth modulated waveform. The transmitted saw tooth FMCW signal consists of several frames, and each frame contains many chirps; a chirp is a sinusoid or a sin wave whose frequency increases linearly with time. The received IF signal can be expressed as T is the slope of chirp, B is the bandwidth of the transmitted signal, and T is the chirp duration, A IF is amplitude of IF signal, and τ and ∆φ denote time delay and phase shift caused by hand, respectively. Electronics 2020, 9, 869 4 of 18 According to Equation (1), the target range R can be calculated by where c is the speed of light and f IF is the principal component of IF signal. For a frame periodicity, ∆R = vT, so the radial velocity v of object can be calculated by where f d is the Doppler frequency. The range information could be obtained by performing Range-FFT along the fast time. The radial velocity information can be obtained by applying Doppler-FFT on the IF signal along the slow time.
A pair of transceiver antennas can realize the Range Doppler estimation. However, at least two receiving antennas are needed for azimuth estimation. MIMO radar with multiple TX and multiple RX antennas provides a cost-effective way to improve the radar angle resolution [28]. We used a 2T4R MIMO radar to generate a virtual array of eight RX antennas. Transmit antenna TX1 and TX3 are horizontally spaced at d = 4d r and four receives are horizontally spaced with an interval of d r , as shown in Figure 3a. The phase difference between adjacent antennas ω is calculated by where θ is the angle of arrival. The unambiguous measurement of angle requires |ω| ≤ π, so d r = λ/2 is for the largest field of view [28].
where is the speed of light and is the principal component of IF signal. For a frame periodicity, ∆ = , so the radial velocity of object can be calculated by where is the Doppler frequency. The range information could be obtained by performing Range-FFT along the fast time. The radial velocity information can be obtained by applying Doppler-FFT on the IF signal along the slow time.
A pair of transceiver antennas can realize the Range Doppler estimation. However, at least two receiving antennas are needed for azimuth estimation. MIMO radar with multiple TX and multiple RX antennas provides a cost-effective way to improve the radar angle resolution [28]. We used a 2T4R MIMO radar to generate a virtual array of eight RX antennas. Transmit antenna TX1 and TX3 are horizontally spaced at = 4 and four receives are horizontally spaced with an interval of , as shown in Figure 3a. The phase difference between adjacent antennas is calculated by where is the angle of arrival. The unambiguous measurement of angle requires| | ≤ , so = /2 is for the largest field of view [28].
A transmission from TX1 results in a phase of [0ω 2ω 3ω] at the four RX antennas. Any signal emanating from TX3 traverses an additional path of length 4 sinθ when compared to TX1 because the second TX antenna (TX3) is placed a distance of 4 from TX1. Correspondingly, the signal at each RX antenna sees an additional phase-shift of 4ω (with regard to transmission from TX1). The phase of the signal at the four RX antennas, due to a transmission from TX3, is [4ω 5ω 6ω 7ω]. Concatenating the phase sequences at the four RX antennas obtains the sequence [0ω 2ω 3ω 4ω 5ω 6ω 7ω], as shown in Figure 3b. Thus the 2TX-4RX antenna configuration of Figure 3a synthesizes a virtual array of 8 RX antennas, as shown in Figure 3b. In this work, time division multiplexing (TDM) [29] is employed to separate different transmit signals. According to Equation (4), can be calculated by A transmission from TX1 results in a phase of [0ω 2ω 3ω] at the four RX antennas. Any signal emanating from TX3 traverses an additional path of length 4d r sin θ when compared to TX1 because the second TX antenna (TX3) is placed a distance of 4d r from TX1. Correspondingly, the signal at each RX antenna sees an additional phase-shift of 4ω (with regard to transmission from TX1). The phase of the signal at the four RX antennas, due to a transmission from TX3, is [4ω 5ω 6ω 7ω]. Concatenating the phase sequences at the four RX antennas obtains the sequence [0ω 2ω 3ω 4ω 5ω 6ω 7ω], as shown in Figure 3b. Thus the 2TX-4RX antenna configuration of Figure 3a synthesizes a virtual array of 8 RX antennas, as shown in Figure 3b. In this work, time division multiplexing (TDM) [29] is employed to separate different transmit signals.
According to Equation (4), θ can be calculated by A virtual array of eight receive antennas is constructed and the received raw data are rearranged to conform to the data processing model of virtual array. The angle information can be obtained by using Electronics 2020, 9, 869 5 of 18 DOA estimation methods based on Equation (5). There are many DOA estimation methods [30][31][32][33][34][35][36][37], such as MUSIC [30][31][32], ESPRIT [33], and Capon [34,35]. In this paper, we design a RFBM 2D joint super-resolution algorithm to obtain information of range and azimuth. We need rearrange the received data to make it suitable for signal processing. The collected data are reshaped to a cube matrix: where n = 1, 2, · · · , N, p = 1, 2, · · · , PF, l = 1, 2, · · · , L, and N are the samples within the time duration T, P is the number of consecutive chirps in one frame and F is the total frames, f c is carry frequency, and L is the number of virtual receiving antennas. In Equation (6), the three dimensions of matrix s contain information of range R, Doppler f d and azimuth θ.

Signal Processing
In this section, we describe the signal processing methods of FMCW MIMO radar, including the pre-processing method of improved RDM generation to obtain gesture radial information and a RFBM algorithm for RAM generation to obtain gesture lateral information.

Generate RDM
This section introduces the generation process of traditional RDM and a pre-processing method, including window functions and an IF band-pass-filter for improved RDM.

Generate Traditional RDM
Since the matrix s contains Range-Doppler information in all frames, the Range-Doppler-FFT is performed in each frame to reveal the change of range and velocity in time. Figure 4 shows the calculation process of Range-Doppler FFT. A range-FFT performed on each column resolves objects in range, and a Doppler-FFT along each row resolves each column (range-bin) in velocity. The Doppler-FFT is accumulated in the results of fast-time axis, so that the traditional RDM can be obtained. The RDM reflects range and velocity information of object. Figure 5a shows the obtained traditional RDM of real data after Range Doppler FFT.
Electronics 2020, 9, 869 5 of 18 A virtual array of eight receive antennas is constructed and the received raw data are rearranged to conform to the data processing model of virtual array. The angle information can be obtained by using DOA estimation methods based on Equation (5). There are many DOA estimation methods [30][31][32][33][34][35][36][37], such as MUSIC [30][31][32], ESPRIT [33], and Capon [34,35]. In this paper, we design a RFBM 2D joint super-resolution algorithm to obtain information of range and azimuth. We need rearrange the received data to make it suitable for signal processing. The collected data are reshaped to a cube matrix: where = 1,2, ⋯ , , = 1,2, ⋯ , , = 1,2, ⋯ , , and are the samples within the time duration , is the number of consecutive chirps in one frame and is the total frames, is carry frequency, and is the number of virtual receiving antennas. In Equation (6), the three dimensions of matrix contain information of range , Doppler and azimuth .

Signal Processing
In this section, we describe the signal processing methods of FMCW MIMO radar, including the pre-processing method of improved RDM generation to obtain gesture radial information and a RFBM algorithm for RAM generation to obtain gesture lateral information.

Generate RDM
This section introduces the generation process of traditional RDM and a pre-processing method, including window functions and an IF band-pass-filter for improved RDM.

Generate Traditional RDM
Since the matrix contains Range-Doppler information in all frames, the Range-Doppler-FFT is performed in each frame to reveal the change of range and velocity in time. Figure 4 shows the calculation process of Range-Doppler FFT. A range-FFT performed on each column resolves objects in range, and a Doppler-FFT along each row resolves each column (range-bin) in velocity. The Doppler-FFT is accumulated in the results of fast-time axis, so that the traditional RDM can be obtained. The RDM reflects range and velocity information of object. Figure 5a shows the obtained traditional RDM of real data after Range Doppler FFT.

Window Functions for Spectrum Leakage Suppression
There will be spectrum leakage when conducting FFT operation. Spectrum leakage will reduce the spectral resolution and make it hard to detect real object. We consider applying a window function before FFT operation to reduce spectrum leakage in order to solve this problem. The Hanning window [38] is able to alleviate spectrum leakage with a good frequency resolution. Therefore, Hanning is applied to window the signals of Range dimension and Doppler dimension respectively. The Hanning window calculation formula for time domain signal of Range and Doppler dimensions of the fifth frame signal is as follows where ( , , ) is the result of Hanning window for in range dimension, and ( , , ) is the result of Hanning window for in Doppler dimension. Figure 5b shows the obtained RDM after windowing and Range Doppler FFT.

Designed IF Band-Pass-Filter (IF-BPF) for Clutter Suppression
Besides hand gesture echoes, there will be background clutters in real experimental scene. In addition, there will be interference between the antennas. These situations may cause clutters in RDM. As can be seen in Figure 5a,b, there are peaks at almost all range bins when the velocity is 0, which are caused by the interference of antennas. It can be observed that that there are peaks at range from 1.5 m to 2 m. Based on the analysis of the experimental environment, the peaks are the echo spectrum of ceiling. The motion range of gesture is approximately 0.1-0.7 m, so the targets beyond this range can be considered to be interference or clutters.
The Constant False-Alarm Rate (CFAR) detector can be employed to reduce background clutter and detect target [39]. However, for strong background clutter situation, it is easy to detect false target by CFAR, and there is incomplete feature extraction of CFAR. The motion range of gesture is approximately 0.1-0.7 m, and there are still strong peaks beyond this interval in spectrum. According to Equation (2), we know that the IF signal is proportional to range, we consider filtering clutters by filtering the low and high frequency in IF signal. Therefore, we designed an IF-BPF to filter background clutters in RDM. Figure 6 shows the block diagram of this designed IF-BPF.

Window Functions for Spectrum Leakage Suppression
There will be spectrum leakage when conducting FFT operation. Spectrum leakage will reduce the spectral resolution and make it hard to detect real object. We consider applying a window function before FFT operation to reduce spectrum leakage in order to solve this problem. The Hanning window [38] is able to alleviate spectrum leakage with a good frequency resolution. Therefore, Hanning is applied to window the signals of Range dimension and Doppler dimension respectively. The Hanning window calculation formula for time domain signal of Range and Doppler dimensions of the fifth frame signal is as follows where s rw (N, P, f ) is the result of Hanning window for s in range dimension, and s dw (N, P, f ) is the result of Hanning window for s in Doppler dimension. Figure 5b shows the obtained RDM after windowing and Range Doppler FFT.

Designed IF Band-Pass-Filter (IF-BPF) for Clutter Suppression
Besides hand gesture echoes, there will be background clutters in real experimental scene. In addition, there will be interference between the antennas. These situations may cause clutters in RDM. As can be seen in Figure 5a,b, there are peaks at almost all range bins when the velocity is 0, which are caused by the interference of antennas. It can be observed that that there are peaks at range from 1.5 m to 2 m. Based on the analysis of the experimental environment, the peaks are the echo spectrum of ceiling. The motion range of gesture is approximately 0.1-0.7 m, so the targets beyond this range can be considered to be interference or clutters.
The Constant False-Alarm Rate (CFAR) detector can be employed to reduce background clutter and detect target [39]. However, for strong background clutter situation, it is easy to detect false target by CFAR, and there is incomplete feature extraction of CFAR. The motion range of gesture is approximately 0.1-0.7 m, and there are still strong peaks beyond this interval in spectrum. According to Equation (2), we know that the IF signal is proportional to range, we consider filtering clutters by filtering the low and high frequency in IF signal. Therefore, we designed an IF-BPF to filter background clutters in RDM. Figure 6 shows the block diagram of this designed IF-BPF.
Where x(n) = s rw (N, p, f ), r l , and r h are the minimum range and maximum range of gesture, respectively. For filter parameter, f l and f h are the lower passband cutoff frequency and the higher passband cutoff frequency, respectively, where f l ∈ f pl , f sl , f h ∈ f ph , f sh , and f pl , f sl , f ph , f sh are the lower passband cutoff frequency, the lower stopband cutoff frequency, the higher passband cutoff frequency, and the higher stopband cutoff frequency, respectively. Additionally, f s = N/T is the sampling frequency (N is the samples within the time duration T), and · denotes the ceiling function, rounding toward positive infinity. For band-pass filter function, h d (n) are the unit sampling response sequence, h(n) is the system function of band-pass filter. The output y(n) no longer contains the frequency components of f > f l or f < f h , but only reserves the component of f ∈ F R . where ( ) = ( , , ) , , and are the minimum range and maximum range of gesture, respectively. For filter parameter, and are the lower passband cutoff frequency and the higher passband cutoff frequency, respectively, where ∈ , , ∈ [ , ], and , , , are the lower passband cutoff frequency, the lower stopband cutoff frequency, the higher passband cutoff frequency, and the higher stopband cutoff frequency, respectively. Additionally, = / is the sampling frequency ( is the samples within the time duration ), and • denotes the ceiling function, rounding toward positive infinity. For band-pass filter function, ℎ ( ) are the unit sampling response sequence, ℎ( ) is the system function of band-pass filter. The output ( ) no longer contains the frequency components of | | > or | | , but only reserves the component of | | ∈ F .
According to IF-BPF algorithm, most of the frequency components of background are filtered out, and the frequency components of the responding gestures are enhanced. Figure 5c shows the RDMs of IF-BPF. We can see in Figure 5 that spectrum leakage is well alleviated by windowing and the background clutter is suppressed by IF-BPF. The gesture spectrum is enhanced, which makes it According to IF-BPF algorithm, most of the frequency components of background are filtered out, and the frequency components of the responding gestures are enhanced. Figure 5c shows the RDMs of IF-BPF. We can see in Figure 5 that spectrum leakage is well alleviated by windowing and the background clutter is suppressed by IF-BPF. The gesture spectrum is enhanced, which makes it obvious for identification.

Generate RAM
This section describes a Range FFT based MUSIC 2D joint super-resolution estimation Algorithm (RFBM). This algorithm can realize range and azimuth joint estimation, so as to obtain the lateral information of gesture.
The generation of RAM requires joint estimation of range and azimuth. The range azimuth dimension data of cube matrix s in Equation (6) are selected for estimation. We adopt the first chirp of each frame in order to generate a RAM for each frame signal. Algorithm 1 introduces the proposed RFBM 2D joint super-resolution estimation algorithm.
Obtain the noise subspace E N . Perform the singular value decomposition of the covariance matrix R xx and get E N .
Where E S and E N are signal subspace and noise subspace, respectively. (6) Determine steering vectors and angle search space.
Where θ d and θ u indicate the upper and lower bounds of angular search space VecA, respectively. Steering vectors is shown as Calculate the MUSIC spatial spectrum.
Electronics 2020, 9, 869 9 of 18 The RAM reflects the range and azimuth information of the real target. Figure 7 shows the RAMs of target at different angular positions. The peaks in Figure 7 are the spectrum of object, and the corresponding values of abscissa and vertical axis are the estimated azimuth and range of target. Figure 7a shows the RAM of target at about −20 • . Figure 7b shows that the estimated azimuth is approximately 0 • . Figure 7c shows the RAM of target at about 20 • . We can see that RAM can effectively reflect the range and azimuth information of real target. approximately 0°. Figure 7c shows the RAM of target at about 20°. We can see that RAM can effectively reflect the range and azimuth information of real target. Each frame signal can generate a RDM and a RAM with the above signal processing method. The RDM and RAM of each frame signal then form a RDMTS and a RAMTS in gesture recording duration, which represent the continuous radial and transverse motion information of gesture.

Dual Stream 3DCNN-LSTM Networks
3DCNN can extract temporal information of a few consecutive pictures, but is not enough to learn long term information from a long picture sequence. When compared with 3DCNN, Long Short Term Memory (LSTM) network is more suitable for learning long-term temporal information. In this paper, there are 30 frames in a gesture duration and 30 consecutive RDMs and RAMs respectively. 3DCNNs are employed to extract short-term spatiotemporal features first, and then LSTMs are employed to learn long-term spatiotemporal features of RDMST and RAMTS. Deeper spatiotemporal information can be learned in this way.
Inspired by [15], we proposed a dual stream 3DCNN-LSTM networks for feature extraction. When compared to [15], we have two different points. Firstly, only RDM was obtained in [15], in this paper, we can not only obtain the RDMTS, but also get the RAMTS by proposing the RFBM algorithm. Therefore, we proposed a dual-stream concept. Dual-stream refers to the use of two-way network to extract the features of RDMTS and RAMTS separately and then merge them.
Secondly, in [15], an I3D network and LSTMs were employed to extract RDM features. In this paper, since both RDMTS and RAMTS contain spatiotemporal features of gestures, 3DCNNs are employed to extract short-term spatiotemporal features first, and then LSTMs are employed in order to learn long-term spatiotemporal features of RDMST and RAMTS. Finally, the features are fused.
The detailed feature extraction process of proposed network contains two parts: First, two 3DCNNs are employed to learn short-term spatiotemporal features of RDMTS and RAMTS. The learned features of RDMTS and RAMTS are called and . Second, two LSTMs are employed to extract long-term spatiotemporal features accumulated in and . The extracted features of and by LSTMs are vectors called and . The and with size of 1600 contain radial and transversal information of gesture. After LSTMs feature extraction, the extracted features and are fused in order to form a fusion feature vector with a 3200 size. The fusion feature contains distance, velocity, and azimuth information for continuous gestures. The fusion feature is input to the two-layer fully connected (FC) layers to reduce the dimensionality of features and output 10 categories. A softmax function is employed in the final FC layer for classification. Figure 8 shows an overview of proposed deep learning architecture. Each frame signal can generate a RDM and a RAM with the above signal processing method. The RDM and RAM of each frame signal then form a RDMTS and a RAMTS in gesture recording duration, which represent the continuous radial and transverse motion information of gesture.

Dual Stream 3DCNN-LSTM Networks
3DCNN can extract temporal information of a few consecutive pictures, but is not enough to learn long term information from a long picture sequence. When compared with 3DCNN, Long Short Term Memory (LSTM) network is more suitable for learning long-term temporal information. In this paper, there are 30 frames in a gesture duration and 30 consecutive RDMs and RAMs respectively. 3DCNNs are employed to extract short-term spatiotemporal features first, and then LSTMs are employed to learn long-term spatiotemporal features of RDMST and RAMTS. Deeper spatiotemporal information can be learned in this way.
Inspired by [15], we proposed a dual stream 3DCNN-LSTM networks for feature extraction. When compared to [15], we have two different points. Firstly, only RDM was obtained in [15], in this paper, we can not only obtain the RDMTS, but also get the RAMTS by proposing the RFBM algorithm. Therefore, we proposed a dual-stream concept. Dual-stream refers to the use of two-way network to extract the features of RDMTS and RAMTS separately and then merge them.
Secondly, in [15], an I3D network and LSTMs were employed to extract RDM features. In this paper, since both RDMTS and RAMTS contain spatiotemporal features of gestures, 3DCNNs are employed to extract short-term spatiotemporal features first, and then LSTMs are employed in order to learn long-term spatiotemporal features of RDMST and RAMTS. Finally, the features are fused.
The detailed feature extraction process of proposed network contains two parts: First, two 3DCNNs are employed to learn short-term spatiotemporal features of RDMTS and RAMTS. The learned features of RDMTS and RAMTS are called f d1 and f a1 . Second, two LSTMs are employed to extract long-term spatiotemporal features accumulated in f d1 and f a1 . The extracted features of f d1 and f a1 by LSTMs are vectors called f d2 and f a2 . The f d2 and f a2 with size of 1600 contain radial and transversal information of gesture. After LSTMs feature extraction, the extracted features f d2 and f a2 are fused in order to form a fusion feature vector f da with a 3200 size. The fusion feature f da contains distance, velocity, and azimuth information for continuous gestures. The fusion feature f da is input to the two-layer fully connected (FC) layers to reduce the dimensionality of features and output 10 categories. A softmax function is employed in the final FC layer for classification. Figure 8 shows an overview of proposed deep learning architecture.

3DCNN
In this paper, 3DCNNs are employed. Since two 3DCNNs are employed to learn short-term spatiotemporal features of RDMTS and RAMTS, the employed 3DCNNs does not need to be particularly deep, different from [15], only three Conv3D layers are therefore constructed. The kernel size of each Conv3D layer is 7 × 7 × 5 with stride 1 × 1 × 1. Batch normalization [40] is utilized to accelerate the training process. The batch norm is followed by an activate function, restricted linear units (relu). The number of filters of 3DCNN are set to be 32, 64, 64, respectively. The last Conv3D layers is connected to the two-stacked pooling layers to reduce the output size of 3DCNN component. Figure 7: Part A shows the 3DCNN component of this paper.

LSTM
The output of 3DCNN reshapes the formatting of LSTM. The LSTM component of proposed architecture is displayed in Part B of Figure 7. The LSTM network is composed of LSTM cells, which contains memory cell , forget gate , input gate , and output gates . The cell stores information of previous steps and determine output of current step. Subsequently, the connection of each step is maintained. The LSTM cell can be formulated as where ⨀ denote the Hadamard product, σ = is the sigmoid function,

3DCNN
In this paper, 3DCNNs are employed. Since two 3DCNNs are employed to learn short-term spatiotemporal features of RDMTS and RAMTS, the employed 3DCNNs does not need to be particularly deep, different from [15], only three Conv3D layers are therefore constructed. The kernel size of each Conv3D layer is 7 × 7 × 5 with stride 1 × 1 × 1. Batch normalization [40] is utilized to accelerate the training process. The batch norm is followed by an activate function, restricted linear units (relu). The number of filters of 3DCNN are set to be 32, 64, 64, respectively. The last Conv3D layers is connected to the two-stacked pooling layers to reduce the output size of 3DCNN component. Figure 7: Part A shows the 3DCNN component of this paper.

LSTM
The output of 3DCNN reshapes the formatting of LSTM. The LSTM component of proposed architecture is displayed in Part B of Figure 7. The LSTM network is composed of LSTM cells, which contains memory cell C t , forget gate f t , input gate i t , and output gates o t . The cell stores information of previous steps and determine output of current step. Subsequently, the connection of each step is maintained. The LSTM cell can be formulated as where denote the Hadamard product, σ = 1 1+e −x is the sigmoid function, W x∼ , W h∼ are 2D convolution kernels, and b i , b f , b o are the offset.
Deep LSTM structure with two LSTM layers stacked, as illustrated in Figure 7, are constructed to learn the long-term spatiotemporal features of f d1 and f a1 in order to better learn long-term features. Each LSTM layer is composed of 1600 cells, so the sizes of learned features are 1600. For f d1 and f a1 , the two LSTM structures are identical. The two learned deep spatiotemporal features by LSTMs, f d2 and f d2 , are concatenated to a fusion feature f da of 3200 × 1. Two-layer FC layers are constructed in order to reduce dimensionality and map fusion features to 10 categories. A softmax function is employed in the final FC layer to output classification results.

Experiments and Result Analysis
The employed devices for hand gesture recognition are IWR1443millimeter wave radar sensor [41] and DCA1000 [42], a data capture adapter, made by Texas Instruments. Figure 9 shows the radar signal data acquisition module. Table 1 shows the experiments setup and parameters configuration of FMCW MIMO radar. Two stream 3DCNN and LSTM built under tensorflow are used for training and testing. The number of epochs, batch size, and learning rate are set to 20, 16, and 5 × 10 −4 , respectively. The host for signal processing and deep learning training and testing is configured with the Inter i7-9700K processor and GIGABYTE-RTX2080 super graphics card.
Electronics 2020, 9, 869 11 of 18 constructed in order to reduce dimensionality and map fusion features to 10 categories. A softmax function is employed in the final FC layer to output classification results.

Experiments and Result Analysis
The employed devices for hand gesture recognition are IWR1443millimeter wave radar sensor [41] and DCA1000 [42], a data capture adapter, made by Texas Instruments. Figure 9 shows the radar signal data acquisition module. Table 1 shows the experiments setup and parameters configuration of FMCW MIMO radar. Two stream 3DCNN and LSTM built under tensorflow are used for training and testing. The number of epochs, batch size, and learning rate are set to 20, 16, and 5 × 10 , respectively. The host for signal processing and deep learning training and testing is configured with the Inter i7-9700K processor and GIGABYTE-RTX2080 super graphics card.

Experimental Setup and Data Collection
We designed 10 gestures in pairs that are easily confused in a single dimension, radial, or transversal dimension. The 10 types of hand gestures are (1) Figure 10. The above gestures can provide potential applications in many HCI applications. For instance, CW and CCW are used to turn up or turn down the volume, and SRL and SLR are able to switch channels.

Experimental Setup and Data Collection
We designed 10 gestures in pairs that are easily confused in a single dimension, radial, or transversal dimension. The gesture data of different experimenters are collected in order to generate data set for good robustness. Five volunteers, three men and two women, were recruited to participate in the experiment. Every participant performed every gesture for 20 times, and each gesture was performed 100 times, so a total of 1000 hand gesture data sets were obtained. The data sets are divided into two parts: training set and testing set. The ratio of training set and testing set are set to be 8:2. The data sets are divided into two parts: training set and testing set. There are 800 hand gesture data sets for training and 200 hand gesture data sets for testing. Each gesture contains two sequences (RDMTS and RAMTS), so there are 1600 training sequences and 400 testing sequences. Radar sensor and data capture adapter are fixed to a table towards the ceiling in order to reduce the interference of the human body on the spectrum.

RDMTS with Windowing and IF-BPF
In this experiment, there are total 30 frames for a gesture duration. Each frame of signal will obtain a RDM; there will be 30 frames RDM to form a RDMTS. We employed 30 frames RDM to represent a gesture. Several RDMs without windowing and IF-BPF of a push gesture are obtained as an example, as shown in Figure 11. In Figure 12, there are RDMs after windowing and IF-BPF of the same push gesture. When comparing Figures 11 and 12, we can find that after windowing and IF-BPF, the target in RDM is more obvious. The highlight in RDM is the gesture echo spectrum. We can learn from frame 1, frame 7, frame 12, frame 7, frame 12, frame 22, and frame 28 that the range is decreasing, which means the hand is approaching the radar. Additionally, the velocity changes from zero to negative and then becomes zeros, which is also consistent with the trend of velocity of real hand gesture.  The gesture data of different experimenters are collected in order to generate data set for good robustness. Five volunteers, three men and two women, were recruited to participate in the experiment. Every participant performed every gesture for 20 times, and each gesture was performed 100 times, so a total of 1000 hand gesture data sets were obtained. The data sets are divided into two parts: training set and testing set. The ratio of training set and testing set are set to be 8:2. The data sets are divided into two parts: training set and testing set. There are 800 hand gesture data sets for training and 200 hand gesture data sets for testing. Each gesture contains two sequences (RDMTS and RAMTS), so there are 1600 training sequences and 400 testing sequences. Radar sensor and data capture adapter are fixed to a table towards the ceiling in order to reduce the interference of the human body on the spectrum.

RDMTS with Windowing and IF-BPF
In this experiment, there are total 30 frames for a gesture duration. Each frame of signal will obtain a RDM; there will be 30 frames RDM to form a RDMTS. We employed 30 frames RDM to represent a gesture. Several RDMs without windowing and IF-BPF of a push gesture are obtained as an example, as shown in Figure 11. In Figure 12, there are RDMs after windowing and IF-BPF of the same push gesture. When comparing Figures 11 and 12, we can find that after windowing and IF-BPF, the target in RDM is more obvious. The highlight in RDM is the gesture echo spectrum. We can learn from frame 1, frame 7, frame 12, frame 7, frame 12, frame 22, and frame 28 that the range is decreasing, which means the hand is approaching the radar. Additionally, the velocity changes from zero to negative and then becomes zeros, which is also consistent with the trend of velocity of real hand gesture. The gesture data of different experimenters are collected in order to generate data set for good robustness. Five volunteers, three men and two women, were recruited to participate in the experiment. Every participant performed every gesture for 20 times, and each gesture was performed 100 times, so a total of 1000 hand gesture data sets were obtained. The data sets are divided into two parts: training set and testing set. The ratio of training set and testing set are set to be 8:2. The data sets are divided into two parts: training set and testing set. There are 800 hand gesture data sets for training and 200 hand gesture data sets for testing. Each gesture contains two sequences (RDMTS and RAMTS), so there are 1600 training sequences and 400 testing sequences. Radar sensor and data capture adapter are fixed to a table towards the ceiling in order to reduce the interference of the human body on the spectrum.

RDMTS with Windowing and IF-BPF
In this experiment, there are total 30 frames for a gesture duration. Each frame of signal will obtain a RDM; there will be 30 frames RDM to form a RDMTS. We employed 30 frames RDM to represent a gesture. Several RDMs without windowing and IF-BPF of a push gesture are obtained as an example, as shown in Figure 11. In Figure 12, there are RDMs after windowing and IF-BPF of the same push gesture. When comparing Figures 11 and 12, we can find that after windowing and IF-BPF, the target in RDM is more obvious. The highlight in RDM is the gesture echo spectrum. We can learn from frame 1, frame 7, frame 12, frame 7, frame 12, frame 22, and frame 28 that the range is decreasing, which means the hand is approaching the radar. Additionally, the velocity changes from zero to negative and then becomes zeros, which is also consistent with the trend of velocity of real hand gesture.

RDMTS with RFBM Algorithm
According to Algorithm 1, each frame of signal will obtain a RAM, there will be 30 frames RAM to form a RAMTS. We employed 30 frames RAM to represent a gesture. There are several RAMs of different frames of a sliding left to right gesture as example, as shown in Figure 13. The RAMTS reflects hand location change relative to radar during the movement of the hand. We can see from RAMTS that the azimuth changes from about −40° to 0° and then to about 20°, reflecting the location change of the hand during the hand motion.  Table 2 presents the confusion matrix of 10 types of hand gestures to analyze the misclassification of 10 gestures. It can be seen that the proposed system is able to recognize 10 types hand gestures with high accuracy, ranging from 85.7% to 100%. There are seven types of gestures that achieved 100% gesture recognition accuracy, and three other gestures achieved a recognition rate of 94.4%, 85.7%, and 88.9%, corresponding to three complex gestures, CW, DV, and PLPS, respectively. The recognition accuracy of the entire data set has reached 97.66%.

RDMTS with RFBM Algorithm
According to Algorithm 1, each frame of signal will obtain a RAM, there will be 30 frames RAM to form a RAMTS. We employed 30 frames RAM to represent a gesture. There are several RAMs of different frames of a sliding left to right gesture as example, as shown in Figure 13. The RAMTS reflects hand location change relative to radar during the movement of the hand. We can see from RAMTS that the azimuth changes from about −40 • to 0 • and then to about 20 • , reflecting the location change of the hand during the hand motion.

RDMTS with RFBM Algorithm
According to Algorithm 1, each frame of signal will obtain a RAM, there will be 30 frames RAM to form a RAMTS. We employed 30 frames RAM to represent a gesture. There are several RAMs of different frames of a sliding left to right gesture as example, as shown in Figure 13. The RAMTS reflects hand location change relative to radar during the movement of the hand. We can see from RAMTS that the azimuth changes from about −40° to 0° and then to about 20°, reflecting the location change of the hand during the hand motion.  Table 2 presents the confusion matrix of 10 types of hand gestures to analyze the misclassification of 10 gestures. It can be seen that the proposed system is able to recognize 10 types hand gestures with high accuracy, ranging from 85.7% to 100%. There are seven types of gestures that achieved 100% gesture recognition accuracy, and three other gestures achieved a recognition rate of 94.4%, 85.7%, and 88.9%, corresponding to three complex gestures, CW, DV, and PLPS, respectively. The recognition accuracy of the entire data set has reached 97.66%.   Table 2 presents the confusion matrix of 10 types of hand gestures to analyze the misclassification of 10 gestures. It can be seen that the proposed system is able to recognize 10 types hand gestures with high accuracy, ranging from 85.7% to 100%. There are seven types of gestures that achieved 100% gesture recognition accuracy, and three other gestures achieved a recognition rate of 94.4%, 85.7%, and 88.9%, corresponding to three complex gestures, CW, DV, and PLPS, respectively. The recognition accuracy of the entire data set has reached 97.66%.

Impact of RFBM Algorithm
A very important contribution of this paper is the addition of RAMTS to represent the lateral change of gestures, which is rarely involved in other papers. Several different training strategies are utilized to evaluate the combination of both RDMTS and RAMTS in order to verify the effectiveness of lateral information on the accuracy of gesture recognition.
Strategy 1: training 3DCNN-LSTM with only RDMTS. Since only RDMTS is considered in experiments, only one 3DCNN-LSTM but not DS-3DCNN-LSTM is need. There is no step of feature fusion, and the features of RDMTS are directly input to the fully connected layer for classification.
Strategy 2: training 3DCNN-LSTM with only RAMTS. As with the RAMTS training process, there is no feature fusion step. The features of RAMTS are directly input to the fully connected layer for classification.
Strategy 3: training DS-3DCNN-LSTM with RDMTS and RAMTS. RAMTS and RDMTS are input to the network at the same time, and the features after DS-3DCNN-LSTM are concatenated to the fully connected layer for classification. Table 3 shows the recognition accuracy comparison of different training strategies. It is observed that the accuracies of Strategy 1 and Strategy 2 are lower than Strategy 3, which verifies the effectiveness of the combination of RDMTS and RAMTS.

Impact of Window Function and IF-BPF
We compared the traditional RDM and improved RDM with windowing and IF-BPF (WBP-RDM) and keep other parts unchanged in order to analyze the impact of preprocessing method of windowing and IF-BPF. The same training and testing processes were carried out. Figure 14 shows the recognition results. As steps increase, both data can converge to a stable accuracy. However, the convergence speed of traditional RDM is much slower than that of WBP-RDM. It can be seen in Figure 13 that traditional RDM converge at step 400, while WBP-RDM reach convergence at step 200. In terms of final accuracy, WBP-RDM achieves an accuracy of 97.66%, showing an improvement of about 3.91% by contrast with traditional RDM with accuracy of 93.75%.

Impact of RFBM Algorithm
A very important contribution of this paper is the addition of RAMTS to represent the lateral change of gestures, which is rarely involved in other papers. Several different training strategies are utilized to evaluate the combination of both RDMTS and RAMTS in order to verify the effectiveness of lateral information on the accuracy of gesture recognition.
Strategy 1: training 3DCNN-LSTM with only RDMTS. Since only RDMTS is considered in experiments, only one 3DCNN-LSTM but not DS-3DCNN-LSTM is need. There is no step of feature fusion, and the features of RDMTS are directly input to the fully connected layer for classification.
Strategy 2: training 3DCNN-LSTM with only RAMTS. As with the RAMTS training process, there is no feature fusion step. The features of RAMTS are directly input to the fully connected layer for classification.
Strategy 3: training DS-3DCNN-LSTM with RDMTS and RAMTS. RAMTS and RDMTS are input to the network at the same time, and the features after DS-3DCNN-LSTM are concatenated to the fully connected layer for classification. Table 3 shows the recognition accuracy comparison of different training strategies. It is observed that the accuracies of Strategy 1 and Strategy 2 are lower than Strategy 3, which verifies the effectiveness of the combination of RDMTS and RAMTS. We compared the traditional RDM and improved RDM with windowing and IF-BPF (WBP-RDM) and keep other parts unchanged in order to analyze the impact of preprocessing method of windowing and IF-BPF. The same training and testing processes were carried out. Figure 14 shows the recognition results. As steps increase, both data can converge to a stable accuracy. However, the convergence speed of traditional RDM is much slower than that of WBP-RDM. It can be seen in Figure  13 that traditional RDM converge at step 400, while WBP-RDM reach convergence at step 200. In terms of final accuracy, WBP-RDM achieves an accuracy of 97.66%, showing an improvement of about 3.91% by contrast with traditional RDM with accuracy of 93.75%.

Impact of Different Networks on Accuracy
In this work, we combined 3DCNN and LSTM to extract features, and 3DCNNs are employed to learn short-term features and LSTMs are used to learn long features. We compared the accuracy of extracting features using 3DCNN-LSTM and extracting features using only 3DCNN or LSTM in order to verify the validity of the combination of 3DCNN and LSTM. For fair comparison, the structure and parameters of 3DCNN and LSTM are set to be consistent. The experiments are conducted on the same training and testing sets. Figure 15 shows the recognition results of different networks. It is observed that the recognition accuracy of the three networks improved with the increase of steps and accuracy of DS-3DCNN-LSTM is the best, higher than 3DCNN and LSTM. The final recognition accuracy of DS-3DCNN-LSTM, LSTM and 3DCNN are 97.66%, 91.41%, and 88.28%, respectively, which suggests that the combination of 3DCNN and LSTM for the extraction of both short and long term spatial-temporal features is effective for hand gesture recognition.

Impact of Different Networks on Accuracy
In this work, we combined 3DCNN and LSTM to extract features, and 3DCNNs are employed to learn short-term features and LSTMs are used to learn long features. We compared the accuracy of extracting features using 3DCNN-LSTM and extracting features using only 3DCNN or LSTM in order to verify the validity of the combination of 3DCNN and LSTM. For fair comparison, the structure and parameters of 3DCNN and LSTM are set to be consistent. The experiments are conducted on the same training and testing sets. Figure 15 shows the recognition results of different networks. It is observed that the recognition accuracy of the three networks improved with the increase of steps and accuracy of DS-3DCNN-LSTM is the best, higher than 3DCNN and LSTM. The final recognition accuracy of DS-3DCNN-LSTM, LSTM and 3DCNN are 97.66%, 91.41%, and 88.28%, respectively, which suggests that the combination of 3DCNN and LSTM for the extraction of both short and long term spatial-temporal features is effective for hand gesture recognition.

Conclusions
This work proposed a DS-3DCNN-LSTM gesture recognition system based on RDMTS and RAMTS fusion of FMCW MIMO radar. Firstly, a windowed RDM with IF-BPF was presented for hand range and velocity estimation. Secondly, a RFBM 2D joint super-resolution algorithm was proposed in order to generate RAM for range and azimuth estimation. Finally, a DS-3DCNN-LSTM network was presented for the feature extraction and fusion of RDMTS and RAMTS with gesture radial and transversal information preserved. Several comparative experiments were conducted on 10 complex gestures. The Windowed RDM with IF-BPF obtains a 3.91% improvement over traditional RDM, which verifies the effectiveness of presented signal preprocessing method. The dual-stream 3DCNN-LSTM network that is based on the feature fusion of RDMTS and RAMTS achieves better performance than single stream 3DCNN-LSTM. It improves 15.63% than single RDMTS input and 3.69% than single RAMTS input. The average recognition accuracy of the proposed method reached 97.66%, showing that the method can effectively distinguish different gestures.
Future work will consider the interference suppression of human body in more complex scenarios, and focus on the state-of-the-art deep learning network to excavate complex gestures feature.

Conclusions
This work proposed a DS-3DCNN-LSTM gesture recognition system based on RDMTS and RAMTS fusion of FMCW MIMO radar. Firstly, a windowed RDM with IF-BPF was presented for hand range and velocity estimation. Secondly, a RFBM 2D joint super-resolution algorithm was proposed in order to generate RAM for range and azimuth estimation. Finally, a DS-3DCNN-LSTM network was presented for the feature extraction and fusion of RDMTS and RAMTS with gesture radial and transversal information preserved. Several comparative experiments were conducted on 10 complex gestures. The Windowed RDM with IF-BPF obtains a 3.91% improvement over traditional RDM, which verifies the effectiveness of presented signal preprocessing method. The dual-stream 3DCNN-LSTM network that is based on the feature fusion of RDMTS and RAMTS achieves better performance than single stream 3DCNN-LSTM. It improves 15.63% than single RDMTS input and 3.69% than single RAMTS input. The average recognition accuracy of the proposed method reached 97.66%, showing that the method can effectively distinguish different gestures.
Future work will consider the interference suppression of human body in more complex scenarios, and focus on the state-of-the-art deep learning network to excavate complex gestures feature.

Conflicts of Interest:
The authors declare no conflict of interest.