Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar

Xue, Wei; Wang, Rui; Wei, Jianyun; Liu, Li

doi:10.3390/s25226949

Open AccessArticle

Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar

¹

School of Automation, China University of Geosciences, Wuhan 430074, China

²

Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China

³

Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(22), 6949; https://doi.org/10.3390/s25226949

Submission received: 13 October 2025 / Revised: 10 November 2025 / Accepted: 11 November 2025 / Published: 13 November 2025

(This article belongs to the Special Issue Sensor Systems for Gesture Recognition (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Radar-based gesture recognition technology has gained increasing attention in the context of contactless human–computer interaction (HCI). Micro gestures have smaller motion amplitudes and shorter duration compared with traditional gestures, which increases the difficulty of motion feature extraction. In addition, improving recognition accuracy while maintaining low computational and storage costs is also a challenge. In this paper, a micro gesture recognition method combining multi-dimensional feature fusion and a lightweight CQ-MobileNetV3 network is proposed. For feature extraction, the range–time map, velocity–time map, and angle–time map of gestures are first constructed. Then, normalization and adaptive filtering are performed to refine the three maps. Finally, the three refined maps are fused to form a range–velocity–angle–time map, which can accurately describe the motion characteristics of gestures. For recognition, a lightweight CQ-MobileNetV3 network is designed. First, the network structure of MobileNetV3 is optimized to reduce computational complexity. Then, the improved convolutional block attention module (CBAM) and the improved self-attention (SA) module are constructed and integrated into different bottleneck blocks to improve recognition accuracy. A series of experiments are conducted with a 77 GHz frequency-modulated continuous wave (FMCW) radar. The results indicate that CQ-MobileNetV3 achieves a recognition accuracy of 97.16% for 14 micro gestures, with a parameter count of 0.207 M and a computational complexity of 0.027 GFLOPs, surpassing several other deep neural networks.

Keywords:

micro gesture recognition; frequency-modulated continuous wave (FMCW) radar; multi-dimensional feature fusion; deep neural network

1. Introduction

Gestures are an important form of human body language, with the advantages of simplicity, efficiency and universality [1]. At present, gesture recognition has been applied in various fields of human–computer interaction (HCI), such as smart home [2], virtual reality [3], intelligent driving [4], medical assistance [5,6], and sign language translation [7].

To date, different types of sensors have been applied to gesture recognition, including wearable sensors [8], visual sensors [9,10], ultrasonic sensors [11], Wi-Fi devices [12] and radar sensors [13,14]. Wearable sensors can precisely acquire gesture motion information, demonstrating good recognition performance. However, they require the user to wear devices and can be inconvenient, which limits their application scenarios. Visual sensors extract gesture feature information through cameras or depth cameras, but their recognition performance may decrease under insufficient lighting conditions, and they may also lead to user privacy leakage. Ultrasonic sensors use ultrasonic waves to detect gesture motion information, but they are prone to diffraction and the speed of sound propagation. Wi-Fi devices utilize the channel state information of Wi-Fi signals to perceive gesture movements. They are not affected by lighting conditions but are susceptible to the multipath effect. Radar sensors are not influenced by weather and light, which can prevent user privacy leakage. Among various types of radars, a frequency-modulated continuous wave (FMCW) radar operating in the millimeter wave band has the merits of low cost, small size and high measurement accuracy and is considered the most promising non-contact gesture recognition technology.

Extensive studies have been conducted on gesture recognition based on radar. These studies have mainly focused on gesture feature extraction methods and recognition algorithms. Gesture features generally include the distance, velocity (Doppler), angle, and point clouds. Recognition algorithms mainly consist of machine learning algorithms and deep learning algorithms. Common machine learning algorithms include support vector machines (SVMs), artificial neural networks (ANNs), and K-nearest neighbors (KNNs). Li et al. [15] extracted six features of gestures, including azimuth, elevation, and Doppler information, and employed a shallow ANN for classification, achieving an average recognition accuracy of 93.3% for six gestures. Zhang et al. [16] extracted two Doppler features and used SVM for recognition, obtaining a recognition accuracy of 93.6% for four gestures. Rashid et al. [17] extracted Doppler spectra, used principal component analysis (PCA) to decrease the spectral feature dimension, and performed classification using a KNN classifier, achieving an accuracy of nearly 100% for four gestures.

Deep learning algorithms mainly include convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and Transformer networks. Wang et al. [18] employed range-Doppler images (RDIs) as features and utilized a CNN with LSTM units for recognition, obtaining a recognition rate of 87% for 11 gestures. Wang et al. [19] extracted the projection features from the range-Doppler map (RDM) and used a CNN to recognize six micro gestures, achieving a recognition rate of 98.06%. Ali et al. [20] extracted time–velocity and time–angle diagrams and utilized a CNN to recognize six micro gestures, obtaining an average accuracy of 95%. Wang et al. [21] extracted the range–Doppler map (RDM) and range–angle map (RAM) and employed a 3D-CNN for recognizing eight gestures, achieving an accuracy of 93.12%. Yu et al. [22] extracted RDIs and range–angle images (RAIs) and employed a CNN-LSTM network to classify 12 gestures, obtaining an accuracy of 94.67%. Wu et al. [23] extracted range–time maps (RTMs) and angle–time maps (ATMs) and designed a lightweight CNN model for feature fusion and classification for 11 gestures, achieving an accuracy of 97.53%. Yang et al. [24] obtained the motion trajectory of gestures from point clouds and used a single-layer LSTM network for recognition, achieving an average recognition accuracy of 98.2% for 10 air-digit-writing gestures. Song et al. [25] extracted the range–time map (RTM), Doppler–time map (DTM), azimuth–time map (ATM), and elevation–time map (ETM) and used DenseNet with a convolutional block attention module (CBAM) for recognition, obtaining a recognition accuracy of 99.03% for 12 micro gestures. Li et al. [26] extracted the RDM and used ResNet50 for recognition, achieving a recognition accuracy of 98.02% for four gestures. Li et al. [27] obtained trajectory points of gestures and converted them into images, then used Xception for recognition, achieving a recognition accuracy of 99.6% for 10 digital gestures. Jin et al. [28] constructed sparse time–Doppler–range features from range–Doppler maps and utilized a dual-flow Transformer network for recognition, obtaining an accuracy of 99.17% for six letter gestures.

From the aforementioned research, it can be seen that machine learning algorithms are suitable for processing low-dimensional features and are generally applied in recognition scenarios with fewer gesture types. Deep learning algorithms are suitable for processing high-dimensional features with stronger feature extraction abilities, making them widely used in gesture recognition. In deep learning algorithms, the gesture features mainly consist of two types: temporal features and sequential features. Temporal features such as RDIs and RAIs reflect the motion information of gestures at a certain moment. Generally, an LSTM network is required to process the temporal features of multiple moments and extract time sequential information for classification. However, the computational complexity of the LSTM network increases linearly with sequence length. Sequential features such as RTMs, DTMs, and ATMs describe the motion information of the entire gesture duration and can be recognized using a CNN. Therefore, sequence features are adopted in this work. Different from the traditional macro gestures involving hand-level movements, micro gestures mainly involve finger-level movements, with smaller motion variations and shorter duration [29]. The similarity between different micro gestures increases the difficulty of feature extraction. To improve gesture recognition performance, existing methods generally extract multiple features to characterize the motion of gestures. However, multiple features increase the dimensionality of information and may lead to information redundancy. Meanwhile, multiple features need to be fused in the network, increasing the computational complexity. Therefore, how to accurately recognize multiple micro gestures with fewer features and lightweight networks is also a challenge. To address this challenge, this study proposes a method combining the fusion of multi-dimensional features and a lightweight network. The main contributions of this work are as follows.

(1): A feature extraction algorithm that integrates multi-dimensional features is proposed. First, the range–time map (RTM), velocity–time map (VTM), and angle–time map (ATM) are constructed from the raw data. Then, the three maps are refined using normalization and adaptive filtering. Finally, the refined RTM, VTM, and ATM are fused to construct the range–velocity–angle–time map (RVATM). RVATM can fully describe the motion information of micro gestures, with low feature dimensionality and less information redundancy.
(2): A lightweight micro gesture recognition network CQ-MobileNetV3 is proposed based on the MobileNetV3 network. First, several convolution layers and bottleneck blocks are deleted, and the expansion size of bottleneck blocks is reduced, which can reduce computational and storage requirements. Then, an improved convolutional block attention module (CBAM) is constructed, in which the convolution operations of some modules are simplified, and the simpler activation function is adopted to reduce computational complexity. Meanwhile, an improved self-attention (SA) module is designed, in which grouped query technology is employed to reduce computational burden. Finally, the two improved attention modules are integrated into different bottleneck blocks to enhance feature extraction capabilities with little addition in computational complexity. The combination of three improvements enables the network to remain lightweight while maintaining high recognition accuracy.
(3): An experimental platform is designed, and a micro gesture dataset is constructed to evaluate the proposed method. A dataset containing 14 micro gestures in three scenarios is constructed using TI’s IWR1443 FMCW radar sensor. The experimental results verify the robustness and superiority of the proposed method.

The rest of the paper is organized as follows. In Section 2, the principle of the FMCW radar is presented. In Section 3, the details of multi-dimensional feature fusion are provided. Section 4 describes the lightweight CQ-MobileNetV3 network. Section 5 gives experimental verification results. In Section 6, the summary of this work is provided.

2. Principle of FMCW Radar

2.1. System Architecture and Signal Model

Figure 1 illustrates the architecture of the FMCW radar system [30]. The FMCW signal is generated by the signal synthesizer and transmitted by the transmitting antenna (Tx). The receiving antenna (Rx) is used to receive the target echo. The received signal is mixed with the transmitted signal through a mixer, and after low-pass filtering (LPF), the intermediate frequency (IF) signal is obtained. Finally, an analog-to-digital converter (ADC) is used to sample the IF signal, which is then processed by a digital signal processing (DSP) module.

Generally, FMCW radar transmits multiple chirps to detect the target. The frequency domain waveform of the transmitted and received signals of the FMCW radar is illustrated in Figure 2, where

f_{c}

is the starting frequency, B is the modulation bandwidth,

T_{f}

is the sweep duration,

T_{i}

is the chirp period, and

τ

is the time delay.

The transmitted FMCW signal is given by

s_{T} (t) = A_{T} c o s [2 π (f_{c} t + \frac{μ t^{2}}{2}) + φ_{0}]

(1)

where A_T is the amplitude of the transmitted FMCW signal,

μ = B / T_{f}

is the modulation slope, and

φ_{0}

is the initial phase.

The time delay of the echo can be given by

τ = \frac{2 (R + v t)}{c}

(2)

where c is the light velocity, and R and v are the range and velocity of the target, respectively.

Then, the received signal is written as

s_{R} (t) = A_{R} c o s [2 π (f_{c} (t - τ) + \frac{μ}{2} {(t - τ)}^{2}) + φ_{0}]

(3)

where

A_{R}

is the amplitude of the received signal.

The transmitted and the received signals are mixed and low-pass-filtered to obtain the IF signal, which can be given by

s_{I F} = A_{I F} c o s \{2 π (f_{c} τ + μ t τ - \frac{μ}{2} τ^{2})\}

(4)

Substituting (2) into (4), the IF signal can be rewritten as

s_{R} (t) = A_{R} c o s [2 π (f_{c} (t - τ) + \frac{μ}{2} {(t - τ)}^{2}) + φ_{0}]

(5)

As

2 μ v t^{2} / c

and

4 μ {(R + v t)}^{2} / 2 c^{2}

can be neglected, the IF signal can be simplified to

\begin{array}{l} s_{I F} & \approx A_{I F} c o s \{2 π (\frac{2 μ R t}{c} + f_{c} \frac{2 (R + v t)}{c})\} \\ = A_{I F} c o s \{2 π (\frac{2 μ R}{c} + \frac{2 v}{λ}) t + \frac{4 π R}{λ}\} \end{array}

(6)

where

λ = c / f_{c}

is the wavelength.

Then, the frequency and phase of the IF signal are represented as

f_{I F} = \frac{2 μ R}{c} + \frac{2 v}{λ}

(7)

φ_{I F} = \frac{4 π R}{λ}

(8)

2.2. Range Measurement

Due to the slow velocity of gestures, the second term is much smaller than the first term and can be neglected in (7), so the IF frequency can be given by

f_{I F} = \frac{2 μ R}{c}

(9)

Then, the target range can be given by

R = \frac{f_{I F} c}{2 μ}

(10)

Because the duration of chirp is

T_{f}

, the resolution of the IF frequency can be defined as

Δ f_{I F} = \frac{1}{T_{f}}

(11)

Therefore, the range resolution is represented as

R_{r e s} = \frac{Δ f_{I F} c}{2 μ} = \frac{c}{2 B}

(12)

2.3. Velocity Measurement

In (7), the variation in velocity has little effect on the IF frequency, thus the velocity cannot be obtained from the IF frequency. However, the target motion can lead to obvious changes in the phase between adjacent chirps, especially for millimeter wave signals.

The phase difference between the IF signals corresponding to two adjacent chirps can be expressed as

Δ φ_{I F} = \frac{4 π v T_{i}}{λ}

(13)

Then, the target velocity is given by

v = \frac{λ Δ φ_{I F}}{4 π T_{i}}

(14)

To obtain the maximum unambiguous velocity,

|Δ φ_{I F}| \leq π

should be satisfied. Therefore, the maximum velocity is

v_{m a x} = \frac{λ}{4 T_{i}}

(15)

when using N chips for velocity measurement, the minimum value of phase difference is

2 π / N

, and the velocity resolution can be obtained as

v_{r e s} = \frac{λ}{2 N T_{i}}

(16)

2.4. Angle Measurement

Angle measurement requires at least two receiving antennas. Due to the far-field conditions in the experimental environment, the paths of the echoes to different receiving antennas can be considered as parallel incidence. The principle of angle measurement using antenna arrays is shown in Figure 3.

The range difference for signals of two adjacent receiving antennas can be represented as

Δ R = d s i n θ

(17)

where d is the spacing of adjacent receiving antennas, and θ is the incident angle.

Then, the phase difference for signals of two adjacent receiving antennas can be written as

Δ φ = \frac{2 π Δ R}{λ} = \frac{2 π d s i n θ}{λ}

(18)

The incident angle is given by

θ = s i n^{- 1} (\frac{λ Δ φ}{2 π d})

(19)

3. Multi-Dimensional Feature Fusion

Figure 4 shows the flow chart of multi-dimensional feature fusion. First, the sampling data of the IF signal is rearranged into the format of Samples × Chirps × Antennas. Subsequently, RTM and VTM are extracted from the multi-frame data of one antenna, where one frame contains multiple chirps. Simultaneously, ATM is obtained from the multi-frame data of multiple antennas. Then, the three maps ATM, VTM, and ATM are refined using feature preprocessing. Finally, the three refined maps are fused to construct the range–velocity–angle–time map (RVATM).

3.1. RTM and VTM Extraction

The extraction of RTM and VTM needs multi-frame data of one antenna. First, the two-dimensional data in one frame is processed using 2-D fast Fourier transform (2-D FFT) to obtain the range-Doppler map (RDM). Then, the coherent accumulation is performed separately in the range and Doppler dimensions of the RDM to obtain one-dimensional range information and one-dimensional velocity information of this frame. Finally, RTM and VTM are generated through stacking one-dimensional range information and velocity information from multiple frames.

Assume the two-dimensional data matrix in one frame is T[m, n], where

1 \leq m \leq M

,

1 \leq n \leq N

, one chirp contains M samples, and one frame contains N chirps. Figure 5 shows the schematic of RDM generation, where the sample dimension is also called the fast time dimension, and the chirp dimension is also called the slow time dimension.

First, the range spectrum is obtained by performing FFT in the fast time dimension, which is represented as

R (k, n) = F F T [T (m, n)] = \sum_{m = 1}^{K} T (m, n) e x p [- j \frac{2 π}{K} m (k - 1)]

(20)

where

1 \leq k \leq K

, and K is the number of FFT points.

Then, the range–Doppler map is generated by performing FFT in the slow time dimension, which is given by

\begin{array}{l} R D (k, l) & = F F T [R (k, n)] = F F T 2 [T (m, n)] \\ = \sum_{n = 1}^{L} (\sum_{m = 1}^{K} T (m, n) e x p [- j \frac{2 π}{K} m (k - 1)]) e x p [- j \frac{2 π}{L} n (l - 1)] \end{array}

(21)

where

1 \leq l \leq L

, and L is the number of FFT points.

Subsequently, one-dimensional range information is obtained by coherent accumulation in the range dimension of RDM, which is given by

F R (k) = \sum_{l = 1}^{L} R D (k, l)

(22)

Similarly, one-dimensional velocity information is obtained by correlation accumulation in the Doppler dimension of RDM, which is given by

F V (l) = \sum_{k = 1}^{K} R D (k, l)

(23)

Finally, the one-dimensional range and velocity information of multiple frames are stacked to generate RTM and VTM, which can be represented as

F_{R T M} = [F R_{1} F R_{2} \dots F R_{N F}]

(24)

F_{V T M} = [F V_{1} F V_{2} \dots F V_{N F}]

(25)

where FR_i and FV_i represent the range and velocity information of the ith frame, respectively,

1 \leq i \leq N F

, and NF is the number of frames.

3.2. ATM Extraction

The extraction of ATM requires multi-frame data from multiple antennas. Here, the multiple signal classification (MUSIC) algorithm [31] is utilized for angle extraction to obtain higher angular resolution.

Assume the number of receiving antennas is W and the incident angles of D targets are

\{θ_{1}, θ_{2}, \dots, θ_{D}\}

, the direction matrix can be written as

\begin{array}{l} A & = [\begin{matrix} 1 & 1 & ... & 1 \\ e^{- j \frac{2 π d s i n θ_{1}}{λ}} & e^{- j \frac{2 π d s i n θ_{2}}{λ}} & ... & e^{- j \frac{2 π d s i n θ_{D}}{λ}} \\ ... & ... & ... & ... \\ e^{- j (W - 1) \frac{2 π d s i n θ_{1}}{λ}} & e^{- j (W - 1) \frac{2 π d s i n θ_{2}}{λ}} & ... & e^{- j (W - 1) \frac{2 π d s i n θ_{D}}{λ}} \end{matrix}] \\ = [a (θ_{1}) a (θ_{2}) \dots a (θ_{D})] \end{array}

(26)

where

a (θ_{i}) = {[1 e^{- j \frac{2 π d s i n θ_{i}}{λ}} \dots e^{- j (W - 1) \frac{2 π d s i n θ_{i}}{λ}}]}^{T}

is the direction vector of the angle

θ_{i}

, which represents the response of the antennas to the ith incident signal.

Then, the received signal of W antennas is represented as

X (t) = A (t) S (t) + N (t)

(27)

where S(t) is the vector of the incident signal, and N(t) is the vector of noise.

The covariance matrix of the received signal is written as

R_{x x} = E [X (t) X {(t)}^{H}] = A E [S S^{H}] A^{H} + E [N N^{H}] = \sum_{i = 1}^{D} σ_{i}^{2} a (θ_{i}) a^{H} (θ_{i}) + σ_{n}^{2} I

(28)

where H represents the conjugate transpose operation,

σ_{i}^{2}

and

σ_{n}^{2}

are the power of the ith incident signal and noise, respectively.

Generally, the estimation of covariance matrix is given by

{\hat{R}}_{x x} = \frac{1}{P} \sum_{i = 1}^{P} X (i) X^{H} (i)

(29)

where P is the number of snapshots.

The eigenvalue decomposition of the covariance matrix is given by

{\hat{R}}_{x x} = U_{s} Λ_{s} U_{S}^{H} + U_{N} Λ_{N} U_{N}^{H} = U_{s} Λ_{s} U_{S}^{H} + σ_{n}^{2} U_{N} U_{N}^{H}

(30)

where

Λ_{s}

and

Λ_{N}

are diagonal matrices corresponding to the signal and noise, respectively, and

U_{S}

and

U_{N}

are the eigenvalue matrices of the signal subspace and noise subspace, respectively.

Ideally,

U_{S}

and

U_{N}

are orthogonal, and the direction vector in the signal subspace and the noise subspace are also orthogonal, which can be written as

a^{H} (θ) U_{N} = 0

(31)

Therefore, a spatial spectrum is constructed to estimate the angle, which is represented as

P (θ) = \frac{1}{a^{H} (θ) U_{N} U_{N}^{H} a (θ)}

(32)

In the MUSIC algorithm, the angle corresponding to the maximum value of

P (θ)

is the signal incident direction.

In one frame, the data of one chirp from multiple antennas are used to estimate the covariance matrix and obtain the spatial spectrum. In practical calculation, assuming the step of spatial spectrum search is

Δ θ

, the discrete one-dimensional angle information can be defined as

F A (p) = P (p Δ θ)

(33)

where p is an integer and

- π / (2 Δ θ) \leq p \leq π / (2 Δ θ)

.

Finally, by stacking the one-dimensional angle information of multiple frames, the ATM can be obtained as

F_{A T M} = [F A_{1} F A_{2} \dots F A_{N F}]

(34)

where FA_i represents the angle information of the ith frame.

3.3. Feature Preprocessing

In the measurement, the range variation in the gesture causes the variation in the strength of the IF signal, which further leads to fluctuations in the amplitude of range, velocity, and angle information in each frame. In addition, echoes from other parts of the human body and environmental clutter also result in interference in RTM, VTM, and ATM. To reduce the impact of the above factors, normalization and adaptive filtering are used to process the three maps to improve the quality of feature extraction.

First, normalization is used to reduce the difference in the strength of features between different frames, which can be given by

F R 1 (k) = \frac{F R (k)}{m a x (|F R (k)|)}

(35)

F V 1 (l) = \frac{F V (l)}{m a x (|F V (l)|)}

(36)

F A 1 (p) = \frac{F A (p)}{m a x (|F A (p)|)}

(37)

Subsequently, adaptive filtering is applied to suppress interference, which is calculated as

F R 2 (k) = F R 1 (k) e x p (k r |k - k_{0}|)

(38)

F V 2 (l) = F V 1 (l) e x p (k v |l - l_{0}|)

(39)

F A 2 (p) = F A 1 (p) e x p (k a |p - p_{0}|)

(40)

where kr, kv, and ka are the filter coefficients for range, velocity, and angle, respectively; and kr < 0, kv < 0, ka < 0, k₀, l₀, and p₀ are the positions corresponding to the maximum amplitude of the range, velocity, and angle features in each frame, respectively.

Finally, the preprocessed three maps can be expressed as

F P_{R T M} = [F R 2_{1} F R 2_{2} \dots F R 2_{N F}]

(41)

F P_{V T M} = [F V 2_{1} F V 2_{2} \dots F V 2_{N F}]

(42)

F P_{A T M} = [F A 2_{1} F A 2_{2} \dots F A 2_{N F}]

(43)

where FR2_i, FV2_i, and FA2_i represent the preprocessed range, velocity and angle information of the ith frame, respectively.

3.4. Feature Fusion

The preprocessed RTM, VTM, and ATM are fused in the color space. Three feature maps are encoded separately into the RGB channels of the image and concatenated to obtain the range–velocity–angle–time map (RVATM), which can be expressed as

F_{R V A T M} = c a t [F P_{R T M}, F P_{V T M}, F P_{A T M}]

(44)

where

c a t [\cdot]

is the concatenation function. The red, green, and blue channels represent range, velocity, and angle features, respectively.

RVATM integrates multiple motion features into one feature map, reducing feature dimensionality and information redundancy. Using RVATM as the input feature map does not require feature fusion in the network, resulting in lower computational complexity.

4. CQ-MobileNetV3 Network

4.1. MobileNetV3 Network

MobileNetV3 is a classic lightweight network [32], which integrates depthwise separable convolution, inverted residual structures, and squeeze-and-excitation (SE) attention modules. Depthwise separable convolution divides the standard convolution operation into depthwise (DW) convolution and pointwise (PW) convolution, reducing the demand for computational resources. The inverted residual structure enhances the feature representation capability while reducing the network complexity. The SE attention module extracts important information of channels to improve the feature expression ability.

The basic unit of MobileNetV3 is the bottleneck block, as shown in Figure 6. First, the input features are dimensionality enhanced using a 1 × 1 PW convolution. Then, a 3 × 3 DW convolution is performed on the dimensionality enhanced features. Next, the features are fed into the SE attention module to obtain attention-weighted features. Finally, the 1 × 1 PW convolution is used to decrease the dimensionality of features, and the output features and initial input features are added as the final output.

In terms of the number of basic units, MobileNetV3 includes two versions, namely MobileNetV3-Large and MobileNetV3-Small. Among them, MobileNetV3-Small is applicable in resource-constrained scenarios and therefore is used as the baseline network. Figure 7 illustrates the architecture of MobileNetV3-Small. Bottleneck_SE represents the bottleneck block with the SE attention module, and Bottleneck represents the bottleneck block without the SE attention module.

In MobileNetV3-Small, a 3 × 3 convolutional layer is first employed for preliminary feature extraction. Subsequently, 11 stacked bottleneck blocks are employed for feature extraction. Next, a 1 × 1 convolution layer is applied to increase the feature dimension, and then global average pooling is performed to obtain the one-dimensional vector feature. Finally, two 1 × 1 convolutional layers map features to probability values for K classes.

4.2. Improved CBAM

The SE attention mechanism [33] in MobileNetV3 only considers the importance of each channel and ignores spatial information, which limits its ability to capture the spatial features of scattered regions of interest in gesture feature maps. To overcome the drawback, an improved convolutional block attention module (CBAM) is introduced into MobileNetV3.

The structure of CBAM is shown in Figure 8. The CBAM [34] is composed of the channel attention module (CAM) and the spatial attention module (SAM). The input features are first input into CAM to obtain channel-refined features; then, the features are input into SAM to obtain the spatial-refined features.

In CAM, first the input feature F undergoes max pooling and average pooling to obtain two feature descriptors. Then, the two feature descriptors are mapped into two weight vectors by a multi-layer perceptron (MLP). Finally, the two weight vectors are merged and normalized to generate the channel attention weights. The calculation can be given by

\begin{matrix} M_{C} (F) = σ (M L P (M a x P o o l (F)) + M L P (A v g P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{m a x}^{c})) + W_{1} (W_{0} (F_{a v g}^{c}))) \end{matrix}

(45)

where

σ

is the Sigmoid activation function, and W₀ and W₁ represent the two-layer convolution operations of MLP, respectively.

In SAM, first the channel-refined features are also processed through max pooling and average pooling along the channel dimension to obtain two weight vectors. Subsequently, the two weight vectors pass through a convolutional layer and are normalized to generate spatial attention weights. The calculation is given by

M_{S} (F) = σ (f^{7 \times 7} ([M a x P o o l (F); A v g P o o l (F)]))

(46)

where

f^{7 \times 7}

denotes the 7 × 7 convolution operation.

Here, two improvements are proposed for CAM to reduce the computational complexity. Figure 9a shows the structure of the improved CAM. First, a two-layer linear transformation is used to replace the convolution operation in MLP. Then, the Hardsigmoid function [35] is used to replace the Sigmoid function. The calculation of improved channel attention can be calculated as

M_{C}^{'} (F) = H a r d s i g m o i d (W_{1}^{'} \cdot R e L U [W_{0}^{'} \cdot (M a x P o o l (F) + A v g P o o l (F))])

(47)

where

W_{0}^{'}

and

W_{1}^{'}

represent two-layer linear transformations, respectively.

The linear transformation avoids the unnecessary dimension transformations in MLP and can reduce the computational cost.

The Hardsigmoid function is defined as

H a r d s i g m o i d (x) = \{\begin{cases} 1, & x > 3 \\ x / 6 + 1 / 2, & - 3 \leq x \leq 3 \\ 0, & x < - 3 \end{cases}

(48)

The Hardsigmoid function is a piecewise linear approximation of the Sigmoid function, avoiding exponential operations, which can improve computational efficiency and reduce resource consumption.

Similarly, two improvements are also proposed for SAM. Figure 9b shows the structure of the improved SAM. First, a 3 × 3 convolution is used to replace the original 7 × 7 convolution. Then, the original Sigmoid function is replaced by the Hardsigmoid function. The calculation of improved spatial attention is given by

M_{S}^{'} (F) = H a r d s i g m o i d (f^{3 \times 3} ([M a x P o o l (F); A v g P o o l (F)]))

(49)

4.3. Improved SA Module

The motion features in the gesture feature map have strong time correlation. The SA mechanism [36] determines the weight of value (V) by calculating the similarity between query (Q) and key (K), which can effectively capture temporal dependence and correlation in the sequence. Therefore, the SA mechanism is also introduced into MobileNetV3.

In gesture recognition, the SA mechanism has two advantages. First, it can capture the correlation between cross-frame gestures by calculating the feature similarity between any two positions. Second, it performs a weighted sum of the value based on the weights obtained from the similarity, enhancing the feature representation of key frames. Figure 10 shows the structure of the SA module.

In Figure 10, Q, K and V are the query, key, and value matrices, respectively. The SA mechanism calculates the attention weights by performing the dot-product operation of Q, K, and V, which can be given by

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(50)

where

d_{k}

represents the dimension of K.

Although the SA mechanism has excellent feature expression ability, its computational cost increases linearly with the quadratic of the input feature size. To solve the problem, the grouped query technique is introduced into the calculation of attention. First, group convolution is used to divide the input channels into G groups, and then the attention is calculated separately for each group. Finally, the attention of each group of channels is concatenated. The calculation can be given by

G r o u p A t t n (Q, K, V) = \oplus_{g = 1}^{G} s o f t m a x (\frac{Q_{g} K_{g}^{T}}{\sqrt{d_{k} / G}}) V_{g}

(51)

where g is the index of the group, G is set to 4, and

\oplus

represents the concatenation operation.

Of course, the grouped query technology only focuses on specific keys and values within each group, which may limit the network’s ability to capture global information in some cases.

4.4. Optimization of Network Structure

The MobileNetV3-Small network contains 11 bottleneck blocks. Due to the small size of gesture feature maps, excessively deep network layers may lead to unstable gradients. In addition, some bottleneck blocks have a high expansion size, which can also result in redundant calculations.

To address these issues, two optimization strategies are proposed for the network structure. First, two convolution layers and four bottleneck blocks are removed, which can greatly decrease the number of parameters and computational complexity. Second, the expansion size of some bottleneck blocks is reduced to decrease memory consumption. Table 1 lists the optimized network structure.

4.5. Overall Network Structure

Based on the two improved attention modules, two improved bottleneck blocks are designed, as shown in Figure 11.

Bottleneck_CBAM uses the improved CBAM to replace the SE attention module, which can better extract spatial features in gesture feature maps. Bottleneck_SA utilizes depthwise separable convolution to extract local features, while utilizing the improved SA module to extract global features. Subsequently, channel concatenation is performed on the two features to better preserve local details and global semantics.

According to the optimized network structure and two improved bottleneck blocks, a lightweight CQ-MobileNetV3 network is proposed, the structure of which is shown in Figure 12. The CQ-MobileNetV3 network consists of five Bottleneck_CBAMs, two Bottleneck_SA modules, two 1 × 1 convolutional layers, and one global average pooling module. Due to the high correlation between the computational complexity of the SA module and the size of the input feature map, the Bottleneck_SA module is placed at the deeper layer of the network, which can further reduce the computational cost.

The optimized network structure can significantly reduce the number of parameters and computational complexity. In addition, two types of improved attention modules can enhance recognition accuracy at a low computational cost. Through these three improvements, the proposed network can achieve high recognition accuracy with low computational complexity.

5. Experiments and Analysis

5.1. Experimental Setup and Parameter Configuration

The FMCW radar system for collecting gesture data consists of an IWR1443 radar sensor and a DCA-1000 data acquisition card developed by TI, which is shown in Figure 13. Table 2 lists the parameter configuration of the radar system.

In the experiment, all programs are run on a computer with an Intel I7-10700K CPU with 64 GB RAM and an RTX 4060 GPU with 8 GB of graphic memory. Feature extraction, preprocessing, and feature fusion of radar data are implemented using MATLAB 2017. Deep neural networks are implemented on the PyTorch 1.10 platform using Python 3.10.

5.2. Dataset

5.2.1. Data Collection

In this study, 14 micro gestures are designed. The detailed movements of the 14 micro gestures are shown in Figure 14. The 14 micro gestures include click, double click, beckoning, wave, sliding left, sliding right, drawing tick, drawing fork, palm clench, palm open, forefinger–thumb open, forefinger–thumb close, rotating clockwise, and rotating counterclockwise.

Figure 15 shows three different data collection scenarios. In all three scenarios, the radar was placed horizontally on a table in a relatively spacious hall. In Scenario 1, the gesture was performed above the radar, at a distance of approximately 0.2 m from the radar. In Scenario 2, the gesture was also performed above the radar, at a distance of about 0.5 m from the radar. In Scenario 3, the gesture was performed above a certain angle perpendicular to the radar, approximately 0.25 m away from the radar.

A total of eight volunteers participated in the gesture data collection. These volunteers consisted of five males and three females. For each gesture, 25 samples were collected from each volunteer in each scenario, and the total sample size was 8400.

5.2.2. Feature Extraction and Preprocessing Results

For RTM and VTM extraction, the RDM is first obtained by performing 2-D FFT on one frame of data collected from one RX antenna. As shown in Table 2, one frame contains 128 chirps, with each chirp containing 64 samples. When performing 2-D FFT, the number of sample points in each chirp is added to 128 by zero padding, resulting in an RDM with a size of 128 × 128. Then, coherent accumulation is performed in the range and Doppler dimensions of RDM to obtain one-dimensional range and velocity information with a size of 128 × 1. Finally, the one-dimensional range and velocity information of 50 frames are stacked along the time dimension to obtain the RTM and VTM with a size of 128 × 50. The range resolution is 0.0234 m, and the maximum range is 3 m. The velocity resolution is 0.0401 m/s, and the maximum velocity is 2.56 m/s.

The 2Tx-4Rx antenna configuration is equivalent to the 1Tx-8Rx antenna units. Therefore, for ATM extraction, the data of one chirp in one frame collected from eight RX antennas is processed using the MUSIC algorithm to obtain the one-dimensional angle information. Here, the step of spatial spectrum search

Δ θ

is set to

π / 127

, and the size of one-dimensional angle information is also 128 × 1. Finally, the one-dimensional angle information of 50 frames is stacked along the time dimension to obtain the ATM with a size of 128 × 50.

In preprocessing, for each frame in RTM, VTM, and ATM, normalization is first performed to reduce the intensity differences between different frames. Then, adaptive filtering is used to suppress interference within each frame of the three feature maps, where the filter coefficients kr, kv, and ka are all set to −0.2.

The click gesture in Scenario 1 is taken as an example to analyze the influence of preprocessing on feature extraction. Figure 16 and Figure 17 show the three feature maps before and after preprocessing, respectively.

As shown in Figure 16, the gesture echoes in the original RTM and VTM are weak at some moments, and breakpoints occur in the motion trajectories. In addition, some clutter appears in the original ATM. As shown in Figure 17, after preprocessing, the gesture echo distribution in RTM and VTM is more uniform, and the clutter in ATM is well suppressed, which can more accurately describe the motion characteristics of the click gesture.

For comparison, Figure 18 and Figure 19 present the three feature maps of the click gesture after preprocessing in Scenarios 2 and 3, respectively.

As shown in Figure 17, Figure 18 and Figure 19, the range variation for the click gesture is minimal in the three scenarios, especially for Scenario 3, which has a longer range. The velocity of the click gesture varies significantly and similarly across the three scenarios. The angle changes in the click gesture across the three scenarios are also obvious but different. The differences in the feature maps of the same gesture in different scenarios can effectively increase the diversity of the dataset.

Here, taking Scenario 1 as an example, the feature maps of 14 micro gestures are analyzed. Figure 20, Figure 21 and Figure 22 show the RTMs, VTMs, and ATMs of 14 gestures after preprocessing, respectively.

As shown in Figure 20, the range of the click gesture decreases first and then increases, presenting a V-shaped trajectory. The range of the double-click gesture shows two V-shaped changes. The ranges of the wave and palm clench gestures decrease, while the ranges of the beckoning and palm open gestures increase.

As shown in Figure 21, the velocity of the click gesture first becomes negative, then positive. The velocity of the double-click gesture exhibits two similar changes. The velocities of the beckoning and palm open gestures are positive during the duration of the gesture, while the velocities of the wave, drawing fork, and palm clench gestures are negative during the duration of the gesture.

As shown in Figure 22, the angle of the click gesture is positive during the duration, and the double click gesture shows two similar angle changes. The angle of the sliding left gesture varies from negative to positive, while the angle of the sliding right gesture changes from positive to negative. The angle of the forefinger–thumb open gesture changes from zero to negative, while the angle of the forefinger–thumb close gesture changes from zero to positive. The angle of the rotating clockwise gesture first becomes negative and then positive, while the angle of the counterclockwise rotation gesture changes in the opposite direction.

5.2.3. Feature Fusion Results

By fusing the RTM, VTM, and ATM in the color space, an RVATM with a size of 128 × 50 × 3 is obtained. Figure 23 shows the RVATMs of 14 gestures. The red, green, and blue lines represent the trajectories of range, velocity, and angle, respectively.

Compared with the RTM, VTM, and ATM, RVATM contains three-dimensional features, providing a more comprehensive description of motion characteristics, which is beneficial for distinguishing different micro gestures.

After feature extraction, preprocessing, and feature fusion of the collected samples, a dataset containing 8400 RVATMs is obtained, with 600 RVATMs for each gesture.

5.3. Recognition Results and Analysis

5.3.1. Evaluation Metrics

In this study, five metrics are used to measure the performance of the proposed network, including accuracy, parameters, computational complexity, model size, and frames per second (FPS).

Accuracy represents the overall recognition performance for all types of gestures, which is given by

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} = \frac{1}{N_{t o t a l}} \sum_{i = 1}^{K} T P_{i}

(52)

where TP represents the number of correctly classified positive samples, FP represents the number of incorrectly classified negative samples, FN represents the number of incorrectly classified positive samples, TN represents the number of correctly classified negative samples, K represents the number of types, and N_total represents the number of samples.

The parameters represent the sum of all trainable parameters in the network, reflecting the complexity of the network structure. Computational complexity is assessed through floating point operations per second (FLOPs), which determines the power consumption. Model size indicates the storage requirement of the network. FPS indicates the number of frames processed by the network per second, reflecting the inference speed of the network.

5.3.2. Network Training

The constructed dataset is divided into a training set, validation set, and test set, at a ratio of 6:2:2. The input RVATM is resized to 128 × 128 × 3, the batch size is 16, and the maximum training epoch is 1000. The Adam optimizer is employed for training, and the initial learning rate is 0.001. In addition, the early stop method is used to detect the accuracy of the validation set and terminate the training process when generalization performance begins to decline.

Figure 24 shows the training curves of MobileNetV3 and CQ-MobileNetV3. As shown in Figure 24, compared with MobileNetV3, CQ-MobileNetV3 achieves faster convergence speed in the first 100 epochs and has slightly higher stable accuracy and slightly lower stable loss.

5.3.3. Ablation Experiments

Ablation experiments are conducted to assess the performance of the three improvements. Table 3 lists the experimental results.

MobileNetV3 is used as the baseline network in the experiments. Network A optimizes the network structure, resulting in a 79.4% decrease in parameter count and a 43.3% decrease in computational complexity. However, the accuracy also decreases by 2.71%. Network B optimizes the network structure and introduces the improved CBAM into the bottleneck blocks, reducing the parameter count by 79.1% and computational complexity by 41.1%, while maintaining almost the same accuracy. After optimizing the network structure, Network C introduces the improved SA module into the bottleneck blocks, reducing the parameter count by 77.9%, the computational complexity by 38.1%, and the accuracy also decreases by 0.62%. The proposed network contains three improvements, which reduce the parameter count by 77.8%, the computational complexity by 37.6%, and increase the accuracy by 0.41%. The results indicate that the three proposed improvements can greatly decrease the parameter count and computational complexity, while enhancing the accuracy.

5.3.4. Comparison with Other Networks

To further evaluate the performance of CQ-MobileNetV3, several state-of-the-art networks were selected for comparative experiments, including DenseNet + CBAM [25], ResNet50 [26], Xception [27], ResNet18 + CBAM [37], GhostNetV3 [38], MobileNetV4 [39], and Swin-Transformer-Small [40].

Figure 25 shows the confusion matrices of the eight networks. The recognition accuracy of a single gesture for all networks exceeds 85%. Except for Swin Transformer Small, the other seven networks achieve 100% accuracy for at least three gestures. These results indicate that the RVATM can effectively depict the motion characteristics of micro gestures and is suitable for different networks.

Table 4 presents a comprehensive performance comparison of the eight networks. As shown in Table 4, ResNet50 and Xception have high accuracy but a high parameter count and computational cost. Swin-Transformer-Small has the highest parameter count but relatively low accuracy. Compared with ResNet18 + CBAM, GhostNetV3 has similar accuracy, with a lower parameter count and computational complexity. MobileNetV4 has much lower parameter and computational complexity than GhostNetV3, but its accuracy is also 1.61% lower than GhostNetV3.

The proposed CQ-MobileNetV3 achieves an accuracy of 97.16%, which is less than 0.5% lower than the highest accuracy. The proposed network has the lowest parameter count and computational complexity, which are only 8.24% and 21.8% of those of the second lowest network (MobileNetV4), respectively. In addition, CQ-MobileNetV3 achieves the highest inference speed of 309 FPS. The results show that CQ-MobileNetV3 maintains high recognition accuracy while implementing lightweight design, obtaining a balance between lightweight design and recognition accuracy through efficient feature extraction networks and attention mechanisms.

6. Discussion

Feature extraction and network design are two key stages of radar-based gesture recognition.

In the feature extraction stage, first the three traditional feature maps RTM, VTM, and ATM are extracted by stacking the data of multiple frames. Then, the original RTM, VTM, and ATM are preprocessed to reduce interference and highlight gesture features. Finally, the preprocessed RTM, VTM, and ATM are fused in the color space to construct the RVATM. Compared with the multiple single-dimensional feature maps used in previous studies, the RVATM contains refined multi-dimensional features, which can fully describe the motion characteristics of micro gestures with less information redundancy and avoid further feature fusion processing in the network, thereby reducing the computational complexity of the network.

In the network design stage, achieving a lightweight structure is the main design goal, and the classic lightweight network MobileNetV3-Small is selected as the baseline network. First, based on the size of the RVATM feature map, the network structure is optimized by reducing the number of bottleneck blocks and decreasing the expansion size, which can greatly reduce the number of parameters. Subsequently, two lightweight attention modules, the improved CBAM and the improved SA module, are proposed and integrated into the bottleneck block, constructing the Bottleneck_CBAM and Bottleneck_SA module. Finally, the Bottleneck_CBAM and Bottleneck_SA module are used to replace bottleneck blocks in the structurally optimized network to obtain the CQ-MobileNetV3 network. The two lightweight attention modules can effectively improve the recognition accuracy, while increasing little computation complexity and parameter count.

In the experiment, using the RVATM as the input feature map, different networks achieve recognition accuracy above 90% for 14 types of micro gestures, demonstrating the good adaptability of the RVATM. Compared with several other mainstream networks, CQ-MobileNetV3 significantly reduces the parameter count and computational complexity, and achieves the highest inference speed while maintaining high recognition accuracy. These results indicate that CQ-MobileNetV3 effectively balances network lightweight design and recognition accuracy, making it suitable for deployment on mobile devices with limited computational and storage resources.

This study has some limitations. First, only scenarios where the hand is at a relatively close distance above the radar are studied. The performance of the proposed network may degrade in scenarios where the hand is at longer distances and different directions relative to the radar. Second, feature extraction in complex backgrounds with more interference also needs to be considered.

7. Conclusions

In this study, a micro gesture recognition method using multi-dimensional feature fusion and a lightweight network is presented. In feature extraction, the RTM, VTM, and ATM are first extracted from raw data and refined through preprocessing. Then, the three feature maps are fused in color space to obtain the RVATM, which can well express the motion information of gestures. For recognition, a lightweight CQ-MobileNetV3 is presented. First, the redundant parameters and computation are reduced by optimizing the network structure. Then, the recognition accuracy is improved by integrating the improved CBAM and the improved SA module. The experimental results based on the 77 GHz FMCW radar show that the CQ-MobileNetV3 network obtains a high accuracy of 97.16% for 14 micro gestures, with a parameter count of 0.207 M, a computational complexity of 0.027 GFLOPs, and a model size of 0.895 MB. The results validate that the proposed network is superior to seven other networks in terms of comprehensive performance.

In future work, interference suppression techniques in feature extraction against complex backgrounds will be studied. In addition, the improvement of the network will be investigated to enhance recognition performance in different scenarios.

Author Contributions

Conceptualization, W.X. and R.W.; methodology, W.X. and R.W.; software, R.W.; validation, R.W., J.W. and L.L.; formal analysis, W.X.; investigation, W.X.; resources, W.X.; data curation, R.W.; writing—original draft preparation, R.W.; writing—review and editing, W.X.; visualization, R.W.; supervision, W.X.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62175220.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to acknowledge the support from editors and comments from all the reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rastgoo, R.; Kiani, K.; Escalera, S. Sign language recognition: A deep survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
Li, A.; Bodanese, E.; Poslad, S.; Hou, T.; Wu, K.; Luo, F. A Trajectory-Based Gesture Recognition in Smart Homes Based on the Ultrawideband Communication System. IEEE Internet Things J. 2022, 9, 22861–22873. [Google Scholar] [CrossRef]
Zeng, M. Gesture Recognition Technology of VR Piano Playing Teaching Game based on Hidden Markov Model. Int. Arab J. Inf. Technol. 2024, 21, 760–772. [Google Scholar] [CrossRef]
Zheng, L.; Bai, J.; Zhu, X.; Huang, L.; Shan, C.; Wu, Q.; Zhang, L. Dynamic Hand Gesture Recognition in In-Vehicle Environment Based on FMCW Radar and Transformer. Sensors 2021, 21, 6368. [Google Scholar] [CrossRef]
Mahmoud, N.M.; Fouad, H.; Soliman, A.M. Smart healthcare solutions using the internet of medical things for hand gesture recognition system. Complex Intell. Syst. 2021, 7, 1253–1264. [Google Scholar] [CrossRef]
Wang, H.; Zhang, M.; Zhang, L.; Zhu, X.; Cao, Q. Real-Time Hand Gesture Recognition in Clinical Settings: A Low-Power FMCW Radar Integrated Sensor System with Multiple Feature Fusion. Sensors 2025, 25, 4169. [Google Scholar] [CrossRef]
Li, B.; Yang, J.; Yang, Y.; Li, C.; Zhang, Y. Sign Language/Gesture Recognition Based on Cumulative Distribution Density Features Using UWB Radar. IEEE Trans. Instrum. Meas. 2021, 70, 2511113. [Google Scholar] [CrossRef]
Zhang, Z.; Cai, J.; Dai, X.; Xiao, H. Multi-Scale Attention Fusion Gesture-Recognition Algorithm Based on Strain Sensors. Sensors 2025, 25, 4200. [Google Scholar] [CrossRef]
Sharma, S.; Singh, S. Vision-based hand gesture recognition using deep learning for the interpretation of sign language. Expert Syst. Appl. 2021, 182, 115657. [Google Scholar] [CrossRef]
Leon, D.G.; Gröli, J.; Yeduri, S.R.; Mosqueron, R.; Pandey, O.J.; Cenkeramaddi, L.R. Video Hand Gestures Recognition Using Depth Camera and Lightweight CNN. IEEE Sens. J. 2022, 22, 14610–14619. [Google Scholar] [CrossRef]
Kong, F.; Deng, J.; Fan, Z. Gesture recognition system based on ultrasonic FMCW and ConvLSTM model. Measurement 2022, 190, 110743. [Google Scholar] [CrossRef]
Ding, X.; Yu, X.; Zhong, Y.; Xie, W.; Cai, B.; You, M.; Jiang, T. Robust gesture recognition method toward intelligent environment using Wi-Fi signals. Measurement 2024, 231, 114525. [Google Scholar] [CrossRef]
Ahmed, S.; Kallu, K.D.; Ahmed, S.; Cho, S.H. Hand Gestures Recognition Using Radar Sensors For Human-Computer-Interaction: A review. Remote Sens. 2021, 13, 527. [Google Scholar] [CrossRef]
Mao, Y.; Zhao, L.; Liu, C.; Ling, M. A Low-Complexity Hand Gesture Recognition Framework via Dual mmWave FMCW Radar System. Sensors 2023, 23, 8551. [Google Scholar] [CrossRef]
Li, Q.; Liu, L.; Hao, S.; Wan, G. Dynamic Gesture Recognition Method Based on Millimeter-Wave Radar. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 63–67. [Google Scholar]
Zhang, S.; Li, G.; Ritchie, M.; Fioranelli, F.; Griffiths, H. Dynamic Hand Gesture Classification Based on Radar Micro-Doppler Signatures. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; pp. 1–4. [Google Scholar]
Rashid, N.E.A.; Nor, Y.A.I.M.; Sharif, K.K.M.; Khan, Z.I.; Zakaria, N.A. Hand Gesture Recognition using Continuous Wave (CW) Radar based on Hybrid PCA-KNN. In Proceedings of the 2021 IEEE Symposium on Wireless Technology & Applications (ISWTA), Shah Alam, Malaysia, 17 August 2021; pp. 88–92. [Google Scholar]
Wang, S.; Song, J.; Lien, J.; Poupyrev, I.; Hilliges, O. Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. In Proceedings of the 29th Radio-frequency Spectrum Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; pp. 851–860. [Google Scholar]
Wang, X.; Min, R.; Cui, Z.; Cao, Z. Micro Gesture Recognition with Terahertz Radar Based on Diagonal Profile of Range-Doppler Map. In Proceedings of the 2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 770–773. [Google Scholar]
Ali, A.; Parida, P.; Va, V.; Ni, S.; Nguyen, K.N.; Ng, B.L.; Zhang, J.C. End-to-End Dynamic Gesture Recognition Using MmWave Radar. IEEE Access 2022, 10, 88692–88706. [Google Scholar] [CrossRef]
Wang, Y.; Wang, D.; Fu, Y.; Yao, D.; Xie, L.; Zhou, M. Multi-Hand Gesture Recognition Using Automotive FMCW Radar Sensor. Remote Sens. 2022, 14, 2374. [Google Scholar] [CrossRef]
Yu, J.T.; Tseng, Y.H.; Tseng, P.H. A mmWave MIMO Radar-Based Gesture Recognition Using Fusion of Range, Velocity, and Angular Information. IEEE Sens. J. 2024, 24, 9124–9134. [Google Scholar] [CrossRef]
Wu, Y.; Wang, X.; Guo, S.; Zhang, B.; Cui, G. A Lightweight Network With Multifeature Fusion for mmWave Radar-Based Hand Gesture Recognition. IEEE Sens. J. 2024, 24, 19553–19561. [Google Scholar] [CrossRef]
Yang, Z.; Zhuang, L.; Chu, P.; Zhou, J. A Low-Complexity Air-Digit-Writing Recognition Method Based on Adaptive Trajectory Learning Using MIMO Radar. IEEE Sens. J. 2024, 24, 4992–5003. [Google Scholar] [CrossRef]
Song, Y.; Wu, L.; Zhao, Y.; Liu, P.; Lv, R.; Ullah, H. High-Accuracy Gesture Recognition using Mm-Wave Radar Based on Convolutional Block Attention Module. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 1485–1489. [Google Scholar]
Li, Y.; Li, B.; Zhang, A.; Xue, P. Research on Gesture Recognition Based on Millimeter Wave Radar. In Proceedings of the 2024 10th International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2024; pp. 157–162. [Google Scholar]
Li, W.; Jiang, J.; Liu, D.; Gao, Y.; Li, Q. Digital Gesture Recognition Based on Millimeter Wave Radar. In Proceedings of the 2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Xi’an, China, 17–19 August; 2021; pp. 1–6. [Google Scholar]
Jin, B.; Wu, H.; Zhang, Z.; Lian, Z.; Zhang, X.; Du, G. SRDST: Effective Dynamic Gesture Recognition With Sparse Representation and Dual-Stream Transformers in mmWave Radar. IEEE Trans. Ind. Inform. 2025, 21, 604–612. [Google Scholar] [CrossRef]
Lien, J.; Gillian, N.; Karagozler, M.E.; Amihood, P.; Schwesig, C.; Olson, E.; Raja, H.; Poupyrev, I. Soli: Ubiquitous Gesture Sensing with Millimeter Wave Radar. ACM Trans. Graph. 2016, 35, 1–9. [Google Scholar] [CrossRef]
Tang, G.; Wu, T.; Li, C. Dynamic Gesture Recognition Based on FMCW Millimeter Wave Radar: Review of Methodologies and Results. Sensors 2023, 23, 7478. [Google Scholar] [CrossRef]
Ascione, M.; Buonanno, A.; Urso, M.D.; Angrisani, L.; Lo Moriello, R.S. A New Measurement Method Based on Music Algorithm for Through-the-Wall Detection of Life Signs. IEEE Trans. Instrum. Meas. 2013, 62, 13–26. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Huang, X.; Yang, R.; Wang, Q.; Yu, F.; He, B. A novel method for real-time ATR system of AUV based on Attention-MobileNetV3 network and pixel correction algorithm. Ocean Eng. 2023, 270, 113403. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 2022, 503, 92–108. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zhang, Y.; Tang, H.; Wu, Y.; Wang, B.; Yang, D. FMCW Radar Human Action Recognition Based on Asymmetric Convolutional Residual Blocks. Sensors 2024, 24, 4570. [Google Scholar] [CrossRef]
Liu, Z.; Hao, Z.; Han, K.; Tang, Y.; Wang, Y. GhostNetV3: Exploring the Training Strategies for Compact Models. arXiv 2024, arXiv:2404.11202. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. arXiv 2024, arXiv:2404.10518. [Google Scholar]
Gu, M.; Chen, Z.; Chen, K.; Pan, H. IR-ST: A Lightweight Transformer Network for Human Fall Detection Based on FMCW Radar. IEEE Sens. J. 2023, 23, 25128–25135. [Google Scholar] [CrossRef]

Figure 1. FMCW radar system architecture.

Figure 2. Frequency domain waveform of the transmitted and received signals of FMCW radar.

Figure 3. Angle measurement using antenna arrays.

Figure 4. Multi-dimensional feature fusion flow chart.

Figure 5. Schematic of RDM generation.

Figure 6. Bottleneck block of MobileNetV3.

Figure 7. Architecture of MobileNetV3-Small.

Figure 8. Structure of CBAM: (a) CAM; (b) SAM.

Figure 9. Structure of the improved CBAM: (a) improved CAM; (b) improved SAM.

Figure 10. Structure of the SA module.

Figure 11. Improved bottleneck blocks: (a) Bottleneck_CBAM; (b) Bottleneck_SA.

Figure 12. Structure of the CQ-MobileNetV3 network.

Figure 13. FMCW radar system.

Figure 14. Fourteen gestures: (a) click; (b) double click; (c) beckoning; (d) wave; (e) sliding left; (f) sliding right; (g) drawing tick; (h) drawing fork; (i) palm clench; (j) palm open; (k) forefinger–thumb open; (l) forefinger–thumb close; (m) rotating clockwise; (n) rotating counterclockwise.

Figure 15. Data collection scenarios: (a) Scenario 1; (b) Scenario 2; (c) Scenario 3.

Figure 16. Feature maps of the click gesture before preprocessing in Scenario 1: (a) RTM; (b) VTM; (c) ATM.

Figure 17. Feature maps of the click gesture after preprocessing in Scenario 1: (a) RTM; (b) VTM; (c) ATM.

Figure 18. Feature maps of the click gesture after preprocessing in Scenario 2: (a) RTM; (b) VTM; (c) ATM.

Figure 19. Feature maps of the click gesture after preprocessing in Scenario 3: (a) RTM; (b) VTM; (c) ATM.

Figure 20. RTMs of 14 gestures after preprocessing: (a) click; (b) double click; (c) beckoning; (d) wave; (e) sliding left; (f) sliding right; (g) drawing tick; (h) drawing fork; (i) palm clench; (j) palm open; (k) forefinger–thumb open; (l) forefinger–thumb close; (m) rotating clockwise; (n) rotating counterclockwise.

Figure 21. VTMs of 14 gestures after preprocessing: (a) click; (b) double click; (c) beckoning; (d) wave; (e) sliding left; (f) sliding right; (g) drawing tick; (h) drawing fork; (i) palm clench; (j) palm open; (k) forefinger–thumb open; (l) forefinger–thumb close; (m) rotating clockwise; (n) rotating counterclockwise.

Figure 22. ATMs of 14 gestures after preprocessing: (a) click; (b) double click; (c) beckoning; (d) wave; (e) sliding left; (f) sliding right; (g) drawing tick; (h) drawing fork; (i) palm clench; (j) palm open; (k) forefinger–thumb open; (l) forefinger–thumb close; (m) rotating clockwise; (n) rotating counterclockwise.

Figure 23. RVATMs of 14 gestures after preprocessing: (a) click; (b) double click; (c) beckoning; (d) wave; (e) sliding left; (f) sliding right; (g) drawing tick; (h) drawing fork; (i) palm clench; (j) palm open; (k) forefinger–thumb open; (l) forefinger–thumb close; (m) rotating clockwise; (n) rotating counterclockwise.

Figure 24. Training curves of MobileNetV3 and CQ-MobileNetV3: (a) accuracy curves; (b) loss curves.

Figure 25. Confusion matrices of eight networks: (a) DenseNet + CBAM; (b) ResNet50; (c) Xception; (d) ResNet18 + CBAM; (e) GhostNetV3; (f) MobileNetV4; (g) Swin-Transformer-Small; (h) CQ-MobileNetV3.

Table 1. Optimized network structure.

Module	Expansion Size (Original/Optimized)	Output Size
Bottleneck_SE	16/16	64 × 64 × 16
Bottleneck	72/16	32 × 32 × 16
Bottleneck_SE	88/72	16 × 16 × 24
Bottleneck_SE	144/96	8 × 8 × 48
Bottleneck_SE	288/144	8 × 8 × 48
Bottleneck_SE	576/96	4 × 4 × 96
Bottleneck_SE	576/192	4 × 4 × 96
Conv2d, 1 × 1		4 × 4 × 576
Pool		1 × 1 × 576
Conv2d, 1 × 1		1 × 1 × 14

Table 2. Parameter configuration of the radar system.

Parameter	Value
Number of transmitting antennas	2
Number of receiving antennas	4
Starting frequency	77 GHz
Modulation slope	100 MHz/us
Chirp period	380 us
Chirp duration	40 us
Number of chirps per frame	128
Frame period	50 ms
Sampling rate	2 MHz
Samples per chirp	64
Number of frames	50

Table 3. Results of ablation experiments.

Network	Structure Optimization	Improved CBAM	Improved SA	Accuracy (%)	Params (K)	FLOPs (M)	Size (KB)	FPS
MobileNetV3				96.75	932.96	42.99	3786	270
A	√			94.04	192.49	24.36	814	331
B	√	√		96.32	194.63	25.31	823	320
C	√		√	96.13	206.64	26.59	864	315
Proposed	√	√	√	97.16	207.24	26.81	895	309

Table 4. Comprehensive performance comparison of different networks.

Network	Accuracy (%)	Params (M)	FLOPs (G)	Size (MB)	FPS
DenseNet + CBAM	96.98	7.012	1.924	28.025	182
ResNet50	97.44	23.537	2.698	89.785	199
Xception	97.54	20.836	2.982	78.482	281
ResNet18 + CBAM	97.50	11.346	1.193	44.496	124
GhostNetV3	97.20	8.129	0.300	31.010	158
MobileNetV4	95.59	2.511	0.124	9.579	381
Swin-Transformer-Small	91.70	49.800	17.089	191.399	35
CQ-MobileNetV3	97.16	0.207	0.027	0.895	309

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, W.; Wang, R.; Wei, J.; Liu, L. Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar. Sensors 2025, 25, 6949. https://doi.org/10.3390/s25226949

AMA Style

Xue W, Wang R, Wei J, Liu L. Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar. Sensors. 2025; 25(22):6949. https://doi.org/10.3390/s25226949

Chicago/Turabian Style

Xue, Wei, Rui Wang, Jianyun Wei, and Li Liu. 2025. "Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar" Sensors 25, no. 22: 6949. https://doi.org/10.3390/s25226949

APA Style

Xue, W., Wang, R., Wei, J., & Liu, L. (2025). Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar. Sensors, 25(22), 6949. https://doi.org/10.3390/s25226949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Micro Gesture Recognition with Multi-Dimensional Feature Fusion and CQ-MobileNetV3 Using FMCW Radar

Abstract

1. Introduction

2. Principle of FMCW Radar

2.1. System Architecture and Signal Model

2.2. Range Measurement

2.3. Velocity Measurement

2.4. Angle Measurement

3. Multi-Dimensional Feature Fusion

3.1. RTM and VTM Extraction

3.2. ATM Extraction

3.3. Feature Preprocessing

3.4. Feature Fusion

4. CQ-MobileNetV3 Network

4.1. MobileNetV3 Network

4.2. Improved CBAM

4.3. Improved SA Module

4.4. Optimization of Network Structure

4.5. Overall Network Structure

5. Experiments and Analysis

5.1. Experimental Setup and Parameter Configuration

5.2. Dataset

5.2.1. Data Collection

5.2.2. Feature Extraction and Preprocessing Results

5.2.3. Feature Fusion Results

5.3. Recognition Results and Analysis

5.3.1. Evaluation Metrics

5.3.2. Network Training

5.3.3. Ablation Experiments

5.3.4. Comparison with Other Networks

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI