Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN

Chen, Penghui; Li, Siben; Yuan, Chenchen; Bai, Yujing; Wang, Jun

doi:10.3390/electronics15020437

Open AccessArticle

Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN^†

by

Penghui Chen

^1,*

,

Siben Li

¹

,

Chenchen Yuan

²,

Yujing Bai

¹

and

Jun Wang

^1,3

¹

School of Electronic and Information Engineering, Beihang University, Beijing 100191, China

²

Beijing Institute of Radio Metrology and Measurement, Beijing 100143, China

³

Key Laboratory of Intelligent Sensing Materials and Chip Integration Technology of Zhejiang Province, Hangzhou Innovation Institute of Beihang University, Hangzhou 310056, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Yuan, C.; Chen, Z.; Chen, P.; Tian, R.; Xiong, D.; Guo, W. Fine-Grained Gesture Recognition by Using FMCW Millimeter-Wave Radar. In Proceedings of the 2023 Cross Strait Radio Science and Wireless Technology Conference (CSRSWTC), Guilin, China, 10–13 November 2023; pp. 1–3.

Electronics 2026, 15(2), 437; https://doi.org/10.3390/electronics15020437

Submission received: 11 December 2025 / Revised: 8 January 2026 / Accepted: 14 January 2026 / Published: 19 January 2026

(This article belongs to the Special Issue Advancements in Signal Processing: Communications, Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

With the ongoing miniaturization of smart devices, fine-grained hand gesture recognition using millimeter-wave radar has attracted increasing attention, yet practical deployment remains challenging in continuous-gesture segmentation, robust feature extraction, and reliable classification. This paper presents an end-to-end fine-grained gesture recognition framework based on frequency modulated continuous wave(FMCW) millimeter-wave radar, including gesture design, data acquisition, feature construction, and neural network-based classification. Ten gesture types are recorded (eight valid gestures and two return-to-neutral gestures); for classification, the two return-to-neutral gesture types are merged into a single invalid class, yielding a nine-class task. A sliding-window segmentation method is developed using short-time Fourier transformation(STFT)-based Doppler-time representations, and a dataset of 4050 labeled samples is collected. Multiple signal classification(MUSIC)-based super-resolution estimation is adopted to construct range–time and angle–time representations, and instance-wise normalization is applied to Doppler and range features to mitigate inter-individual variability without test leakage. For recognition, a variable-channel deep residual shrinkage network (DRSN) is employed to improve robustness to noise, supporting single-, dual-, and triple-channel feature inputs. Results under both subject-dependent evaluation with repeated random splits and subject-independent leave one subject out(LOSO) cross-validation show that DRSN architecture consistently outperforms the RefineNet-based baseline, and the triple-channel configuration achieves the best performance (98.88% accuracy). Overall, the variable-channel design enables flexible feature selection to meet diverse application requirements.

Keywords:

fine-grained gestures; millimeter-wave radar; feature extraction; deep residual shrinkage network

1. Introduction

Hand gesture recognition is a key component of human-computer interaction (HCI) and shows strong potential in 5G sensing, sign language translation, motion-sensing games, smart homes, and smart healthcare [1,2]. Owing to its high operating frequency, millimeter-wave radar enables precise measurements and can detect gestures with small motion amplitudes [3]. In addition, radar signals are less susceptible to blockage by certain materials during propagation [4], and advances in integrated-circuit technology have further reduced chip size [5]. Overall, millimeter-wave radar provides strong privacy, high environmental adaptability, ease of integration, and low cost, making it a promising solution for gesture recognition [6,7].

In recent years, research on gesture recognition has shifted from traditional signal processing to deep learning methods [8,9], particularly convolutional neural networks (CNNs), which have advanced radar-based gesture recognition [10,11]. The Soli project released by Google [12] represents a breakthrough in frequency-modulated continuous-wave (FMCW) radar-based micro-motion gesture recognition and can identify fine-grained gestures, such as finger movements and thumb motions across other fingers.

However, most FMCW radar-based gesture recognition studies have primarily focused on algorithm design [13,14]. Malysa et al. [15] used a hidden Markov model (HMM) to classify five gesture signals collected by FMCW radar and reported an accuracy of 82%. The engineering team at TI [16] designed and developed a multifunctional FMCW radar for gesture recognition and demonstrated its use in car door opening and closing. Ryu et al. [17] applied short-time Fourier transform (STFT) to extract gesture features, and combined it with a quantum-inspired evolutionary algorithm (QEA) for classification, achieving 85% accuracy. Suh et al. [18] designed FMCW hardware with one transmitting and four receiving antennas, extracted the Doppler–range map of gestures as features, and used a long short-term memory (LSTM) network to classify seven gesture types. Zhang et al. [19] adopted connectionist temporal classification (CTC) and utilized unsegmented gesture data streams to predict gesture categories. Choi et al. [20] developed a real-time gesture recognition algorithm based on an LSTM encoder and incorporated a Gaussian mixture model for clutter removal. The constant false alarm rate (CFAR) algorithm was employed for gesture detection for the first time. After classification using the LSTM network, the system was able to recognize ten different gestures and remained effective under low signal-to-noise ratio (SNR) conditions.

At present, millimeter-wave radar-based gesture recognition has advanced in both hardware and algorithms [21,22]. Wang et al. [23] proposed a DCS-CTN (Millimeter-Wave Radar Data Cube Sequence and Time-Distributed CNN-Transformer Network) framework that uses a time-distributed wrapper and a convolutional neural network (CNN) to extract local features from the data-cube sequence, a positional encoder to preserve temporal information, and a Transformer network to capture global features. This framework achieved a gesture recognition accuracy of 99.75%. Huang et al. [24] developed a gesture recognition algorithm that combines an error-correction code (ECC) module with multiple signal classification (MUSIC) preprocessing, improving decision logic and robustness while significantly enhancing real-time detection performance. Jin et al. [25] introduced a dynamic gesture recognition method based on multi-layer feature fusion (MLFF) and a Transformer, characterized by a small model size and high accuracy. On a dataset with 10% random interference, this method achieved an average recognition accuracy of 99.11%, indicating strong potential for embedded device applications.

Accurate recognition of fine-grained gestures has long been a major challenge in gesture recognition [26]. The deployment of gesture recognition systems in real-world scenarios still faces several difficulties: (a) the system must operate reliably across diverse environments; (b) individuals perform gestures differently, and even for the same gesture, motion amplitude and velocity can vary substantially across users; (c) continuous gestures introduce segmentation challenges [27]; and (d) fine-grained gestures typically involve small motion amplitudes and low velocities, which complicates feature extraction [28]. Consequently, further investigation is needed for radar-based measurement and recognition of fine-grained gestures.

To address these challenges, this paper proposes an improved deep residual shrinkage network (DRSN) for fine-grained gesture recognition. Specifically, the network extends the multi-path refinement architecture of deep residual networks by integrating low- and high-resolution features to produce high-resolution representations. In addition, a shrinkage module is incorporated to suppress noise interference effectively. To assess the representational capacity of different feature sets, single-, dual-, and triple-channel neural network architectures are designed. The proposed model achieves an accuracy of 98.88% on a nine-class fine-grained gesture recognition task and provides recommendations for flexible channel-mode selection across different application scenarios. The main contributions of this paper are summarized as follows:

This paper constructs a dataset comprising 4050 samples collected using an AWR1843 mm-wave radar, covering nine gesture categories in total: eight fine-grained command gestures commonly encountered in daily life and one return-to-neutral category. To improve robustness in continuous gesture recognition, two return-to-neutral gesture types (following “pat left” and “pat right” actions, respectively) are defined and collected, while they are merged into the same return-to-neutral category during labeling and classification. This dataset helps mitigate the lack of publicly available datasets for fine-grained gesture recognition and reduces interference and false positives between consecutive command gestures.
This paper applies a Deep Residual Shrinkage Network (DRSN) to fine-grained gesture recognition by embedding a learnable residual shrinkage (soft-thresholding) mechanism into residual units to adaptively suppress noise-like responses, thereby improving robustness under noisy radar measurements and enhancing recognition accuracy.
This paper designs three DRSN configurations—single-channel, dual-channel, and triple-channel—to evaluate the representational capacity of different feature sets for fine-grained gestures. This design enables flexible selection of the number and combination of input features to meet diverse application requirements.

The remainder of this paper is organized as follows. Section 2 presents the fine-grained gesture data collection and feature extraction methods based on FMCW millimeter-wave radar, including the FMCW radar signal model, gesture design and acquisition, feature extraction, feature normalization, and continuous-gesture segmentation. Section 3 describes the experimental methodology based on the improved DRSN architecture. Section 4 reports the experimental setup, results, and analysis, including parameter settings, dataset construction, and comparative experiments. Section 5 concludes the paper with a summary and final remarks.

2. Data Collection and Feature Extraction

2.1. Millimeter-Wave Radar Signal Model

This paper uses TI’s AWR1843 mm-wave radar to generate FMCW signals, which provide high range resolution, short measurement time, and low computational complexity, making them well-suited to gesture recognition [29].

The core of the AWR1843 mm-radar is the RF/analog subsystem, which comprises three components: a clock subsystem, a transmit subsystem, and a receive subsystem. The clock subsystem provides a stable frequency range of 76–81 GHz. The transmit subsystem supports three channels with independent amplitude and phase control, and the receive subsystem provides four parallel channels, each equipped with a low-noise amplifier, a filter, and a mixer.

The radar’s RF front-end integrates three transmit antennas. An FMCW waveform generated by the signal generator is amplified by a power amplifier and radiated by the transmit antennas. The transmitted signal is reflected by the target, producing echoes that are captured by four receive antennas. Each received signal is amplified by a low-noise amplifier and mixed with the transmit signal to generate an intermediate-frequency (IF) signal, which is then low-pass filtered (LPF) and digitized by an analog-to-digital converter (ADC). Finally, the digitized data are sent to the signal processor, yielding four-channel digital outputs.

An FMCW radar transmits a frequency-modulated continuous wave; the frequency difference between the received echo and the transmitted signal reflects target characteristics. The working principle of the FMCW radar is shown in Figure 1.

The frequency of an FMCW signal varies linearly with time. Let the starting frequency be

f_{c}

and the slope be

S

. The transmitted signal can be expressed as

s (t) = \exp \{j 2 π (f_{c} t + \frac{1}{2} S t^{2})\} .

(1)

The transmission signal

s (t)

is reflected by the target, generating an echo signal

r (t)

, which is received by the antenna. Ignoring amplitude variations, the received signal can be approximated as a time-delayed version of the transmitted signal:

r (t) = s (t - t_{d}) = \exp \{j 2 π [f_{c} (t - t_{d}) + \frac{1}{2} S {(t - t_{d})}^{2}]\},

(2)

where the time delay

t_{d}

is related to the target range

R

by

t_{d} = \frac{2 R}{c},

(3)

and

c

denotes the speed of light.

To obtain the mixed signal, the transmitted signal

s (t)

and the received signal

r (t)

are mixed in the mixer as

x_{M I X} (t) = s (t) r^{*} (t) = \exp [j 2 π (f_{c} t_{d} + S t t_{d} - \frac{1}{2} S t_{d}^{2})],

(4)

where

{(\cdot)}^{*}

denotes complex conjugation.

After low-pass filtering (LPF), the intermediate-frequency (IF) signal can be written as

x (t) = \exp \{j (2 π f_{IF} t + ϕ_{0})\},

(5)

where the beat frequency is

f_{I F} = S t_{d} = S \cdot 2 R / c

and

ϕ_{0}

is a constant phase term that absorbs components independent of

t

(e.g.,

2 π (f_{c} t_{d} - 1 / 2 S t_{d}^{2})

).

Therefore, the relationship between the target range

R

and the IF frequency

f_{I F}

is

R = \frac{c f_{I F}}{2 S} .

(6)

2.2. Fine-Grained Gesture Design and Data Acquisition

This paper defines a fine-grained gesture set consisting of eight valid command gestures [30] and two return-to-neutral gesture types. The return-to-neutral gestures are introduced to reduce interference and false positives in continuous gesture recognition. For classification, the two return-to-neutral gesture types are merged into a single return-to-neutral category, resulting in nine gesture categories in total (eight command categories plus one return-to-neutral category).

To ensure practicality and feasibility, the eight command gestures are designed according to two principles: (a) they follow natural daily life gesture habits; and (b) they are distinguishable in radar signals in terms of velocity, range, and angle characteristics. The eight command gestures are illustrated in Figure 2.

To avoid ad hoc gesture definitions, the eight command gestures are aligned with commonly used radar/HCI gesture primitives reported in prior studies and public datasets [31,32]. Specifically, “pat left” and “pat right” correspond to lateral swipe/pat-like motions; “move forward” and “move backward” correspond to radial push/pull motions along the range direction; “palm open” and “palm pinch” represent hand-aperture changes (open vs. pinch) that are widely used as interaction commands; “rub fingers” captures fine finger micro-motions often used in subtle gesture sensing; and “hook hand” represents a distinct hand-shape/posture change that complements the above motion-based primitives. Since there is currently no universally standardized gesture vocabulary across radar platforms and application scenarios, this paper provides explicit gesture definitions and illustrations (Figure 2 and Figure 3) to support reproducibility and facilitate cross-paper comparisons.

When multiple consecutive “pat left” or “pat right” gestures are performed, a return-to-neutral gesture naturally occurs between successive actions, during which the palm is reoriented to face vertically toward the radar. To prevent such return-to-neutral gestures from being misclassified as command gestures, two return-to-neutral gesture types are explicitly defined: one following a left pat (return-to-neutral gesture 1) and one following a right pat (return-to-neutral gesture 2), as illustrated in Figure 3. As these return-to-neutral gestures do not correspond to functional commands, they are labeled as the return-to-neutral category during gesture classification rather than as command gestures.

Gesture data were collected from nine volunteers. The radar was placed upright on the ground, and participants were seated in front of it with their hands positioned directly in front of the radar. The data acquisition scenario is shown in Figure 4.

2.3. Feature Extraction

2.3.1. Static Clutter Suppression

During the acquisition of fine-grained gesture signals, FMCW radar measurements are affected by stationary clutter. Moving targets induce Doppler frequency shifts, whereas the spectral spread of stationary clutter around zero Doppler is typically limited. The moving target indicator (MTI) exploits this difference to separate moving targets from stationary clutter.

In this paper, an MTI filter is employed to suppress static clutter in fine-grained gesture signals. Compared with a double-delay-line canceller, a single-delay-line canceller provides a wider effective passband but offers weaker suppression of stationary clutter. Because the fine-grained gesture data in this paper are acquired in a relatively controlled environment, and the single-delay-line canceller is structurally simple and easy to implement, it is adopted for clutter suppression. Meanwhile, the filter is applied to primarily attenuate near-zero-Doppler stationary components while preserving the dominant dynamic patterns of the gestures. Using two consecutive gestures (“move forward” and “move backward”) as examples, the one-dimensional range profiles before and after static clutter suppression are shown in Figure 5.

Compared with the post-suppression result, the range profiles before clutter suppression contain more redundant components, including returns from stationary (non-moving) objects. After suppression, the energy becomes more concentrated in the time frames associated with gesture activity, and the transition between the two consecutive gestures becomes more distinct. This property facilitates subsequent segmentation in continuous gesture recognition.

2.3.2. Range Feature Extraction

Since the configured range resolution of the radar is 5.9 cm, while the motion displacement of the fine-grained gestures designed in this paper is approximately 3.0–10.0 cm, conventional FFT-based range estimation does not provide sufficient resolution to capture subtle range variations. Therefore, a super-resolution method is required for range estimation.

In this paper, the multiple signal classification (MUSIC) algorithm [33] is applied to extract range features for fine-grained gestures. MUSIC exploits the orthogonality between the signal and noise subspaces to construct a pseudo-spectrum, and range estimates are obtained by searching for peaks in this spectrum. A detailed procedure is provided in [30]. The range–time spectrograms of the ten fine-grained gesture types are shown in Figure 6.

2.3.3. Doppler Feature Extraction

Compared with the conventional Fourier transform, the short-time Fourier transform (STFT) is better suited to non-stationary signals. By partitioning a non-stationary signal into short-time segments, each segment can be approximated as locally stationary. The Fourier transform is then applied to each segment, enabling time-localized frequency analysis.

Wavelet-based transforms, such as the wavelet transform (WT) and continuous wavelet transform (CWT), provide multi-resolution time-frequency analysis and can be advantageous for strongly non-stationary or multi-scale transient signals. Nevertheless, this paper adopts STFT as a deliberate trade-off for the following reasons: (a) in FMCW radar processing, Doppler is naturally estimated via an FFT along the slow-time (chirp-to-chirp) dimension, and STFT can be interpreted as a windowed FFT that yields a physically interpretable Doppler/velocity axis; (b) STFT produces a fixed-size spectrogram with controllable time/frequency binning, which integrates well with CNN-based models and can stabilize training [34]; and (c) WT/CWT requires selecting a mother wavelet and scale parameters, and the resulting representation may be sensitive to these hyperparameters, introducing additional tuning overhead. Therefore, STFT is adopted as a simple and robust representation in this study, while a systematic comparison with WT/CWT under the same protocol is left for future work.

This paper uses STFT to extract Doppler features for fine-grained gestures. Here, each frame corresponds to a sliding window over multiple chirps in the slow-time sequence. First, for each frame, the signal is transformed along the sampling-point (fast-time) dimension to obtain range information. The data are then aggregated along the range dimension, and STFT is applied along the chirp (slow-time) dimension to estimate Doppler information for the frame. Finally, the resulting Doppler representations are concatenated frame by frame to capture the temporal evolution of Doppler variations. The overall Doppler feature extraction process is illustrated in Figure 7.

The Doppler–time spectrograms of the ten fine-grained gesture types are shown in Figure 8. The horizontal axis denotes the frame index, the vertical axis represents Doppler, and the color indicates the intensity of motion-related energy. Because the time–frequency distributions differ across gesture types, the Doppler–time spectrum provides an effective basis for distinguishing fine-grained gestures.

2.3.4. Angle Feature Extraction

The gestures “pat left” and “pat right” exhibit similar velocity and range variations with respect to the radar, leading to similar Doppler–time and range–time characteristics. The key difference between these two gestures lies in the lateral direction of motion, making angle features important for discriminating between them.

The multiple signal classification (MUSIC) algorithm is a spectral estimation method that can be used to estimate both range and angle information. Accordingly, MUSIC is employed to extract angle features for fine-grained gestures. The angle–time spectrograms of the ten fine-grained gesture types are shown in Figure 9.

Compared with Doppler and range, the radar provides lower angular resolution under the antenna configuration used in this paper. Even when a gesture is performed directly in front of the radar, the angular resolution is approximately 10°. As shown in Figure 9, although the radar cannot accurately estimate the instantaneous angle for each fine-grained gesture, it can capture the overall trend of angle variation.

2.4. Normalization of Features

Different individuals exhibit varying motion extents along the range dimension when performing the same gesture. To mitigate this variability, the range extent in terms of range–bin indices on the extracted range–time map is normalized. Moving targets in the range–time map are detected using the Ordered-Statistics Constant False Alarm Rate (OS-CFAR) algorithm. For each gesture sample (instance), this paper determines the minimum and maximum range bins of the detected target region from that sample only, which are then used for min–max normalization. Specifically, this paper linearly maps the original range–bin indices to a predefined interval

[r_{1}, r_{2}]

(min–max scaling). Let

r_{i 1}

and

r_{i 2}

denote the minimum and maximum range positions (range–bin indices) of the

i

-th gesture sample (instance), respectively, obtained from OS-CFAR detections on that sample’s range–time map. The range–bin indices of all fine-grained gestures are normalized to the range of

r_{1} ~ r_{2}

, where

r_{1}

and

r_{2}

are predefined constants that specify the target interval after normalization (e.g., mapping to a fixed range–bin interval used by the network input), and they do not depend on any dataset-level statistics. Given the original range–bin index

{\tilde{r}}_{i}

, its normalized value is computed as follows:

{\tilde{r}}_{i} = r_{1} + \frac{r_{i} - r_{i 1}}{r_{i 2} - r_{i 1}} (r_{2} - r_{1}) .

(7)

The overall process of gesture range–bin normalization is illustrated in Figure 10.

Similarly to range normalization, Doppler features are also normalized instance-wise to mitigate user variability in motion velocity. Let

v_{i 1}

and

v_{i 2}

denote the minimum and maximum Doppler–bin indices of the detected target region for the

i

-th gesture sample (instance), respectively, obtained from OS-CFAR detections on that sample’s Doppler-time map. The Doppler–bin indices of all fine-grained gestures are normalized to the range of

v_{1} ~ v_{2}

, where

v_{1}

and

v_{2}

are predefined constants specifying the target interval after Doppler–bin normalization, and they are fixed across all experiments. Similarly, this paper linearly maps the original Doppler–bin indices to a predefined interval

[v_{1}, v_{2}]

(min–max scaling). Given the original Doppler–bin index

{\tilde{v}}_{i}

, its normalized value is computed as follows:

{\tilde{v}}_{i} = v_{1} + \frac{v_{i} - v_{i 1}}{v_{i 2} - v_{i 1}} (v_{2} - v_{1}) .

(8)

The overall process of gesture Doppler–bin normalization is illustrated in Figure 11.

All normalization parameters in Equations (7) and (8) are computed per gesture sample (instance). Specifically,

r_{i 1}, r_{i 2}

and

v_{i 1}, v_{i 2}

are derived from the OS-CFAR detection results on each sample’s own range–time and Doppler–time maps, respectively. Therefore, no statistics (e.g., mean/std, min/max scaling factors, or any other dataset-level normalization statistics) are computed using the whole dataset or using any held-out test subject data. This design ensures that the reported subject-independent evaluation does not involve test leakage and avoids biased performance estimation. In particular, under the subject-independent protocol, the held-out test subject is never used to compute any normalization statistics.

2.5. Continuous Gesture Splitting

Based on the Doppler features of fine-grained gestures, this paper employs the boundary detection method for gesture segmentation. During feature extraction, the palm and fingers may yield multiple Doppler–bin activations within a frame; therefore, OS-CFAR is adopted to detect motion-related activations for subsequent segmentation [35,36]. The core idea is to sort the sample data in the reference window in ascending order and select the

k

-th ordered statistic as the background estimate.

OS-CFAR consists of the cell under test (CUT), guard cells, and

n

reference cells. The reference cells are distributed on both sides of the guard region (i.e.,

n / 2

on each side). To avoid target-signal leakage into the reference window, samples in the guard region are excluded from background estimation.

When processing the input data, all values in the reference cells are sorted in ascending order according to their power levels, i.e.,

X_{(1)} \leq \dots \leq X_{(n)}

. The

k

-th order statistic

Z = X_{(K)}

is used as an estimate of the background clutter power, where

1 \leq k \leq n

. The detection threshold is set to

T Z

, and a target is declared if

X_{C U T} > T Z

. Let

μ

denote the mean (scale parameter) of the exponential clutter/noise power; then the probability density function of

Z

under a uniform clutter background is given by

f_{Z (k)} (z) = \frac{k}{μ} (\begin{array}{l} n \\ k \end{array}) \exp (- \frac{(n - k + 1) z}{μ}) {(1 - \exp (- \frac{z}{μ}))}^{k - 1}, z \geq 0 .

(9)

The false alarm probability

P_{fa}

of the OS-CFAR detector under a uniform clutter background is given by

P_{fa} = k (\begin{array}{l} n \\ k \end{array}) \frac{Γ (n - k + 1 + T) Γ (k)}{Γ (n + T + 1)},

(10)

where

T

represents the threshold coefficient. For completeness, the probability density function of the Gamma distribution is

f_{X} (x) = \frac{1}{Γ (α) β^{α}} x^{α - 1} \exp (- \frac{x}{β}), x \geq 0, α > 0, β > 0,

(11)

where

α > 0

and

β > 0

are the shape and scale parameters of the Gamma distribution, respectively, and

Γ (\cdot)

denotes the Gamma function. The exponential distribution is a special case of the Gamma distribution with

α = 1

and

β = μ

.

As shown in Figure 12a, this fine-grained gesture sequence contains three gesture actions. OS-CFAR is applied to the Doppler–time representation to detect motion-related Doppler bins in each frame, which are then used for gesture segmentation. The detection result is shown in Figure 12b.

Let the binary matrix obtained after OS-CFAR detection be

C \in {\{0, 1\}}^{M \times N}

, where rows correspond to Doppler bins and columns correspond to frames. Here,

M

denotes the number of Doppler bins and

N

denotes the number of frames. An entry

C (i, j) = 1

indicates that a target is detected at Doppler bin

i

in frame

j

, whereas

C (i, j) = 0

indicates no detection at that bin and frame.

The next step is to determine whether a moving target exists in a given frame. If no targets are detected in any Doppler bin of a frame (i.e.,

\sum_{i = 1}^{M} C (i, j) = 0

), then the frame is considered to contain no moving target. In addition, if detections occur only at the zero-Doppler bin (or within a small neighborhood around zero Doppler), the frame is also treated as containing no moving target. The procedure for determining target presence across frames is illustrated in Figure 12c, where blue markers on the horizontal axis indicate frames with detected moving targets.

Next, the boundaries of each fine-grained gesture are detected. A frame is considered a gesture boundary if two conditions are satisfied: (a) no moving target is present in the frame; and (b) at least one moving-target frame exists between this frame and the preceding boundary. Condition (a) ensures that boundaries occur only in frames without moving targets, while condition (b) ensures that a gesture is present between two adjacent boundaries.

To suppress false alarms caused by OS-CFAR detection, a sliding-window strategy is employed for boundary detection. Let

l

denote the window length. If all frames within the window contain no moving targets, the last frame in the window is confirmed as an idle (non-target) frame. If a gesture segment exists between the current frame and the previous boundary, the current frame is identified as a boundary. The overall process of fine-grained gesture boundary detection is illustrated in Figure 13.

3. Variable-Channel DRSN

To examine the contribution of different feature types to gesture recognition and to improve recognition accuracy, this paper adopts an improved Deep Residual Shrinkage Network (DRSN). DRSN incorporates a shrinkage mechanism that learns adaptive thresholds to suppress noise, thereby enhancing feature learning under noisy conditions. Three input configurations are considered—single-channel, dual-channel, and triple-channel—so that different feature combinations can be selected according to application requirements.

3.1. Basic Components

As an improved deep learning method, DRSN shares the same fundamental building blocks as conventional convolutional neural networks (ConvNets), including convolutional layers, the rectified linear unit (ReLU) activation function, batch normalization (BN), global average pooling (GAP), and the cross-entropy loss function. These components are briefly introduced below.

3.1.1. Convolutional Layers

Convolutional layers are the core building blocks of convolutional neural networks (CNNs). Compared with fully connected layers, convolution uses local receptive fields and weight sharing, which substantially reduces the number of trainable parameters. This property can mitigate overfitting and improve generalization, especially when training data are limited.

Given an input feature map, the convolutional output at the

j

-th output channel (including a bias term) can be written as

y_{j} = \sum_{i \in M_{j}} x_{i} * k_{i j} + b_{j},

(12)

where

x_{i}

denotes the

i

-th input channel,

y_{j}

denotes the

j

-th output channel,

*

denotes the convolution operation,

k_{i j}

is the convolution kernel that connects input channel

i

to output channel

j

,

b_{j}

is the bias term, and

M_{j}

is the set of input channels used to compute the

j

-th output channel [37]. By applying multiple kernels, a convolutional layer produces an output feature map with multiple channels. Stacking convolutional layers further enables hierarchical feature extraction.

3.1.2. Rectified Linear Unit (ReLU) Activation Function

The activation function is an essential component of neural networks, enabling nonlinear transformations. Among commonly used activation functions, the rectified linear unit (ReLU) alleviates the vanishing-gradient issue by providing non-saturating gradients for positive inputs. In addition, ReLU yields sparse activations by setting negative responses to zero, which can facilitate optimization in practice. The ReLU function is defined as

y = \max (x, 0)

(13)

where

x

and

y

denote the input and output of the ReLU activation function, respectively.

3.1.3. Batch Normalization (BN)

Batch normalization (BN) is a widely used normalization technique in deep learning that standardizes intermediate features using mini-batch statistics and introduces learnable scale and shift parameters [38]. By reducing the sensitivity of layer inputs to parameter updates and improving numerical conditioning, BN often stabilizes and accelerates training.

Given a mini-batch

{\{x_{n}\}}_{n = 1}^{N_{b a t c h}}

, BN computes the mini-batch mean and variance as

μ = \frac{1}{N_{b a t c h}} \sum_{n = 1}^{N_{b a t c h}} x_{n},

(14)

σ^{2} = \frac{1}{N_{b a t c h}} \sum_{n = 1}^{N_{b a t c h}} {(x_{n} - μ)}^{2} .

(15)

The normalized feature is then obtained by

\overset{\land}{x_{n}} = \frac{x_{n} - μ}{\sqrt{σ^{2} + ε}},

(16)

followed by an affine transformation

y_{n} = γ \overset{\land}{x_{n}} + β,

(17)

where

x_{n}

and

y_{n}

denote the input and output features of the

n

-th sample in the mini-batch, respectively;

γ

and

β

are learnable scale and shift parameters; and

ε

is a small constant added for numerical stability.

3.1.4. Global Average Pooling (GAP)

Global average pooling (GAP) is typically applied before the final classification layer to aggregate each feature map by taking the spatial average of its activations [39]. Compared with fully connected pooling of flattened feature maps, GAP substantially reduces the number of trainable parameters in the classifier, thereby mitigating overfitting. Moreover, by summarizing spatial responses, GAP encourages the network to focus on the overall presence of discriminative patterns rather than their exact locations, which improves robustness to small spatial shifts.

3.1.5. Cross-Entropy Loss Function

The cross-entropy loss is widely used as the training objective for multi-class classification tasks [38]. When combined with the softmax function, it quantifies the discrepancy between the predicted class distribution and the ground-truth label distribution and typically provides stable gradients for optimizing deep neural networks.

Given the logit

x_{j}

for class

j

, the softmax function converts logits into a normalized probability distribution whose components sum to one:

y_{j} = \frac{e^{x_{j}}}{\sum_{i = 1}^{N_{c l a s s}} e^{x_{i}}},

(18)

where

y_{j}

denotes the predicted probability that an observation belongs to class

j

, and

N_{c l a s s}

is the total number of classes.

For an observation with target label

t_{j}

, the cross-entropy loss is defined as

E = - \sum_{j = 1}^{N_{c l a s s}} t_{j} \log (y_{j}),

(19)

where

t_{j}

is the target probability for class

j

(e.g., a one-hot vector in standard supervised classification). The network parameters are then optimized by minimizing

E

using gradient-based methods.

3.2. Classical ResNets

Vanishing and exploding gradients arise from repeated multiplication of Jacobians during backpropagation. As network depth increases, these effects can make optimization difficult and may lead to unstable training or ineffective learning in early layers. Moreover, deeper networks can exhibit a degradation phenomenon in which adding more layers increases training difficulty and does not necessarily improve accuracy, and may even reduce it, despite the model having greater representational capacity.

Deep residual networks (ResNets) were proposed to address these optimization challenges in deep architectures. The key idea is to learn a residual function with respect to the identity mapping, which is implemented through shortcut connections that bypass one or more layers. These shortcuts provide a direct path for both forward activations and backward gradients, thereby facilitating optimization.

The fundamental module of a ResNet is the residual building unit (RBU). In its basic form, an RBU consists of two convolutional layers with batch normalization (BN) and rectified linear unit (ReLU) activations, together with an identity shortcut. Let

F (x)

denote the residual mapping learned by the stacked layers. The output of the unit is

H (x) = F (x) + x

, as illustrated in Figure 14. If the learned residual is small (i.e.,

F (x) \approx 0

), the unit can approximate the identity mapping by driving

F (x)

toward zero, so that

H (x) \approx x

. This property helps mitigate degradation and enables effective training as depth increases.

The overall architecture of ResNets is shown in Figure 15. Here, “Conv” denotes a convolutional layer; “/2” indicates a stride of 2, which halves the spatial resolution of the feature map;

K

denotes the number of convolution kernels (filters) in a convolutional layer;

C

denotes the number of channels; and “FC” denotes the final fully connected classification layer.

3.3. Soft Thresholding

Soft thresholding is a fundamental operation in classical signal denoising (e.g., transform-domain shrinkage), where small-magnitude coefficients are treated as noise and are suppressed while informative components are preserved [40,41]. In typical denoising pipelines, the signal is first mapped to a domain in which noise tends to concentrate near zero, such as wavelet or frequency coefficients, and soft thresholding is then applied to drive near-zero responses to exactly zero.

In deep learning, soft thresholding can be integrated as a differentiable shrinkage nonlinearity, enabling the network to learn noise-suppressing representations without manually designing filters. By combining learned feature extraction with shrinkage, the network can attenuate noise-related components and retain discriminative responses.

The soft-thresholding function is defined as

\begin{matrix} y & = sign (x) \max (|x| - τ, 0) \\ = \{\begin{array}{l} x - τ, & x > τ, \\ 0, & - τ \leq x \leq τ \\ x + τ, & x < - τ, \end{array} \end{matrix},

(20)

where

x

denotes the input feature,

y

the output feature, and

τ > 0

denotes the threshold. This operation sets features with small magnitude (i.e., within

[- τ, τ]

) to zero and shrinks larger-magnitude features toward zero by

τ

while preserving their sign. Compared with the ReLU activation, which discards all negative responses, soft thresholding suppresses only small-magnitude responses of both signs, thereby retaining potentially informative negative features.

The soft-thresholding process is illustrated in Figure 16a. Its derivative with respect to the input is piecewise constant,

\frac{\partial y}{\partial x} = \{\begin{array}{l} 1, & x > τ \\ 0, & - τ < x < τ \\ 1, & x < - τ \end{array},

(21)

and is not defined at

x = \pm τ

; in practice, a subgradient is used. The bounded slope (0 or 1) provides numerically stable gradient propagation outside the dead zone, while the zero-gradient interval enforces sparsity by removing near-zero (noise-like) responses, as shown in Figure 16b.

3.4. Deep Residual Shrinkage Network

Compared with classical ResNets, the deep residual shrinkage network (DRSN) integrates soft thresholding into the residual building unit to form a residual shrinkage building unit (RSBU). The RSBU combines residual learning, a threshold estimation module, and soft thresholding, enabling the network to suppress noise-like responses while preserving informative components.

For the RSBU with channel-shared thresholds (RSBU-CS), a lightweight module is introduced to estimate a single shared threshold for all channels, as shown in Figure 17a. Specifically, the absolute feature map

|x|

is first summarized to a compact descriptor, which is then fed into a two-layer fully connected (FC) network. A sigmoid function is applied to the FC output to produce a scaling parameter in the range (0, 1) [42], expressed as follows:

α = \frac{1}{1 + e^{- z}},

(22)

where

z

denotes the output of the two-layer FC network and

α

is the corresponding scaling parameter.

The soft-thresholding value must be positive and should not be excessively large; otherwise, when the threshold exceeds the maximum magnitude of the feature map, the soft-thresholding output can collapse to all zeros. To keep the threshold within a reasonable range, the scaling parameter

α

is multiplied by the mean magnitude of the feature map to obtain the shared threshold:

ς = α \cdot {average}_{i, j, c} |x_{i, j, c}|,

(23)

where

ς

is the threshold, and

i

,

j

, and

c

denote the width index, height index, and channel index of the feature map

x

, respectively.

If separate thresholds are desired for different channels, the RSBU with channel-wise thresholds (RSBU-CW) can be adopted, as shown in Figure 17b. In RSBU-CW, thresholds are estimated and applied per channel, enabling more flexible shrinkage when channel characteristics differ significantly.

DRSN-CS follows a ResNet-style backbone, as illustrated in Figure 18a, while replacing the standard residual building unit (RBU) with RSBU-CS. Stacking multiple RSBU-CS units progressively suppresses noise-related components and facilitates discriminative feature learning. Similarly, DRSN-CW uses RSBU-CW as its building block (Figure 18b), applying channel-adaptive shrinkage through repeated nonlinear transformations.

To evaluate the contribution of different input features, three variable-channel configurations are constructed in this paper: single-channel, dual-channel, and triple-channel, corresponding to the use of one, two, and three feature types as network inputs, respectively. These configurations provide flexible feature-combination choices under different deployment constraints. The overall variable-channel DRSN architecture is illustrated in Figure 19.

4. Experiments

4.1. Millimeter-Wave Radar Parameter Settings

Radar parameter design is critical for reliable target detection and subsequent gesture recognition. When configuring the FMCW radar, multiple factors must be considered jointly, including velocity (Doppler) resolution, range resolution, angular resolution, frame rate, and the onboard processing and data-throughput constraints. In practice, these requirements often impose trade-offs. For example, increasing bandwidth improves range resolution, whereas improving velocity resolution typically requires a longer coherent processing interval (i.e., more chirps), which can reduce the achievable frame rate and increase computational load. Therefore, the parameter configuration is selected to balance temporal responsiveness (frame rate) with sufficient range and velocity resolution for fine-grained gesture characterization under the available hardware constraints. The radar configuration used in this paper is summarized in Table 1. The TI AWR1843 (Texas Instruments, Dallas, TX, USA) was applied to carry out the experiments.

4.2. Dataset Construction and Input Representation

This paper collected 4050 fine-grained gesture samples from nine volunteers. For each sample, three time–frequency/spatial representations were extracted, namely the Doppler–time map, range–time map, and angle–time map, which jointly characterize motion velocity, radial displacement, and azimuthal variation, respectively.

During data collection, ten fine-grained gesture types were recorded, including eight valid command gestures and two return-to-neutral gestures. Because return-to-neutral gestures are introduced only to prevent transitional movements from being misclassified as valid commands, the two return-to-neutral gesture types are not distinguished in the final label space and are merged into a single invalid class. Therefore, the gesture classification task is formulated as a nine-class problem, comprising eight valid gesture classes and one invalid (return-to-neutral) class.

To evaluate the representational capability of Doppler, range, and angle features for fine-grained gesture recognition, this paper follows the feature-combination setting in prior work [30] and constructs single-channel, dual-channel, and triple-channel network inputs using the three extracted representations. For consistency across feature types and models, the Doppler–time, range–time, and angle–time maps are uniformly resized to 64 × 64 before being fed into the neural network.

4.3. Comparison Model: Multi-Path Refinement Network

Compared with backbones such as ResNet and VGG, the multi-path refinement network (RefineNet) is designed to recursively aggregate multi-level features. It jointly leverages high-level, low-resolution semantic features and low-level, high-resolution detailed features to construct refined high-resolution representations. RefineNet is built upon residual connections and identity mappings, which facilitate gradient propagation and stable optimization. In addition, its chained residual pooling (CRP) mechanism aggregates context at multiple pooling scales and integrates these pooled features through residual connections, enriching the representation with multi-scale information.

Following the stage-wise structure of ResNet, the backbone feature maps can be grouped into four consecutive blocks with decreasing spatial resolution. A corresponding four-level cascade of RefineNet modules is attached to these blocks to fuse features across stages and progressively refine the representation. Let ResNet block-m denote the m-th stage of the backbone, and let RefineNet-m denote the refinement module connected to ResNet block-m. Each RefineNet module comprises three components: a residual convolutional unit (RCU) for local feature adaptation, a multi-resolution fusion (MRF) unit for aligning and merging features from different resolutions, and a CRP unit for multi-scale context aggregation.

In this cascade, RefineNet-4 operates on the deepest backbone features to produce an initial refined representation at low resolution. RefineNet-3 takes as inputs the output of RefineNet-4 and the features from ResNet block-3, fuses these two sources, and refines the representation by injecting higher-resolution details. RefineNet-2 and RefineNet-1 follow the same top-down refinement strategy: each module combines the refined low-resolution features from the subsequent stage with the higher-resolution features from the corresponding backbone block, yielding progressively higher-resolution feature maps. Finally, the refined feature map is pooled and fed into the classification head (a fully connected layer) to predict fine-grained gesture categories.

4.4. Experimental Protocols Design

To avoid ambiguity caused by mixing subject-dependent and subject-independent evaluations, this paper reports results under two explicit benchmarks on the collected dataset (nine volunteers). The first benchmark evaluates subject-dependent performance under sample-level random splits, whereas the second benchmark evaluates subject-independent generalization using leave-one-subject-out (LOSO) cross-validation.

4.4.1. Subject-Dependent Evaluation (SD)

In the subject-dependent (SD) benchmark, all samples are pooled and randomly partitioned into training/validation/test sets with a ratio of 60%/20%/20%. To mitigate randomness introduced by data splitting and optimization, the SD benchmark is repeated five times (five random splits) with different random seeds. The reported SD performance is summarized as mean ± standard deviation over repeats.

4.4.2. Subject-Independent Evaluation (LOSO)

In the subject-independent benchmark, LOSO cross-validation is conducted across all volunteers. For each fold, one volunteer is held out as the test subject, and the remaining volunteers form the training pool. A validation set is sampled only from the training pool (i.e., excluding the held-out subject) for model selection and hyperparameter tuning. Performance is computed on the held-out subject for each fold and then aggregated across folds to report mean ± standard deviation.

4.4.3. Metrics and Experimental Configuration

Performance is reported using classification accuracy and macro-averaged F1 score (macro-F1). Macro-F1 is included to better reflect per-class performance under potential class imbalance. For completeness, the implementation also computes additional summary statistics (e.g., macro-precision/recall and weighted-F1), while accuracy and macro-F1 are used as the primary metrics throughout this paper.

To ensure fair comparison across models and feature combinations, all experiments adopt the same input resolution (64 × 64) and identical training hyperparameters. Two backbones (RefineNet and DRSN) are evaluated under seven input-channel configurations constructed from the Doppler–time, range–time, and angle–time representations, including three single-channel settings, three dual-channel settings, and one triple-channel setting. A unified experimental script is used to run the full configuration sweep and to generate structured result reports (per-repeat/per-fold outputs and aggregated summaries), enabling reproducibility and facilitating independent auditing of the evaluation protocol.

4.4.4. Preprocessing and Leakage Avoidance

To prevent test leakage, all normalization operations are performed instance-wise. Specifically, the normalization parameters for range and Doppler features are computed from OS-CFAR detections on each sample’s own feature maps, rather than from dataset-level statistics. Consequently, no global statistics (e.g., mean/standard deviation or min/max values computed over the full dataset) are derived using any held-out test subject data, and under LOSO evaluation, the held-out subject is never used to compute normalization parameters.

4.5. Results and Analysis

This section reports fine-grained gesture recognition performance under two evaluation protocols: subject-dependent (SD) evaluation with repeated random splits and subject-independent evaluation using leave-one-subject-out (LOSO) cross-validation. For each model and input-feature configuration, classification accuracy and macro-F1 are summarized as mean ± standard deviation.

Under the RefineNet-based baseline (Table 2), Doppler–time features provide the strongest single-channel representation, whereas range–time features are substantially weaker, and angle–time features are the least effective among the three single-channel inputs. This observation is consistent with the physical characteristics of the designed gestures and the sensing constraints of the radar: Doppler captures dominant motion dynamics, while angle estimation is limited by coarse angular resolution, and range variations can be subtle for small-amplitude fine-grained motions. For dual-channel inputs, combining Doppler–time with angle–time yields the best baseline performance, indicating that angle cues can complement Doppler signatures by resolving lateral-motion tendencies. By contrast, Doppler–range fusion does not yield a clear gain over Doppler alone, which aligns with the coupling between range evolution and radial velocity in FMCW motion patterns. Using all three features jointly achieves the highest baseline accuracy; however, the improvement over the best dual-channel setting is modest, and LOSO performance remains comparable to the Doppler–angle setting.

Table 3 reports the results of the improved DRSN under the same set of input-feature configurations. Across both SD and LOSO, DRSN consistently outperforms the RefineNet baseline, suggesting that integrating shrinkage-based denoising into residual learning is beneficial for fine-grained radar gesture recognition. For single-channel inputs, DRSN yields clear gains for Doppler–time and particularly for range–time, implying improved robustness when the input representation is less discriminative. For dual-channel inputs, Doppler–range achieves the highest SD accuracy, while Doppler–angle remains strong and competitive under both SD and LOSO. When all three features are fused, the triple-channel DRSN achieves the best overall performance (98.88% under SD and 92.20% under LOSO), and macro-F1 closely tracks accuracy, indicating that the gains are broadly consistent across classes under the nine-class formulation.

Beyond aggregate metrics, Figure 20 provides a class-level interpretation of the remaining confusions under different input-feature settings. With single-channel range inputs, major errors arise between gesture pairs that exhibit similar range–time trajectories but differ primarily in motion direction (e.g., the two lateral pat gestures), reflecting the limited discriminability of range-only cues under small displacement. Incorporating angle cues in the Doppler–angle and triple-channel settings substantially reduces such confusions by providing complementary lateral-motion information. For angle-only inputs, misclassifications increase among classes whose discriminative cues are encoded mainly in velocity profiles or hand-aperture dynamics rather than azimuth variation, which is consistent with the radar’s limited angular resolution. For the range–angle setting, removing Doppler information makes it more difficult to separate gestures with similar spatial traces but different velocity patterns, whereas triple-channel fusion yields the strongest diagonal dominance and the fewest off-diagonal errors overall.

A direct SD accuracy comparison between RefineNet and DRSN is summarized in Table 4. Performance gains are observed for all input settings, with the largest improvements appearing in configurations that are relatively weak under the baseline (e.g., range-only and range–angle), while Doppler-dominant configurations also improve consistently. Notably, under SD, DRSN improves over RefineNet by 8.52 percentage points for range–angle fusion and by 4.39 percentage points for triple-channel fusion, indicating that the shrinkage-based design provides consistent benefits beyond feature refinement alone. Under the subject-independent LOSO protocol, the advantage of DRSN is particularly evident for the triple-channel setting, supporting that multi-feature fusion combined with shrinkage-based denoising improves cross-subject generalization. The larger gain for the range–angle setting likely reflects a higher degree of noise sensitivity and weaker separability in the baseline: without Doppler, the model relies on comparatively noisier range/angle estimates (limited range span and coarse angular resolution). The shrinkage-based residual units in DRSN provide adaptive suppression of noise-like responses and implicit feature selection, which is expected to benefit such low-SNR inputs more strongly, whereas Doppler-inclusive settings have less room for improvement due to their stronger baseline separability.

In addition to recognition performance, Table 5 reports computational complexity and runtime behavior of RefineNet and DRSN under two numerical precisions (FP16/FP32) on the same experimental platform. The results highlight different accuracy-efficiency trade-offs: DRSN uses fewer parameters but incurs higher arithmetic cost (FLOPs per batch), leading to higher inference latency under the same batch setting. Switching from FP32 to FP16 reduces latency and increases throughput (FPS) for both models. Notably, the recorded inference peak memory is higher under FP16 in our measurements; this peak is obtained from the framework-reported maximum GPU memory allocation and thus includes temporary workspaces and allocator caching. In our stack, FP16 may invoke different precision-specific kernel paths (e.g., Tensor Core-optimized implementations) that allocate larger intermediate workspaces, whereas training peak memory still decreases under mixed precision because activations/gradients are stored in FP16 (with only FP32 master weights retained).

Finally, Table 6 summarizes representative radar-based hand gesture recognition methods and lists the datasets and evaluation protocols reported in the corresponding papers. Since these results are obtained under different data sources, label definitions, and experimental settings, they should be interpreted as a contextual reference rather than a strictly controlled benchmark. Nevertheless, the proposed variable-channel DRSN achieves strong performance under both subject-dependent and subject-independent (LOSO) evaluations on our dataset, while Table 5 further characterizes its computational and runtime costs. Overall, these results indicate that integrating multi-feature fusion with shrinkage-based residual learning provides a robust solution for fine-grained radar gesture recognition, with explicit accuracy–efficiency trade-offs that can be selected according to practical constraints.

5. Conclusions

This paper presents an end-to-end framework for fine-grained gesture recognition using FMCW millimeter-wave radar, encompassing gesture design, radar data acquisition, feature extraction, continuous-gesture detection/segmentation, and neural network-based classification. Ten gesture types were recorded, including eight valid command gestures and two return-to-neutral gestures; for classification, the two return-to-neutral gesture types were merged into a single invalid class, yielding a nine-class recognition task (eight valid classes plus one invalid class). Doppler–time representations extracted via STFT are used to support sliding-window-based automatic detection and segmentation in continuous streams, while MUSIC-based super-resolution estimation is adopted to construct range–time and angle–time representations for enhanced range/angle characterization. To mitigate inter-subject variability without introducing test leakage, instance-wise normalization is applied to Doppler and range features based solely on each sample’s own detections. For recognition, this paper develops a multi-path refinement network (RefineNet) built on a deep residual backbone to fuse low- and high-resolution features into high-resolution representations, and further integrates a shrinkage mechanism into residual building units to suppress noise-like responses, resulting in a deep residual shrinkage network (DRSN). The discriminative contributions of Doppler, range, and angle features are examined through single-channel, dual-channel, and triple-channel input configurations. To remove ambiguity in evaluation, performance is reported under two explicit protocols—subject-dependent evaluation with repeated random splits and subject-independent evaluation with leave-one-subject-out (LOSO) cross-validation—demonstrating that the proposed variable-channel DRSN consistently outperforms the RefineNet-based baseline and that the triple-channel configuration achieves the best overall performance. The triple-channel DRSN achieves an accuracy of 98.88% on the fine-grained gesture recognition task, representing an improvement of 4.39% over the triple-channel RefineNet. Overall, the experimental results demonstrate the adaptability of the proposed system, enabling flexible selection of channel configurations to suit diverse application scenarios.

Author Contributions

Conceptualization, P.C. and J.W.; methodology, P.C.; software, C.Y. and Y.B.; validation, S.L., C.Y. and Y.B.; formal analysis, S.L. and C.Y.; investigation, P.C.; resources, S.L. and C.Y.; data curation, S.L. and C.Y.; writing—original draft preparation, S.L. and C.Y.; writing—review and editing, S.L. and P.C.; visualization, S.L.; supervision, P.C. and J.W.; project administration, P.C.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jianbing and Lingyan Key Research and Development. Program of Zhejiang Province of China under Grant No. 2023C01148.

Data Availability Statement

The original contributions presented in the study are included in the article; the corresponding codes are available at https://github.com/Summer-Hometown/Hand-Gesture (accessed on 10 December 2025); due to the size of the dataset, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank the anonymous reviewers and editors for their valuable comments and constructive suggestions, which helped improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, X.; Tang, S.; Zhang, B.; Wu, J.; Ma, X.; Wang, J. WiVi-GR: Wireless-Visual Joint Representation-Based Accurate Gesture Recognition. IEEE Internet Things J. 2024, 11, 2701–2711. [Google Scholar] [CrossRef]
Liu, H.; Zhou, A.; Dong, Z.; Sun, Y.; Zhang, J.; Liu, L.; Ma, H.; Liu, J.; Yang, N. M-Gesture: Person-Independent Real-Time In-Air Gesture Recognition Using Commodity Millimeter Wave Radar. IEEE Internet Things J. 2022, 9, 3397–3415. [Google Scholar] [CrossRef]
Bai, Y.; Wang, J.; Chen, P.; Gong, Z.; Xiong, Q. Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM. Appl. Sci. 2024, 14, 6782. [Google Scholar] [CrossRef]
Chen, Q.; Cui, Z.; Zhou, Z.; Tian, Y.; Cao, Z. MMHTSR: In-Air Handwriting Trajectory Sensing and Reconstruction Based on mmWave Radar. IEEE Internet Things J. 2024, 11, 10069–10083. [Google Scholar] [CrossRef]
Chen, Q.; Cui, Z.; Tian, Y.; Chen, Y.; Cao, Z. Joint Position Estimation for Hand Motion Using MIMO FMCW mmWave Radar. IEEE Internet Things J. 2025, 12, 2838–2853. [Google Scholar] [CrossRef]
Wang, Y.; Ren, A.; Zhou, M.; Wang, W.; Yang, X. A Novel Detection and Recognition Method for Continuous Hand Gesture Using FMCW Radar. IEEE Access 2020, 8, 167264–167275. [Google Scholar] [CrossRef]
Jin, B.; Ma, X.; Zhang, Z.; Lian, Z.; Wang, B. Interference-Robust Millimeter-Wave Radar-Based Dynamic Hand Gesture Recognition Using 2-D CNN-Transformer Networks. IEEE Internet Things J. 2024, 11, 2741–2752. [Google Scholar] [CrossRef]
Ahmed, S.; Kim, W.; Park, J.; Cho, S.H. Radar-Based Air-Writing Gesture Recognition Using a Novel Multistream CNN Approach. IEEE Internet Things J. 2022, 9, 23869–23880. [Google Scholar] [CrossRef]
Eom, J.Y.; Jeon, W.S.; Jeong, D.G. UWB Impulse Radar-Based Open-Set Gesture Recognition Using Transformer and One-Versus-Rest Classifier. IEEE Internet Things J. 2025, 12, 24803–24818. [Google Scholar] [CrossRef]
Ahmed, S.; Kallu, K.D.; Ahmed, S.; Cho, S.H. Hand Gestures Recognition Using Radar Sensors for Human-Computer-Interaction: A Review. Remote Sens. 2021, 13, 527. [Google Scholar] [CrossRef]
Tuan Trinh, T.; Vu Pham, H.; Dat Le, T.; Le, M. Hand Gesture Recognition with Uncertainty Awareness via FMCW Radar Sensing and Deep Learning. IEEE Sens. J. 2025, 25, 24517–24524. [Google Scholar] [CrossRef]
Lien, J.; Gillian, N.; Karagozler, M.E.; Amihood, P.; Schwesig, C.; Olson, E.; Raja, H.; Poupyrev, I. Soli: Ubiquitous Gesture Sensing with Millimeter Wave Radar. ACM Trans. Graph. 2016, 35, 1–19. [Google Scholar] [CrossRef]
Dong, X.; Zhao, Z.; Wang, Y.; Zeng, T.; Wang, J.; Sui, Y. FMCW Radar-Based Hand Gesture Recognition Using Spatiotemporal Deformable and Context-Aware Convolutional 5-D Feature Representation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Kim, W.; Byung Park, J.; Ahmed, S.; Ho Cho, S. FMCW Radar-Based In-Air Alphanumeric Gesture Recognition with Machine Learning. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar] [CrossRef]
Malysa, G.; Wang, D.; Netsch, L.; Ali, M. Hidden Markov Model-Based Gesture Recognition with FMCW Radar. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA, 7–9 December 2016; pp. 1017–1021. [Google Scholar]
Rao, S.; Ahmad, A.; Roh, J.C.; Bharadwaj, S. 77GHz Single Chip Radar Sensor Enables Automotive Body and Chassis Applications. Available online: https://www.ti.com/lit/wp/spry315/spry315.pdf (accessed on 15 June 2020).
Ryu, S.-J.; Suh, J.-S.; Baek, S.-H.; Hong, S.; Kim, J.-H. Feature-Based Hand Gesture Recognition Using an FMCW Radar and Its Temporal Feature Analysis. IEEE Sens. J. 2018, 18, 7593–7602. [Google Scholar] [CrossRef]
Suh, J.S.; Ryu, S.; Han, B.; Choi, J.; Kim, J.-H.; Hong, S. 24 GHz FMCW Radar System for Real-Time Hand Gesture Recognition Using LSTM. In Proceedings of the 2018 Asia-Pacific Microwave Conference (APMC), Kyoto, Japan, 6–9 November 2018; pp. 860–862. [Google Scholar]
Zhang, Z.; Tian, Z.; Mu, Z.; Liu, Y. Application of FMCW Radar for Dynamic Continuous Hand Gesture Recognition. In Proceedings of the 11th EAI International Conference on Mobile Multimedia Communications, Qingdao, China, 21–22 June 2018; ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering): Brussels, Belgium, 2018; pp. 298–303. [Google Scholar]
Choi, J.-W.; Ryu, S.-J.; Kim, J.-H. Short-Range Radar Based Real-Time Hand Gesture Recognition Using LSTM Encoder. IEEE Access 2019, 7, 33610–33618. [Google Scholar] [CrossRef]
Li, Q.; Liu, L.; Hao, S.; Wan, G. Dynamic Gesture Recognition Method Based on Millimeter-Wave Radar. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 63–67. [Google Scholar]
Chen, J.; Wen, P.; Chen, G.; Wang, Y.; Wang, Y.; Zheng, J. Hand Gesture Recognition Based on Millimeter-Wave Radar Using iFormer. In Proceedings of the 2024 9th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 12–14 July 2024; pp. 22–26. [Google Scholar]
Wang, C.; Zhao, X.; Li, Z. DCS-CTN: Subtle Gesture Recognition Based on TD-CNN-Transformer via Millimeter-Wave Radar. IEEE Internet Things J. 2023, 10, 17680–17693. [Google Scholar] [CrossRef]
Huang, J.; Jiao, L.; Zhou, C.; Li, H. Millimeter Wave Radar Gesture Recognition Based on ECC. In Proceedings of the IGARSS 2024–2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9623–9626. [Google Scholar]
Jin, B.; Ma, X.; Hu, B.; Zhang, Z.; Lian, Z.; Wang, B. Gesture-mmWAVE: Compact and Accurate Millimeter-Wave Radar-Based Dynamic Gesture Recognition for Embedded Devices. IEEE Trans. Hum. Mach. Syst. 2024, 54, 337–347. [Google Scholar] [CrossRef]
Behera, A.; Wharton, Z.; Liu, Y.; Ghahremani, M.; Kumar, S.; Bessis, N. Regional Attention Network (RAN) for Head Pose and Fine-Grained Gesture Recognition. IEEE Trans. Affect. Comput. 2023, 14, 549–562. [Google Scholar] [CrossRef]
Liu, D.; Zhang, L.; Wu, Y. LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 3304–3312. [Google Scholar]
Fang, J.; Xu, Y.; Liu, W.; Li, C.; Fang, Q.; He, X. Real-Time Fine-Grained Gesture Recognition Based on Millimeter Wave Radar. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Denarau Island, Fiji, 2–7 December 2024; pp. 394–401. [Google Scholar]
Wang, Y.; Wang, D.; Fu, Y.; Yao, D.; Xie, L.; Zhou, M. Multi-Hand Gesture Recognition Using Automotive FMCW Radar Sensor. Remote Sens. 2022, 14, 2374. [Google Scholar] [CrossRef]
Yuan, C.; Chen, Z.; Chen, P.; Tian, R.; Xiong, D.; Guo, W. Fine-Grained Gesture Recognition by Using FMCW Millimeter-Wave Radar. In Proceedings of the 2023 Cross Strait Radio Science and Wireless Technology Conference (CSRSWTC), Guilin, China, 10–13 November 2023; pp. 1–3. [Google Scholar]
Wang, S.; Song, J.; Lien, J.; Poupyrev, I.; Hilliges, O. Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; ACM: Tokyo, Japan, 2016; pp. 851–860. [Google Scholar]
Ahmed, S.; Wang, D.; Park, J.; Cho, S.H. UWB-Gestures, a Public Dataset of Dynamic Hand Gestures Acquired Using Impulse Radar Sensors. Sci. Data 2021, 8, 102. [Google Scholar] [CrossRef]
Schmidt, R. Multiple Emitter Location and Signal Parameter Estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Tan, T.-H.; Tian, J.-H.; Sharma, A.K.; Liu, S.-H.; Huang, Y.-F. Human Activity Recognition Based on Deep Learning and Micro-Doppler Radar Data. Sensors 2024, 24, 2530. [Google Scholar] [CrossRef] [PubMed]
Rohling, H. Radar CFAR Thresholding in Clutter and Multiple Target Situations. IEEE Trans. Aerosp. Electron. Syst. 1983, AES-19, 608–621. [Google Scholar] [CrossRef]
Gandhi, P.P.; Kassam, S.A. Analysis of CFAR Processors in Nonhomogeneous Background. IEEE Trans. Aerosp. Electron. Syst. 1988, 24, 427–445. [Google Scholar] [CrossRef]
Zhao, M.; Kang, M.; Tang, B.; Pecht, M. Deep Residual Networks with Dynamically Weighted Wavelet Coefficients for Fault Diagnosis of Planetary Gearboxes. IEEE Trans. Ind. Electron. 2018, 65, 4290–4300. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2014, arXiv:1312.4400. [Google Scholar]
Donoho, D.L. De-Noising by Soft-Thresholding. IEEE Trans. Inf. Theory 1995, 41, 613–627. [Google Scholar] [CrossRef]
Isogawa, K.; Ida, T.; Shiodera, T.; Takeguchi, T. Deep Shrinkage Convolutional Neural Network for Adaptive Noise Reduction. IEEE Signal Process. Lett. 2018, 25, 224–228. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Moon, J.; Kim, B.-K.; Kang, J. Fixed Point Cloud Normalization and None-Sequential Modeling for Hand Gesture Recognition Based on Short-Range mmWave Radar Sensor’s Sparse Time-Series Point Cloud. IEEE Sens. J. 2024, 24, 10656–10668. [Google Scholar] [CrossRef]
Khan, I.; Kwon, Y.-W. Radar-Based Hand Gesture Recognition with Feature Fusion Using Robust CNN-LSTM and Attention Architecture. IEEE Access 2025, 13, 69281–69291. [Google Scholar] [CrossRef]

Figure 1. Working principle of FMCW radar.

Figure 2. Eight fine-grained command gestures. (a) Pat left. (b) Pat right. (c) Palm open. (d) Palm pinch. (e) Rub fingers. (f) Hook hand. (g) Move forward. (h) Move backward.

Figure 3. Two return-to-neutral gesture types. (a) Return-to-neutral gesture 1 (after “pat left”). (b) Return-to-neutral gesture 2 (after “pat right”).

Figure 4. Fine-grained gesture data acquisition scenario.

Figure 5. One-dimensional range profiles before and after static clutter suppression: (a) before suppression; (b) after suppression.

Figure 6. Range–time spectrograms of fine-grained gesture types. (a) Pat left. (b) Pat right. (c) Palm open. (d) Palm pinch. (e) Rub fingers. (f) Hook hand. (g) Move forward. (h) Move backward. (i) Return-to-neutral 1. (j) Return-to-neutral 2.

Figure 7. Doppler feature extraction process for fine-grained gestures. Electronics 15 00437 i001

and

blocks in the same columns, respectively, indicate the range–bin data obtained by applying an FFT along the fast-time (sam-pling-point) dimension within each frame; Electronics 15 00437 i003

blocks indicate the range-superposed slow-time se-quence, on which STFT is applied frame-by-frame and then stitched across frames to form the Doppler–time diagram.

Figure 7. Doppler feature extraction process for fine-grained gestures. Electronics 15 00437 i001

and

blocks in the same columns, respectively, indicate the range–bin data obtained by applying an FFT along the fast-time (sam-pling-point) dimension within each frame; Electronics 15 00437 i003

blocks indicate the range-superposed slow-time se-quence, on which STFT is applied frame-by-frame and then stitched across frames to form the Doppler–time diagram.

Figure 8. Doppler–time spectrograms of fine-grained gesture types. (a) Pat left. (b) Pat right. (c) Palm open. (d) Palm pinch. (e) Rub fingers. (f) Hook hand. (g) Move forward. (h) Move backward. (i) Return-to-neutral 1. (j) Return-to-neutral 2.

Figure 9. Angle–time spectrograms of fine-grained gesture types. (a) Pat left. (b) Pat right. (c) Palm open. (d) Palm pinch. (e) Rub fingers. (f) Hook hand. (g) Move forward. (h) Move backward. (i) Return-to-neutral 1. (j) Return-to-neutral 2.

Figure 10. Range feature normalization process for fine-grained gestures. The red horizontal lines indicate the instance-wise lower/upper range–bin bounds

(r_{i 1}, r_{i 2})

of the OS-CFAR-detected target region, and the corresponding normalized bounds after min–max mapping to the predefined interval

[r_{1}, r_{2}]

.

Figure 10. Range feature normalization process for fine-grained gestures. The red horizontal lines indicate the instance-wise lower/upper range–bin bounds

(r_{i 1}, r_{i 2})

of the OS-CFAR-detected target region, and the corresponding normalized bounds after min–max mapping to the predefined interval

[r_{1}, r_{2}]

.

Figure 11. Doppler feature normalization process for fine-grained gestures. The red horizontal lines indicate the instance-wise lower/upper Doppler–bin bounds

(v_{i 1}, v_{i 2})

of the OS-CFAR-detected target region, and the corresponding normalized bounds after min–max mapping to the predefined interval

[v_{1}, v_{2}]

.

Figure 11. Doppler feature normalization process for fine-grained gestures. The red horizontal lines indicate the instance-wise lower/upper Doppler–bin bounds

(v_{i 1}, v_{i 2})

of the OS-CFAR-detected target region, and the corresponding normalized bounds after min–max mapping to the predefined interval

[v_{1}, v_{2}]

.

Figure 12. Illustration of OS-CFAR-based target-frame determination on the Doppler–time representation. (a) Doppler spectrum for dynamic gestures. (b) OS-CFAR detection result on the Doppler–time representation. (c) Target frame determination.

Figure 13. Workflow of boundary detection for fine-grained gestures.

Figure 14. Schematic diagram of RBU structure.

Figure 15. Architecture of ResNets.

Figure 16. Soft thresholding function and its derivative. (a) Soft thresholding function. (b) Derivative of soft threshold function.

Figure 17. Architecture of RSBU. (a) RSBU-CS. (b) RSBU-CW.

Figure 18. Architecture of DRSN. (a) DRSN-CS. (b) DRSN-CW.

Figure 19. Architecture of the proposed variable-channel DRSN. Orange circles denote the input feature vector (after GAP), green circles denote hidden neurons in the fully connected classifier, and blue circles denote the output units producing class scores for final classification.

Figure 20. Normalized confusion matrix of DRSN. (a) Single-channel Doppler features. (b) Single-channel range features. (c) Single-channel angle features. (d) Doppler–range joint estimation. (e) Doppler–angle joint estimation. (f) Range–angle joint estimation. (g) Doppler–range–angle joint estimation.

Table 1. Radar parameter design.

Parameter Type	Parameter Value
Radar Model	AWR1843
Transmitted Signal	FMCW Signal
Antenna Channels	3 Transmitters, 4 Receivers
Effective Bandwidth	2.5 GHz
Range Resolution	5.9 cm
Velocity Resolution	10 cm/s
Sampling Points per Chirp	128
Number of Chirps per Frame	384
Frames per sample	512
Frame Duration	20 ms

Table 2. Accuracy and macro-F1 of the RefineNet-based baseline under different input-feature configurations (SD and LOSO).

Number of Channels	Input Features	Accuracy (Mean ± Std %)		Macro-F1 (Mean ± Std %)
Number of Channels	Input Features	SD	LOSO	SD	LOSO
Single-Channel	Doppler	93.65 ± 1.10	87.56 ± 1.24	93.65 ± 1.07	86.80 ± 1.27
	range	74.34 ± 1.82	70.76 ± 2.08	74.12 ± 1.69	69.98 ± 2.13
	angle	67.80 ± 3.26	63.62 ± 3.02	67.79 ± 3.25	62.89 ± 3.31
Dual-Channel	Doppler–range	93.74 ± 1.56	87.64 ± 1.47	93.73 ± 1.55	87.22 ± 1.38
	Doppler–angle	94.41 ± 1.07	87.89 ± 1.13	94.44 ± 1.05	87.38 ± 0.99
	range–angle	74.69 ± 1.91	72.29 ± 2.03	74.67 ± 1.93	71.84 ± 2.34
Triple-Channel	Doppler–range–angle	94.49 ± 0.85	87.76 ± 0.92	94.51 ± 0.97	87.40 ± 1.03

Table 3. Accuracy and macro-F1 of the improved DRSN under different input-feature configurations (SD and LOSO).

Number of Channels	Input Features	Accuracy (Mean ± Std %)		Macro-F1 (Mean ± Std %)
Number of Channels	Input Features	SD	LOSO	SD	LOSO
Single-Channel	Doppler	97.16 ± 1.03	90.62 ± 1.52	97.15 ± 1.10	90.14 ± 1.23
	range	89.75 ± 1.93	80.94 ± 2.03	89.82 ± 2.16	79.73 ± 2.36
	angle	71.72 ± 3.34	62.62 ± 3.12	71.10 ± 3.79	59.53 ± 3.46
Dual-Channel	Doppler–range	97.65 ± 0.98	89.26 ± 1.26	97.66 ± 1.05	88.60 ± 0.97
	Doppler–angle	97.28 ± 0.82	90.52 ± 0.97	97.28 ± 0.79	89.91 ± 0.89
	range–angle	83.21 ± 2.24	74.47 ± 2.37	81.92 ± 2.24	72.23 ± 2.63
Triple-Channel	Doppler–range–angle	98.88 ± 0.74	92.20 ± 0.82	98.68 ± 0.77	91.69 ± 0.89

Table 4. Subject-dependent (SD) accuracy comparison between RefineNet and DRSN.

Number of Channels	Input Features	Accuracy (Mean%)		Improvement (%)
Number of Channels	Input Features	RefineNet	DRSN	Improvement (%)
Single-Channel	Doppler	93.65	97.16	3.51
	range	74.34	89.75	15.41
	angle	67.80	71.72	3.92
Dual-Channel	Doppler–range	93.74	97.65	3.91
	Doppler–angle	94.41	97.28	2.87
	range–angle	74.69	83.21	8.52
Triple-Channel	Doppler–range–angle	94.49	98.88	4.39

Table 5. Computational complexity and runtime performance of RefineNet and DRSN under different numerical precisions on the same experimental platform.

Model	DType	Params	FLOPs /Forward (Batch)	GFLOPs /Sample	Latency (ms/Batch)	FPS (Samples/s)	Infer Peak Mem (MB)	Train (ms/Iter)	Train Peak Mem (MB)
RefineNet	FP16	82.55M	0.214T	1.669	23.64	5413.5	660.7	70.26	1286.3
RefineNet	FP32	82.55M	0.214T	1.669	31.51	4061.9	422.4	86.71	2102.9
DRSN	FP16	23.81M	1.188T	9.280	99.60	1285.1	1292.4	365.63	4537.7
DRSN	FP32	23.81M	1.188T	9.280	193.16	662.7	1259.0	5912.25	8650.5

Table 6. Comparison of radar-based hand gesture recognition methods. (C, S, and N denote the number of classes, subjects, and samples reported in the corresponding paper, respectively; “Prot” denotes the evaluation protocol. Results are reported on different datasets and protocols and should be interpreted as a contextual reference rather than a strictly controlled benchmark.).

Method	Dataset (Sensor) and Protocol	Accuracy (%)
ShuffleNet [14]	FMCW (TI IWR6843ISK-ODS) C = 43; S = 14; N = 4730 (+860 test-only) Prot: split + unseen-subject test	93.10
Fixed Point Cloud Normalization with Non-sequential PointNet [43]	mmWave point cloud (TI IWR6843AOP EVM) C = 10; S = 3; N = 50 min recordings Prot: 7:3 split	96.64
Dual-branch Transformer and One-Versus-Rest Classifier [9]	UWB impulse radar (XeTru X4) C = 12; S = 8; N = 100/gesture/subject Prot: public dataset (3 placements)	97.00
Multi-radar CNN-LSTM-Attention Architecture [44]	UWB-Gestures (3× XeTru X4) C = 1; S = 8; N = 100/gesture/subject/radar Prot: 20% test split	98.33
Triple-Channel Deep Residual Shrinkage Network (Ours)	FMCW fine-grained gestures (TI AWR1843) C = 9 (10 → 9 merged); S = 9; N = 4050 Prot: SD (repeats) + LOSO	98.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, P.; Li, S.; Yuan, C.; Bai, Y.; Wang, J. Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN. Electronics 2026, 15, 437. https://doi.org/10.3390/electronics15020437

AMA Style

Chen P, Li S, Yuan C, Bai Y, Wang J. Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN. Electronics. 2026; 15(2):437. https://doi.org/10.3390/electronics15020437

Chicago/Turabian Style

Chen, Penghui, Siben Li, Chenchen Yuan, Yujing Bai, and Jun Wang. 2026. "Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN" Electronics 15, no. 2: 437. https://doi.org/10.3390/electronics15020437

APA Style

Chen, P., Li, S., Yuan, C., Bai, Y., & Wang, J. (2026). Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN. Electronics, 15(2), 437. https://doi.org/10.3390/electronics15020437

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN †

Abstract

1. Introduction

2. Data Collection and Feature Extraction

2.1. Millimeter-Wave Radar Signal Model

2.2. Fine-Grained Gesture Design and Data Acquisition

2.3. Feature Extraction

2.3.1. Static Clutter Suppression

2.3.2. Range Feature Extraction

2.3.3. Doppler Feature Extraction

2.3.4. Angle Feature Extraction

2.4. Normalization of Features

2.5. Continuous Gesture Splitting

3. Variable-Channel DRSN

3.1. Basic Components

3.1.1. Convolutional Layers

3.1.2. Rectified Linear Unit (ReLU) Activation Function

3.1.3. Batch Normalization (BN)

3.1.4. Global Average Pooling (GAP)

3.1.5. Cross-Entropy Loss Function

3.2. Classical ResNets

3.3. Soft Thresholding

3.4. Deep Residual Shrinkage Network

4. Experiments

4.1. Millimeter-Wave Radar Parameter Settings

4.2. Dataset Construction and Input Representation

4.3. Comparison Model: Multi-Path Refinement Network

4.4. Experimental Protocols Design

4.4.1. Subject-Dependent Evaluation (SD)

4.4.2. Subject-Independent Evaluation (LOSO)

4.4.3. Metrics and Experimental Configuration

4.4.4. Preprocessing and Leakage Avoidance

4.5. Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN^†