1. Introduction
Hand gesture recognition is a key component of human-computer interaction (HCI) and shows strong potential in 5G sensing, sign language translation, motion-sensing games, smart homes, and smart healthcare [
1,
2]. Owing to its high operating frequency, millimeter-wave radar enables precise measurements and can detect gestures with small motion amplitudes [
3]. In addition, radar signals are less susceptible to blockage by certain materials during propagation [
4], and advances in integrated-circuit technology have further reduced chip size [
5]. Overall, millimeter-wave radar provides strong privacy, high environmental adaptability, ease of integration, and low cost, making it a promising solution for gesture recognition [
6,
7].
In recent years, research on gesture recognition has shifted from traditional signal processing to deep learning methods [
8,
9], particularly convolutional neural networks (CNNs), which have advanced radar-based gesture recognition [
10,
11]. The Soli project released by Google [
12] represents a breakthrough in frequency-modulated continuous-wave (FMCW) radar-based micro-motion gesture recognition and can identify fine-grained gestures, such as finger movements and thumb motions across other fingers.
However, most FMCW radar-based gesture recognition studies have primarily focused on algorithm design [
13,
14]. Malysa et al. [
15] used a hidden Markov model (HMM) to classify five gesture signals collected by FMCW radar and reported an accuracy of 82%. The engineering team at TI [
16] designed and developed a multifunctional FMCW radar for gesture recognition and demonstrated its use in car door opening and closing. Ryu et al. [
17] applied short-time Fourier transform (STFT) to extract gesture features, and combined it with a quantum-inspired evolutionary algorithm (QEA) for classification, achieving 85% accuracy. Suh et al. [
18] designed FMCW hardware with one transmitting and four receiving antennas, extracted the Doppler–range map of gestures as features, and used a long short-term memory (LSTM) network to classify seven gesture types. Zhang et al. [
19] adopted connectionist temporal classification (CTC) and utilized unsegmented gesture data streams to predict gesture categories. Choi et al. [
20] developed a real-time gesture recognition algorithm based on an LSTM encoder and incorporated a Gaussian mixture model for clutter removal. The constant false alarm rate (CFAR) algorithm was employed for gesture detection for the first time. After classification using the LSTM network, the system was able to recognize ten different gestures and remained effective under low signal-to-noise ratio (SNR) conditions.
At present, millimeter-wave radar-based gesture recognition has advanced in both hardware and algorithms [
21,
22]. Wang et al. [
23] proposed a DCS-CTN (Millimeter-Wave Radar Data Cube Sequence and Time-Distributed CNN-Transformer Network) framework that uses a time-distributed wrapper and a convolutional neural network (CNN) to extract local features from the data-cube sequence, a positional encoder to preserve temporal information, and a Transformer network to capture global features. This framework achieved a gesture recognition accuracy of 99.75%. Huang et al. [
24] developed a gesture recognition algorithm that combines an error-correction code (ECC) module with multiple signal classification (MUSIC) preprocessing, improving decision logic and robustness while significantly enhancing real-time detection performance. Jin et al. [
25] introduced a dynamic gesture recognition method based on multi-layer feature fusion (MLFF) and a Transformer, characterized by a small model size and high accuracy. On a dataset with 10% random interference, this method achieved an average recognition accuracy of 99.11%, indicating strong potential for embedded device applications.
Accurate recognition of fine-grained gestures has long been a major challenge in gesture recognition [
26]. The deployment of gesture recognition systems in real-world scenarios still faces several difficulties: (a) the system must operate reliably across diverse environments; (b) individuals perform gestures differently, and even for the same gesture, motion amplitude and velocity can vary substantially across users; (c) continuous gestures introduce segmentation challenges [
27]; and (d) fine-grained gestures typically involve small motion amplitudes and low velocities, which complicates feature extraction [
28]. Consequently, further investigation is needed for radar-based measurement and recognition of fine-grained gestures.
To address these challenges, this paper proposes an improved deep residual shrinkage network (DRSN) for fine-grained gesture recognition. Specifically, the network extends the multi-path refinement architecture of deep residual networks by integrating low- and high-resolution features to produce high-resolution representations. In addition, a shrinkage module is incorporated to suppress noise interference effectively. To assess the representational capacity of different feature sets, single-, dual-, and triple-channel neural network architectures are designed. The proposed model achieves an accuracy of 98.88% on a nine-class fine-grained gesture recognition task and provides recommendations for flexible channel-mode selection across different application scenarios. The main contributions of this paper are summarized as follows:
This paper constructs a dataset comprising 4050 samples collected using an AWR1843 mm-wave radar, covering nine gesture categories in total: eight fine-grained command gestures commonly encountered in daily life and one return-to-neutral category. To improve robustness in continuous gesture recognition, two return-to-neutral gesture types (following “pat left” and “pat right” actions, respectively) are defined and collected, while they are merged into the same return-to-neutral category during labeling and classification. This dataset helps mitigate the lack of publicly available datasets for fine-grained gesture recognition and reduces interference and false positives between consecutive command gestures.
This paper applies a Deep Residual Shrinkage Network (DRSN) to fine-grained gesture recognition by embedding a learnable residual shrinkage (soft-thresholding) mechanism into residual units to adaptively suppress noise-like responses, thereby improving robustness under noisy radar measurements and enhancing recognition accuracy.
This paper designs three DRSN configurations—single-channel, dual-channel, and triple-channel—to evaluate the representational capacity of different feature sets for fine-grained gestures. This design enables flexible selection of the number and combination of input features to meet diverse application requirements.
The remainder of this paper is organized as follows.
Section 2 presents the fine-grained gesture data collection and feature extraction methods based on FMCW millimeter-wave radar, including the FMCW radar signal model, gesture design and acquisition, feature extraction, feature normalization, and continuous-gesture segmentation.
Section 3 describes the experimental methodology based on the improved DRSN architecture.
Section 4 reports the experimental setup, results, and analysis, including parameter settings, dataset construction, and comparative experiments.
Section 5 concludes the paper with a summary and final remarks.
2. Data Collection and Feature Extraction
2.1. Millimeter-Wave Radar Signal Model
This paper uses TI’s AWR1843 mm-wave radar to generate FMCW signals, which provide high range resolution, short measurement time, and low computational complexity, making them well-suited to gesture recognition [
29].
The core of the AWR1843 mm-radar is the RF/analog subsystem, which comprises three components: a clock subsystem, a transmit subsystem, and a receive subsystem. The clock subsystem provides a stable frequency range of 76–81 GHz. The transmit subsystem supports three channels with independent amplitude and phase control, and the receive subsystem provides four parallel channels, each equipped with a low-noise amplifier, a filter, and a mixer.
The radar’s RF front-end integrates three transmit antennas. An FMCW waveform generated by the signal generator is amplified by a power amplifier and radiated by the transmit antennas. The transmitted signal is reflected by the target, producing echoes that are captured by four receive antennas. Each received signal is amplified by a low-noise amplifier and mixed with the transmit signal to generate an intermediate-frequency (IF) signal, which is then low-pass filtered (LPF) and digitized by an analog-to-digital converter (ADC). Finally, the digitized data are sent to the signal processor, yielding four-channel digital outputs.
An FMCW radar transmits a frequency-modulated continuous wave; the frequency difference between the received echo and the transmitted signal reflects target characteristics. The working principle of the FMCW radar is shown in
Figure 1.
The frequency of an FMCW signal varies linearly with time. Let the starting frequency be
and the slope be
. The transmitted signal can be expressed as
The transmission signal
is reflected by the target, generating an echo signal
, which is received by the antenna. Ignoring amplitude variations, the received signal can be approximated as a time-delayed version of the transmitted signal:
where the time delay
is related to the target range
by
and
denotes the speed of light.
To obtain the mixed signal, the transmitted signal
and the received signal
are mixed in the mixer as
where
denotes complex conjugation.
After low-pass filtering (LPF), the intermediate-frequency (IF) signal can be written as
where the beat frequency is
and
is a constant phase term that absorbs components independent of
(e.g.,
).
Therefore, the relationship between the target range
and the IF frequency
is
2.2. Fine-Grained Gesture Design and Data Acquisition
This paper defines a fine-grained gesture set consisting of eight valid command gestures [
30] and two return-to-neutral gesture types. The return-to-neutral gestures are introduced to reduce interference and false positives in continuous gesture recognition. For classification, the two return-to-neutral gesture types are merged into a single return-to-neutral category, resulting in nine gesture categories in total (eight command categories plus one return-to-neutral category).
To ensure practicality and feasibility, the eight command gestures are designed according to two principles: (a) they follow natural daily life gesture habits; and (b) they are distinguishable in radar signals in terms of velocity, range, and angle characteristics. The eight command gestures are illustrated in
Figure 2.
To avoid ad hoc gesture definitions, the eight command gestures are aligned with commonly used radar/HCI gesture primitives reported in prior studies and public datasets [
31,
32]. Specifically, “pat left” and “pat right” correspond to lateral swipe/pat-like motions; “move forward” and “move backward” correspond to radial push/pull motions along the range direction; “palm open” and “palm pinch” represent hand-aperture changes (open vs. pinch) that are widely used as interaction commands; “rub fingers” captures fine finger micro-motions often used in subtle gesture sensing; and “hook hand” represents a distinct hand-shape/posture change that complements the above motion-based primitives. Since there is currently no universally standardized gesture vocabulary across radar platforms and application scenarios, this paper provides explicit gesture definitions and illustrations (
Figure 2 and
Figure 3) to support reproducibility and facilitate cross-paper comparisons.
When multiple consecutive “pat left” or “pat right” gestures are performed, a return-to-neutral gesture naturally occurs between successive actions, during which the palm is reoriented to face vertically toward the radar. To prevent such return-to-neutral gestures from being misclassified as command gestures, two return-to-neutral gesture types are explicitly defined: one following a left pat (return-to-neutral gesture 1) and one following a right pat (return-to-neutral gesture 2), as illustrated in
Figure 3. As these return-to-neutral gestures do not correspond to functional commands, they are labeled as the return-to-neutral category during gesture classification rather than as command gestures.
Gesture data were collected from nine volunteers. The radar was placed upright on the ground, and participants were seated in front of it with their hands positioned directly in front of the radar. The data acquisition scenario is shown in
Figure 4.
2.3. Feature Extraction
2.3.1. Static Clutter Suppression
During the acquisition of fine-grained gesture signals, FMCW radar measurements are affected by stationary clutter. Moving targets induce Doppler frequency shifts, whereas the spectral spread of stationary clutter around zero Doppler is typically limited. The moving target indicator (MTI) exploits this difference to separate moving targets from stationary clutter.
In this paper, an MTI filter is employed to suppress static clutter in fine-grained gesture signals. Compared with a double-delay-line canceller, a single-delay-line canceller provides a wider effective passband but offers weaker suppression of stationary clutter. Because the fine-grained gesture data in this paper are acquired in a relatively controlled environment, and the single-delay-line canceller is structurally simple and easy to implement, it is adopted for clutter suppression. Meanwhile, the filter is applied to primarily attenuate near-zero-Doppler stationary components while preserving the dominant dynamic patterns of the gestures. Using two consecutive gestures (“move forward” and “move backward”) as examples, the one-dimensional range profiles before and after static clutter suppression are shown in
Figure 5.
Compared with the post-suppression result, the range profiles before clutter suppression contain more redundant components, including returns from stationary (non-moving) objects. After suppression, the energy becomes more concentrated in the time frames associated with gesture activity, and the transition between the two consecutive gestures becomes more distinct. This property facilitates subsequent segmentation in continuous gesture recognition.
2.3.2. Range Feature Extraction
Since the configured range resolution of the radar is 5.9 cm, while the motion displacement of the fine-grained gestures designed in this paper is approximately 3.0–10.0 cm, conventional FFT-based range estimation does not provide sufficient resolution to capture subtle range variations. Therefore, a super-resolution method is required for range estimation.
In this paper, the multiple signal classification (MUSIC) algorithm [
33] is applied to extract range features for fine-grained gestures. MUSIC exploits the orthogonality between the signal and noise subspaces to construct a pseudo-spectrum, and range estimates are obtained by searching for peaks in this spectrum. A detailed procedure is provided in [
30]. The range–time spectrograms of the ten fine-grained gesture types are shown in
Figure 6.
2.3.3. Doppler Feature Extraction
Compared with the conventional Fourier transform, the short-time Fourier transform (STFT) is better suited to non-stationary signals. By partitioning a non-stationary signal into short-time segments, each segment can be approximated as locally stationary. The Fourier transform is then applied to each segment, enabling time-localized frequency analysis.
Wavelet-based transforms, such as the wavelet transform (WT) and continuous wavelet transform (CWT), provide multi-resolution time-frequency analysis and can be advantageous for strongly non-stationary or multi-scale transient signals. Nevertheless, this paper adopts STFT as a deliberate trade-off for the following reasons: (a) in FMCW radar processing, Doppler is naturally estimated via an FFT along the slow-time (chirp-to-chirp) dimension, and STFT can be interpreted as a windowed FFT that yields a physically interpretable Doppler/velocity axis; (b) STFT produces a fixed-size spectrogram with controllable time/frequency binning, which integrates well with CNN-based models and can stabilize training [
34]; and (c) WT/CWT requires selecting a mother wavelet and scale parameters, and the resulting representation may be sensitive to these hyperparameters, introducing additional tuning overhead. Therefore, STFT is adopted as a simple and robust representation in this study, while a systematic comparison with WT/CWT under the same protocol is left for future work.
This paper uses STFT to extract Doppler features for fine-grained gestures. Here, each frame corresponds to a sliding window over multiple chirps in the slow-time sequence. First, for each frame, the signal is transformed along the sampling-point (fast-time) dimension to obtain range information. The data are then aggregated along the range dimension, and STFT is applied along the chirp (slow-time) dimension to estimate Doppler information for the frame. Finally, the resulting Doppler representations are concatenated frame by frame to capture the temporal evolution of Doppler variations. The overall Doppler feature extraction process is illustrated in
Figure 7.
The Doppler–time spectrograms of the ten fine-grained gesture types are shown in
Figure 8. The horizontal axis denotes the frame index, the vertical axis represents Doppler, and the color indicates the intensity of motion-related energy. Because the time–frequency distributions differ across gesture types, the Doppler–time spectrum provides an effective basis for distinguishing fine-grained gestures.
2.3.4. Angle Feature Extraction
The gestures “pat left” and “pat right” exhibit similar velocity and range variations with respect to the radar, leading to similar Doppler–time and range–time characteristics. The key difference between these two gestures lies in the lateral direction of motion, making angle features important for discriminating between them.
The multiple signal classification (MUSIC) algorithm is a spectral estimation method that can be used to estimate both range and angle information. Accordingly, MUSIC is employed to extract angle features for fine-grained gestures. The angle–time spectrograms of the ten fine-grained gesture types are shown in
Figure 9.
Compared with Doppler and range, the radar provides lower angular resolution under the antenna configuration used in this paper. Even when a gesture is performed directly in front of the radar, the angular resolution is approximately 10°. As shown in
Figure 9, although the radar cannot accurately estimate the instantaneous angle for each fine-grained gesture, it can capture the overall trend of angle variation.
2.4. Normalization of Features
Different individuals exhibit varying motion extents along the range dimension when performing the same gesture. To mitigate this variability, the range extent in terms of range–bin indices on the extracted range–time map is normalized. Moving targets in the range–time map are detected using the Ordered-Statistics Constant False Alarm Rate (OS-CFAR) algorithm. For each gesture sample (instance), this paper determines the minimum and maximum range bins of the detected target region from that sample only, which are then used for min–max normalization. Specifically, this paper linearly maps the original range–bin indices to a predefined interval
(min–max scaling). Let
and
denote the minimum and maximum range positions (range–bin indices) of the
-th gesture sample (instance), respectively, obtained from OS-CFAR detections on that sample’s range–time map. The range–bin indices of all fine-grained gestures are normalized to the range of
, where
and
are predefined constants that specify the target interval after normalization (e.g., mapping to a fixed range–bin interval used by the network input), and they do not depend on any dataset-level statistics. Given the original range–bin index
, its normalized value is computed as follows:
The overall process of gesture range–bin normalization is illustrated in
Figure 10.
Similarly to range normalization, Doppler features are also normalized instance-wise to mitigate user variability in motion velocity. Let
and
denote the minimum and maximum Doppler–bin indices of the detected target region for the
-th gesture sample (instance), respectively, obtained from OS-CFAR detections on that sample’s Doppler-time map. The Doppler–bin indices of all fine-grained gestures are normalized to the range of
, where
and
are predefined constants specifying the target interval after Doppler–bin normalization, and they are fixed across all experiments. Similarly, this paper linearly maps the original Doppler–bin indices to a predefined interval
(min–max scaling). Given the original Doppler–bin index
, its normalized value is computed as follows:
The overall process of gesture Doppler–bin normalization is illustrated in
Figure 11.
All normalization parameters in Equations (7) and (8) are computed per gesture sample (instance). Specifically, and are derived from the OS-CFAR detection results on each sample’s own range–time and Doppler–time maps, respectively. Therefore, no statistics (e.g., mean/std, min/max scaling factors, or any other dataset-level normalization statistics) are computed using the whole dataset or using any held-out test subject data. This design ensures that the reported subject-independent evaluation does not involve test leakage and avoids biased performance estimation. In particular, under the subject-independent protocol, the held-out test subject is never used to compute any normalization statistics.
2.5. Continuous Gesture Splitting
Based on the Doppler features of fine-grained gestures, this paper employs the boundary detection method for gesture segmentation. During feature extraction, the palm and fingers may yield multiple Doppler–bin activations within a frame; therefore, OS-CFAR is adopted to detect motion-related activations for subsequent segmentation [
35,
36]. The core idea is to sort the sample data in the reference window in ascending order and select the
-th ordered statistic as the background estimate.
OS-CFAR consists of the cell under test (CUT), guard cells, and reference cells. The reference cells are distributed on both sides of the guard region (i.e., on each side). To avoid target-signal leakage into the reference window, samples in the guard region are excluded from background estimation.
When processing the input data, all values in the reference cells are sorted in ascending order according to their power levels, i.e.,
. The
-th order statistic
is used as an estimate of the background clutter power, where
. The detection threshold is set to
, and a target is declared if
. Let
denote the mean (scale parameter) of the exponential clutter/noise power; then the probability density function of
under a uniform clutter background is given by
The false alarm probability
of the OS-CFAR detector under a uniform clutter background is given by
where
represents the threshold coefficient. For completeness, the probability density function of the Gamma distribution is
where
and
are the shape and scale parameters of the Gamma distribution, respectively, and
denotes the Gamma function. The exponential distribution is a special case of the Gamma distribution with
and
.
As shown in
Figure 12a, this fine-grained gesture sequence contains three gesture actions. OS-CFAR is applied to the Doppler–time representation to detect motion-related Doppler bins in each frame, which are then used for gesture segmentation. The detection result is shown in
Figure 12b.
Let the binary matrix obtained after OS-CFAR detection be , where rows correspond to Doppler bins and columns correspond to frames. Here, denotes the number of Doppler bins and denotes the number of frames. An entry indicates that a target is detected at Doppler bin in frame , whereas indicates no detection at that bin and frame.
The next step is to determine whether a moving target exists in a given frame. If no targets are detected in any Doppler bin of a frame (i.e.,
), then the frame is considered to contain no moving target. In addition, if detections occur only at the zero-Doppler bin (or within a small neighborhood around zero Doppler), the frame is also treated as containing no moving target. The procedure for determining target presence across frames is illustrated in
Figure 12c, where blue markers on the horizontal axis indicate frames with detected moving targets.
Next, the boundaries of each fine-grained gesture are detected. A frame is considered a gesture boundary if two conditions are satisfied: (a) no moving target is present in the frame; and (b) at least one moving-target frame exists between this frame and the preceding boundary. Condition (a) ensures that boundaries occur only in frames without moving targets, while condition (b) ensures that a gesture is present between two adjacent boundaries.
To suppress false alarms caused by OS-CFAR detection, a sliding-window strategy is employed for boundary detection. Let
denote the window length. If all frames within the window contain no moving targets, the last frame in the window is confirmed as an idle (non-target) frame. If a gesture segment exists between the current frame and the previous boundary, the current frame is identified as a boundary. The overall process of fine-grained gesture boundary detection is illustrated in
Figure 13.
3. Variable-Channel DRSN
To examine the contribution of different feature types to gesture recognition and to improve recognition accuracy, this paper adopts an improved Deep Residual Shrinkage Network (DRSN). DRSN incorporates a shrinkage mechanism that learns adaptive thresholds to suppress noise, thereby enhancing feature learning under noisy conditions. Three input configurations are considered—single-channel, dual-channel, and triple-channel—so that different feature combinations can be selected according to application requirements.
3.1. Basic Components
As an improved deep learning method, DRSN shares the same fundamental building blocks as conventional convolutional neural networks (ConvNets), including convolutional layers, the rectified linear unit (ReLU) activation function, batch normalization (BN), global average pooling (GAP), and the cross-entropy loss function. These components are briefly introduced below.
3.1.1. Convolutional Layers
Convolutional layers are the core building blocks of convolutional neural networks (CNNs). Compared with fully connected layers, convolution uses local receptive fields and weight sharing, which substantially reduces the number of trainable parameters. This property can mitigate overfitting and improve generalization, especially when training data are limited.
Given an input feature map, the convolutional output at the
-th output channel (including a bias term) can be written as
where
denotes the
-th input channel,
denotes the
-th output channel,
denotes the convolution operation,
is the convolution kernel that connects input channel
to output channel
,
is the bias term, and
is the set of input channels used to compute the
-th output channel [
37]. By applying multiple kernels, a convolutional layer produces an output feature map with multiple channels. Stacking convolutional layers further enables hierarchical feature extraction.
3.1.2. Rectified Linear Unit (ReLU) Activation Function
The activation function is an essential component of neural networks, enabling nonlinear transformations. Among commonly used activation functions, the rectified linear unit (ReLU) alleviates the vanishing-gradient issue by providing non-saturating gradients for positive inputs. In addition, ReLU yields sparse activations by setting negative responses to zero, which can facilitate optimization in practice. The ReLU function is defined as
where
and
denote the input and output of the ReLU activation function, respectively.
3.1.3. Batch Normalization (BN)
Batch normalization (BN) is a widely used normalization technique in deep learning that standardizes intermediate features using mini-batch statistics and introduces learnable scale and shift parameters [
38]. By reducing the sensitivity of layer inputs to parameter updates and improving numerical conditioning, BN often stabilizes and accelerates training.
Given a mini-batch
, BN computes the mini-batch mean and variance as
The normalized feature is then obtained by
followed by an affine transformation
where
and
denote the input and output features of the
-th sample in the mini-batch, respectively;
and
are learnable scale and shift parameters; and
is a small constant added for numerical stability.
3.1.4. Global Average Pooling (GAP)
Global average pooling (GAP) is typically applied before the final classification layer to aggregate each feature map by taking the spatial average of its activations [
39]. Compared with fully connected pooling of flattened feature maps, GAP substantially reduces the number of trainable parameters in the classifier, thereby mitigating overfitting. Moreover, by summarizing spatial responses, GAP encourages the network to focus on the overall presence of discriminative patterns rather than their exact locations, which improves robustness to small spatial shifts.
3.1.5. Cross-Entropy Loss Function
The cross-entropy loss is widely used as the training objective for multi-class classification tasks [
38]. When combined with the softmax function, it quantifies the discrepancy between the predicted class distribution and the ground-truth label distribution and typically provides stable gradients for optimizing deep neural networks.
Given the logit
for class
, the softmax function converts logits into a normalized probability distribution whose components sum to one:
where
denotes the predicted probability that an observation belongs to class
, and
is the total number of classes.
For an observation with target label
, the cross-entropy loss is defined as
where
is the target probability for class
(e.g., a one-hot vector in standard supervised classification). The network parameters are then optimized by minimizing
using gradient-based methods.
3.2. Classical ResNets
Vanishing and exploding gradients arise from repeated multiplication of Jacobians during backpropagation. As network depth increases, these effects can make optimization difficult and may lead to unstable training or ineffective learning in early layers. Moreover, deeper networks can exhibit a degradation phenomenon in which adding more layers increases training difficulty and does not necessarily improve accuracy, and may even reduce it, despite the model having greater representational capacity.
Deep residual networks (ResNets) were proposed to address these optimization challenges in deep architectures. The key idea is to learn a residual function with respect to the identity mapping, which is implemented through shortcut connections that bypass one or more layers. These shortcuts provide a direct path for both forward activations and backward gradients, thereby facilitating optimization.
The fundamental module of a ResNet is the residual building unit (RBU). In its basic form, an RBU consists of two convolutional layers with batch normalization (BN) and rectified linear unit (ReLU) activations, together with an identity shortcut. Let
denote the residual mapping learned by the stacked layers. The output of the unit is
, as illustrated in
Figure 14. If the learned residual is small (i.e.,
), the unit can approximate the identity mapping by driving
toward zero, so that
. This property helps mitigate degradation and enables effective training as depth increases.
The overall architecture of ResNets is shown in
Figure 15. Here, “Conv” denotes a convolutional layer; “/2” indicates a stride of 2, which halves the spatial resolution of the feature map;
denotes the number of convolution kernels (filters) in a convolutional layer;
denotes the number of channels; and “FC” denotes the final fully connected classification layer.
3.3. Soft Thresholding
Soft thresholding is a fundamental operation in classical signal denoising (e.g., transform-domain shrinkage), where small-magnitude coefficients are treated as noise and are suppressed while informative components are preserved [
40,
41]. In typical denoising pipelines, the signal is first mapped to a domain in which noise tends to concentrate near zero, such as wavelet or frequency coefficients, and soft thresholding is then applied to drive near-zero responses to exactly zero.
In deep learning, soft thresholding can be integrated as a differentiable shrinkage nonlinearity, enabling the network to learn noise-suppressing representations without manually designing filters. By combining learned feature extraction with shrinkage, the network can attenuate noise-related components and retain discriminative responses.
The soft-thresholding function is defined as
where
denotes the input feature,
the output feature, and
denotes the threshold. This operation sets features with small magnitude (i.e., within
) to zero and shrinks larger-magnitude features toward zero by
while preserving their sign. Compared with the ReLU activation, which discards all negative responses, soft thresholding suppresses only small-magnitude responses of both signs, thereby retaining potentially informative negative features.
The soft-thresholding process is illustrated in
Figure 16a. Its derivative with respect to the input is piecewise constant,
and is not defined at
; in practice, a subgradient is used. The bounded slope (0 or 1) provides numerically stable gradient propagation outside the dead zone, while the zero-gradient interval enforces sparsity by removing near-zero (noise-like) responses, as shown in
Figure 16b.
3.4. Deep Residual Shrinkage Network
Compared with classical ResNets, the deep residual shrinkage network (DRSN) integrates soft thresholding into the residual building unit to form a residual shrinkage building unit (RSBU). The RSBU combines residual learning, a threshold estimation module, and soft thresholding, enabling the network to suppress noise-like responses while preserving informative components.
For the RSBU with channel-shared thresholds (RSBU-CS), a lightweight module is introduced to estimate a single shared threshold for all channels, as shown in
Figure 17a. Specifically, the absolute feature map
is first summarized to a compact descriptor, which is then fed into a two-layer fully connected (FC) network. A sigmoid function is applied to the FC output to produce a scaling parameter in the range (0, 1) [
42], expressed as follows:
where
denotes the output of the two-layer FC network and
is the corresponding scaling parameter.
The soft-thresholding value must be positive and should not be excessively large; otherwise, when the threshold exceeds the maximum magnitude of the feature map, the soft-thresholding output can collapse to all zeros. To keep the threshold within a reasonable range, the scaling parameter
is multiplied by the mean magnitude of the feature map to obtain the shared threshold:
where
is the threshold, and
,
, and
denote the width index, height index, and channel index of the feature map
, respectively.
If separate thresholds are desired for different channels, the RSBU with channel-wise thresholds (RSBU-CW) can be adopted, as shown in
Figure 17b. In RSBU-CW, thresholds are estimated and applied per channel, enabling more flexible shrinkage when channel characteristics differ significantly.
DRSN-CS follows a ResNet-style backbone, as illustrated in
Figure 18a, while replacing the standard residual building unit (RBU) with RSBU-CS. Stacking multiple RSBU-CS units progressively suppresses noise-related components and facilitates discriminative feature learning. Similarly, DRSN-CW uses RSBU-CW as its building block (
Figure 18b), applying channel-adaptive shrinkage through repeated nonlinear transformations.
To evaluate the contribution of different input features, three variable-channel configurations are constructed in this paper: single-channel, dual-channel, and triple-channel, corresponding to the use of one, two, and three feature types as network inputs, respectively. These configurations provide flexible feature-combination choices under different deployment constraints. The overall variable-channel DRSN architecture is illustrated in
Figure 19.
4. Experiments
4.1. Millimeter-Wave Radar Parameter Settings
Radar parameter design is critical for reliable target detection and subsequent gesture recognition. When configuring the FMCW radar, multiple factors must be considered jointly, including velocity (Doppler) resolution, range resolution, angular resolution, frame rate, and the onboard processing and data-throughput constraints. In practice, these requirements often impose trade-offs. For example, increasing bandwidth improves range resolution, whereas improving velocity resolution typically requires a longer coherent processing interval (i.e., more chirps), which can reduce the achievable frame rate and increase computational load. Therefore, the parameter configuration is selected to balance temporal responsiveness (frame rate) with sufficient range and velocity resolution for fine-grained gesture characterization under the available hardware constraints. The radar configuration used in this paper is summarized in
Table 1. The TI AWR1843 (Texas Instruments, Dallas, TX, USA) was applied to carry out the experiments.
4.2. Dataset Construction and Input Representation
This paper collected 4050 fine-grained gesture samples from nine volunteers. For each sample, three time–frequency/spatial representations were extracted, namely the Doppler–time map, range–time map, and angle–time map, which jointly characterize motion velocity, radial displacement, and azimuthal variation, respectively.
During data collection, ten fine-grained gesture types were recorded, including eight valid command gestures and two return-to-neutral gestures. Because return-to-neutral gestures are introduced only to prevent transitional movements from being misclassified as valid commands, the two return-to-neutral gesture types are not distinguished in the final label space and are merged into a single invalid class. Therefore, the gesture classification task is formulated as a nine-class problem, comprising eight valid gesture classes and one invalid (return-to-neutral) class.
To evaluate the representational capability of Doppler, range, and angle features for fine-grained gesture recognition, this paper follows the feature-combination setting in prior work [
30] and constructs single-channel, dual-channel, and triple-channel network inputs using the three extracted representations. For consistency across feature types and models, the Doppler–time, range–time, and angle–time maps are uniformly resized to 64 × 64 before being fed into the neural network.
4.3. Comparison Model: Multi-Path Refinement Network
Compared with backbones such as ResNet and VGG, the multi-path refinement network (RefineNet) is designed to recursively aggregate multi-level features. It jointly leverages high-level, low-resolution semantic features and low-level, high-resolution detailed features to construct refined high-resolution representations. RefineNet is built upon residual connections and identity mappings, which facilitate gradient propagation and stable optimization. In addition, its chained residual pooling (CRP) mechanism aggregates context at multiple pooling scales and integrates these pooled features through residual connections, enriching the representation with multi-scale information.
Following the stage-wise structure of ResNet, the backbone feature maps can be grouped into four consecutive blocks with decreasing spatial resolution. A corresponding four-level cascade of RefineNet modules is attached to these blocks to fuse features across stages and progressively refine the representation. Let ResNet block-m denote the m-th stage of the backbone, and let RefineNet-m denote the refinement module connected to ResNet block-m. Each RefineNet module comprises three components: a residual convolutional unit (RCU) for local feature adaptation, a multi-resolution fusion (MRF) unit for aligning and merging features from different resolutions, and a CRP unit for multi-scale context aggregation.
In this cascade, RefineNet-4 operates on the deepest backbone features to produce an initial refined representation at low resolution. RefineNet-3 takes as inputs the output of RefineNet-4 and the features from ResNet block-3, fuses these two sources, and refines the representation by injecting higher-resolution details. RefineNet-2 and RefineNet-1 follow the same top-down refinement strategy: each module combines the refined low-resolution features from the subsequent stage with the higher-resolution features from the corresponding backbone block, yielding progressively higher-resolution feature maps. Finally, the refined feature map is pooled and fed into the classification head (a fully connected layer) to predict fine-grained gesture categories.
4.4. Experimental Protocols Design
To avoid ambiguity caused by mixing subject-dependent and subject-independent evaluations, this paper reports results under two explicit benchmarks on the collected dataset (nine volunteers). The first benchmark evaluates subject-dependent performance under sample-level random splits, whereas the second benchmark evaluates subject-independent generalization using leave-one-subject-out (LOSO) cross-validation.
4.4.1. Subject-Dependent Evaluation (SD)
In the subject-dependent (SD) benchmark, all samples are pooled and randomly partitioned into training/validation/test sets with a ratio of 60%/20%/20%. To mitigate randomness introduced by data splitting and optimization, the SD benchmark is repeated five times (five random splits) with different random seeds. The reported SD performance is summarized as mean ± standard deviation over repeats.
4.4.2. Subject-Independent Evaluation (LOSO)
In the subject-independent benchmark, LOSO cross-validation is conducted across all volunteers. For each fold, one volunteer is held out as the test subject, and the remaining volunteers form the training pool. A validation set is sampled only from the training pool (i.e., excluding the held-out subject) for model selection and hyperparameter tuning. Performance is computed on the held-out subject for each fold and then aggregated across folds to report mean ± standard deviation.
4.4.3. Metrics and Experimental Configuration
Performance is reported using classification accuracy and macro-averaged F1 score (macro-F1). Macro-F1 is included to better reflect per-class performance under potential class imbalance. For completeness, the implementation also computes additional summary statistics (e.g., macro-precision/recall and weighted-F1), while accuracy and macro-F1 are used as the primary metrics throughout this paper.
To ensure fair comparison across models and feature combinations, all experiments adopt the same input resolution (64 × 64) and identical training hyperparameters. Two backbones (RefineNet and DRSN) are evaluated under seven input-channel configurations constructed from the Doppler–time, range–time, and angle–time representations, including three single-channel settings, three dual-channel settings, and one triple-channel setting. A unified experimental script is used to run the full configuration sweep and to generate structured result reports (per-repeat/per-fold outputs and aggregated summaries), enabling reproducibility and facilitating independent auditing of the evaluation protocol.
4.4.4. Preprocessing and Leakage Avoidance
To prevent test leakage, all normalization operations are performed instance-wise. Specifically, the normalization parameters for range and Doppler features are computed from OS-CFAR detections on each sample’s own feature maps, rather than from dataset-level statistics. Consequently, no global statistics (e.g., mean/standard deviation or min/max values computed over the full dataset) are derived using any held-out test subject data, and under LOSO evaluation, the held-out subject is never used to compute normalization parameters.
4.5. Results and Analysis
This section reports fine-grained gesture recognition performance under two evaluation protocols: subject-dependent (SD) evaluation with repeated random splits and subject-independent evaluation using leave-one-subject-out (LOSO) cross-validation. For each model and input-feature configuration, classification accuracy and macro-F1 are summarized as mean ± standard deviation.
Under the RefineNet-based baseline (
Table 2), Doppler–time features provide the strongest single-channel representation, whereas range–time features are substantially weaker, and angle–time features are the least effective among the three single-channel inputs. This observation is consistent with the physical characteristics of the designed gestures and the sensing constraints of the radar: Doppler captures dominant motion dynamics, while angle estimation is limited by coarse angular resolution, and range variations can be subtle for small-amplitude fine-grained motions. For dual-channel inputs, combining Doppler–time with angle–time yields the best baseline performance, indicating that angle cues can complement Doppler signatures by resolving lateral-motion tendencies. By contrast, Doppler–range fusion does not yield a clear gain over Doppler alone, which aligns with the coupling between range evolution and radial velocity in FMCW motion patterns. Using all three features jointly achieves the highest baseline accuracy; however, the improvement over the best dual-channel setting is modest, and LOSO performance remains comparable to the Doppler–angle setting.
Table 3 reports the results of the improved DRSN under the same set of input-feature configurations. Across both SD and LOSO, DRSN consistently outperforms the RefineNet baseline, suggesting that integrating shrinkage-based denoising into residual learning is beneficial for fine-grained radar gesture recognition. For single-channel inputs, DRSN yields clear gains for Doppler–time and particularly for range–time, implying improved robustness when the input representation is less discriminative. For dual-channel inputs, Doppler–range achieves the highest SD accuracy, while Doppler–angle remains strong and competitive under both SD and LOSO. When all three features are fused, the triple-channel DRSN achieves the best overall performance (98.88% under SD and 92.20% under LOSO), and macro-F1 closely tracks accuracy, indicating that the gains are broadly consistent across classes under the nine-class formulation.
Beyond aggregate metrics,
Figure 20 provides a class-level interpretation of the remaining confusions under different input-feature settings. With single-channel range inputs, major errors arise between gesture pairs that exhibit similar range–time trajectories but differ primarily in motion direction (e.g., the two lateral pat gestures), reflecting the limited discriminability of range-only cues under small displacement. Incorporating angle cues in the Doppler–angle and triple-channel settings substantially reduces such confusions by providing complementary lateral-motion information. For angle-only inputs, misclassifications increase among classes whose discriminative cues are encoded mainly in velocity profiles or hand-aperture dynamics rather than azimuth variation, which is consistent with the radar’s limited angular resolution. For the range–angle setting, removing Doppler information makes it more difficult to separate gestures with similar spatial traces but different velocity patterns, whereas triple-channel fusion yields the strongest diagonal dominance and the fewest off-diagonal errors overall.
A direct SD accuracy comparison between RefineNet and DRSN is summarized in
Table 4. Performance gains are observed for all input settings, with the largest improvements appearing in configurations that are relatively weak under the baseline (e.g., range-only and range–angle), while Doppler-dominant configurations also improve consistently. Notably, under SD, DRSN improves over RefineNet by 8.52 percentage points for range–angle fusion and by 4.39 percentage points for triple-channel fusion, indicating that the shrinkage-based design provides consistent benefits beyond feature refinement alone. Under the subject-independent LOSO protocol, the advantage of DRSN is particularly evident for the triple-channel setting, supporting that multi-feature fusion combined with shrinkage-based denoising improves cross-subject generalization. The larger gain for the range–angle setting likely reflects a higher degree of noise sensitivity and weaker separability in the baseline: without Doppler, the model relies on comparatively noisier range/angle estimates (limited range span and coarse angular resolution). The shrinkage-based residual units in DRSN provide adaptive suppression of noise-like responses and implicit feature selection, which is expected to benefit such low-SNR inputs more strongly, whereas Doppler-inclusive settings have less room for improvement due to their stronger baseline separability.
In addition to recognition performance,
Table 5 reports computational complexity and runtime behavior of RefineNet and DRSN under two numerical precisions (FP16/FP32) on the same experimental platform. The results highlight different accuracy-efficiency trade-offs: DRSN uses fewer parameters but incurs higher arithmetic cost (FLOPs per batch), leading to higher inference latency under the same batch setting. Switching from FP32 to FP16 reduces latency and increases throughput (FPS) for both models. Notably, the recorded inference peak memory is higher under FP16 in our measurements; this peak is obtained from the framework-reported maximum GPU memory allocation and thus includes temporary workspaces and allocator caching. In our stack, FP16 may invoke different precision-specific kernel paths (e.g., Tensor Core-optimized implementations) that allocate larger intermediate workspaces, whereas training peak memory still decreases under mixed precision because activations/gradients are stored in FP16 (with only FP32 master weights retained).
Finally,
Table 6 summarizes representative radar-based hand gesture recognition methods and lists the datasets and evaluation protocols reported in the corresponding papers. Since these results are obtained under different data sources, label definitions, and experimental settings, they should be interpreted as a contextual reference rather than a strictly controlled benchmark. Nevertheless, the proposed variable-channel DRSN achieves strong performance under both subject-dependent and subject-independent (LOSO) evaluations on our dataset, while
Table 5 further characterizes its computational and runtime costs. Overall, these results indicate that integrating multi-feature fusion with shrinkage-based residual learning provides a robust solution for fine-grained radar gesture recognition, with explicit accuracy–efficiency trade-offs that can be selected according to practical constraints.
5. Conclusions
This paper presents an end-to-end framework for fine-grained gesture recognition using FMCW millimeter-wave radar, encompassing gesture design, radar data acquisition, feature extraction, continuous-gesture detection/segmentation, and neural network-based classification. Ten gesture types were recorded, including eight valid command gestures and two return-to-neutral gestures; for classification, the two return-to-neutral gesture types were merged into a single invalid class, yielding a nine-class recognition task (eight valid classes plus one invalid class). Doppler–time representations extracted via STFT are used to support sliding-window-based automatic detection and segmentation in continuous streams, while MUSIC-based super-resolution estimation is adopted to construct range–time and angle–time representations for enhanced range/angle characterization. To mitigate inter-subject variability without introducing test leakage, instance-wise normalization is applied to Doppler and range features based solely on each sample’s own detections. For recognition, this paper develops a multi-path refinement network (RefineNet) built on a deep residual backbone to fuse low- and high-resolution features into high-resolution representations, and further integrates a shrinkage mechanism into residual building units to suppress noise-like responses, resulting in a deep residual shrinkage network (DRSN). The discriminative contributions of Doppler, range, and angle features are examined through single-channel, dual-channel, and triple-channel input configurations. To remove ambiguity in evaluation, performance is reported under two explicit protocols—subject-dependent evaluation with repeated random splits and subject-independent evaluation with leave-one-subject-out (LOSO) cross-validation—demonstrating that the proposed variable-channel DRSN consistently outperforms the RefineNet-based baseline and that the triple-channel configuration achieves the best overall performance. The triple-channel DRSN achieves an accuracy of 98.88% on the fine-grained gesture recognition task, representing an improvement of 4.39% over the triple-channel RefineNet. Overall, the experimental results demonstrate the adaptability of the proposed system, enabling flexible selection of channel configurations to suit diverse application scenarios.