SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16

Li, Fenfang; Weng, Chujie; Liang, Yongguang

doi:10.3390/app15116346

Open AccessArticle

SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16

by

Fenfang Li

^1,2,*

,

Chujie Weng

² and

Yongguang Liang

²

¹

School of Information Engineering, Gansu Minzu Normal University, Hezuo 747000, China

²

School of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6346; https://doi.org/10.3390/app15116346

Submission received: 24 April 2025 / Revised: 31 May 2025 / Accepted: 2 June 2025 / Published: 5 June 2025

(This article belongs to the Special Issue Advances in Human–Machine Systems, Human–Machine Interfaces and Human Wearable Device Performance)

Download

Browse Figures

Versions Notes

Abstract

With advancements in IoT and smart home tech, WiFi-driven gesture recognition is attracting more focus due to its non-contact nature and user-friendly design. However, WiFi signals are affected by multipath effects, attenuation, and interference, resulting in complex and variable signal patterns that pose challenges for accurately modeling gesture characteristics. This study proposes SE-WiGR, an innovative WiFi gesture recognition method to address these challenges. First, channel state information (CSI) related to gesture actions is collected using commercial WiFi devices. Next, the data is preprocessed, and Doppler-shift image data is extracted as input for the network model. Finally, the method integrates the squeeze-and-excitation (SE) mechanism with the VGG16 network to classify gestures. The method achieves a recognition accuracy of 94.12% across multiple scenarios, outperforming the standalone VGG16 network by 4.13%. This improvement confirms that the SE module effectively enhances gesture feature extraction while suppressing background noise.

Keywords:

WiFi; gesture recognition; channel state information; squeeze–excitement mechanism; VGG16

1. Introduction

With the advancement of Internet of Things (IoT) technology, smart homes, virtual reality, innovative healthcare, intelligent transportation, and other cutting-edge technologies have become indispensable. Researchers continue exploring more intuitive interaction methods to improve user experience and reduce the learning curve. Among these, gesture recognition—the most natural form of interaction—has emerged as a key focus in Human–Computer Interaction (HCI) research [1]. By recognizing different gestures, devices can be operated remotely or interactively. Additionally, personalized recommendations can be generated by analyzing users’ movement characteristics through gesture recognition. However, low recognition accuracy may lead to operational errors, while poor real-time performance can negatively impact user experience.

In recent years, gesture recognition systems have primarily utilized computer vision technology [2], infrared sensors [3], wearable devices [4], and millimeter-wave radar [5] to detect human movements. Computer vision-based approaches employ standard cameras to capture gesture videos, then extract gesture images through frame-splitting techniques for high-precision recognition. However, their effectiveness is limited by lighting conditions. In contrast, infrared technology works by sensing infrared light reflected off hand movements to create images. While it remains effective under varying lighting conditions, the potentially recorded private details in captured images raise data protection issues. This significantly restricts its deployment in scenarios requiring stringent confidentiality. Wearable-device solutions use sensors to collect pressure and acceleration data for action recognition. Although these systems perform well, they face practical challenges such as user discomfort from continuous wear and limited battery life. Millimeter-wave radar, grounded in the principles of Frequency Modulated Continuous Wave (FMCW) technology, achieves high recognition accuracy through detailed analysis of echo signal characteristics. Nevertheless, its widespread adoption is hindered by high hardware costs.

The emergence of WiFi technology has offered a promising solution to overcome the limitations of the aforementioned approaches. The operation of WiFi-based detection depends mainly on two indicators: Received Signal Strength Indicator (RSSI) and Channel State Information (CSI). Historically, the RSSI has been utilized to assess the strength and quality of wireless connections by quantifying the power level of incoming signals. In WiFi sensing applications, RSSI enables device presence detection and localization. By comparing RSSI values from multiple devices, the system can estimate their relative distances from the access point (AP).

In compliance with the IEEE 802.11n standard [6], Orthogonal Frequency Division Multiplexing (OFDM) is employed to modulate WiFi signals into multiple subcarriers, thereby enabling the extraction of CSI [7]. Unlike the RSSI, CSI provides a more fine-grained characterization of wireless channels, encapsulating detailed propagation characteristics such as amplitude variations, phase shifts, and frequency response characteristics. This enables precise monitoring of signal variations during transmission, yielding more accurate wireless environment sensing and analysis. CSI demonstrates exceptional sensitivity to environmental changes. Beyond detecting major movements like walking or running, it can capture subtle motions, including breathing and chewing, in static environments. Several studies have leveraged these properties for advanced gesture recognition: WiSee [8] utilizes two living room access points to extract comprehensive gesture information from wireless signals, enabling recognition of nine home-based gestures. WiMorse [9] utilizes CSI to detect finger-generated Morse code by presenting a training-free signal conversion framework that automatically maps positional changes to standardized gesture patterns, thereby enabling accurate character recognition. In Wi-SL [10], features of CSI are combined with a support vector machine enhanced by K-means clustering and the Bagging algorithm to classify twelve distinct sign language gestures. The WiReader system [11] proposes a WiFi-based adaptive framework for in-air handwriting recognition. It employs activity factors for data segmentation, applies CSI ratio-based signal transformation, generates energy feature matrices, and incorporates them into an LSTM architecture to accurately classify handwriting gestures.

This study proposes an innovative recognition method to address the limitations of traditional WiFi gesture recognition techniques in complex environments. The approach incorporates a squeeze–excitation mechanism and the VGG16 network. First, gesture action CSI data are collected using commercial WiFi devices; second, the data are preprocessed, and the short-time Fourier transform is applied to extract Doppler frequency shift images as features; then, the CSI signals are enhanced using a squeeze–excitation mechanism to reinforce gesture-related features while effectively suppressing background noise; and finally, the robust feature extraction and classification capabilities of the VGG16 network facilitate gesture recognition.

This research develops a novel WiFi-enabled gesture recognition framework that addresses the critical shortcomings of existing vision and sensor-based approaches in cluttered environments. By utilizing channel state information (CSI) and deep neural networks, robust gesture recognition is achieved. The proposed solution integrates a squeeze–excitation mechanism with the VGG16 network architecture, implementing the following processing pipeline:

(1): Data Acquisition: Commercial WiFi devices capture CSI data corresponding to gesture actions.
(2): Preprocessing: The acquired CSI measurements are processed using STFT-based spectral analysis to derive Doppler shift signatures in the time-frequency domain.
(3): Feature Enhancement: The squeeze–excitation mechanism amplifies relevant gesture features while effectively suppressing background noise in the CSI signals.
(4): Classification: The enhanced features are processed by the VGG16 network, leveraging its robust feature extraction and classification capabilities for final gesture recognition.

The rest of this paper is organized as follows. Section 2 reviews related work on WiFi-based gesture recognition, analyzing previous methods and their limitations. Section 3 details the proposed SE-WiGR method, covering data acquisition, preprocessing, and the integration of squeeze and excitation networks with VGG16. Section 4 presents the experimental design and results, including deployment, data collection, performance evaluations of different algorithms, robustness tests, and ablation studies. Section 5 concludes the study by summarizing contributions, discussing limitations, and suggesting future directions. To enhance the readability of technical terms, the following Table 1 lists the main acronyms used in this paper with their full names and explanations:

2. Related Work

The ubiquitous deployment of WiFi infrastructure in indoor environments has accelerated advancements in WiFi-based gesture recognition technology. Among available signal metrics, RSS measures the power level of received signals. Several studies have demonstrated RSS-based gesture recognition systems: Abdelnasser et al. [12] developed WiGest, an RSS—based system that classifies movements by establishing separate gesture categories. Gu et al. [13] designed an activity recognition framework leveraging WiFi RSS measurements. This framework can identify directional gestures (Left/Right, Up/Down) and symbolic motions (infinity signs, open/close actions). The latter framework combines classification trees with K-nearest neighbor algorithms, achieving 96% recognition accuracy even in non-line-of-sight conditions. However, despite its ease of collection and implementation, RSS exhibits inherent limitations for gesture recognition. The metric fails to capture fundamental signal variations caused by human movements and demonstrates instability even in static environments [14], significantly constraining its practical application.

CSI serves as a critical physical-layer parameter in WiFi systems, quantitatively describing the multipath propagation properties of wireless channels. It empowers transmitters and receivers to execute intelligent adaptive strategies during wireless data transmission. Recent research has demonstrated various CSI-based gesture recognition approaches: He et al. [15] developed WiG, which identifies distinctive patterns in anomalous CSI sequences to represent different gestures, employing Support Vector Machines (SVMs) for classification. WiGeR [16] employs CSI diffusion patterns to achieve accurate gesture recognition, utilizing a customized segmentation and windowing algorithm grounded in wavelet analysis and short-time energy computations. WiAG [17] implements a DWT-based feature extraction pipeline coupled with KNN classification. AirFi [18] proposes a domain-generalized learning framework that distills environment-invariant CSI features through multi-domain training, eliminating the need for scenario-specific calibration. Isack Bulugu [19] proposed a cross-domain CSI method that extracts domain-invariant features for reliable recognition in diverse environments. The system recognizes six gesture types (push/pull, sweep, hand clap, slide, circling, and sawtooth) in known and unknown domains without additional training data from new environments.

Feature extraction represents a crucial stage in gesture recognition systems. While traditional manual feature extraction methods are both time-consuming and prone to omitting important implicit patterns, recent advances in deep learning have enabled automated extraction of key Channel State Information (CSI) features. Several notable approaches demonstrate this capability: Zou et al. [20] developed FreeGesture, a WiFi CSI-based system that employs Convolutional Neural Networks (CNNs) to learn and extract the features from CSI frames. Kong et al. [21] introduced FingerPass, a system utilizing LSTM to process temporal gesture patterns, with the derived recognition results further employed for identity verification applications. Research findings [22] demonstrate that LSTM-based approaches outperform conventional machine learning methods in CSI-based activity recognition tasks. Chen et al. [23] implemented an Attention-based Bidirectional LSTM (ABLSTM) architecture, utilizing both forward and backward layers to enhance human activity recognition. Zhou et al. [24] achieved gait identification through a Gated Recurrent Unit (GRU) network incorporating mean pooling layers.

3. SE-WiGR Gesture Recognition Method

Smart home WiFi-based gesture recognition technology offers an intuitive and convenient control approach, significantly enhancing user interaction with the home environment. In this study, we define six common gesture actions: Enlarge, Narrow, Up, Down, Left, and Right, as depicted in Figure 1. These gestures were carefully selected to represent fundamental motion primitives, balancing practical usability and computational complexity. The choice of six gestures simplifies the initial model training and validation process while still capturing a diverse range of motion characteristics. The “Enlarge” and “Narrow” gestures focus on fine finger adjustments, while the “Up”, “Down”, “Left”, and “Right” gestures involve larger arm and wrist movements, covering both vertical and horizontal spatial dimensions.

The proposed SE-WiGR framework is designed with scalability as a key consideration. The modular architecture, especially the integration of the squeeze-and-excitation (SE) module with the VGG16 network, endows the system with the capability to effectively extract features and classify a broader range of gesture categories. The SE module’s unique functionality lies in its ability to adaptively recalibrate channel-wise feature responses. By doing so, it enables the model to prioritize discriminative features, thereby mitigating potential performance degradation as the gesture set expands.

The structure of SE-WiGR, the gesture recognition method proposed in this study, is illustrated in Figure 2. The framework consists of three phases: data acquisition, data preprocessing, and classification and recognition. The upper half of Figure 2 covers the data acquisition and preprocessing phases, where the collected CSI data are transformed into micro-Doppler maps. The lower half depicts the gesture classification module, which is primarily based on a VGG16 network enhanced with an SE (squeeze-and-excitation) module. The training of the SE-VGG16 network involves several key configurations to ensure optimal performance. The model leverages the standard VGG16 architecture with 13 convolutional layers and 3 fully connected layers, augmented with SE modules inserted after the final layer of each convolutional block (totaling 4 SE modules). Each SE module employs global average pooling for the squeeze operation to generate channel descriptors, followed by two fully connected layers (with a reduction ratio of 16) and a Sigmoid activation for excitation, enabling adaptive channel-wise feature recalibration. The network is trained with a Softmax output layer suitable for the 6-class gesture classification task. These configurations enable the model to effectively learn discriminative gesture patterns from CSI-derived micro-Doppler maps.

3.1. Data Acquisition

CSI quantifies the propagation characteristics of wireless signals through individual transmission paths, characterizing key channel phenomena including multipath scattering, environmental attenuation effects, and path loss variations. It characterizes how the wireless channel varies over time, carrier frequency, and spatial distribution [25]. In this study, we employ Intel 5300 network interface cards (NICs) (Intel, Santa Clara, CA, USA) to capture CSI measurements corresponding to various gesture states. In this study, the WiFi signals are modulated using OFDM, where the channel-received signal can be represented as

Y = H X + N o i s e

(1)

where

X

and

Y

represent the complex-valued transmitted and received signal vectors, respectively;

H

denotes the channel state matrix describing the wireless propagation effects; and Noise corresponds to additive white Gaussian noise. Thus, CSI is defined as the matrix

H

, encapsulating the channel characteristics across all subcarriers.

H = {[H_{1}, H_{2}, {\dots, H}_{i}, \dots, H_{N}]}^{T}

(2)

where

i \in [1, N]

and

N

represent the number of subcarriers. A subcarrier

H_{i}

is represented in complex form, so the channel state information can be expressed as

H_{i} = {|H_{i}| e}^{- j \times ∠ H_{i}}

(3)

where

|H_{i}|

and

∠ H_{i}

denote the amplitude and phase of the

i

th carrier, respectively.

3.2. Data Preprocessing

The collected CSI signals are susceptible to noise interference due to environmental factors and equipment operation. To enhance the extraction of gesture-induced CSI variations, this study employs a three-step denoising approach. First, a Hampel filter [26] removes outliers that deviate significantly from the mean and standard deviation range. Next, a Butterworth low-pass filter [27] with cutoff frequencies of 1.5 Hz (lower) and 60 Hz (upper) eliminates noise outside the gesture signal’s frequency range. Finally, sym8 wavelets are applied for data smoothing. To optimize gesture recognition, PCA is used to reduce data dimensionality while retaining key information. The number of PCA components is crucial: too few may discard essential gesture features, reducing accuracy, while too many retain redundant data, increasing complexity and risking overfitting. Through experiments and cross-validation, an optimal number of components was chosen to balance information retention and dimensionality reduction, enhancing the efficiency and generalization of the recognition model. Figure 3 demonstrates the processing results: (a) and (b) show the raw data from 30 subcarriers corresponding to Narrow and Right gesture signals, while (c) and (d) display the filtered data with principal component features extracted using PCA [28]. The comparison clearly illustrates the significant improvement in signal quality after processing.

This study employs a sliding window approach to calculate local variance, enabling effective differentiation between inactive and motion-activated periods for precise extraction of gesture-induced signal data. The extracted signals are then processed using short-time Fourier transform (STFT) to generate Doppler shift spectrograms, as illustrated in Figure 3e,f, the red rectangular box represents the features displayed after processing the current gesture into a Doppler image. The WiFi signal’s Doppler shift resulting from gesture movements can be expressed as

f_{D} = - \frac{1}{λ} \frac{d}{d t} d (t)

(4)

where λ is the signal wavelength, and

d (t)

is the transmit path length. The Doppler spectrogram [29] primarily displays the signal characteristics in the frequency domain, and the Doppler spectrum of CSI is obtained by video analysis:

H (f, t) = h_{s} (f) + \sum_{k \in D} a_{k} (t) B (f_{D_{k}} (t))

(5)

3.3. Gesture Recognition Based on SE and VGG-16 Model Fusion

The SE-VGG16 method proposed in this study focuses on data acquisition and preprocessing. The generated Doppler spectrogram data are input into the SE-VGG16 for gesture recognition. The architecture of SE-VGG16 improves upon VGG16 by incorporating the SENet module. The overall structure of SE-VGG16 is illustrated in the lower half of Figure 2. The network structure is detailed in Table 2, where SE modules are strategically embedded after specific convolutional layers to adaptively recalibrate channel-wise feature responses.

3.3.1. VGG16 Network Architecture

VGGNet is a CNN structure network, which includes an input layer, multiple convolutional and pooling layers, fully connected layers, and an output layer. Specifically, the network architecture comprises 13 convolutional layers utilizing 3 × 3 filters for spatial feature learning, each coupled with ReLU activations to introduce nonlinearity, alongside three fully connected layers. Max pooling operations utilize a 2 × 2 kernel with a stride of 2, effectively performing down-sampling. The network architecture proceeds as follows: First, Block1 performs two 64-kernel convolutions followed by pooling. Block2 then applies two 128-kernel convolutions with pooling. Subsequently, Block3 executes three 256-kernel convolutions before pooling. This pattern continues in Block4 and Block5, each performing two 512-kernel convolutions followed by pooling. The structure concludes with three fully connected layers—the first two containing 4096 neurons each, and the final layer comprising 1000 neurons. These FC layers transform the convolutional feature maps into class probabilities. The Softmax classifier then processes the final layer’s output to generate a probability distribution across categories, enabling classification. The complete VGG16 architecture is illustrated in Figure 4.

3.3.2. Squeeze and Excitation Networks

The squeeze-and-excitation networks (SENets) are an innovative neural network architecture that introduces the squeeze-and-excitation (SE) block as its core component. The SE block automatically learns and determines the importance of individual feature channels, explicitly modeling inter-channel dependencies. Figure 5 shows the schematic structure of the SE block. In contrast, the excitation operation uses fully connected layers to learn nonlinear channel-wise relationships and performs feature recalibration—selectively enhancing informative features while suppressing less useful ones. This recalibration strategy allows SENet to leverage global context more effectively, optimizing the network’s attention mechanism for improved performance.

A squeeze and excitation block is a computational unit whose core function is to map the input data

x

,

x \in R^{H^{'} \times W^{'} \times C^{'}}

to the features

U

,

U \in R^{H \times W \times C}

by the transformation

F_{t r}

. In this computation, we employ

F_{t r}

as the operator of the convolution operation and define a collection of filter kernels

V

,

V = [v_{1}, v_{2}, . . ., v_{C}]

, and set a parameter for each filter, defined as

v_{c}

, which are learned and optimized during training. Thus, the output can be written as

U = [u_{1}, u_{2}, . . ., u_{C}]

, where

u_{c} = v_{c} \times X = \sum_{s = 1}^{C^{'}} v_{c}^{s} * x^{s}

(6)

where

*

denotes the convolution,

v_{c} = [v_{C}^{1}, v_{C}^{2}, \dots v_{C}^{C^{'}}]

,

X = [x^{1}, x^{2}, \dots, x^{C^{'}}]

and

u_{C} \in R^{H \times W}

.

v_{C}^{S}

is a two-dimensional spatial kernel denoting the individual channels

v_{c}

acting on the corresponding channels of

X

. The representation is based on a two-dimensional space.

(1) Squeezing mechanism

To address the challenge of leveraging inter-channel dependencies, we begin by examining the feature distribution across individual channels in the output. Due to the locality of each filter’s receptive field, the resulting activation units lack awareness of broader spatial context beyond their own regions.

To overcome this limitation, we introduce a strategy that condenses global spatial cues into a compact channel-wise representation. This is accomplished through the application of global average pooling, which computes statistical features for each channel. Formally, the statistic

Z \in R^{C}

is generated by reducing

U

to its spatial dimension

H \times W

, so that the

c

th element of

z

is computed using Equation (7):

z_{x} = F_{s q} (U_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(7)

(2) Excitation Mechanism

An excitation mechanism is introduced to fully utilize the information aggregated during the squeeze operation, enabling effective modeling of channel-wise dependencies. This mechanism employs a simple gating function with sigmoid activation.

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

(8)

where

δ

denotes the ReLU function,

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

.

Fully connected layers are placed on both sides of the nonlinear region, aiming to tune the gating mechanism, constrain the model complexity, and at the same time enhance the generalization performance. Specifically, this bottleneck structure comprises: (1) a dimensionality reduction layer with a scaling ratio

r

to decrease parameter count and computational load, (2) a ReLU activation function introducing nonlinearity, and (3) a dimensionality restoration layer that matches the channel dimension of the transformed output

U

:

\tilde{X_{c}} = F_{s c a l e} (u_{c}, s_{c}) = s_{c} u_{c}

(9)

where

\tilde{X} = [{\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{C}]

and

F_{s c a l e} (u_{c}, s_{c})

is the channel multiplication between the metrics

s_{c}

and the feature map

U_{c} \in R^{H \times W}

.

4. Experimentation and Analysis of Results

4.1. Experimental Deployment

During the experimental stage, this study utilized an Intel 5300 wireless network card (Intel, Santa Clara, CA, USA) and employed the Linux 802.11n CSI-Tool to serve as the receiver (Rx), responsible for capturing signals and initiating communication with the transmitter (Tx) at a predetermined rate. The configuration includes a single antenna on the transmission side and three antennas on the reception side, with 30 subcarriers per channel and a central frequency set to 2.4 GHz. In MATLAB 2020, the collected “.dat” format data were processed to extract and visualize Doppler frequency shift information. The actual system setup is shown in Figure 6, which demonstrates the processing of raw gesture data and the recognition of different gestures, the red rectangular box represents the features displayed after processing the current gesture into a Doppler image. Model training was conducted in PyCharm 2023 using PyTorch 2.6.0, with the SE-VGG16 optimized via configurations detailed in Table 3. An Adam optimizer (learning rate = 0.001, betas = (0.9, 0.999)) ensures stable gradient descent by adapting learning rates for each parameter, combined with a stepwise learning rate decay strategy (initial LR = 0.001, 10× decay every 30 epochs) to avoid local minima. A batch size of 32 balances GPU memory efficiency, and the model is trained for 100 epochs with early stopping—terminating training if validation loss stagnates for 5 consecutive epochs to prevent overfitting.

To further mitigate overfitting, the model integrates ReLU activation functions, a Dropout rate of 0.5 (applied to fully connected layers to reduce neuron co-adaptation), and the aforementioned learning rate decay and early stopping strategies. This combined regularization approach significantly reduces the train–test accuracy gap, as quantitatively validated in Table 4, where the inclusion of early stopping decreases the accuracy discrepancy by 4.5% on average across environments.

4.2. Experimental Data

This study conducted gesture recognition experiments in two distinct environments: an office and a conference room. Figure 7 illustrates the experimental setup and equipment configuration. We recruited ten participants (five male and five female) aged 23–28 years for the experiments. The experimental protocol was designed as follows: (1) Each gesture was performed for a duration of 5 s; (2) Rest periods were implemented before and after each gesture trial; (3) Each participant executed six different gestures; and (4) Every gesture was repeated 20 times. This protocol yielded a total of 2400 samples for analysis. The dataset was divided using a stratified random split: 80% (1920 samples) for training, 10% (240 samples) for validation, and 10% (240 samples) for testing. This partition ensures that the distribution of gesture categories and environmental conditions is preserved across subsets, enabling robust model evaluation.

4.3. Experimental Results and Analysis

4.3.1. Performance of Different Classification Algorithms

(1) Comparison of the accuracy of VGG16, SE-VGG16, ResNet18, ResNet34, ResNet50, MobileNetV3, and EfficientNetB0 for different gestures.

In order to evaluate the performance of the SE-VGG16 WiFi gesture recognition method, the recognition accuracy of seven models, including VGG16, SE-VGG16, ResNet18, ResNet34, ResNet50, MobileNetV3, and EfficientNetB0, is compared and analyzed in this study. The results are shown in Figure 8.

Figure 8 compares the gesture recognition accuracy of seven deep-learning models. Among these models, SE-VGG16 has the highest overall accuracy, with ResNet variants ranking closely behind. SE-VGG16’s superiority stems from its squeeze–excitation (SE) mechanism, which recalibrates channel-wise features, improving accuracy by 4.13% over VGG16. ResNet models use residual connections for better performance, but trail SE-VGG16 in some challenging scenarios. Lightweight models MobileNetV3 and EfficientNetB0 trade accuracy for computational efficiency; their depthwise separable convolutions and reduced capacity limit performance on intricate gestures. Further analysis of per-gesture accuracy reveals that SE-VGG16’s advantage is most pronounced for Enlarge/Narrow gestures, where its attention mechanism effectively isolates subtle finger movements from noise. The results collectively affirm SE-VGG16 as a state-of-the-art solution, particularly for environments requiring high precision across diverse gesture types.

(2) Confusion Matrix of Different Gestures in SE-VGG16 Model

To further assess the SE-VGG16 model’s gesture-recognition performance, confusion matrices for office and conference room environments are analyzed. These matrices detail misclassifications among Enlarge, Narrow, Up, Down, Left, and Right gestures.

In the office environment (Figure 9a), the SE-VGG16 model shows excellent recognition performance. The confusion matrix shows that the recognition rate of the “up” gesture is as high as 94.31%, while the “zoom in” and “zoom out” gestures with complex motion patterns also maintain a high accuracy of 93.69% and 93.56%, respectively. Despite a small number of misclassifications, the overall misclassification rate is very low, which indicates that the model can effectively distinguish the unique signal characteristics of each type of gesture and maintains stable performance even in the presence of environmental interference. In addition, in the validation of the conference room scenario (Figure 9b), the model also performs well, with a recognition rate of 94.29% for the “down” gesture, and other gesture types also maintain a high recognition accuracy. The overall low misclassification rate fully demonstrates the robustness of the model.

The results of the confusion matrix clearly demonstrate the effectiveness of the SE-VGG16 model. It is worth noting that the signal uniformity due to the open space of the conference room and the signal variability due to the complex layout of the office have some comparative value, while the model maintains an overall high accuracy rate in different environments. This performance fully demonstrates the model’s excellent environmental adaptability and robustness characteristics.

(3) Comparison of VGG16 and SE-VGG16 performance

To evaluate the performance enhancement achieved by incorporating the squeeze–excitation mechanism into VGG16, we compared the training and validation accuracy trajectories of both VGG16 and SE-VGG16 across successive epochs. The results are presented in Figure 10. Figure 10 demonstrates that both models exhibit progressively improving accuracy with increasing epoch count until reaching stabilization. Notably, the validation accuracy marginally exceeds the training accuracy in both architectures.

(4) Comparison of SE-WiGR with other gesture recognition systems

To comprehensively verify the actual performance of the proposed method, we selected two typical indoor scenarios (conference room and office environment) and conducted systematic comparisons with mainstream gesture recognition algorithms under identical experimental conditions, the experimental results are shown in Table 5. In gesture recognition tasks, class imbalance (e.g., limited samples for fine-grained gestures like “Enlarge/Narrow”) and inter-class feature similarity (e.g., directional gestures “Left/Right” and “Up/Down”) pose challenges for traditional metrics like Accuracy. As a harmonic mean of Precision (TP/(TP + FP)) and Recall (TP/(TP + FN)), the F1 score mitigates bias from dominant gesture categories and provides a balanced assessment of minority classes, while also sensitively capturing misclassifications between similar gestures with subtle feature differences—avoiding the oversimplification of single Accuracy metrics. Additionally, in practical applications like smart homes, the F1 score’s penalty of both false positives and negatives aligns with real-world reliability requirements.

Our experimental results demonstrate significant performance differences among the evaluated approaches. Shallow learning models relying on hand-crafted features (WiGest and HMM) achieve relatively low accuracy (approximately 70%), as manual feature extraction frequently fails to capture critical discriminative characteristics. In comparison, deep learning-based approaches (LSTM and GRU) show superior performance by automatically learning optimal feature representations. Notably, the ABLSTM method outperforms standard LSTM through its bidirectional architecture and attention mechanism. Among all benchmarked methods, our proposed SE-WiGR technique achieves the highest recognition accuracy across all gesture categories. This performance advantage stems from the effective integration of the SE module with the VGG16 network, which synergistically combines their complementary strengths for robust feature extraction and classification.

4.3.2. Robustness Experiments

(1) Evaluation of User Diversity and Cross-User Generalization

This method evaluates its robustness through a multi-faceted validation framework. First, a dual mechanism assesses:

Inter-User Variability: The recognition effect of the same gesture executed by different users (Figure 11a), where users 1–6 show minimal accuracy differences despite variations in finger width and execution postures. This indicates the method is insensitive to physiological traits like gender, weight, and height.

Intra-User Consistency: The recognition stability of different gestures from the same user (Figure 11b), where accuracy fluctuations (within 3.2%) due to gesture amplitude variations do not compromise overall performance.

To quantitatively validate generalization to unseen users, a leave-one-user-out cross-validation strategy was adopted. Data from 9 users were used for training, with the remaining 1 user serving as the test set (repeated 10 times). The results (Table 6) show average accuracies of 93.95% (office) and 94.26% (conference room), with most users deviating <±1.0% from the mean. Notable exceptions (e.g., U7’s 91.57% in the conference room) likely reflect environmental interference rather than user diversity, demonstrating the model’s resilience to outliers.

The above experiments confirm the robustness of SE-WiGR between inter-user physiological differences and intra-user gesture variations, making it suitable for real-world deployments with different user groups.

(2) Impact of speed diversity

Speed is a crucial perceptual dimension when evaluating the impact of WiFi-based devices on gesture recognition, as the same gesture displays unique speed characteristics at various execution speeds. To comprehensively verify the stability and reliability of the gesture recognition system under different speed conditions, in this study, the gesture execution duration of 4 s, as defined in the training dataset, is adopted as the baseline. To assess the system’s robustness across temporal variations, a set of controlled experiments is conducted at three distinct gesture speeds: rapid (2 s), standard (4 s), and slow (6 s). The results of these experiments are shown in Table 7.

As shown in Table 7, gesture speed significantly affects recognition accuracy. The average accuracy is 89.66% for fast gestures, 93.95% for normal-speed gestures, and 92.80% for slow gestures. Normal-speed gestures yield the highest accuracy, with Up and Down gestures reaching 94.31% and 94.65%, respectively. In contrast, fast-speed Enlarge and Narrow gestures have the lowest accuracy at 88.57% and 87.72%, while the Down gesture performs best at this speed (91.75%). For slow gestures, Left and Right gestures achieve the highest accuracy (93.37% and 93.46%, respectively), whereas Enlarge accuracy drops to 91.36%. At normal speeds, gestures exhibit stable and easily detectable signal variations, resulting in higher accuracy. Faster gestures, however, introduce pronounced signal changes with additional noise and interference, complicating recognition. Slow gestures produce smoother signal transitions, but their subtlety may also reduce recognizability. The Enlarge and Narrow gestures involve more intricate multi-finger coordination compared to simple linear motions. This complexity leads to lower accuracy, especially at extreme speeds where signal variability and uncertainty further increase recognition difficulty.

(3) Effect of positional diversity

The experiments investigated the attenuation and multipath reflection phenomena of electromagnetic signals as the gesture’s position relative to the device varied. These variations cause differences in the received signals’ amplitude and phase, leading to discrepancies in gesture features related to distance and angle. The impact of these feature variations on gesture recognition accuracy requires further validation. We tested 28 different positions, covering distances (D1 + D2) ranging from 1 m to 4 m and angles (θ) of 0°, 30°, 60°, and 90°. At each position, 10 datasets were collected for each gesture in both office and conference room environments, yielding a total of 560 experimental samples. Using this data, we evaluate the effectiveness of gesture recognition at varying distances and angles to analyze how feature differences affect recognition accuracy. The experimental setup is illustrated in Figure 12, and the results are shown in Table 8.

Based on the data in Table 8, it is evident that recognition accuracy varies significantly with changes in object position and angle, with overall recognition efficiency ranging from 74.36% to 94.25%. Specifically, recognition accuracy peaks at 92.17% when the object is positioned at a 0° angle and a distance of 1 m. In contrast, when the object is placed at a 90° angle and the distance increases to 4 m, recognition accuracy drops to a minimum of 78.75%. Further examination indicates that, under a fixed angular condition, recognition accuracy tends to decline progressively as the distance increases. Likewise, when distance remains constant, variations in the angle also lead to a slight decrease in accuracy, though the extent of this reduction is relatively limited. It is worth emphasizing that although recognition accuracy tends to decline as the distance increases, in practical applications such as smart home interaction or virtual reality systems, the user typically operates within a moderate range from the device. Therefore, the observed reduction in accuracy has a limited impact on the overall user experience.

In the experimental design, when the angle θ between the person and the transceiver device remains constant, the geometry of the signal path stays stable, with the only change being an increase in the total distance (D1 + D2). This alteration results in an elongated signal propagation path, increasing the probability of signal attenuation and interference.

On the contrary, assuming that D1 + D2 remains constant, the accuracy decreases slightly as the θ angle gradually increases from 0° to 90°. This occurs because, at θ = 0°, the signal travels directly along a straight line, making the path the simplest, and the multipath effect has less influence, resulting in the highest recognition accuracy. However, as the θ angle increases, the signal must take a detour, leading to greater diversity and complexity in the propagation path, significantly enhancing the multipath effect. Although D1 + D2 is constant, path variation due to angle changes may cause minor signal strength fluctuations, but these are less significant than those from distance changes.

For scenarios where a secondary person performs random movements behind the primary user during gesture execution, the proposed SE-WiGR framework is well-equipped to maintain accurate recognition. The squeeze–excitation (SE) module embedded within the network adaptively recalibrates channel-wise feature responses in the CSI data, enabling the model to prioritize the signal patterns associated with the primary user’s intended gestures. Meanwhile, the hierarchical feature extraction capabilities of the VGG16 network effectively filter out interference from background activities. The robustness demonstrated in previous experiments against environmental noise and occlusions further supports the framework’s ability to handle such complex, real-world interactions.

In addition, combining the data in Table 8, we can further analyze the mechanism of distance and multipath effect on the recognition accuracy: at closer ranges (e.g., D1 + D2 = 1 m, θ = 0°), the direct signal path dominates with weak multipath interference, enabling CSI to clearly reflect gesture dynamics—“up/down” recognition accuracy reaches 94.25%. At 4 m, signal reflection via walls/furniture causes ~20 dB CSI amplitude attenuation and fuzzy features due to multipath phase superposition, reducing “enlarge/narrow” accuracy to 78.75%, consistent with wireless channel theory.

4.3.3. Model Ablation Experiments

(1) Squeezing operator

In the SE module section, this study compares the effects of two pooling strategies, max-pooling and average-pooling, used in the squeezing operation, with the specific results presented in Table 9.

From Table 9, we can see that both pooling methods demonstrate exemplary performance. However, average pooling performs slightly better, which may be due to its greater robustness in handling anomalous or outlier data compared to maximum pooling. It can also better capture the statistical characteristics and patterns of the data. Additionally, this experiment verifies the high robustness of the SE block, as the impact of incorporating the SE block on the performance of the final model remains relatively stable and high, with no significant difference, regardless of the pooling method used.

(2) Different activation functions

The activation functions play a crucial role in neural networks. Their primary function is to introduce nonlinear factors, allowing neural networks to learn and model complex data patterns. Regardless of its depth, such a network would be severely limited in its expressive power, capable only of approximating linear functions or linear combinations, and unable to capture the intricate nonlinear relationships within the data. Therefore, this study investigates the effect of the activation function used for the excitation operator in SE-VGG16 on the results presented in Table 10. The experimental outcomes in Table 10 indicate that employing the ReLU activation function yields greater accuracy, making SE-VGG16 more effective.

(3) VGG16 integrates SE blocks at different stages

We integrated SE blocks into each stage of VGG16 sequentially to investigate their impact at different levels of the network. As illustrated in Table 11, integrating SE blocks at different stages within the VGG16 architecture leads to varying degrees of performance enhancement across the model. This indicates that the benefits produced by SE blocks at various stages are complementary and can collectively enhance the network’s overall performance. Additionally, it is essential to note that while the number of model parameters increases with the addition of SE blocks, this growth remains relatively limited and does not significantly increase model training costs.

(4) Impact of SE: VGG16-SE, NoSqueese, VGG16

To verify the performance of the SE module, we first remove the pooling layer. The original fully connected (FC) layer is replaced with a 1

\times

1 convolutional layer. We refer to the structure after the replacement as NoSqueeze, and the experimental results are shown in Table 12.

From the detailed data in Table 12, we can see that the F1 score of the model improves by 4.13% when the SE module is incorporated into the VGG16 network architecture. This improvement is reflected in the increased recognition accuracy and the enhanced learning ability of the model for complex features. Notably, although we introduced the SE module to boost the network’s representation capability, the model’s number of parameters did not substantially increase. This indicates that we successfully optimized the performance of VGG16 while maintaining computational efficiency. This finding is significant for real-world applications, as it allows us to enhance the model’s performance by introducing advanced network modules without sacrificing computational efficiency.

In summary, the data in Table 12 confirms the significant effect of SE modules on the performance enhancement of the VGG16 network. It also shows that this enhancement is achieved while keeping the model parameters streamlined and ensuring efficient utilization of computational resources. This provides new ideas and directions for us to continue exploring and optimizing deep learning models in the future.

5. Conclusions

This study proposes SE-WiGR, a WiFi gesture recognition method that fuses the squeeze-and-excitation (SE) module with VGG16. The process relies on acquired CSI signals, which are filtered and analyzed using principal component analysis and other preprocessing operations to generate Doppler spectrograms. These spectrograms are subsequently fed into the SE-VGG16 network for gesture recognition. The SE module effectively enhances gesture features by combining external excitation signals with squeezing operations while suppressing ambient noise, thereby improving recognition accuracy. Simultaneously, the powerful feature extraction capability of the VGG network further enhances recognition performance. Quantitative results demonstrate that across office and conference room scenarios, SE-WiGR achieves an average accuracy of 94.12% for six gestures (Enlarge, Narrow, Up, Down, Left, and Right), outperforming the standalone VGG16 network by 4.3%, surpassing traditional methods like WiGest (67.5%) and HMM (73.3%) by 26.62% and 20.82%, respectively, and outperforming deep learning baselines such as LSTM (81.3%) and GRU (80.5%). However, despite the SE-WiGR method’s successes in gesture recognition, it still faces challenges such as multipath effects, device differences, and gesture diversity in complex environments—issues that will be the focus of future research.

Author Contributions

Conceptualization, F.L. and C.W.; methodology, F.L.; software, Y.L.; validation, F.L.; formal analysis, F.L.; investigation, C.W.; resources, F.L.; data curation, C.W.; writing—original draft preparation, F.L.; writing—review and editing, C.W.; visualization, Y.L.; supervision, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant nos. 62162056 and 62262061) and the Natural Science Foundation of Gansu Province (24JRRA127).

Institutional Review Board Statement

The Ethics Review Committee of Gansu Minzu Normal University confirms that the project involving the use of WiFi sensing technology to recognize human gestures does not require further ethical review. This project does not involve invasive procedures on participants, nor does it involve the collection or processing of sensitive data. Therefore, in accordance with the Declaration of Helsinki and other applicable ethical guidelines, the design of this research project meets ethical standards.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, F.; Liang, Y.; Liu, G.; Hao, Z.; Weng, C.; Luo, P. Wi-TCG: A WiFi gesture recognition method based on transfer learning and conditional generative adversarial networks. Eng. Res. Express 2024, 6, 045253. [Google Scholar] [CrossRef]
Chiu, P.; Kim, C.; Oda, H. Recognizing gestures on projected button widgets with an RGB-D camera using a CNN. In Proceedings of the 2018 ACM International Conference on Interactive Surfaces and Spaces, Tokyo, Japan, 25–28 November 2018; pp. 369–374. [Google Scholar]
Hao, S.; Wu, Y.; Ma, X.; He, T.; Wang, F. Matching of visible and infrared images based on Cy-cleGAN-SIFT. Opt. Precis. Eng. 2022, 30, 602–614. [Google Scholar] [CrossRef]
Dang, X.; Ke, W.; Hao, Z.; Jin, P.; Deng, H.; Sheng, Y. mm-TPG: Traffic policemen gesture recognition based on millimeter wave radar point cloud. Sensors 2023, 23, 6816. [Google Scholar] [CrossRef] [PubMed]
Ramanujam, E.; Perumal, T.; Padmavathi, S. Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review. IEEE Sens. J. 2021, 21, 13029–13040. [Google Scholar] [CrossRef]
IEEE 802.11n-2009; IEEE Standard for Information Technology—Local and Metropolitan Area Networks—Specific Requirements—Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 5: Enhancements for Higher Throughput. IEEE: New York, NY, USA, 2009.
Tong, L.; Ma, H.; Lin, Q.; He, J.; Peng, L. A novel deep learning Bi-GRU-I model for real-time human activity recognition using inertial sensors. IEEE Sens. J. 2022, 22, 6164–6174. [Google Scholar] [CrossRef]
Pu, Q.; Gupta, S.; Gollakota, S.; Patel, S. Whole-home gesture recognition using wireless signals. In Proceedings of the 19th Annual International Conference on Mobile Computing & Networking—MobiCom’13, Miami, FL, USA, 30 September–4 October 2013; p. 27. [Google Scholar]
Niu, K.; Zhang, F.; Jiang, Y.; Xiong, J.; Lv, Q.; Zeng, Y.; Zhang, D. WiMorse: A contactless morse code text input system using ambient WiFi signals. IEEE Internet Things J. 2019, 6, 9993–10008. [Google Scholar] [CrossRef]
Hao, Z.; Duan, Y.; Dang, X.; Liu, Y.; Zhang, D. Wi-SL: Contactless fine-grained gesture recognition uses channel state information. Sensors 2020, 20, 4025. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.; Xiao, F.; Sheng, B.; Fei, H.; Yu, S. WiReader: Adaptive air handwriting recognition based on commercial WiFi signal. IEEE Internet Things J. 2020, 7, 10483–10494. [Google Scholar] [CrossRef]
Abdelnasser, H.; Youssef, M.; Harras, K.A. Wigest: A ubiquitous WiFi-based gesture recognition system. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, 26 April–1 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1472–1480. [Google Scholar]
Gu, Y.; Quan, L.; Ren, F. Wifi-assisted human activity recognition. In Proceedings of the 2014 IEEE Asia Pacific Conference on Wireless and Mobile, Bali, Indonesia, 28–30 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 60–65. [Google Scholar]
Cui, W.; Li, B.; Zhang, L.; Chen, Z. Device-free single-user activity recognition using diversified deep ensemble learning. Appl. Soft Comput. 2021, 102, 107066. [Google Scholar] [CrossRef]
He, W.; Wu, K.; Zou, Y.; Ming, Z. WiG: WiFi-based gesture recognition system. In Proceedings of the 2015 24th International Conference on Computer Communication and Networks (ICCCN), Las Vegas, NV, USA, 3–6 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–7. [Google Scholar]
Mohammed, A.Q.; Li, F. WiGeR: WiFi-based gesture recognition system. ISPRS Int. J. Geo Inf. 2016, 5, 92. [Google Scholar]
Virmani, A.; Shahzad, M. Position and orientation agnostic gesture recognition using WiFi. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, Niagara Falls, NY, USA, 19–23 June 2017; pp. 252–264. [Google Scholar]
Wang, D.; Yang, J.; Cui, W.; Xie, L.; Sun, S. AirFi: Empowering WiFi-based passive human gesture recognition to unseen environment via domain generalization. IEEE Trans. Mob. Comput. 2022, 23, 1156–1168. [Google Scholar] [CrossRef]
Bulugu, I. Gesture recognition system based on cross-domain CSI extracted from Wi-Fi devices combined with the 3D CNN. Signal Image Video Process. 2023, 17, 3201–3209. [Google Scholar] [CrossRef]
Zou, H.; Zhou, Y.; Yang, J.; Jiang, H.; Spanos, C.J. WiFi-enabled device-free gesture recognition for smart home automation. In Proceedings of the 2018 IEEE 14th International Conference on Control and Automation (ICCA), Anchorage, AK, USA, 12–15 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 476–481. [Google Scholar]
Kong, H.; Lu, L.; Yu, J.; Chen, Y.; Tang, F. Continuous authentication through finger gesture interaction for smart homes using WiFi. IEEE Trans. Mob. Comput. 2020, 20, 3148–3162. [Google Scholar] [CrossRef]
Yousefi, S.; Narui, H.; Dayal, S.; Ermon, S.; Valaee, S. A survey on behavior recognition using WiFi channel state information. IEEE Commun. Mag. 2017, 55, 98–104. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, L.; Jiang, C.; Cao, Z.; Cui, W. WiFi CSI based passive human activity recognition using attention based BLSTM. IEEE Trans. Mob. Comput. 2018, 18, 2714–2724. [Google Scholar] [CrossRef]
Zhou, Z.; Liu, C.; Yu, X.; Yang, C.; Duan, P.; Cao, Y. Deep-WiID: WiFi-based contactless human identification via deep learning. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 877–884. [Google Scholar]
Alhamdi, M.J.; Lopez-Guede, J.M.; AlQaryouti, J.; Rahebi, J.; Zulueta, E.; Fernandez-Gamiz, U. AI-based malware detection in IoT networks within smart cities: A survey. Comput. Commun. 2025, 233, 108055. [Google Scholar] [CrossRef]
Nozari, S.; Haleem, H.; Garibotto, C.; Sciarrone, A.; Bisio, I.; Lavagetto, F. CrowdWatch: Privacy-Preserving Monitoring Leveraging Wi-Fi Multiple Access Information. IEEE Internet Things J. 2025, 12, 15238–15248. [Google Scholar] [CrossRef]
Daniş, F.S. Live rssi filtering for indoor positioning with bluetooth low-energy. In Proceedings of the 2022 IEEE 12th International Conference on Indoor Positioning and Indoor Navigation (IPIN), Beijing, China, 5–7 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Yang, J.; Chen, X.; Zou, H.; Lu, C.X.; Wang, D.; Sun, S.; Xie, L. SenseFi: A library and benchmark on deep-learning-empowered WiFi human sensing. Patterns 2023, 4, 100703. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Wang, Q.; Huang, L.; Shi, C.; Zhang, R. A PCA acceleration algorithm for WiFi sensing and its hardware implementation. In Proceedings of the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, 19–22 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]

Figure 1. Gesture action design.

Figure 2. Structure of SE-WiGR gesture recognition method.

Figure 3. Example of the data processing process.

Figure 4. The structure of the VGG16 network.

Figure 5. The structure of the squeeze–excitement (SE) module.

Figure 6. WiFi gesture recognition system demonstration.

Figure 7. Experimental scenario and equipment deployment diagram: (a) office and (b) conference room.

Figure 8. Gesture recognition results with different classification models.

Figure 9. Confusion matrix of different gestures in SE-VGG16 mode. (a) Confusion matrix for gesture recognition in an office environment. (b) Confusion matrix for gesture recognition in a conference room.

Figure 10. Comparison of F1 score between VGG16 and SE-VGG16 in training and validation sets under different Epochs.

Figure 11. Recognition results of different users in two environments. (a) Recognition results of the same action by different users. (b) Recognition results of different actions of the same user.

Figure 12. Graphical representation of device deployment.

Table 1. Comparison of explanations of terms for the core abbreviations in this study.

Abbreviation	Full Title
HCI	Human–Computer Interaction
RSSI	Received Signal Strength Indicator
CSI	Channel State Information
AP	Access Point
SVMs	Support Vector Machines
CNNs	Convolutional Neural Networks
ABLSTM	Attention-based Bidirectional LSTM
GRU	Gated Recurrent Unit
SE	Squeeze-and-Excitation
NICs	Network Interface Cards
SENet	Squeeze-and-Excitation Network

Table 2. SE-VGG16 network architecture.

Layer Type	Configuration Parameters	Output Size (H × W × C)	SE Module	Remarks
Input Layer	-	224 × 224 × 3	-	RGB image input
Convolution + ReLU	64 kernels, 3 × 3, stride = 1, pad = 1	224 × 224 × 64	-	-
Convolution + ReLU	64 kernels, 3 × 3, stride = 1, pad = 1	224 × 224 × 64	SE-64	First SE module embedding
Max Pooling	2 × 2, stride = 2	112 × 112 × 64	-	-
Convolution + ReLU	128 kernels, 3 × 3, stride = 1, pad = 1	112 × 112 × 128	-	-
Convolution + ReLU	128 kernels, 3 × 3, stride = 1, pad = 1	112 × 112 × 128	SE-128	-
Max Pooling	2 × 2, stride = 2	56 × 56 × 128	-	-
Convolution + ReLU (3 layers)	256 kernels, 3 × 3, stride = 1, pad = 1	56 × 56 × 256	-	First two layers without SE
Convolution + ReLU	256 kernels, 3 × 3, stride = 1, pad = 1	56 × 56 × 256	SE-256	Third layer with SE module
Max Pooling	2 × 2, stride = 2	28 × 28 × 256	-	-
Convolution + ReLU (3 layers)	512 kernels, 3 × 3, stride = 1, pad = 1	28 × 28 × 512	-	First two layers without SE
Convolution + ReLU	512 kernels, 3 × 3, stride = 1, pad = 1	28 × 28 × 512	SE-512	Third layer with SE module
Max Pooling	2 × 2, stride = 2	14 × 14 × 512	-	-
Convolution + ReLU (3 layers)	512 kernels, 3 × 3, stride = 1, pad = 1	14 × 14 × 512	-	First two layers without SE
Convolution + ReLU	512 kernels, 3 × 3, stride = 1, pad = 1	14 × 14 × 512	SE-512	Third layer with SE module
Max Pooling	2 × 2, stride = 2	7 × 7 × 512	-	-
Fully Connected + ReLU	4096 units	1 × 1 × 4096	-	Input after flattening
Fully Connected + ReLU	4096 units	1 × 1 × 4096	-	-
Fully Connected + Softmax	6 units (gesture categories)	1 × 1 × 6	-	Output probability for 6 gesture classes

Table 3. Training parameter configuration.

Parameter Name	Value/Setting
Optimizer	Adam (learning rate = 0.001, betas = (0.9, 0.999))
Initial Learning Rate	0.01, 10× decay every 30 epochs
Loss Function	CrossEntropy Loss
Batch Size	32
Number of Epochs	100
Early Stopping Condition	Terminate if validation loss does not improve for 5 consecutive epochs

Table 4. Inhibition effect of regularization strategies on overfitting.

Environment	Strategy	Optimal Validation Accuracy	Convergence Epoch	Train/Test Gap
Conference Room	Without Early Stop	91.60%	95	+6.4%
Conference Room	With Early Stop	94.26%	88	+4.2%
Office	Without Early Stop	89.73%	97	+8.7%
Office	With Early Stop	93.95%	93	+3.2%

Table 5. Results of SE-WiGR with other gesture recognition systems.

Method	Office			Conference Room
	Precision	Recall	F1-Score	Precision	Recall	F1-Score
WiGest	68.21%	67.34%	67.54%	67.56%	65.44%	65.78%
HMM	73.62%	73.24%	73.31%	70.34%	69.41%	70.02%
LSTM	82.17%	81.04%	81.38%	83.75%	82.04%	82.34%
GRU	81.09%	80.25%	80.45%	81.58%	80.2%	80.27%
ABLSTM	92.13%	91.56%	91.82%	92.64%	91.88%	92.31%
Our (SE-WiGR)	94.24%	93.89%	93.95%	94.78%	93.96%	94.26%

Table 6. Cross-user generalization performance.

Test User	Office Accuracy (%)	Deviation from Mean	Conference Room Accuracy (%)	Deviation from Mean
U1	93.43%	−0.52%	94.13%	−0.13%
U2	92.76%	−1.19%	94.67%	+0.41%
U3	93.85%	−0.10%	94.32%	+0.06%
U4	93.79%	−0.16%	93.89%	−0.37%
U5	94.57%	+0.62%	94.98%	+0.72%
U6	94.28%	+0.33%	93.64%	−0.62%
U7	93.56%	−0.39%	91.57%	−2.69%
U8	93.44%	−0.51%	95.80%	+1.54%
U9	94.88%	+0.93%	94.27%	+0.01%
U10	94.98%	+1.03%	95.32%	+1.06%
Average	93.95%	—	94.26%	—

Table 7. Accuracy of speed on gesture recognition (%).

Gestures	Standard (4 s)	Fast (6 s)	Slow (2 s)
Enlarge	93.69%	88.57%	91.36%
Narrow	93.56%	87.72%	92.54%
Up	94.31%	89.78%	92.98%
Down	94.65%	91.75%	93.13%
Left	93.63%	89.63%	93.37%
Right	93.88%	90.53%	93.46%
Average	94.12%	89.66%	92.80%

Table 8. F1 score of location diversity (%).

D1+D2	1.0 m	1.5 m	2.0 m	2.5 m	3.0 m	3.5 m	4.0 m
D (Rx, Tx)	0.7 m	1 m	1.41 m	1.77 m	2.12 m	2.47 m	2.83 m
90°	91.38%	89.78%	88.53%	86.23%	85.24%	81.23%	78.75%
60°	92.49%	91.54%	89.42%	87.58%	84.45%	81.36%	77.52%
30°	92.17%	91.73%	91.77%	87.69%	84.05%	82.11%	74.36%
0°	94.25%	92.07%	93.03%	89.42%	88.39%	84.28%	80.54%

Table 9. F1 score of using different squeezing operators in SE-VGG16 (%).

Squeeze	F1
Max pooling	93.18%
Avg pooling	94.12%

Table 10. Accuracy of using different nonlinear activation functions for the excitation operator in SE-VGG16 (%).

Activation Functions	VGG16	SE-VGG16
ReLU	89.82%	94.12%
tanh	88.64%	92.25%
sigmoid	88.98%	92.17%

Table 11. F1 score (%) and number of parameters of integrating SE with VGG16 at different stages on gesture recognition.

Stage	F1	Parameters
VGG-16	89.82%	141.3 M
SE_Stage_2	93.21%	141.3 M
SE_Stage_3	93.33%	141.5 M
SE_Stage_4	93.35%	143.1 M
SE_All	94.12%	145.3 M

Table 12. F1 score (%) and parameters of squeeze operation.

	F1	Parameters
VGG-16	89.82%	141.3 M
NoSqueese	89.82%	145.3 M
SE-VGG-16	94.12%	145.3 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, F.; Weng, C.; Liang, Y. SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16. Appl. Sci. 2025, 15, 6346. https://doi.org/10.3390/app15116346

AMA Style

Li F, Weng C, Liang Y. SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16. Applied Sciences. 2025; 15(11):6346. https://doi.org/10.3390/app15116346

Chicago/Turabian Style

Li, Fenfang, Chujie Weng, and Yongguang Liang. 2025. "SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16" Applied Sciences 15, no. 11: 6346. https://doi.org/10.3390/app15116346

APA Style

Li, F., Weng, C., & Liang, Y. (2025). SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16. Applied Sciences, 15(11), 6346. https://doi.org/10.3390/app15116346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SE-WiGR: A WiFi Gesture Recognition Approach Incorporating the Squeeze–Excitation Mechanism and VGG16

Abstract

1. Introduction

2. Related Work

3. SE-WiGR Gesture Recognition Method

3.1. Data Acquisition

3.2. Data Preprocessing

3.3. Gesture Recognition Based on SE and VGG-16 Model Fusion

3.3.1. VGG16 Network Architecture

3.3.2. Squeeze and Excitation Networks

4. Experimentation and Analysis of Results

4.1. Experimental Deployment

4.2. Experimental Data

4.3. Experimental Results and Analysis

4.3.1. Performance of Different Classification Algorithms

4.3.2. Robustness Experiments

4.3.3. Model Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI