WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals

Wang, Chenlu; Deng, Ya; Li, Yuke; Wang, Shenhujing; Wang, Shubin

doi:10.3390/sym18050878

Open AccessArticle

WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals

by

Chenlu Wang

,

Ya Deng

,

Yuke Li

,

Shenhujing Wang

and

Shubin Wang

^*

Guang’an Institute of Technology, Guang’an 638000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(5), 878; https://doi.org/10.3390/sym18050878

Submission received: 26 March 2026 / Revised: 24 April 2026 / Accepted: 4 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue Symmetry in Computational Intelligence and Data Science)

Download

Browse Figures

Versions Notes

Abstract

WiFi sensing has gained widespread attention as a promising technology, owing to its non-intrusiveness, strong privacy-preserving characteristics, and cost-effective deployment, enabling diverse application scenarios. In addition, the stable spatial characteristics and symmetry-related patterns exhibited by human body postures in WiFi signal propagation provide new possibilities for robust person identification. In traditional WiFi-based person identification technologies, although gait recognition has achieved certain success, it is complex to operate and limited in application scenarios, increasing the constraints on recognition. This issue becomes more pronounced in large-scale user scenarios, where the system performance tends to degrade and exhibit instability. To overcome these challenges, we introduce a new person identification system called WiPID. The WiFi signals extracted from the static postures of users are treated as a “biometric fingerprint” for identity verification. An end-to-end deep learning framework is utilized by WiPID to process WiFi signals, and a convolutional autoencoder is adopted to preprocess the signals directly, effectively reducing redundant information and greatly simplifying the WiFi data processing. Furthermore, the integration of a multi-scale feature extraction module improves the system’s ability to capture discriminative features. The proposed system not only reduces operational complexity but also extends its applicability to a wider range of scenarios, thereby enhancing recognition performance. In an experiment involving 50 volunteers, WiPID achieved an average recognition accuracy of up to 98%, demonstrating the method’s suitability for large-scale person identification scenarios. In addition, a real-time identification experiment has been conducted on PCs and commercial WiFi devices. Experiments have proven that WiPID can achieve real-time person identification on Internet of Things devices, further validating its feasibility and stability in practical applications.

Keywords:

WiFi-based sensing; channel state information; convolutional neural networks; convolutional autoencoder; multi-scale feature fusion

1. Introduction

Person Identification (PID) is an essential technology in modern information society and is widely applied in areas including security monitoring, access control, smart homes, and attendance management [1,2,3]. PID involves capturing and analyzing biological or behavioral characteristics that can uniquely identify individuals to confirm and verify their identities. This typically involves collecting individuals’ characteristic data, such as facial images, fingerprints, gait, and heart rate, and then using algorithms to compare these characteristics with stored information in a database to confirm identities. With the continuous advancement of technology, accurate identity identification and verification affect both the security and reliability of systems as well as user experience and privacy protection. In addition, the spatial distribution characteristics and symmetry-related patterns exhibited by human behaviors and body postures provide valuable information for improving the robustness and discriminability of person identification systems.

The primary types of PID currently include vision-based [4,5,6,7,8,9], wearable device-based [10,11,12,13], and wireless sensing-based methods [14,15,16,17,18]. Vision-based systems use cameras to capture and analyze visual features like faces and gaits for identity recognition. Although highly accurate in good lighting conditions, it is susceptible to changes in ambient light, occlusions, and spoofing attacks such as photo and video replays. Moreover, high-resolution camera equipment needs to be provided to such systems, increasing deployment costs and the risk of privacy breaches. Wearable device-based systems utilize sensors built into devices, including accelerometers and gyroscopes, to collect biometric data from users, such as gait and heart rate. These systems can operate stably in various environments, requiring the wearing of devices, which increases inconvenience and cost.

In contrast, wireless sensing-based systems offer a non-intrusive, low-cost method of identity recognition [16,17,18]. WiFi-based sensing has been extensively adopted in wireless sensing systems and is typically realized using Channel State Information (CSI) [19]. CSI provides fine-grained physical-layer information across subcarrier channels that reflects complex multipath effects caused by human movements. It captures the multipath propagation of WiFi signals from the transmitter to the receiver across multiple subcarriers, depicting the various effects of human activities on wireless signals. Compared to other methods, it has significant advantages. Users do not need to actively cooperate or wear any devices. It can achieve passive person identification simply by deploying on existing WiFi networks, making it ideal for attendance tracking, home security, and medical monitoring scenarios. In addition, wireless sensing also includes technologies such as Bluetooth [20], radar [21] and Radio Frequency Identification (RFID) [22], each of which has its own specific applications and technical requirements.

The most commonly used technique for WiFi-based person identification is gait recognition, which is achieved by analyzing the sequence of CSI generated by the user during walking. This process generally includes signal denoising techniques, such as wavelet-based denoising and Butterworth filtering, to reduce noise and interference. The processed data is subsequently input into machine learning models, such as Support Vector Machines (SVM) and Deep Neural Networks (DNN). Although gait recognition technology can effectively distinguish individuals to a certain extent, its application also has obvious limitations. First, it requires users to follow predefined walking paths, which demands strong user cooperation and increases both system complexity and time consumption. Second, gait recognition is highly dependent on the environment, and its performance significantly degrades in scenarios with a large number of users or complex environments. These limitations make it difficult for gait recognition to meet the requirements of efficient and accurate identity recognition in practical applications, especially in dynamic and variable environments.

To overcome the limitations of gait recognition, this study adopts a WiFi-based fingerprint identification approach. This system employs the biometric information carried by WiFi signal when users are in a stationary state, treating static posture as a “biometric fingerprint” for identity verification. Although users are in a relatively static state, the human body still produces subtle movements. These subtle movements, together with individual physiological characteristics such as body shape, body fat ratio, and muscle composition, introduce distinctive variations into WiFi signals, thereby generating features that can be used for identity recognition. By capturing these unique changes in WiFi signals, it is possible to effectively distinguish different individuals, thus achieving efficient and accurate human identification. WiFi fingerprint-based identification does not require explicit user actions, thereby reducing usage barriers and significantly improving system practicality and user experience, making the approach suitable for a wide range of real-world applications.

According to previous research, the feasibility of static posture recognition has already been preliminarily verified. It has been shown that in environments with human presence, the multipath effect of WiFi signals is significantly disturbed by the human body. This interference can notably affect signal propagation and be utilized for person identification. In our initial exploration, we discovered significant differences in the amplitude mean and variance among different individuals when standing still in a WiFi environment, indicating substantial research potential.

Building on this idea, we introduce WiPID, a WiFi-based person identification framework that employs CNN for fully end-to-end signal processing. Data preprocessing is conducted using a Convolutional Autoencoder [23,24] to reduce redundant information and decrease computational complexity. This is combined with a multi-scale feature extraction network [25,26,27,28] to enhance recognition accuracy. The main contributions of this work are outlined below:

A new high-speed human identification system is proposed, which utilizes WiFi fingerprint information for person identification without the need for complex signal processing operations.
A deep learning framework is designed, which uses a Convolutional Autoencoder instead of preprocessing operations and proposes a multi-scale feature extraction method that combines local and global feature extraction.
In an experiment with 50 volunteers, WiFP achieved an average recognition accuracy of 98%. Additionally, a real-time identification experiment using only a PCs and router was implemented, demonstrating its application potential on Internet of Things devices.

2. Related Work

2.1. Vision-Based Person Identification

Vision-based PID technology plays a significant role in identity verification, including facial recognition and gait recognition. These techniques perform identity verification by capturing and analyzing human biometric features through cameras.

In the late 1990s, M. Turk et al. proposed the Eigenfaces, which uses Principal Component Analysis (PCA) to extract facial features and perform recognition through feature vectors [4]. This representation was very effective and laid the foundation for subsequent facial recognition technologies. With the onset of the 21st century, increased computing capabilities and the emergence of big data have facilitated the rapid growth of deep learning [29,30] has begun to emerge in the field of image recognition. In 2015, the FaceNet [5] proposed by Google used deep convolutional neural networks (DCNN) to achieve high-precision face recognition. Subsequently, models such as ArcFace [6] were introduced. By learning from massive collections of images, these models can generate robust feature representations, leading to notable improvements in face recognition accuracy.

In addition to facial features that can be used for PID, a lot of research has also begun on gait-based identity recognition. As a behavioral biological characteristic, gait has certain uniqueness and stability. Gait recognition technology identifies individuals by extracting gait features from videos, such as leg swing, stride length, and gait cycle. In the development history of gait recognition, Gait Energy Image (GEI) is a representative technology [7]. GEI captures individual gait features by averaging each frame image within a gait cycle to form an energy map. With the advancement of deep learning, gait recognition methods based on DCNN, such as GaitSet [8] and GaitPart [9], have further improved recognition accuracy and robustness.

2.2. Wearable Device-Based Person Identification

Wearable device-based identification technology offers many mature solutions [10,11,12,13]. Early research primarily focused on using accelerometers and gyroscopes in wearable devices for gait recognition. Gafurov et al. [10] installed accelerometers in the pocket position to collect acceleration data while users walked, successfully identifying different users’ walking patterns. Zhao and colleagues [11] proposed a method for gait recognition employing Angle Embedded Gait Dynamic Image (AE-GDI) together with CNN, enhancing recognition accuracy by embedding angle information. They achieved good results on two datasets.

With the advancement of sensing technology, researchers began to explore more complex biometric identification methods. Electrocardiograms (ECG), as an inherent physiological feature of the human body, have also been widely used in PID. Biel and colleagues [12] were the first to develop a method for identity recognition using ECG signals, but it required users to carry multiple electrodes, which was inconvenient. Liu et al. [13] designed a multi-source data model and developed an identification system employing hierarchical two-level feature fusion to improve recognition results. Wearable device-based identification has made significant progress in accuracy and application range, but there are still some challenges, such as the inconvenience of wearing the device and the complexity of data collection.

2.3. WiFi-Based Sensing Person Identification

Since the 19th century, researchers began using X-rays for target imaging and employed radar and sonar systems for target tracking. These sensing systems [31,32] were powerful, but they required expensive hardware and professional operation, making them unsuitable for everyday applications. As society advanced into the era of the Internet of Things (IoT), the proliferation of WiFi technology brought new opportunities. Recent studies [14,15,16,17,18] indicate that low-cost commercial WiFi hardware, such as smartphones and routers, can be used for sensing without requiring the individual to carry any additional devices.

WiFi-based identification methods use WiFi signals, mainly CSI, for identity authentication. Halperin et al. updated the firmware of commercial WiFi hardware, which enabled standard IEEE 802.11n network cards to acquire CSI data, which made CSI-based information sensing a new research focus [19]. CSI enables the acquisition of more fine-grained data.

Over the past few years, WiFi sensing has gradually been applied to the field of identity recognition, primarily focusing on gait recognition research. WiWho [14] uses machine learning methods to extract gait features for classification. FreeSense [15] proposes a sliding window-based image segmentation algorithm for gait recognition, achieving high accuracy. Nevertheless, these approaches depend on active user cooperation, leading to higher recognition complexity and longer processing times, which restricts their potential application. NeuralWave [16] learns salient features from CSI data using DCNN, achieving an accuracy rate close to 90% in identifying 27 individuals.

Identity recognition based on static posture is an emerging field in WiFi sensing technology. WiPIN [17] first proposed the method of using static posture for identity recognition. The study shows that different individuals cause significant differences in multipath effect interference on WiFi signals in a static state, and these differences can be used as the basis for identity recognition. WiDFF-ID [18] proposed a device-dree dast person identification system. By combining signal processing and deep learning techniques, it significantly improved recognition accuracy. Static recognition methods demonstrate significant advantages because they do not require the user to actively cooperate or perform specific actions. This lowers the barrier to use and thus improves user experience.

3. Preliminaries

3.1. Channel State Information

CSI describes the detailed environmental effects experienced by WiFi signals as they travel from the transmitter to the receiver, encompassing aspects such as delay, amplitude attenuation, and phase change. CSI [19] provides more granular physical information, offering more detailed and rich data compared to Received Signal Strength (RSS) [33] for complex environmental sensing applications such as gesture recognition, motion detection, and identity verification. The representation of the channel can be expressed as:

Y = H X + N,

(1)

Here, Y is the received signal vector, X is the transmitted signal vector, N represents the channel noise vector, and H is the channel matrix representing. H is the core of solving the CSI; it reflects the various impacts of the signal propagating through a multipath environment.

Specifically, in a static scenario, assuming the equivalent baseband signal of the transmitted signal is

x (t)

, the signal may arrive at the receiver through multiple paths N. Without considering the effect of noise, the equivalent baseband signal model at the receiver can be expressed as:

y (t) = \sum_{n = 1}^{N} a_{n} (t) x (t - T_{n} (t)) e^{- j 2 π f_{c} T_{n} (t)},

(2)

where

a_{n} (t)

represents the signal amplitude on the n-th path at time t,

T_{n} (t)

is the propagation delay of the n-th path,

f_{c}

is the carrier frequency, and

x (t - T_{n} (t))

represents the transmitted signal considering the delay. Each path’s signal also undergoes a phase shift

e^{- j 2 π f_{c} T_{n} (t)}

caused by its propagation delay. This formula indicates that the received signal

y (t)

is composed of the superposition of signals from multiple paths.

This complex interaction process explains why CSI can provide richer information than traditional RSS. By analyzing the variations in CSI, it is possible to accurately monitor and interpret physical changes in the environment, such as human movement. This fine-grained information makes CSI an ideal choice for precise environmental sensing and human identification.

3.2. WiFi Fingerprint Recognition

In WiFi sensing systems, Channel State Information is used to characterize the frequency-domain response of the wireless channel, whose variations fundamentally arise from multipath propagation effects during signal transmission. In relatively static environments, the signal propagation paths are mainly determined by fixed objects such as walls and furniture, resulting in a stable channel response. When a human subject enters the monitored area, the human body, as a complex medium, alters the signal propagation paths and energy distribution through reflection, scattering, and absorption, thereby introducing observable perturbations in CSI. Furthermore, differences in body shape, posture, and physiological structure among individuals lead to distinct modulation patterns of the wireless signal, causing the channel response to exhibit stable inter-subject variations in its statistical characteristics. Therefore, even under static postures, different individuals can still be distinguished through features such as CSI amplitude distribution, mean, and variance. This provides the theoretical foundation for WiFi-based static fingerprint identification.

In an unoccupied environment, as shown in Figure 1, the amplitude of CSI waveform is relatively stable, primarily reflecting the influence of fixed objects such as walls and furniture on the signal. In this case, the signal mainly depicts the static characteristics of the environment, providing a baseline state for observing the signal without external dynamic influences. When a person enters the monitored area, the human body, acting as a dynamic obstacle, significantly changes the signal amplitude and propagation path through absorption, reflection, and scattering processes, thus causing noticeable changes in the CSI data. The amplitude of CSI waveform of this subcarrier in the presence of a person exhibits significant volatility and irregularity, in stark contrast to the stable state when there is no person. This signal interference caused by the human body can not only be observed through the change of signal amplitude, but also can be analyzed in detail, providing an experimental basis for static posture recognition. The visualization of this interference demonstrates how the human body directly affects WiFi signals, emphasizing the potential application of signal characteristic analysis in identity recognition technology.

Through detailed analysis of CSI data collected from each individual, Figure 2 reveal significant differences in the mean and variance of wireless signal amplitude among individuals. These differences essentially reflect the individual’s “WiFi fingerprint,” which is the unique signal influence pattern of each person. Figure 2 shows the mean CSI amplitude values of different individuals in the same environment, indicating that each person has a unique impact on the wireless signal amplitude. These differences in mean values can be regarded as an effective feature set for distinguishing individuals. Figure 2 presents the standard deviation of CSI amplitude values for each individual, which reveals the instability of signal changes. A higher standard deviation implies that the individual’s impact on the signal is more complex and significant, and this variability itself can serve as an important feature for identity recognition and verification.

To effectively utilize these “WiFi fingerprints”, it is crucial to develop precise feature extraction algorithms. These algorithms can identify and learn each person’s unique signal patterns from CSI data. Using machine learning techniques, especially deep learning models, complex patterns can be learned from subtle signal differences, enabling fast and accurate individual identification. As demonstrated by the research on the WiPIN [17] and WiDFF-ID [18] systems, these technologies can achieve precise identity verification based on static postures. Therefore, we will develop further research based on the above.

4. System Architecture

4.1. Oveview of WiPID

The WiPID system consists of three parts: Data Collection, Preprocessing, and Feature Extraction using CNN. During the data reception phase, the system consists of two devices: a laptop as the receiver to capture Channel State Information (CSI) data samples and a router as the transmitter to send WiFi signal waves. The receiver processes the captured data by trimming it to create input samples. For more detailed experimental environment settings, please refer to Section 5.1. The received data is first preprocessed using a convolutional autoencoder, maintaining the original dimensions of the data, and then fed into the multi-scale feature extraction network. The multi-scale feature extraction network further enhances the model’s performance. To effectively integrate the local and global information extracted from the two branches, we improved the Efficient Channel Attention Network (ECA-Net) channel attention mechanism for feature fusion. Finally, these fused features are classified through a fully connected layer to accurately identify the users. We will discuss these key structures in detail in the following sections.

4.2. Preprocessing

Data conversion: The raw CSI data are stored in .dat format and parsed using the 802.11n CSI Tool in a Linux environment to extract amplitude information from each subcarrier and construct the initial time series. The data are then further processed in MATLAB R2023b, including:

Removing incomplete or corrupted packets to mitigate the impact of packet loss on temporal continuity;
Applying smoothing to CSI amplitude sequences to reduce random fluctuations caused by environmental noise;
Unifying sample lengths by trimming sequences to align the temporal dimension.

Remark 1.

Conventional explicit signal filtering methods (e.g., frequency-domain filtering or high-order filters) are not adopted in this study for two main reasons. First, under relatively static experimental conditions, noise is primarily introduced by device instability and minor environmental perturbations, and thus remains limited in magnitude; complex filtering yields only marginal benefits. Second, excessive filtering may suppress fine-grained signal variations induced by the human body, which are critical for identity discrimination. In addition, incorporating dedicated filtering modules often requires specialized signal processing toolchains, increasing system complexity and inference latency. Instead, this work employs a lightweight preprocessing strategy combined with subsequent deep learning models, enabling end-to-end data-driven feature learning while preserving the intrinsic characteristics of the original signals and simplifying the overall processing pipeline.

Sample construction: The input data received by the system consists of a set of CSI samples, and the experiment in this paper uses the amplitude information of the CSI samples. CSI samples can be represented as:

X = {x_{i}}_{i = I}^{N}, x_{i} \in R^{A \times S \times T},

(3)

where A represents the number of antennas, S represents the number of subcarriers, and T represents the duration. Each sample records the changes in signal amplitude over time for multiple subcarriers, providing information about the WiFi channel state. This format of data facilitates subsequent processing by deep learning models and can be adapted to meet the input requirements of CNN.

Remark 2.

The use of a three-dimensional representation (antenna–subcarrier–time), rather than flattening the data, is intended to preserve structural dependencies across spatial, frequency, and temporal dimensions. Premature dimensionality reduction or reshaping may disrupt these correlations, thereby weakening the model’s ability to capture individual-specific characteristics. Maintaining this structured representation ensures information completeness while providing more discriminative inputs for downstream models, and is widely adopted in WiFi sensing applications.

Convolutional autoencoder module (CAE): The structure of the Convolutional Autoencoder includes three parts: the encoder, the middle layer, and the decoder. The specific details are shown in Table 1. The encoder utilizes a pair of convolutional layers, where Batch Normalization (BN) and ReLU activation are sequentially applied to the output of each layer. The middle layer employs two residual blocks and a channel attention mechanism to further extract and enhance features. The decoder part includes two deconvolutional layers, each also followed by batch normalization and ReLU activation functions. Finally, the output data after decoding is restored to its original dimensions through a reshape operation. This structural design ensures that while extracting features, the integrity and consistency of the data are maintained. By using the Convolutional Autoencoder, redundant information in the CSI data is effectively reduced, enhancing ability of the system to differentiate between different users.

Remark 3.

It is important to note that the CAE is not treated as a conventional preprocessing step in this work, but rather as an unsupervised feature learning module embedded within the data processing pipeline. Instead of explicitly performing normalization or denoising, the CAE operates in a manner analogous to a learnable filter, enabling feature compression and reconstruction in a data-driven fashion. This allows the model to implicitly capture denoising and redundancy reduction effects, rather than relying on hand-crafted signal processing techniques. Compared with manually designed filters, this approach adapts more effectively to variations across individuals and environments, thereby enhancing the robustness of feature representations. Furthermore, by preserving the decoding structure and restoring the original data dimensions, the method maintains structural consistency, facilitating seamless integration with subsequent feature extraction networks and improving overall system stability and scalability.

4.3. Feature Extraction Network

The preprocessed three-dimensional CSI samples are fed into the feature extraction module for representation learning and classification. As illustrated in Figure 3, the module consists of multiple components. Overall, it follows a progressive pipeline of “feature upscaling → multi-scale feature extraction → feature fusion → downsampling → classification”, enabling hierarchical extraction of discriminative information from CSI signals. Across different stages, feature representations are transformed through coordinated variations along the channel (feature) and temporal dimensions, facilitating the mapping from raw signal representations to high-level semantic features.

First, a pointwise one-dimensional convolution (PWConv) is applied to upscale the input features. This operation performs a linear projection along the channel dimension while preserving the temporal dimension, thereby increasing the feature dimensionality from its original space to a higher-dimensional representation and enhancing feature expressiveness. In CSI-based identity recognition, the original input features are relatively low-dimensional and primarily reflect raw signal variations, which are insufficient to capture subtle subject-specific differences. By projecting the input into a higher-dimensional feature space, the model is able to disentangle complex patterns and amplify discriminative cues introduced by human presence. This dimensional upscaling is therefore essential for improving the separability of different identities before subsequent multi-scale feature extraction. The corresponding formulation is given in Equation (4):

X_{PW} = PWConv (X)

(4)

Remark 4.

The pointwise convolution (PWConv), also known as a 1 × 1 convolution, essentially performs a linear combination across channels at each time step. Unlike standard convolutions, it does not model local temporal neighborhoods but instead focuses on channel-wise feature reparameterization, resulting in low computational overhead. In deep learning, PWConv is commonly used for channel expansion or feature projection, enabling enhanced representational capacity without introducing additional temporal dependencies. For CSI signals, the raw amplitude sequences primarily reflect physical-layer variations, whereas the upscaled representations encode richer discriminative patterns in a higher-dimensional space, such as subtle subject-specific perturbations. Compared with large-kernel convolutions for dimensional expansion, PWConv offers fewer parameters and more stable optimization, which has led to its widespread adoption in lightweight architectures such as MobileNet. In this work, it is adopted to improve feature expressiveness while maintaining computational efficiency and to provide a more discriminative representation for subsequent multi-scale feature extraction.

After upscaling, the data is fed into a multi-scale feature extraction module. This module includes two parallel branches: PConv [25] and MixConv [26]. PConv is used for extracting local features, while MixConv is used for extracting global features. Specifically, both branches use Star Operation to extract high-dimensional nonlinear features. Star Operation [28] is a newly proposed operation that introduces more nonlinear transformations during feature extraction, further enhancing feature representation capacity. The calculation formula for this process is shown in Equations (5) and (6):

X_{local} = PConv (X_{PW})

(5)

X_{global} = MixConv (X_{PW})

(6)

To better fuse features from different branches, we introduced an improved ECA-Net [27] attention mechanism. ECA-Net enhances feature representation effectiveness by adaptively learning the dependencies between different channels. In multi-branch architectures, features extracted from different branches are often complementary but not equally important. Direct concatenation or summation may therefore suppress critical information. The lightweight channel attention mechanism in ECA-Net adaptively assigns weights to each channel, improving fusion effectiveness without significantly increasing computational complexity. This design maintains model efficiency while strengthening feature expressiveness. The implementation details are provided in Section 4.4. The calculation formula for this process is shown in Equations (7) and (8):

X_{fused} = X_{local} + X_{global}

(7)

X_{ECA} = ECA - Net (X_{fused})

(8)

After feature fusion, the data is downsampled using a downsampling layer to further simplify the input data. This downsampling operation helps reduce computational complexity while retaining key information. In the experiments, we set the multi-scale feature extraction module to iterate N times (N = 3). Through multiple iterations of feature extraction, the model can fully capture deep features in the CSI signals. The calculation formula for this process is shown in Equation (9):

X_{down} = Downsample (X_{ECA})

(9)

Remark 5.

Progressive downsampling and multi-layer stacking are standard feature extraction strategies in deep learning, whose core idea is to progressively enhance semantic abstraction through “temporal compression and channel expansion”. In this model, the temporal dimension is gradually reduced through successive downsampling operations, thereby lowering computational cost, while the channel dimension is progressively increased to enable richer feature representations in higher-dimensional spaces. This design is consistent with standard CNN architectures such as ResNet and is particularly effective for capturing subject-specific variations in CSI signals, while maintaining a balance between performance and computational efficiency.

In the final step, classification is performed by a fully connected layer based on the extracted features, resulting in the output labels.

4.4. Multi-Scale Feature Fusion Strategy

As shown in Figure 4, the multi-scale feature fusion module mainly includes local and global feature extraction and their feature fusion strategy. Overall, it is organized into three stages: local and global feature extraction, nonlinear feature enhancement, and attention-based fusion. The primary objective is to capture complementary representations at different scales and integrate them through an adaptive mechanism, thereby improving the discriminative capability of the model for CSI signals.

Remark 6.

To improve the interpretability of the feature fusion process, the proposed strategy is organized into sequential stages, namely feature extraction, preliminary fusion, and attention refinement. This design does not introduce additional complexity; instead, it explicitly decomposes the data flow, ensuring that each functional module remains logically independent and structurally coherent. Specifically, the multi-branch architecture first captures complementary local and global representations, followed by a simple yet stable element-wise fusion to establish a unified feature basis, and finally applies a lightweight attention mechanism to adaptively enhance informative channels. This progression from “separate modeling” to “progressive integration” helps avoid instability caused by direct operations in high-dimensional spaces while improving model interpretability.

Local and Global Features: Local feature extraction uses PConv [25], which restricts convolution operations to part of the channels within the input feature map, effectively minimizing redundant computation and memory access. The advantage of PConv is that it can maintain efficient feature extraction capability while reducing Floating Point Operations (FLOPs). Global feature extraction uses MixConv [26], which captures various patterns in the input feature map by mixing convolutional kernels of different sizes within a single depthwise convolution operation. MixConv can easily utilize convolutional kernels of different sizes without changing the network structure, thereby enhancing the diversity and effectiveness of feature extraction. To further enhance feature representation, a Star Operation is introduced in both branches. This operation constructs nonlinear feature interactions through element-wise multiplication, enabling higher-order combinations across channels. Specifically, features are first extracted via convolution to obtain base representations, followed by nonlinear enhancement through the Star Operation, resulting in richer implicit feature representations. This process can be interpreted as introducing more complex mappings within the original feature space, thereby improving the model’s ability to capture intricate signal patterns.

Feature fusion strategies:Since local and global features exhibit clear complementarity—where the former captures fine-grained variations and the latter encodes overall structural information—relying on a single type of feature is insufficient to fully characterize individual differences in CSI signals. Therefore, it is necessary to fuse these two types of features to construct a more comprehensive representation. During fusion, the outputs from the two branches are first aligned and combined via element-wise addition to obtain a unified feature representation. Subsequently, an improved ECA-Net attention mechanism is applied to further refine the fused features. ECA-Net models inter-channel dependencies and adaptively assigns weights to different channels, thereby emphasizing informative features while suppressing redundant ones. Compared with conventional fusion strategies such as direct concatenation or simple weighting, ECA-Net dynamically adjusts feature importance based on data characteristics, leading to more effective feature integration.

Remark 7.

The integration of the components within the feature fusion module is guided by the structural characteristics of CSI signals and the requirements of feature representation, rather than being a simple aggregation of techniques. CSI data exhibit both local fluctuations and global variations across temporal and frequency dimensions, and the influence of different individuals is typically reflected in multi-scale perturbation patterns. As a result, a single feature extraction approach is insufficient to fully capture these differences. In this context, PConv is employed to model local-sensitive features, MixConv captures global structures across scales, the Star Operation enhances representation through nonlinear feature interactions, and ECA-Net performs channel-wise adaptive selection during fusion. This combination corresponds to a hierarchical process from “local modeling” to “global perception”, followed by “nonlinear enhancement” and “discriminative selection”.

Improved ECA-Net [27] Attention Mechanism: To better integrate features from different branches, we have introduced an improved ECA-Net attention mechanism, as shown in Figure 5.

ECA-Net enhances the effectiveness of feature representation by adaptively learning dependencies between different channels, primarily used for attention in single feature extraction channels. In this study, we extend it to integrate important parts of two different feature branches. The improved ECA-Net replaces the original Global Average Pooling (GAP) with Standard Deviation Pooling (StdPool) to further enhance the accuracy of feature fusion. Compared to Global Average Pooling, Standard Deviation Pooling better captures variation information in feature maps, enhancing sensitivity and accuracy in feature selection. ECA-Net proposes a method for adaptive selection of convolution kernel sizes. Specifically, for each channel, a strategy for selecting convolution kernel sizes automatically adjusts the receptive field of the attention mechanism to better adapt to different scales of feature maps. Its formula is as follows:

k = ψ (C) = {⌈\frac{{log}_{2} (C)}{γ} + \frac{b}{γ}⌉}_{o d d}

(10)

Here,

ψ (C)

denotes a mapping used to determine the convolution kernel size. The variable C refers to the channel dimension, while

γ

and b are predefined hyperparameters. The operator

{⌈ \cdot ⌉}_{odd}

indicates that the result is rounded up to the nearest odd integer.

During the feature fusion process, the improved ECA-Net can allocate appropriate weights to blend two types of features. After feature fusion, data undergoes weight allocation through a Softmax layer, automatically selecting the most useful features from the two types for fusion. The advantage of this fusion strategy lies in its ability to fully utilize the complementary nature of local and global features. By improving the attention mechanism and multi-scale feature extraction, it enhances the model’s ability to recognize and process complex Channel State Information (CSI) signals. Experimental validation shows that this method significantly improves the classification performance of the model, particularly in handling CSI signals with high noise and complex backgrounds.

5. Experimental Evaluation

We use a wide range of experiments to evaluate the WiPID system used for authentication, giving configurations of the experiments, evaluating the overall performance, and exploring the impact of different methods and systems and different sample sizes. Finally, we conduct real-time testing experiments.

5.1. Experiment Setup

WiPID experiment setting refers to the experimental configuration of WiPIN [17] and WiDFF [18], and the experimental environment is designed in accordance with practical conditions to ensure data accuracy and scientific validity. In terms of hardware, a Tenda F3 router was used as the WiFi signal transmitter, while a laptop equipped with an Intel 5300 wireless network card served as the receiver. The receiver was further connected to three external antennas to enhance signal reception, thereby improving the accuracy and stability of CSI data acquisition (as shown in Figure 6).

The experiments were conducted in an indoor office environment with dimensions of approximately 3 m × 2 m. The transmitter and receiver were placed on separate tables at opposite sides of the room, with a separation distance of 2 m and a height of 1.2 m, to emulate typical indoor signal propagation conditions. Participants were instructed to stand along the perpendicular bisector of the line connecting the transmitter and receiver, ensuring consistent influence of the human body on the signal propagation path. In addition, the experimental setup was arranged near a doorway to simulate a practical “walk-through” identity authentication scenario without explicit user interaction. The experimental environment of WiPID is shown in Figure 7.

A total of 50 volunteers were recruited for the experiment, with an equal distribution of male participants and an age range of 18 to 30 years, ensuring diversity and representativeness of the dataset. During data collection, all participants were required to remain stationary and refrain from carrying electronic devices that could interfere with wireless signals, thereby minimizing noise in the acquired data.

During data acquisition, the receiver continuously recorded CSI data at a sampling rate of 500 Hz (in practice, higher sampling rates yield more data and potentially higher accuracy) and stored it in .dat format. Each sample contains channel state information across multiple subcarriers, including key features such as amplitude and phase, which capture the impact of the human body on WiFi signal propagation. The duration of each sample was 1–2 s, corresponding to approximately 500–1000 packets, ensuring temporal continuity. To improve data reliability, 100 valid samples were collected from each participant, along with additional redundant samples to compensate for packet loss or anomalies. Each antenna pair captured signals from 30 subcarriers, resulting in a total of 90 independent CSI subcarriers. The raw data were first preprocessed using the 802.11n CSI Tool under a Linux environment, and subsequently processed in MATLAB R2023b to ensure dataset consistency and quality. MATLAB was mainly used for data organization, format conversion, and sample construction. All data were standardized into tensors of size 3 × 30 × 500, where 3 denotes the antennas, 30 the subcarriers, and 500 the temporal length (i.e., number of packets). This standardized representation ensures consistency in model input. Detailed system settings are summarized in Table 2.

The complete dataset consists of 5394 samples, including 4249 samples for training and 1145 for testing (as shown in Table 3). During model training, data normalization was applied to improve convergence speed and generalization performance. The model was trained on an NVIDIA GeForce RTX 4060 Laptop GPU using Python 3.10 and PyTorch 2.1.1. Model training consisted of 80 epochs, optimized using the Adam optimizer with an initial learning rate of 0.001, and beta parameters of 0.9 and 0.999, respectively, ensuring stable convergence and effective optimization.

Additionally, we conducted real-time performance testing of the model. The real-time collected raw .dat files were processed and sent in fixed sizes, then converted to .mat format and visualized. These data were input into the model in real time for predictive analysis. This process ensured the immediacy and continuity of data processing and model prediction, thereby also testing the overall system’s response speed and accuracy.

5.2. Performance Evaluation

Overall Evaluation: To evaluate the overall performance of WiPID, four metrics—accuracy, precision, recall, and F1-score—are adopted. All metrics are computed using macro-averaging to provide a balanced assessment of the classifier across different users. Accuracy reflects the ratio of correctly predicted samples to the total number of predictions, serving as a key indicator of overall model performance. Precision, in contrast, focuses on the proportion of true positive samples among those predicted as positive. Figure 8 illustrates the accuracy and precision results of WiPID. Most users achieve high accuracy and precision values, both close to 1. However, the accuracy of some users decreases, which may be due to the similarity of their signal features, causing confusion. From the analysis of accuracy and precision, it can be seen that the WiPID system has good classification performance in most cases, but the performance of the model may fluctuate in certain users or environments. To further improve the system, we can consider increasing the diversity of the dataset or improving the model to reduce the impact of these fluctuations.

Recall reflects the fraction of true positives that are successfully predicted among all real positive samples. As shown in Figure 9, most users have a good recall rate, almost close to 1, but some users have a lower recall rate, indicating that the model has a higher miss rate for these users. For example, the recall rate of the 47th user is significantly lower than that of other users, possibly due to the user’s unique signal characteristics or significant noise, making it difficult for the model to correctly identify positive class samples for this user. By further analyzing the signal characteristics of these abnormal users, we can improve the model or data preprocessing methods specifically to enhance the overall performance of the system.

The F1-score, calculated as the harmonic mean of precision and recall, serves as an indicator for evaluating overall model effectiveness. F1-Score combines precision and recall, reflecting the overall performance of the model on positive samples. Figure 10 shows the F1-Score performance for each person in our proposed system. In our experiments, the overall F1-Score is high, but the F1-Score for some users is low, indicating that there is room for improvement in the performance of the model for these users. Particularly for some users, the F1-Score is low, possibly due to the complexity of their feature data, causing the model to perform poorly in both precision and recall. These phenomena indicate that the recognition performance of the model decreases in specific users or scenarios.

Compare to other system: In this study, we compared WiPID with several mainstream WiFi identity recognition systems, including WiWho [14], FreeSense [15], NeuralWave [16], WiPIN [17], and WiDFF [18]. Table 4 lists the comparison of the processing methods of these systems. Through comparison, it is observed that most WiFi identity recognition systems adopt signal preprocessing, which complicates the processing flow. However, WiPID did not use conventional methods for data preprocessing. Instead, it directly employed a convolutional autoencoder to replace this step, achieving excellent results and optimizing the processing flow, thereby enabling end-to-end model training. The feature extraction methods used by different systems also affect the final recognition performance. WiWho and WiPIN utilize statistical features, FreeSense employs Wavelet Transform, while WiDFF and WiPID adopt neural networks feature extraction methods. In particular, WiPID integrates multi-scale features, enabling the system to capture more critical features and thus maintain a higher accuracy rate. WiPID has optimized data preprocessing, feature extraction, and classification methods, surpassing other systems in multiple aspects.

Impact of of other deep Model: After comparing the WiFi fingerprint recognition performance of various deep learning models (including Transformer [34], ResNet [30], and FasterNet [25]) on a dataset of 50 people, the WiPID model demonstrated significant advantages in handling large-scale user data. The average recognition performance of these models is shown in the Table 5, including accuracy, precision, recall, and F1 score. WiPID achieved a high accuracy of 0.981, demonstrating its ability to accurately predict in the vast majority of cases. This performance is significantly better than that of Transformer and FasterNet, possibly due to the latter is relatively weak in terms of feature processing and model stability. WiPID also achieved a precision of 0.983, meaning it almost made no false positives when predicting positive cases, which is crucial for applications where the cost of false alarms is high, such as security monitoring and identity verification. Furthermore, WiPID achieved a recall rate of 0.982, highlighting its ability to cover positive samples, greatly reducing the possibility of missed detections. This is crucial for improving user experience and system reliability, especially in scenarios requiring precise identification of each user. WiPID also achieved an F1 score of 0.982, demonstrating its balance between precision and recall, allowing the model to maintain efficient recognition performance even in complex environments. These data not only demonstrate advantages of WiPID in various metrics but also imply the high quality of the dataset, ensuring good performance even when different models are used.

Comparison of Training Sample Sizes: We used different sizes of training samples and multiple deep learning models on a 50-person WiFi fingerprint recognition dataset. Table 6 shows the effect of different sizes of training samples and models on the results, comparing models including Transformer, ResNet, FasterNet, and our proposed WiPID, trained with 20, 40, 60, and all samples, respectively. It can be seen that the performance of each model improves as the number of training samples increases. However, even with a small number of training samples, WiPID shows the slowest decrease in recognition rate and exhibits better stability. Specifically, when the sample size is 20, WiPID achieves an accuracy of 0.718, significantly higher than other models, indicating its ability to maintain a high recognition rate with a small number of samples. As the sample size increases to 40 and 60, the accuracy of WiPID further improves to 0.759 and 0.891, reflecting its strong robustness under medium-sized data. In all sample cases, WiPID achieves an accuracy of 0.981, outperforming other models.

Impact of Pooling Strategies in ECA: To validate the effectiveness of the proposed standard deviation pooling-based attention mechanism, ablation experiments are conducted by adopting different channel descriptor strategies within the ECA (Efficient Channel Attention) module. Specifically, under the same network architecture and training settings, global average pooling (GAP), global max pooling (GMP), and the proposed standard deviation pooling (STD) are respectively employed to generate channel attention weights. The model performance is then compared, and the results are presented in the Table 7. The experimental results demonstrate that different pooling strategies have a significant impact on model performance. When GAP is used, the model achieves an accuracy of 0.969, showing relatively stable performance but still leaving room for improvement. In contrast, using GMP results in a slight performance degradation (accuracy of 0.961), indicating that relying solely on the maximum response is insufficient to fully characterize the feature distribution.In comparison, the model with standard deviation pooling (WiPID) achieves the best performance across all metrics, with an accuracy of 0.981 and an F1-score of 0.982, representing an improvement of approximately 1.2% over GAP. This result indicates that standard deviation pooling provides stronger discriminative capability in channel attention modeling. This improvement can be attributed to the fact that, unlike traditional mean or max statistics, standard deviation effectively captures the dispersion and variability of feature distributions. In WiFi-based identity recognition tasks, the influence of different individuals on wireless signals is typically reflected in subtle yet consistent dynamic variations, which are better characterized by feature fluctuations. By modeling the magnitude of such variations, standard deviation pooling enables the attention mechanism to focus more on discriminative dynamic features, thereby enhancing the overall recognition performance of the model.

Impact of Iteration Number N on Model Performance: To further investigate the impact of the iteration number N in the multi-scale feature extraction module on model performance, ablation experiments are conducted on a dataset consisting of 50 subjects. Under the same experimental settings (including training strategies and hyperparameters), the iteration number N is varied as N = 1,2,3,4,5, and the corresponding recognition performance is comparatively analyzed, as shown in the Table 8. The experimental results demonstrate that the iteration number has a significant impact on model performance. When N = 1 and N = 2, the model achieves accuracies of 0.949 and 0.958, respectively, indicating relatively lower performance. This suggests that a small number of iterations is insufficient to fully extract multi-scale features, leading to limited capability in capturing individual differences in WiFi signals.When N = 3, the model achieves the best performance. At this point, multi-scale features are sufficiently fused, allowing the model to strike a good balance between feature representation capability and model complexity, thereby significantly improving recognition performance.However, when the iteration number further increases to N = 4 and N = 5, the model performance drops significantly, with accuracies decreasing to 0.866 and 0.812, respectively. This degradation can be attributed to two factors. On one hand, increasing the number of iterations leads to higher model complexity and computational cost, making training more difficult. On the other hand, excessive iterations may introduce redundant features and even amplify noise, thereby weakening the discriminative ability of the learned representations. Moreover, under the same number of training epochs, the model becomes harder to converge sufficiently, which further negatively affects the final performance. In summary, the iteration number N = 3 effectively balances multi-scale feature extraction capability and model complexity in this study, making it a reasonable and optimal choice.

5.3. Real-Time Person Identification Experiment

A real-time identity authentication test was carried out to examine the practical applicability of the WiPID system. The experimental configuration and environment were kept the same as those used during data collection. We send the data received at the receiving end in a fixed size through the cloud to another laptop, and the collected data is processed in real time by Intel 5300 CSI Tool [19] and converted into inputs for the trained WiPID model. Subsequently, the recognition was performed by a Python 3.10 script. During testing, we evaluated the user’s recognition effectiveness and response speed, displaying the test results through a window, as in Figure 11. The system predicted the input samples, displaying the predicted probability and processing time on the window. In this experiment, our main focus was on the model’s recognition effectiveness, thus we did not undertake complex system design. We conducted real-time tests on three users selected from the dataset, and the results indicated that real-time recognition was essentially achievable.

This experimental verified the feasibility of real-time identity authentication through WiFi fingerprint extraction.

However, it should be noted that this test was conducted under ideal dataset collection conditions. More exploration and optimization is needed to extend the application of the method. For example, the recognition effectiveness of the system may be affected when the number of recognized users increases, when there are more environmental disturbances, and when there are changes in the weight and dress of the users. This experiment initially validated the potential of the WiPID system in real-time recognition, yet for widespread deployment in practical applications, further research and improvement are necessary.

6. Conclusions

This paper proposes a static fingerprint identification system, WiPID, based on WiFi Channel State Information, aiming to achieve a device-free, non-intrusive, and highly secure identity recognition approach. The system employs a convolutional autoencoder to perform feature reconstruction and preprocessing on CSI data, and integrates a multi-scale feature extraction module with an improved ECA-Net attention mechanism. This design enables effective modeling and enhancement of fine-grained individual-specific characteristics embedded in CSI signals, thereby improving the discriminative capability of identity recognition.

In experimental evaluations, WiPID demonstrates stable and superior identification performance on a large-scale multi-user dataset. Comparative results with several mainstream deep learning methods show that the proposed approach achieves consistent improvements in accuracy, precision, recall, and F1-score, validating the effectiveness of the multi-scale feature extraction and attention mechanism design. In addition, real-time testing further indicates that the system maintains high recognition accuracy during online inference, demonstrating its practical feasibility and application potential.

Nevertheless, this study still has certain limitations. When environmental conditions change significantly (e.g., scene transitions, variations in body posture, or differences in clothing), the system performance may degrade to some extent, indicating that the current model remains sensitive to cross-domain variations. Therefore, future work will focus on exploring WiFi fingerprint modeling methods based on transfer learning and domain adaptation to improve the model’s generalization ability across different scenarios. Meanwhile, further investigation into the long-term stability of CSI-based fingerprints is required to evaluate their reliability in real-world continuous deployment settings.

Author Contributions

Conceptualization, C.W. and S.W. (Shubin Wang); Methodology, C.W.; Software, Y.L. and S.W. (Shenhujing Wang); Validation, Y.D.; Formal analysis, Y.D.; Resources, Y.D.; Data curation, Y.L.; Writing—original draft, C.W.; Writing—review and editing, S.W. (Shenhujing Wang) and S.W. (Shubin Wang); Supervision, S.W. (Shubin Wang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Startup Foundation of Guang’an Institute of Technology, grant numbers KYQD-2026-095 and KYQD-2026-070. The APC was also funded by the same foundation.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiu, J.; Tian, Z.; Du, C.; Zuo, Q.; Su, S.; Fang, B. A survey on access control in the age of internet of things. IEEE Internet Things J. 2020, 7, 4682–4696. [Google Scholar] [CrossRef]
Kong, H.; Lu, L.; Yu, J.; Chen, Y.; Tang, F. Continuous authentication through finger gesture interaction for smart homes using wifi. IEEE Trans. Mob. Comput. 2020, 20, 3148–3162. [Google Scholar] [CrossRef]
Ula, M.; Pratama, A.; Asbar, Y.; Fuadi, W.; Fajri, R.; Hardi, R. A new model of the student attendance monitoring system using rfid technology. J. Phys. Conf. Ser. 2021, 1807, 012026. [Google Scholar] [CrossRef]
Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef] [PubMed]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Han, J.; Bhanu, B. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 28, 316–322. [Google Scholar] [CrossRef] [PubMed]
Chao, H.; He, Y.; Zhang, J.; Feng, J. Gaitset: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8126–8133. [Google Scholar]
Fan, C.; Peng, Y.; Cao, C.; Liu, X.; Hou, S.; Chi, J.; Huang, Y.; Li, Q.; He, Z. Gaitpart: Temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14225–14233. [Google Scholar]
Gafurov, D.; Helkala, K.; Søndrol, T. Biometric gait authentication using accelerometer sensor. J. Comput. 2006, 1, 51–59. [Google Scholar] [CrossRef]
Zhao, Y.; Zhou, S. Wearable device-based gait recognition using angle embedded gait dynamic images and a convolutional neural network. Sensors 2017, 17, 478. [Google Scholar] [CrossRef] [PubMed]
Biel, L.; Pettersson, O.; Philipson, L.; Wide, P. Ecg analysis: A new approach in human identification. IEEE Trans. Instrum. Meas. 2001, 50, 808–812. [Google Scholar] [CrossRef]
Liu, X.; Si, Y.; Yang, W. A novel two-level fusion feature for mixed ecg identity recognition. Electronics 2021, 10, 2052. [Google Scholar] [CrossRef]
Zeng, Y.; Pathak, P.H.; Mohapatra, P. Wiwho: Wifi-based person identification in smart spaces. In Proceedings of the 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Vienna, Austria, 11–14 April 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–12. [Google Scholar]
Xin, T.; Guo, B.; Wang, Z.; Wang, P.; Lam, J.C.K.; Li, V.; Yu, Z. Freesense: A robust approach for indoor human detection using wi-fi signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–23. [Google Scholar] [CrossRef]
Pokkunuru, A.; Jakkala, K.; Bhuyan, A.; Wang, P.; Sun, Z. Neuralwave: Gait-based user identification through commodity wifi and deep learning. In Proceedings of the IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 758–765. [Google Scholar]
Wang, F.; Han, J.; Lin, F.; Ren, K. Wipin: Operation-free passive person identification using wi-fi signals. In Proceedings of the 9 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Wu, Z.; Xiao, X.; Lin, C.; Gong, S.; Fang, L. Widff-id: Device-free fast person identification using commodity wifi. IEEE Trans. Cogn. Commun. Netw. 2022, 9, 198–210. [Google Scholar] [CrossRef]
Halperin, D.; Hu, W.; Sheth, A.; Wetherall, D. Tool release: Gathering 802.11 n traces with channel state information. Acm Sigcomm Comput. Commun. Rev. 2011, 4, 53. [Google Scholar] [CrossRef]
Iannizzotto, G.; Milici, M.; Nucita, A.; Bello, L.L. A perspective on passive human sensing with bluetooth. Sensors 2022, 22, 3523. [Google Scholar] [CrossRef] [PubMed]
Siean, A.-I.; Pamparău, C.; Sluÿters, A.; Vatavu, R.-D.; Vanderdonckt, J. Flexible gesture input with radars: Systematic literature review and taxonomy of radar sensing integration in ambient intelligence environments. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 7967–7981. [Google Scholar] [CrossRef]
Xu, J.; Li, Z.; Zhang, K.; Yang, J.; Gao, N.; Zhang, Z.; Meng, Z. The principle, methods and recent progress in rfid positioning techniques: A review. IEEE J. Radio Freq. Identif. 2023, 7, 50–63. [Google Scholar] [CrossRef]
Thill, M.; Konen, W.; Wang, H.; Bäck, T. Temporal convolutional autoencoder for unsupervised anomaly detection in time series. Appl. Soft Comput. 2021, 112, 107751. [Google Scholar] [CrossRef]
Kargar-Barzi, A.; Farahmand, E.; Mahani, A.; Shafique, M. Cae-cnnloc: An edge-based wifi fingerprinting indoor localization using convolutional neural network and convolutional auto-encoder. arXiv 2023, arXiv:2303.03699. [Google Scholar]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.-H.G. Run, don’t walk: Chasing higher flops for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Tan, M.; Le, Q.V. Mixconv: Mixed depthwise convolutional kernels. arXiv 2019, arXiv:1907.09595. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5694–5703. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Seattle, WA, USA, 17–21 June 2012; Volume 25. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Pfeiffer, D.; Pfeiffer, F.; Rummeny, E. Advanced X-ray imaging technology. In Molecular Imaging in Oncology; Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–30. [Google Scholar]
Neupane, D.; Seok, J. A review on deep learning-based approaches for automatic sonar target recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Feng, C.; Au, W.S.A.; Valaee, S.; Tan, Z. Received-signal-strength-based indoor positioning using compressive sensing. IEEE Trans. Mob. Comput. 2011, 11, 1983–1993. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]

Figure 1. Comparison of CSI amplitude in different environments.

Figure 2. Comparison of CSI amplitude in different environments.

Figure 3. System Architecture.

Figure 4. Feature Extraction Module.

Figure 5. Improved ECA-Net Attention Mechanism.

Figure 6. Receiving and transmitting ends of the experiment.

Figure 7. Experimental Scene.

Figure 8. Overall performance.

Figure 9. Recall of each person.

Figure 10. F1 score of each person.

Figure 11. Real-Time Display Window.

Table 1. The Configuration Of The Convolutional Autoencoder.

No.	Layer	Dimension	Filter	Filter Size	Stride	Padding
0	Input	$3 \times 30 \times 500$
1	Flatten	$90 \times 500$
2	Conv1d-1	$90 \times 500$	128	7	2	3
3	BatchNorm1d	$128 \times 250$
4	ReLU	$128 \times 250$
5	Conv1d-2	$128 \times 250$
6	BatchNorm1d	$256 \times 125$	256	5	2	2
7	ReLU	$256 \times 125$
8	ResidualBlock-1	$256 \times 125$
9	ChannelAttention	$256 \times 125$	256	3	1	1
10	ResidualBlock-2	$256 \times 125$
11	ConvTranspose1d-1	$256 \times 125$
12	BatchNorm1d	$128 \times 250$	128	5	2	2
13	ReLU	$128 \times 250$
14	ConvTranspose1d-2	$128 \times 250$
15	BatchNorm1d	$90 \times 500$	90	7	2	3
16	ReLU	$90 \times 500$
17	Reshape	$90 \times 500$
18	Output	$3 \times 30 \times 500$

Table 2. Details Of The System Setup.

Setup	WiPID
Subjects	50
Transmitter	Tenda F3
Receiver	Intel 5300 802.11n NIC
Sampling frequency	500 Hz
Sample size	(3, 30, 500)
Sample size	(antenna, subcarrier, packet)
Sample Duration	1–2 s

Table 3. Division Of Dataset.

Sample Size	Training Dataset Size	Test Dataset Size
5394	4249	1145

Table 4. Comparison Of Different Systems.

System	Activity	Preprocessing	Feature Extraction	Classification	Performance
WiWho [14]	Gait	Bandpass filter	Statistical Characteristic	Dynamic Time Warping (DTW)	2–6: 92–80%
FreeSense [15]	Gait	Principal Component Analysis (PCA)	Wavelet Transform (WT)	DTW	2–6: 94–88%
NeuralWave [16]	Gait	PCA	Neural Networks	CNN	24: 87.7% ± 2.144
WiPIN [17]	Static Posture	Butterworth filter	Statistical Characteristic	SVM	2–30: 100–92%
WiDFF-ID [18]	Static Posture	Butterworth filter, PCA	DenseNet	CNN	2–42: 100–98%
WiPID	Static Posture	Convolutional Autoencoder	Multi-scale Feature Fusion	CNN	50: 98%

Table 5. Model Performance Comparison.

Model	Accuracy	Precision	Recall	F1 Score
Transformer [34]	0.952	0.957	0.952	0.952
ResNet [30]	0.979	0.980	0.979	0.979
FasterNet [25]	0.962	0.965	0.962	0.961
WiPID	0.981	0.983	0.982	0.982

Table 6. Model Performance with Different Sample Sizes.

Samples	Transformer [34]	ResNet [30]	FasterNet [25]	WiPID
20	0.469	0.634	0.502	0.718
40	0.649	0.668	0.728	0.759
60	0.765	0.867	0.848	0.891
All	0.952	0.979	0.962	0.981

Table 7. Impact of Pooling Strategies in ECA.

Pooling Type	Accuracy	Precision	Recall	F1-Score
GAP	0.969	0.969	0.968	0.968
GMP	0.961	0.966	0.961	0.961
STD (Ours)	0.981	0.983	0.982	0.982

Table 8. Impact of Iteration Number N on Model Performance.

Iteration N	Accuracy	Precision	Recall	F1-Score
N = 1	0.949	0.952	0.949	0.949
N = 2	0.958	0.960	0.958	0.958
N = 3 (Ours)	0.981	0.983	0.982	0.982
N = 4	0.866	0.884	0.866	0.863
N = 5	0.812	0.838	0.812	0.813

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Deng, Y.; Li, Y.; Wang, S.; Wang, S. WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals. Symmetry 2026, 18, 878. https://doi.org/10.3390/sym18050878

AMA Style

Wang C, Deng Y, Li Y, Wang S, Wang S. WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals. Symmetry. 2026; 18(5):878. https://doi.org/10.3390/sym18050878

Chicago/Turabian Style

Wang, Chenlu, Ya Deng, Yuke Li, Shenhujing Wang, and Shubin Wang. 2026. "WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals" Symmetry 18, no. 5: 878. https://doi.org/10.3390/sym18050878

APA Style

Wang, C., Deng, Y., Li, Y., Wang, S., & Wang, S. (2026). WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals. Symmetry, 18(5), 878. https://doi.org/10.3390/sym18050878

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WiPID: An End-to-End Deep Learning Framework for Passive Person Identification Using WiFi Signals

Abstract

1. Introduction

2. Related Work

2.1. Vision-Based Person Identification

2.2. Wearable Device-Based Person Identification

2.3. WiFi-Based Sensing Person Identification

3. Preliminaries

3.1. Channel State Information

3.2. WiFi Fingerprint Recognition

4. System Architecture

4.1. Oveview of WiPID

4.2. Preprocessing

4.3. Feature Extraction Network

4.4. Multi-Scale Feature Fusion Strategy

5. Experimental Evaluation

5.1. Experiment Setup

5.2. Performance Evaluation

5.3. Real-Time Person Identification Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI