Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition

Varga, Domonkos; Cao, An Quynh

doi:10.3390/signals6040059

Open AccessArticle

Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition

by

Domonkos Varga

^*

and

An Quynh Cao

Nokia Bell Labs, 1082 Budapest, Hungary

^*

Author to whom correspondence should be addressed.

Signals 2025, 6(4), 59; https://doi.org/10.3390/signals6040059 (registering DOI)

Submission received: 23 September 2025 / Revised: 21 October 2025 / Accepted: 23 October 2025 / Published: 26 October 2025

Download

Browse Figures

Versions Notes

Abstract

Human action recognition (HAR) based on WiFi channel state information (CSI) has attracted growing attention due to its contactless, privacy-preserving, and cost-effective nature. Recent studies have reported promising results by leveraging deep learning and image-based representations of CSI. However, methodological flaws in experimental protocols, particularly improper dataset partitioning, can lead to data leakage and significantly overestimate model performance. In this paper, we critically analyze a recently published WiFi-CSI-based HAR approach that converts CSI measurements into images and applies deep learning for classification. We show that the original evaluation relied on random data splitting without subject separation, causing substantial data leakage and inflated results. To address this, we reimplemented the method using subject-independent partitioning, which provides a realistic assessment of generalization ability. Furthermore, we conduct a quantitative study of post-training quantization under both correct and flawed partitioning strategies, revealing that methodological errors can conceal the true performance degradation of compressed models. Our findings demonstrate that evaluation protocols strongly influence reported outcomes, not only for baseline models but also for engineering decisions regarding model optimization and deployment. Based on these insights, we provide guidelines for designing robust experimental protocols in WiFi-CSI-based HAR to ensure methodological integrity and reproducibility.

Keywords:

WiFi CSI; human action recognition; machine learning integrity

1. Introduction

Human action recognition (HAR) has gained significant attention across domains such as healthcare [1], smart homes [2], and human–computer interaction [3] due to its potential for contactless and privacy-preserving sensing. Among the various modalities, WiFi-based HAR stands out because it leverages existing wireless infrastructure and does not rely on wearable sensors or cameras. The fine-grained channel state information (CSI) extracted from WiFi signals captures subtle motion-induced variations in the wireless channel, enabling the recognition of complex human activities using commodity hardware.

Recent studies have demonstrated impressive performance by transforming CSI data into image-like representations and applying deep learning architectures. However, many of these works suffer from methodological flaws—most critically, improper dataset partitioning that mixes data from the same subjects across training and testing sets. This practice leads to data leakage, allowing the model to memorize subject-specific patterns rather than learning to generalize to unseen individuals. As a result, reported accuracies are often unrealistically high and fail to reflect performance in real-world deployments.

This paper reexamines a representative WiFi CSI-based HAR method—the paper by Shahverdi et al. [4]—and shows that such flawed partitioning can distort not only baseline accuracy but also downstream evaluations such as model compression through post-training quantization. By reimplementing the method with subject-independent partitioning, we reveal the true generalization capability of the model and quantify how leakage can conceal the degradation introduced by quantization. Our findings underline that evaluation protocols—especially dataset partitioning—play a decisive role in determining the credibility and reproducibility of CSI-based HAR research.

1.1. Contributions

The main contributions of this paper are summarized as follows:

Critical analysis of an existing WiFi-CSI-based HAR method that applies image-based processing and deep learning. We show that the original data partitioning strategy used in that work introduces significant data leakage.
Reimplementation of the original approach with a correct, subject-independent partitioning strategy and rigorous evaluation, demonstrating the actual generalization performance.
Quantitative analysis of post-training quantization under both correct and incorrect data partitioning, showing that methodological flaws can mask substantial performance degradation—in our case, resulting in an overestimation of about 32% in F1-score when data from the same subjects appeared in both training and test sets.
Comprehensive discussion and guidelines for designing robust evaluation protocols in WiFi-CSI-based HAR, helping future studies avoid similar methodological pitfalls.

While the vast majority of existing WiFi CSI-based HAR studies appear to employ non-subject-independent evaluation protocols, providing a precise quantitative estimate of this prevalence would require an exhaustive reimplementation and verification of numerous prior works, which lies beyond the scope of this study. Our intent here is to highlight the methodological risks and demonstrate their tangible impact through a representative example. We hope this work encourages the community to undertake a broader quantitative assessment of this issue in future research.

1.2. Structure of the Paper

The remainder of this paper is organized as follows. Section 2 reviews related work with a focus on previous studies in WiFi-based human activity recognition and issues of data leakage. Section 3 introduces the necessary preliminaries, including definitions. Section 4 describes the materials and methods, with particular emphasis on the dataset preparation and the experimental setup. Section 5 presents the experimental results and analysis, where we compare the outcomes obtained under different data partitioning strategies and examine the impact of additional factors such as quantization. Section 6 discusses the broader implications of our findings, highlighting methodological limitations and lessons for future research. Finally, Section 7 concludes the paper by summarizing the main contributions and outlining potential directions for further work.

2. Related Work

This section reviews research relevant to our study from two perspectives. First, we summarize the main approaches and recent advancements in WiFi-CSI-based human activity recognition. Second, we discuss studies on data leakage in machine learning research, emphasizing its impact on model evaluation and the importance of rigorous dataset partitioning strategies.

2.1. WiFi-CSI-Based HAR

WiFi CSI has increasingly been leveraged for HAR in recent years due to its non-invasive capabilities and the growing ubiquity of WiFi networks. The ability of CSI to encapsulate detailed spatial and temporal data arising from human movements makes it a promising tool for accurately discerning different human actions [5].

Pioneering WiFi-CSI-based HAR frameworks utilized statistical values, e.g., average, variance, or median, as features of absolute CSI time-series and trained classifiers to predict a human action based on a newly given CSI signal. For instance, Wang et al. [6] aggregated 30 CSI subcarrier vectors into one vector by applying weighted moving average. Since the authors implemented a WiFi-based human fall detection system, a one-class support vector machine (SVM) [7] was used to classify the aggregated feature vector into “fall” and “not-fall” categories. A similar approach was followed by Al-qaness [8], but six features were selected both from CSI amplitude and CSI phase, and random forest [9] was utilized as a classification module. In contrast, Wang et al. [10] analyzed the moving variance of the CSI amplitude to identify human actions. In [11], discrete wavelet transform was used for feature extraction from CSI, and similarly to Wang et al. [6], a SVM was applied for HAR based on the extracted features.

In recent years, deep learning methods have become the dominant approach for WiFi-CSI-based human activity recognition, offering automated feature extraction. Deep-learning-based approaches for WiFi-CSI-based HAR can generally be categorized into two groups. The first group directly maps raw or preprocessed CSI time-series to activity classes using models, such as RNNs, LSTMs, or 1D CNNs, while the second group transforms CSI sequences into time–frequency or spatial representations, effectively converting them into images and leveraging architectures originally developed for image classification. For instance, Yousefi et al. [12] applied short-time Fourier transform on a certain number of denoised CSI samples to extract features. On the top of these features, a long short-term memory (LSTM) [13] was trained to predict human actions. In [14], the authors considered an advanced version of LSTM, i.e., bi-directional LSTM. Schäfer et al. [15] combined multiple LSTMs with an SVM to carry out CSI-based HAR. In contrast, Shen et al. [16] trained a 1D-CNN on CSI data to implement a keystroke recognition system. Zhang and Jiao [17] proposed a novel approach in which denoised CSI signals are transformed into images using Gramian angular field [18] representation, and these images are subsequently classified with a 2D CNN. In addition, the authors compared the performance of several pretrained architectures with that of a custom architecture trained from scratch. Moshiri et al. [19] also employed a 2D CNN but, instead of using a Gramian transform, constructed a resized pseudo-image directly from the CSI signals using the MATLAB’s imagesc function. Similarly, Jawad et al. [20] generated MATLAB-based pseudo-images, but after first applying Hampel filtering to the CSI signals, and adapted pretrained networks (VGG19 [21], AlexNet [22], SqueezeNet [23]) for the classification task.

For readers seeking a broad overview of HAR across different sensing modalities—including wearable sensors, vision-based systems, radar, and WiFi-based approaches—we refer to comprehensive survey papers that cover the entire field [24,25,26,27]. In contrast, other references [28,29] focus on the categorization of WiFi-CSI-based HAR methods.

2.2. Data Leakage in Machine Learning Research

Data leakage in machine learning research refers to the inadvertent use of information in the training process that would not be available at the time of prediction, leading to overly optimistic and unreliable model performance. This phenomenon critically undermines the ability of models to generalize to unseen data, causes biased decision-making, and produces misleading analytical insights, significantly affecting the validity and applicability of machine learning results.

Data leakage occurs when a machine learning model gains access to external or future information during training that inflates its predictive accuracy artificially. This can happen due to technical errors such as applying data preprocessing steps (like scaling, normalization, or imputation) to the entire dataset before splitting into training and testing sets, thereby exposing data from the test set to the training process. Other common causes include merging external datasets that contain direct or indirect information about the target variable, improper feature selection from the full dataset rather than only the training portion, and repeated instances of the same subject or samples across training and test sets. Additionally, vulnerabilities in data integration pipelines—such as insecure data transmission, misconfigured access controls, or legacy system interactions—can introduce subtle leakage points, especially in enterprise or production environments where data flows through various systems before model training [30,31].

Kaufman et al. [32] addressed the problem of data leakage in data mining, where illegitimate information about the prediction target is unintentionally introduced into the modeling process. Further, the authors provided a formal definition of leakage, distinguishing between feature leakage (e.g., using future or target-derived features) and training sample leakage (e.g., dependencies between training and test sets). The authors presented methods to avoid leakage, such as tagging data for legitimacy and enforcing a strict learn–predict separation, and discussed detection approaches based on exploratory data analysis and unexpected model performance. They also showed how causal modeling concepts help interpret complex leakage scenarios and emphasized that fixing leakage after it occurs is often difficult. Kapoor and Narayanan [33] examined how data leakage contributes to widespread reproducibility issues across scientific fields that apply machine learning. Based on a review of 22 studies covering 17 disciplines, the authors found data leakage in at least 294 papers, often leading to overly optimistic results. They introduced a taxonomy of eight types of leakage (i.e., [L1] a lack of a clean separation of training and test datasets, [L1.1] no test set, [L1.2] pre-processing on training and test sets, [L1.3] feature selection on training and test sets, [L1.4] duplicates in datasets, [L2] model use features that are not legitimate, [L3] a test set that is not drawn from the distribution of scientific interest, [L3.1] temporal leakage, [L3.2] non-independence between training and test samples, and [L3.3] a sampling bias in the test distribution) and propose model info sheets as a practical tool for researchers to detect and prevent such issues. A case study on civil war prediction demonstrated that, once leakage is corrected, complex machine learning models perform no better than traditional statistical methods, highlighting the need for stricter methodological practices to ensure reliable scientific findings.

3. Preliminaries

This section provides the background knowledge necessary to understand the proposed method. First, we introduce WiFi channel state information, which represents the physical-layer characteristics of wireless communication channels and is widely used for contactless human activity recognition. Next, we present Canny edge detection, a classical image processing technique that plays an important role in enhancing feature extraction from CSI-derived representations.

3.1. WiFi Channel State Information

In WiFi-based HAR, the most commonly used physical-layer feature is the channel state information (CSI), which characterizes the wireless channel at the time of transmission. Modern WiFi standards based on orthogonal frequency division multiplexing (OFDM) estimate a complex channel frequency response for each subcarrier [34]. In multiple-input multiple-output (MIMO) systems, the CSI describes how the transmitted signal is affected by propagation phenomena such as reflection, scattering, and fading. Knowing the CSI enables adaptive transmission strategies to achieve higher data rates and reliable communication [35]. If

T (f_{n}, t)

and

R (f_{n}, t)

represent the transmitted and received signals at the nth subcarrier frequency

f_{n}

and time t, respectively, the received signal can be expressed as

R (f_{n}, t) = T (f_{n}, t) \times C S I (f_{n}, t) + W,

(1)

where W denotes the channel noise. The CSI is represented as a complex matrix:

H_{k} = (\begin{matrix} h_{1, 1} & \dots & h_{1, r} \\ ⋮ & ⋱ & ⋮ \\ h_{t, 1} & \dots & h_{t, r} \end{matrix})

(2)

where t and r are the numbers of transmit and receive antennas, respectively. Each element is a complex value:

h_{t, r} = | h_{t, r} | e^{j ∠ h_{t, r}}

(3)

containing both the amplitude

| h_{t, r} |

and the phase

∠ h_{t, r}

of the propagation path between the tth transmit antenna and rth receive antenna.

Compared with coarse-grained metrics, such as the received signal strength indicator (RSSI), CSI provides fine-grained information by measuring the channel frequency response at each subcarrier and antenna pair. This enables CSI to capture subtle multi-path variations caused by human motion, such as arm swings, walking, or even chest movements during breathing. These small changes induce measurable perturbations in both the amplitude and phase of CSI, which can be exploited to distinguish different activities without the need for wearable sensors or cameras. Because CSI can be obtained using commodity WiFi hardware (e.g., Intel 5300 NICs or Raspberry Pi boards with Nexmon firmware), it offers a low-cost and privacy-preserving solution for contactless human activity recognition [36]. Each captured WiFi frame yields a set of CSI values across subcarriers, antennas, and time. These values are typically preprocessed (e.g., noise filtering, normalization, dimensionality reduction) and then fed into machine learning or deep learning models to classify human activities accurately.

3.2. Canny Edge Detection

Edge detection is a fundamental technique in computer vision, aiming to identify significant local changes in intensity within an image, which often correspond to object boundaries or salient structural information [37]. Among various edge detection algorithms, the Canny edge detector, proposed by John F. Canny in 1986, remains one of the most widely used due to its optimal performance under a set of criteria [38]. The Canny edge detection algorithm involves the following sequential steps.

Noise reduction. The input image I is smoothed using a Gaussian filter $G_{σ}$ to suppress noise:

$I_{s} (x, y) = I (x, y) * G_{σ} (x, y),$

(4)

where ∗ denotes the convolution and

$G_{σ} (x, y) = \frac{1}{2 π σ^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ^{2}}}$

(5)

with $σ$ being the standard deviation of the Gaussian kernel.
Gradient calculation. The gradients in the x and y directions— $G_{x}$ and $G_{y}$ —are computed, typically using Sobel operators. The gradient magnitude M and direction $Θ$ at each pixel are then calculated as

$M = \sqrt{G_{x}^{2} + G_{y}^{2}},$

(6)

$Θ = arctan (\frac{G_{y}}{G_{x}}) .$

(7)
Non-maximum suppression. The algorithm suppresses all gradient magnitudes that are not local maxima along the gradient direction, resulting in thin edges.
Double thresholding.

$M (x, y) = \{\begin{matrix} strong, & M (x, y) \geq T_{H} \\ weak, & T_{L} \leq M (x, y) < T_{H} \\ zero, & M (x, y) < T_{L}, \end{matrix}$

(8)

where $T_{H}$ and $T_{L}$ are the high and low thresholds, respectively.
Edge tracking by hysteresis. Weak edges are retained only if they are connected to strong edges, ensuring edge continuity and discarding isolated responses.

3.3. Post-Training Quantization

Modern deep neural networks often require millions of parameters and substantial computational resources, making their deployment challenging on embedded systems and edge devices with limited memory and processing power. Quantization addresses this issue by representing model parameters and, in some cases, intermediate activations with reduced numerical precision [39].

In PTO, quantization is applied to a fully trained model without modifying the training process. Let

W \in R^{n}

denote a vector of trained model weights. In PTQ, each weight

w \in W

is mapped to a discrete set of quantized values

\hat{W}

, typically represented by integers of lower bit-width, such as 8-bit signed integers:

\hat{w} = r o u n d (\frac{w}{s}),

(9)

s = \frac{max (W) - min (W)}{2^{b} - 1},

(10)

where b is the number of bits (e.g.,

b = 8

), and s is a scale factor used for dequantization during inference:

w \approx s \cdot \hat{w} .

(11)

This transformation reduces the model size approximately by a factor of

\frac{32}{b}

compared to standard 32-bit floating-point representations.

Quantization can be applied to different parts of the network:

Weights only (static quantization): All model parameters are quantized, while activations remain in floating point.
Weights and activations (full quantization): Both parameters and intermediate values are quantized, which further reduces latency and memory requirements.

However, quantization introduces quantization error, defined as

E_{q} = | | W - s \cdot \hat{W} {| |}_{2},

(12)

which can degrade the model’s prediction accuracy. The severity of this degradation depends on model structure, input data distribution, and quantization parameters.

PTQ is attractive because it does not require retraining and can be applied to any pre-trained network, but its effectiveness must be evaluated on realistic benchmarks. Inaccurate evaluation protocols (e.g., those with data leakage between training and test sets) can underestimate the actual performance loss caused by quantization, leading to overly optimistic conclusions about model deployability on constrained devices.

4. Materials and Methods

This section details the resources and methodologies used in this study. First, we introduce the dataset employed in our experiments, including its collection process and structure. Next, we describe the proposed method for processing raw CSI data and training the classification model. Finally, we present the detected data leakage issue in the original experimental setup, highlighting its implications for model evaluation and generalization.

4.1. Applied Database

The CSI-HAR dataset was collected by Moshiri et al. [19] for WiFi-based HAR using CSI. Data acquisition was performed in an indoor environment with a Raspberry Pi 4 running the Nexmon CSI tool [40] and a TP-Link Archer C20 access point operating at 5 GHz and 20 MHz bandwidth. Seven daily activities—walking, running, falling, lying down, sitting down, standing up, and bending— were carried out by three volunteers of different ages, each repeated 20 times, resulting in 420 activity samples. CSI data were captured in monitor mode as raw packets and segmented based on activity duration, yielding matrices with 52 subcarrier columns and 600–1100 time samples depending on the action performed. No pre-processing or filtering was applied to preserve signal characteristics; instead, the matrices were normalized. The dataset is publicly available on GitHub and is intended to support research on low-cost, real-time human activity recognition for healthcare and smart home applications. Figure 1 illustrates the actions of CSI-HAR dataset [19] with pseudocolor images.

4.2. Proposed Method

Figure 2 illustrates Shahverdi et al.’s [4] CSI data preprocessing steps. First, principal component analysis (PCA) is applied to the raw CSI data suppress noise by retaining only the most informative components. This step helps highlight dominant patterns in the CSI variations while discarding redundant or irrelevant information. The normalized, reduced-dimension data are then transformed into pseudocolor RGB images using MATLAB R2022b’s imagesc functione, after which the Canny edge detection algorithm is applied to emphasize structural patterns in the signal representation. Finally, the edge-detected images are resized to the required input dimensions

(64 \times 64)

for the deep learning model and stored with their corresponding activity labels.

Figure 3 depicts the structure of the CNN implemented by Shahverdi et al. [4]. It consists of two convolutional layers, each followed by leaky rectified linear unit (ReLU) activations to introduce non-linearity and mitigate vanishing gradient problems. Max pooling layers (kernel size of

2 \times 2

and stride of 1) succeed the convolutional stages to reduce spatial dimensions and introduce translation invariance, helping the model focus on feature presence rather than precise location. After convolution and pooling, the resulting feature maps are flattened into a one-dimensional vector and passed through two fully connected dense layers, which synthesizes the extracted features. The final output layer has neurons equal to the number of activity classes, delivering the probabilistic classification of observed human activities. Additional regularization techniques such as dropout (dropout rate was set to 0.2 for the convolutional layers and 0.1 for the fully connected layers) and batch normalization are implicitly integrated to prevent overfitting and ensure stable learning, particularly essential given the limited size and noisy nature of CSI datasets.

As the original publication [4] did not provide detailed specifications of the training procedure, the Adam optimizer was adopted in our implementation. The remaining training parameters are summarized in Table 1. The implementation of Shahverdi et al.’s method [4] was carried out using Python 3.11.5 and PyTorch 2.5.1 [42]. The hardware configuration is provided separately in Table 2.

4.3. Detected Data Leakage

The presence of data leakage was first suspected when we saw in the paper of Shahverdi et al. [4] that the model achieved an unusually high test accuracy, comparable to the performance typically observed on well-structured natural image datasets such as CIFAR-10 [44,45]. Given that CSI-derived images are inherently noisy, low-contrast, and contain subtle, overlapping motion patterns rather than clearly separable object categories, such performance levels appeared implausible. This discrepancy suggested that the model might have exploited subject-specific cues rather than learning generalizable activity features. Subsequent inspection of the dataset confirmed that samples from the same individuals appeared in both the training and test sets, validating the presence of data leakage.

A critical methodological issue was identified in the way the dataset was partitioned into training and testing subsets. The authors did not perform the split on a per-subject basis; instead, samples from the same individual appear in both the training and the test sets—illustrated in Figure 4. Since the raw CSI signals inherently contain person-specific characteristics—such as body shape, walking style, or other idiosyncratic movement patterns—the model is able to learn and exploit these features rather than focusing solely on the activity patterns it is intended to recognize. This results in a substantial overestimation of classification performance, as the model is indirectly “seeing” information during training that it should not have access to at inference time.

Such subject-independent partitioning (shown in Figure 5) is a well-established requirement in human activity recognition research, precisely to avoid this form of leakage. By not enforcing it, the reported accuracy reflects the model’s ability to re-identify participants rather than to generalize to unseen individuals. Consequently, the results presented in the paper are likely inflated and may not be reproducible in real-world scenarios where the system must classify activities of new users. This issue represents a form of feature leakage, where unintended identity cues are leaked from the training set into the test set, undermining the validity of the experimental evaluation.

5. Experimental Results and Analysis

This section presents the experimental evaluation of the proposed approach. We first describe the metrics used to assess classification performance, ensuring a comprehensive and interpretable comparison across different experimental settings. Next, we report and analyze the numerical results, focusing on the impact of data partitioning strategies and other relevant factors, such as quantization, on model accuracy and generalization.

5.1. Evaluation Metrics

To assess the performance of the classification models, standard evaluation metrics are employed. Let

T P

,

T N

,

F P

, and

F N

denote the number of true positives, true negatives, false positives, and false negatives, respectively. The metrics are defined as follows.

Accuracy is the ratio of correctly classified instances to the total number of instances:

$A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N} .$

(13)
Precision is the proportion of correctly predicted positive samples among all samples predicted as positive:

$P r e c i s i o n = \frac{T P}{T P + F P} .$

(14)
Recall is the proportion of correctly predicted positive samples among all actual positive samples:

$R e c a l l = \frac{T P}{T P + F N} .$

(15)
F1-score is the harmonic mean of precision and recall:

$F 1 - s c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l} .$

(16)

Since the dataset used in this study is class-balanced, macro-averaged values of the above metrics provide a fair and reliable measure of classification performance. Nevertheless, it is still important to report precision, recall, and F1-score in addition to accuracy, as these metrics capture complementary aspects of performance and highlight the trade-offs between different types of classification errors.

5.2. Numerical Results

In this section, we report the numerical results obtained under two different data partitioning strategies. First, we consider the results based on the subject-independent split, which represents the correct experimental protocol and provides a realistic estimate of model performance on unseen users. Second, we examine the results obtained with the subject-overlapping split, as originally used in the paper, which inadvertently introduces data leakage by allowing samples from the same individuals to appear in both training and test sets. By comparing these two evaluation settings, we are able to quantify the extent to which the reported performance is inflated due to the flawed partitioning. In addition, we also analyze further factors influencing the results, such as the effect of model quantization.

Table 3 summarizes the accuracy and F1-score reported in the original publication, along with the results obtained from our re-implementation. When trained and evaluated without separating subjects, our results closely match those reported in the paper, confirming the correctness of the re-implementation. However, when a subject-independent data split is applied, both accuracy and F1-score drop significantly (to 60.7% and 60.4%, respectively), clearly demonstrating that the original evaluation protocol suffers from data leakage and does not reflect true model generalization to unseen users.

Figure 6 and Figure 7 present the confusion matrices for the activity classification task under two different data partitioning strategies. In Figure 6, where the dataset was randomly split without separating subjects, the classification performance is consistently high across all activities, with several classes achieving perfect accuracy. This apparent robustness, however, reflects the model’s exploitation of subject-specific characteristics rather than its ability to generalize to unseen individuals. In contrast, Figure 7, which corresponds to the subject-independent split, reveals a substantial drop in classification performance, particularly for activities such as bend, fall, and run. Misclassifications become more frequent, and confusion between similar activities is clearly visible. This stark performance difference between the two settings confirms that the original evaluation protocol was affected by data leakage. The subject-independent results provide a more realistic measure of the model’s generalization capability, demonstrating that the performance reported in the original paper was significantly overestimated.

Figure 8 and Figure 9 illustrate the impact of post-training quantization with and without taking human-based separation into account during training. In Figure 8, where the network was trained without considering separation by human subjects (incorrect approach), the model achieves high accuracy in full precision (FP32) and maintains stable performance down to 5-bit quantization (around 92.9%). However, the accuracy drops significantly at 4 bits and below, with extreme degradation at 2 bits (46.4%) and 1 bit (14.3%). In contrast, Figure 9 shows the results when the training was correctly performed with subject-based separation. Here, the FP32 baseline accuracy is considerably lower (60.7%), and accuracy decreases gradually as quantization bits are reduced. The degradation pattern is smoother compared to Figure 6, but the overall accuracy is consistently lower, reaching only 52.9% at 3 bits and collapsing at 2 and 1 bits (17.1% and 14.3%, respectively). Overall, the comparison demonstrates that ignoring human-based separation during training may lead to artificially inflated accuracy in both FP32 and higher-bit quantization, while the correct subject-based separation yields lower but more reliable and realistic performance across quantization levels.

Figure 10 and Figure 11 compare the training dynamics under two different data partitioning strategies: without considering subject identities (Figure 10) and with subject-aware partitioning (Figure 11). In the subject-unaware setting, the model quickly achieves very high test accuracy, closely following the training accuracy curve. However, this is accompanied by suspiciously low test loss values, suggesting that the model may be exploiting subject-specific information rather than genuinely learning the underlying actions. In contrast, when subject identities are taken into account, the gap between training and test performance becomes more pronounced. Although the overall test accuracy is lower, the results provide a more realistic estimate of the model’s generalization ability, reflecting the true difficulty of the task. This comparison highlights the importance of subject-aware data partitioning in avoiding data leakage and ensuring a fair evaluation of model performance.

6. Discussion

Our results demonstrate that improper data partitioning, specifically the lack of subject separation, leads to severe data leakage in WiFi-CSI-based human action recognition. This not only inflates baseline performance but also obscures the true effects of engineering decisions, such as post-training quantization. In contrast, subject-independent partitioning yields lower but realistic performance, offering a fair assessment of generalization to unseen users [46]. These findings emphasize the need for rigorous evaluation protocols in WiFi-CSI-based HAR. Researchers should consistently apply subject-independent splits, report training dynamics, and consider complementary analyses such as confusion matrices or ground truth versus predicted plots. Establishing such practices will help prevent misleading conclusions, improve reproducibility, and ensure that future research in this domain produces results that are both reliable and practically relevant.

Although our analysis focuses on WiFi CSI-based human action recognition, similar methodological vulnerabilities may arise in other radio-frequency sensing modalities, such as radar or ultra-wideband systems. These technologies also capture user-dependent signal patterns, so improper partitioning can similarly inflate performance estimates. Addressing such cross-modal reproducibility concerns would help ensure methodological integrity across the broader RF sensing community.

7. Conclusions

This paper critically examined a WiFi-CSI-based human action recognition method and revealed that subject-unaware data partitioning introduces severe data leakage, leading to unrealistic performance estimates. By reimplementing the approach with subject-independent partitioning, we provided a more reliable assessment of generalization and highlighted how flawed protocols can also distort the evaluation of model compression techniques such as quantization. To ensure integrity and reproducibility, future research should adopt rigorous subject-aware evaluation protocols and openly share datasets and code.

Author Contributions

Conceptualization, D.V.; methodology, D.V.; software, D.V. and A.Q.C.; validation, D.V. and A.Q.C.; formal analysis, D.V. and A.Q.C.; investigation, D.V. and A.Q.C.; resources, D.V.; data curation, D.V. and A.Q.C.; writing—original draft preparation, D.V.; writing—review and editing, D.V.; visualization, A.Q.C.; supervision, D.V.; project administration, D.V.; funding acquisition, D.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study were derived from the CSI-HAR database, available for download at https://github.com/parisafm/CSI-HAR-Dataset (accessed on 19 September 2025).

Acknowledgments

We would like to express our sincere gratitude to our colleague Krisztián for his invaluable assistance and expertise in GPU computing. His guidance and support have been instrumental in optimizing our computational workflows and accelerating the progress of this research project. We would like to express our heartfelt gratitude to the entire team of Nokia Bell Labs, Budapest, for fostering an environment of collaboration, support, and positivity throughout the duration of this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CPU	central processing unit
CSI	channel state information
FN	false negative
FP	false positive
GPU	graphics processing unit
HAR	human action recognition
LSTM	long short-term memory
MIMO	multiple-input multiple-output
OFDM	orthogonal frequency division multiplexing
PCA	principal component analysis
PTQ	post-training quantization
ReLU	rectified linear unit
SVM	support vector machine
TN	true negative
TP	true positive

References

Lei, Z.; Rong, B.; Jiahao, C.; Yonghong, Z. Smart City Healthcare: Non-Contact Human Respiratory Monitoring with WiFi-CSI. IEEE Trans. Consum. Electron. 2024, 70, 5960–5968. [Google Scholar] [CrossRef]
Jiang, H.; Cai, C.; Ma, X.; Yang, Y.; Liu, J. Smart home based on WiFi sensing: A survey. IEEE Access 2018, 6, 13317–13325. [Google Scholar] [CrossRef]
Zhang, R.; Jiang, C.; Wu, S.; Zhou, Q.; Jing, X.; Mu, J. Wi-Fi sensing for joint gesture recognition and human identification from few samples in human-computer interaction. IEEE J. Sel. Areas Commun. 2022, 40, 2193–2205. [Google Scholar] [CrossRef]
Shahverdi, H.; Moshiri, P.F.; Nabati, M.; Asvadi, R.; Ghorashi, S.A. A csi-based human activity recognition using canny edge detector. In Human Activity and Behavior Analysis; CRC Press: Boca Raton, FL, USA, 2024; pp. 67–82. [Google Scholar]
Miao, F.; Huang, Y.; Lu, Z.; Ohtsuki, T.; Gui, G.; Sari, H. Wi-Fi sensing techniques for human activity recognition: Brief survey, potential challenges, and research directions. Acm Comput. Surv. 2025, 57, 1–30. [Google Scholar] [CrossRef]
Wang, Y.; Wu, K.; Ni, L.M. Wifall: Device-free fall detection by wireless networks. IEEE Trans. Mob. Comput. 2016, 16, 581–594. [Google Scholar] [CrossRef]
Kecman, V. Support vector machines–an introduction. In Support Vector Machines: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1–47. [Google Scholar]
Al-qaness, M.A. Device-free human micro-activity recognition method using WiFi signals. Geo. Spat. Inf. Sci. 2019, 22, 128–137. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Wang, Y.; Liu, J.; Chen, Y.; Gruteser, M.; Yang, J.; Liu, H. E-eyes: Device-free location-oriented activity identification using fine-grained wifi signatures. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking, Maui, HI, USA, 7–11 September 2014; pp. 617–628. [Google Scholar]
Chen, C.; Shu, Y.; Shu, K.I.; Zhang, H. WiTT: Modeling and the evaluation of table tennis actions based on WIFI signals. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3100–3107. [Google Scholar]
Yousefi, S.; Narui, H.; Dayal, S.; Ermon, S.; Valaee, S. A survey on behavior recognition using WiFi channel state information. IEEE Commun. Mag. 2017, 55, 98–104. [Google Scholar] [CrossRef]
Schmidhuber, J.; Hochreiter, S. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, L.; Jiang, C.; Cao, Z.; Cui, W. WiFi CSI based passive human activity recognition using attention based BLSTM. IEEE Trans. Mob. Comput. 2018, 18, 2714–2724. [Google Scholar] [CrossRef]
Schäfer, J.; Barrsiwal, B.R.; Kokhkharova, M.; Adil, H.; Liebehenschel, J. Human activity recognition using CSI information with nexmon. Appl. Sci. 2021, 11, 8860. [Google Scholar] [CrossRef]
Shen, X.; Ni, Z.; Liu, L.; Yang, J.; Ahmed, K. WiPass: 1D-CNN-based smartphone keystroke recognition using WiFi signals. Pervasive Mob. Comput. 2021, 73, 101393. [Google Scholar] [CrossRef]
Zhang, C.; Jiao, W. Imgfi: A high accuracy and lightweight human activity recognition framework using csi image. IEEE Sens. J. 2023, 23, 21966–21977. [Google Scholar] [CrossRef]
Xu, Z.; Lin, H. Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field and CNNs for Stock Return Predictions. arXiv 2023, arXiv:2310.07427. [Google Scholar] [CrossRef]
Moshiri, P.F.; Shahbazian, R.; Nabati, M.; Ghorashi, S.A. A CSI-based human activity recognition using deep learning. Sensors 2021, 21, 7225. [Google Scholar] [CrossRef] [PubMed]
Jawad, S.K.; Alaziz, M. Human Activity and Gesture Recognition Based on WiFi Using Deep Convolutional Neural Networks. Iraqi J. Electr. Electron. Eng. 2022, 18, 110–116. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. Acm Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
Wang, Z.; Jiang, K.; Hou, Y.; Dou, W.; Zhang, C.; Huang, Z.; Guo, Y. A survey on human behavior recognition using channel state information. IEEE Access 2019, 7, 155986–156024. [Google Scholar] [CrossRef]
Liu, J.; Liu, H.; Chen, Y.; Wang, Y.; Wang, C. Wireless sensing for human activity: A survey. IEEE Commun. Surv. Tutor. 2019, 22, 1629–1645. [Google Scholar] [CrossRef]
Apicella, A.; Isgrò, F.; Prevete, R. Don’t push the button! exploring data leakage risks in machine learning and transfer learning. arXiv 2024, arXiv:2401.13796. [Google Scholar] [CrossRef]
Domnik, J.; Holland, A. On Data Leakage Prevention and Machine Learning. In Proceedings of the 35th Bled eConference Digital Restructuring and Human (Re) Action, Bled, Slovenia, 26–29 June 2022; p. 695. [Google Scholar]
Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in data mining: Formulation, detection, and avoidance. Acm Trans. Knowl. Discov. Data (TKDD) 2012, 6, 1–21. [Google Scholar] [CrossRef]
Kapoor, S.; Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef]
Ma, Y.; Zhou, G.; Wang, S. WiFi sensing with channel state information: A survey. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
Caire, G.; Shamai, S. On the capacity of some channels with channel state information. IEEE Trans. Inf. Theory 2002, 45, 2007–2019. [Google Scholar] [CrossRef]
Guo, J.; Ho, I.W.H. CSI-based efficient self-quarantine monitoring system using branchy convolution neural network. In Proceedings of the 2022 IEEE 8th World Forum on Internet of Things (WF-IoT), Yokohama, Japan, 26 October–11 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Sun, R.; Lei, T.; Chen, Q.; Wang, Z.; Du, X.; Zhao, W.; Nandi, A.K. Survey of image edge detection. Front. Signal Process. 2022, 2, 826967. [Google Scholar] [CrossRef]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; Van Baalen, M.; Blankevoort, T. A white paper on neural network quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar] [CrossRef]
Gringoli, F.; Schulz, M.; Link, J.; Hollick, M. Free your CSI: A channel state information extraction platform for modern Wi-Fi chipsets. In Proceedings of the 13th International Workshop on Wireless Network Testbeds, Experimental Evaluation & Characterization, Los Cabos, Mexico, 25 October 2019; pp. 21–28. [Google Scholar]
Majumdar, N.; Banerjee, S. MATLAB Graphics and Data Visualization Cookbook; PACKT Publishing: Birmingham, UK, 2012. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Çalik, R.C.; Demirci, M.F. Cifar-10 image classification with convolutional neural networks for embedded systems. In Proceedings of the 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA), Aqaba, Jordan, 28 October–1 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–2. [Google Scholar]
Aslam, S.; Nassif, A.B. Deep learning based CIFAR-10 classification. In Proceedings of the 2023 Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab Emirates, 20–23 February 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
Saupe, D.; Hahn, F.; Hosu, V.; Zingman, I.; Rana, M.; Li, S. Crowd workers proven useful: A comparative study of subjective video quality assessment. In Proceedings of the QoMEX 2016: 8th International Conference on Quality of Multimedia Experience, Lisbon, Portugal, 6–8 June 2016. [Google Scholar]

Figure 1. Illustration of human actions’ CSI images in the CSI-HAR [19] database. The 52 CSI channels were transformed to RGB images using MATLAB R2022B’s imagesc function [41]. (a) Bend. (b) Fall. (c) Lie down. (d) Run. (e) Sit down. (f) Stand up. (g) Walk.

Figure 2. The flow diagram of preparing raw CSI data to the CNN.

Figure 3. Proposed CNN for WiFi-CSI-based HAR.

Figure 4. Data split without subject separation. This figure demonstrates the flawed data partitioning strategy used by Shahverdi et al. [4], in which samples from the same individuals appear in both the training and test sets. In this approach, CSI measurements are first extracted and then randomly divided into a training set (80% of samples) and a test set (20% of samples). Such a procedure enables the model to exploit subject-specific characteristics rather than learning generalizable activity patterns, resulting in data leakage and artificially inflated evaluation metrics.

Figure 5. Data split with subject separation. This figure shows the proper data partitioning strategy, in which entire sets of samples from each individual are assigned exclusively to either the training or the test set. CSI measurements are then extracted for each subject. This methodology guarantees that the model is evaluated on data from unseen individuals, effectively avoiding data leakage and enabling a more accurate assessment of the model’s ability to generalize in human activity recognition tasks.

Figure 6. Confusion matrix (%) of the activity classification results obtained when the dataset is randomly split into 80% and 20% testing samples without separating objects. This evaluation setting allows samples from the same individual to appear in both sets, leading to inflated performance due to data leakage.

Figure 7. Confusion matrix (%) of the activity classification results when the dataset is split by subject, with data from the first two participants used for training and the third participant used exclusively for testing. This subject-independent evaluation provides a realistic estimate of model generalization to unseen users.

Figure 8. Classification accuracy (%) under post-training quantization with different bit widths, evaluated on a model trained with random sample splitting (80% training, 20% testing) without separating subjects. The results show that accuracy remains stable down to 5-bit quantization, with performance degrading significantly at 4 bits and lower.

Figure 9. Classification accuracy (%) under post-training quantization with different bit widths, evaluated on a model trained with respect to humans (the dataset is split by subject, with data from the first two participants used for training and the third participant used exclusively for testing).

Figure 10. The model of Shahverdi et al. [4] was retrained on the CSI-HAR [19] dataset without taking the identity of individuals into account. In the top part of the figure, the blue curve illustrates training accuracy, whereas the black curve denotes test accuracy. In the bottom part, the red curve corresponds to the training loss, while the black curve reflects the test loss.

Figure 11. The model of Shahverdi et al. [4] was retrained on the CSI-HAR [19] dataset while taking human identities into account. In the top part of the figure, the blue curve illustrates the training accuracy, and the black curve denotes the test accuracy. In the bottom part, the red curve corresponds to the training loss, while the black curve represents the test loss.

Table 1. Parameter settings.

Parameter	Value
Loss function	Cross-entropy
Optimizer	Adam [43] ( $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 10^{- 8}$ )
Learning rate	0.001
Weight decay	0.0
Batch size	128
Number of epochs	40

Table 2. Computer configuration.

Parameter	Value
Computer model	STRIX Z270H Gaming
Operating system	Windows 10
CPU	Intel(R) Core(TM) i7-7700K CPU 4.20 GHz (8 cores)
Memory	15 GB
GPU	NVIDIA GeForce GTX 1080

Table 3. Reported and re-implemented classification results under different data partitioning strategies.

	Accuracy	F1-Score
Reported in [4]	0.92	0.92
Retrained w/o.r.t. humans	0.929	0.928
Retrained w.r.t. humans	0.607	0.604

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Varga, D.; Cao, A.Q. Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition. Signals 2025, 6, 59. https://doi.org/10.3390/signals6040059

AMA Style

Varga D, Cao AQ. Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition. Signals. 2025; 6(4):59. https://doi.org/10.3390/signals6040059

Chicago/Turabian Style

Varga, Domonkos, and An Quynh Cao. 2025. "Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition" Signals 6, no. 4: 59. https://doi.org/10.3390/signals6040059

APA Style

Varga, D., & Cao, A. Q. (2025). Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition. Signals, 6(4), 59. https://doi.org/10.3390/signals6040059

Article Menu

Why Partitioning Matters: Revealing Overestimated Performance in WiFi-CSI-Based Human Action Recognition

Abstract

1. Introduction

1.1. Contributions

1.2. Structure of the Paper

2. Related Work

2.1. WiFi-CSI-Based HAR

2.2. Data Leakage in Machine Learning Research

3. Preliminaries

3.1. WiFi Channel State Information

3.2. Canny Edge Detection

3.3. Post-Training Quantization

4. Materials and Methods

4.1. Applied Database

4.2. Proposed Method

4.3. Detected Data Leakage

5. Experimental Results and Analysis

5.1. Evaluation Metrics

5.2. Numerical Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI