1. Introduction
With the rapid development of Industry 4.0 and intelligent manufacturing, accurate perception of robotic arm motion has become increasingly important in application scenarios such as automated production, human–robot collaboration, and intelligent control. Existing sensing methods mainly rely on vision-based methods or additional sensors. However, vision-based methods are susceptible to factors such as illumination changes and target occlusion, which limits their applicability. Sensor-based solutions usually require additional hardware deployment and maintenance costs and may interfere with the normal operation of the equipment to a certain extent. Compared with the above two sensing approaches, CSI-based sensing technology can leverage existing wireless communication infrastructure to obtain robotic arm motion information in a low-cost and non-intrusive manner, showing promising application potential in the field of robotic arm monitoring. Nevertheless, in robotic arm motion recognition, extracting discriminative features of different motions from CSI signals still faces the following challenges.
First, signal perturbations of CSI caused by robotic arm motion are weak and locally distributed, making effective features difficult to capture adequately. In practical deployment environments, the limited deployment space and the physical dimensions of the robotic arm usually lead to small and localized CSI fluctuations during motion. Such subtle signal perturbations are easily masked by environmental noise and changes in measurement conditions. In addition, robotic arm movements are usually highly repetitive and continuous, which makes the differences among different motions even more difficult to distinguish under noisy conditions. Therefore, it is necessary for a robotic arm motion recognition framework to remain sensitive to weak local patterns and extract discriminative features robustly under noisy conditions to accurately distinguish fine-grained motions.
Second, in CSI signals generated by continuous robotic arm motion, discriminative information is often concentrated in only a few key periods, making adaptive extraction of these critical temporal features another major challenge. Specifically, within a fixed-length CSI acquisition window, the channel variations that truly reflect motion differences usually appear only in a few key temporal segments, while the remaining periods mainly correspond to stable background or weakly correlated responses. If all temporal positions are assigned equal weights, useful discriminative information may be diluted by redundant signals, thereby reducing recognition accuracy. Moreover, the key motion segments of different motions may vary considerably in both length and occurrence time, which further increases the difficulty of temporal modeling. Therefore, it is important for a robotic arm motion recognition model to automatically identify key periods and dynamically enhance discriminative information.
In addition, Wi-Fi-based robotic arm motion recognition can also be influenced by deployment conditions, multipath propagation, and motion speed variations, which further increase the practical difficulty of CSI-based representation learning.
To address the above challenges, this paper proposes MSPoolNet, a multi-stage CSI-based model for robotic arm motion recognition. The proposed framework is developed to better accommodate weak local CSI perturbations and sparsely distributed informative temporal cues, while preserving a favorable balance between recognition performance and computational efficiency. The main contributions of this paper can be summarized as follows:
We propose an efficient CSI-based robotic arm motion recognition model, MSPoolNet, which consists of three key modules: an adaptive temporal downsampling module, a temporal gating module, and a Transformer-based feature encoding module. Specifically, the adaptive temporal downsampling module is used to extract local time–frequency patterns at the input stage. The temporal gating module can dynamically highlight key temporal segments and enhance the representation capability of discriminative features. The Transformer-based feature encoding module achieves global temporal modeling of fine-grained features while maintaining low computational overhead.
We propose a fine-grained feature extraction module based on CSI, which integrates a lightweight Transformer encoding structure and replaces conventional self-attention with pooling operations, thereby enhancing the modeling of subtle temporal dependencies and fine-grained discriminative patterns in CSI signals while achieving an effective balance between computational efficiency and representation capability.
We conduct extensive experiments on two public CSI datasets for robotic arms. The results demonstrate that the proposed method achieves state-of-the-art performance. Furthermore, comparative experiments further verify its applicability to related robotic arm motion recognition tasks.
3. Proposed Method
3.1. Problem Formulation
In the robotic arm Wi-Fi sensing scenario considered in this study, a wireless transmitter and multiple passive receivers are deployed around the robotic workspace to capture channel variations induced by arm motion, without requiring attached sensors or vision equipment. As the robotic arm performs different motion patterns, the transmitted Wi-Fi signals undergo multipath propagation, and the resulting channel state information is recorded by the receivers over successive packets and subcarriers. These CSI measurements provide a non-intrusive representation of robotic motion and serve as the sensing basis for subsequent recognition.
In an OFDM-based Wi-Fi system, the received signal can be written as
where
,
, and
denote the transmitted signal, received signal, and additive noise, respectively [
21,
22]. The CSI matrix
characterizes the channel response over
T packets and
S subcarriers, and each element is defined as
where
and
denote the amplitude and phase of subcarrier
s at time index
t. This study uses CSI amplitude as the input feature. This choice follows the common setting adopted in many CSI-based recognition studies, where amplitude provides a comparatively stable and reproducible representation. In contrast, the raw CSI phase is highly sensitive to hardware-induced offsets and usually requires additional calibration procedures, such as phase unwrapping, sanitization, and offset removal. These procedures introduce extra preprocessing complexity and computational cost, which is less consistent with the lightweight design goal of MSPoolNet. Therefore, this work focuses on amplitude-based robotic arm motion recognition, while phase-amplitude fusion is left for future investigation.
Given a CSI amplitude matrix , the objective of the classification task is to infer the activity label corresponding to the observed channel variations. Formally, let denote the labeled training set, where represents the CSI amplitude matrix and denotes the corresponding class label. The goal of this study is to learn a parametric mapping by minimizing the cross-entropy loss over the training set.
3.2. Proposed Framework
MSPoolNet is developed as a multi-stage framework for robotic arm motion recognition from Wi-Fi CSI. Given a raw CSI amplitude matrix , the framework consists of six sequential stages: signal preprocessing, an adaptive temporal downsampling module, a temporal gating module, a patch embedding module, a Transformer-based feature encoding module, and a classification head. Through this hierarchical design, the input is progressively transformed from normalized CSI representations into compact and discriminative features for final motion recognition.
In robotic arm CSI sensing, discriminative information is often subtle and tends to appear as localized variations in time–frequency responses. Moreover, useful cues are unevenly distributed along the temporal dimension, since a fixed-length acquisition window usually contains a considerable proportion of stable background segments and weakly correlated disturbances. Practical deployment also requires an effective balance between parameter efficiency and inference performance. To accommodate these characteristics, MSPoolNet first performs signal preprocessing to obtain a normalized CSI representation. It then employs an adaptive temporal downsampling module to enhance local time–frequency patterns while reducing frequency-domain redundancy, uses a temporal gating module to emphasize informative CSI responses, and adopts a Transformer-based feature encoding module to model compact token representations efficiently.
Accordingly, MSPoolNet is organized as a progressive pipeline comprising input preprocessing, local pattern extraction, key temporal interval selection, and high-level representation integration. This design is specifically tailored to robotic arm CSI signals, including subtle local disturbances, sparsely distributed informative temporal cues, and the practical need for efficient back-end modeling. The overall architecture is shown in
Figure 1, and the key components are described in the following subsections.
3.2.1. Signal Preprocessing
The raw CSI data is stored in the form of a complex matrix
. The amplitude matrix
is extracted from
as
Log compression is applied to reduce the dynamic range:
The compressed amplitude is further normalized along each subcarrier:
Here, and denote the mean and standard deviation of subcarrier s over the temporal dimension within the same CSI sample, respectively.
3.2.2. Adaptive Temporal Downsampling Module
To better preserve weak and locally distributed motion-related perturbations in robotic arm CSI, MSPoolNet employs an adaptive temporal downsampling module as the front-end representation stage. By enhancing local time–frequency responses and reducing redundant subcarrier information before tokenization, this module produces a compact feature representation for subsequent temporal gating and encoding. This design is supported by recent findings showing that early convolutional layers can improve both the optimization process and the feature quality of Transformer architectures [
27]. The module corresponds to the overall mapping process defined in Equation (
6), and its primary roles are to extract local time–frequency patterns and perform early-stage resolution reduction.
The adaptive temporal downsampling module is composed of two convolutional layers, each followed by batch normalization and a nonlinear activation function [
28]. Considering that the discriminative information of robotic arm motions in CSI is primarily distributed along the temporal dimension, while adjacent subcarriers in the frequency dimension often exhibit strong correlation and redundancy, an asymmetric downsampling strategy is adopted, with relatively moderate compression in the temporal dimension and more aggressive compression in the frequency dimension. Specifically, the first layer performs joint downsampling in both time and frequency, whereas the second layer further reduces the frequency resolution. This design aims to preserve the dynamic boundaries and temporal evolution of the motion as much as possible, thereby avoiding excessive loss of temporal detail at an early stage, while simultaneously exploiting the redundancy among adjacent subcarriers to reduce the input scale for subsequent encoding.
For robotic arm motion recognition, inter-class differences are usually not reflected in large-scale global shape variations but rather in subtle local response differences across neighboring time steps and adjacent subcarriers, such as short-term energy fluctuations, local amplitude perturbations, and variations in inter-subcarrier correlation patterns. Therefore, the adaptive temporal downsampling module in the proposed model is not merely used for general feature extraction but is designed to enhance local spatiotemporal responses in raw CSI at an early stage so that discriminative cues related to fine-grained motion differences can be more effectively preserved.
Let the preprocessed CSI tensor be
. To align the feature size for subsequent patch embedding, zero-padding is applied along the frequency dimension to obtain
. The stem output is defined as
Here,
denotes the number of output channels, while
and
represent the downsampled temporal and frequency resolutions, respectively. The adaptive temporal downsampling module maps the original CSI input into a more compact, multi-channel feature representation, thereby establishing the feature basis for subsequent temporal gating and token-based modeling. The two-layer Conv-BN-GELU cascaded structure of the adaptive temporal downsampling module and its asymmetric time–frequency downsampling scheme are illustrated in
Figure 1.
3.2.3. Temporal Gating Module
The CSI features produced by the adaptive temporal downsampling module remain continuous along the temporal dimension, whereas the truly informative discriminative cues associated with robotic arm motions are typically concentrated within only a few key temporal intervals. Within a fixed-length CSI acquisition window, stable background segments before and after motion onset, transition periods, and weak disturbance regions often occupy a substantial portion of the sequence. In contrast, the channel variations that genuinely reflect motion differences are usually short-lived and appear at uncertain temporal locations. If these informative intervals are not explicitly highlighted, the subsequent encoder must process a large amount of weakly correlated or irrelevant CSI responses, thereby increasing the difficulty of accurately identifying critical motion-related segments. To better focus the representation on motion-relevant temporal cues, the proposed framework employs a temporal gating module to perform sample-dependent feature reweighting along the temporal dimension.
The temporal gating module compresses the convolutional features along the channel and frequency dimensions into a one-dimensional temporal descriptor, which characterizes the overall response intensity of the CSI sequence at different time positions. Based on this descriptor, a multi-layer perceptron (MLP) predicts time-step-wise gating weights in the range of 0 to 1, and these weights are then applied to reweight the convolutional features. In this way, the module emphasizes informative channel responses associated with robotic arm motions while preserving the original temporal order, thereby suppressing the influence of stable background segments and weakly correlated disturbances during subsequent tokenization and encoding.
This module consists of three steps. First, the output
from the adaptive temporal downsampling module is averaged across the channel and frequency dimensions, compressing each time step into a scalar to obtain a temporal vector that reflects the overall activity intensity at each time step:
Next, the time vector
is fed into a two-layer MLP. After dimensionality reduction via a bottleneck layer, it is mapped back to the original time dimension to obtain the time-step-wise gating weights:
Here,
denotes the sigmoid function, which is used to constrain the weights to the interval
. The GELU activation function is used to maintain the smoothness of the nonlinear transformation [
29]. Finally, the gated weights are applied element-wise to the feature map via broadcast:
The proposed temporal gating module does not modify the original temporal order of the sequence but instead redistributes the importance of different temporal positions through continuous weighting. Compared with methods based on predefined fixed windows or manually selected segments, this module does not impose prior assumptions on the location or duration of informative temporal intervals. Instead, it adaptively generates temporal weights according to the CSI response of each input sample. As a result, it is more suitable for handling temporal variations caused by different robotic arm motion speeds and motion patterns.
To avoid excessive disturbance to CSI features that are still unstable during the early stage of training, the parameters of the final layer in the gating predictor are initialized to zero. Under this setting, the weights assigned to different time steps remain approximately uniform at the beginning of training, allowing the model to first learn basic CSI representations before gradually establishing the emphasis and suppression mechanism for key temporal segments. This design helps improve training stability.
Figure 2a illustrates the overall workflow of the temporal gating module. The features are first compressed along the spatial and frequency dimensions, after which time-step-wise gating weights are predicted and multiplied back into the feature map, thereby explicitly performing temporal reweighting before tokenization.
3.2.4. Patch Embedding
The gated feature map
is projected onto a sequence of tokens via a convolutional layer with a stride equal to the kernel size. This convolution divides the feature map into several non-overlapping patches and maps each patch to a
D-dimensional embedding vector, yielding
N tokens. To preserve the positional information of the patches within the two-dimensional feature map, this paper introduces a learnable positional encoding for each token:
The features refined by the adaptive temporal downsampling module and temporal gating module are converted into compact token representations through patch embedding and then used as the input to the subsequent Transformer-based feature encoding module.
3.2.5. Transformer-Based Feature Encoding Module
Patch embedding reorganizes the token sequence into a two-dimensional feature map and subsequently feeds it into multiple layers of lightweight encoding modules. The adaptive temporal downsampling module primarily serves to extract local time–frequency responses and reduce redundancy in the frequency dimension, while the temporal gating module further highlights informative temporal intervals. Through these two stages, category-relevant local CSI cues are retained in the form of compact token representations. However, the relationships among these local cues still need to be further integrated so as to form stable, high-level discriminative features.
For robotic arm CSI sensing, inter-class differences are usually not determined by the response of a single local block alone but rather by the joint variation patterns of multiple local subcarrier responses along the temporal dimension. Therefore, the role of the back-end encoder is not to rediscover local perturbations from the raw CSI input but to further integrate the filtered and informative CSI cues within a compact representation space. To meet this requirement, PoolFormer [
30] is adopted as the back-end encoding module to perform token interaction and representation refinement with relatively low computational complexity.
First, the compact tokens obtained after patch embedding are reorganized as the input to the back-end encoding stage. Then, contextual information is progressively aggregated among neighboring CSI tokens through pooling operations, rather than through explicit global pairwise interactions. In this way, the remaining informative responses can be further integrated after local enhancement and temporal filtering. Finally, PoolFormer enables efficient high-level aggregation of local CSI cues while maintaining a more favorable balance between recognition performance and computational complexity.
During the spatial mixing phase, this paper employs a neighborhood aggregation method based on local pooling, rather than explicitly computing the pairwise similarities between all tokens. For the input feature map
, the pooling operations process is represented as
Here, DropPath denotes stochastic depth regularization [
31]. This step facilitates information exchange between adjacent tokens through pooling interactions within local neighborhoods, without the need to explicitly compute global attention. For CSI inputs that have undergone local pattern enhancement and temporal filtering in the earlier stages, this lightweight spatial mixing is sufficient to meet the requirements for subsequent discriminative representation updates.
The model further updates the feature representations through a convolutional channel transformation module:
where
denotes layer normalization [
32]. Specifically, the MLP consists of three components: point-wise convolution for dimensionality increase, depthwise separable convolution, and point-wise convolution for dimensionality reduction, with the Hardswish activation function used in the middle layer [
33]. This architecture preserves a certain degree of local spatial modeling capability while performing channel transformations, thereby helping to further refine the CSI token representations. The detailed structure of a single lightweight Transformer encoding block is illustrated in
Figure 2b.
3.2.6. Classification Head
The feature maps produced by the
L lightweight Transformer encoding blocks are further normalized and aggregated by global average pooling along the spatial dimension, yielding a global feature vector
. Finally, a single layer of normalization and a linear mapping are applied to generate the classification prediction:
Here, denotes the standard layer normalization applied to vector representations, while is used for two-dimensional feature maps . Specifically, first permutes the feature map to , applies layer normalization along the channel dimension C at each spatial location, and then permutes the result back to .
4. Experimental Setup and Results
4.1. Datasets and Evaluation Setup
The proposed method is evaluated on two publicly available robotic arm CSI datasets. Among them, RoboMNIST is used as the primary benchmark for fine-grained motion recognition, while RoboFiSense is employed to further validate the applicability of the proposed method to related robotic arm motion recognition tasks.
Table 1 summarizes the main characteristics of the two datasets and the evaluation protocols adopted in this study.
4.1.1. RoboMNIST Dataset
RoboMNIST [
22] is a multimodal dataset developed for multi-robot activity recognition. In this dataset, two Franka Emika Panda robotic arms are programmed to write the digits 0–9 on a vertical virtual plane, corresponding to ten activity categories. To simulate motion variability in real-world scenarios, multiplicative uniform noise,
, is applied to the first three robot joints during data acquisition, increasing the effective trajectory deviation from the native repeatability of 0.1 mm to approximately 32 cm. Data are collected using three sensing modules, each equipped with a Raspberry Pi 4 running Nexmon firmware [
34] as a CSI sniffer under the IEEE 802.11ac standard with an 80 MHz bandwidth. Each recording lasts 15 s with a sampling rate of 30 Hz, and each sensing module produces a complex-valued CSI matrix
. The dataset covers three motion speed settings, namely, high, medium, and low, and each combination of activity category, robot, and speed contains at least 32 repetitions. In this study, the CSI amplitude data from two sniffers are concatenated along the channel dimension and used as the model input, resulting in a total of 2135 samples for classification.
4.1.2. RoboFiSense Dataset
RoboFiSense [
21] is the first publicly available CSI benchmark for robotic arm motion recognition. In this dataset, a Franka Emika robotic arm performs eight action categories, namely, Arc, Elbow, Rectangle, Silence, SLFW, SLRL, SLUD, and Triangle. Meanwhile, two Raspberry Pi-based sniffers synchronously collect CSI data at a sampling rate of 30 Hz within a 12 s acquisition window, producing
. After removing pilot subcarriers, guard bands, and the DC subcarrier, 232 valid subcarriers are retained. The dataset was collected under four spatial sniffer configurations and three robotic arm speed settings in order to investigate the sensitivity of Wi-Fi sensing to changes in environmental and operational conditions. In this study, the CSI amplitude data from the two sniffers are fused and a total of 552 samples are used to evaluate the applicability of the proposed method to robotic arm motion recognition tasks.
4.2. Training Configuration
The model was trained using the AdamW optimizer [
35] with a cosine-annealing learning rate schedule and a linear warm-up strategy [
36]. The loss function was set to cross-entropy. Gradient clipping was applied during training to improve optimization stability and validation accuracy was used as the criterion for early stopping to reduce the risk of overfitting. Except for the cross-speed evaluation on RoboFiSense, all experiments were conducted using five-fold stratified cross-validation and the mean and standard deviation were reported. In addition, all cross-validation experiments were performed under a fixed random seed to reduce the influence of data partition variability.
Table 2 lists the training hyperparameter settings for the two datasets.
Table 3 summarizes the key architecture hyperparameters of MSPoolNet used in the experiments.
4.3. Experimental Environment
All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 5090 GPU using Python 3.12 and PyTorch 2.7.0. The evaluation metrics included test accuracy, macro-average F1 score (Macro-F1), macro-average precision (Macro-Precision), macro-average recall (Macro-Recall), and single-sample inference latency. The reported latency refers to the model inference latency, i.e., the average forward-pass time of the recognition model after the CSI tensor is constructed. It does not include CSI acquisition, window buffering, amplitude extraction, normalization, tensor construction, or data-transfer time. Specifically, letting
C denote the number of classes and
,
, and
denote the true positives, false positives, and false negatives for class
c, with
denoting the total number of test samples, these metrics are defined as follows:
Standard experiments were performed using five-fold stratified cross-validation, with the mean and standard deviation reported. For the cross-speed experiments, a leave-one-speed-out evaluation protocol was adopted, and the mean and standard deviation across the three speed partitions were reported. To ensure comparability, all baseline models implemented in this study followed the same data preprocessing pipeline, early stopping strategy, and training protocol as the proposed method. In particular, the RoboFiSense baseline comparisons were obtained by re-training all baseline models under the same experimental configuration as MSPoolNet, including the same data split protocol, preprocessing pipeline, training strategy, early stopping criterion, and evaluation metrics.
4.4. Baseline Methods
To evaluate the overall recognition performance of MSPoolNet, we compare it with several representative baseline models covering convolutional, recurrent, hybrid spatiotemporal, Transformer-based, and recent task-related architectures.
CNN is adopted as a representative convolutional baseline for modeling local patterns in CSI representations. It mainly reflects the capability of convolutional architectures for extracting local time–frequency features.
LSTM is adopted as a recurrent baseline that treats the CSI amplitude matrix as a temporal sequence, where the subcarrier vector at each time step is regarded as the input vector and the final hidden state is used for classification.
BiLSTM extends the above recurrent formulation by modeling temporal dependencies in both forward and backward directions. The hidden states from the two directions are concatenated before classification.
ConvLSTM [
37] is adopted as a hybrid spatiotemporal baseline. By replacing the standard matrix multiplications in the gating operations with convolutions, it preserves the two-dimensional CSI structure during temporal modeling.
Transformer is adopted as a standard Transformer encoder baseline. The CSI matrix is partitioned into fixed-length temporal chunks as input tokens and mean pooling over the output sequence is used for classification.
ViT is adopted as a standard Vision Transformer baseline. The two-dimensional CSI amplitude map is divided into non-overlapping patch tokens and encoded by stacked Transformer blocks, and the final aggregated representation is used for classification.
BiVTC is included as a task-related Transformer baseline for robotic arm WiFi sensing. It adopts a dual-branch ViT architecture to process two independent sniffer inputs in parallel and concatenates the corresponding token representations before classification.
RoboDSTAN is included as a recent task-related baseline for robotic arm motion recognition from WiFi sensing. It adopts a dual-stream temporal convolutional architecture with self-attention, fuses the two sniffer streams through concatenation, and performs final prediction with an MLP classifier.
4.5. Results on RoboMNIST
To evaluate the overall recognition performance of MSPoolNet on fine-grained robotic arm motion recognition, we compare it with several representative baseline models on the RoboMNIST dataset.
Table 4 summarizes the five-fold cross-validation results of different models on the RoboMNIST dataset. In terms of absolute classification performance, the standard Transformer with a relatively larger parameter scale achieves the highest accuracy and Macro-F1 score, indicating that global token interaction provides strong representation capability for this fine-grained motion recognition task. However, this model contains 1.9 M parameters and exhibits substantially higher inference latency than the proposed method, suggesting that its performance advantage is achieved at the cost of increased model complexity.
By comparison, the proposed method attains 99.44% accuracy and 99.44% Macro-F1 with only 368 K parameters, achieving the best performance among all lightweight models and demonstrating a more favorable balance among parameter efficiency, inference latency, and classification accuracy. Compared with ConvLSTM, CNN, RoboDSTAN, LSTM, and BiLSTM, which have comparable parameter scales, the proposed method consistently achieves better classification performance. These results indicate that the combined design of an adaptive temporal downsampling module, temporal gating module, and efficient pooling operations is more suitable for fine-grained robotic arm motion recognition from CSI than architectures relying only on recurrent modeling or convolutional stacking.
In addition, the proposed method and the standard Transformer both clearly outperform ViT and BiVTC on this dataset. This result suggests that directly transferring a visual Transformer architecture to the RoboMNIST task does not necessarily lead to satisfactory performance. Instead, the design of the adaptive temporal downsampling module and the choice of lightweight pooling operations remain critical for effective CSI-based fine-grained motion recognition.
To further evaluate the class-level discriminative capability of the proposed method,
Figure 3 shows the confusion matrices of the proposed method and several representative baseline models on the RoboMNIST dataset. Both the proposed method and the standard Transformer exhibit highly concentrated diagonal responses, whereas ConvLSTM, RoboDSTAN, CNN, and LSTM show more evident off-diagonal confusion in several categories. These results indicate that the proposed method can distinguish fine-grained motion categories more reliably while maintaining relatively low model complexity, and that its class-level discriminative capability is overall superior to that of most lightweight baseline models.
To further evaluate the efficiency of the proposed method,
Table 5 compares several representative models in terms of parameter count, classification accuracy, and single-sample inference latency. Under comparable parameter scales, the proposed method achieves the highest classification accuracy among ConvLSTM, CNN, RoboDSTAN, and LSTM. Compared with the standard Transformer, the proposed method shows only a 0.19 percentage point decrease in accuracy, while reducing the parameter count and inference latency to approximately one-fifth and three-fifths of those of the standard Transformer, respectively. These results indicate that the proposed method provides a more favorable trade-off between efficiency and performance.
In addition, the proposed method achieves 2.53 percentage points higher accuracy than CNN, despite CNN exhibiting slightly lower inference latency. Compared with ConvLSTM, the proposed method achieves higher accuracy while maintaining substantially lower inference latency (0.65 ms vs. 3.65 ms). Overall, the proposed method delivers a more balanced performance under lightweight deployment constraints.
Figure 4 illustrates the training dynamics of the major models. As shown, the proposed method reaches a high validation accuracy at an early stage of training and exhibits only minor fluctuations across subsequent epochs, indicating fast convergence and stable generalization performance on the current task.
4.6. Results on RoboFiSense
To further examine the applicability of MSPoolNet beyond the primary fine-grained benchmark, we conduct comparative experiments on RoboFiSense, which corresponds to a related robotic arm activity recognition task.
Table 6 presents the five-fold cross-validation results on the RoboFiSense dataset. To avoid protocol inconsistency in the RoboFiSense comparison, all baseline models were re-trained and evaluated under the same five-fold protocol using the same data split strategy. On this validation task, the proposed method achieves the best overall performance, reaching 99.46% in both accuracy and Macro-F1 while maintaining low model inference latency. Compared with the baseline models, the proposed method shows consistent improvements in accuracy, F1 score, precision, and recall, indicating that the proposed architecture is not only effective for fine-grained motion recognition on RoboMNIST but also transferable to related robotic arm motion recognition tasks.
These results further suggest that the adaptive temporal downsampling module for local spatiotemporal representation learning, the temporal gating module for informative segment selection, and the efficient pooling operations of the Transformer-based feature encoding module exhibit good generalization ability across different robotic arm CSI classification tasks. From the perspective of task characteristics, the class differences in RoboFiSense are more related to coarse-grained motion patterns than those in RoboMNIST, which makes this dataset comparatively easier to distinguish and leads to higher and more stable overall performance.
The revised results show that MSPoolNet improves the accuracy by 2.27 percentage points compared with RoboDSTAN and reduces model inference latency from 5.65 ms to 1.29 ms. This result demonstrates that the collaborative design of the adaptive temporal downsampling module, temporal gating module, and Transformer-based feature encoding module is effective for the current task and indicates that an architecture tailored to CSI characteristics does not necessarily sacrifice recognition accuracy.
It is worth noting that the proposed method achieves better recognition performance with a lower parameter scale than BiVTC, which adopts a dual-branch ViT architecture with a relatively larger parameter scale. This characteristic makes the proposed architecture more suitable for deployment in resource-constrained scenarios.
4.7. Cross-Speed Generalization on RoboFiSense
To assess the robustness of the proposed method under motion-speed variation, we further conduct leave-one-speed-out experiments on RoboFiSense.
Table 7 presents the cross-speed evaluation results on the RoboFiSense dataset. To further examine motion-speed generalization under a consistent comparison protocol, we re-evaluated all baselines under the same leave-one-speed-out setting on RoboFiSense. For each fold, two speeds were used for training and the remaining speed was used for testing. Across the three leave-one-speed-out partitions, the proposed method achieves an average accuracy of 99.09%, outperforming all baseline models, and demonstrates strong adaptability under unseen speed conditions.
Table 7 reports the class-level recall for each held-out speed partition, with the corresponding cross-partition standard deviation computed across the three speed partitions for each model and class. At the class level, the proposed method maintains high recall for most categories across different speed partitions. Only the Rectangle category shows a larger standard deviation, which indicates that this class is more sensitive to speed-induced temporal distribution shifts. This result suggests that the temporal distribution shifts introduced by speed variation do not affect all classes equally and that categories whose trajectory patterns are more dependent on execution rhythm are more sensitive to changes in motion speed.
Overall, the cross-speed experiments show that the proposed method is effective not only under standard in-distribution classification settings but also under speed-shifted conditions. At the same time, these results indicate that robustness to motion speed variation remains an important issue for further study in robotic arm Wi-Fi sensing. Compared with RoboDSTAN, MSPoolNet achieves a higher mean overall recall with lower variation across speed partitions (99.09 ± 1.57% vs. 95.47 ± 2.26%). This comparison indicates that the proposed method is not only effective under conventional five-fold cross-validation but also exhibits stronger stability under unseen speed conditions.
From a methodological perspective, this improvement may be attributed to the robust extraction of local spatiotemporal patterns by the adaptive temporal downsampling module and the adaptive emphasis on informative motion segments introduced by the temporal gating module, which together enable the model to preserve strong discriminative capability under temporal stretching or compression caused by speed variation.
4.8. Test-Time Signal Degradation Analysis
To further verify the robustness of MSPoolNet from the perspective of signal perturbation, we conduct a test-time signal degradation experiment on both RoboMNIST and RoboFiSense. In this experiment, the model is trained using the original clean training samples, while artificial degradation is applied only to the test samples for inference. Four representative degradation settings are considered: Gaussian noise, random subcarrier dropout, contiguous frequency masking, and temporal shifting. These settings simulate different forms of CSI corruption, including measurement noise, partial subcarrier loss, localized frequency-band disturbance, and temporal misalignment.
As shown in
Table 8, MSPoolNet maintains high recognition accuracy under mild signal degradation, while stronger degradation produces more visible performance decreases. On RoboMNIST, Gaussian noise has little influence across
to
, with the accuracy remaining at 99.44% and the decrease limited to 0.04 percentage points. Under subcarrier dropout, the accuracy remains at 99.39% at 2% dropout and 98.50% at 5% dropout, but decreases more noticeably to 95.78% when the dropout ratio increases to 10%. For contiguous frequency masking, the accuracy decreases from 98.92% with C subcarriers to 97.19% and 95.18% with 16 and 24 masked subcarriers, respectively. Under temporal shifting, the model achieves 99.06%, 97.94%, and 95.22% accuracy for shifts of 10, 20, and 30 samples, respectively. These results indicate that removing or misaligning a larger portion of CSI components can cause an evident performance decline. The larger decreases caused by subcarrier dropout and frequency masking can be attributed to their direct disruption of frequency-domain CSI structures.
On RoboFiSense, the influence of these perturbations is generally smaller under the same selected levels. The accuracy remains 99.73% under Gaussian noise with and , and only decreases by 0.09 percentage points when . Under subcarrier dropout, the accuracy remains at 99.37% and 99.46% at dropout ratios of 2% and 5%, respectively, and decreases to 98.09% at 10% dropout. For frequency masking, the accuracy decreases from 99.55% with 8 masked subcarriers to 99.00% and 98.01% with 16 and 24 masked subcarriers, respectively. Under temporal shifts of 10, 20, and 30 samples, the corresponding accuracies are 99.73%, 99.64%, and 99.28%. These results indicate, from the perspective of artificial signal degradation, that the proposed architecture retains stable recognition performance when representative signal components are mildly perturbed or partially missing, while also showing that stronger subcarrier loss, frequency masking, and temporal misalignment can cause clear performance decreases. However, these results should be interpreted as evidence of robustness under selected synthetic test-time signal degradation levels, and they do not fully demonstrate robustness under real deployment conditions such as environmental changes, device repositioning, hardware variation, or real Wi-Fi interference.
4.9. Ablation Study
To quantify the contribution of each key module and structural design choice, ablation experiments are conducted on RoboMNIST.
Figure 5 summarizes the ablation results on the RoboMNIST dataset. Overall, the results indicate that the performance gain of the proposed method does not come from any single component. Instead, it arises from the joint effect of the adaptive temporal downsampling module, temporal gating module, and Transformer-based feature encoding module.
As shown in
Figure 5a, removing the temporal gating module causes the largest accuracy drop, indicating that explicitly emphasizing informative temporal segments within a fixed-length CSI window is crucial for the target task. Removing the adaptive temporal downsampling module also leads to a clear performance degradation, which suggests that the subsequent tokenization and encoding stages alone are insufficient to fully characterize the subtle local responses contained in robotic arm CSI signals. When the lightweight Transformer encoding blocks are replaced by standard self-attention, the accuracy decreases slightly, while the model size and inference cost increase. This result suggests that lightweight pooling operations are more suitable than standard attention for achieving a favorable balance between recognition performance and computational complexity in this setting.
Figure 5b presents the influence of different structural settings. Although Stride
yields a slightly higher accuracy than the default setting, it also increases the parameter count and computational cost, indicating that a higher input resolution does not substantially improve the overall performance ceiling. Depth
reduces latency further, but the corresponding accuracy drop is more pronounced. By contrast, Depth
provides only limited improvement, suggesting that simply increasing the number of back-end encoding layers is not the most effective way to improve performance on the current task.
Figure 5c,d further compare the effects of local mixing settings and patch settings. The results show that different back-end configurations have varying influences on both performance and efficiency. In particular, a larger local mixing kernel brings a slight accuracy gain, but it also increases inference latency. Meanwhile, neither finer nor coarser patch partitioning consistently outperforms the default configuration.
Overall, although the default setting is not optimal for every individual metric, it provides a more reasonable overall trade-off among accuracy, stability, model size, and inference latency.
To further examine whether the key components remain effective on RoboFiSense, we additionally evaluated two representative ablated variants under the same five-fold protocol. As shown in
Figure 6, the full MSPoolNet achieves 99.46% accuracy on RoboFiSense. Removing the temporal gate decreases the accuracy to 98.46%, corresponding to a 1.00 percentage point drop. This result indicates that adaptive temporal weighting also contributes to RoboFiSense, although the effect is relatively moderate on this dataset. By contrast, replacing the adaptive temporal downsampling/convolutional stem leads to a larger decrease, with accuracy dropping to 92.47%. This suggests that the front-end local time–frequency feature extraction and downsampling module is a more critical component for preserving discriminative CSI patterns before tokenization. Overall, the additional RoboFiSense ablation confirms that the two main temporal modules are not only effective on RoboMNIST but also contribute under the RoboFiSense recognition setting.
4.10. Phase Feature Comparison
To further examine whether the CSI phase provides additional recognition information, we conducted a brief phase feature comparison experiment on RoboMNIST. The same MSPoolNet architecture and five-fold evaluation protocol were used, while only the input feature setting was changed. Four input settings were compared. The amplitude-only result in this comparison is reported as the internal reference under the same experimental configuration as the phase feature comparison, rather than as a replacement for the main RoboMNIST result in
Table 4. The amplitude-only setting uses the CSI magnitude as the input. The sanitized phase-only setting uses phase features after temporal unwrapping and packet-wise linear trend removal along subcarriers. The two amplitude–phase settings concatenate the amplitude feature with either temporally unwrapped phase or sanitized phase.
As shown in
Table 9, the sanitized phase-only setting performs poorly, while adding phase to amplitude leads to only a marginal change compared with the amplitude-only setting. These results indicate that phase information does not provide a practically meaningful improvement over the amplitude-only setting in this experiment. Therefore, the current study keeps MSPoolNet as an amplitude-based lightweight model, while reliable phase-amplitude fusion remains a future direction requiring more careful calibration and robustness evaluation.
5. Conclusions
This study investigates fine-grained robotic arm motion recognition from CSI amplitude for contactless robotic state sensing. To accommodate weak local CSI perturbations, temporally sparse discriminative cues, and lightweight deployment requirements, MSPoolNet is developed by integrating an adaptive temporal downsampling module, a temporal gating module, and a lightweight pooling-based Transformer encoding module. Through this design, the model enhances local time–frequency responses, emphasizes informative temporal segments, and aggregates compact token representations for final recognition.
Experimental results on RoboMNIST and RoboFiSense demonstrate the effectiveness of the proposed method for robotic arm motion recognition. In particular, MSPoolNet achieves strong recognition performance while maintaining a compact model structure. The cross-speed, ablation, and test-time signal degradation results further indicate that the proposed architecture provides a favorable balance among recognition accuracy, model efficiency, and robustness.
Nevertheless, the robustness evaluation in the current study is still limited. Although test-time signal degradation experiments are conducted to examine the influence of noise, subcarrier loss, frequency masking, and temporal misalignment, these experiments do not fully represent real-environment robustness. In particular, the current study does not yet include systematic evaluations under room layout changes, transmitter/receiver placement variations, realistic Wi-Fi interference, hardware differences, or long-term environmental dynamics. Such deployment-oriented robustness requires more comprehensive data collection and evaluation protocols across diverse environments, positions, and interference conditions.
Another limitation is that this study mainly focuses on CSI amplitude. Although the phase feature comparison indicates that adding phase may provide a slight accuracy improvement, the observed gain is marginal under the current setting. Considering the trade-off between the additional calibration and fusion complexity introduced by phase processing and the limited performance benefit observed in this experiment, we did not further develop a phase-aware model in the present study. Future work can therefore investigate more reliable phase calibration and phase-guided model design, especially under speed variation, signal degradation, and real deployment changes.
Future work will therefore focus on more comprehensive, real-world robustness evaluation, online deployment validation (e.g., on embedded platforms such as Raspberry Pi), and reliable phase-aware or phase-amplitude fusion models to further improve the practicality of CSI-based robotic arm sensing under complex deployment conditions.