Next Article in Journal
The Influence of School Backpack Load on Dynamic Gait Parameters in 7-Year-Old Boys and Girls
Previous Article in Journal
Coverage Hole Recovery in Hybrid Sensor Networks Based on Key Perceptual Intersections for Emergency Communications
Previous Article in Special Issue
Improving Object Detection for Time-Lapse Imagery Using Temporal Features in Wildlife Monitoring
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TCN-MAML: A TCN-Based Model with Model-Agnostic Meta-Learning for Cross-Subject Human Activity Recognition

by
Chih-Yang Lin
1,*,
Chia-Yu Lin
2,
Yu-Tso Liu
2,
Yi-Wei Chen
2,
Hui-Fuang Ng
3 and
Timothy K. Shih
2,*
1
Department of Mechanical Engineering, National Central University, Taoyuan City 32001, Taiwan
2
Department of Computer Science and Information Engineering, National Central University, Taoyuan City 32001, Taiwan
3
Department of Computer Science, Universiti Tunku Abdul Rahman, Kampar 31900, Perak, Malaysia
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(13), 4216; https://doi.org/10.3390/s25134216
Submission received: 26 May 2025 / Revised: 2 July 2025 / Accepted: 3 July 2025 / Published: 6 July 2025
(This article belongs to the Special Issue Sensors and Sensing Technologies for Object Detection and Recognition)

Abstract

Human activity recognition (HAR) using Wi-Fi-based sensing has emerged as a powerful, non-intrusive solution for monitoring human behavior in smart environments. Unlike wearable sensor systems that require user compliance, Wi-Fi channel state information (CSI) enables device-free recognition by capturing variations in signal propagation caused by human motion. This makes Wi-Fi sensing highly attractive for ambient healthcare, security, and elderly care applications. However, real-world deployment faces two major challenges: (1) significant cross-subject signal variability due to physical and behavioral differences among individuals, and (2) limited labeled data, which restricts model generalization. To address these sensor-related challenges, we propose TCN-MAML, a novel framework that integrates temporal convolutional networks (TCN) with model-agnostic meta-learning (MAML) for efficient cross-subject adaptation in data-scarce conditions. We evaluate our approach on a public Wi-Fi CSI dataset using a strict cross-subject protocol, where training and testing subjects do not overlap. The proposed TCN-MAML achieves 99.6% accuracy, demonstrating superior generalization and efficiency over baseline methods. Experimental results confirm the framework’s suitability for low-power, real-time HAR systems embedded in IoT sensor networks.

1. Introduction

The field of human activity recognition (HAR) has garnered increasing attention, emerging as a dynamic area of research, particularly in the realm of device-free recognition. Traditional HAR methods typically rely on camera-based and sensor-based systems. However, these methods often incur significant costs and require wearable devices, which can lead to user discomfort and inconvenience [1,2,3,4,5]. A notable advancement in device-free recognition is the utilization of Channel State Information (CSI) derived from Wi-Fi signals [6], a method that has gained significant traction in recent studies [7,8,9,10].
HAR systems based on Wi-Fi are particularly notable for their wide range of applications in fields such as healthcare, security, and elderly care [11]. These systems leverage Wi-Fi signals to detect and analyze human activities within indoor settings, offering a non-intrusive and user-friendly alternative to traditional sensor- and device-based approaches. Despite their numerous benefits, Wi-Fi-based HAR systems face several challenges that hinder their widespread adoption. A primary challenge is the inherent variability of Wi-Fi signals, which can fluctuate across different timeframes, environments, and individuals. This variability, stemming from factors such as signal interference, environmental changes, and individual physical characteristics, creates significant obstacles to achieving accurate and consistent activity recognition. The impact of these signal fluctuations is demonstrated in Figure 1, where Wi-Fi CSI signals for the ’kicking with the right leg’ action from the Human-to-Human Interaction (HHI) dataset [7] exhibit marked differences between various subject pairs. These variations, stemming from inter-subject differences and temporal changes, pose considerable challenges to consistent cross-subject recognition, even for human observers.
Traditional machine learning methods often struggle with limited training data, particularly in cross-subject recognition tasks. To address this challenge, few-shot learning [12] has emerged as a promising solution. This paper introduces a novel Wi-Fi-based HAR framework designed to mitigate issues arising from cross-subject signal variability and data scarcity. Our approach integrates Temporal Convolutional Networks (TCNs) [13] with Model-Agnostic Meta-Learning (MAML) [12], forming the TCN-MAML framework, which facilitates rapid adaptation to new subjects with minimal fine-tuning. Compared to recurrent models such as LSTM and attention-based models like Transformer, TCNs offer a lightweight architecture with lower memory and computational requirements, rendering them highly suitable for real-time deployment in resource-constrained environments. Additionally, MAML optimizes model initialization, significantly reducing training overhead while ensuring fast convergence.
To assess the scalability and deployment feasibility of TCN-MAML, we conducted extensive experiments on the HHI dataset [7], partitioning it into distinct training and testing subsets for evaluation. The use of dilated causal convolutions enhances inference efficiency, thereby enabling fast execution on low-power devices. Furthermore, the ability to adapt with limited labeled data significantly reduces the need for frequent retraining, which is a crucial factor for practical IoT and embedded system deployments. These findings highlight the potential of TCN-MAML in real-world HAR applications, where power efficiency, low latency, and robustness to new subjects are critical considerations. The major contributions of this study are as follows:
  • We proposed a novel TCN-MAML model for effective few-shot cross-subject human activity recognition using Wi-Fi signals. To the best of our knowledge, this study is the first to explore the unique integration of Temporal Convolutional Networks with the MAML algorithm for HAR, which surpasses state-of-the-art methods in performance.
  • Three augmentation methods specifically designed for Wi-Fi signals were proposed. These methods incorporate variations into the original signals to effectively expand the dataset and enrich its diversity. The application of these proposed augmentation techniques resulted in over 10% improvement in test accuracy compared to the baseline model trained without any augmentation.
  • This study is the first to conduct realistic cross-subject experiments by partitioning the HHI dataset into two non-overlapping subsets, where the model was trained on one subset and tested on the other. The proposed approach yields a remarkable recognition accuracy of 99.60% on new subjects.
The remainder of this paper is structured as follows: Section 2 provides a review of related Wi-Fi-based HAR, with a focus on few-shot learning approaches. Section 3 defines the targeted problem and presents a detailed description of the proposed method. Section 4 presents the experiments and results. Lastly, Section 5 concludes the paper and outlines potential avenues for future research.

2. Related Work

In recent years, Wi-Fi-based human activity recognition (HAR) has garnered significant attention due to its wide applications in healthcare monitoring, security, elderly care [11], and other domains. While many existing approaches have achieved impressive performance on benchmark datasets, a large portion of prior research tends to overlook or underemphasize the challenge of cross-subject generalization, a critical factor for real-world deployment where models must adapt to new users without retraining. In this section, we review existing Wi-Fi-based HAR methods and several recent HAR systems that employ few-shot learning for cross-domain or cross-subject recognition.

2.1. Wi-Fi-Based Human Activity Recognition (HAR)

Early-stage Wi-Fi-based HAR studies primarily relied on Convolutional Neural Network (CNN)-based models. This choice was motivated by the unique activity patterns discernible within Channel State Information (CSI) signals, rendering CNNs an effective option for activity recognition. For instance, Kabir et al. [14] introduced the CSI-based Inception Attention Network (CSI-IANet) for Wi-Fi-based HAR. This model combined CNNs with spatial attention mechanisms and was evaluated using the HHI dataset [7]. Notably, the CSI-IANet achieved an average accuracy of 91.30%, making it the first CNN-based model to surpass the 90% accuracy threshold on this dataset. Subsequently, Shafiqul et al. [6] introduced the HHI-AttentionNet, which incorporated a depth-wise CNN and a customized attention mechanism. Their method utilized a Butterworth low-pass filter to eliminate outliers and high-frequency noise, and a Gaussian smoothing function to mitigate short peaks, achieving an impressive 95.47% accuracy on the HHI dataset. These models achieved strong performance on benchmark datasets but were primarily evaluated in single-subject or controlled cross-subject settings, thereby limiting their ability to generalize to new individuals. One notable instance is H2HI-Net [15], a unified model that combines a Residual Neural Network with a Bi-directional Gated Recurrent Unit (Bi-GRU). The Residual Neural Network encodes spatial embeddings, while the Bi-GRU focuses on temporal embeddings. The spatial and temporal representations are then concatenated with a two-layer DenseNet. The model achieves an average accuracy of 96.39% on the HHI dataset. In our previous work [10], we proposed the TCN-AA model that incorporates data augmentation, an attention mechanism, and Temporal Convolutional Networks (TCNs) to effectively extract features from CSI signals while maintaining a low number of model parameters compared to other methods. TCN-AA achieved an accuracy of 99.42% on the HHI dataset, outperforming H2HI-Net by 3%. However, while TCN-AA enhances feature extraction, it does not explicitly address cross-subject adaptation, thereby rendering it less effective in real-world settings where new individuals are introduced.
In contrast, the proposed TCN-MAML model extends beyond TCN-AA by integrating Model-Agnostic Meta-Learning (MAML), which allows the model to rapidly adapt to new subjects with minimal training samples. Unlike H2HI-Net, which relies on recurrent components, TCN-MAML leverages a fully convolutional structure, rendering it more computationally efficient while achieving strong cross-subject generalization. Our approach not only maintains high accuracy (99.6%) but also demonstrates improved adaptability in scenarios where training data is limited, making it more suitable for real-world deployments.

2.2. Few-Shot Learning-Based HAR Approaches

Few-shot learning techniques have been applied in Wi-Fi-based HAR to improve cross-subject generalization, allowing models to recognize activities performed by previously unseen individuals. As shown in Table 1, different few-shot approaches vary in their adaptability, reliance on additional data, and computational feasibility. ProtoNet [16] addresses cross-subject variability by learning distance-based embeddings, but these embeddings remain fixed after training, making it difficult for the model to adapt to highly dynamic CSI signals from new subjects. TOSS [17] enhances cross-domain generalization by integrating semi-supervised meta-learning, where the model is exposed to both labeled and unlabeled target-domain data. While this improves adaptation, it still requires additional labeled target data, which may not always be available in real-world settings. CSI-GDAM [18] applies a graph-based attention mechanism to model relationships between different activities, improving representation learning. However, it requires manually defining graph structures, such as node representations for activity classes and edge weights based on inter-class relationships, which constitutes extensive feature engineering. This manual design process limits the model’s flexibility and scalability across different environments. CAUTION [19], designed specifically for Wi-Fi-based authentication, transforms CSI signals into a low-dimensional feature space for identity verification rather than general HAR. While effective in authentication tasks, its reliance on manual feature extraction further restricts its adaptability. In contrast, our TCN-MAML model eliminates the need for predefined embeddings and additional labeled target data while maintaining high adaptability to new subjects. Unlike CSI-GDAM and CAUTION, it does not require feature engineering and is applicable to general HAR applications. By integrating TCN’s efficient temporal feature extraction with MAML’s gradient-based adaptation, TCN-MAML allows for fast generalization with minimal data, reducing the need for frequent retraining. This makes it a scalable, data-efficient, and computationally feasible solution for real-world Wi-Fi-based HAR applications, such as healthcare monitoring, security surveillance, and smart home automation.

2.3. Recent Advances in Deep Learning-Based Wi-Fi HAR

Recent studies have continued to improve Wi-Fi-based human activity recognition (HAR) by leveraging deep learning models and signal selection strategies. Sousa et al. [20] proposed a CNN-based subcarrier selection network that automatically identifies the most discriminative CSI subcarriers, significantly reducing input complexity while maintaining high recognition accuracy. This approach jointly performs subcarrier selection and classification, offering a lightweight alternative for real-world deployment. In another study, Shahverdi et al. [21] explored the use of various deep learning architectures, including CNNs, LSTMs, and GRUs, to model temporal patterns in CSI signals. Their experiments demonstrated that GRU models were particularly effective in capturing time-dependent activity features, leading to robust HAR performance. These works reflect an ongoing trend toward optimizing both input representation and model efficiency in recent CSI-based HAR research.

3. Proposed Method

In this section, we first detail the CSI dataset used and define the problem addressed in this study. Next, we present the proposed pre-processing and augmentation techniques. Finally, a detailed discussion of the novel Temporal Convolutional Network with Model-Agnostic Meta-Learning (TCN-MAML) model is provided.

3.1. Preliminaries

3.1.1. The HHI CSI Dataset

The publicly accessible Human-to-Human Interaction (HHI) CSI dataset [7] is employed in this study. This dataset consists of CSI data collected from 66 individuals, comprising 40 unique subject pairs, each instructed to execute 12 distinct interaction activities ( N c = 12). These activities include actions such as approaching, departing, handshaking, high-fiving, hugging, kicking with the left leg, kicking with the right leg, pointing with either hand, and various punching and pushing motions. Each subject pair performed 10 trials for each activity, culminating in a total of 4800 samples. Each trial lasted between 5 and 6 s. In every trial, the subjects were asked to stand still for 2 s and then permitted to start the interaction, resulting in packet lengths ranging from 1040 to 2249 ( N p = [1040, 2249]). The data-collection environment was set up using a 2 × 3 Multiple-Input Multiple-Output (MIMO) system, comprising 2 transmitters ( N t = 2) and 3 receivers ( N r = 3). The Channel State Information (CSI) data were obtained using a publicly available CSI tool [22] across 30 subcarriers ( N s = 30). Readers may refer to [7] for additional information regarding the data-collection process.

3.1.2. Problem Definition

In this study, we partition the HHI dataset based on subject pairs to rigorously evaluate cross-subject generalization. Let D denote the HHI dataset, and S represent all subject pairs within it. The dataset is divided into two distinct subsets, S 1 and S 2 , where S 1 and S 2 , and where S 1 S 2 = S and S 1 S 2 = . This ensures that no subject pairs overlap between training and testing. Specifically, S 1 contains 30 subject pairs for training, while S 2 consists of 20 subject pairs for testing. The number of activity types remains identical across both subsets, i.e., N c S 1 = N c S 2 = N c . The Wi-Fi CSI samples collected by S 1 and S 2 are denoted as D s 1 (source domain) and D s 2 (target domain), respectively. Our objective is to develop a Wi-Fi-based HAR model that is trained on D s 1 and can effectively generalize to recognize the same activities in D s 2 . To demonstrate the challenge in Wi-Fi-based cross-subject recognition, Figure 2 visualizes CSI data using t-Distributed Stochastic Neighbor Embedding (t-SNE) for similar activities from D s 1 and D s 2 . It is evident that significant distribution disparities exist between D s 1 and D s 2 , rendering cross-subject recognition a highly challenging task.
We acknowledge that partitioning by subject pairs may introduce potential biases, as different individuals exhibit variations in physical characteristics, movement patterns, and signal interactions. However, this setup reflects real-world deployment scenarios, where the model must recognize activities performed by previously unseen individuals rather than memorizing subject-specific patterns. Despite these challenges, our experimental results demonstrate that the proposed model achieves high recognition accuracy under this strict evaluation setup, validating its robustness in handling cross-subject variability. A detailed discussion on the impact of this partitioning strategy on generalization performance will be presented in Section 4.

3.2. Preprocessing

Direct use of raw CSI values for HAR is challenging due to the inherent noise within the CSI signals. Therefore, preprocessing CSI values is crucial before further analysis. While the transmitter sends data at a predetermined frequency during the collection phase, the actual sampling rate can fluctuate due to various internal and external factors. In the HHI dataset, packet sizes range from 1040 to 2249. To safeguard data quality, a threshold of 1500 packets ( N p = 1500) is established, and data with fewer packets than this threshold is discarded. To maintain consistency and streamline subsequent analysis, packet counts exceeding 1500 are normalized to this value. This procedure guarantees that all input data adheres to a uniform size, represented as ( N t * N r , N p , N s ). The values of each Transmitter–Receiver pair (TR pair) are then standardized to a range of [−1, 1] to ensure that the CSI values are comparable across different TR pairs. To further improve data quality, a low-pass Butterworth filter is applied to each TR pair. This filter is particularly effective at eliminating high-frequency noise that could potentially distort the CSI signals. Lastly, a one-dimensional Discrete Wavelet Transform (1D-DWT) is applied twice to down-sample the processed signal. The original signal is decomposed into approximation coefficients, which retain low-frequency components, and detailed coefficients, which contain high-frequency noise. Since the most relevant activity information resides in the low-frequency domain, we retain the approximation coefficients while reducing the packet length from 1500 to 375, effectively lowering computational complexity while preserving key motion features. Compared to alternative techniques, 1D-DWT offers superior temporal structure preservation. Unlike Short-Time Fourier Transform (STFT), which suffers from fixed window-size trade-offs, or Principal Component Analysis (PCA), which may remove essential nonlinear dependencies, 1D-DWT dynamically adjusts to signal variations while achieving effective compression. As shown in Figure 3, the application of a Butterworth filter removes high-frequency noise, while repeating 1D-DWT maintains key feature patterns, thereby reducing both data storage requirements and training time, and making it ideal for resource-constrained environments.

3.3. Data Augmentation

While an abundance of large image datasets, such as ImageNet [23] and COCO [24], exists for image recognition tasks, no such large datasets are available for Wi-Fi-based applications. Consequently, the application of data augmentation techniques becomes crucial for Wi-Fi datasets and for training Wi-Fi-based HAR models. This study proposes three augmentation techniques tailored for Wi-Fi data to mitigate the data scarcity problem.

3.3.1. Random Dropout

Random dropout is a data augmentation mechanism where a fraction of the CSI values is randomly set to zero to simulate unstable Wi-Fi transmission environments (i.e., packet loss). The probability of setting a value to zero, denoted as λ , is randomly selected from the range (0, 0.07). This augmentation technique was inspired by the success of dropout regularization in deep learning. Hence, it is reasonable to expect that applying dropout as an augmentation technique for Wi-Fi data would yield positive results.

3.3.2. Inter-Class Mixing

CSI signals are susceptible to external interferences such as background radio noise and disturbances caused by moving subjects, causing the received signals to fluctuate over time. To improve the resilience of the HAR model to noisy signals, an augmentation technique called inter-class mixing is proposed. This technique involves combining samples from different action classes to enhance the diversity of the dataset. The mixing procedure is represented by the following equation:
D = A × ( 1 ε 1 ) + B × ε 2 + C × ε 3
A new sample, D is generated by combining three existing samples, A, B, and C. Sample D inherits its class label from sample A, whereas samples B and C have different labels from sample A. Throughout our experiments, the value of ε k (where k { 1 , 2 , 3 } ) was randomly chosen from the range (0, 0.05). This mixing technique helps introduce diversity into the dataset and thereby enhances the model’s robustness.

3.3.3. Intra-Class Mixing

It is observed that when different subjects perform the same activity, significant variations occur in the received CSI signals, caused by discrepancies in subjects’ heights and body shapes. The disturbance from subject movement is more profound than background noise. Motivated by this observation, the intra-class mixing augmentation technique is proposed. Intra-class mixing utilizes the same mixing equation as in Equation (1), with the distinction that A, B, and C are samples from the same action class. The intra-class mixing technique should be beneficial in strengthening the model’s resilience to intra-class variations.

3.4. TCN-MAML Model

The proposed model parameters are listed in Table 2.

3.4.1. TCN Model

CSI signals, by nature, are time-series data. As such, sequential deep learning models such as Long Short-Term Memory (LSTM) [25] and Transformer [26] are traditionally well-suited for processing such signals. However, these models demand significant computational resources, require large datasets, and often exhibit long training times, rendering them impractical for real-time Wi-Fi-based human activity recognition (HAR) applications. LSTMs process data sequentially, leading to high memory consumption and limited parallelization, while Transformers rely on self-attention mechanisms with quadratic complexity, making them computationally expensive. In contrast, Temporal Convolutional Networks (TCNs) [13] offer a more efficient alternative by employing one-dimensional causal and dilated convolutions, which effectively capture long-range temporal dependencies without the need for recurrence mechanisms. Compared to LSTMs and Transformers, TCNs reduce training time, enable parallel computation, and maintain high recognition accuracy with fewer computational resources. An overview of the TCN architecture used in this study is shown in Figure 4. Each TCN layer begins with a dilated causal convolution layer, followed by a ReLU activation layer and a dropout layer, as shown in Figure 5.

3.4.2. TCN with MAML

To facilitate cross-subject recognition, the TCN model must efficiently adapt to new subject pairs. To achieve this, we propose TCN-MAML, which integrates Temporal Convolutional Networks (TCNs) with Model-Agnostic Meta-Learning (MAML) [12]. Unlike traditional training, which requires extensive retraining for each new subject, MAML learns an optimal model initialization, allowing for rapid adaptation with minimal labeled samples. This is particularly advantageous for Wi-Fi CSI-based HAR, where collecting large labeled datasets for every new user is impractical due to inter-subject variability.
MAML’s gradient-based adaptation allows TCN-MAML to adjust directly from task-specific gradients, rather than relying on predefined embeddings or handcrafted features. This enables the model to automatically generalize to new subjects, significantly improving cross-subject recognition while maintaining computational efficiency. Unlike approaches that require explicit task-specific engineering, TCN-MAML autonomously learns transferable representations, making it well-suited for deployment in low-resource environments. By combining TCN’s efficient feature extraction with MAML’s adaptability, TCN-MAML ensures both accuracy and scalability in real-world Wi-Fi HAR applications, including healthcare monitoring and security surveillance, where minimizing retraining efforts is crucial for practical deployment.
The MAML algorithm achieves this by dividing model training into two phases: inner-loop optimization and outer-loop optimization. For each meta-learning episode i, an N-way K-shot task T i = [ S i , Q i ] is randomly sampled from the training dataset. The model (e.g., TCN), denoted as f θ i with model weight θ i at the ith episode, together with the support set S i , is used to kick-start the inner-loop optimization process. At the starting of the inner-loop optimization, a temporary weight θ t m p is created and initialized to θ i . The temporary model f θ t m p is then fine-tuned on S i with a few gradient descents. Next, the model with updated θ t m p is evaluated on the query set θ i to produce the meta-training loss. The meta loss is then backpropagated in the outer-loop optimization process to update θ i . This process is repeated until the model has been trained on all episodes. Note that at each training episode, the N classes are randomly selected so that they might be different between episodes. The inner-loop optimization and outer-loop optimization can be formulated as:
Inner Loop:
θ t m p ( m ) = θ i α × θ L T i ( S i , f θ i , Y S i ) , if m = 1 θ t m p ( m 1 ) α × θ L T i ( S i , f θ t m p ( m 1 ) , Y S i ) , if 1 < m M
Outer Loop:
θ i + 1 = θ i β × θ L T i ( Q i , f θ t m p ( M ) , Y Q i ) , Q i D s 1 D s 2
where:
  • θ i : model parameters before task i adaptation;
  • θ t m p ( m ) : task-specific parameters after m inner-loop updates;
  • α , β : learning rates for inner- and outer-loop updates, respectively;
  • S i , Q i : support and query sets sampled from task T i ;
  • Y S i , Y Q i : ground truth labels for the support and query sets;
  • f θ : model parameterized by θ ;
  • L T i : task-specific loss function;
  • M: number of inner-loop updates.
where S i and Q i are the sampled support set and query set, respectively. M denotes the total number of inner-loop update iterations, and m denotes the mth update iteration. α and β are the learning rates, and Y S i and Y Q i represent the ground truth for the support set and query set, respectively. After MAML training, the final model should be able to adapt to a new task through a few fine-tuning steps using a small support set sampled from the new domain.
In the proposed TCN-MAML model, the support sets used in the inner-loop optimization are sampled from the D s 1 subset ( S i D s 1 ). However, the query sets in the outer-loop optimization are sampled from both D s 1 and D s 2 ( Q i D s 1 , D s 2 ). Note that the support set and the query set contain samples from the same N classes, even though the samples might come from different subsets. By including samples from both the source and target domains in the outer-loop optimization process, the TCN model’s adaptability to the target domain is effectively enhanced, thereby enabling cross-domain recognition. The proposed TCN-MAML model requires gathering a few labeled samples from the target domain to perform fine-tuning for that domain. In many real-world situations, it is reasonable to assume that obtaining samples from the target domain is relatively straightforward [17]. For instance, when deploying smart devices in a new environment, it is often customary to request that users engage in specific tasks or activities to facilitate the calibration of these devices for the new environment.

4. Experiments

In this section, extensive experiments are carried out to demonstrate the effectiveness of the TCN-MAML model in cross-subject human-to-human interaction recognition.

4.1. Experimental Setup

For the N-way K-shot few-shot learning setup, the value of N is set to 5 and K is set to 1. As such, in each training episode, the support set S contains 5 × 1 = 5 samples randomly sampled from the D s 1 subset of the HHI dataset. The size of the query set Q is set to 1, and the query samples are randomly sampled from both D s 1 and D s 2 subsets of the HHI dataset (see Section 3 for the dataset split). To ensure adequate query samples are drawn from both subsets, we set the sampling ratio to 3:1 for D s 1 and D s 2 , respectively.
The TCN-MAML model is trained over 30 meta-epochs, with each epoch comprising multiple 5-way 1-shot tasks. The total training time is approximately 6 h. All experiments were conducted on a computer equipped with a GeForce RTX 1060 graphics card (NVIDIA, Santa Clara, CA, USA) and running on the Windows 10 operating system with Python version 3.8 and the PyTorch 1.10.1 deep learning framework. The model is trained using the AdamW optimizer with an exponential learning rate decay of 0.988 per epoch.

4.2. Performance Evaluation

Cross-Subject Recognition Performance

The performance of the proposed TCN-MAML model on Wi-Fi-based human activity recognition is evaluated using the HHI dataset. Its performance is compared with our previously proposed TCN-AA model [10], which has been shown to achieve state-of-the-art recognition accuracy on the HHI dataset. To illustrate the efficacy of the models on cross-subject recognition, they are evaluated using two non-overlapping subsets of the HHI dataset. Specifically, both TCN-AA and TCN-MAML models are trained on the D s 1 subset and evaluated on a separate subset D s 2 . To ensure a fair comparison, both models adhere to the same TCN architecture and employ the same augmentation techniques as described in Section 3. The TCN-AA model is trained using conventional supervised learning techniques, while the TCN-MAML model undergoes 5-way 1-shot learning with MAML. Another difference is that the dropout rate for TCN-AA is configured at 0.5 , whereas for TCN-MAML, a dropout rate of 0.1 is chosen to achieve optimal performance. As shown in Figure 6, the TCN-AA model obtain the best overall training accuracy of 97.84 % and achieving 99.6 % accuracy on the validation set. For more data-processing details, please refer to our previous work in [10]. However, when assessing TCN-AA on the D s 2 subset that was not included in the training process, its performance drops significantly. As illustrated in the confusion matrix shown in Figure 7, the average recognition accuracy across the 12 different action classes is 67.4 % , much lower than the 99.6 % validation accuracy on D s 1 .
Furthermore, a notable confusion of the model in recognizing similar activities is observed, such as ’pointing with the left hand’ versus ’pointing with the right hand’ and between ’kicking with the right leg’ and ’kicking with the left leg.’ These results indicate that existing Wi-Fi-based HAR models are not robust enough for cross-subject recognition applications. In contrast, the confusion matrix of TCN-MAML in Figure 8 exhibits greater robustness compared to TCN-AA. The results show that the proposed TCN-MAML model performs significantly better under the same conditions and does not exhibit the issues mentioned above.
To provide a more comprehensive evaluation, we also compare the performance of our proposed method across three different datasets: NTU-Fi HAR [27], UT-HAR [28], and our original HHI dataset. As shown in Table 3, our model achieves a testing accuracy of 99.12% on NTU-Fi HAR and 98.66% on UT-HAR, while maintaining a superior accuracy of 99.6% on the HHI dataset. These results confirm the effectiveness and generalizability of our method across diverse Wi-Fi-based human activity recognition datasets.
To further strengthen the comparative analysis, we provide a benchmark evaluation of our earlier TCN-AA model on the HHI dataset, alongside several recent and representative human activity recognition methods. These include classical models such as SVM, deep learning-based CNN and RNN variants, and attention-enhanced architectures. As presented in Table 4, the TCN-AA model achieves the highest recognition accuracy of 99.42%, outperforming H2HI-Net (96.39%) and HHI-AttentionNet (95.47%).
These results highlight the strong discriminative capability of the TCN-based model on the HHI dataset. The consistent performance improvement across a range of baselines—including CNNs, GRUs, and attention-based frameworks—demonstrates the robustness of our approach. This comparison also serves as an empirical foundation to emphasize the benefits of our full TCN-MAML model, which further enhances cross-subject generalization in data-scarce conditions.
Additionally, to evaluate the actual effect of the augmented data, in the second experiment, the augmented dataset with 800 samples per class was randomly down-sampled to 400 samples per class, making its sample size consistent with the raw dataset. Table 5 shows the classification results using the down-sampled datasets. For this experiment, the testing accuracy increments of each augmentation method are 1.6 % , 3 % , and 0.6 % , respectively. Although the improvement may not be as substantial as when utilizing all augmented samples (Table 6), the augmented samples do enhance the overall diversity and quality of the training data, as evidenced by the observed reduction in loss across all three methods. Note that the validation accuracies in Table 5 and Table 6 are lower than the respective training and testing accuracies. This is mainly due to the much smaller sample sizes in the validation sets since 10-fold cross-validation was applied throughout the experiments.
Furthermore, we evaluated the computational efficiency and memory requirements of our model to assess its feasibility for real-world deployment. The inference speed of TCN-MAML was tested on a dataset containing approximately 4.44 million packets captured over 74 min at 1000 packets per second. The entire evaluation process was completed in less than six minutes, operating at a speed 12 times faster than real-time, making it suitable for real-time applications. In terms of memory usage, the model consumes about 3.15 MB per second, ensuring feasibility for deployment on resource-constrained edge devices. Compared to recurrent architectures such as LSTM and Transformer, which have higher computational demands due to sequential dependencies, the TCN-based model significantly reduces inference time and memory overhead. This efficiency makes TCN-MAML well-suited for practical applications such as healthcare monitoring and security surveillance, where low latency and lightweight deployment are crucial.

5. Conclusions

In this study, we introduced TCN-MAML, a novel model for human-to-human activity recognition using Wi-Fi CSI signals. By integrating Temporal Convolutional Networks (TCNs) with Model-Agnostic Meta-Learning (MAML), the proposed framework enabled the recognition of activities performed by previously unseen subject pairs without requiring full retraining. This capability is critical for real-world applications such as patient monitoring, where rapid adaptability and generalization are essential.
The model achieved a remarkable accuracy of 99.6 % (Figure 9) in the HHI dataset, demonstrating strong generalization between subjects under a challenging benchmark. However, to further validate its robustness in real-world environments, expanding the dataset to include more diverse and subtle activities will be essential.
In addition, for deployment in safety-critical scenarios, we recommended integrating secondary verification mechanisms, such as CSI-based skeletal modeling or anomaly detection, to verify uncertain patterns and reduce potential risks. These mechanisms can trigger additional validation steps, such as camera activation or caregiver alerts, ensuring reliable operation in sensitive applications.

Author Contributions

Conceptualization, C.-Y.L. (Chih-Yang Lin); Methodology, C.-Y.L. (Chih-Yang Lin); Software, C.-Y.L. (Chia-Yu Lin) and Y.-T.L.; Validation, Y.-T.L. and Y.-W.C.; Formal Analysis, C.-Y.L. (Chia-Yu Lin), Y.-T.L. and Y.-W.C.; Investigation, C.-Y.L. (Chia-Yu Lin), Y.-T.L. and Y.-W.C.; Data Curation, C.-Y.L. (Chia-Yu Lin), Y.-T.L. and Y.-W.C.; Writing—Original Draft Preparation, Y.-T.L. and Y.-W.C.; Writing—Review and Editing, Y.-T.L. and Y.-W.C.; Visualization, C.-Y.L. (Chia-Yu Lin), Y.-T.L. and Y.-W.C.; Supervision, H.-F.N. and T.K.S.; Project Administration, C.-Y.L. (Chih-Yang Lin) and Y.-T.L.; Funding Acquisition, C.-Y.L. (Chih-Yang Lin). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Science and Technology Council (NSTC) under Grants 111-2221-E-008-110-MY3, and 114-2221-E-008-024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Human-to-Human Interaction (HHI) CSI dataset used in this study is publicly available at https://data.mendeley.com/datasets/3dhn4xnjxw/1. The code for this study is available at https://github.com/Teddy0955/TCN-MAML (both accessed on 5 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dang, L.M.; Min, K.; Wang, H.; Piran, M.J.; Lee, C.H.; Moon, H. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognit. 2020, 108, 107561. [Google Scholar] [CrossRef]
  2. Beddiar, D.R.; Nini, B.; Sabokrou, M.; Hadid, A. Vision-based human activity recognition: A survey. Multimed. Tools Appl. 2020, 79, 30509–30555. [Google Scholar] [CrossRef]
  3. Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
  4. Münzner, S.; Schmidt, P.; Reiss, A.; Hanselmann, M.; Stiefelhagen, R.; Dürichen, R. CNN-based sensor fusion techniques for multimodal human activity recognition. In Proceedings of the 2017 ACM International Symposium on Wearable Computers, Maui, HI, USA, 11–15 September 2017; pp. 158–165. [Google Scholar]
  5. Yadav, S.K.; Tiwari, K.; Pandey, H.M.; Akbar, S.A. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowl. Based Syst. 2021, 223, 106970. [Google Scholar] [CrossRef]
  6. Shafiqul, I.M.; Jannat, M.K.A.; Kim, J.W.; Lee, S.W.; Yang, S.H. HHI-AttentionNet: An enhanced human-human interaction recognition method based on a lightweight deep learning model with attention network from CSI. Sensors 2022, 22, 6018. [Google Scholar] [CrossRef] [PubMed]
  7. Alazrai, R.; Awad, A.; Baha’A, A.; Hababeh, M.; Daoud, M.I. A dataset for Wi-Fi-based human-to-human interaction recognition. Data Brief 2020, 31, 105668. [Google Scholar] [CrossRef] [PubMed]
  8. Uddin, M.H.; Ara, J.M.K.; Rahman, M.H.; Yang, S. A study of real-time physical activity recognition from motion sensors via smartphone using deep neural network. In Proceedings of the 2021 5th International Conference on Electrical Information and Communication Technology (EICT), Khulna, Bangladesh, 17–19 December 2021; IEEE: Piscatawy, NJ, USA, 2021; pp. 1–6. [Google Scholar]
  9. Ma, Y.; Zhou, G.; Wang, S. WiFi sensing with channel state information: A survey. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
  10. Lin, C.Y.; Lin, C.Y.; Liu, Y.T.; Chen, Y.W.; Shih, T.K. WiFi-TCN: Temporal Convolution for Human Interaction Recognition based on WiFi signal. IEEE Access 2024, 12, 126970–126982. [Google Scholar] [CrossRef]
  11. Ahad, M.A.R. Activity recognition for health-care and related works. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, Singapore, 8–12 October 2018; pp. 1765–1766. [Google Scholar]
  12. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: New York, NY, USA, 2017; pp. 1126–1135. [Google Scholar]
  13. Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
  14. Kabir, M.H.; Rahman, M.H.; Shin, W. CSI-IANet: An inception attention network for human-human interaction recognition based on CSI signal. IEEE Access 2021, 9, 166624–166638. [Google Scholar] [CrossRef]
  15. Abdel-Basset, M.; Hawash, H.; Moustafa, N.; Mohammad, N. H2HI-Net: A dual-branch network for recognizing human-to-human interactions from channel-state information. IEEE Internet Things J. 2021, 9, 10010–10021. [Google Scholar] [CrossRef]
  16. Mettes, P.; Van der Pol, E.; Snoek, C. Hyperspherical prototype networks. In Proceedings of the Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  17. Zhou, Z.; Wang, F.; Yu, J.; Ren, J.; Wang, Z.; Gong, W. Target-oriented semi-supervised domain adaptation for WiFi-based HAR. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, Online, 2–5 May 2022; IEEE: Piscatway, NJ, USA, 2022; pp. 420–429. [Google Scholar]
  18. Zhang, Y.; Chen, Y.; Wang, Y.; Liu, Q.; Cheng, A. CSI-based human activity recognition with graph few-shot learning. IEEE Internet Things J. 2021, 9, 4139–4151. [Google Scholar] [CrossRef]
  19. Wang, D.; Yang, J.; Cui, W.; Xie, L.; Sun, S. CAUTION: A Robust WiFi-based human authentication system via few-shot open-set recognition. IEEE Internet Things J. 2022, 9, 17323–17333. [Google Scholar] [CrossRef]
  20. Sousa, C.; Fernandes, V.; Coimbra, E.A.; Huguenin, L. Subcarrier selection for har using csi and cnn: Reducing complexity and enhancing accuracy. In Proceedings of the 2024 IEEE Virtual Conference on Communications (VCC), Online, 3–5 December 2024; IEEE: Piscatawy, NJ, USA, 2024; pp. 1–7. [Google Scholar]
  21. Shahverdi, H.; Nabati, M.; Fard Moshiri, P.; Asvadi, R.; Ghorashi, S.A. Enhancing CSI-based human activity recognition by edge detection techniques. Information 2023, 14, 404. [Google Scholar] [CrossRef]
  22. Halperin, D.; Hu, W.; Sheth, A.; Wetherall, D. Tool release: Gathering 802.11 n traces with channel state information. ACM SIGCOMM Comput. Commun. Rev. 2011, 41, 53. [Google Scholar] [CrossRef]
  23. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  24. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  25. Graves, A.; Graves, A. Long short-term memory. Supervised Seq. Label. Recurr. Neural Netw. 2012, 385, 37–45. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
  27. Yang, J.; Chen, X.; Zou, H.; Wang, D.; Xu, Q.; Xie, L. EfficientFi: Toward large-scale lightweight WiFi sensing via CSI compression. IEEE Internet Things J. 2022, 9, 13086–13095. [Google Scholar] [CrossRef]
  28. Yousefi, S.; Narui, H.; Dayal, S.; Ermon, S.; Valaee, S. A survey on behavior recognition using WiFi channel state information. IEEE Commun. Mag. 2017, 55, 98–104. [Google Scholar] [CrossRef]
Figure 1. Significant variations in Wi-Fi CSI signals of the same “Kicking with the right leg” activity among different samples from the HHI dataset [7].
Figure 1. Significant variations in Wi-Fi CSI signals of the same “Kicking with the right leg” activity among different samples from the HHI dataset [7].
Sensors 25 04216 g001
Figure 2. Visualization of CSI data using t-SNE of similar activities from D s 1 and D s 2 . Red dots correspond to data from D s 1 and blue dots represent data from D s 2 .
Figure 2. Visualization of CSI data using t-SNE of similar activities from D s 1 and D s 2 . Red dots correspond to data from D s 1 and blue dots represent data from D s 2 .
Sensors 25 04216 g002
Figure 3. The sample preprocessing results of amplitude data for three interactions. The first row is the raw data; the second row is the data after normalization and Butterworth low-pass filtering; and the third row is the data after 1D-DWT down-sampling.
Figure 3. The sample preprocessing results of amplitude data for three interactions. The first row is the raw data; the second row is the data after normalization and Butterworth low-pass filtering; and the third row is the data after 1D-DWT down-sampling.
Sensors 25 04216 g003
Figure 4. An overview of the architecture of the TCN model used in this paper. A kernel size of two is shown for illustration purposes.
Figure 4. An overview of the architecture of the TCN model used in this paper. A kernel size of two is shown for illustration purposes.
Sensors 25 04216 g004
Figure 5. The details of a TCN layer.
Figure 5. The details of a TCN layer.
Sensors 25 04216 g005
Figure 6. The accuracy and loss curves of the TCN-AA model trained and validated on D s 1 .
Figure 6. The accuracy and loss curves of the TCN-AA model trained and validated on D s 1 .
Sensors 25 04216 g006
Figure 7. The confusion matrix of the TCN-AA model trained on D s 1 and tested on D s 2 .
Figure 7. The confusion matrix of the TCN-AA model trained on D s 1 and tested on D s 2 .
Sensors 25 04216 g007
Figure 8. The confusion matrix of the TCN-MAML model trained on D s 1 and tested on D s 2 .
Figure 8. The confusion matrix of the TCN-MAML model trained on D s 1 and tested on D s 2 .
Sensors 25 04216 g008
Figure 9. The accuracy and loss curves of the proposed TCN-MAML model.
Figure 9. The accuracy and loss curves of the proposed TCN-MAML model.
Sensors 25 04216 g009
Table 1. Comparison of few-shot learning approaches in Wi-Fi-based HAR.
Table 1. Comparison of few-shot learning approaches in Wi-Fi-based HAR.
MethodPredef. Embed. 1New Subj.Adapt. 2Extra Data 3Feature Engr 4General HAR 5
ProtoNetOXXXO
TOSSXOOXO
CSI-GDAMXXOOO
CAUTIONXXXOX
TCN-MAML (Ours)XOXXO
1 Predef. EMmbed. = requires predefined embeddings. 2 New Sunj. Adapt. = adaptable to new subjects. 3 Extra Data = needs additional target data. 4 Feature Engr. = feature engineering required. 5 General HAR = general HAR application.
Table 2. TCN-MAML parameters.
Table 2. TCN-MAML parameters.
ComponentParameterValue
TCN BackboneNumber of Layers3
Dilation Rates1, 2, 4
Number of Filters30, 50, 75
Kernel Size15
Dropout Rate0.1
Output ShapeFinal TCN Output Dimension ( N t × N r , N p , N f L )
FCN ClassifierNumber of Dense Layers2
First Dense Layer Neurons128
Second Dense Layer Neurons64
Dropout Rate0.1
Output Layer Neurons 12 ( N c )
Batch Size32
Meta-learning Outer Learning Rate 5 × 10 4
Task-level Inner Update Learning Rate 5 × 10 4
Inner-loop Adaptation Steps5
Inner-loop Testing Steps10
Table 3. Cross-dataset accuracy comparison.
Table 3. Cross-dataset accuracy comparison.
DatasetTesting Accuracy (%)
NTU-Fi HAR99.12
UT-HAR98.66
HHI99.6
Table 4. A comparison of classification accuracy (%) on the HHI dataset with recent methods.
Table 4. A comparison of classification accuracy (%) on the HHI dataset with recent methods.
MethodAccuracy (%)
TCN-AA (Ours)99.60
SVM86.21
CSI-IANet91.30
DCNN88.66
HHI-AttentionNet95.47
GraSens86.00
E2EDLF86.30
Attention-BiGRU87.00
H2HI-Net96.39
Table 5. The performances of the proposed augmentation techniques. The augmented dataset contains twice the sample size of the raw dataset.
Table 5. The performances of the proposed augmentation techniques. The augmented dataset contains twice the sample size of the raw dataset.
Augmentation MethodAccuracy (%)Loss
Train Valid Test Train Valid Test
Raw data10067.287.00.0011.0980.553
Raw + Dropout99.784.294.60.0020.1840.109
Raw + Intra-mixing (30%)98.886.594.90.0080.1780.055
Raw + Intra-mixing (20%)98.283.190.90.010.1920.069
Raw + Inter-mixing (30%)99.777.893.30.0060.2340.082
Raw + Inter-mixing (20%)98.876.889.30.010.2230.09
Note: Raw data = baseline dataset without augmentation. Dropout = random dropout applied to training samples. Intra-mixing = mixing data within the same class. Inter-mixing = mixing data across different classes.
Table 6. The performances of the proposed augmentation techniques. The augmented dataset contains the same sample size as the raw dataset.
Table 6. The performances of the proposed augmentation techniques. The augmented dataset contains the same sample size as the raw dataset.
Augmentation MethodAccuracy (%)Loss
Train Valid Test Train Valid Test
Raw data10067.287.00.0011.0980.553
(Raw+Dropout)/299.167.288.60.0080.3050.133
(Raw+Intra-mixing (30%))/210066.490.00.0010.3610.141
(Raw+Intra-mixing (20%))/299.263.587.00.0060.3720.158
(Raw+Inter-mixing (30%))/299.064.087.60.0080.350.172
(Raw+Inter-mixing (20%))/295.665.082.60.0120.3630.188
Note: Raw data = baseline dataset without augmentation. Dropout = random dropout applied to training samples. Intra-mixing = mixing data within the same class. Inter-mixing = mixing data across different classes.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, C.-Y.; Lin, C.-Y.; Liu, Y.-T.; Chen, Y.-W.; Ng, H.-F.; Shih, T.K. TCN-MAML: A TCN-Based Model with Model-Agnostic Meta-Learning for Cross-Subject Human Activity Recognition. Sensors 2025, 25, 4216. https://doi.org/10.3390/s25134216

AMA Style

Lin C-Y, Lin C-Y, Liu Y-T, Chen Y-W, Ng H-F, Shih TK. TCN-MAML: A TCN-Based Model with Model-Agnostic Meta-Learning for Cross-Subject Human Activity Recognition. Sensors. 2025; 25(13):4216. https://doi.org/10.3390/s25134216

Chicago/Turabian Style

Lin, Chih-Yang, Chia-Yu Lin, Yu-Tso Liu, Yi-Wei Chen, Hui-Fuang Ng, and Timothy K. Shih. 2025. "TCN-MAML: A TCN-Based Model with Model-Agnostic Meta-Learning for Cross-Subject Human Activity Recognition" Sensors 25, no. 13: 4216. https://doi.org/10.3390/s25134216

APA Style

Lin, C.-Y., Lin, C.-Y., Liu, Y.-T., Chen, Y.-W., Ng, H.-F., & Shih, T. K. (2025). TCN-MAML: A TCN-Based Model with Model-Agnostic Meta-Learning for Cross-Subject Human Activity Recognition. Sensors, 25(13), 4216. https://doi.org/10.3390/s25134216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop