Next Article in Journal
Fundamentals and Applications of Fluid Mechanics and Acoustics in Biomedical Engineering
Previous Article in Journal
Human Septal Cartilage Tissue Engineering: Current Methodologies and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Non-Contact Cross-Person Activity Recognition by Deep Metric Ensemble Learning

by
Chen Ye
1,
Siyuan Xu
1,
Zhengran He
2,
Yue Yin
2,
Tomoaki Ohtsuki
2 and
Guan Gui
1,*
1
School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
2
Department of Information and Computer Science, Keio University, Yokohama 223-8522, Japan
*
Author to whom correspondence should be addressed.
Bioengineering 2024, 11(11), 1124; https://doi.org/10.3390/bioengineering11111124
Submission received: 16 October 2024 / Revised: 3 November 2024 / Accepted: 5 November 2024 / Published: 7 November 2024
(This article belongs to the Special Issue Intelligent Systems for Human Action Recognition)

Abstract

:
In elderly monitoring or indoor intrusion detection, the recognition of human activity is a key task. Owing to several strengths of Wi-Fi-based devices, including their non-contact and privacy protection, these devices have been widely applied in the area of smart homes. By the deep learning technique, numerous Wi-Fi-based activity recognition methods can realize satisfied recognitions, however, these methods may fail to recognize the activities of an unknown person without the learning process. In this study, using channel state information (CSI) data, a novel cross-person activity recognition (CPAR) method is proposed by a deep learning approach with generalization capability. Combining one of the state-of-the-art deep neural networks (DNNs) used in activity recognition, i.e., attention-based bi-directional long short-term memory (ABLSTM), the snapshot ensemble is the first to be adopted to train several base-classifiers for enhancing the generalization and practicability of recognition. Second, to discriminate the extracted features, metric learning is further introduced by using the center loss, obtaining snapshot ensemble-used ABLSTM with center loss (SE-ABLSTM-C). In the experiments of CPAR, the proposed SE-ABLSTM-C method markedly improved the recognition accuracies to an application level, for seven categories of activities.

Graphical Abstract

1. Introduction

Human activity recognition (HAR) is increasingly demanded in an aging society [1,2,3]. In some special periods like the recent outbreak of coronavirus disease 2019 (COVID-19), a shortage of medical resources results in the urgent need for home healthcare [4]. In the applications of HAR with walking activities [5,6,7,8], the adopted devices can be divided into two main categories, namely wearable ones [9,10,11,12,13,14,15,16,17] and non-contact ones [18,19,20,21,22,23,24,25,26,27,28,29,30,31]. Wearable devices like a wristband with an accelerometer usually can precisely identify a user’s static or motion statuses for rehabilitation and so forth, via electromyography (EMG) [15], electroencephalography (EEG) [17], and inertial measurement unit (IMU) [13], etc. Based on whether detecting the central nervous system (CNS), the invasive and non-invasive wearable sensors for activity monitoring and rehabilitation are outlined in Figure 1, referring to [10]. The invasive sensors can be divided into bioelectrical ones (e.g., electroencephalography) and biomechanical ones (e.g., capacitance sensors), besides, the non-invasive sensors further include the electromechanical ones (e.g., accelerometer). However, wearable devices may cause discomfort and skin inflammation due to direct contact with the skin [9].
In contrast, non-contact devices mainly include camera and infrared sensors based on vision [18,19,20,21], and Wi-Fi and lidar based on radio frequency (RF) signals [22,23,24,25,26,27,28,29,30,31]. In particular, over camera or infrared sensor, the use of Wi-Fi has four main strengths [22,23,24,25,26,27,28,29,30]: (1) It can penetrate non-metals like clothing and quilts; (2) It is insensitive to light and temperature; (3) It avoids the invasion of privacy; (4) Its price is uncostly. Owing to the potential of Wi-Fi-based devices for HAR, they have been adopted in numerous practical applications, e.g., home monitoring [24] and indoor intrusion detection [22].
Wi-Fi-based HAR is usually conducted by received signal strength indicator (RSSI) and channel state information (CSI) [32,33,34]. Unlike RSSI which only measures the power in a received signal, CSI can describe both the amplitude and phase of a received signal from subcarriers, which has been used in most existing HAR studies [22,23,24,25,26,27,28,29,30]. Including the transmission of Wi-Fi signals, the RF signals from a transmitter usually undergo multiple paths to reach a receiver, when the CSI is typically sensitive to the ambient environment. That is, the surrounding obstacles and human motion may reflect or interfere with the signal propagation, resulting in the variation of CSI [32,33,34]. Including our dataset, the HAR-aimed CSI datasets collected by specific tools [35], will be described in Section 2.1.

1.1. Conventional Wi-Fi-Based HAR Methods

HAR can be converted to a binary or multi-class classification task. Therein, the common machine learning approaches like support vector machine (SVM) [36] and random forest [37] require artificial experience for feature selection, in contrast, deep learning (DL) is a more powerful technique that can automatically extract semantic features [38,39,40,41]. Specifically, using CSI data, some DL approaches have achieved satisfactory performance in many studies on HAR [25,42,43,44,45]. Owing to the use of a convolution kernel, the convolutional neural network (CNN) usually can extract the important features of interested data, which was used to train CSI data for HAR in a transfer learning fashion as in [42]. In [28], the CSI data were converted to red-green-blue (RGB) images as the input of a two-dimensional CNN, for exploring the signal patterns of activities. Furthermore, CNN-based activity recognition was respectively studied by combining few-shot learning in [25], and multi-task learning in [43]. Considering the sequential characteristics of human activities, the recurrent neural network (RNN)-based [46] long short-term memory (LSTM) with an advanced memory capacity [47,48,49], has been introduced to handle CSI in some HAR methods [24,44,45]. In [24,44], the CSI was regarded as temporal sequences for forecasting different activities, by LSTM-included networks. Moreover, in [45], beyond only handling the past information of input samples by the standard LSTM, a bi-directional LSTM (BLSTM) also handling their future information [50] was introduced for further extracting the features of CSI. In addition, an attention mechanism that can learn the importance of features and time steps was further combined with BLSTM to present the attention-based BLSTM (ABLSTM) approach [45], obtaining improved performance.
In addition, in [26], the HAR was conducted under the scenarios across different environments and categories of activity using CSI data. In [51], a CNN-based ensemble learning strategy with heterogeneous classifiers was presented for HAR, under the same conditions of the training phase. To the best of our knowledge, most conventional Wi-Fi-based HAR methods focus on recognizing the activities of a person who has been seen in the training phase, without considering the differences in height, weight, and activity pattern, between the seen persons and unknown ones in the recognition phase, which may lack a generalization ability across different subjects [52,53,54]. In particular, the activity patterns usually vary from person to person, and the corresponding samples in HAR do not follow the same data distribution, which will hinder the generalization of classifiers. Including Wi-Fi, conventional sensor-based cross-person activity recognition (CPAR) methods via transfer learning or domain adaptation typically require additional knowledge from unknown persons during the training, such as labeled or unlabeled samples, which may be difficult in many practices [53,54]. On the other hand, some domain generalization (DG) methods with no need for target domain data on Wi-Fi-based gesture recognition were presented, e.g., WiSR in [55] and WiSGP in [56]. Note that these gesture recognition-towards DG methods are probably unable to be applied directly in CPAR, considering the differences in body size and activity pattern.

1.2. Our Proposal

Combining the snapshot ensemble and metric learning, this study develops a new generalized method aiming at CPAR using CSI data, i.e., to recognize the activities of unknown persons, which does not need knowledge from unknown persons in the training phase. Given the dominant performance of the ABLSTM classification network [45] on the HAR to the seen persons in the training phase, the snapshot ensemble fashion [57] is the first adopted in the ABLSTM network for obtaining SE-ABLSTM, to improve the generalization capability of recognition. Using the cosine annealing learning rate (LR), the snapshot ensemble can obtain multiple models with converged local minima of loss in one training process. To further improve the recognition generalization, we introduce metric learning to extract discriminative features with compact intra-category distance into the base-classifiers via the high-powered center loss function [58,59], by proposing the snapshot ensemble-used ABLSTM with center loss (SE-ABLSTM-C). In the activity recognition task of unknown persons, outperforming the state-of-the-art HAR method of ABLSTM [45], for each subject, the proposed SE-ABLSTM-C method improved the average accuracy of seven activities consisting of waving, clapping, walking, lying down, sitting down, falling, and picking up, approximately 3–14%. The improved generalization of our proposal to unknown persons, will be helpful for flexible health monitoring with the aid of mobile devices like robots [21].
The early results of SE-ABLSTM have been presented in our previous work as a conference version [60]. This paper further contrastively describes the use of snapshot ensemble by SE-ABLSTM, and expands the SE-ABLSTM to combine metric learning for obtaining a new SE-ABLSTM-C method. In addition, the research problem statement, the more comprehensive experiments with added subjects, and the limitation of our proposal are respectively supplemented. The code and download link of our dataset are available at https://github.com/NJUPT-Sivan/Cross-person-HAR (4 November 2024). The two main contributions of this study are summarized as follows:
1.
We used the Espressif 32 (ESP32) CSI tool [34] to collect and open a CSI dataset on the mentioned seven activities in an indoor environment.
2.
A new SE-ABLSTM-C method is proposed to improve the generalization ability of CPAR, obtaining higher accuracies than those of conventional HAR methods.
The organization of this paper is as follows. Section 2 presents the CSI dataset collection, and the problem and treatment by ensemble learning; Section 3 describes the framework and each part of our proposed CPAR method; Section 4 states the experiments and limitations; finally, Section 5 concludes this study.

2. Preliminaries

As the preliminary work, the CSI dataset collection, and the problem and treatment for CPAR, are respectively stated.

2.1. Dataset Collection

Over those wearable phone-based accelerometer datasets, such as WISDM (Wireless Sensor Data Mining) [61], UCI-HAR [62], and mHealth (mobile Health) [63], CSI datasets enable the non-contact sensing of human activities by the Wi-Fi technique. Reviewing the HAR-aimed CSI datasets in typical existing studies [23,24,25,26,27,28] listed in Table 1, which were usually collected on different activities from multiple subjects by several rounds in an indoor environment, with 200–1100 data. Therein, some studies [23,24,25,26] adopted the Intel 5300 Network Interface Card and Atheros for collecting CSI data, but, the 5300 CSI tool has a relatively high price and complex configuration requirement [35]. A part of CSI datasets have been public, such as those in [24,28]. In our dataset, named A, B, C, D, and E, 5 male subjects with heights of 165–175 cm and weights of 50–68 kg were respectively selected to repeat the 20 rounds of 7 activities, obtaining a total of 700 data whose scale is medium among all the mentioned CSI datasets. As a prior study on CPAR, our medium-scale dataset will be used in the experiments in Section 4, and the scale and diversity of the dataset including subjects and categories of activity, can be adjusted or extended based on actual needs. In addition, the basic information of each subject in our dataset is summarized in Table 1.
Referring to recent research in [34], we adopted two ESP32 development boards consisting of a station (STA) and an access point (AP) for collecting CSI data. As shown in Figure 2a, the STA first transmits a ping command to AP as a receiver, then AP will reply to the ping command to STA, for establishing a unique connection. In an indoor environment (length: 5.2 m, width: 3.6 m, and height: 2.4 m), due to the multi-path transmission of Wi-Fi signal between the pair of transmitter and receiver, CSI will vary by the movements of a person. The layout of the experiment room with its physical dimension is shown in Figure 2b, where only two tripods of 1 m-height were respectively laid for placing STA and AP without extra furniture, and the monitoring area of 2.4 m × 2.0 m was set in the center of the room. In addition, the pattern of walking back and forth was depicted within the monitoring area, and the other six activities were conducted around the center of the monitoring area, therein a chair of 45 cm-height was used for sitting down, a common mat for lying down and falling, and an empty bottle for picking up. Figure 3 shows an example of raw CSI data samples for the seven considered activities. The amplitude scale from sitting down and lying down is roughly within −10–10 mV; the larger scale from clapping, walking, and picking up is roughly within −20–20 mV; the largest scale from waving and falling is roughly within −30–30 mV. In addition, for sitting down and falling, the highlighted sharp change of amplitudes may correspond to the occurrences of the respective activity.
The setting of the ESP32 CSI tool is listed in Table 2. By the compiler version of the Internet of Things development framework (IDF) v4.3.2, the single antenna is configured into both the transmitter and receiver, with a sampling rate of 63 Hz. The number of CSI equals that of subcarriers, i.e., 32, and the CSI with in-phase and quadrature (IQ) data in 64 ways is deemed as the considered data in this study. In ESP32, the analog-to-digital converter (ADC) calibrates the input analog voltage to the reference voltage of 1100 mV, and determines each bit of the output digital result. In a sufficiently large room, the distance of the line of sight (LoS) between the transmitter and receiver is 2.5 m, placing at the height of 1.0 m on the desks. In comparison to the numbers of antennae and subcarriers, namely 3 and 30, in the common 5300 CSI dataset as [24], our dataset adopted the compact single antenna both in STA and AP and a close number of subcarriers, namely 32.

2.2. Problem and Treatment by Ensemble Learning

Since the data of the activities from the targeted unknown subjects can not be obtained before the recognition phase for CPAR, some transfer learning or domain adaption approaches that need re-training of new samples may be inapplicable [52,53]. Fortunately, ensemble learning does not require additional data except those from seen subjects in the initial training phase [64].
Figure 4 illustrates the ensemble learning-induced space expansion of individual hypotheses for approximating a true hypothesis [64]. The hypothesis space is depicted by the features of a sample x R d , d = 2 , and we assume that the ensemble contains M base-classifiers h 1 , h 2 , , h M which are respectively trained within their corresponding individual hypothesis space. From a representation perspective, a true hypothesis g of the learning task may be not in the considered hypothesis space of the current learning algorithm, in this case, the use of a single classifier h m , m 1 , 2 , , M is likely to be invalid. By combining multiple base-classifiers, the space of individual hypotheses will expand, which may acquire a better approximation to the true hypothesis. In the CPAR application, the features of CSI samples from the seen persons can be deemed to construct the hypothesis space, and those from an unknown person are probably out of the constructed hypothesis space. In our proposal, the introduced ensemble strategy covering multiple individual hypothesis spaces enables a generalization improvement, helping approximate an activity category of the unknown person.

3. Proposed CPAR Method

The framework of our proposed CPAR method, and its two main compositions of snapshot ensemble learning and metric learning, are respectively described.

3.1. System Framework

The system framework of our proposed CPAR method is shown in Figure 5, consisting of four main parts: CSI data preparation, feature extraction, ensemble and classifiers training, and CPAR on unknown persons.

3.1.1. CSI Data Preparation

Using the ESP32 CSI tool [34], raw CSI data was collected by the pair of transmitter and receiver on human activities. To train effective classifiers, data pre-processing is indispensable, and is composed of two steps:
  • By the 3.2 s-time window with 0.8 s-forward sliding, the data is sliced into samples with 200-length. Since the observation time of each round is 16.0 s in our experiments, the number of samples becomes 17 = 16.0 3.2 / 0.8 + 1 for each subcarrier.
  • The sliced samples are labeled as 0 , 1 , , 6 in turn, corresponding to the seven specific activities, i.e., waving, clapping, walking, lying down, sitting down, falling, and picking up.

3.1.2. Feature Extraction by ABLSTM Network

In [45], the presented ABLSTM-based method has been manifested to achieve higher classification accuracy on HAR, than those of other conventional methods, such as CNN or LSTM-based methods [24,42]. Given the remarkable ability of feature extraction, the ABLSTM network is selected as the model backbone of the deep neural network (DNN) in our proposal. Reviewing the structure of the ABLSTM network [45] shown in Figure 6, first, the BLSTM layer with 200 units extracts the features of the pre-processed CSI series, namely 200 length-samples x i , considering both the forward states ( t 1 t + n ) and backward states ( t 1 t + n ), obtaining a feature matrix with the size of 200 × 400 . Meanwhile, the feature matrix outputted from the BLSTM layer passes into the attention layer with 400 units, which assigns weights to features and time steps by their importance, generating an attention matrix. By concatenating the feature matrix and attention matrix, a modified feature matrix sequentially passes through a flatten layer and 7 length-fully connected layers associated with the considered activities, converting into a feature vector. Finally, the converted feature vector is calculated by the softmax loss L softmax consisting of softmax activation function f softmax and cross-entropy (CE) loss L CE , as a predicted category vector y ^ i . Note that the standard ABLSTM network [45] we adopted has no Transformer modules [65], which may benefit the parallel computing of sequence samples.
The softmax loss can be formulated as follows:
L softmax = L CE f softmax z i = E { log f softmax z i } ,
with
f softmax z i = e z i , k k = 1 K e z i , k ,
where z i = z i , 1 , , z i , K = f FE x i ; W is a non-normalized probability vector outputted from the fully connected layer, and f FE is the mapping function from the sample space to the feature embedding (FE) space, i.e., f FE : X Y , which needs the substitutes of a training sample x i and the model weight W . E · is the operator of the expected value.
Note that the adopted structure of ABLSTM followed its original scalability only except that of the fully connected layer, and its computational load has been evaluated in [45]. The computational load of ABLSTM will approximately increase to M times that equals the number of cycles, in the following ensemble learning. In addition, owing to the flexibility of the ABLSTM structure, the softmax loss is combinable with other losses via a factor λ that balances the proportion of losses, such as the center loss belonging to metric learning [58,59]. More concretely, the hyper-parameters of the ABLSTM network were experimentally set in this study, which will be stated in Section 4.1.

3.1.3. Ensemble and Classifiers Training

Figure 7 depicts the training phase and recognition phase in the proposed CPAR method. To improve the generalization ability of the ABLSTM model, an advanced ensemble strategy termed snapshot ensemble [57] is adopted, which will be described in Section 3.2. In addition, by combining the softmax loss, the center loss L center acts on discriminative feature extraction for training high-performance classifiers [58,59], which will be described in Section 3.3.

3.1.4. CPAR on Unknown Persons

Unlike the simple HAR task in which the subjects’ activities have appeared in the training phase, this study focuses on the CPAR task aiming at the activities of an unknown person. By the several trained base-classifiers, the simple averaging fashion combines their predicted category vectors for the final prediction of activity.

3.2. Snapshot Ensemble Learning

Figure 8 compares the stochastic gradient descent (SGD) optimization, common ensemble, and snapshot ensemble. In Figure 8a, the SGD optimization is illustrated via a typical LR schedule that is constant or decreasing, and the model converges to a local minimum from a starting point, obtaining one classifier h. However, the use of a single classifier may have a poor generalization ability due to falling into a local minimum with some incorrect predictions. In contrast, ensemble learning probably improves the generalization by combining multiple classifiers with distinguishing predictions. In addition, over a single classifier, the ensemble learning may obtain a better approximation to the true hypothesis being out of the considered hypothesis space, which has been mentioned in Section 2.2. As shown in Figure 8b, unlike the only one training in the SGD optimization, the common ensemble conducts multiple training processes for obtaining one independent base-classifier in each training. More concretely, in the first training, the model converges to a local minimum from a starting point to obtain the first base-classifier h 1 , similarly, M base-classifiers h 1 , h 2 , , h M are respectively obtained in M training processes. However, in the different training processes of the common ensemble, the independent base-classifiers may be obtained by the models converged to the same local minimum, e.g., h 1 and h M , resulting in some local minima not functioning in the convergence of models. As shown in Figure 8c, by the cosine annealing LR with cycles of decreasing and sharp increasing [57], the snapshot ensemble successively obtains M base-classifiers in only one training, whose base-models converged to different local minima with the improved generalization than that of the common ensemble.
For model training, unlike the constant or decreasing LR used in the SGD optimization or common ensemble, the schedule of cyclic cosine annealing LR in snapshot ensemble is defined as
α t = α 0 2 [ cos ( π mod ( t 1 , T / M T / M ) + 1 ] ,
where α 0 is an initial LR, T is the total number of epochs, M is the number of cycles, and t is the ordinal number of epochs within each cycle. · denotes rounding up to an integer, and mod (·) denotes the modulo function that calculates the remainder of a division. In one cycle, t increases from 1 to [T/M], resulting in a decreasing LR α(t) from its maximum α 0 , and the model is saved until the LR decreases to a minimum. At the beginning of the next cycle, α(t) sharply recovers to the maximum α 0 , and repeats the decreasing process. In this way, M classifiers are successively saved in M cycles, during one training process. In our experiments, α 0 , T, and M, are empirically set at 0.01, 500, and 10, respectively. Correspondingly, the number of epochs within each cycle, [T/M], equals 50. Based on the ABLSTM only with softmax loss [45], the approach using snapshot ensemble is given in Algorithm 1, which we name SE-ABLSTM in this study.
Algorithm 1: SE-ABLSTM: Snapshot Ensemble-used ABLSTM
  Input: Training dataset D train = { x i , y i }
  Output: Base-classifiers h 1 , h 2 , , h M
1Construct ABLSTM [45] with softmax loss L softmax ;
2Set parameters: Initial LR α 0 , total number of epochs T, and number of cycles M;
3for  m = 1 , 2 , , M  do
4    Obtain current model weight, i.e., Wm(1) = Wcur;
5    for  t = 1 , 2 , , T / M  do
6         α t = α 0 2 [ cos ( π mod ( t 1 , T / M T / M ) + 1 ] ;
7         Train   on   D train ;
8         Forward propagation :   Predicted   category   y ^ i m = h m x i ; W m t ;
9         Backward propagation :   True   category   label   y i one - hot vector y i ;
10         W m t + 1 = W m t α t L softmax W m ;
11    end
12    Obtain hm with trained Wm;
13end
In the recognition phase, the simple averaging for the predictions of M trained models can be formulated as follows,
H x test = y ^ avg = 1 M m = 1 M h m x test ,
where x test is a test sample, and the base-classifiers h m are respectively obtained by the ABLSTM in this study. Since the result of H x test is in a vector form, we determine the snapshot ensemble-based final predicted category label by
y ^ test = arg max y ^ avg , k H x test ,
where y ^ avg , k is k-th element of the averaged prediction y ^ avg .

3.3. Center Loss Belonging to Metric Learning

Although the softmax loss can deal with the separable features in a classification task, it does not dig out the discrimination characteristic of features. The separable features are usually indiscriminative, in contrast, the discriminative features are easily separable owing to their clustered distribution in the feature space. We introduce the metric learning to further improve the generalization of the proposed CPAR method, by using the center loss function that is good at recognition tasks [58,59]. The center loss aims to obtain more compact intra-category distances in the feature space, which can be formulated as follows,
L center = 1 2 E f FE x i ; W c y i 2 2 ,
where c y i is the learnable center feature corresponding to y i -th category. Combining the softmax loss in Equation (1) and the center loss in Equation (6), a hybrid metric (HM)-based loss function is composed as
L HM = L softmax + λ L center ,
where λ acts on balancing L softmax and L center . By replacing the softmax loss with the HM loss, the SE-ABLSTM-C approach is further obtained.
In the optimization of the center feature c y i , it should be updated when the embedded feature z i = f FE x i ; W t changes. Considering the huge computation load for updating c y i by the entire training samples x i in each epoch, a batch containing B training samples x b is usually used to calculate the feature centers of z b belonging to each category containing in the batch by averaging, and c y i is updated to get close to the feature centers by
c y i t + 1 = c y i t α center c y i ,
with
c y i = b = 1 B δ y b = = y i z b c y i t 1 + b = 1 B δ y b = = y i ,
where α center is the LR for the update of center loss. δ · denotes a conditional selection function, which equals 1 if the condition is established, otherwise, it equals 0. In the case that there are no samples x b corresponding to y i -th category in the current batch, namely
y i y b , for y b , b = 1 , 2 , , B ,
c y i remains unchanged. By combining the center loss into the SE-ABLSTM approach, the SE-ABLSTM-C approach can be obtained, which uses the HM loss L HM for the backward propagation instead of L softmax in the SE-ABLSTM. Correspondingly, the approach of SE-ABLSTM-C is obtained via only replacing Line 12 in Algorithm 1 by
W m t + 1 = W m t α t L HM W m .
For a better understanding, the schematic diagram of center loss for generating discriminative features is shown in Figure 9.

4. Experimental Results

In this section, the settings of experimental parameters and tasks, the evaluations of performance and generalization on HAR methods, and the limitation of this study, are respectively stated.

4.1. Parameter and Task Setting

Table 3 summarizes the parameter setting in the experiments. The information regarding the prepared CSI dataset is summarized in Table 3a. In each round of observation time of 16.0 s, the 3.2 s-time window is used to obtain a 200-length sample, with a 0.8 s-sliding window. Correspondingly, 17 samples are obtained in each round, and the total number of samples is 11,900 = 17 × 700 for our 700 CSI data (see Table 1). The hyper-parameters for model training are summarized in Table 3b. We chose TensorFlow 2.4.1 as the Python framework, the loss function combining softmax loss and center loss with an empirically selected λ = 1.0 , and the optimizer of Adam to train M = 10 base-classifiers. The batch size and initial LR α 0 are set to 128 and 0.01, respectively. In the training process, M cycles are set, resulting in 50 cycles in each cycle. In addition, to the comparison methods without snapshot ensemble, including CNN [42], LSTM [24], and ABLSTM [45], the LR and the number of epochs in each training are respectively 1 × 10 4 and 50.
Since the proposed CPAR method of SE-ABLSTM-C is devoted to improving generalization across different subjects rather than different environments, in the experiments, by defining the collected CSI data from the four subjects as A, B, C, and D, two kinds of tasks for HAR are respectively set as follows:

4.1.1. Task I

As an auxiliary task for selecting a backbone network with high performance, the activities from a seen person are recognized, by the classifiers trained by all four subjects. The collected data are divided into one training dataset { A , B , C , D , E } , and five test datasets { A } , { B } , { C } , { D } , and { E } . By training the network model using { A , B , C , D , E } , { A } , { B } , { C } , { D } , and { E } are respectively used for HAR. The sample ratio of training, validation, and test, are 8:1:1.

4.1.2. Task II

As the main task, it aims to conduct the CPAR on unknown persons, by training classifiers using the data of seen persons. The collected data are divided into four patterns consisting of training and test datasets, namely { A , B , C , D } { E } , { A , B , C , E } { D } , { A , B , D , E } { C } , { A , C , D , E } { B } , and { B , C , D , E } { A } . For example, by training the network model using all the data in { A , B , C , D } , the data in { E } are used for CPAR. In the following experiments, we have used the abbreviation of ABCD-E instead of { A , B , C , D } { E } for simplicity, and it is similar for other patterns.

4.2. Evaluation of Performance in Task I

In Task I for recognizing the activities of seen persons, the confusion matrices generated from the conventional HAR methods are shown in Figure 10, respectively. In general, all the four methods of CNN [42], LSTM [24], BLSTM, and ABLSTM [45], can relatively precisely recognize the considered seven activities of waving, clapping, walking, lying down, sitting down, falling, and picking up. Specifically in Figure 10d, most of the recognition accuracies by ABLSTM are close to 100%, such as 99.41% for both waving and clapping and 98.24% for both lying down and sitting down. Even though for the recognition results by CNN shown in Figure 10a, most accuracies are 85% at least, except for the lowest one of 82.35% for picking up. Compared with the other five activities, the accuracies for falling are lower than 90% by CNN, LSTM, or BLSTM, which may be due to the similarity of the two activities of falling and picking up.
In addition, Table 4 summarized the total average accuracies for all seven activities, respectively by the three conventional HAR methods and BLSTM. Here, the metric of average accuracy is defined by
Accuracy = T P + T N T P + F P + T N + F N × 100 % ,
where T P and F P respectively classify the positive class as true positive and negative (false positive), and T N and F N respectively classify the negative class as true negative and positive (false negative). More concretely, better than the recognition accuracy of 91.01% by CNN, LSTM obtained 92.77% owing to its good capacity in handling sequential samples. In addition, owing to handling the samples in both forward and backward states, BLSTM further improved the total accuracy to 94.37%, and ABLSTM with an attention mechanism obtained the best 97.23%.

4.3. Evaluation of Generalization in Task II

To further evaluate the generalization ability of our proposal for recognizing the activities of unknown persons, the effectiveness of contrastive losses belonging to metric learning is accessed, and the ablation study and HAR methods on CPAR are respectively conducted.

4.3.1. Contrastive Losses

The snapshot ensemble and the center loss belonging to metric learning, are two main compositions in the proposed SE-ABLSTM-C method. Among various contrastive losses, the triplet loss and the center loss are deemed two typical ones [58,59].
Based on ABLSTM [45], the effectiveness of triplet loss and the center loss were compared in Table 5, where we respectively named the ABLSTM with triplet loss or center loss as ABLSTM-T and ABLSTM-C. In general, combining contrastive losses with softmax loss, whether ABLSTM-T or ABLSTM-C brought about performance improvement over ABLSTM for most activities in each pattern, except that in Pattern ABCD-E. In Pattern BCDE-A as an example, relying on anchor samples, ABLSTM-T obtained higher accuracies for most activities but waving, and the average accuracy of 85.46%, over 83.57% by ABLSTM. In addition, in most patterns consisting of ABCE-D, ABDE-C, and ACDE-B, ABLSTM-C basically obtained better performances for different activities and average accuracies (74.48%, 77.46%, and 85.04%) over both ABLSTM and ABLSTM-T, owing to the getting close of features in intra-category. Given that, ABLSTM-C with the center loss is selected in the following experiments.

4.3.2. Ablation Study

In our previous report [60], by using our same CSI dataset, the adopted snapshot ensemble-used ABLSTM (i.e., SE-ABLSTM) manifested its outperformance than the common ensemble-used ABLSTM on CPAR with the considered seven activities, in most patterns and total average accuracy. Given that, as another main composition in the proposed SE-ABLSTM-C method, the SE-ABLSTM is selected to verify the effectiveness of the ensemble strategy in the ablation study.
The accuracy comparison of the ablation study is shown in Table 6, for respectively verifying the effects of snapshot ensemble and center loss, by SE-ABLSTM and ABLSTM-C. Since the snapshot ensemble can improve the recognition generalization of ABLSTM by obtaining multiple base-models converged to different local minima, the SE-ABLSTM typically outperformed the ABLSTM. In Pattern ACDE-B as an example, by using cyclic cosine annealing LR, SE-ABLSTM obtained higher accuracies for all the seven activities over those by ABLSTM, and a higher average accuracy of 88.87% over 80.83% by ABLSTM. In the other three patterns (ABCD-E, ABCE-D, ABDE-C, and BCDE-A), the average accuracies by SE-ABLSTM of 81.47%, 75.83%, 78.93%, and 85.67%, were respectively higher than 81.34%, 71.63%, 76.49%, and 83.57% by ABLSTM.
Consisting with the analysis based on the results in Table 5, basically, ABLSTM-C brought about better performances for different activities and average accuracies over ABLSTM, owing to realizing the compact intra-category features. By introducing the center loss into the SE-ABLSTM approach, our proposed SE-ABLSTM-C was obtained. For the three activities of clapping, lying down, and sitting down, our proposal obtained at least three best performances to all the five patterns. Also, the proposed SE-ABLSTM-C respectively improved the average accuracies to the highest 84.37%, 78.86%, 80.74%, 89.03%, and 86.68%, in all the patterns. In addition, better than each part of the proposed SE-ABLSTM-C (i.e., ABLSTM, SE-ABLSTM, and ABLSTM-C), the performance gains of our proposal were respectively quantified by the improvements in average accuracy.

4.3.3. HAR Methods on CPAR

Unlike the precise recognitions over 90% for Task I as shown in Table 4, in the five patterns (ABCD-E, ABCE-D, ABDE-C, ACDE-B, and BCDE-A) in Table 7, the average accuracies by the four conventional HAR methods, namely CNN [42], LSTM [24], ABLSTM [45], and LAGMAT (local and global alignment) that uses distance matrix for correlation representation [54], significantly degraded to 68.61–83.99% on CPAR due to the different characteristics between seen and unknown persons. Therein, as one of the state-of-the-art sensor-based CPAR methods, LAGMAT learns domain-invariant features by utilizing the local and global correlations of sensor signals only belonging to training data.
In Patterns ABDE-C and ACDE-B as examples, 82.35% and 71.76% by LSTM significantly outperformed 19.71% and 22.65% by CNN for clapping, respectively, owing to the memory capacity acting on sequential CSI data. In Pattern ABCE-D, compared with LSTM, ABLSTM further improved performances in the four activities of clapping, lying down, falling, and picking up, and the average accuracy to 85.57%. In both Patterns ABDE-C and ACDE-B, the ABLSTM as the backbone network in our proposal, obtained respectively higher average accuracies of 76.49% and 80.83%, than those of both CNN and LSTM. Better than the four conventional methods (CNN, LSTM, ABLSTM, and LAGMAT), owing to the use of snapshot ensemble and center loss, the proposed SE-ABLSTM-C obtained at least three highest accuracies for the five activities (waving, clapping, walking, lying down, sitting down) for all the five patterns, and at least one highest accuracy in the other two activities. For the average accuracies in all five patterns, the highest ones of 84.37%, 78.86%, 80.74%, 89.03%, and 86.68%, are obtained by our proposal, respectively. Furthermore, in the Student’s t-test [66], the p-values on the average accuracies for all five patterns between the proposed SE-ABLSTM-C method and the conventional CNN [42], LSTM [24], ABLSTM [45], and LAGMAT [54] methods are 0.016, 0.047, 0.038, and 0.051, respectively. All the p-values are less than or approximately equal to the significance level of 0.05, which suggests that our proposal markedly improved the average accuracies over the four conventional methods.

4.4. Limitations

Reviewing the relatively unsatisfied performance for falling in Task I, a similar limitation was found in CPAR in Task II. In Table 7, in Pattern ABCE-D as an example, the four conventional HAR methods (CNN [42], LSTM [24], ABLSTM [45], and LAGMAT [54]), and the proposed SE-ABLSTM-C method only obtained accuracies of 2.94–62.65% for falling, which is inapplicable to recognize the activity of falling. The confusion matrices by the mentioned five HAR methods on CPAR in Pattern ABCE-D were shown in Figure 11, where the true falling was quite easily wrongly predicted as picking up, especially for 92.06% and 88.82% respectively by LSTM and the proposed SE-ABLSTM-C. As another example in Pattern BCDE-A, the recognition accuracies of the four conventional methods and our proposal were within 33.82–81.76%, which may not satisfy a high-performance recognition of picking up. The confusion matrices by the mentioned five HAR methods on CPAR in Pattern BCDE-A were shown in Figure 12, where the true picking up was easily wrongly predicted as falling, especially for 63.24% and 62.35% respectively by LSTM and the proposed SE-ABLSTM-C.
Even though the proposed SE-ABLSTM-C method respectively obtained the two and one highest accuracies for falling and picking up, within all the five patterns (ABCD-E, ABCE-D, ABDE-C, ACDE-B, and BCDE-A), there are some low recognition accuracies, such as 3.82% for falling in Pattern ABCE-D and 37.06% for picking up in Pattern BCDE-A, as shown in Table 7. Reviewing the important finding in [29,45], the performance of the Wi-Fi-based recognition method is vulnerable to the similarity of different activities. Given that, the main reason for the low performances may be caused by the poor distinguishability of the two activities of falling and picking up, unlike the relatively good distinguishability of the other five activities. Note that the performance of LAGMAT was less affected by the similarity of falling and picking up, owing to the alleviation of distribution shifts of data among seen and unknown persons.
Fortunately, in all five patterns shown in Table 7, our proposal obtained at least 85.89% accuracies for the two obviously distinguishable activities of clapping and lying down (e.g., clapping in Pattern ACDE-B), and at least 68.82% accuracies for relatively distinguishable waving, walking, and sitting down (e.g., sitting down in Pattern ABCD-E). In the scenario that needs to ensure robust recognition of certain activities, their similar categories of activities may be excluded, such as falling, picking up, and squatting.
Recall the number of data in each dataset shown in Table 1, the scale of our dataset is to a medium extent. By combining ensemble strategy and metric learning, our proposal can be deemed a prior study on CPAR, the number of subjects may need to increase further in future exploration. Besides, since this study mainly focuses on activity recognition towards cross persons, we selected the seven typical activities of waving, clapping, walking, lying down, sitting down, falling, and picking up. The present proposed SE-ABLSTM-C method is not applicable to handle the specific classification task for falling modes, which may have different characteristics in direction or velocity.

5. Conclusions

Owing to the practicability of Wi-Fi-based devices, including non-contact, robustness to light and temperature, and privacy protection, we proposed a novel CPAR method named SE-ABLSTM-C for recognizing the activities of unknown persons. By adopting the snapshot ensemble, the generalization ability of our proposal on CPAR is markedly improved, based on the selected backbone network of ABLSTM. Moreover, the introduced metric learning further improves the generalization by discriminating the features from samples, over typical conventional HAR methods. In addition, our CSI dataset has been public for the development of related studies. Under the assumption of distinguishable activities, the proposed CPAR method may be applied for developing commercial products, in the healthcare of the elderly or the intrusion detection system.
In the future, the possible research directions are the CPAR with similar categories of activities, the further recognition of an activity unknown in the training phase as an open-set problem, the realization of parallel computing to sequential samples by introducing the Transformer [65], and the human gait trajectory generation combining Wi-Fi and cameras [7,21].

Author Contributions

Conceptualization, C.Y. and S.X.; methodology, C.Y. and S.X.; software, C.Y. and S.X.; validation, C.Y. and S.X.; formal analysis, C.Y. and S.X.; investigation, C.Y., S.X., Z.H. and Y.Y.; resources, C.Y. and G.G.; data curation, C.Y. and S.X.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y.; visualization, C.Y. and S.X.; supervision, T.O. and G.G.; project administration, G.G.; funding acquisition, C.Y. and T.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Foundation of Jilun Medical Intelligent Technology (Nanjing) Co., Ltd. (Grant No. 2024-298), Natural Science Research Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications (Grant No. NY223134), and JST ASPIRE (Japan) (Grant No. JPMJAP2326).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and download link of our dataset are available at https://github.com/NJUPT-Sivan/Cross-person-HAR (4 November 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ohtsuki, T. A smart city based on ambient intelligence (invited paper). IEICE Trans. Commun. 2017, E100.B, 1547–1553. [Google Scholar] [CrossRef]
  2. Xiao, Z.; Yu, H.; Yang, Y.; Gu, J.; Li, Y.; Zhuang, F.; Yu, D.; Ren, Z. HarMI: Human activity recognition via multi-modality incremental learning. IEEE J. Biomed. Health Inform. 2022, 26, 939–951. [Google Scholar]
  3. Basset, M.A.; Hawash, H.; Chang, V.; Chakrabortty, R.K.; Ryan, M.J. Deep learning for heterogeneous human activity recognition in complex IoT applications. IEEE Internet Things J. 2022, 9, 5653–5665. [Google Scholar] [CrossRef]
  4. Keidar, D.; Yaron, D.; Goldstein, E.; Shachar, Y.; Blass, A.; Charbinsky, L.; Aharony, I.; Lifshitz, L.; Lumelsky, D.; Neeman, Z.; et al. COVID-19 classification of X-ray images using deep neural networks. Eur. Radiol. 2021, 31, 9654–9663. [Google Scholar] [CrossRef] [PubMed]
  5. Semwal, V.B.; Gupta, A.; Lalwani, P. An optimized hybrid deep learning model using ensemble learning approach for human walking activities recognition. J. Supercomput. 2021, 77, 12256–12279. [Google Scholar] [CrossRef]
  6. Semwal, V.B.; Lalwani, P.; Mishra, M.K.; Bijalwan, V.; Chadha, J.S. An optimized feature selection using bio-geography optimization technique for human walking activities recognition. Computing 2021, 103, 2893–2914. [Google Scholar] [CrossRef]
  7. Semwal, V.B.; Jain, R.; Maheshwari, P.; Khatwani, S. Gait reference trajectory generation at different walking speeds using LSTM and CNN. Multimed. Tools Appl. 2023, 82, 33401–33419. [Google Scholar] [CrossRef]
  8. Semwal, V.B.; Kim, Y.; Bijalwan, V.; Verma, A.; Singh, G.; Gaud, N.; Baek, H.; Khan, A.M. Development of the LSTM model and universal polynomial equation for all the sub-phases of human gait. IEEE Sens. J. 2023, 23, 15892–15900. [Google Scholar] [CrossRef]
  9. Oscar, L.; Miguel, L. A survey on human activity recognition using wearable sensors. IEEE Commun. Surv. Tutor. 2013, 15, 1192–1209. [Google Scholar]
  10. Wang, X.; Yu, H.; Kold, S.; Rahbek, O.; Bai, S. Wearable sensors for activity monitoring and motion control: A review. Biomim. Intell. Robot. 2023, 3, 100089. [Google Scholar] [CrossRef]
  11. Patil, P.; Kumar, K.S.; Gaud, N.; Semwal, V.B. Clinical human gait classification extreme learning machine approach. In Proceedings of the 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 3–5 May 2019. [Google Scholar]
  12. Dua, N.; Singh, S.N.; Semwal, V.B. Multi-input CNN-GRU based human activity recognition using wearable sensors. Computing 2021, 103, 1461–1478. [Google Scholar] [CrossRef]
  13. Semwal, V.B.; Gaud, N.; Lalwani, P.; Bijalwan, V.; Alok, A.K. Pattern identification of different human joints for different human walking styles using inertial measurement unit (IMU) sensor. Artif. Intell. Rev. 2022, 55, 1149–1169. [Google Scholar] [CrossRef]
  14. Challa, S.K.; Kumar, A.; Semwal, V.B. A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data. Vis. Comput. 2022, 103, 4095–4109. [Google Scholar] [CrossRef]
  15. Chen, W.; Lyu, M.; Ding, X.; Wang, J.; Zhang, J. Electromyography-controlled lower extremity exoskeleton to provide wearers flexibility in walking. Biomed. Signal Process. Control. 2023, 79, 104096. [Google Scholar] [CrossRef]
  16. Dua, N.; Singh, S.N.; Semwal, V.B.; Challa, S.K. Inception inspired CNN-GRU hybrid network for human activity recognition. Multimed. Tools Appl. 2023, 82, 5369–5403. [Google Scholar] [CrossRef]
  17. Liu, C.; Downey, R.J.; Salminen, J.S.; Arvelo, R.S.; Richer, N.; Pliner, E.M.; Hwang, J.; Cruz-Almeida, Y.; Manini, T.M.; Hass, C.J.; et al. Electrical brain activity during human walking with parametric variations in terrain unevenness and walking speed. Imaging Neurosci. 2023, 2, 1–33. [Google Scholar] [CrossRef]
  18. Djamila, B.; Bini, N.; Mohammad, S.; Abdenour, H. Vision-based human activity recognition: A survey. Multimed. Tools Appl. 2020, 79, 30509–30555. [Google Scholar]
  19. Zhao, D.; Li, H.; Yan, S. Spatial-temporal synchronous transformer for skeleton-based hand gesture recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1403–1412. [Google Scholar] [CrossRef]
  20. Muthukumar, K.A.; Bouazizi, M.; Ohtsuki, T. An infrared array sensor-based approach for activity detection, combining low-cost technology with advanced deep learning techniques. Sensors 2022, 22, 3898. [Google Scholar] [CrossRef]
  21. Challa, S.K.; Kumar, A.; Semwal, V.B.; Dua, N. An optimized-LSTM and RGB-D sensor-based human gait trajectory generator for bipedal robot walking. IEEE Sens. J. 2022, 22, 24352–24363. [Google Scholar] [CrossRef]
  22. Habaebi, M.H.; Ali, M.M.; Hassan, M.M.; Shoib, M.S.; Zahrudin, A.A.; Kamarulzaman, A.A.; Azhan, W.S.W.; Islam, M.R. Development of physical intrusion detection system using Wi-Fi/ZigBee RF signals. Sensors 2015, 76, 547–552. [Google Scholar] [CrossRef]
  23. Arshad, S.; Feng, C.; Liu, Y.; Hu, Y.; Yu, R.; Zhou, S.; Li, H. Wi-chase: A WiFi based human activity recognition system for sensorless environments. In Proceedings of the IEEE 18th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), Macau, China, 12–15 June 2017. [Google Scholar]
  24. Yousefi, S.; Narui, H.; Dayal, S.; Ermon, S.; Valaee, S. A survey on behavior recognition using WiFi channel state information. IEEE Commun. Mag. 2017, 55, 98–104. [Google Scholar] [CrossRef]
  25. Ding, X.; Jiang, T.; Zhong, Y.; Wu, S.; Yang, J.; Xue, W. Improving WiFi-based human activity recognition with adaptive initial state via one-shot learning. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021. [Google Scholar]
  26. Zhang, Y.; Wang, X.; Wang, Y.; Chen, H. Human activity recognition across scenes and categories based on CSI. IEEE Trans. Mob. Comput. 2022, 21, 2411–2420. [Google Scholar] [CrossRef]
  27. Forbes, G.; Massie, S.; Craw, S. WiFi-based human activity recognition using Raspberry Pi. In Proceedings of the IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020. [Google Scholar]
  28. Moshiri, F.; Reza, S.; Mohammad, N.; Ghorashi, S.A. A CSI-based human activity recognition using deep learning. Sensors 2021, 21, 7225. [Google Scholar] [CrossRef] [PubMed]
  29. Nakamura, T.; Bouazizi, M.; Yamamoto, K.; Ohtsuki, T. Wi-Fi-based fall detection using spectrogram image of channel state information. IEEE Internet Things J. 2022, 9, 17220–17234. [Google Scholar] [CrossRef]
  30. Islam, M.S.; Jannat, M.K.A.; Hossain, M.N.; Kim, W.-S.; Lee, S.-W.; Yang, S.-H. STC-NLSTMNet: An improved human activity recognition method using convolutional neural network with NLSTM from WiFi CSI. Sensors 2023, 23, 356. [Google Scholar] [CrossRef]
  31. Bouazizi, M.; Ye, C.; Ohtsuki, T. 2D LIDAR-based approach for activity identification and fall detection. IEEE Internet Things J. 2022, 9, 10872–10890. [Google Scholar] [CrossRef]
  32. Halperin, D.; Hu, W.; Sheth, A.; Wetherall, D. Tool release: Gathering 802.11n traces with channel state information. ACM SIGCOMM Comput. Commun. Rev. 2011, 41, 53. [Google Scholar] [CrossRef]
  33. Wang, W.; Liu, A.X.; Shahzad, M.; Ling, K.; Lu, S. Understanding and modeling of wifi signal based human activity recognition. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, New York, NY, USA, 7–11 September 2015. [Google Scholar]
  34. Hernandez, S.M.; Bulut, E. Lightweight and standalone IoT based WiFi sensing for active repositioning and mobility. In Proceedings of the IEEE 21st International Symposium on “A World of Wireless, Mobile and Multimedia Networks” (WoWMoM), Cork, Ireland, 31 August–3 September 2020. [Google Scholar]
  35. Souvik, S.; Jeongkeun, L.; Kyu-Han, K.; Paul, C. Avoiding multipath to revive inbuilding WiFi localization. In Proceedings of the 11th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 25–28 June 2013. [Google Scholar]
  36. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  37. Breiman, L. Random Forests. Mach. Learn. 2021, 45, 5–32. [Google Scholar] [CrossRef]
  38. Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
  39. Wang, J.; Zhang, X.; Gao, Q.; Yue, H.; Wang, H. Device-free wireless localization and activity recognition: A deep learning approach. IEEE Trans. Veh. Technol. Mach. Learn. 2017, 66, 6258–6267. [Google Scholar] [CrossRef]
  40. Hou, C.; Liu, G.; Tian, Q.; Zhou, Z.; Hua, L.; Lin, Y. Multi-signal modulation classification using sliding window detection and complex convolutional network in frequency domain. IEEE Internet Things J. 2022, 9, 19438–19449. [Google Scholar] [CrossRef]
  41. Wang, Y.; Gui, G.; Lin, Y.; Wu, H.-C.; Yuen, C.; Adachi, F. Few-shot specific emitter identification via deep metric ensemble learning. IEEE Internet Things J. 2022, 9, 24980–24994. [Google Scholar] [CrossRef]
  42. Jiang, W.; Miao, C.; Ma, F.; Yao, S.; Wang, Y.; Yuan, Y.; Xue, H.; Song, C.; Ma, X.; Koutsonikolas, D.; et al. Towards environment independent device free human activity recognition. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New York, NY, USA, 29 October–2 November 2018. [Google Scholar]
  43. Yang, J.; Chen, X.; Zou, H.; Wang, D.; Xu, Q.; Xie, L. EfficientFi: Towards large-scale lightweight wifi sensing via csi compression. IEEE Internet Things J. 2022, 9, 13086–13095. [Google Scholar] [CrossRef]
  44. Zhang, J.; Wu, F.; Wei, B.; Zhang, Q.; Huang, H.; Shah, S.W.; Cheng, J. Data augmentation and dense-LSTM for human activity recognition using WiFi signal. IEEE Internet Things J. 2021, 8, 4628–4641. [Google Scholar] [CrossRef]
  45. Chen, Z.; Zhang, L.; Jiang, C.; Cao, Z.; Cui, W. WiFi CSI based passive human activity recognition using attention based BLSTM. IEEE Trans. Mob. Comput. 2019, 18, 2714–2724. [Google Scholar] [CrossRef]
  46. Graves, A.; Mohamed, A.-R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
  47. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
  48. Gers, F.A.; Schraudolph, N.N.; Schmidhuber, J. Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 2002, 3, 115–143. [Google Scholar]
  49. Li, B.; Cui, W.; Wang, W.; Zhang, L.; Chen, Z.; Wu, M. Two-stream convolution augmented transformer for human activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
  50. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 5–6. [Google Scholar] [CrossRef]
  51. Cui, W.; Li, B.; Zhang, L.; Chen, Z. Device-free single-user activity recognition using diversified deep ensemble learning. Appl. Soft Comput. 2021, 102, 107066. [Google Scholar] [CrossRef]
  52. Wang, J.; Lan, C.; Liu, C.; Ouyang, Y.; Qin, T. Generalizing to unseen domains: A survey on domain generalization. IEEE Trans. Knowl. Data Eng. 2021, 35, 8052–8072. [Google Scholar]
  53. Qian, H.; Pan, S.J.; Miao, C. Latent independent excitation for generalizable sensor-based cross-person activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
  54. Lu, W.; Wang, J.; Chen, Y. Local and global alignments for generalizable sensor-based human activity recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
  55. Liu, S.; Chen, Z.; Wu, M.; Liu, C.; Chen, L. WiSR: Wireless domain generalization based on style randomization. IEEE Trans. Mob. Comput. 2024, 23, 4520–4532. [Google Scholar] [CrossRef]
  56. Liu, S.; Chen, Z.; Wu, M.; Wang, H.; Xing, B.; Chen, L. Generalizing wireless cross-multiple-factor gesture recognition to unseen domains. IEEE Trans. Mob. Comput. 2024, 23, 5083–5096. [Google Scholar] [CrossRef]
  57. Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J.E.; Weinberger, K.Q. Snapshot Ensembles: Train 1, get M for free. arXiv 2017, arXiv:1704.00109. [Google Scholar]
  58. Kaya, M.; Bilge, H.S. Deep metric learning: A survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
  59. Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  60. Xu, S.; He, Z.; Shi, W.; Wang, Y.; Ohtsuki, T.; Gui, G. Cross-person activity recognition method using snapshot ensemble learning. In Proceedings of the IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022. [Google Scholar]
  61. Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM SigKDD Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
  62. Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A public domain dataset for human activity recognition using smartphones. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 24–26 April 2013. [Google Scholar]
  63. Kumar, D.; Jeuris, S.; Bardram, J.E.; Dragoni, N. Mobile and wearable sensing frameworks for mHealth Studies and applications: A systematic review. ACM Trans. Comput. Healthc. 2020, 2, 1–28. [Google Scholar] [CrossRef]
  64. Dietterich, T.G. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
  65. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  66. Wellek, S. Testing Statistical Hypotheses of Equivalence and Noninferiority, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2010. [Google Scholar]
Figure 1. Outline of invasive and non-invasive wearable sensors for activity monitoring and rehabilitation.
Figure 1. Outline of invasive and non-invasive wearable sensors for activity monitoring and rehabilitation.
Bioengineering 11 01124 g001
Figure 2. (a) An illustration of our CSI data collection for HAR in an indoor environment. (b) The layout of experiment room with the pattern of walking.
Figure 2. (a) An illustration of our CSI data collection for HAR in an indoor environment. (b) The layout of experiment room with the pattern of walking.
Bioengineering 11 01124 g002
Figure 3. An example of raw CSI data for different activities.
Figure 3. An example of raw CSI data for different activities.
Bioengineering 11 01124 g003
Figure 4. An illustration of the space expansion of individual hypotheses via ensemble, for learning a better approximation to a true hypothesis.
Figure 4. An illustration of the space expansion of individual hypotheses via ensemble, for learning a better approximation to a true hypothesis.
Bioengineering 11 01124 g004
Figure 5. System framework of the proposed CPAR method.
Figure 5. System framework of the proposed CPAR method.
Bioengineering 11 01124 g005
Figure 6. Structure of ABLSTM network. Note that center loss can be optionally combined with softmax loss (see dashed line).
Figure 6. Structure of ABLSTM network. Note that center loss can be optionally combined with softmax loss (see dashed line).
Bioengineering 11 01124 g006
Figure 7. Training phase and recognition phase in the proposed CPAR method.
Figure 7. Training phase and recognition phase in the proposed CPAR method.
Bioengineering 11 01124 g007
Figure 8. Comparison of SGD optimization, common ensemble, and snapshot ensemble. (a) SGD optimization via a constant or decreasing learning rate (LR). (b) Common ensemble via a constant or decreasing LR. (c) Snapshot ensemble via a cyclic LR.
Figure 8. Comparison of SGD optimization, common ensemble, and snapshot ensemble. (a) SGD optimization via a constant or decreasing learning rate (LR). (b) Common ensemble via a constant or decreasing LR. (c) Snapshot ensemble via a cyclic LR.
Bioengineering 11 01124 g008
Figure 9. A schematic diagram of center loss.
Figure 9. A schematic diagram of center loss.
Bioengineering 11 01124 g009
Figure 10. Confusion matrices by conventional HAR methods for seen persons (Task I). (a) CNN [42]. (b) LSTM [24]. (c) BLSTM. (d) ABLSTM [45].
Figure 10. Confusion matrices by conventional HAR methods for seen persons (Task I). (a) CNN [42]. (b) LSTM [24]. (c) BLSTM. (d) ABLSTM [45].
Bioengineering 11 01124 g010
Figure 11. Confusion matrices by HAR methods on CPAR in Pattern ABCE-D (Task II). (a) CNN [42] (avg. accuracy: 74.83%). (b) LSTM [24] (68.61%). (c) ABLSTM [45] (71.63%). (d) LAGMAT [54] (77.01%). (e) SE-ABLSTM-C (78.86%).
Figure 11. Confusion matrices by HAR methods on CPAR in Pattern ABCE-D (Task II). (a) CNN [42] (avg. accuracy: 74.83%). (b) LSTM [24] (68.61%). (c) ABLSTM [45] (71.63%). (d) LAGMAT [54] (77.01%). (e) SE-ABLSTM-C (78.86%).
Bioengineering 11 01124 g011
Figure 12. Confusion matrices by HAR methods on CPAR in Pattern BCDE-A (Task II). (a) CNN [42] (avg. accuracy: 83.99%). (b) LSTM [24] (83.29%). (c) ABLSTM [45] (83.57%). (d) LAGMAT [54] (85.67%). (e) SE-ABLSTM-C (86.68%).
Figure 12. Confusion matrices by HAR methods on CPAR in Pattern BCDE-A (Task II). (a) CNN [42] (avg. accuracy: 83.99%). (b) LSTM [24] (83.29%). (c) ABLSTM [45] (83.57%). (d) LAGMAT [54] (85.67%). (e) SE-ABLSTM-C (86.68%).
Bioengineering 11 01124 g012
Table 1. Dataset Description.
Table 1. Dataset Description.
(a) CSI Datasets for HAR
StudyCSI Collection ToolNo. of DataPublicity
(subj. × act. × round)
S. Arshad, et al. [23]5300 CSI Tool720 (12 × 3 × 20)No
S. Yousefi, et al. [24]5300 CSI Tool720 (6 × 6 × 20)Yes
X. Ding, et al. [25]5300 CSI Tool200–400 (- × 4 × -)No
Y. Zhang, et al. [26]5300 CSI Tool500 (1 × 10 × 50)No
G. Forbes, et al. [27]Raspberry Pi1100 (1 × 11 × 100)No
F. Moshiri, et al. [28]Raspberry Pi420 (3 × 7 × 20)Yes
Our datasetESP32 CSI Tool700 (5 × 7 × 20)Yes
(b) Basic Information of Five Subjects in Our Dataset
SubjectGenderHeight (cm)Weight (kg)
AMale17252
BMale17360
CMale16550
DMale17568
EMale16859
Table 2. ESP32 CSI Tool Setting.
Table 2. ESP32 CSI Tool Setting.
ParameterSpecification
ProjectACTIVE_AP/STA
Compiler versionIDF v4.3.2
No. of antennae1
Sampling rate63 Hz
No. of subcarriers32
No. of Wi-Fi channels6
Packet rate100 packet/s
Voltage1100 mV
Distance of LoS path2.5 m
Height of placement1.0 m
Table 3. Experimental Parameter.
Table 3. Experimental Parameter.
(a) CSI Dataset Preparation
ParameterSpecification
Observation time per round16.0 s
Time window length3.2 s
Sliding window length0.8 s
Sample length200
No. of samples per round17
Total no. of samples11,900
(b) Hyper-Parameter for Model Training
ParameterSpecification
Python frameworkTensorflow 2.4.1
Loss functionSoftmax + Center
Balance factor λ 1.0
OptimizerAdam
Batch size128
Initial learning rate α 0 0.01
No. of cycles M10
No. of epochs per cycle50
Table 4. Total Average Accuracy Comparison of Conventional HAR Methods for All Seven Activities of Seen Persons (Task I).
Table 4. Total Average Accuracy Comparison of Conventional HAR Methods for All Seven Activities of Seen Persons (Task I).
CNN [42]LSTM [24]BLSTMABLSTM [45]
91.01%92.77%94.37%97.23%
Table 5. Accuracy Comparison on Contrastive Losses (Task II).
Table 5. Accuracy Comparison on Contrastive Losses (Task II).
ApproachWavingClappingWalkingLying DownSitting DownFallingPicking UpAvg.Pattern
ABLSTM [45]86.76%92.06%74.12%78.24%62.65%89.71%85.88%81.34%ABCD-E
ABLSTM-T87.35%91.47%69.12%79.41%61.18%90.29%79.41%79.75%
ABLSTM-C84.12%88.53%75.59%81.47%57.65%92.06%89.12%81.22%
ABLSTM [45]94.41%86.18%85.84%88.53%66.47%6.47%75.53%71.63%ABCE-D
ABLSTM-T97.65%84.41%93.81%90.00%69.41%0.88%77.06%73.31%
ABLSTM-C97.35%92.94%85.84%90.00%75.79%0.88%78.53%74.48%
ABLSTM [45]97.35%82.06%73.16%99.71%94.71%22.12%66.18%76.49%ABDE-C
ABLSTM-T98.53%77.35%46.02%100.00%96.47%17.40%72.06%72.55%
ABLSTM-C90.59%81.18%96.11%99.71%94.12%26.55%73.82%77.46%
ABLSTM [45]95.29%69.71%81.47%90.27%93.24%50.00%85.88%80.83%ACDE-B
ABLSTM-T98.53%71.47%93.53%80.83%96.47%53.82%80.88%82.22%
ABLSTM-C97.06%80.00%92.35%89.97%95.59%49.71%90.59%85.04%
ABLSTM [45]91.18%92.35%92.65%98.53%98.53%70.59%42.30%83.57%BCDE-A
ABLSTM-T87.16%98.24%95.59%99.41%98.82%74.41%44.71%85.46%
ABLSTM-C97.65%96.47%92.94%98.53%97.06%82.06%32.65%85.34%
ABLSTM-T: ABLSTM with triplet loss. ABLSTM-C: ABLSTM with center loss.
Table 6. Accuracy Comparison on Ablation Study (Task II).
Table 6. Accuracy Comparison on Ablation Study (Task II).
ApproachWavingClappingWalkingLying DownSitting DownFallingPicking UpAvg.Avg. imp.Pattern
ABLSTM [45]86.76%92.06%74.12%78.24%62.65%89.71%85.88%81.34%3.03%ABCD-E
SE-ABLSTM [60]82.65%96.18%74.41%85.29%66.18%76.47%89.12%81.47%2.90%
ABLSTM-C84.12%88.53%75.59%81.47%57.65%92.06%89.12%81.22%3.15%
SE-ABLSTM-C (prop.)81.18%96.47%74.71%87.06%68.82%94.41%87.94%84.37%
ABLSTM [45]94.41%86.18%85.84%88.53%66.47%6.47%75.53%71.63%7.23%ABCE-D
SE-ABLSTM [60]97.94%94.12%91.45%92.94%68.53%2.35%83.53%75.83%3.03%
ABLSTM-C97.35%92.94%85.84%90.00%75.79%0.88%78.53%74.48%4.38%
SE-ABLSTM-C (prop.)98.82%96.47%91.15%94.12%80.29%3.82%87.35%78.86%
ABLSTM [45]97.35%82.06%73.16%99.71%94.71%22.12%66.18%76.49%4.25%ABDE-C
SE-ABLSTM [60]97.94%85.29%72.57%100.00%97.94%28.32%70.29%78.93%1.81%
ABLSTM-C90.59%81.18%76.11%99.71%94.12%26.55%73.82%77.46%3.28%
SE-ABLSTM-C (prop.)95.59%88.53%79.35%100.00%95.88%32.45%73.24%80.74%
ABLSTM [45]95.29%69.71%81.47%90.27%93.24%50.00%85.88%80.83%8.20%ACDE-B
SE-ABLSTM [60]98.53%85.29%93.53%99.12%100.00%54.41%91.18%88.87%0.16%
ABLSTM-C97.06%80.00%92.35%89.97%95.59%49.71%90.59%85.04%3.99%
SE-ABLSTM-C (prop.)99.41%85.59%97.65%97.64%100.00%50.88%92.06%89.03%
ABLSTM [45]91.18%92.35%92.65%98.53%98.53%70.59%42.30%83.57%3.11%BCDE-A
SE-ABLSTM [60]96.18%95.88%95.00%100.00%100.00%75.00%37.65%85.67%1.01%
ABLSTM-C97.65%96.47%92.94%98.53%97.06%82.06%32.65%85.34%1.34%
SE-ABLSTM-C (prop.)96.47%99.12%94.41%100.00%98.24%81.47%37.06%86.68%
SE-ABLSTM: Snapshot ensemble-used ABLSTM. ABLSTM-C: ABLSTM with center loss. SE-ABLSTM-C (prop.): Snapshot ensemble-used ABLSTM with center loss.
Table 7. Accuracy Comparison among HAR Methods on CPAR (Task II).
Table 7. Accuracy Comparison among HAR Methods on CPAR (Task II).
ApproachWavingClappingWalkingLying DownSitting DownFallingPicking UpAvg.Pattern
CNN [42]50.59%92.65%67.65%81.76%65.29%84.12%65.88%72.56%ABCD-E
LSTM [24]96.76%92.65%76.18%82.35%65.00%85.29%89.41%83.95%
ABLSTM [45]86.76%92.06%74.12%78.24%62.65%89.71%85.88%81.34%
LAGMAT [54]35.88%91.76%75.29%83.24%57.06%73.53%77.65%70.63%
SE-ABLSTM-C (prop.)81.18%96.47%74.71%87.06%68.82%94.41%87.94%84.37%
CNN [42]92.94%94.12%90.86%88.24%79.71%2.94%75.00%74.83%ABCE-D
LSTM [24]95.88%61.47%83.78%83.24%67.94%2.65%85.29%68.61%
ABLSTM [45]94.41%86.18%85.84%88.53%66.47%6.47%75.53%71.63%
LAGMAT [54]77.65%84.71%72.57%95.59%67.06%62.65%78.82%77.01%
SE-ABLSTM-C (prop.)98.82%96.47%91.15%94.12%80.29%3.82%87.35%78.86%
CNN [42]96.47%19.71%77.88%100.00%96.76%38.05%80.59%72.78%ABDE-C
LSTM [24]97.94%82.35%43.36%100.00%95.29%9.14%70.00%71.15%
ABLSTM [45]97.35%82.06%73.16%99.71%94.71%22.12%66.18%76.49%
LAGMAT [54]97.06%65.59%83.78%95.29%71.18%53.69%72.06%76.95%
SE-ABLSTM-C (prop.)95.59%88.53%79.35%100.00%95.88%32.45%73.24%80.74%
CNN [42]97.94%22.65%95.88%88.79%97.35%66.76%94.71%80.58%ACDE-B
LSTM [24]99.41%71.76%95.88%47.20%97.35%57.65%77.94%78.17%
ABLSTM [45]95.29%69.71%81.47%90.27%93.24%50.00%85.88%80.83%
LAGMAT [54]91.18%81.18%83.53%83.19%79.12%63.82%90.59%81.80%
SE-ABLSTM-C (prop.)99.41%85.89%97.65%97.64%100.00%50.88%92.06%89.03%
CNN [42]86.76%99.12%94.41%99.71%98.24%71.76%37.94%83.99%BCDE-A
LSTM [24]93.82%97.94%88.82%98.53%98.24%71.18%33.82%83.29%
ABLSTM [45]91.18%92.35%92.65%98.53%98.53%70.59%42.30%83.57%
LAGMAT [54]78.53%94.41%86.47%94.12%90.00%74.41%81.76%85.67%
SE-ABLSTM-C (prop.)96.47%99.12%94.41%100.00%98.24%81.47%37.06%86.68%
SE-ABLSTM-C (prop.): Snapshot ensemble-used ABLSTM with center loss. In the Student’s t-test [66], the p-values on the average accuracies for all five patterns between the proposed SE-ABLSTM-C and conventional CNN [42], LSTM [24], ABLSTM [45], and LAGMAT [54] are 0.016, 0.047, 0.038, and 0.051, respectively.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, C.; Xu, S.; He, Z.; Yin, Y.; Ohtsuki, T.; Gui, G. Non-Contact Cross-Person Activity Recognition by Deep Metric Ensemble Learning. Bioengineering 2024, 11, 1124. https://doi.org/10.3390/bioengineering11111124

AMA Style

Ye C, Xu S, He Z, Yin Y, Ohtsuki T, Gui G. Non-Contact Cross-Person Activity Recognition by Deep Metric Ensemble Learning. Bioengineering. 2024; 11(11):1124. https://doi.org/10.3390/bioengineering11111124

Chicago/Turabian Style

Ye, Chen, Siyuan Xu, Zhengran He, Yue Yin, Tomoaki Ohtsuki, and Guan Gui. 2024. "Non-Contact Cross-Person Activity Recognition by Deep Metric Ensemble Learning" Bioengineering 11, no. 11: 1124. https://doi.org/10.3390/bioengineering11111124

APA Style

Ye, C., Xu, S., He, Z., Yin, Y., Ohtsuki, T., & Gui, G. (2024). Non-Contact Cross-Person Activity Recognition by Deep Metric Ensemble Learning. Bioengineering, 11(11), 1124. https://doi.org/10.3390/bioengineering11111124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop