Human Activity Recognition via Score Level Fusion of Wi-Fi CSI Signals

Wi-Fi signals are ubiquitous and provide a convenient, covert, and non-invasive means of recognizing human activity, which is particularly useful for healthcare monitoring. In this study, we investigate a score-level fusion structure for human activity recognition using the Wi-Fi channel state information (CSI) signals. The raw CSI signals undergo an important preprocessing stage before being classified using conventional classifiers at the first level. The output scores of two conventional classifiers are then fused via an analytic network that does not require iterative search for learning. Our experimental results show that the fusion provides good generalization and a shorter learning processing time compared with state-of-the-art networks.


Introduction
Human activity recognition (HAR) is a field of research and technology that focuses on developing methods for automatically identifying and understanding human activities using sensor data [1,2]. HAR has a wide range of applications in various domains, including healthcare, security, sports, robotics, and surveillance (see e.g., [1,3]). In recent years, the importance of HAR has grown significantly due to its potential benefits and the many practical applications it offers [4].
There are three main approaches to HAR: vision-and-sound-based [5], mobile-devicebased [6], and Wi-Fi-signal-based [7]. Each approach has its own advantages and challenges. The vision-and-sound-based approach can raise concerns about security and privacy due to the use of cameras and microphones. The mobile-device-based approach requires individuals to wear or carry smart devices, which can be expensive and inconvenient. In contrast, the Wi-Fi-based approach does not involve the use of cameras or microphones, so it does not raise the same concerns about security and privacy [8]. Additionally, individuals do not need to wear or carry any devices, as Wi-Fi signals are readily available in most environments. This makes the Wi-Fi-based approach a promising candidate for HAR. A brief review is provided in Section 2 regarding each of these approaches.
Wi-Fi signals, which operate within the radio frequency spectrum, can be affected by various factors such as interference, obstructions, and signal absorption. When humans move or interact with objects within the signal path, they can inadvertently affect the propagation of Wi-Fi signals [9]. This can lead to observable patterns in signal quality and connectivity. For example, the human body itself can act as an obstacle that attenuates or blocks Wi-Fi signals. When a person moves around a space, their physical presence can cause fluctuations in signal strength as the signals encounter the obstruction posed by their body. These disturbances can manifest as recognizable patterns in Wi-Fi connectivity, often seen as intermittent drops or variations in signal strength [7]. By capitalizing on this phenomenon, it is possible to use disturbed Wi-Fi CSI signals for HAR.
For vision-and-sound-based systems, a comprehensive survey of space-time and hierarchical approaches for HAR is provided by Aggarwal et al. [1]. In video-based systems, moving pictures are converted into 2D images, and image features are extracted for classification. For example, Anitha et al. [14] developed a system to recognize hand gestures by converting human action videos into 2D images and extracting features using Laplace Smoothing Transform (LST) and Kernel Principal Component Analysis (KPCA), with KNN used for classification. Local space-time features can also be used with classifiers such as SVM for recognizing human actions [15]. Ahmad et al. [16] used Spatio-Temporal Interest Points (STIPs) to detect important changes in images, extracting appearance and motion features using Histogram of Oriented Gradient (HOG) and Histogram of Optical Flow (HOF) descriptors, with SVM used for classification. Since using videos and images can jeopardize user privacy, other non-revealing sensors have been explored for HAR. For instance, Fu et al. [17] developed a motion detector with sub-millisecond temporal resolution using a contrast vision sensor. Sound signals have also been used, such as in the work of Stork et al. [18], who proposed a Non-Markovian Ensemble Voting (NEV) method to classify multiple human activities in real-time based on characteristic sounds. Additionally, 3D skeletal data have been utilized, as in the work of Ramezanpanah et al. [19], who represented 3D skeletal features of human action using Laban movement analysis and dynamic time warping, with SVM used for activity classification.
In addition to the classical machine learning methods mentioned above, recent work has focused on using deep learning for human action or activity recognition. For example, Wang et al. [20] developed a system for 3D human activity recognition using a re-configurable Convolutional Neural Network (CNN). Other examples of deep learningbased applications include the work of Dobhal et al. [21], who recognized human activity based on binary motion images and deep learning. In the work of Mahjoub et al. [22], image sequences were combined into a single Binary Motion Image (BMI) for feature representation, with a CNN used for classification.

Mobile-Device-Based HAR
In healthcare systems, wearable sensors are commonly used to recognize human activities. According to a survey by Thakur et al. [2], smartphone sensors such as accelerometers, gyroscopes, magnetometers, digital compasses, microphones, GPS, and cameras can be used to monitor and recognize human activities. For example, Anjum et al. [23] developed a smartphone application that uses the embedded motion sensor to track users' physical activities and to provide feedback. The application estimates calories burned and breaks it down by activity such as walking, running, climbing stairs, descending stairs, driving, cycling, and being inactive. Another example is a framework proposed by Nandy et al. [24] that combines features from a smartphone accelerometer and a wearable heart rate sensor to recognize intense physical activity. The framework uses an ensemble model based on different classifiers. In addition to smartphone sensors, wearable acoustic sensors such as the bodyscope developed by Yatani et al. [25] can be used to detect and classify human activities. Human activities can also be remotely monitored via the use of wearable sensors that track heart rate, respiration rate, and body acceleration. Castro et al. [26] developed a remote monitoring system based on the Internet of Things (IoT) that uses these sensors.
The development of mobile-device-based HAR has followed a similar trend to visionand-sound-based HAR in terms of the popularity of deep learning deployment. In addition to MLP and CNN, LSTM is also a popular choice. For example, Voicu et al. [27] used smartphone sensors such as accelerometers, gyroscopes, and gravity sensors to recognize physical activity, with an MLP used for learning and classification. Rustam et al. [28] employed sensor data from gyroscopes and accelerometers with a deep network model called Deep Stacked Multilayered-Perceptron (DS-MLP) for HAR. Chen et al. [29] constructed a CNN network for HAR based on a single accelerometer, with modified convolution kernels to adapt to the characteristics of tri-axial acceleration signals. Ghate et al. [30] constructed a hybrid deep learning model that combines deep neural networks with LSTM and GRU for effective classification of engineered features of CNN. The network integrates CNN with a Random Forest Classifier (DeepCNN-RF) to add randomness to the model.

Wi-Fi-Based HAR
Different from the mobile-based approach, the Wi-Fi signals offer a device free solution for HAR. Compared with vision-based methods, the Wi-Fi signals do not provide detailed information related to privacy. There are mainly two types of Wi-Fi signals commonly used for HAR: the RSSI and the CSI signals. The RSSI refers to a coarse-grained received signal strength indicator, and the CSI refers to the fine-grained channel state information. The RSSI is susceptible to signal fading, distortions, and inconsistency since it has a low resolution parameter, which is measured using a packet index and has only a single value per packet. For this reason, the RSSI is being replaced by the CSI in Wi-Fi sensing solutions. The Wi-Fi CSI is a fine grained signal measured via Orthogonal Frequency Division Multiplexing (OFDM) subcarriers. The following review has been conducted according to these two types of Wi-Fi technology.
In terms of using RSSI signals, four wireless technologies-Wi-Fi, BLE, Zigbee, and LoRaWAN-have been evaluated for indoor localization. According to a study by Sadowski et al. [31], Wi-Fi was found to be the most accurate among the investigated wireless technologies based on RSSI signals. In another example, Hsieh et al. [32] recognized human activity in an indoor environment using Wi-Fi RSSI and investigated the effectiveness of several machine learning methods, such as MLP and SVM, for activity detection.
In terms of using CSI signals, it has been found that CSI can capture unique patterns of small-scale fading caused by different human activities at a subcarrier level, which is not available in traditional received signal strength (RSS) extracted at the per-packet level [33]. Most methods that use CSI signals employ deep networks. For example, Chen et al. [34] constructed a deep learning network for HAR that uses an attention-based bi-directional long short-term memory model. Wang et al. [35] developed a deep learning network that combines hidden features from both temporal and spatial dimensions for accurate and reliable recognition. Other examples include the work of Lu et al. [36] and Islam et al. [37], who implemented a channel-exchanging fusion network to fuse CSI amplitude and phase features for HAR and constructed a spatio-temporal convolution with nested long short-term memory (STC-NLSTMNet) to extract spatial and temporal features concurrently for automatic recognition of human activities, respectively. Recent algorithms such as transfer learning and attention mechanisms have also been used in HAR. For instance, Yang et al. [38] used an attention mechanism with LSTM features at different dimensions for HAR, while Jung et al. [8] performed in-air handwritten signature recognition using transfer learning due to limited data availability.
Finally, we shall focus on the use of CNN, a popular network for CSI-based HAR. In the work of Moshiri et al. [39], CSI data were converted into images and a 2D CNN classifier was employed for HAR. Zhang et al. [40] exploited semantic activity features and temporal features from different dimensions to characterize activity at different locations. The semantic activity features were extracted using a CNN combined with a convolutional attention module (CBAM), while the temporal features were extracted using a Bidirectional Gated Recurrent Unit (BGRU) combined with a self-attention mechanism. In addition to semantic features, dimension reduction techniques have also been used with CNN for HAR. Showmik et al. [41] proposed a Principal Component-based Wavelet Convolutional Neural Network (PCWCNN), which uses PCA and Discrete Wavelet Transform (DWT) as preprocessing algorithms and a Wavelet CNN for classification. Zou et al. [42] fused a tailored CNN model with a variant of the C3D model using vision for HAR.
Summarizing the above work, it is noted that vision-and-sound-based approaches pose the fundamental limitations in terms of securing the audio and video passages in view of the potential violation of human privacy.

Wi-Fi for Human Activity Recognition
As mentioned in the section of related work, the Wi-Fi RSSI and the Wi-Fi CSI are two types of signal measurements that can be utilized for HAR [7]. The principle underlying these technologies is the Doppler effect, where the wavelength of reflected signals changes according to different relative motions of the human under the signal propagation space [43]. The RSSI stands for received signal strength indicator, which is an energy characteristic of the Media Access Control layer [44]. However, due to its reliance on channel strength superposition, it may not accurately reflect changes in the channel, leading to a reduced detection rate. On the other hand, the CSI represents a fine-grained channel state information, which includes specific indicators such as carrier signal strength, amplitude, phase, and signal delay [45]. The physical layer of CSI can capture micro dynamic changes in human activity, leading to the ability to detect rapid changes caused by the superposition of multipath signal exchange layers. Conceptually, the channel response is the response to RSSI, similar to how the rainbow responds to solar beams. We can think of OFDM as the medium that refracts RSSI into CSI, allowing the components of different wavelengths to be separated. As a result, the OFDM modulation system can make Wi-Fi-based HAR systems more robust against complex indoor environments, improving their effectiveness and accuracy [46].

Wi-Fi CSI Signal Features
The CSI is the channel response extracted from the OFDM subcarriers using finegrained wireless channel measurement [7,45,47]. In an OFDM system, the CSI data on each subcarrier are modulated and converted into frequency domain via Fast Fourier Transform [46]. This provides an estimate of the amplitudes and phases of each subcarrier of the channel properties. During operation, the signal that is transmitted experiences multiple paths before arriving at the receiver. Each of these paths introduces distinct variations in time delay, amplitude attenuation, and phase shift where the Channel Impulse Response (CIR) [46] can be expressed as In this equation, a signal from the ith path is represented by a i for its amplitude, θ i for its phase, and τ i for its time delay. n denotes the total number of paths, and δ(t) refers to the Dirac delta function.
For data collection in practice, the Channel Frequency Response (CFR) can be utilized to model the transmitting channel in place of the CIR. This is because the commodity hardware may not have the required time resolution to capture rapid changes in the signal. Under the unlimited bandwidth condition, the CFR can be derived from the CIR by applying the Fast Fourier Transform.
In frequency domain, the channel response of each subcarrier can be written as for k = 1, . . . , K, where H( f k ) denotes the CSI of the k-th subcarrier, and ∠H( f k ) represents the corresponding phase shift information. By packing the CSI data based on the subcarrier index and the packet number, we can write where {1, 2, . . . , K} denotes the subcarrier indices and {1, 2, . . . , p} denotes the packet numbers.

Proposed System
In this study, we adopt the score fusion strategy to learn and predict the class labels of human activities. Figure 1 shows the pipeline of the implemented system. Essentially, the raw data of Wi-Fi CSI signals go through a preprocessing stage for normalization. The differently normalized data, process1 and process2, are then classified separately via a linear model based on the linear Least Squares Error (LSE) and a nonlinear model based on the SVM utilizing the Radial Basis Function (RBF) kernel. The learned scores are subsequently concatenated to form the input features for fusion learning. The fusion learning uses ANnet, or another LSE classifier, KNN classifier, or SVM-RBF classifier for a final decision. The final decision has been based on the one-versus-rest technique.

Preprocessing
In view of the noisy nature, the raw Wi-Fi CSI signals are preprocessed before being fed into the learning algorithms. Firstly, the signals are cropped according to each activity at different lengths. Since the action activities have different lengths, the CSI signals are resized based on linear interpolation. The resized data are then packed in matrix form for further processing prior to learning and prediction. Subsequently, a low-pass filtering has been performed to remove high frequency noises. Eventually, a z-score normalization is formed to remove the signal bias.
The pre-processed signals are packed as shown in (4) for a subsequent stage of learning and prediction.
where k ∈ {1, 2, . . . , K} denotes the subcarrier index and C denotes the cropping size of the preprocessing. Figure 2 shows a sample of CSI raw signals and the preprocessed form before z-score normalization. The flow of the preprocessing steps is summarized in Figure 3.
As illustrated in this figure, a set of cropped signals at length1 is named as process1 while another set of signals cropping at length2 is named as process2. The signals of process1 and process2 are eventually standardized via z-score normalization in the preprocessing step.

Classification Stage
The matrix for each activity sample H processed is flattened to form a row vector h T ∈ R 1×D so that the M training samples can be stacked in matrix form as follows: Correspondingly, the K number of target activities is encoded based on the one-hot encoder such as where each sample row contains a '1' at the column position corresponding to the class label. For first-level activity classification, two base classifiers namely the LSE and the SVM utilizing the RBF kernel have been deployed.
For training the linear prediction model Y = XW based on LSE, the learning weights W ∈ R D×G can be found deterministically aŝ with regularization λ = 0.001 and consideration of over-/under-determined systems. Subsequently, the prediction of test samples can be computed usinĝ where X t ∈ R N×D denotes the test matrix. These output scores will be used in the subsequent stage of fusion for final prediction. For binary classification, the SVM learning can be written as minimize : subject to : where φ corresponds to the RBF function, and x i and y i are the ith sample of input feature vector and target value, respectively. For multiclass problems, multiple SVMs can be implemented with the one-versus-rest technique for class prediction. In our study, the output probability score values of multicategory SVM are used as input features in the fusion stage.

Fusion Stage
We represent, respectively, the prediction scores obtained from the first level LSE and SVM asŶ LSE ∈ R M×G andŶ SVM ∈ R M×G , where M denotes the sample size and G denotes the number of activity categories. Then, the scores for fusion can be stacked as According to [48], a network of two layers with sufficient hidden nodes can learn well data samples of limited size. Here, we implement a two-layer network known as ANnet [48] given by Y = φ(φ(F scores )W 1 )W 2 (12) to learn the stacked output scores (11) from the first-level LSE and SVM classification. In our implementation, the arctan (tan −1 ) function has been adopted as the nonlinear transformation φ. Similar to that in LSE, the learning target Y can adopt the one-hot encoding where the weights can be learned based on where † denotes the Moore-Penrose inverse, which has been implemented using considering stability of inversion in Python. In this learning, a random perturbation γR(k) ∈ R M×2G , where R is a random matrix, has been included to spread the data. The scaling factor γ ∈ R of the perturbation and the random perturbation seed k ∈ N can be considered as hyperparameters to be determined based on cross-validation utilizing the training set. For prediction using unseen F testscores , the learned network weights can be substituted into (12) to obtain the estimated scores: The one-versus-rest technique can be applied to obtain the class label prediction.

Database
Three datasets have been utilized for our experimentation. The first CSI dataset, which we called HAR-RP, has been obtained from [39]. This dataset contains seven different activities including RUN, WALK, FALL, BEND, SIT DOWN, STAND UP, and LIE DOWN operated by three volunteers in an indoor environment. In total, the dataset consists of 420 samples of CSI signals with 60 samples for each activity. These time sequence data are composed of 52 subcarriers where the period of each activity determines the length of data ranging from 600 to 1100 sampling measurements. The extracted CSI signals consist of three different types of subcarriers, namely the null subcarriers, the pilot subcarriers, and the data subcarriers. Only the data subcarriers contain crucial information related to human activities among these three types of subcarriers. Therefore, the pilot subcarriers and the Null subcarriers are not utilized for this experimentation following [39].
The second CSI dataset [49], which we called HAR-RT, has been collected based on an Asus RT-AC86U at a frequency band of 80 MHz. HAR-RT consists of six different activities including SIT, STAND, SIT-DOWN, STAND-UP, WALK, and FALL. This indoor CSI information was collected from different spots of the room at different frequency bandwidths in order to avoid possible location and bandwidth dependency. Table 1 shows that the HAR-RT dataset has a total of 1084 samples for six activities with 256 subcarriers where these time sequence data have been normalized.
The third CSI dataset, which we called HAR-ARIL, has been obtained from [50]. This dataset contains six distinct hand activities, namely, hand up, hand down, hand left, hand right, hand circle, and hand cross. The 1440 data samples have been collected based on 15 samples from each activity at 16 different locations by 15 individuals. To ensure data quality, the dataset was manually curated to include 1394 samples.  [50] Normalized raw CSI amplitude data with 52-dimensional vector.

Experimental Setup
We conducted three experiments to analyse the fusion system for HAR, as shown in Table 2. Under experiment I, various preprocessing parameters were tested with three learning classifiers on the HAR-RP, HAR-RT, and HAR-ARIL datasets in terms of their signal cropping size and the normalized pass-band of the low-pass filter. Under experiment II, various fusion combinations were evaluated on two differently preprocessed data (called process1 and process2) of HAR-RP, HAR-RT, and HAR-ARIL with and without data transformation or normalization. Under experiment III, a comparison between the proposed fusion combination and state-of-the-art (SOTA) methods from [39,49,50] was carried out to observe the accuracy standing of activity recognition.  Table 1.

HAR-RP, HAR-RT HAR-ARIL
For the HAR-RP database [39], we randomly selected 336 samples to form the training set and the remaining 84 samples as test samples to follow the 80/20 ratio of a five-fold partitioning. For the HAR-RT database [49], 867 samples were selected to form the training set and the remaining 217 samples were used as test set. For the HAR-ARIL database, 1116 samples were typically used for training and the remaining 278 samples were used for testing. By permuting the above 80/20 ratios, a five-fold cross validation was employed to evaluate the testing accuracy for all the three datasets. All experiments were conducted on a PC equipped with an i9 processor of 3.7 GHz with 32 GB of RAM.
In experiment I, the linear LSE method, the SVM-RBF, and the KNN were utilized as learning classifiers to determine the best combination of the signal cropping size and the normalized pass-band of the low-pass filter. This experiment was conducted in two steps. In step 1, various cropping sizes and normalized pass-bands of the low-pass filter were applied separately to preprocess the CSI signals to determine their desired operating ranges. In step 2, the top two combinations of the cropping size and the normalized pass-band were determined based on the recognition accuracy. These two combinations with top accuracies were named as process1 and process2 respectively.
In experiment II, a score level fusion of the above results was performed using either ANnet, or another LSE classifier, SVM-RBF classifier, or KNN classifier. This experiment was also conducted in two steps. In step 1, process1 and process2 (which were differently preprocessed data as described in above) from experiment I were trained with LSE and SVM-RBF individually in order to generate scores with and without feature transformation/normalization. In step 2, the corresponding scores from different classifier settings were concatenated to form a new set of features for score level fusion using another LSE, SVM-RBF, KNN or ANnet that adopted the training strategy given by (13)- (14).
In experiment III, the proposed score level fusion was compared with SOTA methods in [39,49,50]. The experiment for our fusion was conducted according to the training and test settings in [39,49,50], where the accuracy of activity recognition and processing times were recorded. As for SOTA methods on the HAR-RP dataset, Moshiri et al. [39] converted the CSI data into a 2D array to form a pseudo color image and then utilized a 2D CNN for learning and recognition. Data pre-processing was not applied for the CSI amplitudes since they believed that any extra filtering can result in losing essential information and affect the classification performance. Moreover, the CNN is recommended instead of the LSTM to overcome the high computational complexity and long training time. For the HAR-RT dataset, Schäfer et al. [49] proposed the use of normalized raw CSI data directly in a LSTM network with 100 hidden nodes for HAR. The dataset was randomly divided into 80% for training and 20% for testing. The input dimension for the network was equal to the number of the subcarriers and the dropping rate of the LSTM was set at 40%. As for the HAR-ARIL dataset, the adopted classical machine learning baseline SOTA were the DWT+KNN and SVM-RBF classifiers [50]. Table 3 shows the impact of cropping size and normalized pass-band on the accuracy of LSE for the HAR-RP, HAR-RT and HAR-ARIL databases. The cropping size refers to the number of effective points of time series data, and the normalized pass-bands refer to the bandwith of the low-pass filter. For the HAR-RP dataset, the result shows that a small crop size leads to a high accuracy at an intermediate range of normalized pass-bands. However, when the normalized pass-band falls below a certain range, the accuracy of LSE drops significantly. For example, at a normalized pass-band of 0.05, the accuracy with cropping size 50 is 40.4%. The result also shows that a combination of cropping size 100 and normalized pass-band 0.05 gives the highest accuracy of 73.6% among all the tested combinations. This is closely followed by the combination of cropping size 100 and pass-band 0.05, which gives an accuracy of 72.8%. For the HAR-RT database, the result shows that accuracy increases as the pass-band value increases. For example, when the cropping size is 50, the accuracy scores for at different pass-band values of 0.1, 0.5, 0.8 and 1.0 are 30.4%, 45.1%, 48.8%, and 71.4%, respectively. The result also shows that the top two accuracy scores are 71.4% and 66.4%, respectively, at cropping sizes 50 and 100. A similar trend in terms of the cropping size and the pass-band is observed for the HAR-ARIL database.

Experiment I(b) to Observe the Effect of Cropping Size and Pass-Band on the SVM-RBF Classifier
As seen from Table 4, the accuracy of SVM-RBF is generally higher than the accuracy of LSE in Experiment I(a). For the HAR-RP database, it can be observed that the accuracy tends to be slightly higher when the cropping size is small. For example, at a cropping size of 50, the accuracy is 96.3% for pass-band 0.1, which is the highest accuracy in the table. However, the second-highest accuracy of 95.6% is achieved at a cropping size of 100 and at pass-band 0.05. This suggests that with a moderate cropping size and pass-band, the SVM-RBF is able to achieve a high level of accuracy. For the HAR-RT database, a similar cropping size pattern is observed at pass-band 1.0, where the top two accuracies of 95.4% and 92.6% are obtained. These values are significantly higher than all other values in the table, which are around 80%. For the HAR-ARIL database, the two best performed results are 72.5% at pass-band 0.3 with cropping size 50 and 73.5% at pass-band 0.5 with cropping size 100. It is clear that the accuracy is affected by both the cropping size and the pass-band. The patterns of the combination of cropping size and pass-band do show certain common trends between LSE and SVM-RBF. Table 5 shows the chosen parameters, which correspond to the top two accuracies for each of the HAR-RP and HAR-RT datasets. According to Table 6, KNN outperforms LSE in terms of accuracy in Experiment I(a), but it is not as accurate as SVM in Experiment I(b). When examining the HAR-RP database, it is clear that accuracy tends to increase slightly with cropping sizes of 50 and 100. For example, the highest accuracy of 92.6% is achieved with a cropping size of 50 and a pass-band of 0.1. The second-highest accuracy of 92.3% is achieved with a cropping size of 100 and a pass-band of 0.05. A similar pattern is observed in the HAR-RT database, where the two highest accuracies of 82.3% and 76.7% are achieved with a pass-band of 1.0 and cropping sizes of 50 and 100, respectively. In the HAR-ARIL database, although performance remains relatively consistent across different cropping sizes and pass-band values, the two highest accuracies are also observed with cropping sizes of 50 and 100. This suggests that KNN can achieve high levels of accuracy with smaller cropping sizes and moderate pass-band values. The scores of the first-level LSE and SVM-RBF on the two differently preprocessed data from experiment I are next utilized for fusion under several settings utilizing ANnet, KNN, and another set of LSE and SVM-RBF classifiers. In other words, the LSE and SVM-RBF methods are implemented individually on each preprocessed data (process1 and process2) before fusion with and without transformation/normalization to observe whether the scores from each method are distinguishable. Subsequently, the two individual scores are fused to form a set of new features for final classification decision using LSE, SVM-RBF, KNN, and ANnet. Table 7 shows that by applying a score level fusion method with transformation/normalization, the accuracy of the algorithm is increased in most cases. For the HAR-RP database, the highest accuracy of 97.6% is achieved by applying ANnet and SVM-RBF on the concatenated first-level LSE score of process1 and SVM-RBF score of process2 with transformation. For the HAR-RT database, the highest accuracy of 96.4% is achieved by applying SVM-RBF on the concatenated first-level SVM-RBF score of process1 and SVM-RBF score of process2 with transformation/normalization.   Figure 4 shows the training and test accuracies of the SOTA namely, the 2D CNN, the 1D CNN, the BiLSTM, and the proposed ANnet fusion on the HAR-RP database. The results show significant over-fitting of the SOTA methods compared with that of ANnet. Figure 5 shows the training and test accuracies of the compared algorithms, namely, the LSTM and the proposed ANnet fusion on the HAR-RT database. The results show significant over-fitting of the LSTM compared with ANnet. Figure 6 shows the training and test accuracies of the compared algorithms, namely, the DTW + KNN, SVM-RBF, and the proposed ANnet fusion on the HAR-ARIL database. The results show comparable test accuracies of the proposed ANnet with the classical DTW + KNN method. In terms of the training processing time, our fusion benefits from low computational complexity compared with the deep learning method in Table 8, since our model is a combination of LSE and SVM-RBF with analytic learning.   Table 1 for the HAR-RP database.  Table 1 for the HAR-RT database.  Table 1 for the HAR-RT database.

Summary of Results and Observations
• Expt I: This experiment reveals that the preprocessing steps of selecting the cropping size and the normalized pass-band have a significant impact on the recognition accuracy. In particular, each database shows its best accuracy at different combinations of settings. For example, the HAR-RP and HAR-ARIL datasets show that a small cropping size leads to a high accuracy at an intermediate range of normalized pass-bands. For the HAR-RT database, the accuracy increases as the pass-band value increases. • Expt II: This experiment shows that fusion using SVM-RBF and ANnet outperforms the LSE and KNN in general. Moreover, many of their fused results show an improved accuracy compared with that before fusion. • Expt III: This experiment shows that the proposed fusion has either comparable or better accuracy than that of SOTA. In particular, the SOTA methods show significant over-fitting in view of their higher model complexity than the proposed fusion method. In other words, the proposed fusion method has capitalized on the low model complexity but with sufficient mapping capability to generalize the prediction.

Conclusions
In response to the relatively poor generalization of complex network models on small data sizes, a fusion method with simpler model complexity has been proposed for human activity recognition. This method involves fusing the scores of two first-level classifiers using an analytic network to make the final decision. Experiments have shown not only that this fusion improves recognition accuracy but also that preprocessing of Wi-Fi signals plays a critical role in achieving good baseline recognition accuracy. In particular, varying the cropping size contributes to signal diversity for fusion gain. Additionally, linear LSE and SVM-RBF have been shown to contain complementary information for fusion gain. The proposed simple fusion structure has demonstrated good generalization compared with classical machine learning state-of-the-art methods for the tested datasets.