Wearable Sensor-Based Human Activity Recognition Method with Multi-Features Extracted from Hilbert-Huang Transform

Wearable sensors-based human activity recognition introduces many useful applications and services in health care, rehabilitation training, elderly monitoring and many other areas of human interaction. Existing works in this field mainly focus on recognizing activities by using traditional features extracted from Fourier transform (FT) or wavelet transform (WT). However, these signal processing approaches are suitable for a linear signal but not for a nonlinear signal. In this paper, we investigate the characteristics of the Hilbert-Huang transform (HHT) for dealing with activity data with properties such as nonlinearity and non-stationarity. A multi-features extraction method based on HHT is then proposed to improve the effect of activity recognition. The extracted multi-features include instantaneous amplitude (IA) and instantaneous frequency (IF) by means of empirical mode decomposition (EMD), as well as instantaneous energy density (IE) and marginal spectrum (MS) derived from Hilbert spectral analysis. Experimental studies are performed to verify the proposed approach by using the PAMAP2 dataset from the University of California, Irvine for wearable sensors-based activity recognition. Moreover, the effect of combining multi-features vs. a single-feature are investigated and discussed in the scenario of a dependent subject. The experimental results show that multi-features combination can further improve the performance measures. Finally, we test the effect of multi-features combination in the scenario of an independent subject. Our experimental results show that we achieve four performance indexes: recall, precision, F-measure, and accuracy to 0.9337, 0.9417, 0.9353, and 0.9377 respectively, which are all better than the achievements of related works.


Introduction
In recent years, services provided by computing devices have shifted from proprietary computing services to flexible services focusing on human need. Different kinds of small devices with computing power and communications capabilities have grown in popularity. Computer systems are closely linked to human users by natural interaction methods. Thus, how to recognize human activities is an important part of supporting technology for these computer systems.
as input signals and utilizing the back propagation (BP) neural network as a classifier. The effect of combining multi-features vs. a single-feature are investigated and discussed in the scenario of a dependent subject. The experimental results show that combined multi-features instantaneous energy density, marginal spectrum, instantaneous amplitude, and instantaneous frequency (IE-MS-IA-IF) can further improve the performance. Finally, we test the effect of multi-features combination in the scenario of an independent subject. The experimental results show that we achieve four performance indexes: recall, precision, F-measure, and accuracy to 0.9337, 0.9417, 0.9353, and 0.9377 respectively, which are all better than related works tested on the same dataset.
The rest of this paper is structured as follows. In Section 2, an extensive review of feature extraction and classification algorithm is presented. Section 3 introduces the public dataset and data preprocessing method we use in this study, and Section 4 describes the activity recognition methodology in detail and analyzes the degree of stationarity (DS) for activity data. In Section 5, the classification methods and parameters setting of BP neural network are introduced. Next, we present the results obtained from experiments and compare the performance measures with relevant work in Section 6. Then, the limitation and future work are discussed in Section 7. Finally, the paper is summarized in Section 8.

Related Work
The algorithm for the wearable sensors activity recognition system contains two parts, feature extraction and activity classification. In this section, we will introduce some common methods for these two parts.

Feature Extraction
In the time-domain, a signal's characteristics contain the statistical properties of the data collected from the tri-axial accelerometers such as mean value, mean square root, and variance. The mean value and the mean square root are the expression of the signal's magnitude, and the other statistical properties reflect the volatility and dispersion of the data. Some time ago, Li et al. proposed a shapelet-based model which can effectively recognize complex activities [12] in the time domain.
The space-domain characteristics describe correlation coefficient for different sensors on different body parts. Frequency domain characteristics include the entropy and the energy of acceleration data. They are generally used as a measurement of the periodicity of a piece of data. Some of the existing works perform activity recognition using frequency characteristic from short-time Fourier transform (STFT) [13], Wigner-Ville distribution(WVD), and WT [14,15].
HHT is a novel signal analysis theory proposed by Huang in the 1990s [16]. The main innovative point of HHT is that it proposes the idea of intrinsic mode function (IMF) and introduces the EMD method. HHT decomposes raw data to several IMFs by using the EMD method and obtains IF by using the Hilbert transform (HT) on each IMF. At that point, signals are finally expressed as energy distribution in the time-frequency domain, this is called the Hilbert spectrum. We can also further get the MS of the signal by the method proposed in [16]. HHT is designed to work for data that is non-stationary and nonlinear [17], such as the propagation of seismic waves [18,19] and detection of earthquake's hypocenter [20,21].
In addition, HHT is also used for analysis of other nonlinear systems such as atomic decomposition [22], biomedicine [23], earthquake engineering [24], structural damage monitoring [25], radar data processing [26], mechanical equipment fault diagnosis [27], and diffusion of contaminants [28]. For activity recognition, Wang et al. [10], and Liao et al. [11] introduced EMD to process accelerometer signals to extract features such as IA and IF. Ayachi et al. [9] developed algorithms for automatic detection and segmentation of activities of daily living (ADL) tasks, by using a combination of EMD and other transform approaches. However, to the best of our knowledge, for activity recognition, little work discusses and evaluates the effects of these combined features to classification. Moreover, although it is one of the major components, Hilbert spectral analysis is yet to be introduced to feature extraction in the field of activity recognition.

Classifier for Activity Recognition
Some classification algorithms in human activity recognition are commonly used at present, such as naive Bayes, decision tree, support vector machine (SVM), and K-nearest neighbor (KNN). This subsection will explain these algorithms briefly.
Naive Bayes is a simple classifier which is widely used in human activity recognition [29][30][31][32]. In naive Bayes, a posteriori probability is obtained from priori probability according to new input. The advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. Meanwhile, the disadvantage is that we need to know a priori probability in advance.
Decision tree is also a widely adopted algorithm in this field [29,30]. The strength of it is that the decision-making process is intuitive and well-understood [33]. Yet, when many values are uncertain or outcomes are linked, the calculation can be very complex, and even lead to overfitting.
KNN is a non-parametric method used for classification and regressions [12]. The advantage of KNN is that it is simple to implement and far less expensive to retrain, but it leads to a large amount of computations when there are a great many training samples [34].
SVM is a kind of pattern recognition method which is based on statistical theory [35]. SVM has the advantage of high classification accuracy and does not need a large number of samples to input, thus, SVM is widely used in activity recognition [36,37]. However, a potential drawback of the SVM is that the input data require full labeling of input data.
Compared with the above algorithms, BP neural networks have strong fault-tolerance, robustness, and self-learning ability. So, we choose the BP neural network as the classifier in this paper.

Dataset and Signal Preprocessing
This section presents the dataset and signal preprocessing method we use in our paper. In order to contrast with existing work, the public dataset called PAMAP2 is used. It was contributed by the University of California, Irvine [38]. It was donated on 6 August 2012.

Hardware Setup
The dataset PAMAP2 uses three inertial measurement units (IMUs) and one heart rate monitor as sensors during the data collection. Each IMU contains two 3-aixs micro-electro-mechanical system (MEMS) accelerometers, one scale is ±16 g and the other is ±6 g, one ±1500 • /s 3-axis MEMS gyroscope, and one 3-axis magneto-resistive magnetic sensor whose scale is ±400 µT. All IMUs were sampled at 100 Hz, and the heart rate monitor was sampled at 9 Hz [39]. One IMU and heart rate monitor was on the chest, one IMU over the wrist on the dominant arm, and the other IMU on the dominant side's ankle.
The previous work [40] shows that three sensors can achieve a great balance between recognition accuracy and numbers of sensors. In wearable sensors systems, as the number of sensors increases, the average recognition accuracy increases, but this is neither comfortable nor practical for users. Nevertheless, previous work showed that one or two sensors are not enough for real-time activity recognition [41]. Thus, in this paper, we use the accelerometer data from three sensors.

Subject and Data Format
There were eight males and one female who participated in the data collection. Their age ranged from 23.91 to 30.53 years, body mass index (BMI) ranged from 22.49 kg·m −2 to 27.73 kg·m −2 , one subject was left-handed and the other eight were right-handed. In accordance with the regulation, they all collected over 10 h of data separately.
The dataset PAMAP2 contains 12 activities (defined in Table 1), including basic activities (running, walking, cycling, and nordic walking), postures (lying, sitting, and standing), everyday work (ascending stairs and descending stairs), household (ironing and vacuum cleaning), and motor movements (rope jumping). Each subject had to follow this protocol. Beyond that, if the subject had additional free time, they were allowed to collect six optional activities to enrich the range of the dataset, the activities were watching TV, computer work, car driving, folding laundry, house cleaning, and playing soccer.

Data Processing Flow
This section provides a brief introduction to the data processing flow in this paper. The process diagram is shown in Figure 1. The data processing flow, including data preprocessing, data segmentation, feature extraction, and classification.

Data Preprocessing
The real-time data received was saved as text-files, 54 columns in each text-file. Each IMU's data contain the columns as laid out in Table 2 and the meaning of each column is shown in Table 3.  Table 3. The meaning of each column in IMU data.

Number Meaning
3D-acceleration data (ms −2 ), scale: ±16 g, resolution: 13-bit 5-7 3D-acceleration data (ms −2 ), scale: ±6 g, resolution: 13-bit 8-10 3D-gyroscope data (rads −1 ) 11-13 3D-magnetometer data (µT) [14][15][16][17] orientation (invalid in this data collection) Data labeled with activity ID = 0 mainly covers transient activities between performing different activities. We discard them first. At the same time, to ensure the accuracy of each activity, we discard the first 10 s of the data. As we can see in Table 3, columns 14-17 are invalid in this data collection. Then, to avoid repeating the calculation and according to the recommendation of the dataset, we choose to use ±16 g 3D-accelerometer data. That is to say, we eliminated the data in columns 5-7 and 14-17.

Data Segmentation
Wearable activity recognition systems are composed of a set of sensors worn on the human body, and the sensors deliver a data stream every second. To accurately recognize each period of activity, we should deliver the data stream to each activity period every time. Since we use neural networks as a classifier in this work, we also need to consider the number of activity periods we delivered each time.
In this section, we present the data segment method we use in our paper-sliding window approach.
The sliding window approach is one of the most popularly used segmentation methods, and is based on the thinking that the real-time data is more valuable than the stale data [42]. So, in the field of activity recognition, the sliding window approach has been widely employed in previous works [43,44]. However, in the segmentation process, how to choose a window size is one of the key problems. Decreasing the window size allows a faster activity detection, and reduces resources and energy needs, however, it may not contain one complete activity cycle. On the contrary, large data windows are normally considered for the recognition of complex activities but will result in the increment of computational cost. A range of window sizes from 0.08 s [45] to 6.7 s [30], or even 30 s [44] have been used in previous studies. The sliding window approach has been proven to be especially beneficial for the recognition of periodic signal. However, different activities have different periods. It is hard to determine the best window size for one dataset.
In this paper, we use the dataset from Reiss and Stricker [39]. For comparision with their work, we use a similar preprocessing method. The preprocessed data is segmented using a sliding window with 5.12 s of window size, shifted by 1 s between consecutive windows.

The Feature Extraction Method Based on HHT
It is known to us that only when processed data has stationarity, linearity, and periodicity, can those data be converted to frequency domain by FT. Then, the frequency characteristic can be obtained. The data processed by WT can be non-stationary, but it is required to be linear. However, most of the data we meet in real life is nonlinear. What is worse, only a few of them can be approximately seen as linear data. So, both WT and FT have their own limitations. To overcome the above limitation, Huang proposed HHT [16] to process a nonlinear and non-stationary signal, which has been proven to be effective in practical engineering. Besides, HHT presents the results of feature extraction in time-frequency-energy space and the results contain physical meaning, which makes it easier for us to understand. In this section, we will briefly introduce the HHT algorithm and analyze the stationarity of the activity data.

A Brief Introduction to HHT
The steps of HHT algorithm are shown as follows and the detailed introduction to each step is described in the corresponding subsection.
Step 1 EMD. The EMD method is needed to decompose the raw signal into several IMFs to which the HT can be applied. Then the raw signal can be denoted by the sum of IMFs and residue.
Step 2 HT. Using the HT extend the real signal IMFs into the complex plane. Next, the IA and IF can be derived based on the result of HT.
Step 3 Hilbert spectral analysis. The Hilbert spectrum can be deduced according to the relationship among IF, IA and time. The final result of this step is to show the frequency-time distribution of signal amplitude and we can analyze the localized features of signal from it.
Step 4 The derivation of IE, MS, and DS. Based on the Hilbert spectrum, we can get the results of IE, MS, and DS. Furthermore, we can analyze the characteristics and stationarity of different activity data.

EMD
According to [16], if we directly use HT to convert activity data, the obtained IF may be negative. That is to say, the IF will lose physical meaning. To avoid this situation, HHT decomposes raw data to several IMFs by using the EMD method. IMF needs to have two characteristics:

•
In the whole dataset, the number of extrema and the number of zero-crossings must be equal or differ by no more than one. • At any time points, the average value of the envelope defined by the local maxima and the local minima equals zero.
In order to give a intuitive impression first, the EMD process is shown in Figure 2. Then we will give a detailed interpretation of this algorithm as follows.  We identify all extrema of a signal, then use cubic spline interpolation to connect all the local maxima and repeat this method to process all the local minima. The constructed upper and lower envelops will cover all the data between them. The mean of these two envelops is calculated and denoted as m 1 . Besides, a new component h 1 can be obtained from the difference between the raw data and m 1 , denoted by However, in reality, h 1 is always not an IMF because of the overshoots and undershoots, which can generate new extrema and influence this process. Thus, in order to get a better result, the sifting process will be repeated several times. In the second sifting process, the result of the first sifting process h 1 serves as the raw data, then we repeat above steps until the result can be regarded as an IMF. The first IMF component can be designated as where c 1 is the first IMF component and the k means the sifting times. Moreover, we can eliminate c 1 from the raw data and designate the rest of the data as Then we regard the r 1 as the raw data and repeat using the above sifting process. This mentioned process is iterative and IMFs c 2 , . . . , c n can be obtained. The whole algorithm will stop until the residue is too small to continue or it becomes a monotonic function from which no more IMF can be extracted. Finally, the raw data X(t) can be denoted by where the X(t) is the raw signal, c i is the ith IMF and r n is the residue. The EMD algorithm (Algorithm 1) also can be expressed as follows.

Algorithm 1 EMD algorithm
Input: The raw data X(t), i = 1 Output: The IMFs c 1 , . . . , c n , and the residue r n 1: Subfunction for IMF component 2: loop 3: Using cubic spline interpolation to respectively connect all the local maxima and minima. 4: Calculate the mean of these two envelops and denote it as m.

HT
The HT of a signal is the same as the convolution of a signal with the function h(t) = 1 πt . In order to get the instantaneous features of activity data, we use the HT to convert each IMF component: where x(t) indicates the IMF that will be converted, y(t) is the result after transform. Next, z(t) can be derived on this basis.
Moreover, the IA a(t) and instantaneous phase θ(t) about z(t) can be obtained.
The IF ω(t) is defined as the derivative of instantaneous phase θ(t).

Hilbert Spectral Analysis
The Hilbert spectrum H(ω, t) can be deduced according to the relationship among IF ω(t), IA a(t) and time. For example, this paper plots Hilbert spectrums about running and cycling in Figures 3 and 4. These activity data all come from the x-axis of the accelerometer on the wrist.  Figure 3. The Hilbert-Huang spectrum for the running data with 100 Hz frequency. The activity energy appears in skeleton lines representing each intrinsic mode function (IMF). The Hilbert-Huang spectrum was filtered by a 9 × 9 weighted Gaussian filter. The Hilbert spectrum in the colour map format for the activity data is given in Figures 3 and 4. The energy of each IMF was plotted as a skeleton line. In order to give a intuitive feeling on IE, the Hilbert spectrum in the smoothed format is obtained by using 9 × 9 weighted Gaussian filter. As we can see from the above figure, the energy of the activity running is concentrated mainly in the 0-5 Hz frequency region while the energy of activity cycling is dispersedly distribute in 0-25 Hz frequency region. This can reflect the exercise intensity of these activities. At the same time, some features about energy can be deduced.

IE and MS
Based on the Hilbert spectrum H(ω, t), we can get definitions about MS and IE. Then, we denote them as h(ω) and IE(t) respectively.
According to the above analysis, we can use HHT to convert each piece of data and get their IE. These activity data all come from the z-axis of the accelerometer on the wrist. The IE of activity 1 to activity 6 is shown in Figure 5.
As we can see in Figure 5, in some time periods, the numerical value and variation range of IE differ from activity to activity. On one hand, from the view of numerical value, the IE of lying is around 0.2 when the IE of cycling is around 80. On the other hand, from the view of variation range, the peak-to-peak value of lying only 0.1. If we plot lying and cycling in one figure, we will find that the numerical value of lying is nearly a constant. On the contrary, the peak-to-peak value of cycling can be more than 100. That is to say, cycling is more drastic than lying, which conforms to our intuitive feeling. Meanwhile, it also conforms to Huang's theory that physical quantities such as IF, MS, and IE gained by HHT have practical meanings. So, we think HHT is more suitable for activity recognition.
However, in some time periods, the IE of two activities can be similar in numerical value and variation range. In this paper, we use the mean and the variance of IE as eigenvalues. The trend of signals' amplitude may be different, but the mean and variance of IE can be same. So, if we only use one sensor to recognize activity, the judge result may be wrong. The comparison is shown in Figure 6. In the athletic process, the exercise intensity will decrease when the participant feels tired. This will cause the feature of the activity to be similar to the features of other activities, which will affect the results of machine judgement. However, different body parts have different reactions during movement. The reaction of the same body part in different activities is similar, but we can solve the problem by adding sensors. Thus, this paper analyzes activity data from 3D accelerometers on the chest, wrist, and ankle. For an example, we consider IE as a feature, according to the above analysis, we can get the mean and the variance of IE to classify activities. Then, in total, we can get 18 eigenvalues from three sensors (each sensor has 6 eigenvalues, the mean and the variance of IE in x, y, z axis).
In addition to IE, there are many other features, such as IF, IA, and MS. We introduce them briefly. Firstly, MS represents the value of energy in each frequency. As shown in Figure 7, the abscissa axis represents frequency and the vertical axis represents energy intensity. As shown in Figure 7, the variation tendency of these curves is similar. The energy in low frequency is much larger than that in high frequency, which conforms to our intuitive feeling that these activities are low-frequency-actions. It is impossible for people to perform these actions hundreds of times per second. Thus, in Figure 7, the energy is nearly equal to zero in high frequency.
The IF and the IA can greatly represent the instantaneous features of a signal. However, for one signal, we can get hundreds of (and even thousands of) eigenvalues about the IF and the IA. It will be difficult for the neural network to train such a large classification network. At the same time, it will need to spend lots of time finishing this work. Therefore, we also consider to use the mean and variance of them in our classification.

DS
This paper has mentioned several signal processing methods, including FT, STFT, and FFT which all require that the processed signal be stationary. However, this requirement is often ignored by researchers for simplification. Meanwhile, it is extremely difficult to meet the original definition of stationary signal. If we use that definition to judge a signal, nearly all signals will be considered as non-stationary signals.
So, Huang proposes a different definition for stationarity based on HHT [16]: Among them, n(ω) is the mean of the MS, DS(ω) is the DS of a signal. If a signal is stationary, the Hilbert spectrum H(ω, t) will not be a function of time, and the DS(ω) will be zero. On the contrary, the higher the index value DS(ω) is, the more non-stationary the signal is. We analyze the data coming from the accelerometer located on the wrist as an example.
It can be seen from Figure 8 that the cycling data is highly non-stationary. The DS will increase when the frequency increases. Besides, the data of standing is the most non-stationary when the frequency is between 50 Hz to 60 Hz. Then, the overall DS of standing is also poor. Because of these obvious differences among different activities and activity data's high non-stationarity, we determine that HHT is suitable for activity recognition.

Activity Classification
The above extracted features will be used as the inputs to this part, that is, the classification. Artificial neural networks are important methods of classification in the field of activity recognition, as proven by the work in [46][47][48]. Thus, in our work, the BP neural network is employed as the classifier to validate the extracted features.

Artificial Neural Network
A BP neural network is a multi-layers feed-forward neural network. According to statistics, 80% to 90% of the neural network models use the BP network or its extensions. Meanwhile, they are often used as a classifier in the field of activity recognition [49].
BP neural networks were proposed by Rumelhart and Mcclelland in 1986 [50]. From the aspect of structure, it is a typical multi-layers forward neural network with one input layer, several hidden layers, and an output layer. The full connection mode is adopted between the layers and a mutual connection does not exist between neurons in the same layer. The main characteristics of the BP neural networks are signal forward transmission, error backward propagation, and supervised learning. Moreover, it has been theoretically proven that a three-layer network with one hidden layer can approximate any nonlinear function [51]. A typical three-layer BP neural network is shown in Figure 9.

The Learning Algorithm Based on a BP Neural Network [52]
As shown in Figure 9 above, the structure of a three-layer BP neural network is demonstrated. The specific implementation steps of this supervised algorithm are given. The situation of more than one hidden layer can be derived based on this algorithm. Hereby, the BP neural network algorithm is shown as follows.
Step 1 Network initialization. The initial parameters of network should be set. For example, setting the weights of the input layer to the hidden layer, the hidden layer to the output layer as any small random number, and setting the initial threshold value.
Step 2 Training samples collection. According to the rules of supervised learning, both the input vector X = (x 1 , x 2 , · · · , x m ) and the corresponding output vector D = (d 1 , d 2 , · · · , d l ) should be provided.
Step 3 Output calculation. Calculating the output from the input layer to the hidden layer and the output layer.
For each neuron in the input layer: the input is x i , the output is O i = x i (i = 1, 2, · · · , m), i is the total number of neurons in the input layer.
For each neuron in the hidden layer: the input is , j is the total number of neurons in the hidden layer.
For each neuron in the output layer: the input is x k = ∑ m j=1 w jk O j − θ k , the output is y k = g(x k )(k = 1, 2, · · · n), k is the total number of neurons in the output layer.
In the above equations, w ij is the weight from the ith input layer to the jth hidden layer. w jk is the weight from the jth hidden layer to the kth output layer. f (x) and g(k) are transfer functions. There are many kinds of transfer functions which can be selected.
Step 4 Weight adjustment. According to the error, the BP neural network adjusts the weights from the output layer nodes to the hidden layer nodes and the weights from the hidden layer nodes to the input layer nodes. The network error function is defined as In Equation (15), −∇E(t) represents inverse direction of the gradient of the error function in the Tth training.
For the weight w jk (t + 1) from the output layer to the hidden layer, the adjusting formula is In Equation (16), η is the learning rate, η > 0, d is the desired output. For the weight w ij (t + 1) from the hidden layer to the input layer, the adjusting formula is In Equation (17), δ k is the error in the output node k.
Step 5 Iterative loop. Return to Step 2 and iteratively operate until the error meets the requirements.

Parameters Setting
The feature vector will be used as input to the neural network. In this paper, the activity is classified according to which types of activity are taking place, the corresponding element will be set as 1, and the remaining elements will be zero. For instance, suppose there are five kinds of activities in total. If an unknown activity is classified as the second kind of activity, then the output value will be [0, 1, 0, 0, 0]. Therefore, the number of neurons in the output layer and the number of activity types will remain the same. Besides, the number of neurons in the hidden layer can make a great difference to the performance of the BP neural network. If the number of neurons is too few, the network cannot fully describe the relationship between input and output. On the contrary, if the number of neurons is too large, it will lead to the network needing to take a longer time to learn. What is worse, the classification results may turn out poorly. So, it is important to choose a suitable parameter as the number of neurons in the hidden layer. For a three-layer neural network, there are some common empirical formulas. The empirical formula selected in this paper is as follows: where m represents the number of neurons in the hidden layer, x represents the number of neurons in the input layer, and y represents the number of neurons in the output layer. The number obtained by this formula is usually not an integer, therefore, the number of neurons is selected from two integers that are near to the calculated results. The neural network toolbox in MATLAB [53] is used to classify activities. The learning rate is set to 0.1. The Trainlm function is selected as the training function, which uses the Levenberg-Marquardt algorithm to update the weights. Moreover, it is the fastest back propagation algorithm in this toolbox. Because the output value of the neural network is 0 or 1, the hidden layer chooses the logsig function as the transfer function since it can map the input values on the [−∞, ∞] interval to the [0, 1] interval. In addition, the output layer selects the purelin function as the transfer function. The performance evaluation index is the mean square error, that is, reducing the mean square error to the acceptable range through iterative learning.

Feature Selection
As we mentioned in section4, a raw signal can be decomposed to several IMFs and their IA and IF can be obtained. It can be seen from the Figure 2 that the first IMF has the highest frequency and the penultimate signal has the lowest frequency. We regard the first two IMFs as high frequency noise. Usually, daily activities belong to the low frequency motion and the energy of them located at low frequency band [9]. Meanwhile, the number of IMFs from decomposition is a variable, in order to ensure these features can be obtained. The presented method considers the mean and variance of the third and the fourth IMFs as features to be selected. Then based on all IA and IF, the Hilbert spectrum and relevant physical quantities such as IE, MS, DS can be deduced. There are some relevant papers using EMD to do the feature extraction [9][10][11], but they only consider using the mean and variance of all or part of IA and IF. However, according to the above analysis from Figure 5, the IE can be an obvious feature for activity recognition. So based on their work, besides using the IA and IF, we also discuss the effect when the mean and variance of IE and MS as features.

Results
This section introduces how to use the features of HHT for activity recognition. Moreover, we discuss some different conditions upon experiment, including using each feature independently, combining some of them, and analyzing the effect of two scenarios: dependent subjects and independent subjects. In addition, we make some comparisons with relevant work.

Performance Measures
In order to compare our results with [54], this paper uses same performance measures: recall, precision, F-measure and accuracy, which are also commonly used in the field of classification. Firstly, we introduce the concept of confusion matrix, that is a matrix where each column of this matrix represents the activities in a annotated (or target) class while each row represents the activities in an recognized (or output) class. Moreover, we represent the element in the ith row and jth column as P ij , the sum of all elements in the ith row as S i , and the sum of all elements in the jth column as R j . Then let C be the total number of activity classes and N be the total number of samples of all activities. Based on these notations, we can clearly define these performance measures as follows.

Recognition Performance When Only Using IE as Feature
As described in the previous sections, this paper uses three wearable sensors to collect activity data. Then, these data are divided into several segments for every 5.12 s. Each sensor collects data from its x, y, and z axes. This paper classifies 12 activities which include most common activities in daily life. It is important, because we can provide useful guidelines to users if we can recognize these activities accurately.
We divide each subject's activity data that have been preprocessed into two parts: one half for training the neural network and the other half for testing the training effect. For each data segment, we can get several eigenvalues. Then, we use training data to train the neural network and use testing data to evaluate the classification accuracy. According to Equation (18), the structure of neural network in this experiment is 18-14-12. The three numbers represent the number of neurons in the input layer, the hidden layer, and the output layer, respectively. The confusion matrix of training data is shown in Table 4.
Using the trained neural network to classify the data of the other parts, testing data, the confusion matrix is obtained in Table 5.
Based on the above two confusion matrices, Equations (20)-(23), we can obtain the performance measures for these two conditions in Table 6.
In our real life, however, only the untrained data needs to be classified, thus, the results of Table 5 are more practical and our focus is on trying to improve the performance measures of testing data. Therefore, in the next subsections, we will discuss the activity recognition from this perspective. Table 4. Confusion matrix of training data with three sensors using instantaneous energy density (IE) as feature.

Recognition Performance of Other Features of HHT
Through the experiment, the effect of three different types of features on activity recognition is analyzed. The first feature is the mean and variance of IE, which we have discussed in Section 6.2. The second feature is the mean and variance of MS and the third one is the mean and the variance of IA and IF. We use the data coming from the accelerometers of three sensors. The accelerometer has three (x, y, z) axis, so when the IE or MS is considered as a feature for activity recognition, the number of eigenvalues is 18. Then, a raw signal can always be decomposed into several IMFs and the total number of IMFs is variable. Therefore, for each signal, the third and the fourth IMFs are selected and this the reason we have illustrated this in Section 5.4. Next, the IA and IF of each IMFs are extracted as features. Thus, when we use the IA and IF as feature, 72 eigenvalues can be obtained in total.
In this experiment, the structure of a neural network is 18-14-12 when the IE or MS was considered as a feature for activity recognition. If the features are IF and IA, the structure of the neural network is 72-29-12. The three numbers represent the number of neurons in the input layer, the hidden layer, and the output layer respectively. The confusion matrices obtained by using MS, IA and IF as features are shown in Figure 10. As we can seen from Figure 10, the judge error between the second activity sitting and the third activity standing is huge, especially when using IA and IF as features, the correct rates are less than 50%. Maybe this is the reason that these two activities be seen as the same activity in the [54]. Meanwhile, we have discussed these activities in Figure 5. The features of them are very similar, which leads us to conclude that it is difficult to distinguish them. In addition, from the results of the confusion matrix, we can intuitively draw the conclusion that the effect of IA and IF is much worse than IE or MS in the activity recognition. To compare and evaluate these three types of features, we calculate their performance measures, as shown in Table 7. As we predicted, the results shows that the effect of IE and MS are roughly equivalent. The precision and accuracy of IE are a little better than that of MS while the recall and F-measure of MS are relatively better than that of IE. From the previous analysis, the figures of the MS among different activities sometimes looks very similar, nevertheless, we know that, according to the classification results, the neural network can still clearly find these differences and classify them correctly. However, when the IF and IA are considered as features for activity recognition, the performance measures are rather poor. The error rate of some activities is very high, and it is difficult to apply it to real life. We maintain that the IF and IA are instantaneous features, which describe the characteristics of a signal at a certain time, but there are too many eigenvalues to put them all into classification. Thus, we use the mean and variance to process these eigenvalues and use them as inputs to the BP neural network. However, this process eliminates the instantaneous feature and leads to steps which aim to stop the instantaneous feature becoming meaningless. This results in the effect of IA and IF being unsatisfactory. It will be better if we put all instantaneous features into classification without considering time consumption and network magnitude, but it is difficult to process hundreds of, and even thousands of eigenvalues. However, for the IA and IF, the correct rates of some activities like cycling are higher than the other two methods. Therefore, we infer that this method may have a special characteristic which can help improve the performance measures of the other two methods and we consider combining these features in the next subsection.

Recognition Performance of Multi-Features Combination
According to the analysis of above subsection, we can get relatively satisfied recognition results when using IE or MS as features. In order to further improve the performance of activity recognition, we consider combining these features both pairwise and by combing all four. Thus, we will have four different types of feature groups including IE-MS, IE-IAIF, MS-IAIF, and IE-MS-IAIF. Then, we take these feature groups as inputs to the BP neural network. The corresponding confusion matrix of each type of feature group can be gained. In this experiment, the structure of the neural network is 90-32-12 when the IE-IAIF or MS-IAIF was considered as a feature for activity recognition. If the features are the IE-MS, the structure of the neural network is 36-20-12 and if the features are the IE-MS-IAIF, the structure of the neural network is 108-36-12. The three numbers represent the number of neurons in the input layer, the hidden layer, and the output layer respectively. We plot their confusion matrices in Figure 11.
To intuitively compare the effect of these feature groups, we calculate their performance measures and show the results in Figure 12.
It can be seen from the results that nearly all the numbers in Figure 12 are bigger than 0.9 while all the numbers in Table 7 are less than 0.9, so the performance of each feature group is better than that of an individual feature. Thus, we believe that multi-features combination can further improve the performance of activity recognition. In addition, when we use each feature separately, the performance measures of IE and MS are much better than that of IAIF. However, when we combine these features in pairs, the MS-IAIF rather than IE-MS shows the best performance. We think that IE and MS may contain many similar characteristics which mean that it is difficult for these two features to play a complementary role for each other. At the same time, according to the above analysis, the correct rate of some activities using IAIF as a feature is better than the condition of using MS. Thus, IAIF can provide some special characteristics for MS to improve the performance. Moreover, it shows the best results when we combine all three features together, but it only a little bit better than the results of MS-IAIF.

Recognition Performance of Independent Subject
In the above discussion, each type of activity data is composed of all subjects' corresponding data and we divide these data into two parts: training data and testing data. However, in our daily life, people focus more on pursuing personalized experience and use the tailor-made devices. Thus, the independent subject validation is important for the activity recognition system. In this subsection, the effect of the presented method in the scenario of independent subject is analyzed. We use the same data preprocessing method as in the scenario of the dependent subject and process each subject's data independently. That is to say, we will separately divide each subject's data into two parts without mixing them with any other subjects' data. According to the analysis of the previous subsection, the feature IE-MS-IAIF shows the best performance in the scenario of dependent subject; therefore, we also use this feature as an example to do the experiment. Besides, in this dataset, only five subjects (including subject ID 1, 2, 5, 6, and 8 ) finish all 12 activities that we want to classify. All other datasets have some missed activity type that mean they cannot be used. As for the data of subject ID 6, the duration of its activity, rope jumping, is less than 3 s, which is too short to use. Thus, in this experiment, we have data from four subjects and we show the results of them in Table 8.
In order to integrate their respective results into an overall performance measures we propose the definition of overall performance measures which is denoted by where overall-PM represents overall performance measures that are composed of four specific indexes, PM i is performance measures of the ith subject that is also composed of four indexes, and they are illustrated in Equations (24) and (25). In addition, m is the total number of subjects, which is 4 in this experiment, N i is the sample number of ith subject, and N is the total number of all samples. Then, we can get the complete results of this experiment in Table 8. during an action, which makes each activity can be easily distinguished, while some people like subject ID 5 may make smaller movements, which introduces difficulties to activity recognition. This is the challenge that we face in our daily life. Besides, the more important reason for this phenomenon is that the duration of some of subject ID 5's activities is too short. This results in the sample size of these activities being small and affects the classification performance of the BP neural network. However, it is not a problem for practical application because the user can supply enough activity data for the algorithm. Thus, we can focus more on the results of the other three subjects. An overall-PM2 is calculated which does not concern the result of subject ID 5. All four performance measures of Overall-PM1 are bigger than 0.91 while those of Overall-PM2 are bigger than 0.93. In the meantime, as for the sample size, although the duration of some activities of subject ID 5 are only about 1 min, the four indexes can be satisfied. For any subject, most of the activities' durations are less than 3 min, which is short. Nevertheless, the results show relatively high performance. Thus, the presented method can realize accurate activity recognition with a small amount of training data and we can believe that if we can give enough data as inputs, the results can be further improved.

Comparing with Relevant Work
Many related works [29][30][31][32][33][34][35][36][37] discuss the classification methods for activity recognition. In these papers, several classification approaches are introduced at the same time. Then, the author can determine a relatively good method according to some specific evaluation indexes. In addition, we used fuzzy classification [55] and the FFT algorithm to process other datasets before, and good classification results are gained with the proposed methods. Moreover, the dataset we use in this paper is from the experiment of [54]. In the work [54], the authors propose both dataset and benchmark for different methods of human activity recognition. Therefore, in order to evaluate our work, we make a comparison of the classification results with the outcome presented in [54]. In that paper, independent subject evaluation is done with leave-one-subject-out (LOSO) 9-fold cross-validation and dependent subject is done with standard 9-fold cross-validation. According to the previous analysis, the scenario of an independent subject is more common and practical, which allows a device to be more suitable for a specific user. Thus, we compare our results of the independent subject in Table 8 with the results from [54] which uses LOSO 9-fold cross-validation. Meanwhile, the activity recognition experiment we undertook is described as all the activity recognition tasks in that paper, so the corresponding results of the task are selected. It can be seen form the Figure 13 that the performance measures of the first four methods are all less than 0.9. KNN shows the best results in [54] which has a similar effect (slightly lower) to the results of Overall-PM1. If we do not consider subject ID 5 because of the sample size, Overall-PM2 has outperformed KNN significantly. There are some differences between the ways we preprocess the dataset. Besides, different classifiers are used to classify activities. However, the effectiveness of the HHT is remarkable. Especially, when the processed data in activity dataset are nonlinear and non-stationary (it can be seen in Figure 8), and HHT has obtained a very good effect processing this type of data in other areas [22][23][24][25][26][27][28]. Hence, the proposed method is verified to be valuable and practical in activity recognition and it can effectively make up the shortcoming of traditional methods when processing activity data.

Limitation and Future Work
Before we summarize the paper, current limitations of the proposed method should be pointed out to direct our future works.
Firstly, although the PAMAP2 dataset we used in this paper is well known for activity recognition and benchmarking, the dataset still contains null values and missing data fields due to packet loss in signal collection or the misuse of sensors by practitioners. To the best of our knowledge, how to deal with missing data is another critical issue for activity data processing based on body-worn sensors. In this paper, we only conduct a set of simple dropping policies to deal with the problem. For example, if there are missing data from one sensor in a certain time point, we will drop all data which are collected from three sensors in this time point. These policies are simple to facilitate our experimental study on feature extraction. Nevertheless, our means for missing data will break the continuity of activity data, or may reduce the recognition accuracy. Accordingly, in our future works, we plan to evaluate the influence of missing data to feature extraction for activity recognition, and to investigate some useful methods like interpolation or prediction, with the objective to fix the problem of missing data and improve recognition effect.
Another limitation of our work is that, due to the limitation of time and space, we do not conduct a detailed study on the impact of sensor numbers on the proposed feature extraction approach. In the view of beauty and comfort considerations, however, people are inclined to wear fewer sensors. Thus, it is important to explore the relationship between recognition results and the number of sensors. Moreover, we can find a tradeoff between them based on the relationship. We will also investigate a possible future research task, to integrate other various classification tools (beyond BP neural network) with the proposed feature extraction methods. As we mentioned in Section 2.2, many kinds of classifiers are widely used in activity recognition. It is valuable to study the effects of the proposed methods with different classifiers.

Conclusions
In this paper, a feature extraction method based on HHT has been proposed for activity recognition since the activity data are non-stationary and nonlinear. The experimental results show that our approach is capable of handling this problem and three conclusions can be drawn.
Firstly, In addition to the features (IA and IF) from EMD, we also use two other related features (IE and MS) from Hilbert spectral analysis. In the scenario of dependent subjects, the effect of multi-features combination and single-feature are discussed. The comparison results show further performance improvement with multi-features combination. At the same time, the highest performance measures can be obtained when we combine all three features together, that is, using IE-MS-IAIF as features.
Secondly, in the scenario of the independent subject, the four overall performance indexes: recall, precision, F-measure, and accuracy can reach 0.9337, 0.9417, 0.9353, and 0.9377 respectively. The experimental results from two scenarios show that our method is both suitable for the scenario of independent subject and dependent subject.
Finally, compared with relevant paper, the relatively high performance of the presented method demonstrated that the effectiveness of the HHT is remarkable and it can make up the shortcoming of traditional methods when the processed data are nonlinear and non-stationary.