Recognition of Human Activities Using Continuous Autoencoders with Wearable Sensors

This paper provides an approach for recognizing human activities with wearable sensors. The continuous autoencoder (CAE) as a novel stochastic neural network model is proposed which improves the ability of model continuous data. CAE adds Gaussian random units into the improved sigmoid activation function to extract the features of nonlinear data. In order to shorten the training time, we propose a new fast stochastic gradient descent (FSGD) algorithm to update the gradients of CAE. The reconstruction of a swiss-roll dataset experiment demonstrates that the CAE can fit continuous data better than the basic autoencoder, and the training time can be reduced by an FSGD algorithm. In the experiment of human activities’ recognition, time and frequency domain feature extract (TFFE) method is raised to extract features from the original sensors’ data. Then, the principal component analysis (PCA) method is applied to feature reduction. It can be noticed that the dimension of each data segment is reduced from 5625 to 42. The feature vectors extracted from original signals are used for the input of deep belief network (DBN), which is composed of multiple CAEs. The training results show that the correct differentiation rate of 99.3% has been achieved. Some contrast experiments like different sensors combinations, sensor units at different positions, and training time with different epochs are designed to validate our approach.


Introduction
Human activities recognition as an important artificial intelligence (AI) research area has become a hot topic in recent years. It has attracted people's attention for its application prospects in ambulatory monitoring, fall detection and educational domain. The current research on human activities recognition mainly includes vision-based recognition and sensor-based recognition [1]. With the development of wireless sensor technology, such sensors as inertial sensor, acceleration sensor and magnetic sensor are more and more applied to human activities recognition, behavior classification and human activity monitoring domains [2]. Early studies for activities recognition employ single or multiple video cameras as the data collector. The vision-based systems are adapted to the laboratory environment in which visual disturbance can be avoided. However, the recognition accuracy will decrease in outdoor environments due to the influence of variable lighting and different disturbances [3]. Furthermore, the single camera can only collect two-dimensional scenes, which will lose some significant information. Due to these restrictions, the sensor-based recognition is applied which can get rid of the influence of light and shade of recognizing human behavior.
Recently, several human activities' recognition approaches have been articulated which acquire data by using wearable sensors. In [4,5], the inertial sensors were used to detect the fall activities of humans. In [6], an incremental diagnosis method for wearable inertial and magnetic sensors system

Materials and Methods
The classification technology of neural networks has been applied to human activities recognition, and has achieved good results. In this section, we propose a new neural network model called continuous autoencoder to classify and recognize human activities. The CAE adds Gaussian random units into activation functions to extract the features of nonlinear data. A novel FSGD algorithm instead of a traditional stochastic gradient decent algorithm is proposed to update the gradients of CAE.

Basic Autoencoder
Autoencoder is a neural network model in which the output is as same as the input, such as y piq " x piq . The autoencoder has two processes: an encoding process (encoder) and a decoding process (decoder). The encoder transforms inputs into hidden features, and the decoder reconstructs hidden features to approximate output. The structure of the autoencoder is shown in Figure 1.

Basic Autoencoder
Autoencoder is a neural network model in which the output is as same as the input, such as . The autoencoder has two processes: an encoding process (encoder) and a decoding process (decoder). The encoder transforms inputs into hidden features, and the decoder reconstructs hidden features to approximate output. The structure of the autoencoder is shown in Figure 1.
where m is the number of input,  controls relative importance of the second term, the first term of loss Equation (2) is an average sum-of-squares error term, the second term is the weight decay term which tends to decrease the magnitude of weights, and helps to prevent over-fitting.

Gaussian Continuous Unit
The zero-mean Gaussian stochastic unit with variance 2  is added into the activation function, which can be defined as: Figure 1. Basic autoencoder model. x i , i P 1, . . . , n represents the input of autoencoder, h j , j P 1, . . . , k is the value of hidden units,x i , i P 1, . . . , n is the approximate output, W piq , i P 1, 2 denotes the weight matrix, b is the bias term.
The square error loss function of single sample is calculated as: JpW, b; x, yq " 1 2 h W,b pxq´y 2 (1) where x and y stand for the real input and output respectively, h W,b pxq is the output of activation function. The error loss function of whole network can be obtained: where m is the number of input, λ controls relative importance of the second term, the first term of loss Equation (2) is an average sum-of-squares error term, the second term is the weight decay term which tends to decrease the magnitude of weights, and helps to prevent over-fitting.

Gaussian Continuous Unit
The zero-mean Gaussian stochastic unit with variance σ 2 is added into the activation function, which can be defined as: where f p¨q represents the activation function which can be set as sigmoid function, h W,b is the output of activation function with input x i , b i is the bias term. Np0, 1q means a zero-mean Gaussian unit, n " σ¨Np0, 1q subjects to the distribution as: considering that the Gaussian stochastic unit which is added into the activation function can make the curve of sigmoid function fluctuate. Therefore, the improved sigmoid function is proposed which has two parameters to control the steepness of activation function. The improved sigmoid function f p¨q in Equation (3) can be defined as: where a i " ř i W ij x i`bi`σ¨N p0, 1q, and k is the gain control parameter, which can regulate the range of sigmoid function. The range of f pa i q changes to r´0.5k, 0.5ks from r´1, 1s after importing k. c i is the exponential control parameter which can regulate the range of approximate linear operating. According to Taylor mean value theorems, the Taylor expansion of Equation (4) at point 0 can be calculated as where R n pa i q is the remainder term of Taylor expansion. Equation (5) approximates the linear function near the 0 point and the nonlinear function away from the 0 point. The effect of parameter c i can be seen from Equation (5). It can control the smoothness of curve and avoid the occurrence of the abrupt curvature change. CAE can be built by adding a stochastic unit into a basic autoencoder. In this paper, the effect of stochastic unit is analyzed by means of manifold learning theory. High dimensional data is often scattered on a low dimensional manifold. Assuming that the stochastic operator ppx|r xq attempts to get the low dimensional manifold data r x to approximate the high dimensional data x. The pattern of Gaussian stochastic unit is different from the low dimensional manifold, so the gradient of ppx|r xq need to be greatly changed to approximate x. CAE can also be regarded as a manifold learning algorithm. Adding Gaussian stochastic unit into activation function can prevent over-fitting and local optimum.
In order to verify the effectiveness of autoencoder with Gaussian unit, the simulation is performed which applies the nonlinear manifold swiss-roll dataset as the experiment subject. The swiss-roll dataset containing 2000 points can avoid the error of singularity data.
The results of comparative experiments implementing the basic autoencoder and CAE which have the same network structure and parameters are shown in Figure 2. Figure 2a is the original swiss-roll dataset, while Figure 2b,c are the reconstruction of swiss-roll dataset implementing autoencoder and CAE, respectively. It can be seen that the result of reconstruction by using CAE is more smooth, and the reconstructed data is as approximate as the original data.

Fast Stochastic Gradient Descent
Gradient descent (GD) is an efficient algorithm to search for the optimal solution. Early research for neural network employed GD as the algorithm of updating network gradients. The GD function is defined as where j represents the iteration epochs, m is the number of samples, j  is the parameters vector of current iteration epoch, The error of loss function can asymptotically converge to local optimization through iterative update of  . In 1952, Kiefer and Wolfowitz put forward stochastic gradient descent (SGD) algorithm [23], which was widely applied in the machine learning [24,25] SGD algorithm is shown in Algorithm 1. It has solved the problem that the results of GD algorithm converge to local optimization instead of global optimization. However, it will take a long time to train network before the end of iteration epochs.

Fast Stochastic Gradient Descent
Gradient descent (GD) is an efficient algorithm to search for the optimal solution. Early research for neural network employed GD as the algorithm of updating network gradients. The GD function is defined as where j represents the iteration epochs, m is the number of samples, θ j is the parameters vector of current iteration epoch, θ j`1 is the parameter vector of next iteration epoch, ∇ θ denotes the gradient operator, and Jpx piq j , θ j q is the loss function which meets The error of loss function can asymptotically converge to local optimization through iterative update of θ. In 1952, Kiefer and Wolfowitz put forward stochastic gradient descent (SGD) algorithm [23], which was widely applied in the machine learning [24,25] domain. SGD algorithm does not need to calculate all the m samples at one time. It only calculates one sample in one iteration epoch. The SGD function in one iteration epoch is defined as SGD algorithm is shown in Algorithm 1. It has solved the problem that the results of GD algorithm converge to local optimization instead of global optimization. However, it will take a long time to train network before the end of iteration epochs.

Algorithm 1 Stochastic Gradient Descent
1: for j " 1 to n do 2: Draw i P t1, 2, . . . , mu at random. 3: θ j Ð θ j´α ∇ θ J´x piq j , θ j4 : end for 5: return θ j In this paper, FSGD algorithm is proposed to update the gradients of CAE. FSGD adopts the cross validation method in training process. According to the training error, FSGD determines whether the training can be ended ahead of time. It is suggested that the iteration epochs should be set as a big number. If the error of loss function converges to a relatively small stable range, FSGD will break the loop ahead of time. FSGD algorithm structure is shown in Algorithm 2. Where Losspθq is the output of loss function, n is the iteration epochs, ε indicates the minimum loss constant which is set as the convergence criterion, K is the time range of breaking the loop.
In order to verify the efficiency and performance of FSGD, comparative experiment is designed with the application of SGD and FSGD as CAE updating gradient algorithm, respectively. The iteration epochs are set to 1000. The minimum loss constant ε is 0.003 and K " 100. It means that if the average error of the loss function is less than 0.003 during continuous 100 epochs, the iteration will be broken ahead of time. It is suggested that small-scale data is firstly trained, then ε can be determined by the a posteriori method.
The FSGD algorithm error curve is shown in Figure 3. It can be seen that the iteration error is reduced to ε at 80th epoch, and it sustains less than ε within 100 epochs. Thus, FSGD breaks the loop at 180th epoch.
The results of experiment are shown in Table 1. SGD algorithm takes 176.53 s to reduce the error to 0.0023 after 1000 epochs. In addition, FSGD only takes 29.277 s to reduce the error to ε range. It can be concluded that FSGD can shorten the training time and achieve the expected results more effectively. In this paper, FSGD algorithm is proposed to update the gradients of CAE. FSGD adopts the cross validation method in training process. According to the training error, FSGD determines whether the training can be ended ahead of time. It is suggested that the iteration epochs should be set as a big number. If the error of loss function converges to a relatively small stable range, FSGD will break the loop ahead of time. FSGD algorithm structure is shown in Algorithm 2. In order to verify the efficiency and performance of FSGD, comparative experiment is designed with the application of SGD and FSGD as CAE updating gradient algorithm, respectively. The iteration epochs are set to 1000. The minimum loss constant  is 0.003 and 100 K  . It means that if the average error of the loss function is less than 0.003 during continuous 100 epochs, the iteration will be broken ahead of time. It is suggested that small-scale data is firstly trained, then  can be determined by the a posteriori method.
The FSGD algorithm error curve is shown in Figure 3. It can be seen that the iteration error is reduced to  at 80th epoch, and it sustains less than  within 100 epochs. Thus, FSGD breaks the loop at 180th epoch. The results of experiment are shown in Table 1. SGD algorithm takes 176.53 s to reduce the error to 0.0023 after 1000 epochs. In addition, FSGD only takes 29.277 s to reduce the error to  range. It can be concluded that FSGD can shorten the training time and achieve the expected results more effectively.

Human Activities Dataset
In 2010, Altun et al. [26] collected human activity data by using body-worn inertial and magnetic sensors. Nineteen activities were classified: sitting (A1), standing on the ground (A2), lying on the back (A3), lying on the right side (A4), going upstairs (A5), going downstairs(A6), standing in an elevator (A7), walking around in an elevator (A8), walking in a parking lot (A9), walking on a treadmill (A10), walking on a treadmill with 15˝angle of inclination (A11), running on a treadmill (A12), exercising on a stepper (A13), exercising on cross trainer (A14), cycling on an exercise bike in horizontal position (A15), cycling on an exercise bike in vertical position (A16), rowing (A17), jumping (A18), playing basketball (A19). Each activity is performed by eight different subjects for 60 segments. In this way, signal segments amounting to 9120 (= 60ˆ19ˆ8) can be obtained.
Each subject wears different sensors in five parts of their body: left and right arms; left and right legs; and the body torso. Each sensor has a tri-axial accelerometer, a tri-axial gyroscope, and a tri-axial magnetometer. Sensor units are calibrated to acquire data at 25 Hz sampling frequency. Each 5-s signal has 125 rows of data. Each sensor has five MTx miniature inertial three degrees of freedom orientation trackers. The MTx tracker in accelerometers can sense up to˘5 g gravitational acceleration, g " 9.806665 m{s 2 . The MTx tracker in gyroscopes can sense up to˘1200˝{s angular velocity. The tracker in magnetometers can sense magnetic fields in the range of˘75 µT. The z-axis acceleration and gyroscope signals of the right arm for walking in the parking lot and jumping are shown in Figure 4, respectively.
Each subject wears different sensors in five parts of their body: left and right arms; left and right legs; and the body torso. Each sensor has a tri-axial accelerometer, a tri-axial gyroscope, and a triaxial magnetometer. Sensor units are calibrated to acquire data at 25 Hz sampling frequency. Each 5s signal has 125 rows of data. Each sensor has five MTx miniature inertial three degrees of freedom orientation trackers. The MTx tracker in accelerometers can sense up to ±5 g gravitational acceleration,

Feature Extraction and Reduction
The original signals have a large amount of data. If these data are applied to recognition of human activities directly, the correct differentiation rates will be affected by the redundant data, and it will take a lone time to classify these data. In this section we propose TFFE method to extract features from original signals, and employ principal component analysis (PCA) to reduce data dimension.

Feature Extraction and Reduction
The original signals have a large amount of data. If these data are applied to recognition of human activities directly, the correct differentiation rates will be affected by the redundant data, and it will take a lone time to classify these data. In this section we propose TFFE method to extract features from original signals, and employ principal component analysis (PCA) to reduce data dimension.

Feature Extraction
The original signals are acquired by accelerometer, gyroscope and magnetometer which have 125ˆ45 dimension of data in each 5 s window. Because the original signals do not have easily-detected features, TFFE method is proposed to extract features from the original sensor data. The common time domain features include mean value, variance, skewness, kurtosis, correlation between axes (CORR) and mean absolute deviation (MAD). The frequency domain features include power spectral density (PSD), discrete cosine transform (DCT), fast Fourier transform (FFT) and cepstrum coefficients. The correct differentiation rates could be improved if time and frequency domain features are chosen properly. In Section 4.2, we will design the contrast experiments and evaluate the correct differentiation rates of these features. According to the result of contrast experiments, the following features can be selected.
Firstly, TFFE extracts four-dimensional time domain feature: mean value, MAD, skewness, and CORR. They can be calculated as where E t¨u is the expected operator, N = 125 and i P t1, ..., 45u, s i,n refers to the data in row n column i, σ i is the standard deviation, s j,n is the data in row n column j, µ j represents the mean value of s j . Secondly, the ten-dimensional frequency domain features can be acquired from original signals. The features are the maximum five peaks of fast Fourier transform (FFT) and cepstrum coefficients of the signals, which can be calculated as The instances of frequency domain features for two activities are shown in Figure 5. After feature extraction, the dimension of each signal segment is reduced from 5625 (=125ˆ45) to 630 (=14ˆ45).

Feature Reduction
After feature extraction, the dimension of each data segment is 630 (= 14 × 45). In this paper, PCA [27] method is adopted to reduce the dimension of features. The essence of PCA is to calculate the optimal linear combinations of features by linear transformation. The results of PCA represent the highest variance in the feature subspace.
The eigenvalues and contribution rate of covariance matrix are shown in Figure 6. It can be seen that after being sorted in descending order, the contribution rate of the largest three eigenvalues accounts for more than 98% of total contribution rate. These eigenvalues can be used to form the transformation matrix. After PCA feature reduction, the dimension of each signal segment is reduced from 630 (=14 × 45) to 42 (=14 × 3). Scatter plots of the first three transformed features are shown in Figure 7. The features of different classifications are better clustered and more distinct than the original data.

Feature Reduction
After feature extraction, the dimension of each data segment is 630 (= 14ˆ45). In this paper, PCA [27] method is adopted to reduce the dimension of features. The essence of PCA is to calculate the optimal linear combinations of features by linear transformation. The results of PCA represent the highest variance in the feature subspace.
The eigenvalues and contribution rate of covariance matrix are shown in Figure 6. It can be seen that after being sorted in descending order, the contribution rate of the largest three eigenvalues accounts for more than 98% of total contribution rate. These eigenvalues can be used to form the transformation matrix. After PCA feature reduction, the dimension of each signal segment is reduced from 630 (=14ˆ45) to 42 (=14ˆ3). Scatter plots of the first three transformed features are shown in Figure 7. The features of different classifications are better clustered and more distinct than the original data.

Results and Discussion
In this section, six layers of DBN is established to train feature data and classify activities. Then, K-fold and confusion matrix method are employed to analyze the results of experiment. At the end of this section, we design the contrast experiment for our approach and some existing methods using the same dataset.

Network Structure
DBN is established by two layers of the CAE network and one layer of the BP network in logical construction. In physical construction, DBN is composed of one input layer, one output layer and four hidden layers.
The network structure is shown in Figure 8. H and output layer T compose the back propagation (BP) network.
The parameters of DBN include learning rate, momentum, dropout rate, number of epochs and batch size. In the process of the training network, the proper parameter can improve the correct differentiation rates of activity recognition. According to the experience and the results of small-scale data training, the parameters can be set as follows: Learning rate: if the learning rate is set relatively small, the error curve will converge slowly and the training time is too long. Otherwise, if it is set relatively big, the error curve will oscillate. Because the setting of training epochs is 100 in this paper, the learning rate is set to be 0.6. This setting can make the mean squared error curve converge fast.
Momentum: the momentum can fine-tune the direction of gradient. In the activation function of CAE, the Gaussian unit is added which can also change the gradient stochastically. Thus, the momentum is set to be 0.06. The value of momentum is relatively small because of the effect of CAE.

Results, Discussion
In this section, six layers of DBN is established to train feature data and classify activities. Then, K-fold and confusion matrix method are employed to analyze the results of experiment. At the end of this section, we design the contrast experiment for our approach and some existing methods using the same dataset.

Network Structure
DBN is established by two layers of the CAE network and one layer of the BP network in logical construction. In physical construction, DBN is composed of one input layer, one output layer and four hidden layers.
The network structure is shown in Figure 8. In V layer, there are 42 units that contain features extracted by TFFE method and reduced by PCA method. The T layer contains 19 units, corresponding to the 19 activities' binary codes. The hidden layer H 0 contains 10 units that are used to store the low-dimensional features. The hidden layer H 1 contains 42 units that are used to reconstruct the low-dimensional features to high-dimensional approximate output. The hidden layer H 2 and H 3 contain eight units and 42 units, respectively. The input layer V and hidden layer H 0 , H 1 compose the first CAE network, the hidden layer H 1 , H 2 and H 3 compose the second CAE network. Hidden layer H 3 and output layer T compose the back propagation (BP) network.
The parameters of DBN include learning rate, momentum, dropout rate, number of epochs and batch size. In the process of the training network, the proper parameter can improve the correct differentiation rates of activity recognition. According to the experience and the results of small-scale data training, the parameters can be set as follows: Learning rate: if the learning rate is set relatively small, the error curve will converge slowly and the training time is too long. Otherwise, if it is set relatively big, the error curve will oscillate. Because the setting of training epochs is 100 in this paper, the learning rate is set to be 0.6. This setting can make the mean squared error curve converge fast.
Momentum: the momentum can fine-tune the direction of gradient. In the activation function of CAE, the Gaussian unit is added which can also change the gradient stochastically. Thus, the momentum is set to be 0.06. The value of momentum is relatively small because of the effect of CAE. Dropout rate: the dropout is proposed by Hinton et al. [28], which can be used to prevent overfitting. During the training of DBN, the connection weights between visual layer and hidden layer are probabilistically dropped. In addition, these weights will be back in the retaining process. As the weights get sparse, DBN will select the most representative features which have much less prediction error. The dropout rate is set to be 0.1. The result of experiments indicates that the correct differentiation rates can be improved more than 2% by the dropout rate.
After the setting of network parameters, the DBN can be trained as the following steps: Step (1): each CAE is an individual network under unsupervised training. In the process of training, CAE extracts features from input data and stores features into weight matrix W . The local optimization of each CAE can be acquired which will be used to search the global optimization of entire DBN in next steps.
Step (2): one layer of BP network is set at the bottom layer of DBN. BP network will acquire the approximate output of CAE network. Through supervised training, the BP will calculate the error of the whole DBN.
Step (3): the error will be passed back to previous layers of CAE. According to the error, CAE will use supervised fine-tune strategy to update the weight matrix. The process of reconstruction will be repeated 100 epochs until the error converges. Then, the global optimization will be acquired to classify the human activities.
DBN overcomes the disadvantages of traditional single layer BP network: local optimum and long training time. Under the effect of the FSGD algorithm, the error of loss function reaches 0.002 at the 52th epoch and is sustained less than 0.002 within the next 10 epochs. Then, the training of the network is broken at the 62th epoch. As shown in Figure 9, the mean squared error of DBN network is reduced to 10 −3 . Dropout rate: the dropout is proposed by Hinton et al. [28], which can be used to prevent over-fitting. During the training of DBN, the connection weights between visual layer and hidden layer are probabilistically dropped. In addition, these weights will be back in the retaining process. As the weights get sparse, DBN will select the most representative features which have much less prediction error. The dropout rate is set to be 0.1. The result of experiments indicates that the correct differentiation rates can be improved more than 2% by the dropout rate.
After the setting of network parameters, the DBN can be trained as the following steps: Step (1): each CAE is an individual network under unsupervised training. In the process of training, CAE extracts features from input data and stores features into weight matrix W. The local optimization of each CAE can be acquired which will be used to search the global optimization of entire DBN in next steps.
Step (2): one layer of BP network is set at the bottom layer of DBN. BP network will acquire the approximate output of CAE network. Through supervised training, the BP will calculate the error of the whole DBN.
Step (3): the error will be passed back to previous layers of CAE. According to the error, CAE will use supervised fine-tune strategy to update the weight matrix. The process of reconstruction will be repeated 100 epochs until the error converges. Then, the global optimization will be acquired to classify the human activities.
DBN overcomes the disadvantages of traditional single layer BP network: local optimum and long training time. Under the effect of the FSGD algorithm, the error of loss function reaches 0.002 at the 52th epoch and is sustained less than 0.002 within the next 10 epochs. Then, the training of the network is broken at the 62th epoch. As shown in Figure 9, the mean squared error of DBN network is reduced to 10´3.

Comparative Evaluation
The K-fold validation method is applied to the simulation experiment. In this section, K is set to be six, it means that 9120 (=60 × 19 × 8) feature vectors will be divided into six partitions. In each partition, there are 1520 feature vectors, each vector contains 42 (=14 × 3) dimensional features. One of the partitions is used for testing, and the others are used for training. The training process will repeat six times to do the cross validation. In the training process, each partition can be regarded as testing data for one time, and can be trained for five times. The advantage of the K-fold method is that each feature vector can be used for testing and training, which can avoid underfitting and overfitting effectively. The average accuracy of six times training can be obtained as the final results.
The confusion matrix of 6-fold is shown in Table 2. It can be observed that A7 is easy to be mistaken for A8 and A2. The confusion rates of these activities are 6% and 2%, respectively. The activities A7 and A8 are both performed in elevator in which the accelerometer and gyroscope sensors could be affected, thus the correct differentiation rate of these two sensors combinations are relatively lower than accelerometer combinations. A7 and A2 are both activities about standing, the only difference between them is the places of activities. A19 is difficult to be recognized for other activities, while other activities are easy to be recognized for A19. The reason is that the features of A19 are more clutter than others. Furthermore, it can be perceived that the confusion rate between the two activities is not symmetrical. A7 has a high confusion rate for A8, while A8 is not easy to be mistaken for A7.
In order to validate the predictability of DBN, another experiment is designed which uses the 10-fold cross validation method. In this experiment, the 7600 (=50 × 19 × 8) feature vectors is applied to the training dataset, the 1520 (=10 × 19 × 8) feature vectors is applied to the predicting dataset. These two parts have different data that can avoid the experience effect of DBN. There are two steps in this experiment: (1) The DBN is established, and the 7600 feature vectors is used to train the network; (2) The trained DBN is applied to predicting the correct differentiation rate of 1520 feature vectors.
The confusion matrix of predicting is shown in Table 3. The correct differentiation rate of predicting is 94.9%. According to the results, we observe that the predicting correct differentiation rates of A2, A7 and A8 are relatively lower than other activities. A7 is easily to be mistaken for A2, and A2 also has a high confusion rate for A7. These two activities are both standing which have the similar characteristics. We also notice that the results of predicting confusion matrix are not consistent with training confusion matrix. In training confusion matrix, A8 is not easily mistaken for A7. However, there is an opposite result in predicting confusion matrix, A8 has a high confusion rate for A7. Many researches have proved that the more data is trained, the higher predicting correct differentiation rate will be achieved. Thus, the predicting correct differentiation rate can be increased if the DBN has learned more activity features.

Comparative Evaluation
The K-fold validation method is applied to the simulation experiment. In this section, K is set to be six, it means that 9120 (=60ˆ19ˆ8) feature vectors will be divided into six partitions. In each partition, there are 1520 feature vectors, each vector contains 42 (=14ˆ3) dimensional features. One of the partitions is used for testing, and the others are used for training. The training process will repeat six times to do the cross validation. In the training process, each partition can be regarded as testing data for one time, and can be trained for five times. The advantage of the K-fold method is that each feature vector can be used for testing and training, which can avoid underfitting and overfitting effectively. The average accuracy of six times training can be obtained as the final results.
The confusion matrix of 6-fold is shown in Table 2. It can be observed that A7 is easy to be mistaken for A8 and A2. The confusion rates of these activities are 6% and 2%, respectively. The activities A7 and A8 are both performed in elevator in which the accelerometer and gyroscope sensors could be affected, thus the correct differentiation rate of these two sensors combinations are relatively lower than accelerometer combinations. A7 and A2 are both activities about standing, the only difference between them is the places of activities. A19 is difficult to be recognized for other activities, while other activities are easy to be recognized for A19. The reason is that the features of A19 are more clutter than others. Furthermore, it can be perceived that the confusion rate between the two activities is not symmetrical. A7 has a high confusion rate for A8, while A8 is not easy to be mistaken for A7.
In order to validate the predictability of DBN, another experiment is designed which uses the 10-fold cross validation method. In this experiment, the 7600 (=50ˆ19ˆ8) feature vectors is applied to the training dataset, the 1520 (=10ˆ19ˆ8) feature vectors is applied to the predicting dataset. These two parts have different data that can avoid the experience effect of DBN. There are two steps in this experiment: (1) The DBN is established, and the 7600 feature vectors is used to train the network; (2) The trained DBN is applied to predicting the correct differentiation rate of 1520 feature vectors.
The confusion matrix of predicting is shown in Table 3. The correct differentiation rate of predicting is 94.9%. According to the results, we observe that the predicting correct differentiation rates of A2, A7 and A8 are relatively lower than other activities. A7 is easily to be mistaken for A2, and A2 also has a high confusion rate for A7. These two activities are both standing which have the similar characteristics. We also notice that the results of predicting confusion matrix are not consistent with training confusion matrix. In training confusion matrix, A8 is not easily mistaken for A7. However, there is an opposite result in predicting confusion matrix, A8 has a high confusion rate for A7. Many researches have proved that the more data is trained, the higher predicting correct differentiation rate will be achieved. Thus, the predicting correct differentiation rate can be increased if the DBN has learned more activity features.     Figure 10. It can be seen that the rate of kurtosis is relatively low, the rate of MAD is higher than other time domain features. According to the result, TFFE chooses MAD, mean value, CORR and skewness as the four-dimensional time domain features. The average correct differentiation rate of these four features is 75.7%.
In Section 3.1, the TFFE method has been introduced. The result of contrast experiments can decide which time and frequency domain features are chosen. The correct differentiation rates of time domain features are shown in Figure 10. It can be seen that the rate of kurtosis is relatively low, the rate of MAD is higher than other time domain features. According to the result, TFFE chooses MAD, mean value, CORR and skewness as the four-dimensional time domain features. The average correct differentiation rate of these four features is 75.7%. The correct differentiation rates of frequency domain features are shown in Figure 11. The rate of cepstrum coefficients and FFT are higher than other features. According to this result, TFFE chooses the maximum five peaks of cepstrum coefficients and FFT as the ten-dimensional frequency domain features. Based on Table 4, it can be concluded that there is little difference between the four-dimensional time domain and the ten-dimensional frequency domain in correct differentiation rates. Although the rate of frequency domain features is higher than time domain features, the training time of each time domain feature is only 1/5 of the frequency domain features. Considering the recognition efficiency of human activities, the combination of time domain and frequency domain features is applied for the training data of DBN.
The results of single sensor and sensor combinations experiment are shown in Table 5. The highest correct differentiation rate is obtained by accelerometers, followed by magnetometers and gyroscopes. The highest rate of sensor combinations is magnetometers and accelerometers. The rates of combinations are higher than the single sensor. The rates will be higher by adding accelerometers The correct differentiation rates of frequency domain features are shown in Figure 11. The rate of cepstrum coefficients and FFT are higher than other features. According to this result, TFFE chooses the maximum five peaks of cepstrum coefficients and FFT as the ten-dimensional frequency domain features.
In Section 3.1, the TFFE method has been introduced. The result of contrast experiments can decide which time and frequency domain features are chosen. The correct differentiation rates of time domain features are shown in Figure 10. It can be seen that the rate of kurtosis is relatively low, the rate of MAD is higher than other time domain features. According to the result, TFFE chooses MAD, mean value, CORR and skewness as the four-dimensional time domain features. The average correct differentiation rate of these four features is 75.7%. The correct differentiation rates of frequency domain features are shown in Figure 11. The rate of cepstrum coefficients and FFT are higher than other features. According to this result, TFFE chooses the maximum five peaks of cepstrum coefficients and FFT as the ten-dimensional frequency domain features. Based on Table 4, it can be concluded that there is little difference between the four-dimensional time domain and the ten-dimensional frequency domain in correct differentiation rates. Although the rate of frequency domain features is higher than time domain features, the training time of each time domain feature is only 1/5 of the frequency domain features. Considering the recognition efficiency of human activities, the combination of time domain and frequency domain features is applied for the training data of DBN.
The results of single sensor and sensor combinations experiment are shown in Table 5. The highest correct differentiation rate is obtained by accelerometers, followed by magnetometers and gyroscopes. The highest rate of sensor combinations is magnetometers and accelerometers. The rates of combinations are higher than the single sensor. The rates will be higher by adding accelerometers Based on Table 4, it can be concluded that there is little difference between the four-dimensional time domain and the ten-dimensional frequency domain in correct differentiation rates. Although the rate of frequency domain features is higher than time domain features, the training time of each time domain feature is only 1/5 of the frequency domain features. Considering the recognition efficiency of human activities, the combination of time domain and frequency domain features is applied for the training data of DBN.
The results of single sensor and sensor combinations experiment are shown in Table 5. The highest correct differentiation rate is obtained by accelerometers, followed by magnetometers and gyroscopes. The highest rate of sensor combinations is magnetometers and accelerometers. The rates of combinations are higher than the single sensor. The rates will be higher by adding accelerometers into sensor combinations. Considering the cost of sensors, the combination of magnetometers and accelerometers is accurate enough to recognize human activities. In order to verify the validity of DBN, we design the contrast experiment for the correct differentiation rates of DBN, rule-based algorithm (RBA), dynamic time warping (DTW), least squares method (LSM), Bayesian decision making (BDM), k-nearest neighbor algorithm (k-NN) and support vector machines (SVM). Specifically, Altun et al. [29] employed these algorithms except DBN to classify human activities in the same dataset. The K-fold (K = 10) cross-validation techniques are adopted in this experiment. The correct differentiation rate of these algorithms are shown in Table 6. The correct differentiation rate of DBN is 99.3%, which is relatively higher than other algorithms. It can be seen that DBN is more suitable for recognition of human activities than other algorithms. The correct differentiation rates of sensor units at different positions on the body are displayed in Table 7. LA and RA represent sensor units on the left arm and right arm, LL and RL denote sensor units on the left leg and right leg, and T denotes the sensor units on torso. It can be seen that the correct differentiation rates of units on RL is higher than units at other positions. The rates of unit combinations are higher than the single unit. The average correct differentiation rate of units at two positions is 95.9%. That means a high rate can be achieved with only two sensor units on the body to avoid to affect the natural movements. The DBN also can be constructed by restricted boltzmann machine (RBM), sparse autoencoder (SAE) and autoencoder (AE). These models are popular research topic in recent years, especially in image pattern recognition. The four DBNs are established by CAE, RBM, SAE and AE, separately. These DBNs have the same parameters of network which can ensure the fairness of results. The correct differentiation rate of these algorithms are shown in Table 8. We observe that the correct differentiation rate of RBM is relatively lower than other models. The RBM is more suitable for image recognition than the signal process. The rate of CAE is higher than AE and SAE. It can be concluded that CAE is better adapted to recognition of continuous signals than other models.
Considering the training time and correct rates of different training epochs, the experiment results are shown in Table 9. It only takes 62.875 s to achieve a rate as high as 70.1%, thus the high recognition efficiency of DBN is proved. When epoch = 10, the rate becomes 93.2% accordingly. Under such a condition, DBN can be applied in real-time systems to acquire a recognition result within minutes. When epoch = 100, the rate can increase 6.1% by taking tenfold time than the previous training time. Under this condition, DBN can be applied in a system which demands higher recognition rates. After the training process, the trained DBN is stored into the file system of the computer. Then different signal segments are used for activity recognition. Each signal segment has 5-s data. When segments = 9120, there is a total of 12.67 h data which equals to the amount of activities of a person within half a day. The recognition time can be shown in Table 10 in which three recognition results are all lower than 1 s. Thus, it indicates that human activities can be recognized by the trained DBN instantly. The 19 activities are performed by eight different subjects for 5 min. The 5-min signals are divided into 5-s segments, which means each subject in each activity has 60 segments to be trained. The profiles of these subjects are given in Table 11. These subjects preform the activities according to their personal habits. The correct differentiation rates of these subjects are extremely close to each other. We can conclude that the features of each activity of the subjects cannot be influenced by the personal characteristics and habits of the subjects. The correct differentiation rates are mainly related to the amount of training dataset.

Conclusions
In this paper, we put forward a new approach for the recognition of human activities with wearable sensors. When data is being prepossessed, the 5625 (=125ˆ45) dimensional data in each 5-s signal is acquired by accelerometer, gyroscope and magnetometer. The TFFT method is presented to extract features from the original sensor data. Then, the 630 (=14ˆ45) dimensional features can be got. In addition, PCA is applied to feature reduction, thus the dimension of features is reduced from 630 to 42 (=14ˆ3).
In the process of data recognition, a CAE model is designed which adds Gaussian noise into the activation function. The FSGD algorithm is proposed to shorten the training time of CAE. Then, DBN is constructed by two layers of CAE and one layer of the BP network. These two layers of CAE are applied to unsupervised pre-training. The layer of BP is used for supervised fine-tuning and to update the weight matrix of CAE. The effectiveness of our approach is validated by the experiment results.