Asymmetric Residual Neural Network for Accurate Human Activity Recognition

: Human activity recognition (HAR) using deep neural networks has become a hot topic in human–computer interaction. Machines can effectively identify human naturalistic activities by learning from a large collection of sensor data. Activity recognition is not only an interesting research problem but also has many real-world practical applications. Based on the success of residual networks in achieving a high level of aesthetic representation of automatic learning, we propose a novel asymmetric residual network, named ARN. ARN is implemented using two identical path frameworks consisting of (1) a short time window, which is used to capture spatial features, and (2) a long time window, which is used to capture ﬁne temporal features. The long time window path can be made very lightweight by reducing its channel capacity, while still being able to learn useful temporal representations for activity recognition. In this paper, we mainly focus on proposing a new model to improve the accuracy of HAR. In order to demonstrate the effectiveness of the ARN model, we carried out extensive experiments on benchmark datasets (i.e., OPPORTUNITY, UniMiB-SHAR) and compared the results with some conventional and state-of-the-art learning-based methods. We discuss the inﬂuence of networks parameters on performance to provide insights about its optimization. Results from our experiments show that ARN is effective in recognizing human activities via wearable datasets.


I. Introduction
Human Activity Recognition (HAR) is very important for human-computer interaction and is an indispensable part of many current real-world applications.The goal of HAR is teaching machines to understand the data acquired by sensors.To overcome the awareness of human-computer interaction, the potential features in the on-body device must be learned.HAR using wearable devices data is at the core of intelligent assistive technology.Particularly for the elderly people who are in remote and need to continuous monitor, HAR can greatly increase their safety [1].Due to its proliferative applications in smart homes [2], intelligent traffic control [3], medical/health assistance [4,5], skill based check [6], even in the security field [7].
Nowadays, accelerators, gyroscopes and magnetic field sensors are widely utilized in smart phone (e.g., iPhone, Samsung, Huawei), smart bracelets (Apple Watch, Fitbit).With the increasing number of wearable sensors and IoT devices, there is a growing trend in collecting the activity data of users in real time.The key technology in HAR includes a sliding time window of time-series data captured with on-body sensors, manually designed feature extraction procedures, and a wide variety of supervised learning methods.
In the past years, researchers have made lots of progress in wearable activity recognition, using algorithms such as Logistic Regression [8], Decision Tree [9], and Hidden Markov Model [10].For task identification, many of these traditional methods are characterized by manual feature extraction.The performance of those methods often depend on the quality of the manual feature extraction and we can hardly find a good method of manual feature extraction.When different sensors are used, the distribution of data features will change, and we need to analyze data features again and extract manually.Manual feature extraction is very inconvenient and has no good application prospects.Therefore, these methods look realistic, but not conceptually novel.Compared with manual features extraction methods, deep learning techniques can discover adequate features without expert knowledge and systematic exploration of the feature space.In the many fields, deep learning techniques have achieved remarkable results, such as in image recognition [11,12], speech recognition [13], natural language processing [14] and so on.Exiting deep learning methods for human activity recognition can be further divided into two categories: Deep Neural Networks (DNNs) [15], and Convolutional Neural Networks (CNNs) [16].Researchers who have used DNNs methods for HAR include: [17] who investigated deep neural networks with wearable sensors data and, [18] who explored temporal deep neural networks for active biometric authentication.
Recently, many researchers use CNNs for wearable activity recognition [19].CNNs can model the entire sequence by sharing the weights from local to global, extract abstract features at hierarchical layers through a series of convolutional operations, and process the raw activity signals for capturing potential features.[20] proposed CNNs based approaches to automatically extract discriminative features for HAR.
However, all the deep learning methods we mentioned above are all identified by a single-path neural net without considering spatial and temporal features of the data.Inspired by the biological study of retinal ganglion cells in the primate visual system, there are Parvocellular (P-cells) that provide good spatial detail and color in the visual system, but its resolution is very low.In addition, there are high-frequency Magnocellular (M-cells), which are very sensitive to time changes, but not sensitive to spatial details and colors.In this paper, inspired by the facts above, we propose a model to handle a set of activity data, synchronized by a dual net, using a short time window to capture spatial features and a long time window to capture fine temporal features, corresponding to the P-cells and the M-cells, respectively.Our network is an end-to-end network, and the input of the network is the original sensor data.The data collected from the wearable device can be directly input into the network.DRN model is applicable for supervised learning approaches and unsupervised learning approaches.In this paper, we use supervised learning method.Because it can make label information bridge the semantic gap.
We propose a novel Dual Residual Network, named DRN.As a new kind of deep learning network, the components of activity recognition in DRN are divided into two parts.(1) a residual net using short time window (i.e., 32 or 64); (2) a residual net using long time window (i.e., 64 or 96).The last layer representations of two parts will be concatenated, then use the fusion representations for accurate activity recognition.The superior advantages of the DRN over other existing methods were listed in the Table 1.To the best of our knowledge, this is the first work that applies a dual residual net for activity recognition.
The main contributions of the paper are as follows: 1.The proposed network consists of dual residual net that not only can effectively manage information flow, but will also automatically learn effective activity feature representation, while capturing the fine feature distribution in different activities from wearable sensor data.
2. The proposed DRN is an asymmetric network, which has two paths separately working at short and long slide window, our wide path is designed to capture global features but few spatial details, analogous to M-cells, and our narrow path is lightweight, similar to the small receptive field of P-cells.Results from our experiments show that the model is suitable for activity recognition.
The remainder of this paper is structured as follows.In Section II, we briefly introduce the related works.In Section III, we highlight the motivation of our method and provide some theoretical analysis for its implementation.In Section IV, we introduce our experimental results and corresponding analysis and finally in Section VI conclude the paper.

II. Related Work i. Traditional Features Extraction for Human Activity Recognition
The traditional features extraction methods, generating windows for every time step, which we denote as setting the sampling stride to 1, will also achieve dense predictions of the sequence, including statistics of raw signals [21], and symbolic representation [22] are deemed Method manual f. 1 high-level f. 1 spatial f. 1 temporal f. 1 to play a important role of transforming the data by one or a few of neurons in one layer of a deep learning model.However, methods that are able to exploit the temporal dependencies in time-series data appear as the natural choice for modeling human movement captured with sensor data, and this straight forward and these methods will result in a huge number of windows and the prediction process will became intractably slow.
ii. Convolutional Neural Networks for Human Activity Recognition CNNs can automatically extract the features from raw sensor data which without need for very professional expert knowledge [19].A standard convolutional neural network consists of convolutional layers, max-pooling layers, fully-connection layers (FC) and SoftMax layers.Instead of using predefined filters as in traditional feature extracting methods, CNNs can learn locally connected neurons that represent data-specific filters.
As CNNs can share weights of neurons, the connection between neurons of CNNs are much fewer than those of the traditional neural networks [23].
Convolutional layers are an important component of CNNs.Using several convolution filters (or kernels), which aim to learn feature representations of the raw input, complex operations can be easily performed by the convolution operation in the convolutional layer.The dimension of filters (or kernels) is determined by the input dimension.Convolution kernel is a function that generalizes a linear model for the underlying local patch.It works well for abstraction, when instances of latent concepts are linearly separable.In each convolutional layer, neurons of current layer are connected to the neurons of previous layer through feature mapping operation.Thus, feature mapping of the upper layer can be obtained from the convolved results of the previous layer by adopting an element-wise nonlinear activation function.So, the value of the feature map j in the l-th layer, x l+1 j is calculated by: where maps are the total number of feature maps in l-th layer and b l j is a bias vector.σ(•) is the activation function to improve the performance of CNNs.The most notable non-liner activation function is ReLU, which is defined as: σ(x) = max(x, 0).The ReLu activation operation allows networks to compute much faster than sigmoid or tanh activation functions, induces the sparsity in the hidden neurons, and makes networks to obtain sparse representations more easily.Adopting ReLU may bring zero value to affect the performance of backpropagation, but many research results have show that ReLU works much better than sigmoid and tanh [24].
Pooling layers have come after the convolutional layer, is another component of CNNs.In the pooling layer, a pooling operation is used to reduce the number of neurons connections between neighboring convolutional layers thus reducing computational complexity.
Fully-connected layers, whcih aims to convert the matrix-feature (2-D) unfolded to a vector-feature (1-D) for anastomosis classification tasks, and contains about 90% of the parameters of the entire CNNs.
Loss function plays an important role in different classification tasks.The most common loss function is softmax.Given a training set {x (i) , is the target label which belongs to the total number of labels (K).The prediction a (i) j of j-th class for i-th input is transformed with the Soft-max function: Soft-max normalizes the predictions to a probability distribution over the total classes.The soft-max is represented loss as follows: Regularization is required in CNNs.Overfitting is an unavoidable problem in convolutional neural networks, that but it can be effectively reduced by regularization.
As a means of regularization, dropout can prevent the dependence of different neurons in a network, and force the network to be more accurate even in the absence of certain information [25].
The convolutional neural network model can automatically learn different hierarchical layers of abstract features, and achieve remarkable results in computer vision [26], restoration [27], HAR [20] and in other fields [28].
Several advanced algorithms have been evaluated in the last few years on the HAR.Hand-crafted features method [29] uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data.Due to its simplicity to setup and low computational cost, it is still being used in some areas, but the accuracy cannot satisfy the requirement of modern AI games.In addition, when faced with the activity recognition of complex high-level behaviors tasks, identifying the relevant features through these traditional approaches is time-consuming [19].
Recently, the most popular approach to HAR is using machine learning.[1] carried out experiment to evaluate the recognition performance of supervised and unsupervised machine learning techniques.Many researchers have adopted CNNS to deploy human activity recognition system, such as [7,17,20,30].CNNs are based on the discovery of visual cortical cells and retain the spatial information of the data through receptive field.It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance.Also, with their ability to act as feature extractors, a plurality of convolution operators are stacked to create a hierarchy of progressive abstract features.Recently, more and more researches are using variants of CNNs to learn sensor-based data representations for human activity recognition and have achieved remarkable performances [20].The model [31] consists of two or three temporal-convolution layers with a ReLU activation function followed by a max-pooling layer and a soft-max classifier, which can be applied over all sensors simultaneously.Yang et al. [20] introduce four temporal-convolutional layers on a single sensor, followed by a fully-connected layer and soft-max classifier and it shows that deeper networks can find correlations between different sensors.

III. Dual Residual Network
In this section, our proposed model has a narrow path (Sec ii) and a wide path(Sec iii), which are concatenated and sent to the fc.layer.Loss function is introduced in Sec v.  corresponding to the narrow path and wide path, respectively (T 1 < T 2 ).D denotes the number of sensor channels.In the conv.layers, K denotes the kernels in layer, and the length of kernels is s = 5.In the res.layers, the dimension of each block double increase feature maps to the input signal, which is processed by number of preset building blocks for residual learning.
nition task.In convolutional layers, the general activity features are extracted from raw sensor data.In residual layers, the special features can be extracted from general features and the special features are used for human activity recognition.
The convolutional layers (i.e., conv.layers) of our architecture consists of 1 layer, including 64 sliding windows (filters) whose size is s = 5 × 1 , a batch normalization layer, and a ReLU layer with use of a pooling layer [19].The residual layers contain four â ȂIJblocksâ Ȃ˙I .The details of residual block are shown in Table 2 and the value of n 1 , n 2 , n 3 , n 4 are set to 3, 4, 6, 3, respectively.

ii. Narrow Path
The narrow path can be any convolutional model (e.g., [32] introduced a new Two-Stream inflated 3D Convolutional Networks: filters and pooling kernels of very deep image classification Convolutional Networks are expanded into 3D, [33] introduced spatiotemporal ResNets as a combination of Two-stream Convolutional Networks and ResNets, [34] introduced non-local operations as a generic family of building blocks for capturing long-range dependencies.)that works on a sequence data as a spatiotemporal volume.The key concept in our narrow path is a short slide time window to scan the sequence activity data.A typical value of T we study is 32 [35].Denoting the number of sensor channels as S, the raw clip length is T × S. The function of this path is to throw compact information into the net, the purpose is to capture spatial features.

iii. Wide Path
In parallel to the narrow path, the wide path is another convolutional model with a long slide time window.The operations of two path net work on the same raw activity data sequences, so the wide path uses αT slide time window, α times longer than the narrow path.A typical value is α = 3 [36] in our experiments.Our wide path enters a long sequence of activity data into the net in order to pursue global functionality throughout the net hierarchy.Our wide path is distinguished from existing methods in that it can use significantly lower channel capacity to achieve good accuracy for the DRN model.The low channel capacity can also be interpreted as a weaker ability of representing spatial semantics.Our wide path not only has a long slide time window, but also pursues high-dimension features throughout the network hierarchy, maintaining temporal fidelity as much as possible.

iv. Lateral Concatenation
Our lateral concatenation fuses from the narrow path to the wide path.We denote the representation shape of the narrow path as {T, S}, the representation shape of the wide path is {αT, S}.The output of the lateral concatenation is fused into the narrow path by concatenation.Therefore, the shape of the concatenation layer is {(1 + α)T, S}.

v. Loss Function
In order to train classification models, classification objectives (such as logistic loss and softmax loss) have been widely explored.For accurate human activity recognition, using labels that are different from the groundtruth for prediction, cannot contribute to the update of the network parameters.For depth estimates, predictions that are close to the ground-truth labels also help to update network parameters.In this work, we employ softmax loss for training the human activity recognition model.For each training sequence x, the probability of each label k ∈ {1, 2, ..., K} in our model is computed via softmax: where z i are the logits or unnormalized log probabilities.
Here, the z i are computed by adding a fully-connected layer on top of the sequence data embedding, i.e., z i = W T i φ(x) + b i , where W i and b i are weights and bias for target label, respectively.Let q(k|x) denote the groundtrue distribution over classes for this training example such that ∑ K i=1 q(k|x) = 1.The cross-entropy loss for the example is computed as:

IV. Experiments
In order to demonstrate the performance of our proposed DRN method, we carried out our extensive experiments on two widely used benchmark datasets, i.e.,  OPPORTUNITY and UniMiB-SHAR, to verify the effectiveness of our method.

i. Dataset
Human activity features are usually unique and cyclical, and natural human activities include walking, running, jumping and so on.Therefore, a set of active data that includes a variety of types of natural human activities should be considered in dataset construction.We use benchmark datasets to validate the model performance, and use different action sequences to verify whether they belong to the same person.There are many benchmark activity datasets, such as OP-PORTUNITY [37], WISDM [38], UniMiB-SHAR [39], MHEALTH [40], PAMAP2 [41] datasets.In this paper, we evaluate our method by using the following two datasets.
OPPORTUNITY dataset has been widely used in many researches.It contains four subjects performing 17 different (morning) Activities of Daily Living (ADLs) in a sensor-rich environment, as listed in Table 3 4.They were acquired at a sampling frequency of 30Hz equipping 7 wireless body-worn inertial measurement units (IMUs).Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometers placed on the back, arms, ankles and hips, accounting for a total of 145 different sensor channels.During the data collection process, each subject performed a session 5 times with ADL and 1 drill session.During each ADL session, subjects were asked to perform the activities naturally-named "ADL1" to "ADL5".During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset.The dataset contains about 6 hours of information in total, and the data are labeled on a timestamp level.The dataset can be used in an open activity recognition recognition challenge where participants competed to achieve the highest performance on the recognition.In our experiment, the training and testing sets have 63-Dimensions (36-D on hand, 9-D on back and 18-D on ankle, respectively).
UniMiB-SHAR dataset was collected data from 30 healthy subjects (6 male and 24 female) acquired using the 3D-accelerometer of a Samsung Galaxy Nexus I9250 with Android OS version 5.1.1.It contains 11771 samples of both human activities and falls performed by 30 subjects of ages ranging from 18 to 60.The data are sampled at a constant sampling rate of 50 Hz, and split into 17 different activity classes, 9 safety activities and 8 dangerous activities (e.g., a falling action) as shown in Table 3 5 The OPPORTUNITY dataset and UniMiB-SHAR dataset are collected from real environment.The two datasets have their own characteristic and contain different sensors, the UniMiB-SHAR dataset only contains the accelerometer data, it has low power cost.The OPPOR-TUNITY dataset combines accelerometers, gyroscopes and magnetic sensors data, and it can provide accurate limb orientation.

ii. Baseline
We compared our proposed DRN method against some classic or state-of-the-art activity recognition methods.We roughly divided these methods into categories: conventional recognition methods include HC [29], CBH [42], CBS [43].The learning-based methods include AE [44], MLP [45], CNN [20], LSTM [46], Hybrid [19], ResNet.As in conventional methods, we use handcrafted features, readers can find more details in [35].For learning-based methods, we use raw activity data as input.Follow by [35], the hyper-parameters of these learning-based baseline models except ResNet1 for the OPPORTUNITY and UniMiB-SHAR datasets are provided in Table 6.

iii. Implementation and Setting
Our DRN model is implemented in TensorFlow [47], a system that transfers complex data structures to artificial intelligence neural networks for analysis and processing.The computing platform is equipped with an Intel 2× Intel E5-2600 CPU, 128G RAM, and a NVIDIA TITAN Xp 12G GPU.The model is trained using the ADADELTA gradient decent algorithm with default parameters (i.e., initial learning rate of 1), for 50 epoches.The batch size is set to 128.The hyper-parameters of the proposed model are provides in Table 2.

Sliding Time Window Size:
The length of the sliding window T is an important hyper-parameter of the proposed model.As in baseline methods, we carried out two more comparative studies using T = 32 (approximately 1s), T = 64 (approximately 2s) and T = 96 (approximately 3s).For the proposed model, we use T = 32 or T = 64 as the hyper-parameter of the narrow path and T = 64 or T = 96 as the hyperparameter of the wide path, respectively.iv.Performance Measure ADL datasets are often highly unbalanced.The OPPOR-TUNITY dataset is extremely imbalanced, as the NULL class represents more than 75% of the recorded data.For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes.As a result, many previous researches such as [19] show the use of an evaluation metric independent of the class repartition-F1-score.The F1-score combines two measures: the precision p and the recall r: p is the number of correct positive examples divided by the number of all positive examples     c (number of cells, output-dim).
returned by the classifier, and r is the number of correct positive results divided by the number of all positive samples.The F1-score is the harmonic average of p and r, where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted F1-Score (Sum of class F1-scores, weighted by the class proportion): where w g = N g /N total and N g is the number of samples in class g, and N total is the total number of samples.

v. Results and discussions
In this section, we present and discuss the results.To get insight into how these methods are applied to the domain, we show the performance of these methods and evaluate some key parameters.
The weighted F 1 -score of all models on OPPORTU-NITY and UniMiB-SHAR are listed in Table 7. Results on these datasets show that the proposed DRN method substantially outperforms all other methods against which it was compared.Compared to conventional recognition methods, such as CBS, the best conventional method achieves an absolute boost of 4.98%, and 14.65% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively.In addition, most of the learning-based recognition methods outperform the conventional recognition methods.In particular, for OPPORTUNITY dataset, the Hybrid method achieves the best performance among all the learning-based methods.Compared to Hybrid method, our DRN method achieves boosts of 2.4%.For UniMib-SHAR dataset, the MLP method achieves the best performance among all the learning-based methods.Compared to MLP method, our DRN method achieves boosts of 2.48%.We also compared a single-path residual network i.e.ResNet, our DRN achieves an absolute boost of 1.26%, and 1.32% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively.
From the Table 7, we can observe that the gap between the learning-based methods and conventional methods is larger on the UniMiB-SHAR dataset than OPPORTUNITY dataset.The reasons are that the sensor channels in OPPORTUNITY dataset are more than those in UniMiB-SHAR dataset.By carefully comparing the performance of the results, we found that our pro-  posed method showed a higher degree of performance improvement when tested on UniMiB-SHAR dataset compared to OPPORTUNITY dataset.This means that our method is effective.We also observe from the Table 7 that different lengths of slide time window have an impact on the performance of the activity recognition.The short time window contains too little information.With the growth of the time window, the window contains more and more information, and the accuracy is improved accordingly.But using longer slide time window does not yield better recognition performances [35].Most methods perform best in human activity recognition tasks when T = 64.The reasons are that longer frames potentially contain data related to a higher number of activities, making their majority-labeling more inaccurate.

V. Hyperparameters Evaluation
Impact of the length of the narrow and wide path selection: The model we proposed has two paths, one is narrow path (i.e., the length of slide time window is short), another is wide path (i.e., the length of the slide window is wide).In order to verify the impact of different lengths of slide time windows combinations on the results.We leverage a combination of slide time window of different lengths for comparison experiments (i.e., 32-64, 32-96, 64-96).We carried out experiments on the two datasets.The weighted F 1 -score results are shown in Table 8.From Table 8, we can observe that on the OPPORTUNITY dataset, the DRN-(32-96) outperforms the DRN-(32-64) and DRN-(64-96), the minimum image size is 63(D)*32(T) and the accuracy is 90.29%.For UniMiB-SHAR dataset, the DRN-(32-64) outperforms the DRN-(32-96) and DRN-(64-96), the minimum image size is 3(D)*32(T) and the accuracy is 76.39%.The performance gap between the three experiments was very small.This indicates that DRN is a stable model that is not sensitive to the lengths of slide time window.

VI. Conclusions
In this paper, we propose a novel dual residual network for activity recognition using wearable device data, named DRN.In order to improve the accuracy of activity recognition, our method consists of two paths.The first path uses a short time window to capture spatial features, and the second path uses a long time window to capture fine temporal features.Comprehensive experiments on the two benchmark human activity recognition datasets demonstrate that the DRN outperforms the state-of-the-art methods.
Unlike other deep learning methods, our method considers the spatial and temporal features of the data at the same time.It can effectively manage information flow and automatically learn activity feature representation.Our network is an end-to-end network, and the input of the network is the original sensor data.The data collected by the wearable device can be directly input into the network.This method has a good application prospect.However,Biometric identification, fingerprint recognition, iris recognition and other technologies have achieved more than 98% accuracy and are widely used in peopleâ Ȃ Źs life.Our technology cost is expensive and the recognition accuracy cannot satisfy the requirement of application level.It still has much room for improvement.
For future work we may research how to extract more fine data features by using new model with multiple paths and focus on the research of data dynamic fusion algorithm for maximizing the retention of the data features and obtaining higher recognition accuracy.We are planning to make experimentation on all the publicly available datasets.Finally, we want to make research on emergency human activity recognition detection.
i. Network Architecture As shown in Fig 1, convolutional layers and residual layers in our architecture are used to model the recog-

Figure 1 :
Figure 1: The proposed DRN architecture for human activity recognition.T 1 and T 2 in the raw data layer denote the lengths of time windowcorresponding to the narrow path and wide path, respectively (T 1 < T 2 ).D denotes the number of sensor channels.In the conv.layers, K denotes the kernels in layer, and the length of kernels is s = 5.In the res.layers, the dimension of each block double increase feature maps to the input signal, which is processed by number of preset building blocks for residual learning.
. Unlike the OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced.It allows researchers to work to more robust features and classification schemes.In our experiments, the training and testing sets have 3-Dimensions.

Table 1 :
Comparison of different HAR technologies

Table 2 :
The layer-parameters of the DRN mdoel.

Table 3 :
The details of experimental datasets

Table 4 :
Classes and proportions of the OPPORTUNITY dataset

Table 5 :
Classes and proportions of the UniMiB-SHAR dataset

Table 7 :
Weighted F 1 -score performances of different methods on the OPPORTU-NITY and UniMiB-SHAR datasets.(n) and (w) denote the narrow and wide path, respectively.

Table 8 :
Weighted F 1 -score performances comparison of DRN with the combinations of different lengths of the slide window on the OPPORTUNITY and UniMiB-SHAR datasets.(n) and (w) denote the narrow and wide path, respectively.