Coarse-Fine Convolutional Deep-Learning Strategy for Human Activity Recognition

In the last decade, deep learning techniques have further improved human activity recognition (HAR) performance on several benchmark datasets. This paper presents a novel framework to classify and analyze human activities. A new convolutional neural network (CNN) strategy is applied to a single user movement recognition using a smartphone. Three parallel CNNs are used for local feature extraction, and latter they are fused in the classification task stage. The whole CNN scheme is based on a feature fusion of a fine-CNN, a medium-CNN, and a coarse-CNN. A tri-axial accelerometer and a tri-axial gyroscope sensor embedded in a smartphone are used to record the acceleration and angle signals. Six human activities successfully classified are walking, walking-upstairs, walking-downstairs, sitting, standing and laying. Performance evaluation is presented for the proposed CNN.


Introduction
Human Activity Recognition (HAR) is the automatic understanding of human actions performed by an individual or group of people. There are numerous areas and sectors where it is applied, such as smartphones, tablets, cars, games, health, security, commercial organizations and governments [1,2]. It is always been approached using sensors, namely using a video camera, infrared camera, microphone, GPS, gyroscope, accelerometer, proximity sensor, ultrasound sensor, light sensor, etc. [3][4][5][6]. Most of the sensors cited previously are integrated into a smartphone. On the other hand, in the recent years, smartphones have been preferred for implementing better HAR systems [7,8] due to the increasing accuracy of their built-in sensors, popularity, low cost, wireless facilities, and wireless connectivity. Due to the aforementioned reasons, smartphones are opening a new horizon in the applications of understanding users' personal activities and their world contexts. In addition, the literature reports that the HAR systems embedded in smartphones are reaching good performance, however, they have not reached 100% recognition [3][4][5][6]. Single user activities that all HAR systems want to identify are walking, walking-upstairs, walking-downstairs, sitting, standing and laying, among others. The significance of the 6 activities identified in our proposed article is geared toward a future ability to assist people with disabilities in everyday household activities via a single smartphone, such as walking, ascending stairs, descending stairs, sitting, standing and laying. There are commercial HAR platforms developed by important companies such as google (https://developers.google.com/ location-context/activity-recognition/), Microsoft-Azure (http://www.md2c.nl/meetup-microsoftdata-science-azure-machine-learning-workshop/), and IBM-human action recognition (https:// www.ibm.com/blogs/research/category/ai/). Google "activity recognition" API identifies 2 types of In the last decade, this research area has received significant attention due to the increasing trend of HAR applications in different areas, reduction in sensor price and built-in sensors in handheld devices. The human actions are identified by applying the extraction or selection features in the time or frequency domain on the signals detected by a smartphone's sensor. Since there are no working features that can ensure 100% identification of all the activities a person can perform, the problem still persists and requires further attention from the researchers. From sensor-based HAR research, there are two approaches.
• Video camera sensor: This research area is focused to identify HAR developed by a group of people. The most distinguished studies carried out for analyzing whole videos are [14][15][16][17]; 3D videos [18,19] or still images [16]. • Infrared camera, microphone, GPS, gyroscope, accelerometer, proximity sensor, ultra-sound sensor and light sensor: This HAR research is developed to identify a single person activity. There are survey works giving a landscape on different techniques and terminologies [14,[20][21][22][23].
We are interested in this emerging research area which uses a smartphone sensor for single user recognition. The main works and their techniques are described as follows.

Machine Learning-Based HAR Methods
Focusing on the bibliography for a single person movement identification by a smartphone, the field of machine learning-based HAR methods reports a competitive work by Anguitia [3] which uses statistical features such as mean, minimum, maximum, standard deviation, skewness, kurtosis, angles, entropy, correlations, energy, and energy bands. The authors used support vector machine (SVM) as a classification system and they achieved good results to identify 6 human activities (walking, walking-upstairs, walking-downstairs, sitting, standing and running). The second most related work is "Human activity recognition by smartphone" by Le Tuan [4]. The author used time-domain and frequency-domain features: mean, minimum, maximum, standard deviation, energy, inter-quartile, entropy, auto-regression, correlations, skewness, kurtosis, the energy of a frequency; getting a 561-feature vector as an activity descriptor. The authors used a naive Bayes classifier and a Decision Tree criteria. Another important method is proposed in [5] were statistical features are used. By employing time-frequency features, the authors obtained good results to identify the same 6 human activities as two previous works. Other relevant work is based on Bag-of-Features [6] using a hierarchical recognition scheme over motion primitives, motion vocabulary size, weighting schemes of motion primitive assignments, and learning machine kernel functions. Also, Lane's work [24] used a Bayesian classifier to identify 4 to 6 human activities (walking, walking-upstairs, walking-downstairs, sitting, standing and running). Other researchers [25][26][27] used a k-nn classifier, Kim et al. used SVM [28], Quadratic Discriminant Analysis QDA [29], Multilayer Neural Network [30], Probabilistic Neural Network [31], and Classification Rules [32].
Finally, there are works where authors applied a Hidden Markov Model for segmenting human activities [33,34], using the same public database in [3], authors obtained good results to identify the same 6 human activities considered in this research work and, they defined an "Activity sequence modeling" to identify the relationship among activities.

Convolutional Neural Network-Based HAR Methods
A different approach to feature extraction task is based on deep learning/CNNs, and several works have been conducted to adapt it to the HAR problem. The most related work using CNNs is [35] where authors used the "divide and conquer" paradigm and 1D convolutional neural network to identify the actions performed by humans, six activities are efficiently identified: walk, walk upstairs, walk downstairs, sit, stand, and lay. Despite the good classification, the authors did not achieve 100% accuracy. Another close work by Ignatov [36] presents a user-independent deep learning-based approach for online human activity classification. Ignatov proposes to use Convolutional Neural Networks for local feature extraction together with simple statistical features that preserve information about the global form of time series. The author investigated the impact of time series length on the recognition accuracy and limited it up to one second that makes possible the continuous real-time activity classification. The accuracy of the proposed approach is evaluated on two commonly used WISDM and UCI datasets.
Other less accurate works in HAR, using CNNs as a platform base are [37][38][39][40]. There are approaches exploiting deep Recurrent Neural Network (RNN) [41] or combined Long Short-Term Memory (LSTM) RNN with CNN. Ordonez and Roggen [42] proposed DeepConvLSTM that combined convolutional and recurrent layers. Edel et al. [43] proposed a binarized bidirectional LSTM-RNNs which reduces memory consumption and replaced most of the arithmetic operations with bitwise operations achieving an increase in power-efficiency.
Despite the variety of proposals in the HAR field using convolutional or recurrent networks, there is still an opportunity for work to achieve 100% recognition of human activities. In this paper, a novel framework is proposed to analyze and classify single user activity using a smartphone on the well-known public smartphones databases [3,13]. Our scheme is based on a coarse-fine convolutional neural network strategy which is explained in the following section.

Proposal
The architecture of the proposed coarse-fine CNN system is shown in Figure 1. CNN is a parallel feedback neural network whose structure is inspired by the visual biological system. The main idea is the hierarchization of the information visually analyzed. On one side, the "coarse" information is perceived, i.e., circles, lines, shapes, and colors. On the other side, the "average" information, and finally, the "fine" detailed information is perceived. In the present proposal, detailed information is represented by several stages of convolution and max-pooling, while "coarse" information is represented by a single stage of convolution and pooling. The three levels of information are merged in the whole classification CNN stage. The overall structure of CNNs is described below: • Convolutional layer: In one-dimensional case, a convolution between two vectors x ∈ R N and a kernel vector h ∈ R M is a vector c ∈ R M+N−1 , where c = x * h, * represents the convolution operation. Thus, in discrete domain, the convolution is expressed as c In other words, a reflected vector h, which is also called a convolutional filter, is sliding along signal x, a dot product is computed at each n value and the concatenated values Full-connected layer: This stage concatenates the outputs of the three partial CNNs: a fine-CNN, a medium-CNN, and a coarse-CNN. The output of the partial CNNs is flattened into a one-dimensional vector and used for the classification. In this proposal, a fully-connected layer is comprised of one input layer, one hidden layer, and one output layer. • Soft-max layer: Finally, the output of the last layer is passed to a soft-max layer that computes the probability distribution over the predicted walking, up-stairs, down-stairs, sitting, standing and laying human activities.
All three partial CNNs: a fine-CNN, a medium-CNN, and a coarse-CNN are trained as a whole one. Training and optimizing tasks are carried out using a back propagation algorithm and stochastic gradient descent, respectively.  Fully Connected Feedforward network Flattened Accelerometer gyroscope  Figure 1. Proposed Coarse-fine convolutional neural network topology.

System Architecture
The whole proposed CNN architecture presented in Figure 1 is fed by six signals coming from an accelerometer and a gyroscope. The input data passes throughout the three partial CNN as follow: • Fine-CNN (See Figure 2a): A first convolutional layer comprised of 18 filters where the kernel filter h 1 has the size 1 × 2 and the step of the convolution is 1. Then, a max-pooling 1 layer is applied with a size of 1 × 2 and the step of max-pooling 1 is 2. The activation function is ReLU. Then, a second convolutional layer, comprised of 18 filters where the kernel filter h 2 has the size 1 × 2 and the step of the convolution 2 is 1. Then, a max-pooling 2 layer is applied with a size of 1 × 2 and the step of max-pooling 2 is 2. The activation function is ReLU. A third convolutional layer comprised of 36 filters where the kernel filter h 3 has the size 1 × 2 and the step of the convolution 3 is 1. Then, a max-pooling 3 layer is applied with a size of 1 × 2 and the step of max-pooling 3 is 2. The activation function is ReLU. Finally, a fourth convolutional layer comprised of 36 filters where the kernel filter h 4 has the size 1 × 2 and the step of the convolution 4 is 1. Then, a max-pooling 4 layer is applied with a size of 1 × 2 and the step of max-pooling 4 is 2. The activation function is ReLU. Then, a max-pooling 1 layer is applied with a size of 1 × 4 and the step of max-pooling 1 is 2. The activation function is ReLU. Then, a second convolutional layer, comprised of 36 filters where the kernel filter h 2 has the size 1 × 2 and the step of the convolution 2 is 3. Then, a max-pooling 2 layer is applied with a size of 1 × 4 and the step of max-pooling 2 is 2. The activation function is also ReLU. Then, a max-pooling 1 layer is applied with a size of 1 × 16 and the step of max-pooling 1 is 2. The activation function is ReLU.

Max-Pooling1
Flattened Conv2   The output of the three partial max-pooling output layers are then flattened. The joint vector is subsequently passed to a fully-connected layer that consists of 864 neurons (8 × 36 × 3). We have used a dropout technique in this layer with dropout rate of 0.00005. Finally, the outputs of the fully-connected layer are passed to a soft-max layer that computes probability distribution over six activity classes. The model is trained to minimize cross-entropy loss function using back propagation algorithm and optimize training parameters with stochastic gradient descent [44].
For the proposed fine-coarse CNN, the loss entropy function is defined as: where ζ FMC (Θ 1 , Θ 2 , Θ 3 ) corresponds to the loss function of fine-CNN, medium-CNN, and coarse-CNN; and ζ CLA (W) corresponds to the loss function of the whole classification layer (dropout and soft-max layers). Total loss function ζ T can be rewritten as: where Ω = {Θ 1 , Θ 2 , Θ 3 } are the parameter sets for the three partial CNN. W is the parameter sets for the classification layer (dropout and soft-max layers). m h = movement type, where h = {walking, walking-upstairs, walking-downstairs, sitting, standing and laying}.p(m h /Θ i ) stands for the conditional probability function for a given movement type conditioned to a Θ i parameter sets, andp(m h /W) stands for the conditional probability function for a given movement type conditioned to a W classification parameter sets layer. The parameter set for each partial CNN is defined as follows:

UCI HAR Dataset
Accelerometer and gyroscope sensors built-in in a smartphone were used to collect two-tri-axial movement information [3]. Sensor's data were collected from 30 volunteers, between the age of 19-49 year. Carrying a smartphone Samsung Galaxy SII in a vertical position in their pockets, each subject performed six activities: walking, walking up stairs, walking down stairs, sitting, standing, and laying activity. 3-axial linear acceleration and 3-axial angular velocity data were collected. These sensor's data were sampled at a constant rate of 50 Hz, using the embedded accelerometer and gyroscope. A realization of a single activity was divided into windows of 2.56 s each, which is sampled at 50 Hz giving 128 samples (2.56 s × 50 Hz = 128). The database is structured into two sets, 70% of the volunteers (21 persons) were selected for training and 30% for testing (9 persons). Table 1 shows the activities distribution over the two sets. An example of a single recording can be found in Figure 3 where four activities are shown. Table 2 shows the hyperparameters experimental setup. As you can see in pooling size parameter, there are three vector size [1 × 2], [1 × 4] and [1 × 16] for fine-CNN, medium-CNN and coarse-CNN, respectively (See Figure 2).  All signals were filtered with a digital FIR low-pass filter with a cut-off frequency of 10Hz. Thus, the already filtered signals are used in the proposed neural network.

Evaluation
We implemented the proposed coarse-fine convolutional deep-learning strategy for human activity recognition on the python+Tensor Flow (python = 2.7, tensorflow = 1.1) platform running on iMac-XOS Intel Core i5 CPU. To evaluate the proposal, firstly, the influence of each partial CNN is evaluated, and then, whole parallel CNN strategy is evaluated. Performance evaluation for the three proposed CNN: Fine-CNN, Medium-CNN, and Coarse-CNN, as well as, for the proposed merged architecture is presented as follows.

Learning Evaluation
Evaluation is developed in training and testing tasks, regarding training task, accuracy and loss parameters are evaluated for the four CNNs. One of the most important parameter to be defined is the size of the convolutional filters, which was defined experimentally to [1 × 2]. Figure 4 shows the classification accuracy curve, the Coarse-Fine CNN is not very sensitive to this parameter: while the first best accuracy was obtained for filter of size [1 × 2], the accuracy does not drop significantly till this size becomes greater than [1 × 2] size.    Figure 5b depicts the precision performance for the same four CNN, it can be seen that it is the best performance reached by the proposed CNN (magenta plot), i.e., less iterations and best accuracy.  Other parameters analyzed in the training task were training-validation loss and training-validation accuracy. Figure 6 shows the evolution curves through iterations. The training task is developed following the paradigm "Subject-dependent test", it means the same dataset is used for learning and testing task. From Figure 6, it can be seen that continued magenta line plot corresponds to the proposed CNN architecture where the accuracy reaches 100% (see Figure 6b), and the loss parameter reaches the zero value (see Figure 6a). Other color plots correspond to the partial CNN used and fused: red→coarse-CNN, blue→medium-CNN, and green→fine-CNN.   Figure 7 shows the testing performance for the proposed CNN. For this test, the testing dataset is completely different from the learning task. The test is developed following the paradigm "Subject-Independent test". From Figure 7, it can be seen that the continued magenta line plot corresponds to the proposed CNN architecture where the accuracy reaches 100%. It seems that the fusion of partial information given by fine-CNN, medium-CNN, and coarse-CNN makes it possible to obtain a 100% of good classification for HAR activities.  Confusion matrix of the six single user activities classification in testing task is given in Table 3, performance activities are: walking activity 100%, ascending stairs 100%, descending stairs 100%, sitting 100%, standing 100% and laying 100%; giving a mean average of 100%. Table 3. Testing confusion matrix of the six single user activities.  Table 4 and Figure 8 compare our coarse-fine convolutional deep-learning strategy for human activity recognition with the best competitive works reported in the literature. Please note that in this comparison, we present the classification performance for the six single user activities, as well as, the mean average performance. The comparison includes the two best competitive methods using machine learning approach [3,33], and, on the other hand, the 3 most competitive works using convolutional networks [35,36,38]. As it can be seen, our proposal accurately recognizes (100%) each of the movements under the scheme of fusion of fine, medium and coarse information from the defined convolutional neural networks.

Walking Ascending Stairs Descending Stairs Sitting Standing Laying
As given in Table 4 where authors used the same database, the proposed method improves the best result from the literature about 2% i.e., from 98% San-Segundo [33] to 100%. Comparison against the most competitive works Ronao[36] Anguita [8] Cho [33] Ignatov [43] San-Segundo [31] Our proposal

WISDM Dataset
To test our proposal, this paper, uses a second standard HAR dataset which is publicly available from the WISDM group [13]. The dataset is conformed of 6 activities walking, jogging, walking upstairs, walking downstairs, sitting, and standing. While performing these activities, the sampling rate for accelerometer sensor was set to 20 Hz. Dataset description is shown in Table 5   Table 5. WISDM dataset description [13].

Activity
Number

Evaluation
Following the same methodology of the proposed Coarse-fine convolutional network and with the same parameters defined in Section 5 "Experiments", the experimentation was carried out with the second WISDM dataset. The results obtained are shown in the confusion matrix where six single user activities classification in testing task is given in Table 6, performance activities are: walking activity 100%, jogging 100%, upstairs 100%, downstairs 100%, sitting 100% and standing 100%; giving a mean average of 100%. Table 6. Testing confusion matrix of the six single user activities.  Table 7 compares our coarse-fine convolutional deep-learning strategy for human activity recognition with the best competitive work reported in the literature. Please note that in this comparison, we present the classification performance for the six single user activities, as well as, the mean average performance. The comparison is versus the most competitive work using convolutional networks [45]. As it can be seen, our proposal accurately recognizes (100%) each of the movements under the scheme of fusion of fine, medium and coarse information from the defined convolutional neural networks.

Walking Jogging Upstairs Downstairs Sitting Standing
As given in Table 7 where the author used the same database and a CNN, the proposed method improves about 0.7% i.e., from 99.33% to 100%.

Conclusions
Human activity recognition is a challenging problem. In this paper, a novel CNN framework is presented to classify single user activities based on local feature extraction under parallel scheme. The whole CNN strategy is based on coarse-medium-fine feature extraction and then, their fusion in a classification stage.
The sensors used to record the acceleration and angle signals were a tri-axial accelerometer and a tri-axial gyroscope embedded in a smartphone.
Six human activities were successfully classified: walking, walking-upstairs, walking-downstairs, sitting, standing and laying, giving an average recognition of 100%.
Future work includes taking into account the more complex human activities and to find association relationships to health issues for common physical diseases.

Conflicts of Interest:
The authors declare that there is no conflict of interests regarding the publication of this paper.