A Hierarchical Learning Approach for Human Action Recognition

In the domain of human action recognition, existing works mainly focus on using RGB, depth, skeleton and infrared data for analysis. While these methods have the benefit of being non-invasive, they can only be used within limited setups, are prone to issues such as occlusion and often need substantial computational resources. In this work, we address human action recognition through inertial sensor signals, which have a vast quantity of practical applications in fields such as sports analysis and human-machine interfaces. For that purpose, we propose a new learning framework built around a 1D-CNN architecture, which we validated by achieving very competitive results on the publicly available UTD-MHAD dataset. Moreover, the proposed method provides some answers to two of the greatest challenges currently faced by action recognition algorithms, which are (1) the recognition of high-level activities and (2) the reduction of their computational cost in order to make them accessible to embedded devices. Finally, this paper also investigates the tractability of the features throughout the proposed framework, both in time and duration, as we believe it could play an important role in future works in order to make the solution more intelligible, hardware-friendly and accurate.


Introduction
As human beings, part of our happiness or suffering depends on our interactions with our direct environment, which is governed by an implicit or explicit set of rules [1]. Meanwhile, a growing proportion of these interactions are now targeted towards connected devices such as smart phones, smart watches, intelligent home assistants and other IoT objects since their number and range of applications keep growing every year [2]. As a consequence of this phenomenon, efforts are made in order to improve the users' experience through the elaboration of some ever-more intuitive and comfortable human-machine interfaces [3]. With the relatively recent successes of artificial intelligence based algorithms and the emergence of a wide range of new sensors, human action recognition (HAR) has become one of the most popular problems in the field of pattern recognition currently and plays a key role in the development of such interfaces. In that context, RGB, depth, skeleton and infrared stream based approaches attracted most of the scientific community's attention as their non-invasive nature is well suited for applications such as surveillance and led to the creation of commonly accepted benchmarks such as NTU-RGB-D [4], which was recently enhanced as NTU-RGB-D 120 [5]. On the other hand, action recognition based on visual streams can only be used within limited setups, is prone to issues such as occlusion and often needs substantial computational resources.
Admittedly less popular, HAR based on inertial sensor signals is still a relevant subject of research. Despite the fact that with this approach, the users are required to wear some kind of device, thanks to • Section 2 quickly reviews the literature. • Section 3 provides a detailed explanation of the method. • Section 4 details the temporal analysis, showing how feature elements can be localized in time and that they report the dynamics of different durations. • Section 5 provides the results in order to validate both the performances and properties of the proposed method. • Section 6 summarizes the important results and their implications and provides ideas for future work.

Neural Network Based HAR
Originally based on conventional machine learning methods such as support vector machines (SVM), decision trees or k-nearest neighbour (k-NN), early methods for HAR relied on various time and/or frequency domain features extracted from the raw data. However, as for many other fields, inertial sensor based HAR benefited from gradient descent based methods. Besides improving the classification capabilities, these methods also alleviated the requirement of feature engineering as they learn the features and classifier simultaneously from the annotated data. As a matter of fact, Kwapisz et al. [7] observed that multi-layer-perceptrons (MLPs) were superior to decision trees and logistic regression for HAR. Similarly, Weiss et al. [8] observed that MLP outperformed decision trees, k-NN, naive Bayes and logistic regression when the training and evaluation was done on data gathered from the same user and also noted that these user-specific models performed dramatically better than impersonal models where training and evaluation users are different. Since then, models based on convolution neural networks (CNN) have been proposed for inertial sensor based HAR [9][10][11][12] and have also been proven superior to k-NN and SVM [11]. With these methods, the input signals usually go through multiple convolution and pooling layers before being fed to a classifier. Although 1D-CNN architectures are well suited for the time series type of data such as IMU signals, CNNs are more famous in their 2D form, considering the tremendous amount of success they have had in the field of image recognition. Consequently, Jiang et al. [13] proposed to construct an activity image, based on IMU signals, in order to proceed to HAR as an image recognition problem. Back to 1D-CNN, Ronao et al. [12] showed that wider kernels and a low pooling size resulted in better performances. More recently, Murad et al. [14] implemented a recurrent neural network (RNN) and demonstrated the prevalence of their method over SVM and k-NN. Moreover, employing the long short-term memory (LSTM) variation of RNN along with CNNs, Hammerla et al. [15] and Ordonez et al. [16] achieved better results than with some strictly CNN based methods. However, the CNN architectures they used in their comparative study had respectively a maximum of three and four convolution blocks. Meanwhile, Imran and Raman [9] conducted an analysis on the topology of the 1D-CNN, from which they concluded that five consecutive blocks of convolution and pooling layers were better than four. Furthermore, with their architecture and a data augmentation strategy, they also achieved state-of-the-art results on the UTD-MHAD dataset for both sensor based and multi-modal HAR. Yet, leaving data augmentation aside, the proposed method in this article slightly outperformed theirs for the sensor based HAR.

Temporal Normalization
As the popular machine learning frameworks require the inputs to be of a constant shape during the training process and as different actions usually have different durations, temporal normalization is mandatory. On that account, different strategies have been proposed. One way to do it is by randomly sampling a fixed number of frames from a longer sequence while keeping the chronological ordering [9,15]. It can also serve as a data augmentation technique, since the sampled frames can vary from one epoch to another. Another way to do it is to sample a fixed amount of consecutive frames from the action sequence [8]. In this case, the algorithms often have to learn from incomplete representations. Finally, a more conservative way to achieve temporal normalization is to extend, up to a fixed point, the signals with zeros. This technique is also known as zero-padding. It has the benefit of enabling the algorithms to learn features on complete and undistorted sequences. Therefore, it will be the preferred strategy in this work. Moreover, contrary to Imran and Raman [9], who reported that the performances of their CNN based method decayed with the augmentation of the amount of zero-padding, our method proved itself to be robust against it.

Proposed Method
Like the majority of HAR algorithms, the main objective of our method is to provide a way to successfully classify specific actions based on a multivariate time series input. In the sensor based variation of HAR addressed here, the input time series, X 0 , consist of a fixed number of time steps, T 0 , each reporting a dynamic occurring at a certain moment t, by providing, in a vector form, x t , the speed and angular rate variations, γ and ω, with respect to a 3-dimensional (x,y,z) orthogonal coordinate system whose origin corresponds to the IMU sensor. This can be expressed mathematically as: In order to train our algorithm, the original time series, X 0 , is conveyed to a 1D-CNN constituted of multiple convolution blocks in cascade (i.e., the output of a specific block serves as the input for the next), each applying a certain set of convolution filters, batch normalisation and max pooling. From this process, L other time series are created, {X l ∈ R F l ×T l } L l=1 . Here, L is the number of convolution blocks; F l is the number of filters in the lth convolution block's filter bank, W l ∈ R F l ×m l ×F l−1 (whose filters are of size m l along the temporal axis); T l is the number of corresponding time steps of the lth generated time series, X l . Hence, the lth time series is described as: In these blocks, convolutions are carried without bias; thus, they can be described by the following equation: where z l t is the vector resulting from the application of all the convolution filters of W l to the portion of the previous time series whose vectors are between the tth and (t + m l − 1)th time steps inclusively, ; φ is the SeLU activation function [17], which demonstrated better experimental results; and * is the convolution operator, which can be equivalently expressed as Equation (4) where W l,k i is the ith column of the kth filter of W l .
Subsequently applying batch normalization and max pooling to the convolution output, the elements of the time step vector, x l,k t , of the different time series, {X l } L l=1 , are calculated as: where BN Γ k ,β k (z l,k 2t ) is the batch normalization operation, which is defined as: where µ k B and (σ k B ) 2 refer to the mean and variance of the elements of the kth dimension computed on the examples of the mini-batch, B; is an arbitrarily small constant used for numerical stability; and finally, Γ and β are parameters learned in the optimization process in order to restore the representation power of the network [18]. It can be deduced from Equation (5) that the width of the max pooling operator is 2; hence, the length of the time series is halved after each convolution block (i.e., T l = 1 2 T l−1 ). Rather then connecting the output of the last convolution block to a classifier, as was done in previous 1D-CNN based works [9][10][11][12], the proposed method instead achieves high-level reasoning (inference and learning) by connecting to a classifier the maximum and minimum values of each dimension of each time series generated throughout the network. Equivalently, this concept can be regarded as generating a feature vector, V, by sampling elements of the different time series as such: and when doing so, it is important to keep a constant ordering of the elements of the feature vector. This way, the sampled values from a specific time series associated with a specific convolution filter always exploit the same connections to the classifier. For that purpose, Algorithm 1 was used in order to create the feature vectors: As CNNs learn convolution filters that react to specific features, these maximum and minimum activation values are correlated with specific motions. To ensure that the convolution filters learn and capture discriminating dynamics for every action class, different sets of filter banks were independently learned through multiple binary classification problems and grouped convolutions. From this process, illustrated in Figure 1, N different feature vectors, {V } N c=1 , were created based on the discriminating filters learned for each of the N different actions. Additionally, this process also resulted in N different binary classifiers.
As each classifier can be taken individually with their specific set of filters in order to recognize a specific action, our method is modular. Thus, it is possible to reduce the computation and memory requirements dramatically during inference if a single or a reduced set of actions is targeted. Figure 1. An high-level representation of the network architecture and learning mechanisms. Each class has its own convolution group that does the feature extraction and its own binary classifier.
Although it is probable that performances could be improved by tailoring a specific architecture for each class, this avenue was not explored in the present work. Instead, each convolution group has the same architecture that is provided by Figure 2.
As illustrated, it consists of five consecutive blocks that are defined by the following operations: According to the proposed architecture, it is also important to observe that in the first block, the convolution kernels' width (m 1 ) is set to 1; hence, instantaneous dynamics are sampled. As the information moves to subsequent blocks, longer discriminating dynamics are harvested by the coaction of both larger kernels and pooling layers compressing the temporal information while providing a certain degree of invariance towards translations and elastic distortions [19]. It is also possible to observe, in Figure 2, that each convolution layer has a total of 32 filters ( f l ) from which it ensures that each feature vector (depicted at the bottom) has a total of 320 elements (32 min and 32 max for each of the 5 generated times series). In order to train binary classifiers (see Figure 1), some relabelling had to be done. Therefore, all the examples were relabelled with respect to the different groups as 1 if the group index, c, matched the original multi-class label and 0 otherwise.
This process is thus described by Equation (8), where y c refers to the label attributed to the examples going through the cth binary classifier and y c is the cth element of the original multi-class label expressed as the vector y ∈ {0, 1} N such that the true class element is set to 1 and all others to 0 (one-hot-encoding).
In the proposed method, predictions were made based on a one-vs.-all approach, which is depicted by Equation (9), whereŷ c corresponds to the probability output for the positive class of the cth binary classifier.
As for the convolution groups, the architectures of the binary classifiers are identical. As shown on Figure 3, the feature vector (whose elements are in red) starts by going through a batch-normalization layer (in purple) before going through two fully connected layers (in blue). Unlike the convolution modules, bias was allowed, and the chosen activation function was ReLU. Additionally, dropout layers (in yellow) randomly dropping half of the connections after both fully connected layers were inserted in order to reduce overfitting. Lastly, as part of each learning process, the classifiers' outputs (ŷ c and 1 −ŷ c ) were made "probabilistic" (positive and summing to one) using the softmax function (in green). The cost function used during the training phase was the weighted cross-entropy, which is expressed by Equation (10). As it has been shown to be an effective way to deal with unbalanced data [20], the weights were set to be inversely proportional to

Temporal Analysis
With the proposed method, it is possible to localize in time each element of the feature vector used in order to make predictions and/or train the network. As a matter of fact, the functions torch.max and torch.min of the PyTorch library, used in order to create our feature vectors, return the index location of the maximum or minimum value of each row of the input tensor in a given dimension, k. Hence, this section will explain how the sampled elements of the feature vector can be associated with a confidence interval in the original time series (IMU signals) and that elements sampled from different convolution blocks report dynamics of different durations.
For the first convolution block output, this analysis is trivial. As the convolution kernels of the first block are of width one, if the returned index of the min and/or max function is t, Equation (5) indicates that this element is related to a specific dynamics that occurred either during the (2t)th or (2t + 1)th time frame of the original time series (IMU signals). Unfortunately, the exact occurrence cannot be determined as pooling is not a bijective operation. Furthermore, as information gets sampled from the elements generated by deeper blocks, this uncertainty grows, but remains bounded, making it possible to determine a confidence interval for the sampled features.
In order to illustrate this process, let us take a look at Figure 4, which takes the previous example one convolution block deeper. For both convolution blocks, the first line refers to the block's input; the following is the result of the convolution layer; and the third is the result of the pooling layer. In order to simplify the analysis, the time step vectors only have one dimension, and batch normalization is omitted as it has no influence on the temporal traceability. As the second block's convolution filter has a width of three (i.e., m 2 = 3), if a feature is sampled from the second time series, X 2 , with a relative position index of t, the corresponding dynamic consequently occurred between the (4t)th and the (4t + 7)th time steps of the original signal.
Generally speaking, knowing that the stride of the convolutions are set to one and the max pooling width is two, it is possible to determine the length of a certain feature based on the convolution block from which it was sampled. Defining D l , m l and E l as the duration of the features, the width of the convolution filters and the temporal uncertainty caused by max pooling of the lth convolution block, it is possible to inductively calculate the duration of each convolution block feature using the following set of equations: Consequently, the confidence interval, I, during which the dynamic of a certain feature occurred, can be expressed as: Table 1 gives the relative duration, D, the relative uncertainty, E (both with respect to the original time step), and the real duration (knowing that the original signal was acquired at 30 Hz) based on the layer from which they were sampled and the width of the convolution filters specified by the proposed architecture. Table 1. Characterisation of the uncertainty (E) and relative duration (D) of the sampled features based on the networks hyper-parameters (l) and (m).

Experiments
Using the publicly available UTD-MHAD dataset [21], which proposes a total of 27 different actions performed 4 times by eight different subjects (4 males and 4 females), this section will first demonstrate how our hierarchical framework makes our method robust towards zero-padding and that it is possible to train and infer on sequences of different lengths without impacting the performances. Then, we will assess the performances and compare them with state-of-the-art methods. Finally, we will conduct an analysis on the complexity of the method and see how it compares with some well-known CNN architectures.
Although it provides RGB, depth and skeleton streams, our experiments were conducted only on the inertial sensor signals, which were acquired from a single IMU worn on the right wrist for Actions 1 through 21 and on the right tight for Actions 22 through 27, as displayed in Figure 5. For all our experiments, the evaluation protocol suggested in the original UTD-MHAD paper [21] was followed, where odd subjects are for training and evens for evaluation, except in Section 5.2, which will additionally provide an assessment of the performances through a leave-one-out k-fold cross-validation protocol. In all cases, the experiments were performed using the PyTorch deep learning framework with stochastic gradient descent as the optimizer and a batch size of 16. The learning rate was initialized at 0.01 and decayed 1% per epoch. Weights were initialized using the Xavier method, and the bias of the classifier was initialized using a Gaussian distribution of mean 0 and variance 1.

Robustness against Zero-Padding and Flexibility Towards the Input's Length
The idea behind our feature sampling method was to create a feature vector based on information sampled from multiple local signals instead of encoding the whole sequence into a feature vector. Hence, meaningless and/or redundant information such as zero-padding does not affect the feature vector, nor the classification performances. In order to validate this hypothesis, signals were extended up to various lengths with zero-padding. More precisely, training using sequences normalized up to 326, 700 and 1000 frames was conducted, for which the results are reported by Figure 6. As we can see, the performances are rather steady around 83% regardless of the amount of zero-padding added to the original sequence. This is an interesting result as Imran and Raman [9] previously conducted a 4-fold cross-validation analysis using only the subjects of the training set and observed that using the conventional CNN approach, where only the information encoded in the last layer is provided to the classifier, zero-padding the IMU signals of UTD-MHAD dataset caused the accuracy to decay by nearly 7%. More precisely, they transitioned from randomly sampled sequences of 107 time frames (which corresponded to the duration of the shortest action) to sequence zero-padded up to 326 time frames (which corresponded to the longest action). Intermediate lengths were also tested, where shorter sequences were zero-padded, while longer ones were randomly sampled. As observed on Table 2, their accuracy kept decreasing as the amount of zero-padding increased. Table 2. Mean accuracy of the k-fold cross-validation analysis produced by Imran et al. [9] with respect to zero-padding on their 5 layer 1D-CNN. Moreover, it was also observed that with our hierarchical method, it was possible to train the model using sequences of a certain length and proceed to classify sequences of a different length, which is impossible with the other CNN approaches. As a matter of fact, the parameters learned at the 100th epoch of the model trained with the sequence extended to 700 time frames (see Figure 6) resulted in an accuracy of 84.88%, which was also reproduced regardless of the testing examples being zero-padded up to 326 or 1000 time frames.

Performances
With our method, we obtained a maximal accuracy score of 85.35%, which resulted in the confusion matrix provided by Figure 7. As we can see, the performances were better on actions for which the inertial sensor was placed on the thigh rather than on the wrist. In order to compare our result, we built Table 3, which indicates the accuracy of previous methods and the modality that they used in order to achieve it. Furthermore, we also report the performances on the NTU-RGB-D dataset for the methods that validated their approach on it using the cross-subject/cross-view evaluation protocol (CS/CV), as it is a well-established benchmark for HAR.
As we can see, our method is very competitive as it outperformed the current state-of-the-art using inertial data before it benefited from the data augmentation, which, in our case, was not implemented. Another important observation that can be made by taking a closer look at Table 3 is that the best overall performance was achieved by Imran and Raman [9] through a multi-stream fusion approach using inertial signals. Moreover, in the conclusion of their work, Imran and Raman argued that it was not only the complementarity between the different modalities that helped them achieve these results, but also the complementarity between the architecture treating the different data streams, thus further stressing the importance of 1D-CNN architectures. Table 3. Comparison with the literature. CS/CV, cross-subject/cross-view.

Computational Complexity Analysis
As stated earlier, one of the greatest challenges currently faced by HAR algorithms is their portability to limited resource electronics; hence, we proceeded to do an analysis of the computational complexity and the number of parameters, which ended up demonstrating that our approach could be very easily integrated in limited resource electronics.
As a matter of fact, the number of operations needed by our model is directly correlated to the normalized length of the input sequence. As we can see in Figure 8, the relation is linear and is equal to 86.37 million multiply and accumulate (MAC) operations when the normalized length is equal to 326. Moreover, as our method is modular, the performance of each binary classifier can be studied independently. For that purpose, the precision and recall scores for each class are given by Figure 9. Therefore, the number of operations and parameters can be divided by 27 in order to be integrated into a simple device only recognizing one action. In that case, the number of parameters would only be 343,830, which only takes about 1.4 MB of memory if the parameters are represented with single precision floating-point numbers, as was the case in our experiments. In the case of the number of operations, it comes down to about 3.2 million MACs. As MACs account for two floating point operations (one multiplication and one addition), if a processing unit has a throughput of 1 operation per clock cycle, a processing speed of only 3.2 MHz would be necessary in order to run our model under 1 s. As many inexpensive microcontrollers offer sufficient performances, this proves that our method could easily be embedded into smart devices. Finally, comparing our method with some of the most popular CNN architectures regarding the number of computations, on Figure 10 and the number of parameters on Figure 11, we can appreciate once more how hardware-friendly our method really is.

Discussion
In this paper, we presented a new framework based on a 1D-CNN architecture in order to perform action recognition on signals generated by IMUs. Although less accurate than methods using visual streams, this approach still has its advantages as it provides more privacy and can be used anywhere, as long as wearing the acquisition device does not become too inconvenient. Moreover, we demonstrated that our method is computationally inexpensive enough to be integrated into embedded devices. On another note, as the latent representation can be viewed as an organized set of elements assessing the presence of specific sub-events, the proposed method works in a hierarchical way. Yet, the temporal relationships between these sub-events are not captured implicitly or explicitly; thus, we believe that the proposed approach could help address the unsolved problem [29] of high-level action recognition with machine learning based methods as the difficulty that previous algorithms faced with this type of actions lies in the fact that they have important temporal variations between the sub-events composing them. This hierarchical approach also allowed us to infer on times series extended up to various lengths without compromising the accuracy.
Finally, we also demonstrated that the element of the latent representation can be traced in time, both in location and duration. In future work, we would like to investigate this ability further as we believe that it offers more transparency to the inner workings of the model by providing a way to interpret them. For example, this could be done by assessing if specific kernels are in fact reacting or not to discriminating movements based on the movements occurring in their confidence interval. On top of opening a window into the black-box of deep-learning algorithms, we believe that this approach could help reduce over-fitting and propose better architectures, tailored to specific actions, thus improving the overall performances.