1. Introduction
As human beings, part of our happiness or suffering depends on our interactions with our direct environment, which is governed by an implicit or explicit set of rules [
1]. Meanwhile, a growing proportion of these interactions are now targeted towards connected devices such as smart phones, smart watches, intelligent home assistants and other IoT objects since their number and range of applications keep growing every year [
2]. As a consequence of this phenomenon, efforts are made in order to improve the users’ experience through the elaboration of some ever-more intuitive and comfortable human-machine interfaces [
3]. With the relatively recent successes of artificial intelligence based algorithms and the emergence of a wide range of new sensors, human action recognition (HAR) has become one of the most popular problems in the field of pattern recognition currently and plays a key role in the development of such interfaces. In that context, RGB, depth, skeleton and infrared stream based approaches attracted most of the scientific community’s attention as their non-invasive nature is well suited for applications such as surveillance and led to the creation of commonly accepted benchmarks such as NTU-RGB-D [
4], which was recently enhanced as NTU-RGB-D 120 [
5]. On the other hand, action recognition based on visual streams can only be used within limited setups, is prone to issues such as occlusion and often needs substantial computational resources.
Admittedly less popular, HAR based on inertial sensor signals is still a relevant subject of research. Despite the fact that with this approach, the users are required to wear some kind of device, thanks to the miniaturization of microelectromechanical systems (MEMS), inertial measurement units (IMU) are now seamlessly integrated with different day-to-day objects such as remote controls or smart-phones and wearable technologies such as smart watches or smart cloths. Furthermore, for applications such as sports analysis, fall detection or remote physiotherapy, it provides an interesting alternative, as it can be used anywhere, provides more privacy and is considerably less expensive.
Although many different datasets exist in that branch of HAR, none have really established themselves as a clear benchmark, as placement strategies of the IMUs, or the set of activities are mostly targeted towards specific applications. In this work, we opted for the University of Dallas, Texas, multimodal human action dataset (UTD-MHAD) in order to validate our approach, as we believed that it offered the most variability in terms of activities (27) and actors (eight). Not only did the proposed method achieve competitive results on the classification task using IMU signals, it was also designed in order to address two of the biggest challenges currently faced by HAR algorithms [
6]. For that purpose, we propose a hierarchical representation of the different actions by sampling the maximum and minimum activation of each convolution kernel throughout the network, which could potentially help with (1) the recognition of high-level activities characterized by the large time variations in the sub-movements composing them. Moreover, as the proposed solution is built around a 1D-CNN architecture, it was also observed that it necessitated dramatically less operations or memory than the most popular architectures, thus greatly contributing to (2) the portability to limited resource electronics for HAR algorithms. Finally, in this paper, we also investigated the tractability of the features throughout the proposed framework, both in time and duration, as we believe it could play an important role in future works in order to make the solution more intelligible, hardware-friendly, robust and accurate. For the rest of the present article, the following structure will observed:
Section 3 provides a detailed explanation of the method.
Section 4 details the temporal analysis, showing how feature elements can be localized in time and that they report the dynamics of different durations.
Section 5 provides the results in order to validate both the performances and properties of the proposed method.
Section 6 summarizes the important results and their implications and provides ideas for future work.
3. Proposed Method
Like the majority of HAR algorithms, the main objective of our method is to provide a way to successfully classify specific actions based on a multivariate time series input. In the sensor based variation of HAR addressed here, the input time series,
, consist of a fixed number of time steps,
, each reporting a dynamic occurring at a certain moment
t, by providing, in a vector form,
, the speed and angular rate variations,
and
, with respect to a 3-dimensional (x,y,z) orthogonal coordinate system whose origin corresponds to the IMU sensor. This can be expressed mathematically as:
In order to train our algorithm, the original time series,
, is conveyed to a 1D-CNN constituted of multiple convolution blocks in cascade (i.e., the output of a specific block serves as the input for the next), each applying a certain set of convolution filters, batch normalisation and max pooling. From this process,
L other time series are created,
. Here,
L is the number of convolution blocks;
is the number of filters in the
lth convolution block’s filter bank,
(whose filters are of size
along the temporal axis);
is the number of corresponding time steps of the
lth generated time series,
. Hence, the
lth time series is described as:
In these blocks, convolutions are carried without bias; thus, they can be described by the following equation:
where
is the vector resulting from the application of all the convolution filters of
to the portion of the previous time series whose vectors are between the
tth and
th time steps inclusively,
;
is the SeLU activation function [
17], which demonstrated better experimental results; and ∗ is the convolution operator, which can be equivalently expressed as Equation (
4) where
is the
ith column of the
kth filter of
.
Subsequently applying batch normalization and max pooling to the convolution output, the elements of the time step vector,
, of the different time series,
, are calculated as:
where
is the batch normalization operation, which is defined as:
where
and
refer to the mean and variance of the elements of the
kth dimension computed on the examples of the mini-batch,
B;
is an arbitrarily small constant used for numerical stability; and finally,
and
are parameters learned in the optimization process in order to restore the representation power of the network [
18]. It can be deduced from Equation (
5) that the width of the max pooling operator is 2; hence, the length of the time series is halved after each convolution block (i.e.,
).
Rather then connecting the output of the last convolution block to a classifier, as was done in previous 1D-CNN based works [
9,
10,
11,
12], the proposed method instead achieves high-level reasoning (inference and learning) by connecting to a classifier the maximum and minimum values of each dimension of each time series generated throughout the network. Equivalently, this concept can be regarded as generating a feature vector,
, by sampling elements of the different time series as such:
and when doing so, it is important to keep a constant ordering of the elements of the feature vector. This way, the sampled values from a specific time series associated with a specific convolution filter always exploit the same connections to the classifier. For that purpose, Algorithm 1 was used in order to create the feature vectors:
As CNNs learn convolution filters that react to specific features, these maximum and minimum activation values are correlated with specific motions. To ensure that the convolution filters learn and capture discriminating dynamics for every action class, different sets of filter banks were independently learned through multiple binary classification problems and grouped convolutions. From this process, illustrated in
Figure 1, N different feature vectors,
, were created based on the discriminating filters learned for each of the N different actions. Additionally, this process also resulted in N different binary classifiers.
Algorithm 1: Feature vector harvesting method. |
|
As each classifier can be taken individually with their specific set of filters in order to recognize a specific action, our method is modular. Thus, it is possible to reduce the computation and memory requirements dramatically during inference if a single or a reduced set of actions is targeted.
Although it is probable that performances could be improved by tailoring a specific architecture for each class, this avenue was not explored in the present work. Instead, each convolution group has the same architecture that is provided by
Figure 2.
As illustrated, it consists of five consecutive blocks that are defined by the following operations:
According to the proposed architecture, it is also important to observe that in the first block, the convolution kernels’ width (
) is set to 1; hence, instantaneous dynamics are sampled. As the information moves to subsequent blocks, longer discriminating dynamics are harvested by the coaction of both larger kernels and pooling layers compressing the temporal information while providing a certain degree of invariance towards translations and elastic distortions [
19]. It is also possible to observe, in
Figure 2, that each convolution layer has a total of 32 filters (
) from which it ensures that each feature vector (depicted at the bottom) has a total of 320 elements (32 min and 32 max for each of the 5 generated times series).
In order to train binary classifiers (see
Figure 1), some relabelling had to be done. Therefore, all the examples were relabelled with respect to the different groups as 1 if the group index, c, matched the original multi-class label and 0 otherwise.
This process is thus described by Equation (
8), where
refers to the label attributed to the examples going through the
cth binary classifier and
is the
cth element of the original multi-class label expressed as the vector
such that the true class element is set to 1 and all others to 0 (one-hot-encoding).
In the proposed method, predictions were made based on a one-vs.-all approach, which is depicted by Equation (
9), where
corresponds to the probability output for the positive class of the
cth binary classifier.
As for the convolution groups, the architectures of the binary classifiers are identical. As shown on
Figure 3, the feature vector (whose elements are in red) starts by going through a batch-normalization layer (in purple) before going through two fully connected layers (in blue). Unlike the convolution modules, bias was allowed, and the chosen activation function was ReLU. Additionally, dropout layers (in yellow) randomly dropping half of the connections after both fully connected layers were inserted in order to reduce overfitting. Lastly, as part of each learning process, the classifiers’ outputs (
and
) were made “probabilistic” (positive and summing to one) using the softmax function (in green).
The cost function used during the training phase was the weighted cross-entropy, which is expressed by Equation (
10). As it has been shown to be an effective way to deal with unbalanced data [
20], the weights were set to be inversely proportional to the number of training examples of a specific class relative to the total number of training examples. Thus, by assigning a weight of 1 to the positive class’ examples, the weight of the negative class’ examples (label = 0) is
, where
is the number of training examples of the specific class c and M is the total number of training examples.
4. Temporal Analysis
With the proposed method, it is possible to localize in time each element of the feature vector used in order to make predictions and/or train the network. As a matter of fact, the functions torch.max and torch.min of the PyTorch library, used in order to create our feature vectors, return the index location of the maximum or minimum value of each row of the input tensor in a given dimension, k. Hence, this section will explain how the sampled elements of the feature vector can be associated with a confidence interval in the original time series (IMU signals) and that elements sampled from different convolution blocks report dynamics of different durations.
For the first convolution block output, this analysis is trivial. As the convolution kernels of the first block are of width one, if the returned index of the min and/or max function is t, Equation (
5) indicates that this element is related to a specific dynamics that occurred either during the
th or
th time frame of the original time series (IMU signals). Unfortunately, the exact occurrence cannot be determined as pooling is not a bijective operation. Furthermore, as information gets sampled from the elements generated by deeper blocks, this uncertainty grows, but remains bounded, making it possible to determine a confidence interval for the sampled features.
In order to illustrate this process, let us take a look at
Figure 4, which takes the previous example one convolution block deeper. For both convolution blocks, the first line refers to the block’s input; the following is the result of the convolution layer; and the third is the result of the pooling layer. In order to simplify the analysis, the time step vectors only have one dimension, and batch normalization is omitted as it has no influence on the temporal traceability. As the second block’s convolution filter has a width of three (i.e.,
), if a feature is sampled from the second time series,
, with a relative position index of t, the corresponding dynamic consequently occurred between the
th and the
th time steps of the original signal.
Generally speaking, knowing that the stride of the convolutions are set to one and the max pooling width is two, it is possible to determine the length of a certain feature based on the convolution block from which it was sampled.
Defining
,
and
as the duration of the features, the width of the convolution filters and the temporal uncertainty caused by max pooling of the
lth convolution block, it is possible to inductively calculate the duration of each convolution block feature using the following set of equations:
Consequently, the confidence interval, I, during which the dynamic of a certain feature occurred, can be expressed as:
Table 1 gives the relative duration, D, the relative uncertainty, E (both with respect to the original time step), and the real duration (knowing that the original signal was acquired at 30 Hz) based on the layer from which they were sampled and the width of the convolution filters specified by the proposed architecture.