4.1. Contributions Per Stage and Reasons for Exclusion during Selection Process
During selection process, 1243 publications were examined as they complied with the Inclusion Criteria, with 52 of them making it to Stage IV as their full text complied with the Content Criteria for selection process. Table 6
illustrates the number of contributions per stage.
Referring to the predefined Content Criteria for selection process, common cases are presented where contributions violated the criteria during selection process.
Regarding Criterion A, two major issues kept contributions from reaching Stage IV. Approaches of vision-based activity recognition using images and videos have been very prominent in the literature corpus, even though they were not explicitly searched for based on the inclusion criteria keyword list. Beyond that, contributions were excluded despite using IMUs, because the sensors were not attached on the human body but to objects such as golf clubs, surfboards, ice hockey sticks, skateboards and so forth.
The keywords IMU and accelerometer led to contributions dealing with sensor-based control of robots and unmanned aerial vehicles (UAVs). Another active field of research is activity recognition of animals, e.g., dogs, horses and lizards. Both issues meant a violation of Content Criteria B.
In contradiction with regards to Content Criterion C, many contributions do not gather data in the physical world. Instead, motion patterns are taken from a 3D-simulation of a human working in a digital twin of a scenario resembling a P+L facility. Researchers also use virtual reality to create a feeling of immersion and record data of a human working in this simulated environment. Thus far, there is no empirical proof known to the authors that motion patterns recorded from a human linked to a virtual reality can be used for HAR in a real-world facility.
There are fields of research related to HAR that do not deal with quantification of human activity (Content Criterion D). Rather, they aim at qualitative assessment, e.g., in regards to ergonomics. Further examples are robot control via gestures and the analysis of human behaviour to create more realistic digital human simulation models. An active field of research related to HAR is emerging in the domain of medicine. Advances in sensor technology sparked research aiming at the detection of clinical pictures and their consequences, e.g., Parkinson’s disease and Alzheimer’s disease, stroke, medial knee osteoarthritis, leg prostheses, asymmetries during the gait cycle, head trauma and many more.
Contributions that did not provide a use case and thus cannot be considered application-oriented, violated Content Criterion E. They were either a survey and thus moved to a separate list (see Section 2
), or they were dealing with fundamentals of HAR, lacking the prospect of deployment in P+L from a practitioner’s perspective. In the case the application is not taking place in P+L, the reviewers discussed whether its transfer to this domain seems feasible. While locomotion and handling activities that are part of daily living, e.g., using tools, seemed transferable, there were cases where this was not the case. The first group of these activities were sport activities such as jumping and dribbling in basketball, boxing, alpine skiing, soccer, ball throwing in baseball and handball, and underwater sports, e.g., fly kicks. The second group consists of ADLs, that do not resemble P+L activities. Most present among them were dance moves, performing music with a violin or piano, or locomotion with a rolling walker, crutches or a wheelchair.
In the literature, the meaning of the term physical activity depends on the context. Consequently, the search for Activity or Human Activity often resulted in violations of Content Criterion F. Eye tracking and the recognition of facial expressions and emotions such as stress cannot be considered limb or torso movements. Work on pose tracking and path analysis did not aim to recognise the physical activity. For example, it is not necessarily the case that the employee moved on foot as he/she could have used a vehicle.
Advances in sensor technology for HAR applications are published in journals and conference proceedings. Current trends are devices for finger, head and back movement tracking that can be deployed easily. These hardware showcases may be useful for HAR in the long run, but violate Content Criterion G. Contributions that do not show practical application scenarios as they solely focus on the hardware showcase were excluded.
Contributions with no clear pattern recognition methods and non-standard performance metrics for HAR were not considered, e.g., contributions with vague procedure on the data collection, pre-processing, segmentation and classification. There are contributions that rely on conditions flow-charts based on thresholding, or they simplify the algorithms by only mentioning them.
4.2. Systematic Review of Relevant Contributions
Analysing the number of relevant publications per year reveals an increase in the second half of the observed ten-year time interval. Thus, the growing relevance of this review’s scope is confirmed (see Figure 3
The peaks in the years 2016 and 2018, in particular when compared to 2017, cannot be explained by special issues of domain-specific journals. The 52 publications are spread over 41 different journals or conferences. This fact underlines that HAR for P+L benefits from a wide range of interdisciplinary expertise. The International Workshop on Sensor-based Activity Recognition and Interaction (iWOAR) accounts for three publications, making it the most frequent one. The remaining journals or conferences are accountable for a single or two contributions each. In total, 165 unique authors were involved in writing the 52 contributions. Their countries of affiliation are spread worldwide over 26 countries. None of the authors has participated in writing more than three publications.
The results of the systematic literature review according to the predefined Categorisation Scheme are illustrated in Table 7
. Entries are listed in chronological ascending order and alphabetically according to the last name of the corresponding author.
During categorisation, it turned out that active markers are not used in any of the relevant contributions. OMMC is used in only two contributions [54
]. The remaining research was done using IMUs, sometimes in conjunction with other sensor technology that is not considered in this review. Therefore, the utilised sensor technology is not explicitly stated in Table 7
The FWCI could not be recorded for three publications [48
]. They are marked with a “-”. Some of the publications from 2018 were published at the end of the year. Thus, it is too early to draw conclusions from their FWCI of 0. The highest FWCI of
is held by Bulling et al. The mean value of the remaining 49 publications—including those with a FWCI of 0 but excluding those where it could not be captured—is
In the following, the major findings per category are summarised in two parts—Application and HAR Methods.
The first part of the systematic review focuses on the application domain, the observed activities, the sensor attachment and the utilised datasets. When pointing out the major findings, particular emphasis is put on the contributions from the P+L domain.
Eight publications are assigned to the P+L domain, three of them in conjunction with other domains. Interestingly, all publications of the P+L domain have been published since 2013. This underlines the increasing significance and research interest in HAR for P+L. The application covers a wide range of sectors, such as manufacturing and assembling [58
], warehousing [72
], construction [90
] and maintenance [11
]. Warehousing, in particular order picking, is the only sector in logistics. Other areas of logistics, such as packaging, in-house transport, external transport and handling processes such as loading, unloading and reloading are not observed in the contributions. The majority of the remaining 44 contributions focus on simplistic activities or simple exercises, as pointed out in the next paragraph.
The eight contributions from the P+L domain correspond to those that cover working activities. Tao et al. and Koskimaki et al. focused solely on working activities but did not involve locomotion. Zhang and sawchuk did not recognise the activity itself but motions that are performed during working, such as neck bending, neck extension, kneeling and so forth. The Skoda dataset utilised in [11
] involves manipulative gestures in car maintenance, but does provide labelled data of locomotion (see Section 184.108.40.206). However, the other datasets utilised in these contributions do involve locomotion activities. The remaining publications [72
] from P+L apply HAR methods on both stationary working processes and locomotion activities. Apart from Reining et al. [85
], the observed contributions in the domain of P+L assume that the activity definition is known at design time and is not going to change at run time of a HAR method (see Section 1
). Attribute-based representations are proposed to address this issue, following the concept of Rueda and Fink [92
] and Lampert et al. [93
Within the entire literature corpus, locomotion is most often represented by far. Forty-six contributions aim to recognise the associated activities such as walking, running, standing and climbing stairs that can be recorded without elaborate set-ups. In total, 23 contributions deal with ADL, followed by exercises with 13 occurrences. It is noticeable that more complex activities that require greater effort for recording are represented less than simple ones. This is because the method of HAR is the focus of most publications, while datasets are often regarded as a tool for evaluation. The domain in which the data were created is of secondary importance.
Among the contributions on P+L, passive MoCap sensor technology is applied only once [85
]. The remaining contributions from P+L utilise IMUs. The combination of Surface Electromyography (sEMG) and IMUs was proposed by Tao et al. Using a single IMU attached to a hand or arm is proposed twice [58
Examining the sensor attachment in the entire literature corpus, no clear picture emerges. While placing the sensors on the torso is most prominent (35), other locations are common as well. The reviewers could not establish a link between the sensor placement and the recognised activities. While some researchers attach the sensors to the arm performing a motion, e.g., drilling or opening a window, other approaches try to recognise ADLs and locomotion activities with sensors placed on the waist. The same issue also occurs the other way around, when researchers aim to recognise locomotion with IMUs placed on the hand. Out of the 21 contributions utilising smartphone sensors, 13 include the creation of individual datasets. The latter group solely takes smartphone data into consideration without deploying further sensors. Placing sensors on the head or a helmet is proposed rather rarely. Among the six contributions that do so, Wolff et al. presented the only work based on the idea of using head-worn sensors exclusively without further body-worn devices. Transferring this approach to P+L could possibly minimise the impairment of employees when performing manual work.
Seven out of the eight P+L contributions use data recorded in a laboratory. Work stations or relevant features of a facility are rebuilt to record data close to reality. Nevertheless, these data are not recorded in a real process with those employees that routinely perform the manual activities. There are four P+L contributions that utilise data recorded from real-life working routines, of which three also use laboratory data. There is a single P+L contribution that solely uses real-life data [72
]. No work could be found on the issue of training a classifier using data from a laboratory for deployment in a real-life P+L facility. Besides, no work was found concerning transfer learning between datasets and scenarios in an industrial context. Examining the entire literature corpus, there are only six papers that use both real-life and laboratory data versus 39 real-life and 19 laboratory data, respectively.
There are six P+L contributions using individual datasets, while data from a repository are utilised three times. A single P+L contribution uses both data sources [83
]. Examining the entire literature corpus, it is striking that a majority of 38 contributions utilises individual datasets, and 34 of them do so without referring to data that are publicly available in repositories. Thus, there are four contributions using both individual and repository data [22
]. As most contributions lack a recording protocol and a detailed description of the recorded activities, it remains unclear whether the motion patterns and their assigned activity labels as well as the underlying activity definitions are comparable.
During the review process, the authors traced information about the used datasets back to their origin, e.g., in a repository or the first time they were mentioned in a contribution. Table 8
lists publicly available datasets utilised by a single contribution. They are listed in alphabetical order, providing a reference for further reading and stating the contribution that utilises this dataset.
describes those datasets that are used by more than one contribution in greater detail.
It was noticed that the oldest dataset from 2008, Skoda, is still used eight years later in 2016 in [11
]. Even though several contributions use excerpts from the same datasets, a performance comparison between the proposed methods is not necessarily possible. In some cases, the excerpts taken from a dataset differ. One reason is that authors may exclude activities with too few subjects or solely rely on IMU data without taking other data sources into account. Since a wide variety of data is used, a comparison of classification performance seems hardly possible from a practitioner’s perspective.
Irrespective of the HAR method applied, effort for dataset creation is another factor to consider in commercial applications for P+L. However, none of the authors state the effort for dataset creation. It remains unclear how much time the set-up of equipment and the recording sessions consume. The same applies to the process of annotation, in some contributions referred to as labelling. Cross tests and repetition tests may reveal the consistency of labels among several annotators and the inherent annotation error caused by vague class descriptions. In the observed contributions, the time spent on annotating and the consistency of the labels are not discussed.
4.2.2. HAR Methods
The HAR methods represent the second part of the systematic review, covering the standard pattern recognition chain—this involves the pre-processing, segmentation, and shallow classification methods—and deep learning, in addition to the performance metrics.
Different data representations have been used by the publications in Table 7
, depending on the HAR applications and sensors. Inertial measurements recorded by sensors integrated in IMUs are deployed, in general, for solving HAR. Usually, more than three devices are set on the human body, e.g., on the hands, legs, head, and torso. Differently, the authors of [44
] recorded acceleration measurements from only one device, which is placed on the waist. The authors of [52
] proposed using the magnitude of the acceleration vector from the three components x, y, and z. The authors of [86
] used the logarithm magnitude of a two-dimensional Discrete Fourier Transform of IMU signals. They proposed utilising this magnitude as an image input for a CNN.
Otherwise, the local human joint poses have also been used for HAR. In this case, Optical Motion Capturing (OMOCAP) or devices such as Kinect were deployed. In [54
], the authors represented OMOCap data by using 3D-joint positions and rotations specifying a posture. The authors divided the human joints into five groups, according to the main human body parts. In [85
], the authors ofterpreted the joint positions and orientations of OMOCap data as multichannel time series, where each joint pose per component x, y, and z, is considered individually. This approach is similar to how acceleration data are handled. The authors of [81
] used the quaternions and Euler angles representations, along with their velocities, and accelerations of human joints as input data. This also can be seen as a geometrical and parametric representation of the joint poses.
Due to different characteristics of sensors, sampling rate, units, random noise or malfunctioning, pre-processing approaches are in need. The authors of [71
] used a median filter for smoothing the signal. In [47
], a third-order average filter was deployed for reducing random noise. Low- and High-pass filtering have been used for separating the acceleration components due to body movements and gravity, and for eliminating noise. The authors of [44
] argued that the low-frequency component of the acceleration is due to gravity, and the high-frequency component to the dynamic motion of a human body. In [84
], a low-pass Butterworth-filter was used for such separation. The authors of [57
] computed the gravity component by averaging the acceleration measurements, which can be seen as taking the zero frequency component. The body acceleration was calculated by subtracting the gravity component from the acceleration measurements. The authors of [80
] also separated these two components, however there was no clear explanations of which method was deployed. Differently, the authors of [64
] used a low-pass filter, the authors of [71
] used a third-order low-pass Butterworth filter, and the authors of [62
] an average filter for reducing noise. The term noise filter was also used in [84
], but there was no explanation on the particular method. In [68
], a zero-mean and unit-variance normalisation, and in [83
] a Max-Min normalisation to the range [0, 1] were carried out, as there are differences among the units and scales of the measurements. The authors of [67
] did not use any pre-processing and forwarded the raw data to a convolutional neural network.
In contrast to the aforementioned works, the authors of [63
] normalised the extracted features before the training stage. For example, the authors of [90
] normalised the extracted features to the range [0, 1].
Segmentation refers to extracting a sequence of continuous measurements or the pre-processed data that are likely to portray a human activity. In HAR, the sliding-window approach is the most common method for creating segments for being processed by a classifier. In the approach, a window is moved over the time-series data by a certain step to extract a segment [29
]. The window sizes directly controls the delay of the recognition system. The step size is selected according to segmentation precision—taking into account that short activities can be skipped—and computation effort. Table 10
shows the usual window sizes and overlapping percentage along with the sampling rate of the measurements that are presented by the publications in Table 7
. It is noticeable that the higher is the sampling rate, the smaller is the window. Publications using small sampling rates handle activities that can be seen as compositions of short duration activities. These activities have longer durations.
Differently, there are approaches for segmenting sequences using additional measurements or events, e.g., eye movement [29
In addition to segmentation, a traditional supervised-HAR pipeline includes the extraction of relevant handcrafted-features, a feature reduction, a training stage and a classification. The handcrafted features should capture the intrinsic characteristics of a certain human action. A feature reduction is deployed for reducing the dimensionality of the feature space, keeping the discriminant properties of the features. In the training stage, a classifier is trained by using the extracted features and the ground-truth activity labels. Finally, the classifier assigns activity classes.
Feature Extraction. In the standard pattern recognition methods, the feature extraction is an important stage. It allows representing data in a compact manner, which helps with later classification stages. They are divided into two main groups, statistical features and application-based.
shows the statistical features that are mentioned in the publications in Table 7
. Time-domain features focuses on the waveform characteristics, and frequency-domain features focus on the periodic structure of the signal [90
]. The Fourier transform is applied to the raw- or pre-processed signals to acquire the estimated spectral density of the time series. As a remark, the authors of [64
] mentioned the usage of time- and frequency-domain features, but they did not specify which ones exactly.
Application-based features refer to features that were created for a certain application or dataset. These features are based on geometric, structure and kinematic relations. Table 12
shows the application-based features used in Table 7
The authors of [44
] deployed Principal Component Analysis (PCA) for reducing the dimensionality of their features. PCA is a holistic method that considers its inputs as points in a high-dimensional space and it finds a lower-dimensional feature space along the highest variance, where classification becomes easier. Linear Discriminant Analysis (LDA) is another holistic method that tries to overcome the drawbacks of PCA. It minimises the intra-class variance, and maximises the inter-class variance of a set of inputs/features. It finds an optimal projection by maximising the ratio between the inter- and intra-class variations of the inputs. Kernel Discriminant Analysis (KDA) is a non-linear discriminating approach based on kernel techniques to find non-linear discriminating features, used in [29
]. Quadratic Discriminant Analysis (QDA) was deployed by Siirtola and Röning [57
]. The authors of [90
] followed the Recursive Feature Elimination (RFE) for finding the best set of features. The RFE can be seen as a dense parameter search, which iteratively selects or rejects a set of features after training and deploying a classifier. A Bag of Feature Representation using Kmeans for clustering was proposed by Deng et al. [54
]. They used the Kmeans algorithm for clustering M
motion sequences. These motion sequences are 3D joint positions and orientations through m
frames. Additionally, the authors of [50
] utilised a Random Projection (RP). Sequential forward- or backward-feature selection (SFFS) was used in [45
The authors of [52
] trained a HMM model per axis (x, y, and z) of pre-processed acceleration measurements, fusing them with a weighted sum. The authors of [29
] also deployed HMMs, and the authors of [49
] used couple HMMs. The authors of [78
] used a hierarchical conditional HMMs. The authors of [44
] used a NB using the Probability Density Function (PDF) from all of the 19 features, which were previously reduced using PCA; assuming that all the features in the lower space are not mutually correlated. The authors of [81
] used also a NB assuming that the features are mutually independent. Similarly, refs. [29
] used NB. The authors in [45
] deployed SVMs. The authors of [65
] used a SVM-based binary decision tree classifier. The authors of [53
] proposed a hardware-friendly SVM that is meant to be deployed on smartphone devices. [29
] trained a K-Nearest Neighbor (KNN) based classifier. The authors of [81
] used Dynamic Bayesian Mixture Model (DBMM) for combining conditional probability outputs from different classifiers, namely, the NB, SVM and MLP. A weight is assigned to each base classifier, according to a learning process, using an uncertainty measure as a confidence level. Multilayer Perceptron (MLP)—a network with two fully connected layers and a softmax layer—was deployed by [45
Additionally, other classifiers, methods or approaches were used: Dynamic Time Warping (DTW) [46
], Conditional Random Field (CRF) [62
], Random Forest (RF): [50
], Decision Tree (DT) [48
], Logistic Regression (LR) [48
], Least Squares Method [45
], Gaussian Mixture Model (GMM) [64
], Template matching (TP) [75
], Correlation [75
], Euclidean distance (ED) [75
]. In addition, the authors of [61
] combined single classifiers by majority voting and average of probabilities. Joint Boosting (JB) [69
], Bagging [56
] and Stacking [69
] were also deployed.
Deep architectures have recently been proposed for solving HAR. Differently from the standard HAR pipeline, deep architectures combine the feature extraction and classification in a single approach. Their features are directly learned from data, being more discriminative. Besides, they overcome some problems regarding computation and adaptability of handcrafted features.
Convolutional Neural Networks (CNNs).
CNNs, first proposed in [107
], have recently been used for solving HAR problems. CNNs combine the feature extraction and classification in an end-to-end approach. CNNs contain hierarchical structures combining convolutional operations using learnable filters and non-linear activation functions, downsampling operations, and classifiers [108
]. Networks for HAR are relatively small—in comparison to networks for image classification or document analysis—with a maximum of four convolutional layers and a maximum of three pooling layers. The authors of [67
] used three convolutional layers and three pooling layers. As a remark, the authors of [87
] did not explain their network in detail. Classification is usually performed using a softmax layer [83
] or by means of attribute representation, a sigmoid function and a distance measurement, e.g., Euclidean, Cosine or Bray–Curtis distances [85
]. The authors of [86
] used spectrograms of inertial signals as image inputs for CNNs. Tao et al. [86
] concatenated sEMG vectors, corresponding to muscles activation levels, to the output of the first fully-connected layer and input it to a softmax layer.
Temporal Convolutional Neural Networks (tCNNs).
In multichannel time-series HAR, the CNNs input consists of a stack of segmented sequences from different sensors for a certain temporal duration. tCNNs carry out convolutions and downsampling operations along the time axis, sharing different feature extractors among sensors. The authors of [14
] also used a small CNN with only one conv layer, a pooling layer and a MLP. However, they proposed parallel branches per spatial axis, x, y and z, with late fusion. The authors of [77
] also used a small network with one temporal convolutional layer, one pooling and a MLP. The authors of [79
] used a small tCNN with one conv layer, one pooling, one FC and a softmax layer. The authors of [83
] used temporal convolutional networks. They also proposed an architecture with different parallel branches per IMU device with late fusion. The authors of [84
] proposed using an Encoder–Decoder TCNN and a dilated TCNN.
Recurrent Neural Networks (RNNs).
The authors of [11
] proposed a network with four temporal convolutional layers and two Long-Short Term Memory (LSTM) layers followed by a softmax layer. The authors of [88
] proposed to use dilated temporal ConvLSTMs. The network consists of one initial convolutional layer followed by three dilated temporal convolutional layers with different dilated factors. The feature maps of the last dilated temporal convolutional layer is fed to a two LSTMs layers followed by a softmax. Chen et al. [82
] proposed a combination of LSTMs with raw-data input, an MLP with handcrafted-feature inputs and late fusion. Differently, Zhu et al. [91
] used a LSTM network with statistical features as input.
For training the CNNs, tCNNs, and RNNs, the Adam [79
] and the RMSProp [11
] optimisations methods were usually used. For avoiding overffiting, the authors of [86
] utilised the
-regularisation, the authors of [23
] added random noise to the normalized measurements, and the authors of [23
] used dropout.
Following the metrics procedure from the publications in Table 7
, the most used metric is the accuracy. However, following the different datasets presented in the publications, most of those datasets have an imbalance problem, that is, some classes contain more samples than other classes. Generally, the reported performances in the publications show relative good results, approaching
accuracy. Nevertheless, these performances must be revised. Using the F1 measures, computing the mean and weighted average of the precision and recall, could give a more impartial conclusion of the performance for HAR (see Table 13