Human Activity Recognition for Production and Logistics—A Systematic Literature Review

: This contribution provides a systematic literature review of Human Activity Recognition for Production and Logistics. An initial list of 1243 publications that complies with predeﬁned Inclusion Criteria was surveyed by three reviewers. Fifty-two publications that comply with the Content Criteria were analysed regarding the observed activities, sensor attachment, utilised datasets, sensor technology and the applied methods of HAR. This review is focused on applications that use marker-based Motion Capturing or Inertial Measurement Units. The analysed methods can be deployed in industrial application of Production and Logistics or transferred from related domains into this ﬁeld. The ﬁndings provide an overview of the speciﬁcations of state-of-the-art HAR approaches, statistical pattern recognition and deep architectures and they outline a future road map for further research from a practitioner’s perspective


Introduction
In the vision of Industry 4.0, tasks and responsibilities are shared among human employees and robots [1,2].Nevertheless, it is expected that manual activities remain dominant in Production and Logistics (P+L) with a steady number of employees [3,4].Automated robots are not expected to fully replace manual labour in P+L in the foreseeable future [5,6].This is because it remains challenging to imitate the cognitive and motor skills of humans by machines [7].Detailed information on the occurrence, duration and properties of relevant human activities is crucial to draw conclusions on how to enhance employee performance [8].It is seen as a managerial failure not to account for human characteristics [9].Due to advancements in sensor technology and data processing, IT-supported approaches for automated activity recognition and assessment are gaining significance [10].
Human Activity Recognition (HAR) is the task of classifying human movements.HAR methods became relevant in applications such as mobile-or ambient-assisted living, smart-homes, rehabilitation, health support, and industrial settings [11][12][13][14].HAR commonly processes signals from videos, Motion Capturing (MoCap) systems, a set of on-body sensors or other data sources [14][15][16][17].Traditionally, methods of statistical pattern recognition have been utilised for recognising human movements [18][19][20].These methods extract relevant hand-crafted features from pre-processed and segmented sequences, and they train a classifier for assigning action labels to the sequences.Recently, deep learning methods are being used for combining the aforementioned pipeline into a single method.
They have become the state-of-the-art method for solving HAR problems in the context of gesture recognition, activities of daily living (ADL) as well as in industrial settings [11,21,22].
Automated recognition of human activities in P+L requires their definition, for example Locomotion, Retrieval, Utility Usage [23].The underlying assumption is that sensor patterns can be assigned explicitly to activity classes that are known at design time and that remain identical at run time [24,25].In industrial environments, there is no finite set of activities that can be segregated unambiguously as their definition may vary depending on the use case.As a simple example, items such as boxes or tools can be picked with the left hand, the right, or both hands and from different heights.Thinking of further activities for each variant results in an extensive quantity with overlapping features.It may be of interest to differentiate between fast and slow walking, lifting a box from the ground or from a shelf, paper-based or digital pick confirmation, and so forth.The characteristics of human motion are expected to change continuously as new information and handling technology emerges and facility layouts and processes adapt constantly.
Sensor Technology for capturing human motions is divided into optical and non-optical systems [26].Optical systems rely on active or passive markers that are attached to a person or they are deployed markerless, using for instance RGB(-D) cameras.Optical, marker-based Motion Capturing (OMMC) is considered more precise than markerless systems, in particular for clinical studies [27,28].On the downside, their use is restricted to constrained laboratory settings as vibration, dust, rapid temperature changes make their use in industrial environments unfeasible.Among the non-optical systems, Inertial Measurement Units (IMUs) have become highly relevant as they can be deployed in the challenging environments of P+L.They are not affected by occlusion and they do not portray human identities as in the case of videos [12].These low-power devices are cheap, highly reliable, non-invasive and easy-to-use.In general, attaching sensors or markers to a person's body may lead to an impairment when performing manual activities.Therefore, minimalistic recording set-ups are preferred in industrial application.Magnetic and mechanical systems as well as Electromyography are further options, but they are not the focus of this contribution.The examined sensor technologies are highlighted in Figure 1 and build the foundation for the literature review.Sensor Technology for capturing human motions [26].The selection covered in this contribution is highlighted.
In HAR, recording, analysing, and annotating measurements from IMUs is an expensive and time-demanding task.HAR faces challenges with regards to the settings of the recording environment, number of participants as well as sensors and their configuration [20,29].Due to the intra-and inter-class variability of human motion, a high quantity of observations from different subjects is necessary [11].Markers and sensors may be configured with different sampling rates and resolutions.Besides, their placement on the human body may vary [30].Therefore, data collection and annotation for HAR causes immense effort.In addition, raw inertial measurements are visually difficult to interpret by a human annotator.Additional video streams are necessary to make the observed activity apparent.This becomes a problem in real scenarios where video streams might not be allowed [23,29].Furthermore, annotations may be inconsistent among different annotators.Additional repetitions and validations are needed to enhance the quality of the dataset while further increasing the annotation effort.Gathering data in real scenarios requires a great expense and the data are prone to be disturbed by external factors.Nonetheless, the closeness to reality of the recorded data is ensured.On the downside, such an approach implies that the facility layout does not change until data analysis has concluded and an activity classifier has been trained.Controlled environments are appealing, as sensor measurements are less affected by noise [31,32] and recordings can be repeated under different settings and layouts that do not yet exist.Highly precise OMMC-Systems can be used in controlled environments [31].The skeleton model visualisation of a human wearing an OMMC-Suit renders synchronised videos obsolete, facilitating the annotation process.
Considering the aforementioned points from a P+L practitioner's perspective, the following guiding questions towards the current state of research are derived: What is the current status of research regarding HAR for P+L and related domains from a practitioner's perspective?2.
What are the specifications of current applications regarding the sensor technology, recording environment and utilised datasets?3.
What methods of HAR are deployed?4.
What is the research gap to enhance HAR in P+L?What does the future road map look like?
The scope of this contribution is to answer those questions by conducting a systematic literature review.
The remainder of this contribution is structured as follows.In Section 2, this contribution is demarcated from related surveys to underline its novelty value.Next, the method of the literature review is presented in Section 3. In Section 4, the findings of the review are presented and assessed.This contribution concludes with a discussion of the guiding questions in Section 5.

Demarcation from Related Surveys
In total, 68 surveys and literature reviews were identified during the review process.Among those, six were identified as having a scope related to this contribution.They are listed in Table 1 to underline the novelty value of this systematic literature review.A short content description is given in regards to the commonalities and differences to the scope of this contribution.The surveys are listed in chronological ascending order.

Ref. Year
Author & Description [33] 2013 Lara and Labrador reviewed the state of the art in HAR based on wearable sensors.They addressed the general structure of HAR systems and design issues.Twenty-eight systems are evaluated in terms of recognition performance, energy consumption and other criteria.
[34] 2014 While all related surveys deal with HAR based on IMUs or OMMC data, P+L is not addressed.Therefore, the selection of reviewed approaches was not performed in regards to the specific demands of P+L as outlined in Section 1.

Method of Literature Review
This literature review is based on the guidelines suggested by Kitchenham and Brereton [38], Kitchenham et al. [39], Kitchenham [40] and Chen et al. [41].The three-step pipeline of the method is illustrated in Figure 2. Based on Inclusion Criteria, a list of potentially relevant publications was created.During the selection process, three reviewers (C.R., F.N. and F.M.R.), who are experts from the domains of P+L and HAR, assigned the listed publications to one of the four stages according to predefined Content Criteria.Publications that reached Stage IV were considered relevant for the literature analysis.Their specifications were documented by the reviewers in a structured manner and they served as a point of departure for further literature search.

Criteria
Selection Process  The three steps of the method are explained in greater detail in the remainder of this section.

Inclusion Criteria
First, criteria for including literature were defined (see Table 2).As this contribution targets the current state of the art, original contributions are of primary interest.Surveys and literature reviews were added to a separate list (see Section 2).Nevertheless, the same Inclusion Criteria applied to both types of contribution.The literature search was conducted on the selected databases and according to feasible combinations of the listed keywords.They were defined by the reviewers.Synonyms and spelling variations of the keywords were also considered.All contributions must be published between 1 January 2009 and 31 December 2018.Going further back in time would not be expedient to capture the state of the art in a rapidly progressing field of research such as HAR.Only English publications were considered.Accepted source types were conference proceedings and peer-reviewed journals.Grey literature, such as technical reports, work in progress, homepages and student degree theses, were excluded from the reviewing process to ensure the quality of the observed material.
The creation of the initial literature list to consider during the selection process happened without interference by reviewers as the Inclusion Criteria were objective.A duplicate removal took place once all databases were searched.

Selection Process
The selection process was conducted for all potentially relevant publications that meet the Inclusion Criteria.Each document was reviewed by each of three reviewers independently.When this subjective process led to disagreement, the Content Criteria were discussed and possibly adjusted until agreement was reached.Ultimately, the Content Criteria in Table 3 were applied for the selection process.In cases where the content of a peer-reviewed journal paper and an earlier conference publication overlapped, the journal paper was preferred.The same procedure was applied when an author wrote several papers with the same scope, refining the applied methods.In these cases, solely the newest contribution was considered.Original contributions that reached Stage IV and related surveys (see Section 2) served as a starting point for further literature search.References from the Stage IV publications and other contributions from the corresponding authors were examined according to the stages of the selection process, starting with the title.The Inclusion and Content Criteria remained the same.After examination of the reference lists and the authors' profiles, a second duplicate removal took place.(II) Title The title does not conflict with any Content Criteria.This is because the title either complies with the criteria or it is ambiguous.
(III) Abstract The abstract's content does not conflict with any Content Criteria.This is because the content either complies with the criteria or necessary specifications are missing.
(IV) Full Text Reading the full text confirms compliance with all Content Criteria.Properties of the publication are recorded in the literature overview.

Literature Analysis
Once the selection process was complete, a systematic literature analysis was conducted for all contributions assigned to the fourth stage.
In Stage V, the final list of relevant contributions was extended by including the authors' names and affiliation, year of publication, name of journal or conference and the Field Weighted Citation Impact (FWCI): "[...] FWCI is an indicator of mean citation impact, and compares the actual number of citations received by a document with the expected number of citations for documents of the same document type (article, review, book, or conference proceeding), publication year, and subject area" [43].Therefore, recording the FWCI is expected to give valuable insights on the contribution's relevance for their respective domain.The metrics were taken from the Scopus Database www.scopus.com/sources on 6 June 2019.
Once the general information was acquired, each contribution's striking features in regards to this review's guiding questions were analysed.Thus, Stage VI required the reviewers to read all relevant literature and briefly summarise: This procedure enabled the reviewers to get an overview of the research area.Beyond that, the application domain, the sensor technology and utilised methods and datasets were given attention when studying the literature.
Based on the contribution analysis, a categorisation scheme was derived by the reviewers in Stage VII.This scheme consists of two layer-root categories and subcategories.The initial categories were created by the reviewers.As in the selection process, points of disagreement were discussed among the reviewers.The initial categories were subject to change when allocating the contributions in the following stage.
The systematic review of the literature was accompanied by a continuous refinement of the categorisation scheme-ultimately leading to the one illustrated in Table 5.After the categorisation scheme was finalised, all publications were allocated accordingly in Stage VIII.Per root category, the criteria for multiple subcategories can be met, e.g., a contribution can consider both working activities and exercises or attach sensors and markers to different body parts at the same time.
Apart from categories in Table 5, the evaluation metrics of HAR were also analysed.As HAR typically contains imbalanced datasets, proper metrics for comparing the performance of different methods have to be chosen.In general, the accuracy is used extensively for evaluating algorithms for solving classification tasks.Nevertheless, the F1-measures are more convenient for evaluating the performance of HAR using highly imbalanced datasets.The mean F1 measure is the harmonic mean of the precision and recall.Additionally, the weighted F1 measure considers the proportion of the class in the dataset.

Results
Following the method as outlined in Section 3, the results of the literature review are presented in this section.First, the results of the selection process are illustrated and major causes for paper exclusion in accordance with the Content Criteria for selection process are pointed out.This is followed by a structural analysis of the Stage IV literature using the predefined categorisation scheme.

Contributions Per Stage and Reasons for Exclusion during Selection Process
During selection process, 1243 publications were examined as they complied with the Inclusion Criteria, with 52 of them making it to Stage IV as their full text complied with the Content Criteria for selection process.Table 6 illustrates the number of contributions per stage.Referring to the predefined Content Criteria for selection process, common cases are presented where contributions violated the criteria during selection process.
Regarding Criterion A, two major issues kept contributions from reaching Stage IV.Approaches of vision-based activity recognition using images and videos have been very prominent in the literature corpus, even though they were not explicitly searched for based on the inclusion criteria keyword list.Beyond that, contributions were excluded despite using IMUs, because the sensors were not attached on the human body but to objects such as golf clubs, surfboards, ice hockey sticks, skateboards and so forth.
The keywords IMU and accelerometer led to contributions dealing with sensor-based control of robots and unmanned aerial vehicles (UAVs).Another active field of research is activity recognition of animals, e.g., dogs, horses and lizards.Both issues meant a violation of Content Criteria B.
In contradiction with regards to Content Criterion C, many contributions do not gather data in the physical world.Instead, motion patterns are taken from a 3D-simulation of a human working in a digital twin of a scenario resembling a P+L facility.Researchers also use virtual reality to create a feeling of immersion and record data of a human working in this simulated environment.Thus far, there is no empirical proof known to the authors that motion patterns recorded from a human linked to a virtual reality can be used for HAR in a real-world facility.
There are fields of research related to HAR that do not deal with quantification of human activity (Content Criterion D).Rather, they aim at qualitative assessment, e.g., in regards to ergonomics.Further examples are robot control via gestures and the analysis of human behaviour to create more realistic digital human simulation models.An active field of research related to HAR is emerging in the domain of medicine.Advances in sensor technology sparked research aiming at the detection of clinical pictures and their consequences, e.g., Parkinson's disease and Alzheimer's disease, stroke, medial knee osteoarthritis, leg prostheses, asymmetries during the gait cycle, head trauma and many more.
Contributions that did not provide a use case and thus cannot be considered application-oriented, violated Content Criterion E. They were either a survey and thus moved to a separate list (see Section 2), or they were dealing with fundamentals of HAR, lacking the prospect of deployment in P+L from a practitioner's perspective.In the case the application is not taking place in P+L, the reviewers discussed whether its transfer to this domain seems feasible.While locomotion and handling activities that are part of daily living, e.g., using tools, seemed transferable, there were cases where this was not the case.The first group of these activities were sport activities such as jumping and dribbling in basketball, boxing, alpine skiing, soccer, ball throwing in baseball and handball, and underwater sports, e.g., fly kicks.The second group consists of ADLs, that do not resemble P+L activities.Most present among them were dance moves, performing music with a violin or piano, or locomotion with a rolling walker, crutches or a wheelchair.
In the literature, the meaning of the term physical activity depends on the context.Consequently, the search for Activity or Human Activity often resulted in violations of Content Criterion F. Eye tracking and the recognition of facial expressions and emotions such as stress cannot be considered limb or torso movements.Work on pose tracking and path analysis did not aim to recognise the physical activity.For example, it is not necessarily the case that the employee moved on foot as he/she could have used a vehicle.
Advances in sensor technology for HAR applications are published in journals and conference proceedings.Current trends are devices for finger, head and back movement tracking that can be deployed easily.These hardware showcases may be useful for HAR in the long run, but violate Content Criterion G. Contributions that do not show practical application scenarios as they solely focus on the hardware showcase were excluded.
Contributions with no clear pattern recognition methods and non-standard performance metrics for HAR were not considered, e.g., contributions with vague procedure on the data collection, pre-processing, segmentation and classification.There are contributions that rely on conditions flow-charts based on thresholding, or they simplify the algorithms by only mentioning them.

Systematic Review of Relevant Contributions
Analysing the number of relevant publications per year reveals an increase in the second half of the observed ten-year time interval.Thus, the growing relevance of this review's scope is confirmed (see Figure 3).The peaks in the years 2016 and 2018, in particular when compared to 2017, cannot be explained by special issues of domain-specific journals.The 52 publications are spread over 41 different journals or conferences.This fact underlines that HAR for P+L benefits from a wide range of interdisciplinary expertise.The International Workshop on Sensor-based Activity Recognition and Interaction (iWOAR) accounts for three publications, making it the most frequent one.The remaining journals or conferences are accountable for a single or two contributions each.In total, 165 unique authors were involved in writing the 52 contributions.Their countries of affiliation are spread worldwide over 26 countries.None of the authors has participated in writing more than three publications.
The results of the systematic literature review according to the predefined Categorisation Scheme are illustrated in Table 7. Entries are listed in chronological ascending order and alphabetically according to the last name of the corresponding author.During categorisation, it turned out that active markers are not used in any of the relevant contributions.OMMC is used in only two contributions [54,85].The remaining research was done using IMUs, sometimes in conjunction with other sensor technology that is not considered in this review.Therefore, the utilised sensor technology is not explicitly stated in Table 7.
The FWCI could not be recorded for three publications [48,57,71].They are marked with a "-".Some of the publications from 2018 were published at the end of the year.Thus, it is too early to draw conclusions from their FWCI of 0. The highest FWCI of 64.63 is held by Bulling et al.The mean value of the remaining 49 publications-including those with a FWCI of 0 but excluding those where it could not be captured-is 10.04.
In the following, the major findings per category are summarised in two parts-Application and HAR Methods.

Application
The first part of the systematic review focuses on the application domain, the observed activities, the sensor attachment and the utilised datasets.When pointing out the major findings, particular emphasis is put on the contributions from the P+L domain.

Domain
Eight publications are assigned to the P+L domain, three of them in conjunction with other domains.Interestingly, all publications of the P+L domain have been published since 2013.This underlines the increasing significance and research interest in HAR for P+L.The application covers a wide range of sectors, such as manufacturing and assembling [58,86], warehousing [72,83,85], construction [90] and maintenance [11,14].Warehousing, in particular order picking, is the only sector in logistics.Other areas of logistics, such as packaging, in-house transport, external transport and handling processes such as loading, unloading and reloading are not observed in the contributions.The majority of the remaining 44 contributions focus on simplistic activities or simple exercises, as pointed out in the next paragraph.

Activity
The eight contributions from the P+L domain correspond to those that cover working activities.Tao et al. and Koskimaki et al. focused solely on working activities but did not involve locomotion.Zhang and sawchuk did not recognise the activity itself but motions that are performed during working, such as neck bending, neck extension, kneeling and so forth.The Skoda dataset utilised in [11,14] involves manipulative gestures in car maintenance, but does provide labelled data of locomotion (see Section 4.2.1.4).However, the other datasets utilised in these contributions do involve locomotion activities.The remaining publications [72,76,83] from P+L apply HAR methods on both stationary working processes and locomotion activities.Apart from Reining et al. [85], the observed contributions in the domain of P+L assume that the activity definition is known at design time and is not going to change at run time of a HAR method (see Section 1).Attribute-based representations are proposed to address this issue, following the concept of Rueda and Fink [92] and Lampert et al. [93].
Within the entire literature corpus, locomotion is most often represented by far.Forty-six contributions aim to recognise the associated activities such as walking, running, standing and climbing stairs that can be recorded without elaborate set-ups.In total, 23 contributions deal with ADL, followed by exercises with 13 occurrences.It is noticeable that more complex activities that require greater effort for recording are represented less than simple ones.This is because the method of HAR is the focus of most publications, while datasets are often regarded as a tool for evaluation.The domain in which the data were created is of secondary importance.Attachment Among the contributions on P+L, passive MoCap sensor technology is applied only once [85].The remaining contributions from P+L utilise IMUs.The combination of Surface Electromyography (sEMG) and IMUs was proposed by Tao et al.Using a single IMU attached to a hand or arm is proposed twice [58,86].
Examining the sensor attachment in the entire literature corpus, no clear picture emerges.While placing the sensors on the torso is most prominent (35), other locations are common as well.The reviewers could not establish a link between the sensor placement and the recognised activities.While some researchers attach the sensors to the arm performing a motion, e.g., drilling or opening a window, other approaches try to recognise ADLs and locomotion activities with sensors placed on the waist.The same issue also occurs the other way around, when researchers aim to recognise locomotion with IMUs placed on the hand.Out of the 21 contributions utilising smartphone sensors, 13 include the creation of individual datasets.The latter group solely takes smartphone data into consideration without deploying further sensors.Placing sensors on the head or a helmet is proposed rather rarely.Among the six contributions that do so, Wolff et al. presented the only work based on the idea of using head-worn sensors exclusively without further body-worn devices.Transferring this approach to P+L could possibly minimise the impairment of employees when performing manual work.

Dataset
Seven out of the eight P+L contributions use data recorded in a laboratory.Work stations or relevant features of a facility are rebuilt to record data close to reality.Nevertheless, these data are not recorded in a real process with those employees that routinely perform the manual activities.There are four P+L contributions that utilise data recorded from real-life working routines, of which three also use laboratory data.There is a single P+L contribution that solely uses real-life data [72].No work could be found on the issue of training a classifier using data from a laboratory for deployment in a real-life P+L facility.Besides, no work was found concerning transfer learning between datasets and scenarios in an industrial context.Examining the entire literature corpus, there are only six papers that use both real-life and laboratory data versus 39 real-life and 19 laboratory data, respectively.
There are six P+L contributions using individual datasets, while data from a repository are utilised three times.A single P+L contribution uses both data sources [83].Examining the entire literature corpus, it is striking that a majority of 38 contributions utilises individual datasets, and 34 of them do so without referring to data that are publicly available in repositories.Thus, there are four contributions using both individual and repository data [22,54,71,83].As most contributions lack a recording protocol and a detailed description of the recorded activities, it remains unclear whether the motion patterns and their assigned activity labels as well as the underlying activity definitions are comparable.
During the review process, the authors traced information about the used datasets back to their origin, e.g., in a repository or the first time they were mentioned in a contribution.Table 8 lists publicly available datasets utilised by a single contribution.They are listed in alphabetical order, providing a reference for further reading and stating the contribution that utilises this dataset.Wearable Action Recognition Database (WARD) [68] Table 9 describes those datasets that are used by more than one contribution in greater detail.[105] Hand Gesture The dataset from 2013 contains 70 minutes of arm movements per subject from eight ADLs as well as from playing tennis.Two recorded subjects were equipped with three IMUs on the right hand and arm.[22,29,71] [106] Skoda This dataset from the year 2008 contains ten manipulative gestures performed by a single worker in a car maintenance scenario.20 accelerometers were used for recording.[11,14] It was noticed that the oldest dataset from 2008, Skoda, is still used eight years later in 2016 in [11].Even though several contributions use excerpts from the same datasets, a performance comparison between the proposed methods is not necessarily possible.In some cases, the excerpts taken from a dataset differ.One reason is that authors may exclude activities with too few subjects or solely rely on IMU data without taking other data sources into account.Since a wide variety of data is used, a comparison of classification performance seems hardly possible from a practitioner's perspective.
Irrespective of the HAR method applied, effort for dataset creation is another factor to consider in commercial applications for P+L.However, none of the authors state the effort for dataset creation.It remains unclear how much time the set-up of equipment and the recording sessions consume.The same applies to the process of annotation, in some contributions referred to as labelling.Cross tests and repetition tests may reveal the consistency of labels among several annotators and the inherent annotation error caused by vague class descriptions.In the observed contributions, the time spent on annotating and the consistency of the labels are not discussed.

HAR Methods
The HAR methods represent the second part of the systematic review, covering the standard pattern recognition chain-this involves the pre-processing, segmentation, and shallow classification methods-and deep learning, in addition to the performance metrics.

Data Representation
Different data representations have been used by the publications in Table 7, depending on the HAR applications and sensors.Inertial measurements recorded by sensors integrated in IMUs are deployed, in general, for solving HAR.Usually, more than three devices are set on the human body, e.g., on the hands, legs, head, and torso.Differently, the authors of [44,82,91] recorded acceleration measurements from only one device, which is placed on the waist.The authors of [52,75,79] proposed using the magnitude of the acceleration vector from the three components x, y, and z.The authors of [86] used the logarithm magnitude of a two-dimensional Discrete Fourier Transform of IMU signals.They proposed utilising this magnitude as an image input for a CNN.
Otherwise, the local human joint poses have also been used for HAR.In this case, Optical Motion Capturing (OMOCAP) or devices such as Kinect were deployed.In [54], the authors represented OMOCap data by using 3D-joint positions and rotations specifying a posture.The authors divided the human joints into five groups, according to the main human body parts.In [85], the authors ofterpreted the joint positions and orientations of OMOCap data as multichannel time series, where each joint pose per component x ,y, and z, is considered individually.This approach is similar to how acceleration data are handled.The authors of [81] used the quaternions and Euler angles representations, along with their velocities, and accelerations of human joints as input data.This also can be seen as a geometrical and parametric representation of the joint poses.

Pre-Processing
Due to different characteristics of sensors, sampling rate, units, random noise or malfunctioning, pre-processing approaches are in need.The authors of [71,80] used a median filter for smoothing the signal.In [47], a third-order average filter was deployed for reducing random noise.Low-and High-pass filtering have been used for separating the acceleration components due to body movements and gravity, and for eliminating noise.The authors of [44,61,87] argued that the low-frequency component of the acceleration is due to gravity, and the high-frequency component to the dynamic motion of a human body.In [84], a low-pass Butterworth-filter was used for such separation.The authors of [57,59,65] computed the gravity component by averaging the acceleration measurements, which can be seen as taking the zero frequency component.The body acceleration was calculated by subtracting the gravity component from the acceleration measurements.The authors of [80] also separated these two components, however there was no clear explanations of which method was deployed.Differently, the authors of [64,65] used a low-pass filter, the authors of [71] used a third-order low-pass Butterworth filter, and the authors of [62] an average filter for reducing noise.The term noise filter was also used in [84], but there was no explanation on the particular method.In [68,77], a zero-mean and unit-variance normalisation, and in [83] a Max-Min normalisation to the range [0, 1] were carried out, as there are differences among the units and scales of the measurements.The authors of [67] did not use any pre-processing and forwarded the raw data to a convolutional neural network.
In contrast to the aforementioned works, the authors of [63,64,90] normalised the extracted features before the training stage.For example, the authors of [90] normalised the extracted features to the range [0, 1].

Segmentation
Segmentation refers to extracting a sequence of continuous measurements or the pre-processed data that are likely to portray a human activity.In HAR, the sliding-window approach is the most common method for creating segments for being processed by a classifier.In the approach, a window is moved over the time-series data by a certain step to extract a segment [29].The window sizes directly controls the delay of the recognition system.The step size is selected according to segmentation precision-taking into account that short activities can be skipped-and computation effort.Table 10 shows the usual window sizes and overlapping percentage along with the sampling rate of the measurements that are presented by the publications in Table 7.It is noticeable that the higher is the sampling rate, the smaller is the window.Publications using small sampling rates handle activities that can be seen as compositions of short duration activities.These activities have longer durations.
Differently, there are approaches for segmenting sequences using additional measurements or events, e.g., eye movement [29].

Shallow Methods
In addition to segmentation, a traditional supervised-HAR pipeline includes the extraction of relevant handcrafted-features, a feature reduction, a training stage and a classification.The handcrafted features should capture the intrinsic characteristics of a certain human action.A feature reduction is deployed for reducing the dimensionality of the feature space, keeping the discriminant properties of the features.In the training stage, a classifier is trained by using the extracted features and the ground-truth activity labels.Finally, the classifier assigns activity classes.
Feature Extraction.In the standard pattern recognition methods, the feature extraction is an important stage.It allows representing data in a compact manner, which helps with later classification stages.They are divided into two main groups, statistical features and application-based.
Table 11 shows the statistical features that are mentioned in the publications in Table 7. Time-domain features focuses on the waveform characteristics, and frequency-domain features focus on the periodic structure of the signal [90].The Fourier transform is applied to the raw-or pre-processed signals to acquire the estimated spectral density of the time series.As a remark, the authors of [64] mentioned the usage of time-and frequency-domain features, but they did not specify which ones exactly.Application-based features refer to features that were created for a certain application or dataset.These features are based on geometric, structure and kinematic relations.Table 12 shows the application-based features used in Table 7.
Table 12.Application-based features for HAR.

Gravity variation
Gravity acceleration computed using the harmonic mean of the acceleration along the three axes (x,y,z) [65] Eigenvalues of Dominant Directions [50] Structural Trend [55,56] Magnitude of change [55,56] Time Autoregressive Coefficients [47] Kinematics User steps frequency Number of detected steps per unit time [65] Walking Elevation Correlation between the acceleration along the y-axis vs. the gravity acceleration or acceleration along the z-axis [65] Correlation Hand and foot Acceleration correlation between wrist and ankle [65] Heel Strike Force Mean and variance of the Heel Strike Force, which is computed using dynamics [65] Average Velocity Integral of the acceleration [50] Feature Reduction.The authors of [44][45][46]68,81] deployed Principal Component Analysis (PCA) for reducing the dimensionality of their features.PCA is a holistic method that considers its inputs as points in a high-dimensional space and it finds a lower-dimensional feature space along the highest variance, where classification becomes easier.Linear Discriminant Analysis (LDA) is another holistic method that tries to overcome the drawbacks of PCA.It minimises the intra-class variance, and maximises the inter-class variance of a set of inputs/features.It finds an optimal projection by maximising the ratio between the inter-and intra-class variations of the inputs.Kernel Discriminant Analysis (KDA) is a non-linear discriminating approach based on kernel techniques to find non-linear discriminating features, used in [29,47,47].Quadratic Discriminant Analysis (QDA) was deployed by Siirtola and Röning [57].The authors of [90] followed the Recursive Feature Elimination (RFE) for finding the best set of features.The RFE can be seen as a dense parameter search, which iteratively selects or rejects a set of features after training and deploying a classifier.A Bag of Feature Representation using Kmeans for clustering was proposed by Deng et al. [54].They used the Kmeans algorithm for clustering M motion sequences.These motion sequences are 3D joint positions and orientations through m frames.Additionally, the authors of [50,68] utilised a Random Projection (RP).Sequential forward-or backward-feature selection (SFFS) was used in [45,90].
Classification.The authors of [52] trained a HMM model per axis (x, y, and z) of pre-processed acceleration measurements, fusing them with a weighted sum.The authors of [29,62,71] also deployed HMMs, and the authors of [49] used couple HMMs.The authors of [78] used a hierarchical conditional HMMs.The authors of [44] used a NB using the Probability Density Function (PDF) from all of the 19 features, which were previously reduced using PCA; assuming that all the features in the lower space are not mutually correlated.The authors of [81,90] used also a NB assuming that the features are mutually independent.Similarly, refs.[29,45,50,56,63,68,72] used NB.The authors in [45,50,61,67,68,72,80,90] deployed SVMs.The authors of [65] used a SVM-based binary decision tree classifier.The authors of [53] proposed a hardware-friendly SVM that is meant to be deployed on smartphone devices.[29,45,46,57,63,64,68,80,90] trained a K-Nearest Neighbor (KNN) based classifier.The authors of [81] used Dynamic Bayesian Mixture Model (DBMM) for combining conditional probability outputs from different classifiers, namely, the NB, SVM and MLP.A weight is assigned to each base classifier, according to a learning process, using an uncertainty measure as a confidence level.Multilayer Perceptron (MLP)-a network with two fully connected layers and a softmax layer-was deployed by [45][46][47][48]50,56,59,61].

Deep Learning
Deep architectures have recently been proposed for solving HAR.Differently from the standard HAR pipeline, deep architectures combine the feature extraction and classification in a single approach.Their features are directly learned from data, being more discriminative.Besides, they overcome some problems regarding computation and adaptability of handcrafted features.
Convolutional Neural Networks (CNNs).CNNs, first proposed in [107], have recently been used for solving HAR problems.CNNs combine the feature extraction and classification in an end-to-end approach.CNNs contain hierarchical structures combining convolutional operations using learnable filters and non-linear activation functions, downsampling operations, and classifiers [108].Networks for HAR are relatively small-in comparison to networks for image classification or document analysis-with a maximum of four convolutional layers and a maximum of three pooling layers.The authors of [67] used three convolutional layers and three pooling layers.As a remark, the authors of [87] did not explain their network in detail.Classification is usually performed using a softmax layer [83,86] or by means of attribute representation, a sigmoid function and a distance measurement, e.g., Euclidean, Cosine or Bray-Curtis distances [85].The authors of [86] used spectrograms of inertial signals as image inputs for CNNs.Tao et al. [86] concatenated sEMG vectors, corresponding to muscles activation levels, to the output of the first fully-connected layer and input it to a softmax layer.
Temporal Convolutional Neural Networks (tCNNs).In multichannel time-series HAR, the CNNs input consists of a stack of segmented sequences from different sensors for a certain temporal duration.tCNNs carry out convolutions and downsampling operations along the time axis, sharing different feature extractors among sensors.The authors of [14] also used a small CNN with only one conv layer, a pooling layer and a MLP.However, they proposed parallel branches per spatial axis, x, y and z, with late fusion.The authors of [77] also used a small network with one temporal convolutional layer, one pooling and a MLP.The authors of [79] used a small tCNN with one conv layer, one pooling, one FC and a softmax layer.The authors of [83] used temporal convolutional networks.They also proposed an architecture with different parallel branches per IMU device with late fusion.The authors of [84] proposed using an Encoder-Decoder TCNN and a dilated TCNN.
Recurrent Neural Networks (RNNs).The authors of [11] proposed a network with four temporal convolutional layers and two Long-Short Term Memory (LSTM) layers followed by a softmax layer.The authors of [88] proposed to use dilated temporal ConvLSTMs.The network consists of one initial convolutional layer followed by three dilated temporal convolutional layers with different dilated factors.The feature maps of the last dilated temporal convolutional layer is fed to a two LSTMs layers followed by a softmax.Chen et al. [82] proposed a combination of LSTMs with raw-data input, an MLP with handcrafted-feature inputs and late fusion.Differently, Zhu et al. [91] used a LSTM network with statistical features as input.

Metrics
Following the metrics procedure from the publications in Table 7, the most used metric is the accuracy.However, following the different datasets presented in the publications, most of those datasets have an imbalance problem, that is, some classes contain more samples than other classes.Generally, the reported performances in the publications show relative good results, approaching 100% accuracy.Nevertheless, these performances must be revised.Using the F1 measures, computing the mean and weighted average of the precision and recall, could give a more impartial conclusion of the performance for HAR (see Table 13).

Discussion and Conclusions
The purpose of this systematic literature review is to capture the state of the art of Human Activity Recognition for Production and Logistics.To achieve this goal, HAR applications in related domains are taken into account as well.Fifty-two contributions were selected from an initial list of 1243 publications according to predefined Inclusion and Content Criteria.The relevant literature is categorised and analysed by domain experts from P+L and HAR.Summarising the findings, the guiding questions from Section 1 are answered: 1.
What is the current status of research regarding HAR for P+L and related domains from a practitioner's perspective?
For the past 10 years, eight publications dealing with HAR in P+L have been identified.They address a variety of use cases but none covers the entire domain.Apart from two applications [83,85], the approaches assume a predefined set of activities, which is a downside amid the versatility of human work in P+L.Furthermore, the necessary effort for dataset creation is unknown, making the expenditure for deploying HAR in industry difficult to predict.
In applications for related domains, locomotion activities as well as exercises and ADLs that resemble manual work in P+L are covered, allowing for their transfer to this domain.

2.
What are the specifications of current applications regarding the sensor technology, recording environment and utilised datasets?
The vast majority of research is done using IMUs placed on a person or using the accelerometers of smartphones.The sensor attachment could not be derived from the activities to be recognised.
There was no link apparent to the reviewers.Seven out of the eight P+L contributions use data recorded in a laboratory.In total, 39 contributions use real-life data versus 19 that use laboratory data.Only four papers use both real-life and laboratory data.The reviewers did not find work regarding the training of a classifier using data from a laboratory for deployment in a real-life P+L facility or transfer learning between datasets and scenarios in an industrial context.Most of the publications proposed their own datasets or used individual excerpts from data available in repositories; thus, replicating their methods and results is hardly possible.

3.
What methods of HAR are deployed?
Current publications solve HAR either using a standard pattern recognition algorithm or using deep networks.Publications follow only the sliding window approach for segmenting signals.
The window size differs strongly according to the recording scenarios.However, the overlapping is usually 50%.For the standard methods, there is a large number of statistical features in time and frequency, being the variance, mean , correlation, energy and entropy the most common.Deep applications have been applied successfully for solving HAR.In comparison with applications in the vision domain, the networks are relatively shallow.Temporal CNNs or combinations between tCNNS and RNNs show the best results.Accuracy is the most used metric for evaluating the HAR methods.However, methods using datasets with unbalanced annotation should be evaluated with precision, recall and F1-metrics; otherwise, the performance of the method is not evaluated correctly.

4.
What is the research gap to enhance HAR in P+L?What does the future road map look like?
From the reviewer's perspective, further research on HAR for P+L should focus on five issues.First, a high-quality benchmark dataset for HAR methods to deploy in P+L is missing.This dataset should contain motion pattern that are as close to reality as possible and it should allow for comparison among different methods and thus being relevant for application in industry.Second, it must become possible to quantify the data creation effort, including both recording and annotation following a predefined protocol.This allows for a holistic effort estimation when deploying HAR in P+L.Third, most of the observed activities in the literature corpus are simplistic and they do not cover the entirety of manually performed work in P+L.Furthermore, the definition of activities cannot be considered fixed at design time and expected to remain the same during run time in such a rapidly evolving industry.Methods of HAR for P+L must address this issue.Fourth, method-wise, the segmentation approach should be revised in detail as a window-based approach is currently the only method for generating activity hypothesis.
This method does not handle activities that differ on their duration.A new method for computing activities with strongly different duration is needed.Fifth, the methods using deep networks do not include confidence measure.Even though these network methods show the state-of-the-art performance on benchmark datasets, they are still overconfident with their predictions.For this reason, integrating deep architectures with probabilistic reasoning for solving HAR using context information can be difficult.
Looking towards the future, the authors plan to tackle the issues along the outlined road map with further research.

Figure 1 .
Figure 1.Sensor Technology for capturing human motions[26].The selection covered in this contribution is highlighted.

Figure 2 .
Figure 2. Method of the literature review.

•
the initial situation and scope; • the methodological and empirical results; and • the further research demand.

Figure 3 .
Figure 3. Papers distribution by year of publication.

Table 2 .
Inclusion Criteria for original contributions and related surveys.

Table 3 .
Content Criteria for selection process.

Table 4
explains the four stages that contributions could reach during the selection process in accordance with the Content Criteria.

Table 4 .
Stages I-IV of selection process.
Smartphone Worn in a pocket or a bag.If attached to a limb, the subcaterogy is checked as well Dataset Repository Utilised dataset is available in a repository Individual Dataset is created specifically for the contribution and not available in a repository Laboratory Recording takes place in a constraint laboratory environment Real-life Recording takes place in a real-life environment, e.g., a real warehouse or in public places Name of dataset Name, origin, repository and description of dataset Sensor Passive Markers Markers reflect light for the camera to capture Active Markers Markers emit light for the camera to capture IMU Devices that measure specific forces such as acceleration or gyroscopes Data Preparation (DP) Pre.-Pr.Pre-Processing: Normalisation, noise filtering, low-pass and high-pass filtering, and re-sampling Segm.Segmentation: Sliding window-approach Shallow Method FE -Stat.Feat.Statistical feature extraction: Time-and Frequency-Domain Features FE-App.-based Application-based features, e.g., Kinematics, Body model, Event-Based FR Feature reduction, e.g., Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), Kernel Discriminant Analysis (KDA), Random Projection (RP) Classification method: Random Forest (RF), Decision Trees (DT), Dynamic Time Warping (DTW), K-Nearest Neighbor (KNN), Fuzzy-Logic (FL), Logistic Regression (LR), Bayesian Network (BN), Least-Squares (LS), Conditional Random Field (CRF), Factorial Conditional Random Field (FCR), Conditional Clauses (CC), Gaussian Mixture Models (GMM), Template Matching (TM), Dynamic Bayesian Mixture Model (DBMM), Emerging Patterns (EP), Gradient-Boosted Trees (GBT), Sparsity Concentration Index (SCI)

Table 6 .
Examined Publications per Stage.

Table 7 .
Systematic Review of relevant literature (Stage VIII).

Table 8 .
Overview of publicly available datasets utilised by a single relevant contribution.

Table 9 .
Overview of publicly available datasets utilised by two or more relevant contributions.

Table 11 .
Statistical features divided in two main groups: time and frequency domain.

Table 13 .
Number of publications using these metrics as performance metric for HAR.