LARa: Creating a Dataset for Human Activity Recognition in Logistics Using Semantic Attributes

Optimizations in logistics require recognition and analysis of human activities. The potential of sensor-based human activity recognition (HAR) in logistics is not yet well explored. Despite a significant increase in HAR datasets in the past twenty years, no available dataset depicts activities in logistics. This contribution presents the first freely accessible logistics-dataset. In the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University, two picking and one packing scenarios were recreated. Fourteen subjects were recorded individually when performing warehousing activities using Optical marker-based Motion Capture (OMoCap), inertial measurement units (IMUs), and an RGB camera. A total of 758 min of recordings were labeled by 12 annotators in 474 person-h. All the given data have been labeled and categorized into 8 activity classes and 19 binary coarse-semantic descriptions, also called attributes. The dataset is deployed for solving HAR using deep networks.


Introduction
Human activity recognition (HAR) assigns human action labels to signals of movements. Signals are time series that are obtained from video-frames, marked-based motion capturing systems (Mocap), or inertial measurements. This work focuses on HAR using Mocap and inertial measurements. Methods of HAR are critical for many applications, e.g., medical and rehabilitation support, smart-homes, sports, and in industry [1][2][3]. Nevertheless, HAR is a complicated task due to the large intra-and inter-class variability of human actions [1]. In addition, extensive annotated-data for HAR is scarce. This is, mainly, due to the complexity of the annotation process. Moreover, datasets of HAR are likely to be unbalanced. Usually, there exists more samples of frequent activities, e.g., walking or standing in comparison with picking an article [4,5].
Warehousing is an essential element of every supply chain. The main purpose of warehousing is storing articles and satisfying customers' orders in a cost and time-efficient manner. Despite an increase in automation and digitization in warehousing and the impression of a shrinking number of employees, the employee numbers are rising [6,7]. Manual order-picking and -packing are labor-intensive and costly processes in logistics. Information on the occurrence, duration, and properties of relevant The contribution answers also the following research questions in the context of the first freely accessible logistics HAR dataset-Logistic Activity Recognition Challenge (LARa):

1.
What is the the state-of-the-art of dataset creation for multichannel time-series HAR? 2.
What guidelines are proposed for creating a novel dataset for HAR? 3.
What are the properties of a logistics-dataset for HAR created by following these guidelines? 4.
How does a tCNN perform on this dataset using softmax compared to an attribute representation?
This contribution is organized as follows. Section 2 presents the related work on multichannel-time series HAR. In Section 3, the freely accessible dataset LARa is introduced. First, data recording steps in the logistics scenarios are presented. Second, the activity classes and semantic attributes are explained. Third, findings of the annotation and revision process are highlighted. Section 3 concludes with an overview of the LARa dataset. Section 4 presents an example of solving HAR on the LARa dataset using deep architectures. Finally, Section 5 offers a discussion and the conclusions of the work in this contribution. Additionally, Appendix A gives an overview of state-of-the-art datasets for HAR. Based on the datasets' descriptions, the guideline for creating the novel dataset in Section 3 is derived.

Related Work
Methods of supervised-statistical pattern-recognition have been used successfully for HAR. The standard pipeline consists of preprocessing, segmentation, statistical-features extraction, and classification. High and low-pass filters are common as preprocessing steps. High-pass filters serve denoising, as faulty measurements in the sensors are on the high-frequency spectrum. In addition, changes in human motions are rather in the low frequency. Low-filter operations are used for separating gravitation and inclination of the IMUs in constant space, i.e., the earth [19]. A segmentation approach, e.g., a sliding window, divides the input signal into segments of a certain time duration. Statistical features are computed from the time and frequency domain. They are, for example, the mean, variance, channels-correlation, entropy, energy, and coherence. [1,11,20,21]. The authors in [10] present a summary of such features. Using these features, the parameters of a classifier are computed. The classifier assigns an activity label to an unknown input. Some examples of classifiers are Naïve Bayes, Support Vector Machines (SVMs), Random Forests, Dynamic Time Warping (DTW), and Hidden Markov Models (HMMs) [11,22]. These methods, however, might show low performance on challenging HAR problems. In addition, different combinations of features must be selected manually per activity. This makes the method hardly scalable and is prone to overfitting [3,19].
The authors in [11] evaluate HAR for the order picking using statistical pattern recognition. They present a novel dataset of human order picking activities. They use a low number of sensor devices. Specifically, they deployed three inertial measurement units (IMUs), which are worn by workers in two different scenarios. They computed handcrafted-statistical features on segments that were extracted from the sliding window approach. The authors evaluated three classifiers, namely, an SVM, a Naïve Bayes, and a Random Forest. The authors in [19] solve HAR for activities on daily living. They compute statistical features on three data streams, namely the raw inertial-measurements, their AC and DC components. They propose a hierarchical approach with bagging performance of simple classifiers on a different combination of device locations on the human body.
Deep architectures have been also deployed for solving HAR. Temporal Convolutional Neural Networks (tCNN), Recurrent Networks (RNN), e.g., Long Short-Term Memory (LSTMs), and a combination of both are examples of architectures in the field. tCNNs are hierarchical architectures that combine the feature extraction along with time and classification in an end-to-end approach. They learn the features and parameters of the classifier directly from raw data. tCNNs are presented in [17,18,23]. They are composed of convolution and pooling operations that are carried out along the time axis. tCNNs exploit their hierarchical composition becoming more discriminative concerning human actions. The combination of stacked convolutional and pooling layers find temporal relations that are invariant to temporal translation. They are also robust against noise. Moreover, these architectures share small temporal filters among all the sensors in the IMUs. Local temporal-neighborhoods are likely to be correlated independent of the sensors' type. The authors in [2] introduce an architecture that combines temporal convolutions and LSTMs layers replacing the fully-connected layers. LSTMs are recurrent units with memory cells and a gating system, which are suitable for learning long-temporal dependencies in sequences. These units do not suffer from exploiting or vanishing gradients during training. The authors in [24] utilize a shallow recurrent network; namely, a three-layered LSTM and a one-layered bidirectional LSTM. Bidirectional LSTMs process sequences following their inputs in both forward and backward directions. The performance of the BLSTMs outperforms the convolutional architectures. Nevertheless, tCNNs show more robust behavior against parameter changes. The authors in [3] propose a tCNN that is adapted for IMUs, called IMU-CNN. The architecture is composed of convolutional branches corresponding to each IMU. These branches compute an intermediate representation per IMU. They are then combined in the last fully-connected layers. The authors compared IMU-CNN with the tCNN and a tCNN-LSTM, similar to [2]. The IMU-CNN shows a better performance, as it is more robust against IMU's faults and asynchronous data. The authors in [20] investigate the effect of data normalization on the deep architecture's performance. They compare the normalization to zero-mean and unit standard deviation, batch normalization, and a pressure-mean subtraction. The architecture's performance improves when utilizing normalization techniques. Extending the work of [3], the authors use four sensor fusion strategies. They find that late fusion strategies are beneficial. Additionally, they evaluate the robustness of the architectures concerning proportions of the training dataset.
The authors in [16] propose using attribute-based representation for HAR. In object recognition and word-spotting problems, attributes are semantic descriptions of objects or words. They represent coarsely a class. In [12], a search for attributes is presented, as there are no datasets with such annotations. The selected attributes are better suited for solving HAR. For such a search, the authors deploy an evolutionary algorithm. Firstly, they assign random binary representations to action classes as population. Secondly, they evaluate a population using deep architectures with a sigmoid activation function. The validation's performance serves as evolution fitness. The authors deploy non-local mutations on the populations. They conclude that using attribute representations boosts the performance of HAR. Even, a random attribute-representation performs comparably to a directly classifying human actions. A drawback of this approach was the lack of a semantic definition of the attributes.
Attribute-based representations have been deeply explored on HAR in [13]. Particularly, in the manual order picking process, attribute representations were expected to be beneficial for dealing with the versatility of activities. This contribution compared the performance of deep architectures trained using different attribute representations, and it evaluated their quantitative performance as well as their quality from the perspective of practical application. Expert-given attribute representations performed better than a random one, created following the conclusions in [16]. A semantic relation between attributes and activities enhances HAR not only quantitatively with regards to performance, but it also ensures a transfer of the attributes between activities by domain experts. In this preliminary work, the mapping between activity classes and attribute representations was one-to-one. This became a multiclass problem that limits the benefits of attribute-based representations.
An important element of these supervised methods is annotated data [20]. A drawback of using deep methods is the need for extensive annotated data. This contrasts against the statistical pattern recognition. However, capturing and annotating data for HAR is laborious and expensive. Moreover, annotations regarding attributes are not existing. These fine-grained annotations represent an extra cost. In [13], human actions were given unique attribute representations. Nevertheless, human actions might include a different combination of attributes. Different combinations of attributes might be helpful for zero-shot learning and reducing the effects of the unbalanced problem. They also might allow clustering signals of a certain activity but with slight changes in the human movements. So far, there is no large-scale, freely accessible dataset of human activities in complex, industrial processes; neither using attributes. In addition, there are not standard guidelines for creating such a dataset. Thus, it needs to be defined beforehand. A review of existing datasets and their shortcomings in regards to the goal of this paper is presented in the appendix to further motivate the introduction of the new dataset in the following section.

Introducing the LARa Dataset
This section states the LARa dataset's specifications. Requirements and specifications of LARa are based on a detailed review of datasets for HAR, see Appendix A. In particular, the origin of the laboratory set-ups, the subjects' characteristics as well as the recording and annotation procedure are showcased. For data recording, the researchers created physical replicas of real-world warehouses in a laboratory. They are called scenarios in this contribution. This subsection gives insights into the replicas' creation, and it explains the underlying warehousing processes. Next, the sensors' configuration and the proper preparation of the subjects are presented.

Guidelines for Creating and Publishing a Dataset
The datasets, as discussed in the Appendix A, show no uniform guidelines for dataset creation. Based on this overview of the datasets and their description, a guideline for the creation of a dataset is derived.
If possible, the recording should take place under real conditions. Realistic environments ensure recording natural movements, e.g., a real warehouse or a detailed replica. A replica requires a large laboratory. In addition, objects similar to the real scenarios are needed, e.g., a picking card. The subjects' selection depends on the variety of people from the real environment, e.g., employees of a real warehouse. The selection terms involve age, sex, height, and handiness.
In addition to the realistic environment, the behavior of the subjects should be implemented as naturally as possible. Instead of just recording individual activities in isolation, recording a whole process enables natural behavior and thus natural movements. A recording should therefore not only consist of one activity, e.g., lifting a box, but should occur as part of a process, e.g., lift the box → walk with the box → pick the article → put the article in the box → walk with the box → place the box. Using a recording protocol and RGB camera for documentation, discrepancies, such as the slipping of sensors or markers, are noticeable after the recordings.
It is recommended to use different sensor types with a high frame rate. Since there is no uniform positioning of sensors, several sets of different positions on the human body can be experimented with.
OMoCap and RGB videos could help in complex annotation-scenarios. The annotation is to be carried out by domain experts such as physiotherapists, dance teachers, or, in the case of logistics, logistics experts. As soon as several people annotate or are expected to benefit from the annotated data, an annotation guideline is necessary. A revision of the annotation is recommended to improve the quality of labeled data. To ensure other applications, the representation of the activity classes should be as granular as possible. The granularity depends on the number of activities and can be increased by a binary coarse-semantic description.
Necessary general information such as location and period of the recordings must be specified. The method of data acquisition and the description of the activities are part of the description of the dataset. In addition to the method of annotation and its effort, the labeled activity classes must be described. The dataset should contain labeled and raw data from all sensors. Access to the annotation tool must be guaranteed for understanding the process of annotation.

Laboratory Set-Ups based on Logistics Scenarios
This subsection explains three logistics scenarios for data recording. The warehousing processes' graphical representation is based on the guidelines defined by the Object Management Group [25]. The graphical and textual descriptions of the scenario guide researchers when applying methods of HAR that take context into consideration. A detailed explanation of the scenarios might be helpful for approaches involving context, preconditions and effects, e.g., Computational Causal Behavior Models (CCBM) [26]. This context may be the constraints of the warehousing process. For example, some activities can only be performed in a specific order or at a specific location and time.
Data were recorded in physical set-ups created in a controlled environment-the 'Innovationlab Hybrid Services in Logistics' at TU Dortmund University [15]. A group of researchers created the physical replica of warehousing scenarios following a cardboard engineering approach [27,28].

Logistics Scenario 1-Simplified Order Picking System
The first scenario is not based on a real warehouse. Nevertheless, this process may exist in reality. The process is illustrated in Figure 1, the physical laboratory set-up is presented in Figure 2.  In the beginning of an order-picking process, the subject places boxes on an empty order-picking cart. These empty boxes are provided at the base. In a real warehouse, this base may be a conveyor that transports empty boxes to the order picker while transporting full boxes to the shipping area. In the laboratory, stacking frames recreated the conveyor. This simplification does not influence human-motion behavior. The boxes and the cart were standard items that are common in the industry.
Next, the subject moves the cart to a retrieval location. The researchers who guided the recordings specify where to go. An order-picking aisle was recreated by placing boxes on frames. When the subject arrives at a retrieval location, they pick articles from a box or they open a fronted bin. The subjects place the articles in an empty box on their cart. The articles were small, light items, such as bags of 500 g. This procedure of taking the cart to a new location and retrieving goods is repeated until all boxes on the cart are full. The subject takes the cart back to the base and places the full boxes on the conveyor. The order-picking process starts anew. When all articles in the aisles' boxes are empty, the order-picking process has to end. The research team refills the boxes.

Logistics Scenario 2-Real-World Order Picking and Consolidation System
The second scenario is based on a real warehouse. Access to the site and process documentation was granted by industry partners of the chair of Materials Handling and Warehousing. In contrast to Scenario 1, the second scenario takes information technology processes such as scanning barcode labels or pushing buttons for pick confirmation into account. For the sake of clarity, the order-picking process and the consolidation process of the picked goods are illustrated separately in Figures 3 and 4, respectively. The physical laboratory set-up of Scenario 2 is illustrated in Figure 5.
The order-picking cart is bigger than the one used in the first scenario as visible in Figures 2 and 3. It has three shelves of equal size that are filled with cardboard boxes of different shapes and sizes. Each box is held open with a rubber band. In the real warehouse, a so-called put-to-light (PtL) frame is attached to the cart. It gives a visual signal where to place articles and has buttons to press for retrieval and submission confirmation. Small calculators are attached to the cart to replicate this system in the laboratory. On its shorter end, the cart has two handles, a small screen, a stamp pad, a plastic bag for packaging waste and a second bag, which is filled with more small plastic bags. Apart from the screens, all items could be purchased. A labeled cardboard replicates the screen. The research group gives information to the subject, which is usually displayed on the screens. For example, this information might be the retrieval location or the picking quantity.
Subjects deploy a stamp and a knife. They are attached to the OMoCap suit. Additionally, subjects operate a handheld scanner, which is attachable to the cart. To assure a natural motion of the subjects when using the scanner, all items have barcode labels that need to be scanned. Thus, the subjects have to use the scanner correctly to trigger an acoustic signal that confirms a scan operation.
An order consists of several items that need to be picked in varying quantities. For each order-picking cycle, one cart works on the orders of several customers at the same time. This is referred to as order batching. The articles are household goods of varying dimensions and weights, such as cutlery, dishes, or storage boxes. They are stored in plastic and cardboard boxes and open lid bins. Some of the cardboard boxes were sealed with tape for protecting the goods. These storage units are placed on shelves with different heights or on the ground. Stacking frames and shelves formed two aisles. In the real-world system, a flow-through rack is deployed for goods consolidation. In the laboratory, pipe-racking systems were used to recreate it. Each chute of the flow-through rack is equipped with a barcode label and a human-readable ID.  In general, the subject scans all labeled units to ensure that the correct article is picked, e.g., a single article or a newly labeled plastic bag. There are three cases for scanning an article's barcode label. In the first case, the articles are individually packed. Every article already has a barcode label attached. Second, some articles are in a secondary packing, e.g., a cardboard box or a plastic bag that needs to be opened before retrieval. The articles in this secondary packaging have an individual barcode. Third, some articles do not have an attached-barcode label. In this case, the barcode at the shelf has to be scanned. There is a barcode label roll, which is provided next to the respective articles. These labels need to be attached to the retrieval unit.
To begin the order-picking process, the subjects scan the barcode of the cart to trigger the order-picking mode. The screen shows the next retrieval location. When they arrive there, they scan the article's barcode label, which may be found on the article, or on the shelve as explained previously. If the article is correct, the screen indicates the correct withdrawal quantity.
Next, the subject retrieves the correct amount of articles. If necessary, they open sealed cardboard boxes with the knife. They dispose packaging waste using the plastic bag at the cart. If the article already has a barcode label, the subject can scan it so that the PtL-Frame visually indicates the correct box to submit the articles. For articles that do not have a barcode label, the subject wraps the desired quantity of articles in a plastic bag and seals it with a barcode label provided at the shelf.   Pressing a button confirms each submission into a box on the cart. The button is on the PtL-frame above the box. If this is the first item in a box, the box must be marked with a stamp. This is a quality assurance to trace back the employee who packed the box. The subject takes the cart from one retrieval location to the next until the order is complete.
The order-picking process is proceeded by the consolidation of the packed goods for dispatching preparation. For consolidation, the boxes must be inserted on the back side of a flow-through rack. On the front site the packaging, workplaces are located where dispatch preparation takes place. As with the order-picking mode, the subjects scans a specific barcode on the code to trigger the consolidation mode. Next, they take the cart to the consolidation point, which is shown on the cart's display. The subjects scan the barcode of a box so that the scanner's display shows the correct chute. After they inserted the box, they scan the barcode label at the chute to confirm the submission. This procedure repeats until there are no more boxes on the cart.

Logistics Scenario 3-Real-World Packaging Process
The third scenario is the packaging process that follows the order picking and consolidation of scenario 2 in the same real-world warehouse. The packaging process serves the dispatch preparation of the picked articles. In general, the consignment size per order does not exceed 5 boxes. Thus, the shipping by pallet is not feasible. The real-world packaging process is illustrated in Figure 6. Its physical laboratory set-up can be observed in Figure 7.

Begin packaging process
End of packaging process Select order on computer Take boxes from flow through rack and place them on the packaging   Each packaging workplace is equipped with a computer, a printer, a bubble wrap dispenser, a tape dispenser, a scale for weighing boxes, and a trash bin. Next to the table, a conveyor is located where all boxes have to be placed that are ready for shipment. The packaging table in the laboratory is a model often found in real warehouses. Further tables were placed next to the packaging table to provide space for the equipment. The table on the far left was used to recreate the surface of the conveyor. When a box was pushed onto the surface, a researcher took the box. The actual motion of a conveyor is not necessary to ensure a human motion that is close to reality. The dimensions of the tables in the laboratory closely resemble the table from the real-world warehouse.
For the tools, equipment has been purchased that is similar to the real-world system. The bubble wrap dispenser was recreated by cutting a small opening in a cardboard box. The wrap was refilled manually by the researchers present during the recordings. A fully functional computer was placed on the table. Mouse and keyboard were attached to the computer and a spreadsheet application was running on it. When computer work was necessary, the subjects were tasked to perform basic tasks in the program. The printers were substituted by a researcher handing the printed items to the subject. As the weight scale is an area on the table's surface, it could be recreated by indicating a certain area with colored stripes.
As explained previously, all boxes to be prepared for shipment were stored in a flow-through rack. During recordings, the rack from Scenario 2 was used. It was moved next to the packaging table. When recordings were conducted, second and third scenarios were in immediate succession, the flow-through rack was already filled with boxes, which were filled with articles.
By the beginning of the packaging process, the subject goes to the computer and chooses a packing order. Next, they take all boxes that belong to one order from the flow-through rack and place them on the packing table. The rubber band of each box is removed and the barcode needs to be scanned with the hand scanner. When doing so, the packing list of the order is printed automatically.
For each box, the subject evaluates its filling level to decide whether repacking is necessary. This is the case when the box is either rather empty or overfull. In the first case, more articles from a different box of the same order are added. In the second case, the articles protrude the box. Articles may be bigger than the box, due to incorrect article master-data. When the filling level is low, the contents of several boxes are combined. When a box is removed from the order, this information must be entered into the computer. Contents of an overfilled box are put into a bigger one. The subject can get boxes of different sizes from storage next to the packing table. When repacking articles from one box to another, each one needs to be scanned and the repacking must be confirmed at the computer.
The subject confirms that all boxes of an order are filled properly. In case the packing list has been altered due to repacking, it is reprinted automatically. Next, the subject puts the packing list in each box and fills them up with bubble wrap. Then, each box must be pushed onto a scale. The subjects need to trigger the weighing process at the computer. The system will check if the actual weight of the box corresponds to the expected weight according to the master data and the packing list.
Once all boxes are packed correctly and their weight has been approved, the subjects seal them using a tape dispenser. The printer automatically prints the shipping labels when all boxes of one order are ready to be sealed. The subject applies a label to each box. Eventually, each box is pushed onto the conveyor surface.

Configuration of Sensors and Markers
The OMoCap system tracked 39 reflective markers from a suit, see Figure 8. A VICON system consisted of 38 infrared cameras recording at a sampling rate of 200 fps. Three different sets of on-body devices or IMUs record tri-axial linear and angular acceleration, see Figure 9. IMU-sets 1 and 3 served as proof of concept and they are not part of the dataset. The six IMUs of the second set from MbientLab [29] are attached to the arms, legs, chest, and waist. They record tri-axial linear and angular acceleration at a rate of 100 Hz.

Characteristics of Participating Subjects
A total of 14 subjects (S) were involved in the recording process. Their characteristics, including sex, age, weight, height, and handedness, are listed in Table 1. Examining the minimum and maximum of these characteristics show that a wide spectrum of physical characteristics is present. Thus, the subjects' motion patterns vary widely. In addition, the ratio of left-handed to right subjects closely resembles the general population [30,31]. All subjects participated in a total of 30 recordings of 2 min each, which corresponds to about 30 recording/subject × 2 min/recording × 14 subject = 840 min of recorded material. In Scenario 1, subjects 1 to 6 performed 30 recordings wearing the OMoCap suit and the IMU-set 1. Subjects 7 to 14, wearing the OMoCap suit and the IMU-sets 2 and 3, participated in 2 recordings in Scenario 1, 14 recordings in Scenario 2 and 14 packing recordings in Scenario 3. Due to heavy noise and issues with the sensor readings, some recordings had to be scrapped, and they are not included in the dataset. Thus, the number of recordings per subject deviates in Table 1. A total of 379 recordings (758 min) were annotated and are included in the dataset. Figure 10 shows the varying physical features of all subjects true-to-scale.

Recording Procedure
The LARa dataset was recorded in 7 sessions. In the first 3 sessions, subjects 1 to 6 went through Scenario 1. In sessions 4 to 7, data were recorded in all three scenarios with subjects 7 to 14.

Preliminaries
Before the recording, each subject was measured according to the information necessary for the VICON Nexus software: body mass, height, leg length, knee width, ankle width, shoulder offset, elbow width, wrist width, and hand thickness. Subsequently, the test subjects were equipped with an OMoCap suit, a headband, and work safety shoes, as used in real warehouses. Markers and IMUs were attached to the suit. To document the proper positioning of all markers and IMUs, each subject was photographed from four sides before the recording.

Recording Process
For the sake of recording realistic motions, the subjects were introduced to the scenarios by a domain expert in advance. Test runs were carried out before recordings commenced. The subjects were allowed to familiarize themselves with the processes and objects. The subjects do not perform individual and isolated movements as in other datasets that originate from laboratories, e.g., [32]. Rather, realistic motion sequences were the goal. To achieve this, the subjects were only instructed about their tasks within a scenario. They were not told how to perform specific motions necessary to fulfill their task. Thus, the way they handled items, picked boxes and moved to a location were not influenced by the researchers. The motion is solely determined by each subject's individual preference. In addition, the subjects were not given detailed information about the underlying research goal to avoid a bias in their motion behavior.
Between each recording unit of two min, a break of only a few seconds was necessary to start the next capture. Hence, the subjects would be able to remain focused on the task. The subjects did not take off the suit between recordings. After the recordings concluded, each subject was photographed again from four sides to reassure the proper positioning of the markers and sensors.

Documentation and Protocol
A protocol was kept before, during and after the recordings for each subject to ensure repeatability of the recording sessions: time, size of the suit and shoes, room temperature, the use of velcro to fit the suit to the person, RGB video files that were created, number and descriptions of photos taken, remarks, and incidents.
The expenditure of time for recording is made up of the OMoCap system's calibration, the preparation of each subject, their introduction to the scenarios and the recordings. In total, the expenditure was over 197 PHR (8.22 days) to record 14 h of material. To support the subsequent annotation of the data, the sessions were captured by a RGB camera. In sessions 1 to 3, only occasional recordings were created with a RGB camera, but at least one recording per subject is available. Due to the increasing complexity of human motion and the increasing spectrum of objects in Scenarios 2 and 3, i.e., the subjects 7 to 14, were captured entirely by a camera to ensure that the performed activities are apparent to the annotators. In addition to the 14 subjects, the RGB camera recorded other people who were in the test field at the same time. They provided guidance when the task was unclear, ensured that none of the markers or sensors detached and continuously maintained the experimental setup e.g., by refilling the shelves with packed goods. In addition, photos taken before and after the recordings are included in the protocol.
The Remarks section in the protocol includes the number and time of the breaks taken by the subjects, re-calibration of the OMoCap system during the session, injuries of the subjects and unusual movement during recording, e.g., drinking. Incidents mainly include lost or shifted markers and sensors. If a loss was observed during a recording, it was aborted, deleted, and restarted from the beginning. In three instances, a detachment was noticed after the recording session:

•
Incidents with respect to S11: After recording 27, it was noticed that the marker of the left finger (see Figure 8, marker number 22) was misplaced. The reseach group could not determine when exactly the marker shifted its position. After recording 30, it was noticed that the marker of the right ankle (see Figure 8, marker number 35) was lost.

•
Incidents with respect to S13: After the last recording (number 30), it was noticed that the marker from the right finger (see Figure 8, marker number 23) and the marker from the left wrist (see Figure 8, marker number 18) were missing. One of the lost markers was found on the left side of the subject's chest.

•
Incidents with respect to S14: After recording number 15, it was noticed that the marker of the right forearm (see Figure 8, marker number 17) was stuck to the leg. For the subsequent recordings (number 16 to 30), the marker was put back to its proper position.
Despite these incidents, the data acquired through these recordings were found to be usable.

Classes and Attributes
This subsection explains the definitions of human activities in the dataset. The dataset considers periodic and static activities, following [1]. The dataset contains annotations of semantic coarse-descriptions of the activities. These semantic definitions are called attributes and they are motivated by HAR methods in [13,16]. An attribute representation can be seen as an intermediate binary-mapping between sequential data and human activities. This intermediate mapping is beneficial for solving HAR problems because they allow sharing high-level concepts among the activity classes. The consequences of unbalanced class-problem can be reduced. A dataset for HAR contains a set of N sequential samples X = x 1 , x 2 , ..., x N ∈ R D -for LARa dataset either the OMoCap or the IMUs.
D represents the number of joints or sensors for each dimension [x, y, z]. This parameter is also addressed as the number of sequence channels; their respective activity classes Y c = y c 1 , y c 2 , ..., y c N ∈ N from a set of C activity classes. Following the method in [16], this dataset provides additionally attribute annotations Y a = y a 1 , y a 2 , ..., y a N , where Y a is drawn from an attribute representation A [K,M] ∈ B K . A is a binary attribute-representation of size [K, M], with M number of attribute representations of size K for all of the activity classes. A single attribute representation y a serves as an intermediate representation between an input signal x ∈ X and the expected activity class Y c , i.e., x → Y a → Y c . There are M different attribute representations. This is different from [16], where the authors assign a single, random attribute-representation to an activity class. In this work, the number M of representations is stated after the annotation process. In the annotation process, a set of attributes are given to short windows of the recordings, concerning the human movements. Table 7 shows the number of different attribute representations per activity class in LARa.
The definition of activities and their semantic attributes is derived from the researchers' experience [13,33], and from HAR methods [1,16]. The attributes' and activities' terminology by default implies industrial context. This excludes irrelevant activities for warehousing, such as smoking or preparing coffee. This is referred to as a Closed-World Condition [34].

Activity Classes
There are eight C = c 1 , ..., c 8 ∈ N 8 activity classes, see Table 2. Standing, Walking and Cart emphasize the subject's locomotion. The Handling activities refer to a motion of the arms and hands when manipulating an article, box, or tool. These activities do not consider holding an element while standing or walking. Synchronization is crucial for proper annotation and for transferring the labels to different sensor streams.

Attributes
There are K = 19 attributes A ∈ B K . These are coarse-semantic descriptions of the activities. They are mostly related to the locomotion and the pose when moving. The human pose changes according to handling different elements and to different heights. The attributes are subdivided in five groups, see Table 3 and Figure 11. Table 3. Attributes and their semantic meaning.

I -Legs
A

B
Step A single step where the feet leave the ground without a foot swing [35] (pp. [3][4][5][6][7]. This can also refer to a step forward, followed by a step backwards using the same foot.

C
Standing Still Both feet stay on the ground.

II -Upper Body
A Upwards At least one hand reaches the height of the shoulder height (80% of a person's total height [36] (p. 146)) or is lifted beyond that during the handling activity.

Centred
Handling is possible without bending over, kneeling or lifting arms to shoulder joint height.

Downwards
The hands are below the height of the knees (lower than 30% of a person's total height [36] (p. 146)). The subject's spine is horizontal or they are kneeling.

No Intentional Motion
Default value when no intentional motion is performed, e.g., when standing without doing anything, carrying a box or walking with a cart. This is because there is no intentional motion when performing these activities, only a steady stance.

E Torso Rotation
Rotation in the transverse plane [37] (pp. 2-3). Either a rotating motion, e.g., when taking something from the cart and turning towards the shelf or a fixed position when handling something while the torso is rotated.

III -Handedness
A

Right Hand
The subject handles or holds something using the right hand.

Left Hand
The subject handles or holds something using the left hand.

No Hand
Hands are not used, neither for holding nor for handling something.

IV -Item Pose
A Bulky Unit Items that the subject cannot put the hands around, e.g., boxes.

Handy Unit
Items that can be carried with a single hand or that the subjects can put their hands around, e.g., small articles, plastic bags.

Cart
Either bringing the the cart into proper position before taking it to a different location (Handling) or walking with the cart to a new location (No Intentional Motion).

Computer
Using mouse and keyboard.

No Item
Activities that do not include any item, e.g., when the subject fumbles for something when on the search for a specific item.

V -None
A None Equivalent to the None class.
During the labeling, annotators follow these rules: at least one for the attributes per group must be assigned; In group I, the attributes are disjoint, since a subject performs either one of the motions at the same time; The attributes A-D of group II are disjoint while the torso rotation is independent. In the third group, the choice between right and left is non-exclusive as one can use both arms at the same time. In group IV, the attributes are disjoint. Annotators give priority to the items according to a hierarchy: Utility-Auxiliary → Computer → HandyUnit → BulkyUnit → Cart; the None and the Synchronization classes have a fixed attribute representation. The execution of the waving motion for synchronizing is predefined.  Table 4 shows an exemplary warehousing process that consists of four process steps. This process is an excerpt from Scenario 2. In the first process step, the subject is initially standing (Act. 1) before walking to the cart without holding anything in hands (Act. 2). Then, the cart is brought into proper position with both hands while performing smaller steps (Act. 3) and the subject pulls the cart to the retrieval location using the right hand (Act. 4).

Exemplary Activity Sequence and Its Proper Annotation
By the beginning of the process step 2, the subject is standing while resting the right hand on the cart's handle (Act. 5). Then the subject proceeds to take the scanner from the cart. The first half of this left-handed handling motion is done while performing a step (Act. 6), while the latter is performed while standing with both feet on the ground (Act. 7). It is important to note that the scanner is annotated as a Handy Unit because it is handled as such. In contrast, using it in the following activity is annotated with Utility Auxiliary. The label is located on the subject's right and on eye level so a Torso Rotation is necessary and the handling is performed upwards (Act. 8). The ninth activity refers to the subject mounting the scanner back on the cart (Act. 9).
In the third process step, the subject picks the item from the shelf (Act. [10][11][12] and places it in a box located on the lowest level of the cart (Act. 13 and 14). Finally, the pick is confirmed by clicking the put-to-light button located above the box (Act. 15).
There is a wide variety of activity sequences that may constitute the same process. For example, different subjects use different hands when handling an element. In addition, their body motions differ when lifting something from the same height depending on their body size. Thus, the exemplary sequence of activities in Table 4, their class labels and attribute representation are one of many viable options.

Annotation and Revision
A Python tool was created for annotating the OMoCap data, see Figure A3. The procedure of the annotation and revision is described by Reining et al. [38]. The annotation tool offers a visualization of the skeleton from the OMocap data and a window-based annotation frame. A window is a segment that is extracted from the sequential data. In the annotation process, an annotator provides the activity class and the attribute representation of a window. Window sizes are variable. The annotator selects consequently the size of a window. Twelve annotators labeled the OMoCap data of the 14 subjects. Apart from two annotators, none of them had any prior experience regarding the annotation of OMoCap data. Each annotator followed the guidelines, as mentioned in Section 3.6. Additionally, RGB videos served as an additional aid for complex activities.
The total time effort for annotation comprised over 474 PHR (19.75 days or 0.65 months). Table 5 illustrates the annotation effort per individual annotator. The information given in the table relates to two-minute recordings. With a range of 39 min to almost 3 h of annotation per recording, the annotators differ greatly in their annotation speed. The reasons for the different annotation speeds are the different level of experience of the annotators, the different setting of window sizes of activities and the individually selectable playback speed of the OMoCap recordings in the annotation tool. An average of 37.5 min was required for a one-minute recording. Following the annotation, data were revised by four domain experts, see Table 6. The revision of an annotated two-minute recording varied between 4 and 121 min, depending on the quality of the annotation. Compared to the annotation, the average time for a revision is significantly lesser at 11.19 min for a one-minute recording. The dataset is unbalanced. The Handling classes represent nearly 60% of the recordings. These classes show also a higher variability of their attribute representations; this means that these classes show up in many different forms. The class Handling (centered) is the most frequent activity by far.
The representations of the Walking activity class differ in regards to the handedness and Item pose. This is because the Gait Cycle and the No Intentional Motion attribute are fixed. The third class Cart can only have three representations. Either the cart is pushed or pulled using the Left Hand, the Right Hand, or with both hands while walking. By definition, there is only one valid representation for both Synchronization and None classes. This is reflected in the results of the annotation and revision, see Table 7.

Folder Overview of the LARa Dataset
LARa contains data of an OMoCap system, one IMU-set, and one RGB camera as well as the recording protocol, the tool for annotation and revision and the networks of activity classes and attributes. Table 8 illustrates an overview of the sizes of the folders and the formats of the files. The files of the OMoCap data, IMU data, and RGB videos are named after the logistics scenarios, subject, and recording. For example the file name L01_S02_R12 means logistics scenario 01, subject 02, recording 12.

Deploying LARa for HAR
The tCNN, proposed in [18], was deployed for solving HAR using the LARa dataset. Some minor changes on the architecture are, here, proposed. Our tCNN contains four convolutional layers, no downsampling operations, and three fully-connected layers. Downsampling operations are not deployed as they affect the performance of the network negatively following the conclusions of [16]. The convolutional layers are composed of 64 filters of size [5,1], which perform convolutions along the time axis. The first and second fully connected-layers contain 128 units. Considering the definitions in Section 3.6, there are two different last fully connected layers, depending on the task. A softmax layer is used for direct classification of the activity classes. It has C = 8 units. A fully connected layer with sigmoid activation function is used for computing attributes. This layer contains 19 units. The number of output units corresponds to either the number of classes or attributes, respectively, see Section 3.6. Figure 12 shows the tCNN's architecture.
The architecture processes sequence segments that consist of a feature map input of size [T, D], with T the sequence length and D the number of sequence channels. The sequence segments are extracted following a sliding-window approach with window size of T = 200, step size of s = 25 (87.5% overlapping). The number of sequence channels D is 126, as there are measurements of position and rotation in [x, y, z] for the 21 joints of the LARa OMoCap dataset. This excludes the joint "lower_back" as it is used for normalizing the human poses with respect to the subject. In general, the input sequence is [T = 200, D = 126] for the dataset. The tCNN computes, either, an activity class y c or a binary-attribute representation y a from an input sequence. Predicting attribute representation follows the method in [16]. Differently from a standard tCNN, this architecture contains a sigmoid activation function replacing the softmax layer. The sigmoid activation function is computed as sidmoid(x) = 1 1+e −x . This function is applied to each element of the output layer. The outputỹ a ∈ B 19 can be considered as binary pseudo-probabilities for each attribute being present or not in the input sequence. The architecture is trained using the binary-cross entropy loss given by: with y a ∈ B 19 the target attribute representation andỹ a ∈ B 19 the output of the architecture. Following [3,12], input sequences are normalized per sensor channel to the range of [0, 1]. Additionally, a Gaussian noise with parameters [µ = 0, σ = 0.01] is added. This noise simulates sensor's inaccuracies.
Following the training procedures from [1,2], the LARa OMoCap is divided into three sets: the training, validation, and testing. The training set comprises recordings from subjects [S01, S02, S03, S04, S07, S08, S09, S10]. The validation and testing sets are comprised of recordings from [S05, S11, S12] and [S06, S13, S14], respectively. An early stopping approach is followed using the validation set. This set also is deployed for finding proper training hyperparameters. Recordings with label None are not considered for training following the procedure in [3]. The architecture is trained using the batch gradient-descent with RMSProp update rule with an RMS decay of 0.9, a learning rate of 1 × 10 −5 , and a batch size of 400. Moreover, Dropout was applied to the first and second fully-connected layers. In the case of predicting attributes and for solving HAR, a nearest neighbor (NN) approach was used for computing a class c ∈ C. The Euclidean distance is measured from the predicted attribute vector y a to attribute representation A ∈ B [M,K] , with M = 204 and K = 19. This is possible as each activity class c ∈ C is related to a certain number of binary-attributes vectors in the attribute representation A, see Table 7. LARa provides the attribute presentation A. Both the attribute vectorã and the attribute representations A are normalized using the 2-norm. The tCNN is also trained using a softmax layer predicting activity classes directly. In this case, the architecture is trained using the Cross-Entropy Loss.
Tables 9 and 10 present the performance of the method solving HAR on the LARa OMoCap dataset using the softmax layer and the attribute representation. Precision is computed as P = TP TP+FP . Recall is computed as R = TP TP+FN . Having TP, FP, FN as the true positives, false positives, and false negatives. The weighted F1 is calculated as wF1 = ∑ C i 2 × n i N × P i ×R i P i +R i , with n i being the number of window samples of class C i ∈ C. Handling and moving Cart activities show the best performances. Using the attribute representation boost the performance in comparison with the softmax classifier. The approach classifies the Synchronization and Standing activities when using attribute representations. In general, deploying an attribute representations boosts the performance of HAR. These results coincide with [13,16]. Attributes belonging of frequent classes help with the classification of less frequent classes. The effects of the unbalanced problem are also reduced.  Tables 11 and 12 show the confusion matrices of the predictions using the tCNN in combination with: the softmax layer and the NN using the attribute representation. In general, the method exhibits difficulties predicting the class Standing. The method mispredicts Standing sequence segments as Handling (centered) ones. The class Walking present also some mispredictions. Following the results on Tables 9 and 10, solving HAR using the attribute representation offers a better performance in comparison with the usage of a softmax layer. The classification of activity classes, e.g., Synchronization, Standing, and Walking, improve significantly.  Table 13 presents the performance on the attributes. Attributes are correctly classified in general. Attributes none and error are not present in the test dataset. However, they are not misclassified. Attribute Torso Rotation is also not mispredicted. Nevertheless, the precision and recall of this attribute are zero. This suggests that it is not classified when it shall be. Further, an improvement in this particular attribute is needed.  a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18  Both trained tCNNs (using a softmax and a sigmoid layer) and the attribute representation A are included in the annotation and revision tool. Implementation code of the annotation tool is also available in [39]. These results seek to give a first evaluation of the dataset for solving HAR.

Discussion and Conclusions
This contribution presents the first freely accessible dataset for the sensor-based recognition of human activities in logistics using semantic attributes, called LARa. Guidelines for creating a dataset were developed based on an analysis of related datasets. Following these guidelines, 758 min of picking and packing activities of 14 subjects were recorded, annotated, and revised. The dataset contains OMoCap data, IMU data, RGB videos, the recording protocol, and the tool for annotation and revision. Multichannel time-series HAR was solved for LARa using temporal convolutional neural networks (tCNNs). Classification performance is consequent to the state-of-the-art using tCNNs. Semantic descriptions or attributes of human activities improve classification performance. This supports the effort of annotating attributes and the conclusions from [16].
From an application perspective, the following approaches for fundamental research as well as industrial application result from the LARa dataset:

•
The laboratory dataset LARa will be deployed on IMU data recorded in an industrial environment. The addition of more subjects and the inclusion of further logistical processes and objects is conceivable. New attributes may be added.

•
Another approach to recognize human activity is the context. The context may provide information about locations and articles and broaden the application spectrum of the dataset. Context information about the process is provided in this contribution.

•
Dependencies between the activities have to be examined, e.g., state-machines. Can information about dependencies increase the accuracy of the recognition of human activities in logistics? • Finally, the industrial applicability must be proven through a comparison between sensor-based HAR and manual-time management methods, such as REFA and MTM. Can manual-time management methods be enhanced using HAR and LARa?
Further experiments concerning the relation among the activity classes and the attributes will be relevant to evaluate. Analyzing the architecture's filters and their activations using the attribute representations will be useful for understanding how the deep architectures process input signals. LARa dataset can be used for solving retrieval problem in HAR. Retrieval tasks might help facilitate data annotation. Additionally, data-stream-approaches will be relevant to be addressed using this dataset. A comparison of HAR methods using statistical pattern recognition and deep architectures is to be also addressed. The extensive LARa dataset will be of use for investigating RNNs. Computational causal behavior models are of interest for including the flow charts of the scenarios and longer temporal relations of the input signals.

Appendix A. Related Datasets
This section gives an overview of state-of-the-art datasets for HAR. Moreover, it underlines the necessity of a new dataset for HAR in logistics. Similarities from different datasets help as a base for creating a new guidelines. The guidelines describe how a dataset for HAR is created, answering the second research question. The novel LARa dataset is created based on the new guidelines and the findings gained from a similar dataset to logistics. Due to data protection regulations, the following analysis is restricted to OMoCap and IMU datasets.
Appendix A.1. State-of-the-Art Datasets for HAR A dataset overview is followed. This overview is a modification of the guidelines for Literature Reviews suggested by Reining et al. [10], , and Chen et al. [43]. Figure A1 illustrates the dataset overview steps. Firstly, datasets are searched according to a predefined list of keywords. Secondly, the list of datasets is filtered by Content Criteria. Thirdly, datasets are obtained from papers that present methods for HAR. This step includes a loop. Fourthly, the content and properties of the datasets are analyzed. The search process with predefined keywords was carried out on the following platforms: Figshare, Github, Google Scholar, Researchgate, Science Direct, Scopus, UCI Machine Learning Repository, and Zenodo. Only datasets presented in English were taken into consideration during this process. All datasets must have been published by 31st March 2020. While collecting the dataset, the keywords used for the search consisted of three categories: The platforms were searched with the keywords from Category A in combination with the keywords from Category B or C: e.g., IMU dataset, HAR dataset, or accelerometer database.

-Analysing Dataset 2 -Filtering Datasets
A filtering process reduces the list of searched datasets. This process is based on a predefined list of keywords, Table A1, and it is divided into four stages, see Figure A1. At each stage, a dataset will be examined if it includes one of the content criteria consecutively. If the dataset does not meet any of the Content Criteria, it will be excluded from the following steps. Datasets in Stage IV were taken into account for further analysis.

Physical Activity
Caspersen et al. [44] defined physical activity "as any bodily movement produced by skeletal muscles that results in energy expenditure". The definition of physical activity is limited by torso and limb movement [10].
Stage I contains 173 unique datasets that consider the recognition of human movements. In Stage II, 78 datasets were excluded as these do not consist of measurements from IMU or OMoCap. They are based mainly on RGB and depth data. Moreover, datasets with inertial measurements from objects are also excluded. 25 datasets are excluded in Stage III as their accesses are restricted or not possible. Unreachable URLs are the common cause. Moreover, datasets, which required a paid account or a one-time payment or with downloading errors, are not considered. In Stage IV, 9 datasets are excluded as they do not include human activities according to Caspersen's definition [44] and to the limitation of the bodily movement to torso and limb movement. These activities relate to emotion, gender identification, occupancy detection, or facial expression. Furthermore, simulations are excluded. At this Stage, unlabeled datasets are also excluded.
A total of 61 datasets corresponded to all four Content Criteria. Table A2 shows the filtering stages. Scientific publications from the datasets that describe the datasets in detail are searched. They also offer applications of the datasets solving different problems. These publications became additionally a source of identifying related datasets, as they tend to compare them.
Datasets and the corresponding publications from Stage IV are organized according to a categorization scheme. It consists of five categories: General information, Domain, Data Specification, Sensor type, and sensor location. Table A3 presents four of these categories. Table A4 organizes the datasets following the categorization scheme. The fifth root category refers to the attachment of the sensors and markers to the subject's body, divided into 13 body regions. Key information regarding the annotation were noted. This includes the annotation method, number and area of the annotators' expertise, the annotation effort, the annotation tool, and supporting sensors.
Analyzing the Figure A2, there has been a significant increase in published datasets since 2009. There were no time frame restrictions considered during the filtering process. However, no records could be found before 2003 and only four were found in the period, 2003 to 2008. The recordings were made in a laboratory environment

Real Life
The recordings were made in a real environment, e.g., outdoors, on a sports field, or in a production facility

OMoCap [Hz]
Optical marker-based Motion Capture with frames per second or hertz as a unit IMU [Hz] Inertial measurement unit with hertz as a unit Other Sensors Sensors except IMU and OMoCap Phone, Watch, Glasses Use of sensors built in smartphone, smartwatch, or smart glasses  Authors give a different meaning to the term activity class. The word activity could describe a movement, an action, a motion, a gesture, and a process step. In some instances, dance movements and gestures have been considered as activities. The overview table shows examples of these instances, such as: Leuven Action Database [45], Vicon Physical Action Dataset [46], Sensors Activity dataset [47], KIT Whole-Body Human Motion Database [48].
Only a few authors have considered a hierarchy of activities and their descriptive features, which are called postures or attributes. The specification in the UTD Multimodal Human Action Dataset [49] is an example. The movements with the left and right hands are regarded as individual activities; thus, the same motion with the opposite hand represents a different activity. Additionally, the duration of the activity differs. Activities may last from a fraction of a second up to several min. For example, the IMU Dataset for Motion and Device Mode Classification in [50] contains activities lasting for 45 to 60 min. The duration of activities in the Physical Activity Monitoring Dataset (PAMAP2) [5] is approximately 3 min.
The recording times vary widely. Without the two longest recordings [51,52], the average recording time per dataset is 603 min. Forty datasets contain 15 or fewer subjects. The number of activity classes that are present in a dataset varies heavily depending on the number and properties of its application domains. For example, [53] solely addressed locomotion activities that were subdivided in Slowly Walking and Running. In contrast, the dataset presented by Müller et al. contains 70 activity classes in the domains Exercises, Locomotion, ADL, and Dance [54].
Using IMU sensors dominates with 51 datasets. In addition to the significantly lower cost compared to an OMoCap system, its flexible use in a real environment is particularly advantageous. Recording rates vary from 10 Hz [55][56][57] to 700 Hz [58]. With an average rate of 86.2 Hz. The selection of the recording rate is rather arbitrary and not well justified. It is striking that smartphones were used more frequently (24 times) as IMUs. Apart from the smartphones, there are seven datasets with smartwatches [52,[59][60][61][62][63][64] and two datasets with smart glasses [62,63]. Further, relying on just IMU data without video recordings would be difficult to annotate, e.g., in non-orchestrated scenarios,. To facilitate the annotation, subjects performed only one activity in one recording, e.g., [65,66]. The realistic body movement as in a daily activity would not be captured in this particular case. Moreover, IMUs tend to be affected by noise in the presence of metal stands and long lasting-recordings due to drift noise. In [67], recordings of 2-6 days per subject were taken; this assumes realistic human activities. From the subject point of view, this could cause fatigue, irritation, and in some cases the actor forgetting to carry/attach the measurement device. The positions of sensors vary. Subjects carry smartphones in their pockets or in their hands. Smartphones have been also placed on the belt. Moreover, one dataset contains recordings from six smartphones. They are distributed over the entire body [51]. The placements may differ from the placements specified in the Table A4. One reason is the difference in the interpretation of body parts. In individual cases, the sensor was placed on the hip and not on the waist or vice versa. If the smartphone was placed in the pocket, the position is regarded as hip or upper leg depending on the position of the pocket.
The OMoCap system requires a complex and cost-intensive infrastructure. A total of 15 datasets have OMoCap data with eleven different recording rates between 30 and 500 Hz. Four datasets contain both IMU and OMoCap data. The attachment of the markers is determined by the respective systems, e.g., provided by Vicon [68]. In addition to IMUs and the OMoCap systems, other sensors were used. In general, RGB video streams are commonly recorded for annotation purposes [4,69,70]. Other sensors are BodMedia, depth-cameras, electromyography (EMG), Global Positioning System (GPS), heart rate monitor, light, infrared, microphone, photoplethysmogram (PPG), pressure sensor hand-glove, and radio-frequency identification (RFID). The OMoCap serves as ground truth in [71,72].
The descriptions of the datasets rarely contain information about annotation. No standardized annotation process can be identified from the available information. There is no common annotation tool and rarely specified, like in [70], which uses the Anvil software [73]. The annotation tool has been published in exceptions such as [48]. The annotations were carried out by both non-domain-experts and domain-experts. In Daphnet Gait, the gait symptom of Parkinson's patients was analyzed by physiotherapist [69]. In addition to expertise, the number of annotators differs. Hand Gesture dataset [74] was annotated by one person, whereas AndyData-lab-onePerson [70] included three annotators. The specified annotation effort varies greatly. Some data were annotated in real-time [5,69,75,76]. Data such as [4] required 14 to 20 min for the annotation of a one-minute recording.
There is no standardized structure for datasets. Data recorded protocol is not globally predetermined. Likewise, there is no standardized vocabulary. The same term, like activity class mentioned before, is understood differently depending on the author. In addition to repositories, datasets are often stored on private websites. As a result of the non-permanent storage, 16 datasets are no longer available. The repositories include: Figshare [77], UCI Machine Learning Repository [78], and Zenodo [79]. Software development and collaborative platforms like GitHub [80] and Dropbox [81], were used for file sharing. Further, ResearchGate [82] is used to save and access datasets as well as to access the papers relevant for the dataset.
Five datasets deal with working activities. The activity classes can be divided into two categories, on the one hand office work or general work such as "writing on paper" [64], "typing on keyboard" [64], "working" [58], "LAB_WORK" [52], and on the other hand physical work in production [83] and logistics. AndyData-lab-onePerson partially meets logistical activity classes [70]. Maurice et al. [70] had considered ergonomics in an industrial environment with six activities such as screwing in different heights and carrying weights. The annotated movements were divided into three levels. Level one, general posture, includes locomotion. The detailed posture, level two, describes the position of the torso and the hands. Level three, current action, includes movements from intra-logistics and production, such as reach, pick, place, release, carry, manipulation objects, and screw movements in the packaging process, among other things, are missing to fully cover intra-logistics activities.
No dataset from Table A4 meets all the requirements for describing logistical activities, but AndyData-lab-onePerson can serve as a blueprint for creating a logistics-dataset.    Funding: This research was funded by German Research Foundation grant numbers Fi799/10-2, HO2403/14-2.

Conflicts of Interest:
The authors declare no conflict of interest.