Human Activity Recognition through Recurrent Neural Networks for Human–Robot Interaction in Agriculture

: The present study deals with human awareness, which is a very important aspect of human– robot interaction. This feature is particularly essential in agricultural environments, owing to the information-rich setup that they provide. The objective of this investigation was to recognize human activities associated with an envisioned synergistic task. In order to attain this goal, a data collection ﬁeld experiment was designed that derived data from twenty healthy participants using ﬁve wearable sensors (embedded with tri-axial accelerometers, gyroscopes, and magnetometers) attached to them. The above task involved several sub-activities, which were carried out by agricultural workers in real ﬁeld conditions, concerning load lifting and carrying. Subsequently, the obtained signals from on-body sensors were processed for noise-removal purposes and fed into a Long Short-Term Memory neural network, which is widely used in deep learning for feature recognition in time-dependent data sequences. The proposed methodology demonstrated considerable efﬁcacy in predicting the deﬁned sub-activities with an average accuracy of 85.6%. Moreover, the trained model properly classiﬁed the deﬁned sub-activities in a range of 74.1–90.4% for precision and 71.0–96.9% for recall. It can be inferred that the combination of all sensors can achieve the highest accuracy in human activity recognition, as concluded from a comparative analysis for each sensor’s impact on the model’s performance. These results conﬁrm the applicability of the proposed methodology for human awareness purposes in agricultural environments, while the dataset was made publicly available for future research.


Introduction
Agriculture employs a significant number of workers all over the world, particularly in the developing countries. The adoption of modern technologies, including Information and Communication Technologies (ICT), such as Internet of Things (IoT), Artificial Intelligence (AI), Farm Management Information Systems (FMIS), wearable computers and robotics, has led to the so-called "agriculture 4.0" [1,2]. However, despite the plethora of technological advances, which have arguably ameliorated the farmers' living standards to a substantial degree, their safety is frequently underestimated. Epidemiological studies have identified several safety and health issues associated with the agricultural occupations [3,4]. Focusing on the non-fatal health issues, work-related musculoskeletal disorders (MSDs) have proved to be the most ordinary ones [5,6].
The most common manual operations in agriculture include harvesting, weeding, digging, pruning, sorting, and load lifting and carrying, with the last task having gained relatively little attention. This does not conform to the gravity of the matter, as has been processed and transformed into the appropriate shape for a series of mathematical or logical operations. In general, the purpose of a ML algorithm is to be able to produce a model that will fit the data in the best possible way so as to predict unknown examples with the highest accuracy. In the case of HAR, the aim of the ML algorithm is to learn the characteristic features of the signals collected from the on-body sensors in order to be able to classify the correct activity for a particular timeframe. Afterwards, important feature vectors are extracted to minimize the classification errors and computation time [28]. Finally, the classification phase serves to map the selected features into a set of activities by exploiting ML techniques. Through implementing ML algorithms, models can be developed via iterative learning from the extracted features, up until they are able to optimally model a process. ML has extensively been implemented in agriculture, thus offering valuable solutions to several tasks such as crop, livestock, water, and soil management [29], to mention but a few. As far as the HAR is concerned, a plethora of ML models have been utilized, such as Hidden Markov Model [30], Support Vector Machine [31], K-Nearest Neighbor [32], Naive Bayes [33], Decision Tree [34], and Long Short-Term Memory (LSTM) [35]. Nonetheless, the literature regarding the use of ML for automated recognition, by using the data from wearable sensors collected throughout agricultural operations, is very limited. Indicative studies are those of Patil et al. [36] (use of accelerometers to detect digging, harvesting, and sowing) and Sharma et al. [37][38][39] (use of GPS, accelerometers, and microphone sensors to detect harvesting, weeding, bed making, and transplantation. The aim of the present study was to properly identify human activities related to a particular task, which is lifting a crate and placing it onto a robot suitable for agricultural operations with the use of ML algorithms (LSTM) for sequential data classification. Since the agricultural environment is a dynamic ecosystem, which is susceptible to unforeseeable situations [40], other human sub-activities comprising this task were also investigated, including standing still as well as walking with and without the crate. Two common lightweight agricultural Unmanned Ground Vehicles (UGVs) were used: Husky and Thorvald robots. For the purpose of gathering the data, 20 healthy participants took part in outdoor experimental sessions by wearing five IMU sensors (embedded with tri-axial accelerometers, gyroscopes, and magnetometers) in different body positions. To the best of our knowledge, no similar study exists. This investigation, through providing the activity "signatures" of the workers, has the potential to increase human awareness during HRI, thus contributing toward establishing an optimal ecosystem in terms of both cost savings and safety. Finally, the present dataset is made publicly available [41] for the sake of future examination by other researchers.

Experimentation Setup
The experimental tests were carried out in a farm in the region of Volos, in central Greece. The study involved 20 participants (13 male, 7 female), whose average age, height, and weight were 30.95 years (SD ≈ 4.85), 1.75 m (SD ≈ 0.08), and 75.40 kg (SD ≈ 17.20), respectively, where SD corresponds to the standard deviation. The participants' demographics are summarized in Table 1. To be eligible for inclusion in the present investigation, all subjects should have not had any history of surgeries or sustained any musculoskeletal injury during the last year that could influence their performance. All participants, prior to any experimental procedure, had to complete an informed consent form that was approved by the Institutional Ethical Committee. Following informed consent, each participant had to perform a specific activity. In particular, they have to walk an unobstructed distance of 3.5 m, lift a crate, and carry it to their point of departure, where they have to place it on an immovable agricultural robot. This task can be divided in six continuous sequential sub-activities, namely: 1.
Standing still until the signal is given to start; 2.
Walking a distance of 3.5 m without carrying any crate; 3.
Bending down to approach the crate; 4.
Lifting the crate from the ground to an upright position; 5.
Walking back the distance of 3.5 m with carrying the crate; 6.
Placing the crate onto the robot.
For the purpose of the present study, two UGVs were utilized (Husky and Thorvald), which are usually used in outdoor environments ( Figure 1) [42,43]. The two available UGVs correspond to a deposit height of the crate equal to 40 cm (Husky) and 80 cm (Thorvald). Furthermore, the crate was either empty (tare weight equal to 1.5 kg) or full with weight plates with a total mass (crate and plates) approximately equal to 20% of each participant's mass, similarly to [44,45]. The mass of the available weight plates was 1 and 2.5 kg for the purpose of easily adjusting the required mass to be lifted and carried. An open plastic crate, commonly used in agriculture, was used with handles on both sides at 28 cm height above its base. The dimensions of the crate were 31 × 53 × 35 cm (height × width × depth). Consequently, each participant carried out four sub-cases:
Crate full of the required weight-Husky; 3.
Crate full of the required weight-Thorvald. Following informed consent, each participant had to perform a specific activity. In particular, they have to walk an unobstructed distance of 3.5 m, lift a crate, and carry it to their point of departure, where they have to place it on an immovable agricultural robot. This task can be divided in six continuous sequential sub-activities, namely: 1. Standing still until the signal is given to start; 2. Walking a distance of 3.5 m without carrying any crate; 3. Bending down to approach the crate; 4. Lifting the crate from the ground to an upright position; 5. Walking back the distance of 3.5 m with carrying the crate; 6. Placing the crate onto the robot.
For the purpose of the present study, two UGVs were utilized (Husky and Thorvald), which are usually used in outdoor environments ( Figure 1) [42,43]. The two available UGVs correspond to a deposit height of the crate equal to 40 cm (Husky) and 80 cm (Thorvald). Furthermore, the crate was either empty (tare weight equal to 1.5 kg) or full with weight plates with a total mass (crate and plates) approximately equal to 20% of each participant's mass, similarly to [44,45]. The mass of the available weight plates was 1 and 2.5 kg for the purpose of easily adjusting the required mass to be lifted and carried. An open plastic crate, commonly used in agriculture, was used with handles on both sides at 28 cm height above its base. The dimensions of the crate were 31 × 53 × 35 cm (height × width × depth). Consequently, each participant carried out four sub-cases: 1. Empty crate-Husky; 2. Crate full of the required weight-Husky; 3. Empty crate-Thorvald; 4. Crate full of the required weight-Thorvald.
Each sub-case was performed three times in a randomized order and at each participant's own pace, which stands for 12 efforts for each subject. Finally, all participants were instructed to carry out a five-minute warm-up in order to avoid possible injuries. The inclusion of different subjects, as per genre, age, weight, height, and loading heights on robots, was targeted toward a large variability on the collected data, so the trained model was able to identify the activities conducted under a broad range of conditions. Each sub-case was performed three times in a randomized order and at each participant's own pace, which stands for 12 efforts for each subject. Finally, all participants were instructed to carry out a five-minute warm-up in order to avoid possible injuries. The inclusion of different subjects, as per genre, age, weight, height, and loading heights on robots, was targeted toward a large variability on the collected data, so the trained model was able to identify the activities conducted under a broad range of conditions.

Data Acquisition and Sensors
At the start of the day of experiments, the five VICON IMeasureU Blue Trident sensors were calibrated according to the manufacturer's directions [46]. These IMUs are small enough and lightweight (12 gr). Prior to the start of each effort, the sensors were attached to the chest (breastbone), cervix (approximately T1 vertebra), lumbar region (approximately L4), right and left wrist, as can be depicted in Figure 2. IMUs were attached via special Velcro straps at the two wrists (provided by the manufacturer), while the remaining three sensors were attached via double-sided tape similarly to studies such as [47]. Each IMU encompasses a tri-axial accelerometer, a tri-axial gyroscope, and a tri-axial magnetometer. The specifications of the IMUs are summarized in Table 2 according to [48]. These kinds of sensors have been used in several recent studies, including [49][50][51][52]. The sampling frequency that was used throughout the experimental sessions was 50 Hz, which is considered to be adequate for such kind of investigations, similarly to experimental investigations such as [21,24,52].

Data Acquisition and Sensors
At the start of the day of experiments, the five VICON IMeasureU Blue Trident sensors were calibrated according to the manufacturer's directions [46]. These IMUs are small enough and lightweight (12 gr). Prior to the start of each effort, the sensors were attached to the chest (breastbone), cervix (approximately T1 vertebra), lumbar region (approximately L4), right and left wrist, as can be depicted in Figure 2. IMUs were attached via special Velcro straps at the two wrists (provided by the manufacturer), while the remaining three sensors were attached via double-sided tape similarly to studies such as [47]. Each IMU encompasses a tri-axial accelerometer, a tri-axial gyroscope, and a tri-axial magnetometer. The specifications of the IMUs are summarized in Table 2 according to [48]. These kinds of sensors have been used in several recent studies, including [49][50][51][52]. The sampling frequency that was used throughout the experimental sessions was 50 Hz, which is considered to be adequate for such kind of investigations, similarly to experimental investigations such as [21,24,52].   [53] was used to synchronize the sensors and capture the data, while the latter were saved directly to the sensors for further processing. Since Capture.U is available only for iOS devices, an Αpple iPad mini (64 GB) [54] was utilized for the present investigation.

Distinguishing the Sub-Activities
The Capture.U software in conjunction with the iPad offers the choice of simultaneously recording the experimental session at hand. This feature was particularly useful for distinguishing the sub-activities and finding out the critical instant, where the transition between them took place similarly to [55]. Each sequence starts with the subject standing  The Capture.U software (provided by VICON) [53] was used to synchronize the sensors and capture the data, while the latter were saved directly to the sensors for further processing. Since Capture.U is available only for iOS devices, an Apple iPad mini (64 GB) [54] was utilized for the present investigation. between them took place similarly to [55]. Each sequence starts with the subject standing still (labeled as sub-activity "0") and is used as a two-fold baseline; (a) for establishing a distinctive and realistic "idle" activity and (b) for allowing the timely synchronization of the sensors before starting the sequence. The rest of the sub-activities are described next:

•
Walking without the crate sub-activity (labeled as "1") begins immediately after one of the feet first leaves the ground (beginning of stance phase of gait [56]; • The third sub-activity (labeled as "2") begins when the participant starts bending their trunk, kneeling, or simultaneously performs them both, which are usually referred to as stoop, squat, and semi-squat techniques, respectively, in the relative literature [57]; • The fourth sub-activity (labeled as "3") begins when the participant starts lifting the crate from the ground [58]; • The fifth sub-activity (labeled as "4") begins when the participant starts the stance phase of gait similarly to the above description, however, by carrying the crate this time; • The sixth sub-activity (labeled as "5") begins when the participant starts bending their trunk, kneeling, or simultaneously performs them both, as described above, while it ends when the entire surface of the crate is placed onto the UGV.
Obviously, as the required tasks are continuous in nature, the beginning of one task corresponds to the end of the other. Since the participants performed the tasks in their own way and at their own pace, following the above well-defined criteria was of major importance in order to assure the reliability of the results. In this fashion, it should be highlighted that in case at least one of the sensors was not synchronized with the others (this was recognized during processing the dataset), the corresponding measurements of the remaining four sensors were also discarded. The aforementioned labeling of the sub-activities was used for the rest of the signal pre-processing, as will be elaborated below.

Outlier and NaN Handling
Under normal conditions, data collection from sensors in real-life applications can be problematic. Throughout experimentation, hardware failures and malfunctions can occur, resulting in gaps or irregular values in the dataset. During experimental sessions, sensors could randomly stop collecting data, thus creating gaps in the dataset. These gaps were identified in the early stages, and the entire experiment was discarded completely. On the other hand, in some cases, the sensors would record all data properly. However, during the pre-processing phase, irregular values, i.e., outliers, could appear in the dataset. These outliers usually differ from most values by orders of magnitude, and they do not represent the physical behavior of the subject. In the present study, there was a limited number of outliers, which were all removed manually from the dataset.

Noise Reduction
Signals or one-dimensional data usually contain unwanted or unknown components that can be the result of the capturing, transmitting, or storing device. In this investigation, the focus was on removing the noise from the collected data and not identifying its root. Noise removal techniques serve so as to remove irregular fluctuations by running filters (mathematical operations) throughout the entire signal and replace the "noisy" values with "smooth" ones. These methods enable the ML algorithms to better learn the trends and fluctuations instead of random value variations. In this analysis, the median filter [59] was used as a noise removal method. This kind of filter is a nonlinear one within which each output sample is calculated as the median value of input samples under the selected window. This corresponds to a result after the sorting of the input values. Furthermore, the median filtering of signals includes a horizontal window having an odd number of taps. After examining various values, eleven taps were utilized. In addition, no isolated extreme values (such as a possible large-valued sample as a result of impulse noise) appeared in the filtering phase, since outliers were removed manually beforehand. The median filter's effect on the signals is demonstrated in Figure 3. extreme values (such as a possible large-valued sample as a result of impulse noise) appeared in the filtering phase, since outliers were removed manually beforehand. The median filter's effect on the signals is demonstrated in Figure 3.

Activity Count and Class Imbalance
In the present study, the experimental sessions, which were designed for the participants, included six sub-activities, as described in Section 2.1. As expected, these subactivities did not need the same time to be executed. More specifically, bending to approach the crate, lifting it from the ground to an upright position, or placing it on a loading surface were very short activities. On the contrary, walking with and without the crate lasted approximately triple time as compared to the other sub-activities, as shown in Figure 4a, as the crates and the robots were initially at 3.5 m with each other.
In the present study, the experimental sessions, which were designed for the participants, included six sub-activities, as described in Section 2.1. As expected, these sub-activities did not need the same time to be executed. More specifically, bending to approach the crate, lifting it from the ground to an upright position, or placing it on a loading surface were very short activities. On the contrary, walking with and without the crate lasted approximately triple time as compared to the other sub-activities, as shown in Figure 4a, as the crates and the robots were initially at 3.5 m with each other.
The observed imbalance, existing in the training set, may become troublesome. In fact, class imbalance can create problems pertaining to the ML algorithms' performance, particularly in cases where classes are identified that can be wrongly interpreted to each other owing to commonalities. For the purpose of reducing the existence of the above classes, with the least effect associated with the remaining information, an under-sampling technique was implemented. The under sampling was performed via taking out every other entry on the dataset in case the sub-activity matched one of the two most populated classes. The classes do not include an equal number of instances. Nonetheless, there exists a better balance between them, as can be gleaned from Figure 4b.   The observed imbalance, existing in the training set, may become troublesome. In fact, class imbalance can create problems pertaining to the ML algorithms' performance, particularly in cases where classes are identified that can be wrongly interpreted to each other owing to commonalities. For the purpose of reducing the existence of the above classes, with the least effect associated with the remaining information, an under-sampling technique was implemented. The under sampling was performed via taking out every other entry on the dataset in case the sub-activity matched one of the two most populated classes. The classes do not include an equal number of instances. Nonetheless, there exists a better balance between them, as can be gleaned from Figure 4b.

Temporal Window Definition
As a temporal window, we define the time that is needed to identify an activity by the sensors' data. Depending on the addressed problem, this window can be set as small as 1 s [60], or as large i.e., >6 s [61]. The effect of the temporal window has been thoroughly investigated for a multitude of HAR problems; however, a similar approach defines the window at 2.56 sec [31]. For the present analysis, the temporal window was set to 2 s after extensive investigation of values that ranged from 0.5 to 5 s. For each temporal window, a class is assigned, representing the sub-activity that the subject was conducting for the particular time. An indicative schematic on the temporal window and class assignment is presented in Figure 5. As a temporal window, we define the time that is needed to identify an activity by the sensors' data. Depending on the addressed problem, this window can be set as small as 1 s [60], or as large i.e., >6 s [61]. The effect of the temporal window has been thoroughly investigated for a multitude of HAR problems; however, a similar approach defines the window at 2.56 sec [31]. For the present analysis, the temporal window was set to 2 s after extensive investigation of values that ranged from 0.5 to 5 s. For each temporal window, a class is assigned, representing the sub-activity that the subject was conducting for the particular time. An indicative schematic on the temporal window and class assignment is presented in Figure 5.

Overlap
Class assignment, based on temporal windows, can identify some activities accurately; however, there can be activities that fall between temporal windows and are not represented in either. With the intention of solving this issue, an overlapping of the temporal windows was conducted automatically for all signals. Consequently, every next temporal window does not start at the end of the previous one, but exactly in its middle in order to achieve 50% overlap. This technique offers double benefit: one being that it minimizes the chances to miss an activity due to falling between windows, and second

Overlap
Class assignment, based on temporal windows, can identify some activities accurately; however, there can be activities that fall between temporal windows and are not represented in either. With the intention of solving this issue, an overlapping of the temporal windows was conducted automatically for all signals. Consequently, every next temporal window does not start at the end of the previous one, but exactly in its middle in order to achieve 50% overlap. This technique offers double benefit: one being that it minimizes the chances to miss an activity due to falling between windows, and second because it increases the training examples for the ML algorithm to train to. A representative schematic of the overlapping technique is presented in Figure 6.  Figure 6.

Categorical Variables
By definition, the sub-activities that were assigned as classes to each temporal window are categorical variables, since they are descriptive qualitative variables. Each subactivity's description was set as a class, and for each class, a number was assigned to simplify the data collection process. ML models work with categorical variables, but with the intention of including them into the calculations; they need to be transformed to numerical representations. Human intuition points toward the use of a simple integer representation as a numerical value to an abstract categorical variable. Nevertheless, integer numbers contain order and, therefore, they imply order and operations to each activity, i.e., walking with the crate is two times bending, which makes no sense. This issue can be solved by transforming the integers into one-hot vectors, which are of the same magnitude, i.e., one, and their length is the number of the sub-activities. These vectors contain zero values, except for one digit that is different for each activity. By using this technique, all numerical representations have the same magnitude, have no order between them, and can be used in the calculations of the ML algorithms. The sub-activity description, its assigned value, and the resulting vector are presented in Table 3.

Categorical Variables
By definition, the sub-activities that were assigned as classes to each temporal window are categorical variables, since they are descriptive qualitative variables. Each sub-activity's description was set as a class, and for each class, a number was assigned to simplify the data collection process. ML models work with categorical variables, but with the intention of including them into the calculations; they need to be transformed to numerical representations. Human intuition points toward the use of a simple integer representation as a numerical value to an abstract categorical variable. Nevertheless, integer numbers contain order and, therefore, they imply order and operations to each activity, i.e., walking with the crate is two times bending, which makes no sense. This issue can be solved by transforming the integers into one-hot vectors, which are of the same magnitude, i.e., one, and their length is the number of the sub-activities. These vectors contain zero values, except for one digit that is different for each activity. By using this technique, all numerical representations have the same magnitude, have no order between them, and can be used in the calculations of the ML algorithms. The sub-activity description, its assigned value, and the resulting vector are presented in Table 3. Table 3. Sub-activity description, assigned value, and vectorial transformation.

Train/Test Split
The dataset was split into a training portion containing the examples, which will be used for training the model, and a testing portion containing the examples that will be used to evaluate the model's performance and robustness. The testing portion was completely removed from the dataset prior to any operation that would result in having the training set leak information and compromise the validity of the model's predictions. Due to the amount of data obtained from the experimentation phase, an 80/20 split was selected for the training/test datasets. The split was conducted on the subject level in order to evaluate the performance of the trained model on the specific characteristics of an unknown subject's movements. Thus, four subjects were randomly selected for testing purposes, i.e., to have the trained model predict their activities as recorded. The data from the remaining sixteen subjects are to be used for training the ML algorithm.

Normalization
Normalization is the process where all features of a dataset are scaled into a common range. In the present study, the StandardScaler from Python's Sklearn library was utilised [62] which calculated the standard score as: where u is the mean of the training samples and SD is the standard deviation of the training samples. Normalization is applied only to the training dataset so as to prevent information leaking toward the test dataset, which needs to be completely unknown during the training process. Better optimization during training can also be achieved via normalization, as it appropriately speeds up the convergence of the non-convex cost function to the global minimum.

Machine Learning Algorithm (LSTM)
LSTM networks constitute a type of neural network architecture being built upon a recurrent manner, via introducing memory cells, as well as the in-between connections, with the intention of constructing a graph directed in a sequence. In a general sense, recurrent neural networks process the sequences by employing these memory cells in a dissimilar way as compared to simple artificial neural networks. Although they are designed for handling problems with a sequential nature, recurrent neural networks frequently comfort the problem pertaining to vanishing gradients or being unable to "memorize" many sequential data. However, the characteristic cell structures found in LSTMs, also called gates, render the network capable of variating the retained information [63], while they can regulate which part of information is going to be either discarded or stored in the long memory. This leads to the much desired optimization of the memorizing process. Problems with dynamic sequential behavior have been proven to be suitable for such kinds of problems. HAR can fall under this premise, because all activities are time-dependent sequences, which makes LSTM a suitable algorithm for the problem the present study tackles.

Performance Metrics
In this subsection, the utilized performance metrics are briefly described. Generally, this type of metrics is employed with the objective of offering a common measure concerning the trained classifier's performance against the unknown examples originated from the testing set. The result of this prediction, as compared with the actual class label assigned to each activity, can acquire one of the following values: • True Positive (TP) or True Negative (TN), in case that it is classified correctly; • False Positive (FP) or False Negative (FN), in case that it is misclassified.
Subsequently, the aforementioned values are implemented as a means to compute the performance metrics, commonly appeared in classification problems [64]. In Table 4, the performance metrics, which were used in the present investigation for the sake of appraising the classifier's performance, are summarized in conjunction with a concise description of them and their mathematical relationship. Finally, for the purpose of assessing the performance of the present algorithm with respect to the given data, a loss function is utilized, which is also known as an objective or cost function. Since the present study deals with a multiclass classification problem, the categorical cross-entropy was adopted, which calculates the loss among the probability one-hot vectors of the real and the predicted class [65]. The mathematical formula for the cross-entropy loss that was used is given by: In the above equation, p(x) is the probability vector for the real class, and q(x) is the probability vector for the predicted class. This function minimizes the loss relative to how good the predictions are, and it increases the loss in a steep manner for bad predictions (reaching up to infinity when the prediction is completely wrong).

Proposed Machine Learning Pipeline
The complete ML pipeline of the proposed methodology is summarized in this section. The pipeline starts with the signal preprocessing, which includes the loading of the data, the fusion of variables and axes from all sensors, the removal of NaNs (i.e., Not a Number) and outliers, the removal of noise, and the balancing of the classes. The next step is the feature engineering and data transformation that is needed so that the data will be in proper condition for the training. This includes the temporal window definition, the overlap of temporal windows, the class assignment and labeling, the splitting of the training and test datasets, and the normalization of the training dataset. The following process is the ML model training with the use of an LSTM algorithm architecture, which learns each sub-activity's features the training dataset. Finally, validation of the model is done with a 10-fold cross-validation, where the training and test dataset shuffle after each training is over and the performance metrics are calculated. With the completion of the 10-fold cross-validation, the performance metrics are averaged and presented. A schematic of the pipeline is presented in Figure 7. is the ML model training with the use of an LSTM algorithm architecture, which learns each sub-activity's features the training dataset. Finally, validation of the model is done with a 10-fold cross-validation, where the training and test dataset shuffle after each training is over and the performance metrics are calculated. With the completion of the 10-fold cross-validation, the performance metrics are averaged and presented. A schematic of the pipeline is presented in Figure 7.

LSTM Architecture
Extensive investigation and tryouts were conducted with the aim of identifying the optimal specifications and hyper-parameter values for the LSTM architecture. The selected architecture is illustrated in Figure 8.

LSTM Architecture
Extensive investigation and tryouts were conducted with the aim of identifying the optimal specifications and hyper-parameter values for the LSTM architecture. The selected architecture is illustrated in Figure 8.
The three LSTM cells are constructed by 10 memory units each. The first and second fully connected layers have 10 nodes. All the aforementioned layers use Rectified Linear Unit (ReLU) activation [66], and each one is followed by a dropout layer with 50% drop rate. The output layer has six nodes, one for each class and uses the softmax activation [67]. The total number of trainable parameters sums up to 12,701. Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 22 The three LSTM cells are constructed by 10 memory units each. The first and second fully connected layers have 10 nodes. All the aforementioned layers use Rectified Linear Unit (ReLU) activation [66], and each one is followed by a dropout layer with 50% drop rate. The output layer has six nodes, one for each class and uses the softmax activation [67]. The total number of trainable parameters sums up to 12,701. Training of the model ranged from 25 to 50 epochs. A typical plot of the training and validation loss decrease over the epochs is shown in Figure 9.

Confusion Matrix
Based on the results of the performance metrics, the model's performance can be visualized in confusion matrices. In particular, the confusion matrix is a table that displays the aforementioned values in such a way that one can easily view the number of properly classified examples, as well as false positives and false negatives. In this analysis, a multiclass classification, the confusion matrix is of size 6 × 6, where six is the number of activities that are predicted, as seen in Table 5. Training of the model ranged from 25 to 50 epochs. A typical plot of the training and validation loss decrease over the epochs is shown in Figure 9. The three LSTM cells are constructed by 10 memory units each. The first and second fully connected layers have 10 nodes. All the aforementioned layers use Rectified Linear Unit (ReLU) activation [66], and each one is followed by a dropout layer with 50% drop rate. The output layer has six nodes, one for each class and uses the softmax activation [67]. The total number of trainable parameters sums up to 12,701. Training of the model ranged from 25 to 50 epochs. A typical plot of the training and validation loss decrease over the epochs is shown in Figure 9.

Confusion Matrix
Based on the results of the performance metrics, the model's performance can be visualized in confusion matrices. In particular, the confusion matrix is a table that displays the aforementioned values in such a way that one can easily view the number of properly classified examples, as well as false positives and false negatives. In this analysis, a multiclass classification, the confusion matrix is of size 6 × 6, where six is the number of activities that are predicted, as seen in Table 5.

Confusion Matrix
Based on the results of the performance metrics, the model's performance can be visualized in confusion matrices. In particular, the confusion matrix is a table that displays the aforementioned values in such a way that one can easily view the number of properly classified examples, as well as false positives and false negatives. In this analysis, a multiclass classification, the confusion matrix is of size 6 × 6, where six is the number of activities that are predicted, as seen in Table 5.

Classification Report
The classification report displays the prediction, recall, and F1-score for each class. Individual performance metrics for each class is a useful tool to comprehend a model's weaknesses and strengths. The classes were engineered in the preprocessing phase to be generally balanced. Nevertheless, both the macro (not weighted) average and the weighted average of all metrics are calculated. Since the accuracy is calculated considering all predictions, there is only one value, and it describes the general performance of the trained model. The classification report is shown in Table 6. Overall, the "Walking without crate" sub-activity presents the higher predictions by achieving 0.904 for precision, 0.969 for recall, and 0.937 for F1-score. In contrast, the "Bending" sub-activity achieves the lowest values for precision (0.741) and F1-score (0.763), while the "Placing crate" one achieves the lowest recall (0.710). The trained model's total performance is measured by the accuracy metric, which demonstrates a total of 85.6% for all the defined activities.

Feature Selection
Investigation regarding the effect each variable has on the performance of the model was also conducted. The accelerometer, gyroscope, and magnetometer data were used both individually and combined, in order to compare with the result, the model achieved when all variables were utilized. Next, the accuracy of each approach was measured along with its error, and the results are presented in Table 7. As can be deduced from Table 7, the combination of all sensors performed better as compared to the case where each sensor was used individually. This was an expected result that has been highlighted by several relative studies, such as [24,68]. On the other hand, considering the usage of a single sensor, gyroscopes appear to slightly outweigh accelerom-eters, demonstrating an accuracy of 82.987% and 82.708%, respectively. Concerning the magnetometer, it was observed to have the poorest performance, while its supplemental usage, as part of a case considering only two types of sensors, is suggested only in combination with a gyroscope, leading to approximately 3.07% increase of the accuracy. In contrast, the synergy of accelerometers and gyroscopes resulted in approximately 1.10% and 0.76% increase of the accuracy as compared to the purely usage of accelerometers and gyroscopes, respectively.

Discussion and Main Conclusions
HAR is of major importance in the design process of agricultural collaborative robotic systems, as they should be able to operate in dynamic and crowded farm environments, where almost nothing is structured. In addition, these collaborative systems do not use isolated cells as occurs with conventional industrial robots. Toward optimizing the required activities, robots are working in the same working region concurrently with their "coworkers", namely humans. "Cobots", as these robots are usually referred to [69], can carry out either the same task or distinct tasks. The present envisioned application focuses on the latter scenario, where the robot can follow the workers while harvesting and, subsequently, place the crate onto the robot. Afterwards, the robot can safely transfer the full crates outside the field. Aside from the aim to provide safety and time saving, this cooperation can contribute to the prevention of the fatigue of agricultural workers, because the arduous task of carrying the crates for a long distance is performed by robots based on human-aware planning. Apart from the HRI, the results of this study are also applicable to conventional in-field operations such as lifting crates and loading to platforms for transferring to storage.
The activity recognition of workers is closely related to an essential feature of HRI, which is usually mentioned as "social-aware robot navigation" [70,71]. While autonomous navigation is restricted to obstacles' avoidance and reaching the target destination [72][73][74], the social navigation, apart from that, takes into consideration other factors associated with human naturalness, comfort, and sociability [75,76]. More specifically, naturalness is related to navigation in paths such as those for humans via adjusting the robot's speed and its distance from farmers. Comfort offers also the feeling of safety, whereas sociability has to do with abstracting decisions pertaining to robot's movements by considering ethical and regional notions [71]. In a nutshell, HAR within agricultural human-robot ecosystems has a great potential to assure a sociable acceptable safe motion of robots and provide a free space to farmers to perform their activities unaffected by the simultaneous existence of robots, while the latter can approach them when it is required [12,77,78].
The present study focuses solely on HAR. To this end, data originated from 20 healthy participants, carrying out a particular task, were gathered by five wearable IMUs. This task included walking an unobstructed distance, lifting a crate (either empty or with a total mass of 20% of each participant's body mass), and carrying it to the point of departure, where they have to place it onto an immovable UGV (either a Husky or a Thorvald). By carefully distinguishing the sub-activities comprising the above task, the obtained signals were properly preprocessed for the purpose of using them in the learning phase (training of the model) and the testing phase (evaluating the model's performance and robustness) of an ML process (LSTM).
Overall, the problem of properly classifying stationary activities was challenging. The "Bending", "Lifting", and "Placing" activities were initially misclassified by a large margin. However, noise removal and normalization increased the overall performance of the trained model significantly. One of the factors that improved the performance of the model was the width of the temporal window which, when it varied more than one second, the performance of the overall model would decrease substantially. The LSTM architecture has provided the appropriate tools for the model to be able to learn the features of the activity signals. Early experimentation with artificial neural networks (ANN) and one-dimensional convolutional neural networks (CNN) has resulted in low performance on the trained model. Nevertheless, further investigation on more elaborate architectures utilizing the benefits of multiple methods, such as CNN-LSTM [77] or convolutional LSTM [78] networks might be worth conducting. However, being a characteristic of data-driven approaches, the volume and variability of data play a significant role in a model's performance. That being stated, this study has shown that by obtaining data from 20 subjects, equipped with five IMU sensors each, performing a few recordings and fine-tuning a state-of-the art LSTM network can help train a robust model, which can properly classify all activities with an accuracy of larger than 76%. Toward increasing the volume and variability of data (and, thus, the overall accuracy), a study with a sample that consisted of more participants covering a wider range of ages, physical strength, and anthropometric characteristics exists in the immediate plans of the authors. Moreover, with the intention of providing real-world data, these experimental tests are planning to be performed in a real agricultural environment by workers at their own pace, according to the complex conditions that they may face.
Additionally, it can be concluded that the gyroscope and the accelerometer can be both used independently for recognizing the specific sub-activities, which are commonly performed in agricultural environments. However, their synergetic contribution can somewhat increase the overall performance. In contrast, the use of a magnetometer alone cannot lead to equally reliable results and should only be considered for supplementary use. The best performance was presented when the data from all sensors were fused. Furthermore, for the sub-activity of walking without the crate, the present methodology indicated the higher precision. On the contrary, as anticipated, the sub-activity presenting the smallest precision was that of bending down in order to approach the crate, since it can be executed in several ways, depending on each participant. For example, it was observed that most of the time, participants could solely bend their trunks (stooping) or kneel without bending their trunk enough, or simultaneously stoop and kneel to catch the crate. This resulted from the instruction of participants to carry out the task in their own way. This is justified from our intention to increase the variability of the dataset for capturing, as widely as possible, most of the different manners in which someone can perform the desired task.
Obviously, assuring a fluid and safe HRI in agriculture involves a plethora of different issues. However, each issue must be addressed separately, at a preliminary stage, before a viable solution is proposed. This study demonstrates the framework for both conducting direct field measurements and applying a ML approach to accurately identify the activities of workers automatically, by analytically presenting the applied methodology at each phase. Finally, the examined dataset is made publicly available, thus assuring research transparency while allowing for experimental reuse and lowering the barriers for meta-studies.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The dataset used in this work is public available in [41].