Deep ConvLSTM Network with Dataset Resampling for Upper Body Activity Recognition Using Minimal Number of IMU Sensors

Human activity recognition (HAR) is the study of the identification of specific human movement and action based on images, accelerometer data and inertia measurement unit (IMU) sensors. In the sensor based HAR application, most of the researchers used many IMU sensors to get an accurate HAR classification. The use of many IMU sensors not only limits the deployment phase but also increase the difficulty and discomfort for users. As reported in the literature, the original model used 19 sensor data consisting of accelerometers and IMU sensors. The imbalanced class distribution is another challenge to the recognition of human activity in real-life. This is a real-life scenario, and the classifier may predict some of the imbalanced classes with very high accuracy. When a model is trained using an imbalanced dataset, it can degrade model’s performance. In this paper, two approaches, namely resampling and multiclass focal loss, were used to address the imbalanced dataset. The resampling method was used to reconstruct the imbalanced class distribution of the IMU sensor dataset prior to model development and learning using the cross-entropy loss function. A deep ConvLSTM network with a minimal number of IMU sensor data was used to develop the upper-body HAR model. On the other hand, the multiclass focal loss function was used in the HAR model and classified minority classes without the need to resample the imbalanced dataset. Based on the experiments results, the developed HAR model using a cross-entropy loss function and reconstructed dataset achieved a good performance of 0.91 in the model accuracy and F1-score. The HAR model with a multiclass focal loss function and imbalanced dataset has a slightly lower model accuracy and F1-score in both 1% difference from the resampling method. In conclusion, the upper body HAR model using a minimal number of IMU sensors and proper handling of imbalanced class distribution by the resampling method is useful for the assessment of home-based rehabilitation involving activities of daily living.


Introduction
Human activity recognition (HAR), which is one of the branches of pattern recognition, has received great attention in both the research and industry sectors. HAR can help to discover a deep understanding in various contexts, such as kinematics, human-computer interaction and health surveillance. A wide variety of applications are implemented centering on HAR, for instance, rehabilitation activities [1], human activity-based recommendation systems [2], gait analysis [3] and kinematics analysis [4][5][6].
HAR can be classified in two types: video-based HAR and sensor-based HAR. For video-based HAR, the human motion is recorded using a video camera and the video is • A hybrid model was developed for HAR using IMU data.

•
The random undersampling method and multiclass focal loss function were applied to address the imbalanced dataset that will occur in real-life human activity recognition applications. • A high-performance deep model was developed using a smaller dataset and a reasonable result was achieved compared to other implementations.
The rest of the paper is organized as follows: Section 2 describes the related works on human activity recognition in deep learning. The proposed methods are described in Section 3, while the experimental result of the proposed method is discussed in Section 4. Finally, Section 5 concludes the paper.

Related Work
HAR is the study to understand the activity that is performed by the agent using sensor data. HAR aims to identify and classify the activity conducted by the agent using a model with minimum error. Conventional machine learning methods, such as support vector machine [12], random forest [13] and logistic regression [14] require hand-crafted features from the dataset. This is a challenge for sensory-based datasets because of the large size of the dataset and presence of noise in the data. In addition, it also requires expertise in the HAR domain to extract only the significant features as the hand-crafted feature to produce a good performance of the model.
Recently, deep learning has been applied in many challenging research areas, such as computer vision and natural language processing. In these tasks, deep learning has out-performed previous algorithms and machine learning models. The advantage of using deep learning is that the model can learn a complex representation in the large size of a dataset without the need to preprocess the feature and for deep expert knowledge on a domain.

Sensor-Based HAR
Due to the recent improvements in IoT devices and sensors, wearable sensors are widely used in most electronic devices, such as smartphones, watches, and bands. These data can be used to analyze and interpret body movements. However, the performance of an activity recognition system depends mainly on the sensor modality. Some of the widely used wearable sensors are accelerometers, inertia measurement unit (IMU) sensors, electromyography (EMG) sensors, and electrocardiography (ECG) sensors. Among these types of sensors, accelerometers and IMU are the most used in HAR.
An accelerometer can be used to measure the acceleration of the object. In HAR, it is usually mounted on the body parts, such as the waist [15], arm [16], ankle [17], and wrist [18]. In addition to the accelerometer, IMU is one of the choices that have been widely chosen due to the reason that it can measure strength, angular velocity and orientation in the fulcrum part of the body. They are made possible from the incorporation of accelerometers, gyroscopes, and magnetometers inside the IMU sensors. Changes in pitch, roll, and yaw in the IMU sensor can help detect body movement.

Convolutional Neural Network in HAR
The convolutional neural network (CNN) is one of the deep learning architectures which are widely implemented for applications such as image classification, speech recognition, and natural language processing. It stands out from three important ideas, which are the following: sparse interactions, parameter sharing, and equivariant representations [15]. There are several outstanding architectures in CNN, for example LeNet [19], AlexNet [20], GoogLeNet [21], VGGNet [22] and ResNet [23]. These architectures have greatly improved CNN's performance.
The existing works [24][25][26][27][28] have shown that CNN can classify HAR with high accuracy. A CNN-based feature extraction method was adopted to extract the local dependency and scale-invariant characteristics in time series data. A CNN approach [24] was used to extract the feature representation of activity and the variation of activities can be analyzed by the model. They also conducted test on three public datasets which are SKODA, OPPORTUNITY, and Actitracker. The model consists of a pair of convolution layers and one max-pooling layer, followed by two fully connected layers.
In [25], tri-axial acceleration data were collected by the single accelerometer to perform the activity classification using CNN. The model was trained in eight different classes of activities that include falling, running, jumping, walking, walking quickly, step walking, walking upstairs, and walking downstairs.
Actigraphy sensors were used in [26] to collect actigraphy data during sleep to study sleep and physical activity patterns. Physical activity data were collected during the awake time and used to assess the sleep quality of the subject in poor or good sleep efficiency. It showed that the CNN model has the highest specificity and sensitivity compared to all other methods used in the experiments.
In [27], they compared performance between the deep learning model and machine learning model. In their experiments, the CNN model outperforms the entire machine learning model in HAR in both the public datasets that are the OPPORTUNITY dataset [29] and Hand Gesture dataset. The transfer learning in CNN [28] was performed to evaluate the effectiveness of the CNN's structure framework in adapting data collected by different types of sensors. A study has been conducted on transfer learning for sensor data between the different users, application domain, sensor modality, and sensor location. In their results, CNN can be used to conduct training and adapt to different datasets effectively and reduce training time.

Recurrent Neural Network in HAR
A recurrent neural network (RNN) that differs from CNN is appropriate for calculating temporal correlations between neurons and widely used in applications such as speech recognition and natural language processing. Since RNN sequentially receives input, it is undoubtedly the previous input that has less influence on the final prediction results. To solve this shortcoming in RNN, long short-term memory (LSTM) [30] was introduced. LSTM solves the problem of the disappearance gradient in RNN and can accept larger time steps of data than the RNN architecture.
Existing work on HAR using RNN mainly focus on optimizing learning speed and resource consumption. The high recognition and training speed were able to be achieved by using an RNN model with a high throughput [31]. A new approach [32] has also been introduced by using the binary number on all parameters in the hidden layer for RNN which has achieved promising results in the OPPORTUNITY dataset. In addition, the execution time and memory consumption of the model are better than other existing methods.

Hybrid Neural Network in HAR
The presence of a hybrid model is intended to overcome the shortcomings of both the CNN and RNN neural networks. CNN is good in learning sparse representation from the input data while RNN has better performance in learning temporal representation from the input data. Thus, the idea of a hybrid model is to combine these two modules and allow the model to learn the rich representation of the data in the form of spatial and temporal feature representation.
In [33], CNN and LSTM were combined to perform HAR. The input was fed into the CNN structure followed by the LSTM module. Therefore, the hybrid model works better than using only CNN or RNN alone. The model was able to learn a rich representation from the data provided. In [34], a CNN and gated recurrent unit (GRU) model framework was introduced. The same architecture as [33] was applied in his work. With this result, he demonstrated that the model could be well generalized on three different public datasets and produce a consistent result with this hybrid structure.

Handling Imbalanced Dataset in Machine Learning
An imbalanced dataset refers to the dataset that has an unequal distribution for the classes in the dataset where the number of samples for the majority class is much more than for the minority classes. There is an imbalanced class when the data are unevenly distributed in classes that can normally occur during the data collection phase. This phenomenon will cause the classifier to have biased performance toward the majority data. The error rate of minority classes is not significant due to the small number of samples.
In [35], several methods to deal with oversampling and undersampling for the imbalanced dataset were discussed and evaluated. There are various types of sampling methods to be used to tackle the imbalanced dataset acquired from wireless sensors. An inverse random undersampling method [36] was applied to produce several training datasets. By using these newly produced datasets from random undersampling for model training, the model's performance increases compared with the model without using this approach.
In [37], a loss function-based optimization, named Focal Loss, was introduced to optimize the model during training without any preprocessing on the dataset. It is more effective and easier to learn from hard examples that are minority classes using loss functionbased optimization for model training. A suitable technique for handling an imbalanced dataset, whether it is resampling the dataset or a loss-function based optimization, can help the model achieve better performance in an imbalanced dataset.

Proposed Architecture
This article used the OPPORTUNITY [29] dataset as the benchmark dataset. It contains data from accelerometers and IMU sensors of four subjects. The reasons for choosing this dataset as a benchmark dataset are as follows:

•
The dataset contains annotated data for high-level activity classes (for example open drawer, open door, close door) and these activities are related to activities of daily living using IMU sensors.

•
The dataset has imbalanced class distributions, and the same problem will occur during the data collection phase. Therefore, a method is needed to train the model with an imbalanced dataset.

•
Using the weights of the model obtained from this dataset and the transfer learning approach, the knowledge gained can be used to improve the generalization of research in the assessment of rehabilitation progress.
Several experiments were designed to study the problem of class imbalance that is usually observed in the real-life sensor dataset. In terms of model selection, a deep ConvLSTM [33] model was used as a benchmark model to assess the performance of the model developed after applying the data resampling method in the imbalanced dataset. An upper body activity recognition model was developed using a deep ConvLSTM architecture with a reduced number of sensor data and the results were compared with the original implementation.

Dataset Details
The dataset consisted of multiple sensor data collected from four subjects. There are 18 classes of high-level activities that are performed by each participant. The details of OPPORTUNITY Activity Recognition Dataset can be found in Tables 1 and 2. The guideline given in the OPPORTUNITY challenge was followed to ensure that the results were comparable to current research and other state-of-the-art technologies that used this dataset for HAR. As shown in Table 3, run 5 from Subject 1 was used as the validation set while runs 4 and 5 from Subjects 2 and 3 were used as test datasets. The filename convention was S1-ADLx where x is the run number. The remaining data were used as the training set. The details of the training dataset, the validation dataset and the test dataset are shown in Table 3. Figure 1 shows the sensor positions on the human body. The abbreviations of the OPPORTUNITY dataset used in Figure 1 can be found in Appendix A.

The Imbalanced Dataset
Imbalanced data typically refers to a classification problem in which the number of samples per class is not equally distributed. The usual phenomena in an imbalanced dataset are one of the sample classes has a very high number of samples, which is also known as the majority class, while there are classes in which the number of samples is much lower than the majority class. The training process using a high imbalanced dataset can be a challenge, as most of the metrics used to evaluate model performance assume that the dataset is well distributed.
In the OPPORTUNITY dataset, the majority of the classes were found to be 40 times larger than the minority classes. This strong difference in class distribution led to the model learning the representation of the majority classes well, but not that of the minority classes. However, there is a need to train and develop the model to identify minority classes, which are the activities aside from NULL (no activity) stated in Table 2 which represent real-life scenarios; hence, further feature engineering and handling strategies are needed to work on this dataset before the model is trained.
To address the problem of the imbalanced dataset, a resampling approach was applied to the OPPORTUNITY dataset. The undersampling method was used to reduce the size of the majority class, which is the NULL class in the OPPORTUNITY dataset. In order to apply undersampling, firstly the NULL class data were segmented out from the original dataset. Then, part of the NULL class data of each subject was randomly selected and combined with the training dataset before feeding them into the training pipeline. In the experiment, for each subject, 10% of the original NULL-class data was selected from the dataset.
In addition, a focal loss function [37] was used to address the issue of the class imbalance problem without the resampling method. It can transfer between the object detection domain to the sensor classification domain. This approach can reduce the relative loss toward well classified examples in the majority class and draw attention to model optimization in hard, misclassified examples in the minority classes.

Model Architecture
The structure of Deep ConvLSTM [33] is composed of convolutional and LSTM recurrent layers, respectively designed to learn the sparse feature representation and to model the temporal dependencies between each activation of the feature representation. The architecture of the model is shown in Figure 2. The input data presented in the input layer are the sensory data that were preprocessed earlier. Four 1D convolutional layers were used to extract the spatial feature representation of the input data and then the output was fed to the stacked LSTM layer, where the second LSTM used the outputs of the first LSTM and computed the result. The output of the LSTM layer was fed to the dropout layer with a probability of 0.5. Finally, the linear layer with linear transformation was applied to generate the probability output for each class of activity.

Learning Algorithm
The PyTorch [38] framework was used to perform the training and testing of the HAR model. The model was developed using Pytorch's API and was trained using the algorithm shown in Algorithm 1. The cross-entropy loss and focal loss, as shown in Equations (1) and (2), were used as the model optimization loss, respectively. During the training process, an early stop strategy was applied to track the model's validation loss and prevent the model from overfitting. If the model's validation loss did not decrease in five training loops, the progression of the training was stopped.
where s p is the score of positive class in C, and γ is the hyperparameter gamma for focal loss. n activity recognition (HAR) is the study of the identification of specific human action based on images, accelerometer data and inertia measurement unit (IMU) nsor based HAR application, most of the researchers used many IMU sensors to get classification. The use of many IMU sensors not only limits the deployment phase the difficulty and discomfort for users. As reported in the literature, the original ensor data consisting of accelerometers and IMU sensors. The imbalanced class other challenge to the recognition of human activity in real-life. This is a real-life classifier may predict some of the imbalanced classes with very high accuracy. When d using an imbalanced dataset, it can degrade model's performance. In this paper, namely resampling and multiclass focal loss, were used to address the imbalanced ampling method was used to reconstruct the imbalanced class distribution of the set prior to model development and learning using the cross-entropy loss function. M network with a minimal number of IMU sensor data was used to develop the R model. On the other hand, the multiclass focal loss function was used in the classified minority classes without the need to resample the imbalanced dataset. eriments results, the developed HAR model using a cross-entropy loss function and taset achieved a good performance of 0.91 in the model accuracy and F1-score. The a multiclass focal loss function and imbalanced dataset has a slightly lower model -score in both 1% difference from the resampling method. In conclusion, the upper l using a minimal number of IMU sensors and proper handling of imbalanced class he resampling method is useful for the assessment of home-based rehabilitation ies of daily living. lanced dataset; human activity recognition (HAR); inertia measurement unit (IMU); ybrid neural network tivity recognition (HAR), which is one of the branches of pattern recognition, reat attention in both the research and industry sectors. HAR can help to understanding in various contexts, such as kinematics, human-computer health surveillance. A wide variety of applications are implemented AR, for instance, rehabilitation activities [1], human activity-based recomtems [2], gait analysis [3] and kinematics analysis [4][5][6]. be classified in two types: video-based HAR and sensor-based HAR. For AR, the human motion is recorded using a video camera and the video is        n HAR, for instance, rehabilitation activities [1], human activity-based recomsystems [2], gait analysis [3] and kinematics analysis [4][5][6].
an be classified in two types: video-based HAR and sensor-based HAR. For HAR, the human motion is recorded using a video camera and the video is AR, for instance, rehabilitation activities [1], human activity-based recomtems [2], gait analysis [3] and kinematics analysis [4][5][6].
be classified in two types: video-based HAR and sensor-based HAR. For AR, the human motion is recorded using a video camera and the video is

Experimental Design
By using the OPPORTUNITY dataset, three comparative experiments were conducted with different configurations and parameters. In Experiment 1, the Deep ConvLSTM network was developed to train the original dataset, consisting of 113 channels of accelerometers and IMU sensors. Data were normalized, and linear interpolation was performed to fix the missing data. The sliding window approach has been used to increase the number of samples required for training and testing. The training process is summarized in Figure 3. In Experiment 2, the undersampling method was applied in the data preprocessing stage where the NULL class was resampled prior to model training. In Experiment 3, the number of sensors in the dataset was reduced to only five IMU sensors, consisting of RUA, RLARLA, BACK, LUA, and LLA. Using only the five IMU sensors, the number of features used for model development were also reduced. In addition, the number of IMU sensors required in the deployment phase were reduced.

Evaluation Criteria
The evaluation metrics commonly used to evaluate the performance of the HAR model are accuracy, F1-score, recall, and Area Under the Curve (AUC). However, as the OPPORTUNITY dataset is extremely imbalanced, the NULL class accounts for 70% of all classes in this dataset. This represents real-life scenarios, and the classifier predicts NULL class with very high accuracy. Therefore, accuracy is not an appropriate index for performance evaluation in an imbalanced dataset.
In addition to the model accuracy, the F1-score was used as the model evaluation metrics. F1-score is the common evaluation metric used to evaluate the developed model using an imbalanced dataset as it shows the balance between the precision and the recall. The relative contribution of precision and recall to the F1-score are equal. Therefore, the combination of the F1-score, confusion matrix and model accuracy can help in identifying the best model for this application, especially in handling imbalanced datasets. The two metrics were calculated using the following formula: where n and CN denote the class number and total class number respectively, and variables TP n , FP n , TN n , and FN n are true positive, false positive, true negative, and false negative of the class n, respectively.

Training and Testing on Original OPPORTUNITY Dataset
The Deep ConvLSTM model was implemented following the details in the previous literature [33]. In this benchmark model, 113 channels consisting of accelerometers and IMU sensors were used. The distributions of data for the training, validation and test are shown in Figure 4. As shown in Table 3, the validation dataset consists of only one subset of subject 1, which is S1-ADL5. The training dataset consists of samples from subject 1, subject 2 and subject 3. Therefore, the model presents a greater gap between the training and validation loss (Figure 5a).  Based on the distribution of the dataset as shown in Figure 4, the imbalanced class distribution was clearly observed. The NULL class numbers were high compared to the other classes in this dataset. The model training result consisting of the loss curve for training and validation are shown in Figure 5. In the model testing phase, the model gave 0.83 in the F1-score and 0.85 in model accuracy. Through the confusion matrix, as shown in Figure 6, it was observed that most classes had low true positive scores, although the F1-score and accuracy of the model were high. This is due to the imbalance distribution of the OPPORTUNITY dataset. Therefore, to solve this problem, further improvements have been made in the algorithm and more experiments need to be conducted.

Removal of the Majority Class in the Dataset
In this section, the NULL class of the dataset was removed to verify if the model can learn the pattern of the other classes. For this purpose, the entire NULL class was removed from the dataset. The distributions of the datasets for training, validation and test dataset are shown in Figure 7. The distributions of the datasets for all the sets were more balanced after the removal of the NULL class. The training results and confusion matrix of the model are shown in Figures 8 and 9, respectively. Although this model worked well in training and the validation sets (Figure 8), the model failed to properly classify the test samples and had many false positive cases, as shown in the confusion matrix in Figure 9.

Resampling OPPORTUNITY Dataset
The resampling method was used to reduce the number of NULL classes in the dataset. To resample the dataset of each subject, 10% of the NULL class samples were retained. The model was trained using this new dataset. Figures 10-12 show the distribution of the dataset after resampling, the results of the model training and the confusion matrix of the model, respectively. The model can get good performance after applying the resampling method. This model can classify the 18 annotated classes in the dataset.

Upper Body Recognition Model
In order to enable the practical implementation during the model deployment stage, the number of sensors used in model training was reduced from 19 to 5 units. The accelerometers data were removed from the dataset and only five IMU sensors were selected, which were RLA, RUA, BACK, LLA, and LUA.
The model was trained with the resampled IMU dataset that was obtained after data reduction in the unwanted sections. The results of the model training are shown in Figure 13. The model confusion matrix in the test dataset is presented in Figure 14. The confusion matrix shows that there is a low rate of false positives, and each class shows a good rate of true positive rates compared to previous experiments.

Loss Function Based Optimization for Imbalanced Dataset
In this experiment, the multiclass focal loss was used in the sensor classification task to replace the conventional multiclass loss method, namely cross-entropy loss. The hyperparameter gamma acts as a relaxation parameter that is used to adjust the weight for each class. When the gamma value is high, the model focuses on optimization of the minority classes. The learning process is more difficult, and it is harder to correctly classify all the classes. By using this approach, the model can be trained directly with the imbalanced distribution dataset without the need for random undersampling.
This approach was tested with five IMU sensor datasets, as mentioned in Section 4.4. By setting the gamma value to 1.0, both models achieved the best accuracy score of 0.9 and F1-score of 0.9. Although the model using multiclass focal loss was successful in classifying the minority classes, the result was no better than the random undersampling approach. The result of the multiclass focal loss is shown in Table 4.

Summary of Results
The model developed with the proper handling of the imbalanced dataset can achieve an overall high F1-score on all 18 activity classes for both the full-body and upper-body dataset. The model results for all the experiments are presented in Table 5. Table 5 shows a comparison of the developed model with other works. There are several deep neural network models that were used to train the model for human activity recognition. InnoHAR used a multi-level neural network structure model based on the combination of Inception Neural Network and GRU. The experiment was conducted on the Intel Atom E3826 processor platform but the imbalanced class distribution still remains unsolved, as reported in [34]. InnoHAR has a slightly better performance in the HAR classification, as shown by its F-measure of 0.94 compared to our works with an F-measure of 0.91. A ConvLSTM network has the advantage of having lower parameters and a low memory footprint from hidden representations, which is more suitable for wearable activity recognition during the deployment stage. Although the developed model is slightly lower in F1-score (3%) as compared to InnoHAR [34], it remains at the same level of performance with state-of-theart technologies. More importantly, a home-based rehabilitation assessment system can be developed with this model and a minimal number of sensors.

Conclusions
Data imbalance is a common problem in wearable sensor data acquisition. It can greatly degrade the performance of the model during the deployment stage. This article presented a method to address an imbalanced dataset for upper body activity recognition. A resampling method was used to ensure the convergence of the model and produce a high-performance model in classifying different activity. The results of the experiment showed that this method remains at the same level of performance as the state-of-the-art technology. This model used only five IMU sensors and achieved the same F1-score as 19 sensors.
Although the HAR model with a multiclass focal loss function can address the imbalanced dataset, it has a slightly lower model accuracy and F1-score with a 1% (for both) difference from the resampling method. The application of the multiclass focal loss function in the HAR model will facilitate the subsequent deployment phase on edge devices. However, the accuracy of the model and the F1-score in the multiclass classification will be sacrificed. This tradeoff will be taken into consideration in the next implementation phase.
To emphasize, the contributions of these works are the random undersampling to address the problem of data imbalance in real-life human activity recognition and the reduction in the number of features that are needed for the model training that, at the same time, achieved reasonable results compared to previous work. In practical application, this will reduce the number of IMU sensors required to be attached on the human body.
In the future, the developed model will be implemented on a mobile device to perform real-time activity classification in real-world scenarios. Real-world scenarios involve the comparison of movement between healthy and hemiparetic limbs and the quantification of human movement during the activities of daily living. It is hoped that the future models using HAR will be the bridging tool between engineering and medical research. Institutional Review Board Statement: Not applicable. A publicly available dataset [29] was used in this work.
Informed Consent Statement: Not applicable. A publicly available dataset [29] was used in this work.

Data Availability Statement: Not applicable.
Acknowledgments: We would like to express our sincere gratitude to the reviewers for their helpful comments that improved the quality of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. Table 1 shows the abbreviations of the OPPORTUNITY Activity Recognition dataset, as shown in Figure 1. The details and sequences of activities performed by the subjects during the data collection are shown in Tables 2 and 3. The 'ˆ' symbol means the upper sensor placement and the '_' symbol means the lower sensor placement on the same body segments.  Table 3. Drill run details.

Activities Details
Activity 1 Open then close the fridge Activity 2 Open then close the dishwasher Activity 3 Open then close 3 drawers (at different heights)