Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data

Dingeto, Hiskias; Kim, Juntae

doi:10.3390/app142210286

Open AccessArticle

Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data^†

by

Hiskias Dingeto

and

Juntae Kim

^*

Department of Computer Engineering, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “Applying mixup for Time Series in Transformer-Based Human Activity Recognition”, which will be presented at The 5th International Workshop on AI for Social Good in the Connected World at The 23rd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Bangkok, Thailand, 9–12 December 2024.

Appl. Sci. 2024, 14(22), 10286; https://doi.org/10.3390/app142210286

Submission received: 13 September 2024 / Revised: 11 October 2024 / Accepted: 29 October 2024 / Published: 8 November 2024

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Human Activity Recognition (HAR) is an essential area of research in Artificial Intelligence and Machine Learning, with numerous applications in healthcare, sports science, and smart environments. While several advancements in the field, such as attention-based models and Graph Neural Networks, have made great strides, this work focuses on data augmentation methods that tackle issues like data scarcity and task variability in HAR. In this work, we investigate and expand the use of mixup and cutout data augmentation methods to sensor-based and skeleton-based HAR datasets. These methods were first widely used in Computer Vision and Natural Language Processing. We use both augmentation techniques, customized for time-series and skeletal data, to improve the robustness and performance of HAR models by diversifying the data and overcoming the drawbacks of having limited training data. Specifically, we customize mixup data augmentation for sensor-based datasets and cutout data augmentation for skeleton-based datasets with the goal of improving model accuracy without adding more data. Our results show that using mixup and cutout techniques improves the accuracy and generalization of activity recognition models on both sensor-based and skeleton-based human activity datasets. This work showcases the potential of data augmentation techniques on transformers and Graph Neural Networks by offering a novel method for enhancing time series and skeletal HAR tasks.

Keywords:

human activity recognition; transformers; mixup; cutout; data augmentation; time-series data; human skeleton data

1. Introduction

Human Activity Recognition (HAR) has become a crucial area of research within Artificial Intelligence and Machine Learning due to its wide range of practical applications [1,2,3], driven by wearable devices and smart sensors, in our daily lives. HAR often involves automatically detecting and classifying human activities based on data collected from various sensors, such as accelerometers and gyroscopes, typically worn by subjects. Recognizing human activities holds significant potential in many areas. In healthcare, various Machine Learning technologies have been used, including deep neural networks for activity detection [4,5,6], to ensure the welfare of hospital patients and the general population. When it comes to sports science, deep learning models have made a positive impact in modeling sports like football [7] and karate [8], on fall detection [9], posture recognition [10], and many others [11]. HAR has also been implemented in various areas, such as home automation and energy efficiency [12,13,14,15], anomaly detection and public safety [16,17], and gesture recognition and virtual reality [18,19].

Recent advancements in HAR have leveraged deep learning architectures to improve recognition accuracy [11,20,21,22]. For sensor-based HAR, Convolutional Neural Networks (CNNs) have been effective in capturing spatial features from sensor data, while Long Short-Term Memory (LSTM) networks model temporal dependencies [23,24,25,26]. Transformer models, known for their self-attention mechanisms, have also been applied to handle long-range temporal dependencies in time-series data [27,28]. Hybrid models combining CNNs and LSTMs have shown improved performance by leveraging both spatial and temporal features [24,26,29]. Despite their outstanding performance in an experimental environment, the real-world environment still poses a problem, mainly because of poor data distribution [22].

In skeleton-based HAR, Graph Neural Networks (GNNs) have gained prominence due to their ability to model the human body’s skeletal structure as a graph. By representing human body joints as nodes and the connections between them as edges, GNNs can effectively capture both spatial and temporal dependencies in skeleton data, making them well-suited for HAR tasks. Moreover, GNNs can handle the non-Euclidean structure of skeleton data, which provides a significant advantage over traditional methods that assume grid-like data structures [30,31,32,33,34]. Models like the Spatial–Temporal Graph Convolutional Network (ST-GCN) capture spatial relationships between joints and temporal dynamics across frames [35]. Adaptive Graph Convolutional Networks (AGCNs) and attention mechanisms have further enhanced the capacity to focus on relevant joints and motions [36]. Despite these advancements, the models often require large, diverse datasets to perform well and may not generalize effectively to new subjects or environments due to data scarcity and variability [37].

In spite of the amount of research done in the area of Human Activity Recognition, several challenges in the field need to be addressed, such as lack of data [38], variability of data measuring hardware and techniques [39], and misalignment of activities [40,41]. While sensors generate vast amounts of data, labeled and diverse datasets for HAR remain limited. On top of that, human motions are complex and varied, making accurate recognition challenging. In this research, we want to address the issue of HAR data scarcity and low diversity to provide a potential solution. This issue has been addressed by T. Alafif et al. [42], who discussed the issue of data scarcity in abnormal behavior detection in crowds and tried to address the problem by using generative adversarial networks to augment the dataset with artificial samples. On the other hand, Q. Zhu et al. [43] proposed using semi-supervised Machine Learning to use unlabeled data alongside labeled datasets to improve activity recognition with smartphone inertial sensors.

However, a significant challenge in HAR is the scarcity and limited diversity of available datasets. Collecting large-scale, annotated HAR datasets is resource-intensive due to the need for specialized equipment, privacy considerations, and the inherent variability of human movements. Traditional data augmentation techniques used in image processing do not directly apply to time-series and skeleton data, leaving a gap in effectively enhancing HAR datasets. In this research, we explore the use of mixup and cutout data augmentation [44,45] alongside attention-based architecture, which was first proposed by A. Vaswani et al. [46], and Graph Neural Networks, which have gained popularity in the HAR space [34], in order to tackle the problem of low diversity and potential scarcity when training activity recognition models. Our experimental results show that this technique improves model accuracy in state-of-the-art activity recognition models.

To summarize, the contributions of this paper are:

We introduce and adapt the mixup data augmentation technique for sensor-based time-series data in HAR, improving data diversity and model generalization while preserving the temporal structure.
We extend the cutout data augmentation technique to skeleton-based HAR datasets to simulate partial occlusions and missing data in joint positions, thereby improving model robustness.
We empirically demonstrate that integrating these techniques with attention-based models and Graph Neural Networks generally improves performance and accuracy across various HAR datasets.

2. Background Literature

Human Activity Recognition (HAR) is a field within Artificial Intelligence and Machine Learning that focuses on the automatic detection and classification of human activities based on data collected from various methods, including both sensor-based and skeleton-based approaches [1,38]. Sensor-based HAR typically involves accelerometers, gyroscopes, or other wearable devices to capture movement and orientation data. At the same time, skeleton-based HAR uses data derived from visual sensors such as RGB cameras or depth sensors, which track the positions of skeleton joints over time. HAR has significant applications in diverse areas, including healthcare, where it can monitor patient activities and detect falls, sports science for performance analysis, and smart home environments to enhance automation and security [4,11]. Understanding and recognizing human activities is crucial for developing systems that provide personalized feedback, improve safety, and support independent living.

Traditional approaches to HAR often involve classical Machine Learning techniques such as k-Nearest Neighbors (k-NN) [47,48], Support Vector Machines (SVM), and decision trees [29,49]. These methods rely heavily on manual feature extraction and selection processes, where domain experts identify relevant features from the raw sensor data. While these approaches have been effective to some extent, they face challenges, such as the labor-intensive nature of feature engineering and the difficulty in capturing complex patterns in the data. Additionally, these models may struggle to scale and generalize across different datasets and environments due to their reliance on handcrafted features [38]. The advancement towards deep learning has significantly improved HAR by enabling models to learn features from the raw data automatically [11].

Convolutional Neural Networks (CNNs) have been used to capture spatial dependencies in the data [50], while Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks are adept at modeling temporal dependencies [51]. These deep learning models have shown superior performance compared to traditional methods, as they can handle large volumes of data and extract hierarchical features that represent complex patterns in human activities. Similarly, attention-based models have revolutionized the fields of Computer Vision (CV) and Natural Language Processing (NLP) by introducing the transformer architecture that relies on self-attention mechanisms [46]. Unlike traditional sequence models, transformers can capture long-range dependencies and complex patterns in data without the need for recurrent connections. This capability has led to the development of state-of-the-art models in NLP, such as BERT and GPT-3 [52,53], and in Computer Vision, such as Vision Transformers (ViTs) [54]. The scalability and efficiency of transformers make them promising candidates for applications in time-series data like HAR, where capturing temporal dependencies is crucial.

On the other hand, Graph Neural Networks (GNNs), first proposed by M. Gori et al. in 2005 [33] and later developed by F. Scarselli et al. in 2009 [32], have also shown great promise when it comes to activity recognition tasks. For skeleton data, typically represented as sequences of joints or key-points in a graph structure, GNNs are a natural fit due to their ability to capture complex relationships between body joints over time. Their usability was shown in one of the initial works of Yan et al. [55], who implemented GNNs on top of skeleton sequences and proposed a model they named the Spatial–Temporal Graph Convolutional Neural Network (ST-GCN). GNNs can model both spatial and temporal dependencies by treating each joint as a node and the connections between them as edges, allowing for more precise activity recognition. Studies have demonstrated the effectiveness of GNNs in extracting both local and global body movement patterns, which are essential for recognizing complex human activities. In particular, approaches such as the aforementioned ST-GCNs and Dynamic Graph Convolutional Networks (DGCNs) [56] have been particularly successful in skeleton-based HAR, where the model is designed to simultaneously learn spatial features from the skeleton’s structure and temporal features from the sequential nature of the data.

Data augmentation is a technique used to improve the generalization and robustness of Machine Learning models by artificially increasing the diversity of the training dataset. The effectiveness of data augmentation has been demonstrated across various domains, including vision and natural language, which have led to improvements in model performance and stability [57]. Mixup is a specific data augmentation technique that generates new training samples by linearly interpolating between pairs of existing samples and their corresponding labels [44]. Initially introduced for image classification tasks, mixup has been shown to improve model generalization and robustness by encouraging the model to behave linearly between training examples. The mathematical formulation of mixup involves creating a new sample as a weighted combination of two samples and their labels, effectively smoothing the decision boundary of the classifier, according to the authors H. Zhang et al. in their research [44]. Previous studies have extended mixup data augmentation to other domains, including NLP, where it has also demonstrated significant benefits [58].

Another powerful data augmentation technique is cutout, which enhances model robustness by randomly masking out sections of the input data during training. Initially proposed for image classification by T. DeVries et al. in 2017 [45], cutout works by occluding parts of the input samples, forcing the model to rely on surrounding information to make accurate predictions. This technique has been shown to improve generalization by making the model less reliant on specific features and more capable of capturing broader patterns in the data. In the context of Human Activity Recognition (HAR), cutout data augmentation can be adapted to mask out key-point sections in skeleton data, simulating real-world conditions where some sensor data might be noisy or missing. This research shows that introducing such variability during training cutout helps models become more robust to partial occlusions and missing data, ultimately leading to improved performance in HAR tasks. By demonstrating the effectiveness of our approach through our experiments, we provide a potential solution to some of the critical challenges in HAR. The promising results from our experiments clearly show the value of this approach, paving the way for further exploration and application of data augmentation techniques in different domains of Human Activity Recognition.

3. Methodology

Our proposed method applies both sensor-based and skeleton-based Human Activity Recognition (HAR) models, incorporating mixup and cutout data augmentation techniques. The datasets used in this study consist of time-series sensor data collected from devices such as accelerometers and gyroscopes, and skeleton data extracted from visual sensors like RGB cameras or depth sensors. Each dataset contains sequences of sensor readings from subjects performing various activities for sensor-based HAR, while skeleton-based HAR tracks the positions of important body joints over time. The continuous sensor and skeleton data are segmented into fixed-length windows to create consistent input samples for the models, with each segment labeled according to the corresponding activity.

This research applies the mixup data augmentation technique for time-series sensor data to address data scarcity and improve model robustness. Mixup generates new training samples by linearly interpolating between pairs of existing sensor data samples and their corresponding labels, enhancing the diversity of the data. For skeleton-based HAR, we employ the cutout data augmentation technique, which involves masking random parts of the skeleton data to improve model robustness by forcing it to focus on incomplete or missing joint information, ultimately enhancing the generalization capabilities of the model.

Building on this approach, we first apply the mixup data augmentation technique to the sensor-based HAR data, which generates new training samples by blending existing data points and their labels. For each training iteration, pairs of time-series samples

(X_{i}, y_{i})

and

(X_{j}, y_{j})

are randomly selected from the training dataset. A mixing coefficient

λ ϵ [0, 1]

is sampled from Beta distribution

λ ~ B e t a (α, α)

. The selected pairs are linearly combined to generate new samples:

\tilde{X} = λ X_{i} + (1 - λ) X_{j}

(1)

\tilde{y} = λ y_{i} + (1 - λ) y_{j}

(2)

Here,

X_{i}

and

X_{j}

are time-series matrices of the shape

(T, C)

, where

T

is the sequence length and

C

is the number of sensor channels. This implies

X_{i} {, X}_{j} ϵ R^{T \times C}

and

y_{i} {, y}_{j} ϵ R^{K}

, where

K

is the number of activity classes. The labels

y_{i}

and

y_{j}

are one-hot encoded vectors representing the activities.

The graphs in Figure 1 and Figure 2 display sensor readings from a human activity sample before (Figure 1) and after (Figure 2) applying the mixup technique. The data are presented across six separate channels, each representing different sensor readings over time. The x-axis of each plot denotes time, while the y-axis indicates the sensor reading values. For each channel, the signal exhibits a combination of patterns derived from two different samples, reflecting the linear interpolation characteristic of mixup.

The mixup augmentation technique blends the time-series data from two separate activities, creating synthetic samples that enhance the training dataset’s diversity. The representation of each class after dimensionality reduction with t-SNE [59], similar to [22], can be seen in Figure 3. We can see in the figure that mixup provides additional data points to improve model generalization and provide additional data points for classes with a lower number of samples.

We employ transformer-based models proposed by S. Ek et al. [22] and by I. Luptáková [60]. Their transformer architecture is a customized form of the Vision Transformer [54], with some modifications to adapt it to HAR data. For the sake of clarity, we simplify the model architecture and explain its components below. First, the input time-series data is projected into a higher-dimensional space using a linear embedding layer, transforming the raw sensor readings into a suitable format for the Vision Transformer and then applying positional embeddings to keep temporal order, as shown in Equations (3) and (4):

E = E m b e d d i n g (X)

(3)

E_{p o s} = E + P o s i t i o n a l E n c o d i n g (T, d_{m o d e l})

(4)

where

d_{m o d e l}

is the dimentionality of the embeddings. The model consists of multiple Vision Transformer encoder layers, each comprising multi-head self-attention mechanisms and feed-forward networks:

H_{l} = T r a n s f o r m e r E n c o d e r L a y e r (H_{l - 1})

(5)

for

l = 1, \dots, L

, where

H_{0} = E_{p o s}

and

L

is the number of encoder layers. The output of the final encoder layer is passed through a global average pooling layer to aggregate the sequence information into a fixed-size vector.

z = G l o b a l A v e r a g e P o o l i n g 1 D (H_{L})

(6)

The pooled vector is then passed through fully connected layers to produce the final activity classification. The cross-entropy loss function is used to train the model, comparing the predicted activity probabilities

\hat{y}

with the mixup-augmented labels

\tilde{y}

:

L = - \sum_{k} {\tilde{y}}_{k} \log {\hat{y}}_{k}

(7)

This methodology of implementing mixup on sensor-based HAR data is empirically validated through experiments on four different HAR datasets, comparing the results to baseline models without mixup augmentation. The findings demonstrate that mixup augmentation leads to improved accuracy and robustness in transformer-based HAR models.

After applying mixup for sensor-based HAR, we turn our focus to skeleton-based HAR, where we utilize the cutout data augmentation technique to improve model robustness by randomly masking portions of the skeleton data. Figure 4 shows that the skeleton key-points are randomly masked as a data augmentation technique. Given a skeleton sample

X \in R^{T \times J \times D}

, where

T

is the sequence length,

J

is the number of joints in the skeleton, and

D

is the dimensionality of each joint, cutout for skeleton data can be defined as randomly selecting a set of joints

I_{c u t o u t} \subseteq {1, 2, 3, J}

to be masked, where the size of

I_{c u t o u t}

is controlled by a hyperparameter

p

, representing the proportion of joints to be cutout. Then, for each selected joint

{J \in I}_{c u t o u t}

, set its coordinates to zero or an alternative value, such as the mean of the dataset across all frames, such that

x_{t, j} = 0

for all

t \in \{1, 2, \dots, T\}, j \in I_{c u t o u t}

. The masking function

M

can be represented as shown in Equation (8):

M {(x)}_{t, j} = \{\begin{array}{l} x_{t, j} i f j \notin I_{c u t o u t}, \\ 0 j \in I_{c u t o u t} \end{array}

(8)

This cutout augmentation technique effectively masks or “removes” selected joints from the skeleton data, simulating missing or occluded parts of the body, as shown in Figure 5. By doing so, the model is trained to recognize activities using only the remaining visible joints, encouraging it to focus on broader movement patterns rather than relying on specific joints. This enhances the model’s robustness, improving its ability to generalize to real-world scenarios where sensor data might be incomplete or noisy.

4. Results

Our experimental results were obtained on an NVIDIA Ampere A100 GPU with 80 gigabytes of video RAM, manufactured by NVIDIA Corporation, based in Santa Clara, CA, USA. When it comes to vision mixup methods for sensor-based data, we tested our method on four datasets, namely HHAR [61], MotionSense [62], UCI-HAR [63], and the RealWorld dataset [64]. In our experiments, we show that models trained on mixup augmented datasets produce better results than the baseline transformer-based models [22,60], as shown in Table 1 and, more succinctly, in Figure 6.

The table illustrates the accuracy of the transformer model across three different training data scenarios: clean data, mixup data, and a combination of mixup and clean data. For the HHAR dataset, the model trained on clean data achieved an accuracy of 98.01%, while the model trained on mixup augmented data reached a slightly higher accuracy of 98.15%. The highest accuracy of 98.22% was observed when the model was trained on a combination of mixup and clean data, demonstrating the effectiveness of our approach in leveraging both augmented and original data to enhance model performance. Similarly, the MotionSense dataset showed a slight decrease in accuracy when trained solely on mixup data (97.98%) compared to clean data (98.17%). However, the combination of mixup and clean data resulted in an improved accuracy of 98.3%, highlighting the robustness of our method in integrating augmented data for better generalization. For the UCI-HAR dataset, the model trained on mixup data achieved a significantly higher accuracy of 94.5%, compared to 93.32% with clean data. This substantial improvement underscores the potential of mixup augmentation in scenarios with limited training data or higher variability. The combination of mixup and clean data also performed well, achieving an accuracy of 94.44%, close to the mixup-only scenario. The RealWorld dataset also benefited from mixup augmentation. The model trained on clean data achieved an accuracy of 93.94%, while the mixup augmented data resulted in an accuracy of 94.54%. The highest accuracy of 94.66% was observed with the combination of mixup and clean data, reinforcing the trend seen across other datasets.

On the other hand, we used the PYSKL library proposed in [65] to test cutout augmentation on the STGCN [55] and STGCN++ [65] models. Table 2 compares the performance of the ST-GCN and ST-GCN++ models with and without the application of cutout augmentation on the NTU RGB + D [66] and NTU RGB + D 120 [67] action recognition datasets, evaluated in both cross-subject (XSubject) and cross-view (XView) settings. The NTU RGB + D dataset is one of the most extensive skeleton-based action recognition datasets, consisting of 60 different actions performed by 40 subjects in front of an RGB + D camera, with data captured from multiple angles. The extended NTU RGB + D 120 dataset includes 120 action classes, covering a more comprehensive range of activities and more diverse scenarios. The results demonstrate that cutout augmentation provides a noticeable improvement in several cases, particularly in the XView setting. For instance, in NTU RGB + D, the accuracy for ST-GCN with cutout increased from 94.89% to 95.48%, while ST-GCN++ with cutout showed a slight improvement from 94.93% to 94.98%. Similarly, for the NTU RGB + D 120 dataset, cutout improved the performance of ST-GCN from 83.05% to 83.25% in the XSubject setting, and the accuracy of ST-GCN++ in the XView setting increased from 88.64% to 88.68%. These results suggest that the cutout augmentation technique helps the models generalize better, particularly in scenarios involving different views or subjects, by forcing the model to focus on incomplete data and improving its robustness.

Overall, the results clearly indicate that both mixup and cutout augmentations, whether used alone or in combination with clean data, enhance the accuracy of models for HAR tasks. Both mixup applied to sensor-based data and cutout to skeleton-based data contribute to improving model robustness. The combination of augmented and clean data often yields the best performance, suggesting that integrating these augmentation techniques can effectively address data scarcity and variability issues, leading to more robust and generalizable models across different HAR modalities.

5. Conclusions

In this research, we explored the application of transformer models, specifically Vision Transformers (ViTs), for Human Activity Recognition (HAR) using time-series data. To address the prevalent data scarcity issue and enhance HAR model robustness, we adapted the mixup data augmentation technique for time-series data. Our experiments on four diverse HAR datasets (HHAR, MotionSense, UCI-HAR, and RealWorld) consistently demonstrated that models trained with mixup-augmented data outperform those trained on clean data alone.

In addition to mixup, we implemented cutout augmentation on Graph Neural Network (GNN) models trained on skeleton-based HAR datasets. By masking random joints from the skeleton data, cutout forces the GNN models to learn activity patterns from incomplete data, improving generalization and robustness, particularly in scenarios where skeleton data may be noisy or incomplete. The application of cutout augmentation on skeleton datasets such as NTU RGB + D 60 and NTU RGB + D 120 showed noticeable performance gains, particularly in cross-subject and cross-view evaluations.

Our approach of combining mixup augmentation with Vision Transformer models and cutout augmentation with GNN models for skeleton data presents a novel and effective methodology for advancing HAR. This combination leverages the strengths of both techniques: the powerful temporal dependency capture of transformers for time-series data and the spatial–temporal representation learning of GNNs for skeleton data. Together, these approaches address key challenges in HAR, such as variability in sensor data, incomplete skeleton data, and the need for robust, scalable models.

Author Contributions

Conceptualization, H.D. and J.K.; Data curation, H.D.; Formal analysis, H.D. and J.K.; Funding acquisition, J.K.; Methodology, H.D.; Software, H.D.; Supervision, J.K.; Visualization, H.D.; Writing—original draft, H.D.; Writing—review and editing, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (2021R1A2C2008414), the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-2020-0-01789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254592) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). This work was supported by the Dongguk University Research Fund of 2024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We utilized publicly available HHAR, MotionSense, UCI-HAR, RealWorld, and NTU RGB + D 60 and 120 datasets to train our models. The datasets can be accessed using the following links: https://archive.ics.uci.edu/dataset/344/heterogeneity+activity+recognition (last accessed on 31 October 2024), https://github.com/mmalekzadeh/motion-sense?tab=readme-ov-file (last accessed on 31 October 2024), https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones (last accessed on 31 October 2024), https://www.uni-mannheim.de/dws/research/projects/activity-recognition/dataset/dataset-realworld (last accessed on 31 October 2024), and https://github.com/shahroudy/NTURGB-D (last accessed on 31 October 2024).

Acknowledgments

The writing process involved the use of AI language models. We declare that the authors reviewed and edited the content generated by these models to ensure accuracy and relevance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gupta, S. Deep Learning Based Human Activity Recognition (HAR) Using Wearable Sensor Data. Int. J. Inf. Manag. Data Insights 2021, 1, 100046. [Google Scholar] [CrossRef]
Kumar, P.; Chauhan, S.; Awasthi, L.K. Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions. Arch. Comput. Methods Eng. 2024, 31, 179–219. [Google Scholar] [CrossRef]
Younesi, A.; Ansari, M.; Fazli, M.; Ejlali, A.; Shafique, M.; Henkel, J. A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends. IEEE Access 2024, 12, 41180–41218. [Google Scholar] [CrossRef]
Bibbò, L.; Vellasco, M.M. Activity Recognition (HAR) in Healthcare. Appl. Sci. 2023, 24, 13009. [Google Scholar] [CrossRef]
Ohashi, H.; Al-Naser, M.; Ahmed, S.; Akiyama, T.; Sato, T.; Nguyen, P.; Nakamura, K.; Dengel, A. Augmenting Wearable Sensor Data with Physical Constraint for DNN-Based Human-Action Recognition. In Proceedings of the ICML 2017 Times Series Workshop, Sydney, Australia, 6–11 August 2017; pp. 6–17. [Google Scholar]
Fridriksdottir, E.; Bonomi, A.G. Accelerometer-Based Human Activity Recognition for Patient Monitoring Using a Deep Neural Network. Sensors 2020, 20, 6424. [Google Scholar] [CrossRef]
Cuperman, R.; Jansen, K.; Ciszewski, M. An End-to-End Deep Learning Pipeline for Football Activity Recognition Based on Wearable Acceleration Sensors. Sensors 2022, 22, 1347. [Google Scholar] [CrossRef]
Echeverria, J.; Santos, O.C. Toward Modeling Psychomotor Performance in Karate Combats Using Computer Vision Pose Estimation. Sensors 2021, 21, 8378. [Google Scholar] [CrossRef]
Wu, J.; Wang, J.; Zhan, A.; Wu, C. Fall Detection with CNN-Casual LSTM Network. Information 2021, 12, 403. [Google Scholar] [CrossRef]
Fan, J.; Bi, S.; Wang, G.; Zhang, L.; Sun, S. Sensor Fusion Basketball Shooting Posture Recognition System Based on CNN. J. Sens. 2021, 2021, 6664776. [Google Scholar] [CrossRef]
Adel, B.; Badran, A.; Elshami, N.E.; Salah, A.; Fathalla, A.; Bekhit, M. A Survey on Deep Learning Architectures in Human Activities Recognition Application in Sports Science, Healthcare, and Security. In Proceedings of the ICR’22 International Conference on Innovations in Computing Research, Athens, Greece, 29–31 August 2022; Daimi, K., Al Sadoon, A., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 121–134. [Google Scholar]
Rashid, N.; Demirel, B.U.; Faruque, M.A.A. AHAR: Adaptive CNN for Energy-Efficient Human Activity Recognition in Low-Power Edge Devices. IEEE Internet Things J. 2022, 9, 13041–13051. [Google Scholar] [CrossRef]
Das, D.; Nishimura, Y.; Vivek, R.P.; Takeda, N.; Fish, S.T.; Plötz, T.; Chernova, S. Explainable Activity Recognition for Smart Home Systems. ACM Trans. Interact. Intell. Syst. 2023, 13, 1–39. [Google Scholar] [CrossRef]
Najeh, H.; Lohr, C.; Leduc, B. Real-Time Human Activity Recognition in Smart Home on Embedded Equipment: New Challenges. In Proceedings of the Participative Urban Health and Healthy Aging in the Age of AI: 19th International Conference, ICOST 2022, Paris, France, 27–30 June 2022; Aloulou, H., Abdulrazak, B., de Marassé-Enouf, A., Mokhtari, M., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 125–138. [Google Scholar]
Bouchabou, D.; Nguyen, S.M.; Lohr, C.; LeDuc, B.; Kanellos, I. A Survey of Human Activity Recognition in Smart Homes Based on IoT Sensors Algorithms: Taxonomies, Challenges, and Opportunities with Deep Learning. Sensors 2021, 21, 6037. [Google Scholar] [CrossRef] [PubMed]
Wastupranata, L.M.; Kong, S.G.; Wang, L. Deep Learning for Abnormal Human Behavior Detection in Surveillance Videos—A Survey. Electronics 2024, 13, 2579. [Google Scholar] [CrossRef]
Maeda, S.; Gu, C.; Yu, J.; Tokai, S.; Gao, S.; Zhang, C. Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows. arXiv 2024, arXiv:2404.17381. [Google Scholar]
Shen, J.; De Lange, M.; Xu, X.O.; Zhou, E.; Tan, R.; Suda, N.; Lazarewicz, M.; Kristensson, P.O.; Karlson, A.; Strasnick, E. Towards Open-World Gesture Recognition. arXiv 2024, arXiv:2401.11144. [Google Scholar]
Sabbella, S.R.; Kaszuba, S.; Leotta, F.; Serrarens, P.; Nardi, D. Evaluating Gesture Recognition in Virtual Reality. arXiv 2024, arXiv:2401.04545. [Google Scholar]
Challa, S.K.; Kumar, A.; Semwal, V.B. A Multibranch CNN-BiLSTM Model for Human Activity Recognition Using Wearable Sensor Data. Vis. Comput. 2022, 38, 4095–4109. [Google Scholar] [CrossRef]
Betancourt, C.; Chen, W.-H.; Kuan, C.-W. Self-Attention Networks for Human Activity Recognition Using Wearable Devices. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; pp. 1194–1199. [Google Scholar]
Ek, S.; Portet, F.; Lalanda, P. Lightweight Transformers for Human Activity Recognition on Mobile Devices. arXiv 2022, arXiv:2209.11750. [Google Scholar]
Mekruksavanich, S.; Jitpattanakul, A.; Youplao, P.; Yupapin, P. Enhanced Hand-Oriented Activity Recognition Based on Smartwatch Sensor Data Using LSTMs. Symmetry 2020, 12, 1570. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jitpattanakul, A. Hybrid Convolution Neural Network with Channel Attention Mechanism for Sensor-Based Human Activity Recognition. Sci. Rep. 2023, 13, 12067. [Google Scholar] [CrossRef]
Kashyap, S.K.; Mahalle, P.N.; Shinde, G.R. Human Activity Recognition Using 1-Dimensional CNN and Comparison with LSTM. In Sustainable Technology and Advanced Computing in Electrical Engineering; Mahajan, V., Chowdhury, A., Padhy, N.P., Lezama, F., Eds.; Springer Nature: Singapore, 2022; pp. 1017–1030. [Google Scholar]
Krishna, K.S.; Paneerselvam, S. An Implementation of Hybrid CNN-LSTM Model for Human Activity Recognition. In Proceedings of the Advances in Electrical and Computer Technologies, Tamil Nadu, India, 1–2 October 2021; Sengodan, T., Murugappan, M., Misra, S., Eds.; Springer Nature: Singapore, 2022; pp. 813–825. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
Genet, R.; Inzirillo, H. A Temporal Kolmogorov-Arnold Transformer for Time Series Forecasting. arXiv 2024, arXiv:2406.02486. [Google Scholar]
Charabi, I.; Abidine, M.B.; Fergani, B. A Novel CNN-SVM Hybrid Model for Human Activity Recognition. In Proceedings of the IoT-Enabled Energy Efficiency Assessment of Renewable Energy Systems and Micro-Grids in Smart Cities, Tipasa, Algeria, 26–28 November 2023; Hatti, M., Ed.; Springer Nature: Cham, Switzerland, 2024; pp. 265–273. [Google Scholar]
Ghosh, P.; Saini, N.; Davis, L.S.; Shrivastava, A. All About Knowledge Graphs for Actions. arXiv 2020, arXiv:2008.12432. [Google Scholar]
Hu, L.; Liu, S.; Feng, W. Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition. arXiv 2022, arXiv:2208.08599. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Gori, M.; Monfardini, G.; Scarselli, F. A New Model for Learning in Graph Domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, 2005, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 729–734. [Google Scholar]
Ahmad, T.; Jin, L.; Zhang, X.; Lai, S.; Tang, G.; Lin, L. Graph Convolutional Neural Network for Human Action Recognition: A Comprehensive Survey. IEEE Trans. Artif. Intell. 2021, 2, 128–145. [Google Scholar] [CrossRef]
Wu, W.; Tu, F.; Niu, M.; Yue, Z.; Liu, L.; Wei, S.; Li, X.; Hu, Y.; Yin, S. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 2370–2383. [Google Scholar] [CrossRef]
Han, H.; Zeng, H.; Kuang, L.; Han, X.; Xue, H. A Human Activity Recognition Method Based on Vision Transformer. Sci. Rep. 2024, 14, 15310. [Google Scholar] [CrossRef]
Ju, W.; Yi, S.; Wang, Y.; Xiao, Z.; Mao, Z.; Li, H.; Gu, Y.; Qin, Y.; Yin, N.; Wang, S.; et al. A Survey of Graph Neural Networks in Real World: Imbalance, Noise, Privacy and OOD Challenges. arXiv 2024, arXiv:2403.04468. [Google Scholar]
Arshad, M.H.; Bilal, M.; Gani, A. Human Activity Recognition: Review, Taxonomy and Open Challenges. Sensors 2022, 22, 6463. [Google Scholar] [CrossRef]
Zhang, J.; Wu, C.; Wang, Y.; Wang, P. Detection of Abnormal Behavior in Narrow Scene with Perspective Distortion. Mach. Vis. Appl. 2019, 30, 987–998. [Google Scholar] [CrossRef]
Kwon, H.; Abowd, G.D.; Plötz, T. Handling Annotation Uncertainty in Human Activity Recognition. In Proceedings of the 2019 ACM International Symposium on Wearable Computers: Association for Computing Machinery, New York, NY, USA, 9 September 2019; pp. 109–117. [Google Scholar]
Saini, R.; Kumar, P.; Roy, P.P.; Dogra, D.P. A Novel Framework of Continuous Human-Activity Recognition Using Kinect. Neurocomputing 2018, 311, 99–111. [Google Scholar] [CrossRef]
Alafif, T.; Alzahrani, B.; Cao, Y.; Alotaibi, R.; Barnawi, A.; Chen, M. Generative Adversarial Network Based Abnormal Behavior Detection in Massive Crowd Videos: A Hajj Case Study. J. Ambient. Intell. Humaniz. Comput. 2022, 13, 4077–4088. [Google Scholar] [CrossRef]
Zhu, Q.; Chen, Z.; Soh, Y.C. A Novel Semisupervised Deep Learning Method for Human Activity Recognition. IEEE Trans. Ind. Inform. 2019, 15, 3821–3830. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Ferreira, P.J.S.; Cardoso, J.M.P.; Mendes-Moreira, J. kNN Prototyping Schemes for Embedded Human Activity Recognition with Online Learning. Computers 2020, 9, 96. [Google Scholar] [CrossRef]
Mohsen, S.; Elkaseer, A.; Scholz, S.G. Human Activity Recognition Using K-Nearest Neighbor Machine Learning Algorithm. In Proceedings of the Sustainable Design and Manufacturing, Split, Croatia, 15–17 September 2021; Scholz, S.G., Howlett, R.J., Setchi, R., Eds.; Springer: Singapore, 2022; pp. 304–313. [Google Scholar]
Maswadi, K.; Ghani, N.A.; Hamid, S.; Rasheed, M.B. Human Activity Classification Using Decision Tree and Naïve Bayes Classifiers. Multimed. Tools Appl. 2021, 80, 21709–21726. [Google Scholar] [CrossRef]
Khan, Z.N.; Ahmad, J. Attention Induced Multi-Head Convolutional Neural Network for Human Activity Recognition. Appl. Soft Comput. 2021, 110, 107671. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jitpattanakul, A. User Identification Based on Human Activity Recognition Using Wearable Sensors: An Experiment Using Deep Learning Models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv 2018, arXiv:1801.07455. [Google Scholar] [CrossRef]
Zheng, Y.; Gao, C.; Chen, L.; Jin, D.; Li, Y. DGCN: Diversified Recommendation with Graph Convolutional Networks. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
Mumuni, A.; Mumuni, F. Data Augmentation: A Comprehensive Survey of Modern Approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Lewy, D.; Mańdziuk, J. AttentionMix: Data Augmentation Method That Relies on BERT Attention Mechanism. arXiv 2023, arXiv:2309.11104. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Dirgová Luptáková, I.; Kubovčík, M.; Pospíchal, J. Wearable Sensor-Based Human Activity Recognition with Transformer Model. Sensors 2022, 22, 1911. [Google Scholar] [CrossRef]
Abbas, S.; Alsubai, S.; Haque, M.I.U.; Sampedro, G.A.; Almadhor, A.; Hejaili, A.A.; Ivanochko, I. Active Machine Learning for Heterogeneity Activity Recognition Through Smartwatch Sensors. IEEE Access 2024, 12, 22595–22607. [Google Scholar] [CrossRef]
Malekzadeh, M.; Clegg, R.G.; Cavallaro, A.; Haddadi, H. Protecting Sensory Data against Sensitive Inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems, Porto, Portugal, 23–26 April 2018; pp. 1–6. [Google Scholar]
Sonawane, M.; Dhayalkar, S.R.; Waje, S.; Markhelkar, S.; Wattamwar, A.; Shrawne, S.C. Human Activity Recognition Using Smartphones. arXiv 2024, arXiv:2404.02869. [Google Scholar]
Sztyler, T.; Stuckenschmidt, H. On-Body Localization of Wearable Devices: An Investigation of Position-Aware Activity Recognition. In Proceedings of the 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), Sydney, NSW, Australia, 14–19 March 2016; pp. 1–9. [Google Scholar]
Duan, H.; Wang, J.; Chen, K.; Lin, D. PYSKL: Towards Good Practices for Skeleton Action Recognition. In Proceedings of the MM’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef]

Figure 1. HAR Data Augmentation Framework.

Figure 2. Human activity samples from RealWorld dataset before mixup.

Figure 3. Human activity sample from RealWorld dataset after mixup.

Figure 4. RealWorld dataset with (left) and without (right) mixup (T-SNE).

Figure 5. Cutout visualization on human skeleton sample. The white key-points on the right skeleton indicate where cutout has been applied to mask parts of the skeleton.

Figure 6. Comparison of model accuracy across different datasets and training methods.

Table 1. Comparison of transformer-based HAR model accuracy using clean, mixup, and combined training data across different datasets.

Architecture	Training Data	HHAR	MotionSense	UCI	RealWorld	Average
HART [22]	Clean dataset	98.01%	98.17%	93.32%	93.94%	95.86%
	Mixup dataset	98.15%	97.98%	94.50%	94.54%	96.29%
	Clean + Mixup	98.22%	98.30%	94.44%	94.66%	96.41%
HAR-Transformer [60]	Clean dataset	94.99%	97.33%	89.62%	88.57%	92.63%
	Mixup dataset	96.53%	97.40%	92.26%	91.49%	94.42%
	Clean + Mixup	97.04%	97.91%	91.75%	91.52%	94.56%

Table 2. Performance comparison of ST-GCN and ST-GCN++ models with and without cutout augmentation on the NTU 60 and NTU 120 datasets.

Dataset Configuration	NTU RGB + D		NTU RGB + D 120
Dataset Configuration	XSubject	XView	XSubject	XView
STGCN [55]	88.31%	94.89%	83.05%	88.45%
STGCN with cutout	88.50%	95.48%	83.25%	88.30%
STGCN++ [65]	88.02%	94.93%	84.35%	88.64%
STGCN++ with cutout	88.23%	94.98%	84.16%	88.68%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dingeto, H.; Kim, J. Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data. Appl. Sci. 2024, 14, 10286. https://doi.org/10.3390/app142210286

AMA Style

Dingeto H, Kim J. Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data. Applied Sciences. 2024; 14(22):10286. https://doi.org/10.3390/app142210286

Chicago/Turabian Style

Dingeto, Hiskias, and Juntae Kim. 2024. "Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data" Applied Sciences 14, no. 22: 10286. https://doi.org/10.3390/app142210286

APA Style

Dingeto, H., & Kim, J. (2024). Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data. Applied Sciences, 14(22), 10286. https://doi.org/10.3390/app142210286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data^†

Abstract

1. Introduction

2. Background Literature

3. Methodology

4. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data †

Abstract

1. Introduction

2. Background Literature

3. Methodology

4. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Exploring Cutout and Mixup for Robust Human Activity Recognition on Sensor and Skeleton Data^†