Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition

Miao, Mingda; Yan, Weijie; Gao, Xueshan; Yang, Le; Zhou, Jiaqi; Zhang, Wenyi

doi:10.3390/app152312592

Open AccessArticle

Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition

by

Mingda Miao

¹,

Weijie Yan

^1,*

,

Xueshan Gao

²,

Le Yang

¹,

Jiaqi Zhou

¹ and

Wenyi Zhang

³

¹

School of Mechanical Engineering, Jiangsu Ocean University, Cangwu Road, Lianyungang 222005, China

²

School of Mechatronical Engineering, Beijing Institute of Technology, Zhongguancun South Street, Beijing 100081, China

³

Department of Precision Instrument, Tsinghua University, Qinghua East Road, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12592; https://doi.org/10.3390/app152312592

Submission received: 2 October 2025 / Revised: 19 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

Download

Browse Figures

Versions Notes

Abstract

Human activity recognition (HAR) based on wearable sensors is a key technology in the fields of smart sensing and health monitoring. With the rapid development of deep learning, its powerful feature extraction capabilities have significantly enhanced recognition performance and reduced reliance on traditional handcrafted feature engineering. However, current deep learning models still face challenges in effectively capturing complex temporal dependencies in long-term time-series sensor data and addressing information redundancy, which affect model recognition accuracy and generalization ability. To address these issues, this paper proposes an innovative CNN-BiGRU–Transformer hybrid deep learning model aimed at improving the accuracy and robustness of human activity recognition. The proposed model integrates a multi-branch Convolutional Neural Network (CNN) to effectively extract multi-scale local spatial features, and combines a Bidirectional Gated Recurrent Unit (BiGRU) with a Transformer hybrid module for modeling temporal dependencies and extracting temporal features in long-term time-series data. Additionally, an attention mechanism is incorporated to dynamically allocate weights, suppress redundant information, and enhance key features, further improving recognition performance. To demonstrate the capability of the proposed model, evaluations are performed on three public datasets: WISDM, PAMAP2, and UCI-HAR. The model achieved recognition accuracies of 98.41%, 95.62%, and 96.74% on the three datasets, respectively, outperforming several state-of-the-art methods. These results confirm that the proposed approach effectively addresses feature extraction and redundancy challenges in long-term sensor time-series data and provides a robust solution for wearable sensor-based human activity recognition.

Keywords:

BiGRU; CNN; deep learning; HAR; transformer

1. Introduction

Over the past few decades, human activity recognition (HAR) based on wearable sensors has garnered significant attention due to its wide applications in health monitoring, sports training, and medical rehabilitation. In the field of health monitoring, HAR technology enables continuous tracking of daily activities for the elderly and patients, providing early warnings for abnormal behaviors and ensuring their health and safety [1,2,3]. In sports training, HAR can deliver precise, quantitative feedback by analyzing athletes’ movement patterns, thereby enhancing training efficiency and preventing sports injuries [4,5,6]. Furthermore, in medical rehabilitation, HAR is widely applied in the control of prosthetics and assistive devices for patients with motor disabilities by detecting muscle activity and hand movements [7,8,9].

Early wearable sensor-based HAR methods primarily relied on manually extracted time-domain features (such as mean, variance, standard deviation, etc.) and frequency-domain features (such as centroid frequency, average frequency, etc.), followed by the application of traditional machine learning algorithms (such as Support Vector Machines, Decision Trees, K-Nearest Neighbors, etc.) for classification [10,11]. However, these methods exhibit limited performance when dealing with complex sensor data. On the one hand, manually extracted features fail to fully capture the rich information in the raw signals, resulting in constrained recognition accuracy. On the other hand, feature engineering is heavily dependent on expert knowledge, making it both time-consuming and labor-intensive, and difficult to adapt to diverse application scenarios and complex sensor data [12,13,14].

With the rapid development of deep learning technologies, significant breakthroughs have been made in various fields such as image recognition [15] and natural language processing [16], and these technologies have begun to be widely applied in HAR [17]. Deep learning models, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are capable of automatically learning and extracting high-level, abstract features from raw sensor data through their multi-layered network structures. This greatly reduces the reliance on manual feature engineering and significantly improves the accuracy of activity recognition.

Although deep learning-based HAR models have achieved remarkable progress, traditional architectures still encounter several technical challenges when dealing with long-term time-series sensor data. Models such as CNNs and RNNs have difficulty capturing long-range temporal dependencies inherent in long-term sequences. In addition, as the network depth increases, the extracted features tend to include a large amount of redundant information, which degrades the overall recognition efficiency.

To address the above challenges, this paper proposes an attention-based CNN– Bidirectional Gated Recurrent Unit (BiGRU)–Transformer hybrid deep learning model, aiming to more effectively extract key features from long-term time-series sensor data and significantly improve the accuracy and generalization of human activity recognition. The main contributions of this work can be summarized as follows:

To comprehensively extract spatial features, a multi-branch Convolutional Neural Network module is employed. Unlike previous models, which typically use a single convolution kernel, this model uses three convolutional kernels of different sizes to capture multi-scale local spatial features. This multi-kernel approach enhances the model’s ability to capture a broader range of spatial patterns from the input data.
To effectively extract temporal features from long-term time-series data, the model integrates the strengths of the BiGRU module in modeling short-range dependencies and the Transformer module in capturing long-range dependencies. This hybrid architecture, combining RNN and Transformer, distinguishes it from prior models that typically rely solely on RNNs. The inclusion of the Transformer enables the model to better capture complex, long-term temporal relationships in sensor data, improving its temporal modeling capabilities.
To reduce feature redundancy and emphasize critical information, an attention mechanism is introduced to dynamically weight the extracted features. Unlike prior models that often rely on simple channel stacking, which may result in redundant feature representations, our attention mechanism suppresses unnecessary information and emphasizes the most important features. This approach not only reduces data redundancy but also enhances the classification performance by focusing on the most relevant aspects of the data.

The effectiveness and superiority of the proposed model were evaluated using the WISDM, PAMAP2, and UCI-HAR datasets.

The subsequent sections of the paper are structured as follows: Section 2 reviews related works; Section 3 provides a detailed description of the proposed CNN-BiGRU-Transformer model; Section 4 describes the model setup, including dataset description, data preprocessing, evaluation metrics, and hyperparameter settings. Section 5 presents the evaluation results and provides a detailed analysis. Section 6 discusses the ablation study in detail. Finally, Section 7 concludes the paper and outlines possible directions for future research.

2. Related Work

The research methods in the field of HAR have continuously evolved with advancements in technology. This section reviews research related to wearable sensor-based HAR, with a particular focus on deep learning-based approaches and their progress and challenges in feature extraction. Early wearable sensor-based HAR studies primarily employed traditional machine learning methods. Researchers manually designed and extracted time-domain and frequency-domain features from sensor signals, and then used classifiers such as Support Vector Machine, Decision Tree, or K-Nearest Neighbor for activity classification. However, these methods heavily relied on manual feature design, had limited feature representation capabilities, and struggled to fully exploit the deep information within complex sensor data. With the rise of deep learning, the Convolutional Neural Network was introduced to the HAR field due to its powerful local feature extraction capabilities. Sena et al. [18] proposed a CNN-based approach for human activity recognition, which achieved significant improvements compared to traditional machine learning methods. Wan et al. [19] developed a CNN model for activity classification, and results showed that CNN surpassed Multi-layer Perceptrons and Support Vector Machine in overall recognition accuracy. However, CNN mainly focuses on local spatial features, and its ability to model temporal dependencies in sequential data remains limited.

To address the limitations of CNN in modeling temporal dependencies in sequential data, researchers have begun to incorporate RNN and its variants, such as the Long Short-Term Memory network (LSTM) and the Gated Recurrent Unit (GRU), to solve this issue. Wang et al. [20] proposed an LSTM-based HAR classification algorithm, which achieved high recognition accuracy on the UCI-HAR dataset. Li et al. [21] proposed a deep learning model based on residual blocks and Bidirectional Long Short-Term Memory (BiLSTM). The proposed model was evaluated on two public datasets, WISDM and PAMAP2.

Simultaneously, researchers have proposed various CNN-RNN hybrid architecture models, which effectively combine the advantages of CNN in spatial feature extraction and RNN in modeling temporal dependencies. In these models, CNN is typically used to extract high-level spatial features and may also perform sequence dimensionality reduction, while RNN is employed to process the reduced feature sequence, modeling the temporal dependencies within the sequence and extracting temporal features. The CNN-LSTM model proposed by Xia et al. [22] demonstrated superior performance on several benchmark HAR datasets. Challa et al. [23] explored the combination of multi-branch CNN and BiLSTM, utilizing CNN branches with different convolution kernel sizes to extract multi-scale local features and employing BiLSTM to extract temporal features from the sequence. Khan et al. [24] proposed a hybrid model combining 1D-CNN and LSTM for the task of transition activity recognition. Albogamy [25] used a hybrid LSTM–GRU architecture and achieved excellent performance on multiple datasets. Although LSTM and GRU are capable of modeling dependencies within sequential data to some extent, their modeling ability and computational efficiency still have considerable room for improvement when dealing with longer data sequences.

In recent years, the Transformer model [16] has achieved tremendous success in the field of natural language processing. Its architecture, based on self-attention mechanisms and parallel computation, has demonstrated remarkable capabilities in modeling long-range dependencies within sequential data. The Transformer has been widely applied in various fields, including computer vision [26], speech processing [27], and time-series analysis [28]. In the context of HAR, the Transformer offers new insights for addressing the challenge of modeling dependencies in long-term time-series data. However, directly applying the standard Transformer to high-dimensional sensor data may lead to prohibitively high computational costs. Some studies have attempted to combine a Transformer with a CNN for dimensionality reduction. For instance, Al-Qaness et al. [29] proposed the PCNN-Transformer model for fall detection.

Furthermore, in order to optimize feature representation and reduce information redundancy, attention mechanisms have been increasingly applied to HAR models [30,31,32]. The attention mechanism assigns different weights to different feature dimensions or time steps, enabling the model to focus on the most important information for the task at hand. Khan et al. [33] enhanced the feature selection capability of the CNN model by incorporating the Squeeze-and-Excitation (SE) attention module. Khatun et al. [34] combined the self-attention mechanism with the CNN-LSTM model, achieving high recognition accuracy on the H-Activity dataset, thereby validating the effectiveness of attention mechanisms in enhancing HAR performance.

Based on existing research, this paper proposes an innovative CNN-BiGRU-Transformer hybrid model that integrates the strengths of each module to more effectively address the human activity recognition problem in long-term wearable sensor time-series data.

3. Proposed Model

The CNN-BiGRU–Transformer model proposed in this paper consists of three main modules. The first module is the multi-branch CNN module, which extracts local spatial features from multi-sensor signals and captures the interrelationships between different sensors, such as changes in acceleration and angular velocity. These patterns effectively represent the local dynamic features of the motion state. The second module is the BiGRU–Transformer, responsible for modeling the dependencies in the temporal data and extracting temporal features. Finally, a fully connected layer is used for classification. This model integrates the advantages of CNN, BiGRU, and Transformer in feature extraction, resulting in more comprehensive feature information, which improves the accuracy and generalization ability of the model. The overall structure of the CNN-BiGRU–Transformer model is shown in Figure 1.

3.1. Spatial Feature Extraction Module

To accommodate different application scenarios, convolutional layers of varying dimensions have been developed, with one-dimensional convolutional layers primarily used for processing time-series data [35]. The proposed model incorporates one-dimensional convolutional layers with three kernel sizes (3, 5, and 7), enabling it to capture features at multiple scales. The CNN block architecture used in this paper is shown in Figure 2. In branch

i (i \in {1, 2, 3})

, assuming the input tensor is

x^{'}

, two consecutive convolutional operations are applied, followed by max-pooling to reduce the temporal dimensionality of the time-series data. The calculation process is shown in Equations(1)–(3) [36].

Considering the inconsistency of input parameters across the WISDM, UCI-HAR, and PAMAP2 datasets, the proposed model is designed as an adaptive framework capable of handling different input dimensions. These datasets vary in the number and types of sensor channels, ranging from simple tri-axial accelerometer signals to multimodal inertial measurements. To ensure compatibility, all data are uniformly represented as

(T, C)

, where T is the time window length and C the number of sensor channels. All available channels are utilized in the CNN stage, and multi-kernel convolutions (3, 5, 7) are applied to extract multi-scale features. This adaptive design effectively mitigates input inconsistency across datasets and enhances the generalization ability of the model.

\begin{matrix} x_{i, 1} = ReLU ({Convld}_{i} (x^{'})) \end{matrix}

(1)

\begin{matrix} x_{i, 2} = ReLU ({Convld}_{i} (x_{i, 1})) \end{matrix}

(2)

\begin{matrix} x_{i, pool} = MaxPoolld (x_{i, 2}, {kernel}_{size} = 2) \end{matrix}

(3)

An attention mechanism is employed to perform weighted fusion of features from different branches, allowing the network to adjust weights based on the contributions of each branch. By using different convolutional kernels to capture distinct features, the network leverages weighted fusion to effectively utilize these diverse features, thereby enhancing the model’s overall performance. The application of the channel attention mechanism is shown in Figure 3.

First, average pooling is applied to the features from the three branches to extract the global features of each branch. A fully connected network is then used to generate attention weights for the three branches, which are normalized using the Softmax function. The features of the three branches are subsequently weighted and summed according to the attention weights, emphasizing the contributions of the most important branches. The final calculation process is as follows (Equation (4) [37]):

X_{fused} = \sum_{i = 1}^{3} α_{i} \cdot x_{i}

(4)

Here,

α_{i}

represents the attention weight corresponding to branch

i

.

3.2. Temporal Feature Extraction Module

Human activity data is a form of time-series data, for which considering the temporal context is crucial for accurate activity recognition. The BiGRU, as a simplified and efficient variant of Recurrent Neural Networks, offers a more compact structure along with faster computation and training. In this study, BiGRU is employed to model human activity time-series data and capture contextual information within sequences.

By leveraging its bidirectional structure, BiGRU can simultaneously process forward and backward information within sequences, significantly enhancing classification performance. The framework of the BiGRU is shown in Figure 4.

The Transformer model is widely used in natural language processing. However, due to differences in data types and task objectives, certain modifications are necessary when applying it to human activity recognition tasks. In this study, the Encoder module of the Transformer model is employed to construct a time-series model, enabling the extraction of temporal features from human activity data. The design of the Transformer Encoder is shown in Figure 5.

The self-attention mechanism is the core component of the Transformer model. The computation process is as follows (Equation (5) [16]):

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

Here,

Q

,

K

, and

V

represent the query matrix, key matrix, and value matrix, respectively, while

d

denotes the dimensionality of each vector in the input tensor. The attention matrix is obtained by computing the dot product between

Q

and

K

, scaling the result, and then applying the Softmax function to calculate the weight of each vector in the input tensor

V

. The multi-head self-attention mechanism is expressed as follows (Equation (6) [16]):

MultiHead (Q, K, V) = Concat ({head}_{h}) W^{O}

(6)

The calculation for each head is as follows (Equation (7) [16]):

{head}_{i} = Attention ({QW}_{i}^{Q}, {KW}_{i}^{K}, {VW}_{i}^{V})

(7)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

represent the linear transformation matrices for each head, and a linear transformation matrix is applied after the multi-head concatenation.

A core component of the model is the Feed-Forward Network (FFN), which enhances the model’s ability to process information by performing nonlinear mapping and dimensional expansion on the input data. This allows it to more accurately capture features of the input sequence, significantly improving performance in natural language processing tasks.

The parallel architecture of the Transformer and BiGRU accelerates the model’s training process by improving computational efficiency. Similar to the feature fusion approach used in the multi-branch CNN, features extracted by the Transformer and BiGRU are fused using an attention mechanism. This mechanism selects the optimal feature representations, which improves the final classification performance.

3.3. Classification Module

At the final stage of the model, a fully connected layer is used to integrate the extracted features and perform the final classification. The fully connected layer combines features through linear transformations and nonlinear activation functions. The computation is defined as follows (Equation (8) [38]):

h = ReLU (Wx + b)

(8)

Here,

W

represents the weight matrix, and

b

denotes the bias vector. At the final classification stage, the output

h

is passed through the Softmax function to compute the probability of each class; the calculation formula is as follows (Equation (9) [39]):

{\hat{y}}_{i} = \frac{exp (h_{i})}{\sum_{j = 1}^{k} exp (h_{j})}, i = 1, \dots, k

(9)

Here, i represents the k element of the output, and

k

denotes the number of classes.

In the final step, the model provides results based on the highest predicted probability. The calculation process is as follows Equation (10) [36]):

\hat{y} = arg max_{i} {\hat{y}}_{i}, i \in {1, 2, . . ., k}

(10)

4. Model Setup

The proposed CNN–BiGRU–Transformer model was implemented in Python using PyTorch and executed on a workstation equipped with an Intel Core i5 CPU and an NVIDIA RTX 4060 Ti GPU. The model was trained and evaluated on the WISDM, PAMAP2, and UCI-HAR datasets. The detailed software and hardware configurations are summarized in Table 1.

4.1. Dataset Description

WISDM [40]: The WISDM dataset was created by the Department of Computer and Information Science at Fordham University, New York, NY, USA, to support research in human activity recognition. The dataset contains 1,098,207 samples, with the proportions of each activity category shown in Table 2. It includes data from 36 participants, each of whom wore a smartphone, and the data was collected using an accelerometer at a frequency of 20 Hz. The WISDM dataset records six distinct activities: standing, sitting, stair climbing, walking, jogging, and others. Walking accounts for the largest proportion at 39%, while standing represents the smallest proportion at 4%. The dataset samples are labeled according to activity type, facilitating human activity classification and recognition tasks. This dataset is widely used in fields such as human activity recognition and health monitoring.

PAMAP2 [41]: The PAMAP2 dataset was developed by the Technical University of Munich, Munich, Germany, with the aim of providing high-quality, multidimensional data for human activity recognition research. The dataset records activity data from nine participants using multiple wearable sensors, including heart rate monitors, accelerometers, and gyroscopes, at a sampling frequency of 100 Hz. PAMAP2 includes 18 different daily activities, such as walking, running, stair climbing, and cycling. The time-series data for each activity includes measurements from various sensors, including accelerometers and heart rate monitors, with the sample distribution for each activity shown in Table 3.

UCI-HAR [42]: The UCI-HAR dataset was collected by the Human Activity Recognition Research Group at the University of California, Irvine. It contains sensor data recorded from a smartphone placed on the waist of 30 volunteers performing daily activities. The dataset includes a total of 10,299 samples, and the distribution of activity categories is shown in Table 4. The signals were recorded using a built-in accelerometer and gyroscope at a sampling frequency of 50 Hz. The dataset comprises six activities: walking, walking upstairs, walking downstairs, sitting, standing, and lying. Among these categories, lying accounts for the highest proportion at 19.5%, while walking downstairs exhibits the lowest proportion at 13.7%. The dataset provides meticulously labeled activity samples, making it widely used for human activity recognition, wearable computing research, and mobile health monitoring applications.

4.2. Dataset Preprocessing

The dataset used in this study is collected from real-world scenarios and contains a large amount of incomplete, inconsistent, imbalanced, noisy, and outlier data, requiring preprocessing before further analysis. The data preprocessing pipeline mainly includes normalization, handling missing values, and data segmentation.

The sensor time series data from all three datasets are segmented using a sliding-window approach. For the WISDM and PAMAP2 datasets, window sizes of 200 and 171 sample points are used, respectively, whereas the UCI-HAR dataset follows its standard protocol with a window size of 128 sample points. A 50% overlap is applied across all datasets to improve data utilization and preserve temporal continuity at window boundaries. After segmentation, the samples are divided into training, validation, and test sets according to a 7:2:1 ratio. Specifically, the WISDM dataset yields 10,709 samples (7496 training, 2142 validation, 1071 testing), the PAMAP2 dataset yields 6038 samples (4227 training, 1207 validation, 604 testing), and the UCI-HAR dataset yields 10,299 samples (7209 training, 2060 validation, 1030 testing). The training set is used for model optimization, the validation set for hyperparameter tuning and ablation experiments, and the test set for final model evaluation, ensuring a balanced distribution and sufficient samples for reliable performance assessment.

Because the data originate from heterogeneous sensors with different measurement ranges, normalization is performed to scale all features to a consistent range. This step is crucial for stabilizing and accelerating the training of deep learning models, as it improves the efficiency and convergence behavior of gradient-based optimization algorithms.

4.3. Evaluation Metrics

Most activities in the dataset are imbalanced, with certain activities having significantly higher proportions than others. Therefore, relying solely on overall recognition accuracy cannot fully reflect the algorithm’s performance. To address this, multiple evaluation metrics, including accuracy, precision, recall, F1 score, and the confusion matrix, were used to comprehensively evaluate the model.The calculation formulas are shown in Equations (11)–(14) [43].

\begin{matrix} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}

(11)

\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}

(12)

\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}

(13)

\begin{matrix} F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(14)

4.4. Hyperparameter Evaluation

Hyperparameters play a crucial role in the performance of deep learning models. Identifying the optimal set of hyperparameters resembles searching for the best solution among numerous possibilities, which is often challenging. Thus, an iterative approach is adopted in this study, adjusting various hyperparameter configurations to determine the optimal combination. The evaluation of hyperparameters is based on the WISDM dataset. The model is first trained on the training set, and then evaluated using the validation set to obtain corresponding results. Based on these results, the model’s hyperparameters are assessed, with a focus on the following key aspects.

4.4.1. Number of Filters in Convolutional Layers

The number of filters in the convolutional layers determines the model’s feature extraction ability. If the number of filters is too small, the model cannot extract sufficient features, leading to poor recognition performance. Conversely, if the number of filters is too large, the feature extraction ability improves, but the model’s complexity increases. Therefore, finding a balanced number of filters is crucial to ensuring high performance while minimizing unnecessary computational overhead. In this study, four different filter configurations were tested for the first and second convolutional layers, with combinations such as (16/32, 32/64, 64/128, 128/256). By comparing the parameter count and recognition accuracy for each configuration, the optimal combination was identified. As shown in Figure 6, the 64/128 combination achieves high recognition accuracy while maintaining relatively low model complexity.

4.4.2. Batch Size

The batch size determines the amount of data used for each model update. A smaller batch size may lead to unstable updates, whereas a larger batch size can slow down convergence and increase training difficulty. To identify the optimal batch size for the model, evaluations were conducted using different settings (32, 64, 128, and 256). The evaluation results are presented in Figure 7, where it can be observed that a batch size of 128 achieves the highest recognition accuracy.

4.4.3. Choice of Optimizer

The optimizer is a crucial factor affecting both the training performance and convergence speed of the model. Different optimizers (such as Adam, SGD, and Adagrad) can have a significant impact on model performance. In this study, commonly used optimizers were evaluated and compared to determine the most suitable choice for the model. The evaluation results are presented in Figure 8, where it can be observed that the Adam optimizer achieves the highest training efficiency.

Therefore, the final set of hyperparameters adopted in this study is summarized in Table 5. Based on these settings, we further evaluate the computational efficiency of the proposed model. As shown in Table 6, the CNN–BiGRU–Transformer model achieves high performance with a relatively small number of trainable parameters and floating-point operations (FLOPs) while maintaining millisecond-level inference times per sample on a PC-grade GPU across the benchmark datasets. These results demonstrate the high computational efficiency of the model and suggest that it has promising potential for real-time human activity recognition in practical applications.

5. Results Analysis

During the training and testing on the WISDM, PAMAP2, and UCI-HAR public datasets, the proposed CNN–BiGRU–Transformer model demonstrates stable and reliable classification performance. Figure 9, Figure 10 and Figure 11 present the training accuracy curves and corresponding confusion matrices for the three datasets.

5.1. WISDM Dataset

As shown in Figure 9a, the model achieves stable convergence, with the training and validation accuracy curves rapidly rising and stabilizing around the 30th epoch. The confusion matrix in Figure 9b further reveals excellent performance across all six activity categories. The model achieves perfect recognition (100%) for the sitting and standing classes. For walking, the accuracy reaches 98%, with only 1% of samples misclassified as upstairs or downstairs. Jogging is recognized with an accuracy of 98%, with a small proportion (2%) confused with walking. The upstairs and downstairs activities also achieve high accuracies of 98% and 96%, respectively, with most errors occurring between these two visually similar climbing motions. These results indicate that the proposed model maintains strong discriminative capability even for activities with highly similar temporal patterns.

Table 7 compares the performance of the proposed model with several existing approaches. The proposed model achieves strong and well-balanced performance across accuracy (98.41%), precision (98.37%), recall (98.42%), and F1-score (98.40%). Compared with methods such as Multi-head CNN, CNN-GRU, CNN-BiLSTM, Stack-HAR, and LGBM, the proposed model demonstrates consistently superior overall performance, highlighting its robust spatial–temporal feature extraction capability.

5.2. PAMAP2 Dataset

The PAMAP2 results, shown in Figure 10, indicate that the proposed model maintains stable accuracy growth and convergence around the 40th epoch. Figure 10b illustrates that the model performs strongly across most of the ten activity categories. Activities such as lying, nordic walking, and running reach accuracies above 97%, reflecting their distinct motion patterns. Walking is recognized with 98% accuracy, with minor confusion with running and nordic walking. Cycling and ironing also achieve 96% accuracy. Nevertheless, activities involving overlapping motion trajectories, such as vacuum cleaning (88%) and descending stairs (90%), show relatively lower accuracy, primarily due to their higher intra-class variability and similarity to other movements.

As shown in Table 8, although the CCBB model achieves the highest accuracy (96.10%), its precision and F1-score (79.65% and 84.57%) are significantly lower, indicating unstable recognition across different classes. The proposed model achieves a well-balanced performance (95.62% accuracy, 96.07% precision, 95.93% recall, 95.95% F1-score), outperforming CNN, CNN-BiLSTM, and ASK-HAR. These results demonstrate that the proposed architecture offers strong classification performance while maintaining excellent metric consistency.

5.3. UCI-HAR Dataset

The UCI-HAR dataset results further validate the effectiveness of the proposed model. Figure 11a shows that the training accuracy curve rises rapidly and stabilizes around the 30th epoch, similar to the trend observed on the WISDM dataset. The confusion matrix in Figure 11b reflects high recognition rates across the six activity categories. The model achieves nearly perfect accuracy for lying (100%) and very high accuracy for standing (97%) and sitting (91%). The following three walking-related activities also show strong performance: walking (99%), upstairs (96%), and downstairs (98%), with minor confusion between the climbing-related activities due to their similar temporal structures.

Table 9 presents the comparison against several state-of-the-art methods. The proposed model achieves 96.74% accuracy and 96.75% F1-score, outperforming Multi-head CNN, Bi-GRU-I, DMEFAM, and slightly surpassing Multi-input CNN-GRU in overall metric balance.

5.4. Summary

Across all three public datasets, the proposed CNN–BiGRU–Transformer–Attention model exhibits competitive or superior performance compared with state-of-the-art approaches. While some existing models achieve slightly higher accuracy on specific datasets, the proposed model maintains more balanced performance across all evaluation metrics. By integrating parallel BiGRU and Transformer modules with an attention-based fusion mechanism, the model effectively captures both local and global temporal dependencies, enhances spatial–temporal feature representation, and demonstrates strong generalization ability. This validates its effectiveness as a robust and versatile solution for human activity recognition tasks.

6. Ablation Study

To evaluate the impact of different modules on the overall model performance, four comparative models were designed and compared with the final proposed CNN-BiGRU–Transformer model. All models were trained on the training set of the WISDM dataset and evaluated on the validation set. Specifically, Model A contains only the three-branch CNN for spatial feature extraction, without any temporal modeling. Model B extends Model A by introducing a BiGRU module to capture local temporal dependencies. Model C incorporates a Transformer encoder on top of the CNN to capture global temporal dependencies, without including the BiGRU module or attention mechanism. Model D further extends Model B by combining both BiGRU and Transformer modules to jointly capture short- and long-term temporal dependencies, but without the attention mechanism. Finally, the proposed model integrates all modules, including the attention mechanism, to adaptively fuse the extracted spatial and temporal features.

As shown in Table 10, different combinations of modules have a significant impact on model performance. In this table, Acc., Prec., Rec., and F1 denote accuracy, precision, recall, and F1-score, respectively. Model A, which relies solely on CNN for spatial feature extraction without temporal modeling, achieves an accuracy of 94.35%, demonstrating relatively limited performance. With the addition of the BiGRU module, Model B effectively captures local temporal dependencies, improving the accuracy to 94.48% and the F1 score to 94.51%. Model C, which incorporates the Transformer to model global temporal dependencies but excludes both the BiGRU and attention mechanisms, achieves further improvements, reaching an accuracy of 96.61% and an F1 score of 96.65%. Moreover, Model D combines the BiGRU and Transformer modules, enabling the network to jointly capture both short- and long-term dependencies, achieving an accuracy of 97.51% and an F1 score of 97.53%. Finally, the proposed CNN–BiGRU–Transformer model, equipped with an attention mechanism, attains the best overall performance, with an accuracy of 98.56% and an F1 score of 98.47%, confirming that the integration of multi-branch CNN, BiGRU, Transformer, and attention effectively enhances both spatial–temporal feature representation and classification robustness.To further verify the generalizability of the module design, the same ablation experiments were also repeated on the PAMAP2 dataset, and the detailed results are provided in the Supplementary Materials (see Figure S1 and Table S1). For the UCI-HAR dataset, since its activity categories and data characteristics are highly similar to those of the WISDM dataset, the ablation experiments were not repeated.

The classification performance of each model can be further observed from the confusion matrices in Figure 12. Model A performs reasonably well in static activities such as sitting and standing but exhibits significant confusion in dynamic activities such as upstairs and downstairs, with an accuracy of only 88%. After introducing the BiGRU module, Model B improves the recognition accuracy for upstairs and downstairs activities to 90% and 89%, respectively, demonstrating enhanced capability in modeling local temporal dependencies. Model C, which incorporates the Transformer to capture global temporal dependencies, further improves the recognition accuracy for upstairs and downstairs to 92% and 91%, respectively. Model D, which combines both the BiGRU and Transformer modules, captures both local and global temporal dependencies, resulting in an even higher accuracy for upstairs and downstairs activities at 95% and 93%, respectively. Finally, our proposed model, which incorporates attention into the outputs of the BiGRU and Transformer, achieves the best overall performance, with accuracies reaching 98% for upstairs and 96% for downstairs, demonstrating the effectiveness of attention in enhancing both spatial–temporal feature representation and classification robustness.

7. Conclusions

This paper proposes an attention-based CNN–BiGRU–Transformer model for human activity recognition using wearable sensor data. By combining CNN for local spatial feature extraction with BiGRU and Transformer for modeling short- and long-range temporal dependencies, and further incorporating an attention mechanism to emphasize informative patterns, the model effectively exploits the spatial–temporal structure of multichannel time-series signals.

Experiments on three benchmark datasets demonstrate the effectiveness of the proposed approach. The CNN–BiGRU–Transformer model achieves recognition accuracies of 98.41% on WISDM, 95.62% on PAMAP2, and 96.74% on UCI-HAR. Compared with traditional machine learning methods, it substantially reduces the need for manual feature engineering by automatically learning discriminative representations from raw data. In comparison with existing deep learning models, it provides competitive or superior accuracy while maintaining a relatively small number of trainable parameters, moderate FLOPs, and millisecond-level inference times per sample on a PC-grade GPU.

Future work will focus on developing lightweight and deployable variants of the proposed model for wearable and mobile platforms, for example, through model compression, pruning, and knowledge distillation, as well as incorporating personalized data and adaptive learning strategies to improve robustness across different users and real-world environments. In addition, although this study concentrates on human activity recognition, the underlying spatial–temporal modeling framework is also relevant to other domains that involve complex temporal and spatial patterns, such as grazing behavior recognition in agricultural cyber–physical systems based on triaxial accelerometer data [51], sustainable vehicle routing under uncertainty using time-dependent data-driven models [52], and few-shot industrial defect detection with multi-scale feature representations and memory mechanisms [53]. Exploring these directions would further validate the generality of the proposed architecture and extend its applicability to a broader range of engineering and industrial applications. Moreover, differences in input parameter configurations across datasets—such as variations in sensor modalities, channel dimensionality, and sampling characteristics—can influence the model’s performance and generalization behavior. Since the proposed framework processes all available sensor channels in a unified end-to-end manner and employs attention mechanisms to adaptively highlight informative features, it is inherently capable of handling such heterogeneity. Nonetheless, conducting a more systematic examination of how different input parameter settings affect model robustness would be a meaningful direction for future research, particularly when deploying the framework across diverse sensing platforms and real-world environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152312592/s1, Figure S1: Confusion matrices on the PAMAP2 dataset; Table S1: Performance comparison of different models on the PAMAP2 dataset.

Author Contributions

Conceptualization, J.Z. and X.G.; methodology, J.Z. and L.Y.; software, W.Y.; validation, L.Y. and W.Z.; formal analysis, L.Y. and W.Z.; investigation, J.Z. and W.Y.; resources, X.G. and J.Z.; data curation, J.Z. and L.Y.; writing—original draft, W.Y.; writing—review and editing, M.M., X.G. and W.Z.; visualization, W.Y.; supervision, X.G. and M.M.; project administration, M.M. and X.G.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Natural Science Foundation of the Jiangsu Higher Education Institutions of China” (Grant No. 25KJB510002), the “Jiangsu Postgraduate Education Reform Program” (Grant No. JGKT25_C097), the “Natural Science Foundation of Jiangsu Province” (Grant No. BK20251056), and the “Research Fund Project of the College of Marine Medicine Industry, Jiangsu Ocean University” (Grant No. JOUMIRF007).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available and can be retrieved from the following sources: the WISDM dataset at https://www.cis.fordham.edu/wisdm/ (accessed on 13 November 2025), the PAMAP2 dataset at https://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring (accessed on 13 November 2025), and the UCI-HAR dataset at https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones (accessed on 13 November 2025).

Acknowledgments

The authors thank the reviewers for their constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HAR	Human Activity Recognition
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
BiLSTM	Bidirectional Long Short-Term Memory
BiGRU	Bidirectional Gated Recurrent Unit
SE	Squeeze-and-Excitation
FNN	Feed-forward Neural Network
FLOPs	Floating Point Operations

References

Schrader, L.; Vargas Toro, A.; Konietzny, S.; Rüping, S.; Schäpers, B.; Steinböck, M.; Krewer, C.; Müller, F.; Güttler, J.; Bock, T. Advanced sensing and human activity recognition in early intervention and rehabilitation of elderly people. J. Popul. Ageing 2020, 13, 139–165. [Google Scholar] [CrossRef]
Wang, Y.; Cang, S.; Yu, H. A survey on wearable sensor modality centred human activity recognition in health care. Expert Syst. Appl. 2019, 137, 167–190. [Google Scholar] [CrossRef]
Aramendi, A.A.; Weakley, A.; Goenaga, A.A.; Schmitter-Edgecombe, M.; Cook, D.J. Automatic assessment of functional health decline in older adults based on smart home data. J. Biomed. Informatics 2018, 81, 119–130. [Google Scholar] [CrossRef] [PubMed]
Haladjian, J.; Schlabbers, D.; Taheri, S.; Tharr, M.; Bruegge, B. Sensor-based detection and classification of soccer goalkeeper training exercises. ACM Trans. Internet Things 2020, 1, 1–20. [Google Scholar] [CrossRef]
Ueda, M.; Negoro, H.; Kurihara, Y.; Watanabe, K. Measurement of angular motion in golf swing by a local sensor at the grip end of a golf club. IEEE Trans. Hum. Mach. Syst. 2013, 43, 398–404. [Google Scholar]
Rawashdeh, S.A.; Rafeldt, D.A.; Uhl, T.L. Wearable IMU for shoulder injury prevention in overhead sports. Sensors 2016, 16, 1847. [Google Scholar] [CrossRef]
Samuel, O.W.; Asogbon, M.G.; Geng, Y.; Al-Timemy, A.H.; Pirbhulal, S.; Ji, N.; Chen, S.; Fang, P.; Li, G. Intelligent EMG pattern recognition control method for upper-limb multifunctional prostheses: Advances, current challenges, and future prospects. IEEE Access 2019, 7, 10150–10165. [Google Scholar] [CrossRef]
Phinyomark, A.; Scheme, E. EMG pattern recognition in the era of big data and deep learning. Big Data Cogn. Comput. 2018, 2, 21. [Google Scholar] [CrossRef]
Parajuli, N.; Sreenivasan, N.; Bifulco, P.; Cesarelli, M.; Savino, S.; Niola, V.; Esposito, D.; Hamilton, T.J.; Naik, G.R.; Gunawardana, U.; et al. Real-Time EMG Based Pattern Recognition Control for Hand Prostheses: A Review on Existing Methods, Challenges and Future Implementation. Sensors 2019, 19, 4596. [Google Scholar] [CrossRef]
Creagh, A.P.; Simillion, C.; Bourke, A.K.; Scotland, A.; Lipsmeier, F.; Bernasconi, C.; van Beek, J.; Baker, M.; Gossens, C.; Lindemann, M.; et al. Smartphone-and Smartwatch-Based Remote Characterisation of Ambulation in Multiple Sclerosis During the Two-Minute Walk Test. IEEE J. Biomed. Health Informatics 2020, 25, 838–849. [Google Scholar] [CrossRef]
Padhy, S. A tensor-based approach using multilinear SVD for hand gesture recognition from sEMG signals. IEEE Sensors J. 2021, 21, 6634–6642. [Google Scholar] [CrossRef]
Kulsoom, F.; Narejo, S.; Mehmood, Z.; Chaudhry, H.N.; Butt, A.; Bashir, A.K. A review of machine learning-based human activity recognition for diverse applications. Neural Comput. Appl. 2022, 34, 18289–18324. [Google Scholar] [CrossRef]
Saha, A.; Rajak, S.; Saha, J.; Chowdhury, C. A survey of machine learning and meta-heuristics approaches for sensor-based human activity recognition systems. J. Ambient. Intell. Humaniz. Comput. 2024, 15, 29–56. [Google Scholar]
Qiu, S.; Zhao, H.; Jiang, N.; Wang, Z.; Liu, L.; An, Y.; Zhao, H.; Miao, X.; Liu, R.; Fortino, G. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inf. Fusion 2022, 80, 241–265. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
sedaghati, N.; ardebili, S.; Ghaffari, A. Application of human activity/action recognition: A review. Multimed. Tools Appl. 2025, 84, 33475–33504. [Google Scholar] [CrossRef]
Sena, J.; Barreto, J.; Caetano, C.; Cramer, G.; Schwartz, W.R. Human activity recognition based on smartphone and wearable sensors using multiscale DCNN ensemble. Neurocomputing 2021, 444, 226–243. [Google Scholar] [CrossRef]
Wan, S.; Qi, L.; Xu, X.; Tong, C.; Gu, Z. Deep learning models for real-time human activity recognition with smartphones. Mob. Networks Appl. 2020, 25, 743–755. [Google Scholar] [CrossRef]
Wang, L.; Liu, R. Human activity recognition based on wearable sensor using hierarchical deep LSTM networks. Circuits, Syst. Signal Process. 2020, 39, 837–856. [Google Scholar] [CrossRef]
Li, Y.; Wang, L. Human activity recognition based on residual network and BiLSTM. Sensors 2022, 22, 635. [Google Scholar] [CrossRef] [PubMed]
Xia, K.; Huang, J.; Wang, H. LSTM-CNN architecture for human activity recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]
Challa, S.K.; Kumar, A.; Semwal, V.B. A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data. Vis. Comput. 2022, 38, 4095–4109. [Google Scholar] [CrossRef]
Khan, S.I.; Dawood, H.; Khan, M.; Issa, G.F.; Hussain, A.; Alnfiai, M.M.; Adnan, K.M. Transition-aware human activity recognition using an ensemble deep learning framework. Comput. Hum. Behav. 2025, 162, 108435. [Google Scholar] [CrossRef]
Albogamy, F.R. Federated learning for IOMT-enhanced human activity recognition with hybrid LSTM-GRU networks. Sensors 2025, 25, 907. [Google Scholar] [CrossRef]
Jamil, S.; Jalil Piran, M.; Kwon, O.J. A comprehensive survey of transformers for computer vision. Drones 2023, 7, 287. [Google Scholar] [CrossRef]
Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
Ahmed, S.; Nielsen, I.E.; Tripathi, A.; Siddiqui, S.; Ramachandran, R.P.; Rasool, G. Transformers in time-series analysis: A tutorial. Circuits, Syst. Signal Process. 2023, 42, 7433–7466. [Google Scholar] [CrossRef]
Al-Qaness, M.A.; Dahou, A.; Abd Elaziz, M.; Helmi, A.M. Human activity recognition and fall detection using convolutional neural network and transformer-based architecture. Biomed. Signal Process. Control 2024, 95, 106412. [Google Scholar] [CrossRef]
Gao, W.; Zhang, L.; Teng, Q.; He, J.; Wu, H. DanHAR: Dual attention network for multimodal human activity recognition using wearable sensors. Appl. Soft Comput. 2021, 111, 107728. [Google Scholar] [CrossRef]
Wang, K.; He, J.; Zhang, L. Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors. IEEE Sensors J. 2019, 19, 7598–7604. [Google Scholar] [CrossRef]
Nafea, O.; Abdul, W.; Muhammad, G. Incorporating attention mechanism into CNN-BiGRU classifier for HAR. IEEE Access 2024, 12, 160205–160218. [Google Scholar] [CrossRef]
Khan, Z.N.; Ahmad, J. Attention induced multi-head convolutional neural network for human activity recognition. Appl. Soft Comput. 2021, 110, 107671. [Google Scholar] [CrossRef]
Khatun, M.A.; Yousuf, M.A.; Ahmed, S.; Uddin, M.Z.; Alyami, S.A.; Al-Ashhab, S.; Akhdar, H.F.; Khan, A.; Azad, A.; Moni, M.A. Deep CNN-LSTM with self-attention model for human activity recognition using wearable sensor. IEEE J. Transl. Eng. Health Med. 2022, 10, 1–16. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM Sigkdd Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; IEEE: New York, NY, USA, 2012; pp. 108–109. [Google Scholar]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A Public Domain Dataset for Human Activity Recognition using Smartphones. In Proceedings of the The European Symposium on Artificial Neural Networks, Bruges, Belgium, 24–26 April 2013. [Google Scholar]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Dua, N.; Singh, S.N.; Semwal, V.B. Multi-input CNN-GRU based human activity recognition using wearable sensors. Computing 2021, 103, 1461–1478. [Google Scholar] [CrossRef]
Dahal, A.; Moulik, S.; Mukherjee, R. Stack-HAR: Complex Human Activity Recognition With Stacking-Based Ensemble Learning Framework. IEEE Sensors J. 2025, 25, 16373–16380. [Google Scholar] [CrossRef]
Topuz, E.K.; Kaya, Y. EO-LGBM-HAR: A novel meta-heuristic hybrid model for human activity recognition. Comput. Biol. Med. 2025, 189, 110004. [Google Scholar] [CrossRef]
Lalwani, P.; Ramasamy, G. Human activity recognition using a multi-branched CNN-BiLSTM-BiGRU model. Appl. Soft Comput. 2024, 154, 111344. [Google Scholar]
Yu, X.; Al-qaness, M.A. ASK-HAR: Attention-Based Multi-Core Selective Kernel Convolution Network for Human Activity Recognition. Measurement 2025, 242, 115981. [Google Scholar] [CrossRef]
Tong, L.; Ma, H.; Lin, Q.; He, J.; Peng, L. A novel deep learning Bi-GRU-I model for real-time human activity recognition using inertial sensors. IEEE Sensors J. 2022, 22, 6164–6174. [Google Scholar] [CrossRef]
Wang, Y.; Xu, H.; Liu, Y.; Wang, M.; Wang, Y.; Yang, Y.; Zhou, S.; Zeng, J.; Xu, J.; Li, S.; et al. A novel deep multifeature extraction framework based on attention mechanism using wearable sensor data for human activity recognition. IEEE Sensors J. 2023, 23, 7188–7198. [Google Scholar] [CrossRef]
Hou, S.; Wang, T.; Qiao, D.; Xu, D.J.; Wang, Y.; Feng, X.; Khan, W.A.; Ruan, J. Temporal-Spatial Fuzzy Deep Neural Network for the Grazing Behavior Recognition of Herded Sheep in Triaxial Accelerometer Cyber-Physical Systems. IEEE Trans. Fuzzy Syst. 2024, 33, 338–349. [Google Scholar] [CrossRef]
Eltoukhy, A.E.; Hashim, H.A.; Hussein, M.; Khan, W.A.; Zayed, T. Sustainable vehicle route planning under uncertainty for modular integrated construction: Multi-trip time-dependent VRP with time windows and data analytics. Ann. Oper. Res. 2025, 348, 863–898. [Google Scholar] [CrossRef]
Wang, Y.; Khan, W.A.; Chung, S.H. Few-shot defect detection of catheter products via enlarged scale feature pyramid and contrastive proposal memory bank. IEEE Trans. Ind. Informatics 2024, 20, 13036–13046. [Google Scholar] [CrossRef]

Figure 1. The model framework proposed in this paper.

Figure 2. CNN block architecture.

Figure 3. Channel attention mechanism.

Figure 4. BiGRU architecture.

Figure 5. Transformer encoder architecture.

Figure 6. The impact of the number of filters on the model.

Figure 7. The impact of batch size on the mode.

Figure 8. The impact of optimizers on the mode.

Figure 9. Training accuracy curve and confusion matrix on the WISDM dataset.

Figure 10. Training accuracy curve and confusion matrix on the PAMAP2 dataset.

Figure 11. Training accuracy curve and confusion matrix on the UCI-HAR dataset.

Figure 12. Comparison of confusion matrices.

Table 1. Software and hardware configuration for model implementation.

Component	Specification
Operating System	Windows 11
Programming Language	Python 3.11.9
Deep Learning Framework	PyTorch 2.4.1
IDE	PyCharm 2024.1
CPU	Intel Core i5-12400F
RAM	32 GB
GPU	NVIDIA GeForce RTX 4060 Ti (16 GB)
CUDA Version	12.4
cuDNN Version	8.9

Table 2. Proportion of each activity in the WISDM dataset.

Activity	Amount of Data	Proportion
Walking	424,400	39%
Jogging	342,177	31%
Upstairs	122,869	11%
Downstairs	100,427	9%
Sitting	59,939	6%
Standing	48,395	4%

Table 3. Proportion of each activity in the PAMAP2 dataset.

Activity	Total Time	Proportion
Lying	1925.15 s	11%
Walking	2387.53 s	14%
Running	981.92 s	6%
Cycling	1645.93 s	9%
Nordic walking	1881 s	11%
Ascending stairs	1172 s	7%
Descending stairs	1049.27 s	6%

Table 4. Proportion of each activity in the UCI-HAR dataset.

Activity	Amount of Data	Proportion
Walking	1723	16.7%
Walking Upstairs	1541	15.0%
Walking Downstairs	1406	13.7%
Sitting	1777	17.3%
Standing	1845	17.9%
Laying	2007	19.5%

Table 5. Hyperparameter list.

Stage	Module	Hyperparameters	Selected Values
		Kernel size	3/5/7
		Padding	1/2/3
	Conv1_1/Conv2_1/Conv3_1	Stride	1
		Filters	32
		Kernel size	3/5/7
		Padding	1/2/3
Architecture	Conv1_2/Conv2_2/Conv3_2	Stride	1
		Filters	64
	Maxpooling	Stride	2
	BiGRU	Filters	64
		Num layers	1
	Transformer	nhead	4
		Num layers	1
	Optimizer	/	Adam
	Batch size	/	128
Training	Learning rate	/	0.0001
	Epochs	/	100

Table 6. Computational efficiency analysis of the proposed model.

Dataset	Trainable Params	FLOPs (MFLOPs)	Inference Time (ms/sample)	Precision Mode
WISDM	229,450	32.78	2.263	FP32
PAMAP2	260,194	33.43	2.117	FP32
UCI-HAR	244,822	33.11	2.190	FP32

Table 7. Performance comparison (WISDM dataset).

Model	Accuracy	Precision	Recall	F1 Score
Multi-head CNN [33]	98.18%	97.12%	97.29%	97.20%
Multi-input CNN-GRU [44]	97.21%	/	/	97.22%
CNN-BiLSTM citecnnbilstmchalla	96.05%	/	/	/
Stack-HAR [45]	90.00%	/	/	/
LGBM [46]	96.03%	96.10%	96.03%	95.99%
Proposed model	98.41%	98.37%	98.42%	98.40%

Table 8. Performance comparison (PAMAP2 dataset).

Model	Accuracy	Precision	Recall	F1 Score
CNN [19]	91.00%	91.66%	90.85%	91.16%
Multi-input CNN-GRU [44]	95.27%	/	/	95.24%
CNN-BiLSTM [23]	94.29%	/	/	/
CCBB [47]	96.10%	79.65%	/	84.57%
ASK-HAR [48]	94.67%	/	/	95.00%
Proposed model	95.62%	96.07%	95.93%	95.95%

Table 9. Performance comparison (UCI-HAR dataset).

Model	Accuracy	Precision	Recall	F1 Score
Multi-head CNN [33]	95.38%	95.48%	95.42%	95.37%
Multi-input CNN-GRU [44]	96.20%	/	/	96.19%
Bi-GRU-I [49]	95.42%	/	/	95.45%
DMEFAM [50]	96.00%	/	/	/
Proposed model	96.74%	96.79%	96.74%	96.75%

Table 10. Performance comparison of different models. ✓ indicates the presence of the component; × indicates the absence.

Model	CNN	BiGRU	Transformer	Attention	Acc.	Prec.	Rec.	F1
A	✓	×	×	×	94.35%	94.34%	94.35%	94.34%
B	✓	✓	×	×	94.48%	94.57%	94.48%	94.51%
C	✓	×	✓	×	96.61%	96.73%	96.61%	96.65%
D	✓	✓	✓	×	97.51%	97.57%	97.51%	97.53%
Proposed Model	✓	✓	✓	✓	98.56%	98.50%	98.46%	98.47%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miao, M.; Yan, W.; Gao, X.; Yang, L.; Zhou, J.; Zhang, W. Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition. Appl. Sci. 2025, 15, 12592. https://doi.org/10.3390/app152312592

AMA Style

Miao M, Yan W, Gao X, Yang L, Zhou J, Zhang W. Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition. Applied Sciences. 2025; 15(23):12592. https://doi.org/10.3390/app152312592

Chicago/Turabian Style

Miao, Mingda, Weijie Yan, Xueshan Gao, Le Yang, Jiaqi Zhou, and Wenyi Zhang. 2025. "Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition" Applied Sciences 15, no. 23: 12592. https://doi.org/10.3390/app152312592

APA Style

Miao, M., Yan, W., Gao, X., Yang, L., Zhou, J., & Zhang, W. (2025). Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition. Applied Sciences, 15(23), 12592. https://doi.org/10.3390/app152312592

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Based CNN-BiGRU-Transformer Model for Human Activity Recognition

Abstract

1. Introduction

2. Related Work

3. Proposed Model

3.1. Spatial Feature Extraction Module

3.2. Temporal Feature Extraction Module

3.3. Classification Module

4. Model Setup

4.1. Dataset Description

4.2. Dataset Preprocessing

4.3. Evaluation Metrics

4.4. Hyperparameter Evaluation

4.4.1. Number of Filters in Convolutional Layers

4.4.2. Batch Size

4.4.3. Choice of Optimizer

5. Results Analysis

5.1. WISDM Dataset

5.2. PAMAP2 Dataset

5.3. UCI-HAR Dataset

5.4. Summary

6. Ablation Study

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI