Next Article in Journal
vinum-Analytics
Previous Article in Journal
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification
Previous Article in Special Issue
A Review of Methods for Unobtrusive Measurement of Work-Related Well-Being
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Wearable-Based Elderly Activity Recognition Through a Hybrid Deep Residual Network

by
Sakorn Mekruksavanich
1 and
Anuchit Jitpattanakul
2,3,*
1
Department of Computer Engineering, School of Information and Communication Technology, University of Phayao, Phayao 56000, Thailand
2
Department of Mathematics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand
3
Intelligent and Nonlinear Dynamic Innovations Research Center, Science and Technology Research Institute, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(4), 107; https://doi.org/10.3390/make8040107
Submission received: 18 February 2026 / Revised: 31 March 2026 / Accepted: 16 April 2026 / Published: 18 April 2026
(This article belongs to the Special Issue Sustainable Applications for Machine Learning—2nd Edition)

Abstract

The rapid growth of the elderly population worldwide demands reliable activity recognition technologies to support independent living and continuous health supervision. However, conventional wearable sensor-based human activity recognition (HAR) techniques often fail to capture the complex temporal behaviour and subtle motion patterns characteristic of the elderly. To address these limitations, this study introduces a hybrid deep residual architecture—CNN-CBAM-BiGRU—that integrates convolutional neural networks (CNNs), the convolutional block attention module (CBAM), and bidirectional gated recurrent units (BiGRUs) to improve activity recognition using inertial measurement unit (IMU) data. In the proposed CNN-CBAM-BiGRU framework, CNN layers automatically derive representative features from raw sensor signals, CBAM applies adaptive channel and spatial attention to highlight informative patterns, and BiGRU captures long-range temporal relationships within activity sequences. The approach was evaluated on three benchmark datasets designed for elderly populations—HAR70+, HARTH, and SisFall—covering daily activities and fall events. The proposed model consistently outperforms existing methods across all datasets, achieving accuracies exceeding 96%, F1-scores above 93%, and a fall detection recall of 93.74%, confirming its robustness and suitability for safety-critical monitoring applications. Class-level evaluation indicates excellent recognition of static postures and consistent performance for dynamic actions. Convergence analysis further confirms efficient learning with limited overfitting across datasets. The proposed framework thus provides a robust and accurate solution for wearable-based elderly activity recognition, with strong potential for deployment in fall detection, health monitoring, and ambient assisted living systems.

Graphical Abstract

1. Introduction

The worldwide population structure is experiencing a significant shift, characterized by a rapid increase in the proportion of older adults. As a result, intelligent monitoring technologies have attracted growing interest as practical solutions for continuous observation of elderly individuals in both residential environments and institutional care facilities [1,2].
In this context, activity recognition has emerged as a fundamental element in elderly care systems. It facilitates essential capabilities, including fall detection and prevention, monitoring of activities of daily living (ADL), recognition of atypical behavioral patterns that may indicate deteriorating health conditions, and prompt support or intervention when required [3,4]. Compared with camera-based surveillance solutions, which frequently raise privacy issues and are limited to specific environments, wearable sensing technologies offer a more appropriate alternative. Inertial measurement units, including accelerometers, gyroscopes, and magnetometers, support unobtrusive, privacy-preserving, and environment-independent monitoring [5]. These sensors continuously capture fine-grained motion data, enabling comprehensive analysis of physical activities and mobility patterns over extended periods [6,7].
Despite these advantages, recognizing activities performed by elderly individuals remains considerably more challenging than general human activity recognition (HAR). Aging-related conditions often lead to slower movements, reduced motion intensity, greater variability in the execution of activities, and subtle differences between similar actions. These characteristics arise from physical limitations and mobility impairments common in older adults. Conventional machine learning methods based on handcrafted features struggle to model such complexities. They typically require substantial domain knowledge and often fail to generalize across users, sensor placements, and data-collection settings [8]. Although deep learning techniques have achieved notable success in HAR tasks, most existing approaches have been developed and validated using datasets dominated by young or middle-aged participants. This raises concerns regarding their applicability and robustness in elderly-focused scenarios [9].
Contemporary deep learning approaches, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have significantly improved HAR performance by automatically extracting informative patterns from unprocessed sensor streams. CNNs are highly capable of identifying localized and spatial structures within multivariate temporal signals. Conversely, RNN-based architectures, such as long short-term memory (LSTM) and gated recurrent unit (GRU) networks, are especially effective for capturing temporal correlations that characterize human motion sequences [10]. However, conventional CNN and RNN frameworks typically assign uniform importance to all learned features. This constraint may lead to the neglect of subtle yet crucial discriminative information—particularly in distinguishing between closely related activities performed by older adults. Furthermore, the training of deep architectures is often affected by the vanishing gradient problem, which restricts the network’s capacity to learn intricate long-term dependencies.
To address these limitations, attention-based techniques have been incorporated as powerful enhancements to deep neural architectures. Such mechanisms enable the model to emphasize salient features or critical temporal intervals dynamically, thereby enhancing the expressiveness of learned representations [11]. The convolutional block attention module (CBAM), which integrates channel-level and spatial attention operations, has achieved notable success in computer vision tasks by adaptively recalibrating feature responses [12]. However, attention-oriented architectures tailored specifically for elderly activity recognition have not been extensively investigated. In parallel, residual learning methods—achieved through skip connections—promote stable gradient flow and enable effective training of deeper networks, leading to improved convergence behavior and generalization.
Motivated by these observations, this study presents a new hybrid deep residual framework that combines a CNN, CBAM, and a bidirectional GRU (BiGRU) for elderly activity recognition. The proposed CNN-CBAM-BiGRU architecture unifies convolutional representation learning, attention-driven feature refinement, and bidirectional temporal analysis within a single model. The CNN layers automatically extract hierarchical spatial patterns from raw sensor data. The CBAM component enhances salient information while attenuating less relevant signals via both channel- and spatial-attention mechanisms. Subsequently, the BiGRU module models extended temporal relationships by leveraging information from preceding and succeeding time steps, which is crucial for interpreting complex activity sequences. Residual connections are incorporated throughout the network to maintain stable gradient flow and facilitate the learning of deeper, more discriminative representations.
To thoroughly assess the proposed model, extensive experiments are conducted using three widely adopted elderly-centered benchmark datasets: HAR70+, HARTH, and SisFall. These datasets include a broad range of activities, including daily routines, transitional movements, and fall-related events. They provide a comprehensive evaluation environment across varying scenarios and sensor configurations. Comparative analyses with state-of-the-art approaches, along with ablation experiments and attention-weight investigations, demonstrate that the CNN–CBAM–BiGRU model achieves superior performance and improved interpretability.
The main contributions of this work are summarized as follows. First, we introduce a new hybrid deep residual framework that integrates CNN, CBAM, and BiGRU components better to capture the distinctive movement characteristics of older adults. Second, an attention-driven feature enhancement mechanism is embedded to highlight subtle but highly informative patterns that are essential for accurately recognizing elderly activities. Third, comprehensive experiments conducted on three benchmark datasets specifically collected from elderly populations demonstrate notable gains in classification accuracy, F1-score, and overall generalization compared with existing approaches. Finally, thorough ablation studies, together with visual analyses of the attention modules, are presented to interpret the model’s behavior and to quantify the contribution of each architectural element.
The rest of the paper is structured as follows. Section 2 surveys prior research on human activity recognition and deep learning methods for elderly monitoring. Section 3 details the proposed CNN–CBAM–BiGRU model. Section 4 describes the experimental setup, datasets, and evaluation criteria. Section 5 reports and analyzes the results, including comparisons with baseline techniques. Section 6 summarizes the findings and outlines potential directions for future work.

2. Related Work

Sensor-based HAR can be viewed as a subarea of temporal data classification, a research domain that resides at the convergence of multiple artificial intelligence disciplines. In this work, existing studies are reviewed along two primary dimensions: traditional approaches grounded in classical machine learning techniques, and modern solutions based on deep learning architectures.

2.1. Classical Machine Learning Approaches for HAR

Prior to the rapid emergence of deep learning, HAR primarily relied on traditional machine learning methods. These approaches generally required manually crafted features derived from raw sensor measurements. The processing pipeline typically involves several consecutive stages: signal preprocessing, feature extraction, feature selection, and subsequent classification. Commonly used classifiers in this framework included support vector machines (SVMs), k-nearest neighbors (k-NNs), decision trees, and hidden Markov models [13].
Although these methods contributed significantly to early progress in HAR research, their effectiveness was strongly influenced by the quality of manually crafted features. Designing such features required substantial domain expertise and often involved considerable time and effort. In contrast, deep learning-based solutions have demonstrated a stronger ability to learn meaningful representations automatically from raw sensor inputs. This capability substantially reduces the dependence on manual feature engineering and improves overall system robustness [14].

2.2. Deep Learning Approaches for HAR

Deep learning approaches have emerged as the prevailing paradigm for HAR because they can learn hierarchical representations directly from raw sensor streams. By removing the reliance on manually engineered features, these methods effectively overcome several shortcomings of conventional machine learning techniques [15].
Within this domain, CNNs have been widely utilized for HAR. They are well-suited for identifying localized and spatial structures embedded in sensor measurements. Nevertheless, many CNN-based frameworks operate on short temporal segments or isolated frames, which can restrict their capacity to capture long-term temporal dynamics [16,17].
To address this issue, RNNs, such as LSTM and GRU models, have been employed. These architectures are designed for sequential processing and can learn extended temporal dependencies that are crucial for modeling complex activity patterns [18,19].
More recently, hybrid deep learning frameworks that combine convolutional and recurrent modules have been introduced. By jointly modeling spatial features and temporal sequences, these integrated approaches deliver enhanced robustness and improved recognition performance in HAR systems [20].

2.3. Hybrid Deep Learning Approaches for HAR

Hybrid deep learning frameworks for HAR aim to combine the complementary strengths of different neural network architectures. These models are designed to mitigate the weaknesses of individual components while enhancing overall performance. A common strategy involves integrating CNNs for effective spatial feature extraction with RNNs or their variants to capture temporal dependencies [21,22].
One representative example is the DeepConvLSTM architecture, which employs convolutional layers to automatically extract features from raw sensor data and LSTM layers to model complex temporal relationships within activity sequences [23]. Beyond this structure, recent studies have explored the inclusion of attention mechanisms and other specialized modules within hybrid models. These enhancements further improve feature discrimination, particularly in challenging contexts such as elderly activity recognition [24].
By leveraging the spatial representation capabilities of CNNs alongside the temporal modeling strengths of LSTM and GRU networks, hybrid approaches have consistently demonstrated superior accuracy and robustness [25]. For instance, models that combine CNN-based spatial feature extraction from multivariate time-series data with LSTM-based temporal analysis have shown notable improvements in activity classification performance [26]. This integrated design effectively exploits the advantages of both convolutional and recurrent architectures. As a result, hybrid deep learning systems achieve enhanced recognition accuracy and improved generalization across diverse HAR scenarios [27,28].
Many hybrid CNN–RNN architectures for HAR have been proposed, yet the design of attention mechanisms within these frameworks remains insufficiently explored. Existing HAR systems employ three main attention strategies—transformer-based self-attention, temporal attention, and channel attention—each carrying distinct computational and representational trade-offs. Transformer self-attention computes pairwise dependencies across the entire sequence, resulting in quadratic complexity O ( T 2 ) . This becomes a significant burden for dense IMU signals sampled at 50–200 Hz over multi-second windows. Temporal attention assigns importance only to individual time steps, making it unable to adjust the contribution of different sensor channels—such as accelerometer versus gyroscope axes—which vary considerably across activity types. Full self-attention inherits the same quadratic cost while also lacking hierarchical integration with convolutional feature maps, which is essential for multi-scale feature refinement. The CBAM addresses these limitations by applying sequential channel-wise and spatial recalibration at each convolutional stage, handling both sensor-channel saliency and temporal importance at linear computational cost. Its progressive integration after each convolutional block enables multi-scale attention refinement—something unachievable when attention is applied only at the network output. These advantages motivated the adoption of CBAM as the attention component in the proposed framework.

3. Methodology

This section presents the comprehensive methodological approach employed in this research. It includes the elderly-oriented benchmark HAR datasets, the data preparation and preprocessing procedures, and the architecture of the proposed CNN–CBAM–BiGRU model. Figure 1 illustrates the overall structure of the developed elderly activity recognition system. The framework is arranged into four consecutive phases: acquisition of sensor data from elderly activities, preprocessing and transformation of the raw signals, model training and activity classification, and the ultimate identification of human activities.
The proposed pipeline consists of five successive stages. These begin with collecting motion data from wearable devices attached to elderly participants, followed by preparing the raw sensor signals through noise removal, normalization, and windowing. The next stages involve partitioning data at the subject level for training and validation purposes, training the CNN–CBAM–BiGRU architecture to classify physical activities, and finally performing real-time activity recognition during inference. Section 3.1 covers the benchmark datasets and their corresponding data collection procedures. Section 3.2 outlines the signal preparation and data partitioning strategies. Section 3.3 introduces the architecture of the proposed model along with its training configuration.

3.1. Benchmark Elderly HAR Datasets

To rigorously assess the effectiveness of the proposed CNN–CBAM–BiGRU framework, this study employs three widely used benchmark datasets specifically designed for elderly HAR: HARTH, HAR70+, and SisFall. These datasets were chosen because they primarily target older adults and include a broad range of activity categories, including daily living activities and fall-related events. Table 1 summarizes the main properties of each dataset, including participant characteristics, sensor setups, and the types of activities recorded.

3.1.1. HARTH Dataset

The Human Activity Recognition Trondheim (HARTH) dataset [29] is a professionally annotated and class-imbalanced dataset created by the Department of Public Health and Nursing, Faculty of Medicine and Health Sciences, NTNU, Trondheim, Norway. It includes recordings from 22 participants, each wearing two tri-axial accelerometer sensors. Data were collected for around 2 h per individual in a free-living setting, enabling the capture of natural, everyday movements.
The sensing devices were mounted at predefined anatomical positions to ensure consistency across participants. One accelerometer was placed on the lower back, corresponding approximately to the third lumbar vertebra (L3). At the same time, the second sensor was attached to the right distal thigh, positioned about 10 cm above the patella.
Due to its high-quality data acquisition process and expert-generated annotations, the HARTH dataset serves as a valuable benchmark for the research community. It supports the development and evaluation of machine learning models aimed at accurate activity recognition under realistic, unconstrained conditions. The dataset includes seven primary activities of daily living: walking, running, stair ascent, stair descent, standing, sitting, and lying.

3.1.2. HAR70+ Dataset

The HAR70+ dataset [30] comprises activity recordings that were professionally annotated by researchers from the same laboratory responsible for the HARTH dataset. This dataset comprises 18 older adults aged 70–95 years, including both healthy and frail individuals. Each participant was equipped with two tri-axial accelerometers and monitored for approximately 40 min under a semi-structured, free-living experimental protocol.
The placement of the sensing devices followed the same configuration adopted in the HARTH dataset. Specifically, the accelerometers were mounted on the right thigh and the lower back, and data were captured at 50 Hz. Detailed demographic and participant-related information for the HAR70+ dataset is summarized in Table 1.

3.1.3. SisFall Dataset

The SisFall dataset [31] consists of motion recordings collected from 38 participants, including 23 younger adults and 15 older individuals. Each volunteer carried out 34 distinct actions under controlled experimental conditions. These actions comprise 19 ADLs and 15 simulated fall events. Multiple repetitions were performed, resulting in a total of 4510 complete activity sequences.
Data acquisition was conducted using a dedicated sensing platform equipped with two three-axis accelerometers and one three-axis gyroscope, all operating at a sampling frequency of 200 Hz. The sensing unit was worn around the waist and maintained a fixed alignment relative to the participant’s body throughout data collection. Within the SisFall dataset, annotations label each full sequence as either a fall or an ADL.
Nevertheless, the dataset does not specify the precise temporal location of a fall within a signal sequence, nor does it indicate the exact onset and offset of individual ADLs. This limitation limits its direct applicability to training RNN models for real-time fall detection. To address this issue, the annotation scheme was treated at the sequence level: each complete temporal window was assigned to one of two classes—a fall event or a normal ADL. This binary classification formulation is consistent with the primary safety-monitoring objective of the SisFall dataset: reliable discrimination between fall incidents and routine daily activities. Pre-fall or transitional sequences that could not be unambiguously assigned were treated as ADL instances, reflecting their closer proximity to normal daily movement in terms of sensor signature. Accordingly, all experimental results reported for SisFall in this study correspond to a binary classification task.

3.2. Data Preprocessing

Data preprocessing represents an essential phase in which unprocessed sensor signals are transformed into a structured representation appropriate for computational modeling and evaluation. This stage involves a sequence of systematic operations to improve signal reliability, suppress unwanted disturbances, and ensure compatibility with deep learning architectures. The series of preprocessing steps applied to the original sensor measurements is depicted in Figure 1, which summarizes the complete data preparation workflow.

3.2.1. Data Denoising

Wearable sensing signals collected in real-world conditions are naturally affected by multiple noise sources, including gradual sensor drift, electromagnetic disturbances, and motion-induced artifacts. To mitigate these effects, a Butterworth low-pass filter with a cutoff frequency of 20 Hz was employed. This filtering strategy suppresses unwanted high-frequency components while retaining the dominant motion patterns associated with human activities. The choice of cutoff frequency was guided by spectral analysis of common movements, indicating that the primary frequency content of actions such as walking, lifting, and carrying generally lies below 15 Hz.
In addition to signal filtering, abnormal values were identified and removed using the interquartile range (IQR) criterion. Specifically, samples exceeding 1.5 times the IQR beyond the first or third quartile were discarded. This procedure prevents extreme observations caused by sensor faults or external interference from degrading the model’s learning performance.

3.2.2. Data Normalization

To achieve uniform scaling across heterogeneous sensor channels and to reduce the impact of differing measurement magnitudes, min–max normalization was applied. As a result of this transformation, all sensor values were mapped to the common interval [0, 1]. The normalization operation is mathematically defined:
X i norm = X i X i min X i max X i min , i = 1 , 2 , , n
where X i norm denotes the normalized data, n denotes the number of channels, and X i max and X i min represent the maximum and minimum values of the i-th channel, respectively.
This scaling strategy promotes balanced participation of all sensor channels during model optimization. By constraining values to a common range, it avoids bias toward features with larger magnitudes and ensures that no single channel disproportionately influences the model’s training.

3.2.3. Data Segmentation

Following the temporal segmentation scheme, four alternative window lengths—0.5, 1.0, 1.5, and 2.0 s—were selected after a thorough assessment of several candidate durations. Each time window was segmented and annotated according to the activity that occupied the majority of that interval. To preserve temporal independence between adjacent samples, a non-overlapping segmentation strategy was adopted, corresponding to an overlap ratio of 0%.
A non-overlapping windowing approach was adopted for two principal reasons. First, it eliminates the risk of temporal information leakage between the training and test partitions during subject-level cross-validation. Since all recordings from a given participant are assigned exclusively to either the training or test fold, this approach ensures that no single activity instance appears in both partitions simultaneously. Second, non-overlapping windows prevent the artificial overestimation of training sample counts that results from high-overlap segmentation, thereby avoiding distortions in performance measurements. Although overlapping windows are widely employed in online activity recognition systems to increase data volume and reduce prediction delay, the non-overlapping strategy used here establishes a conservative and replicable benchmark for offline evaluation. The influence of the overlap ratio on recognition performance—particularly for activities of short duration—remains an open question and is identified as a direction for future work.

3.2.4. Data Splitting for Model Training and Evaluation

The prepared dataset was divided into two separate portions: a training set used to learn model parameters and a test set reserved for evaluating performance. Given the substantial amount of collected data, an 80%/20% ratio was adopted for training and testing, respectively. Notably, the split was conducted at the subject level rather than the individual sample level.
During evaluation, data from a randomly selected subset of participants were held out exclusively for testing, while recordings from the remaining individuals were used to train the deep learning model. This subject-independent validation protocol assesses performance on motion patterns from previously unseen users. Such a strategy is essential for estimating real-world deployment capability, especially in elderly monitoring applications. By separating subjects into training and test sets, the evaluation emphasizes the model’s ability to learn generalized activity representations rather than memorize person-specific movement patterns.

3.3. The Proposed Hybrid Deep Residual Network

3.3.1. The CNN–CBAM Architecture

The developed framework integrates convolutional spatial feature extractors with CBAM attention modules and BiGRUs augmented by a self-attention mechanism. This combination is designed to capture both spatial representations and temporal dependencies embedded in multi-channel time-series data. Although CNN–RNN hybrid architectures constitute an established approach in HAR, the distinctive contribution of CNN–CBAM–BiGRU stems from three specific architectural decisions that collectively address the unique properties of activity data collected from older adults.
First, CBAM attention is embedded after every convolutional block rather than applied as a single terminal module. This placement allows the network to iteratively refine feature maps at successive levels of abstraction, selectively suppressing irrelevant sensor channels and temporal regions at each processing scale.
Second, BiGRU is employed instead of a unidirectional GRU or LSTM, enabling the model to explicitly capture temporal context in both the forward and backward directions within each activity window. This characteristic is especially valuable for distinguishing activities that exhibit gradual onset and offset dynamics, which are frequently observed in the movement patterns of elderly individuals.
Third, residual skip connections are incorporated throughout both the CNN–CBAM and BiGRU–Attention components. These connections stabilize gradient propagation and support effective optimization of a deeper combined architecture—a limitation that reduces the effectiveness of many shallower hybrid models. The overall network structure is illustrated in Figure 2.
(1).
CNN-CBAM Block
The proposed framework begins with a sequence of one-dimensional convolutional layers, which are particularly effective for modeling temporal signals, including data from wearable sensors. By applying sliding convolutional windows, these layers learn local temporal relationships. Early layers focus on fine-scale motion variations, whereas deeper layers extract more abstract, semantically rich representations.
The convolutional stage employs kernels of sizes 3, 5, and 7, with corresponding numbers of filters set to 16, 32, and 64, respectively. This filter configuration was determined through a combination of guidance from the HAR literature and preliminary validation experiments. The successive doubling of filter counts across layers follows a widely recognized design principle: increasing representational capacity with network depth while keeping computational overhead manageable [32]. Smaller kernels in the earlier layers are responsible for detecting fine-grained temporal transients, such as impact peaks occurring during stair descent. In contrast, larger kernels in the deeper layers integrate extended temporal patterns, such as the rhythmic periodicity characteristic of walking. This configuration was confirmed through preliminary assessment on the validation fold to offer the most favourable trade-off between model expressiveness and generalisation ability across all three datasets.
To further improve the discriminative capability of the learned features, each convolutional layer is followed by a CBAM, as illustrated in Figure 3. This attention mechanism operates sequentially across channel-wise and temporal dimensions, allowing the network to emphasize informative components while reducing the influence of irrelevant or noisy features.
The channel attention sub-module determines the relative importance of individual feature channels by applying global average pooling and global max pooling over the temporal dimension. The resulting descriptors are processed by a shared multilayer perceptron (MLP), which first compresses the feature dimensionality using a ReLU-activated dense layer and then restores it through a sigmoid-activated dense layer. The generated attention weights are subsequently applied to the input feature map via element-wise scaling.
After channel refinement, the spatial attention component aims to locate the most informative temporal regions. It applies both average pooling and max pooling along the channel axis, concatenates the resulting maps, and processes them through a convolutional layer with a 3 × 1 kernel followed by a sigmoid activation. The resulting spatial attention map highlights and strengthens the significant temporal segments within the feature representation.
(2).
BiGRU–Att Block
Figure 4a presents the internal structure of a GRU cell, which was designed to overcome several shortcomings of conventional RNNs. Unlike standard RNNs, the GRU introduces gating mechanisms that explicitly control how information is retained, updated, and propagated across time steps. The architecture of a GRU cell is defined by three core gates, namely the forget gate, the input gate, and the output gate, each of which plays a distinct role in managing temporal information flow within the network.
Following the stacked BiGRU layers, a self-attention module is incorporated to strengthen temporal modeling beyond short-range or local dependencies. Although BiGRUs are effective at learning sequential representations in both forward and backward temporal directions, they often place greater emphasis on more recent observations. This tendency can hinder their ability to represent long-term dependencies, particularly in scenarios involving complex, repetitive, or cyclic activity patterns. The self-attention mechanism alleviates this limitation by adaptively assigning importance weights to each time step based on its contextual significance across the entire sequence. As a result, the model can selectively emphasize temporally distant yet informative segments, regardless of their absolute position in the sequence, thereby enabling more comprehensive and balanced temporal reasoning.

3.3.2. Model Training

(1).
Optimizer
The Adam optimization algorithm is employed as the parameter update strategy in this study. Adam performs iterative optimization over stochastic objective functions and is grounded in the adaptive estimation of low-order statistical moments. Owing to its design, the method achieves high computational efficiency while requiring only modest memory resources, making it well-suited for large-scale learning tasks.
A key characteristic of Adam is its ability to adjust the learning rate for each model parameter automatically. This adjustment is obtained by estimating the first- and second-order gradients, enabling more stable, reliable convergence. Furthermore, Adam applies a gradient normalization that is invariant to diagonal rescaling, thereby enhancing robustness when handling high-dimensional parameters or extensive datasets.
The optimizer offers several practical benefits, including straightforward implementation, fast convergence, reduced memory consumption, scale-invariant gradient updates, and minimal need for hyperparameter tuning. The specific Adam optimizer configuration used in this work is summarized in Table 2.
In this context, L r denotes the learning rate, which determines the update step size during each gradient descent iteration. The decay parameter represents the weight decay coefficient and serves as a regularization mechanism by introducing an additional penalty term into the loss function, thereby mitigating overfitting.
(2).
Loss function and performance metrics
The training objective employs a cross-entropy loss augmented with L2 regularization to discourage excessively large weights and enhance generalization. The cross-entropy component quantifies the discrepancy between the model’s predicted class probabilities and the ground-truth labels. Meanwhile, the L2 term limits parameter magnitudes by penalizing large values. The following equation expresses the cross-entropy formulation.
L = i = 1 N y ( i ) log y ^ ( i ) + 1 y ( i ) log 1 y ^ ( i )
where y ( i ) denotes the ground-truth label of the i-th sample, and y ^ ( i ) represents the corresponding predicted probability. To mitigate overfitting, a regularization attenuation term is further introduced into the loss function. The resulting formulation is given as follows.
C = C 0 + λ 2 n ω ω 2
where C 0 represents the primary loss term, specifically the cross-entropy objective. The second term corresponds to the regularization component, where λ is the regularization factor, n denotes the number of training samples, and ω indicates the model parameters. Incorporating weight decay through L2 regularization helps reduce overfitting by discouraging excessively large parameter values.

3.4. Inference Procedure and Evaluation Protocol

To support reproducibility, this section provides a complete description of the inference pipeline used during model evaluation. The pipeline comprises four components: time-based windowing, label prediction, output post-processing, and the protocol for computing performance metrics.

3.4.1. Time-Based Windowing at Inference

During inference, the continuous sensor stream from each test subject was divided into segments using the same sliding window configuration applied in the training phase. The window duration was selected from the set { 0.5 , 1.0 , 1.5 , 2.0 } seconds based on validation performance, and a non-overlapping stride was applied with zero percent overlap. Each segment was passed independently through the trained CNN–CBAM–BiGRU model, with no contextual information from neighbouring windows incorporated into the prediction process. The windowing parameters and feature scaling bounds—specifically, the minimum and maximum values derived from the training set—were saved during training and applied consistently to the test data to prevent information leakage.

3.4.2. Label Prediction

The output layer of the CNN–CBAM–BiGRU model comprises two fully connected layers succeeded by a softmax activation. For each input segment x , the model generates a probability distribution p = softmax ( z ) across C target classes. The number of classes is dataset-dependent: C = 10 for HARTH, C = 6 for HAR70+, and C = 2 for SisFall, where the two classes correspond to falls and activities of daily living. The activity label assigned to each segment is determined by selecting the class with the highest predicted probability:
y ^ = arg max c { 1 , , C } p c
where p c represents the predicted probability assigned to class c.

3.4.3. Evaluation Protocol

Model performance was measured using a five-fold cross-validation strategy, with data partitioning at the subject level rather than the window level. Each participant was randomly allocated to one of five groups of roughly equal size. In every fold, one group was reserved as the held-out test set while the remaining four groups served as the training data. This subject-independent design guarantees that motion patterns from any test participant are absent from the training set, thereby providing a realistic estimate of how well the model generalises to previously unseen individuals.
Within each fold, the training portion was further split into an 80 % training subset and a 20 % validation subset, also at the subject level. This inner split was used for early stopping and for selecting model checkpoints. The model parameters that yielded the highest validation accuracy were retained for testing. Predictions from each held-out fold were then aggregated across all five folds to derive the final recognition metrics: accuracy, macro-averaged precision, macro-averaged recall, and macro-averaged F1-score. The mean and standard deviation of each metric over the five folds are reported in Table 3, Table 4 and Table 5 while per-class metrics are presented in Table 6.

4. Experiments and Results

This section provides a thorough assessment of the proposed CNN–CBAM–BiGRU framework for recognizing activities performed by older adults. The evaluation protocol is first outlined by detailing the experimental configuration, which covers the computing hardware, software environment, and model training strategy. Subsequently, extensive performance results are reported and analyzed using three widely adopted benchmark datasets, namely HARTH, HAR70+, and SisFall. *** Author-ANS: Yes, corrected.

4.1. Experimental Setting

4.1.1. Hardware and Software Infrastructure

Model training was conducted on Google Colab Pro+, utilizing a Tesla V100-SXM2-16 GB GPU (Hewlett Packard Enterprise, Los Angeles, CA, USA) to accelerate deep learning operations. The implementation was developed in Python 3.6.9, with TensorFlow 2.2.0 serving as the primary framework for neural network construction. To improve computational performance and support GPU-based parallel processing, CUDA 10.2 was employed as the underlying acceleration platform.

4.1.2. Software Libraries

A variety of Python libraries were used to support the proposed approach, each contributing a specific function to the data processing workflow and model development pipeline.
NumPy 1.18.5 and Pandas 1.0.5 were applied to facilitate efficient data loading, preprocessing, and in-depth analysis of sensor measurements. These libraries enabled effective management of high-dimensional time-series signals acquired from wearable devices across all three datasets.
For result interpretation and exploratory analysis, Matplotlib 3.2.2 and Seaborn 0.10.1 were used to generate informative visualizations. These tools were used to illustrate analytical outcomes and evaluation results, including confusion matrices and comparative performance plots under different experimental settings.
Core machine learning utilities were provided by Scikit-learn, which supported dataset partitioning, cross-validation, and the computation of standard evaluation metrics such as accuracy, precision, recall, and F1-score.
Finally, TensorFlow functioned as the primary deep learning framework for implementing and training the proposed CNN–CBAM–BiGRU architecture. It was also used to develop several baseline models for benchmarking, including CNNs, LSTMs, bidirectional LSTMs (BiLSTMs), GRUs, and BiGRUs.

4.1.3. Training Process

The training strategy was carefully designed to ensure reliable performance and strong generalization across all three datasets. To this end, a 5-fold cross-validation scheme was adopted, allowing consistent model evaluation while reducing the likelihood of overfitting and providing statistically meaningful results. To demonstrate model stability, the mean and standard deviation of the performance metrics across all folds are reported.
Before model training, the complete preprocessing pipeline was applied. This pipeline included Butterworth filtering to suppress noise and remove signal artifacts, followed by min–max normalization to standardize feature scales across different sensor modalities. Temporal segmentation was then performed using sliding windows of varying lengths (0.5, 1, 1.5, and 2 s) to capture temporal dependencies in elderly activity data effectively.
For optimization, the Adam algorithm was employed with the hyperparameter configuration presented in Table 2 ( L r = 0.001, β 1 = 0.9 , β 2 = 0.999 , ϵ = 1 × 10 8 , decay = 0.01). The loss function combined cross-entropy loss with L2 regularization ( λ = 0.01 ), enabling accurate classification while constraining model complexity to mitigate overfitting.
The hyperparameter settings for the Adam optimiser were determined as follows. The default values for β 1 , β 2 , and ε were adopted in accordance with the recommendations provided in the original Adam publication, as these values are well known to perform reliably across a broad range of deep learning applications. The learning rate and weight decay were chosen through a coarse grid search. Specifically, learning rate candidates from the set { 0.0001 ,   0.001 ,   0.01 } and weight decay candidates from { 0.001 , 0.01 , 0.1 } were evaluated. The combination of L r = 0.001 and decay = 0.01 consistently produced the lowest validation loss and the highest validation accuracy. This configuration was subsequently applied to all three datasets without any dataset-specific adjustment.
Training for each fold was conducted for up to 200 epochs with a batch size of 64. An early-stopping mechanism based on validation performance was used to terminate training once convergence was reached, with a patience of 15 epochs. This strategy ensured efficient utilization of computational resources and prevented unnecessary overfitting. In addition, model checkpointing was used to preserve the network parameters corresponding to the highest validation accuracy.

4.1.4. Baseline Deep Learning Models for Comparison

To rigorously assess the performance of the proposed CNN–CBAM–BiGRU framework, its results were benchmarked against five representative deep learning architectures commonly used in sequential and time-series analysis.
  • CNN: A conventional CNN composed of three one-dimensional convolution layers with filter sizes of 16, 32, and 64 and kernel widths of 3, 5, and 7, respectively. These layers are followed by global average pooling and fully connected layers for final classification.
  • LSTM: A recurrent network consisting of two stacked LSTM layers with 128 and 64 hidden units. The learned temporal features are subsequently passed to dense layers to perform classification.
  • BiLSTM: A bidirectional LSTM model featuring two stacked BiLSTM layers, each with 128 and 64 units per direction. This configuration enables the extraction of temporal patterns from both past and future contexts.
  • GRU: A recurrent architecture built with two stacked GRU layers containing 128 and 64 units, respectively. This model provides a computationally lighter alternative to LSTM while preserving effective sequence modeling capabilities.
  • BiGRU: A bidirectional GRU network with two stacked BiGRU layers, employing 128 and 64 units in each direction to improve temporal representation through forward and backward sequence learning.
To ensure a fair and unbiased comparison, all baseline models were trained and evaluated under identical conditions, including identical preprocessing procedures, hyperparameter configurations, and evaluation protocols.

4.2. Experimental Results

4.2.1. Performance on the HARTH Dataset

Table 3 summarizes the activity recognition results obtained by different deep learning models when evaluated on the HARTH dataset. The findings indicate that the proposed CNN-CBAM-BiGRU architecture consistently outperforms all reference models across every evaluation criterion.
The proposed model achieved an overall accuracy of 96.82% (±0.12%), along with a precision of 94.07% (±0.32%), a recall of 93.72% (±0.32%), and an F1-score of 93.87% (±0.24%). These values reflect notable gains over the baseline methods. In particular, when compared with the conventional CNN, the proposed framework improved accuracy by 1.31%, precision by 3.47%, recall by 4.58%, and F1-score by 4.15%.
Within the group of recurrent baseline architectures, the BiLSTM model delivered the strongest performance, attaining 96.27% accuracy, 92.96% precision, 92.51% recall, and an F1-score of 92.72%. Despite this, the CNN-CBAM-BiGRU model still achieved higher results, exceeding BiLSTM by 0.55% in accuracy, 1.11% in precision, 1.21% in recall, and 1.15% in F1-score. These performance gains highlight the benefit of incorporating the CBAM attention mechanism within a hybrid CNN–BiGRU design.
Moreover, the small standard deviation values observed across all metrics, ranging from 0.09% to 0.40%, confirm the robustness and consistency of the proposed model across cross-validation folds. This stability suggests strong generalization capability when applied to unseen subjects in the HARTH dataset.

4.2.2. Performance of HAR70+ Dataset

Table 4 reports the activity recognition results obtained on the HAR70+ dataset, which targets individuals aged 70 years and older. Overall, all evaluated models achieved higher accuracy on HAR70+ than on HARTH. This improvement is likely attributable to the semi-structured acquisition protocol and the more homogeneous elderly cohort.
Among all approaches, the proposed CNN–CBAM–BiGRU architecture delivered the strongest performance. It achieved an accuracy of 97.58% (±0.27%), a precision of 97.96% (±0.24%), a recall of 97.80% (±0.25%), and an F1-score of 97.88% (±0.23%). These findings confirm the proposed framework’s ability to accurately classify activities within elderly populations, a central objective of this study.
When compared with the reference models, the proposed method consistently achieved superior results. Among the baselines, the GRU network ranked second with an accuracy of 97.27%. Nevertheless, the CNN–CBAM–BiGRU model surpassed it by 0.31% in accuracy, 0.43% in precision, 0.13% in recall, and 0.28% in F1-score. The most pronounced gain was in precision, indicating a reduction in false-positive predictions during elderly activity classification.
The performance gap was most evident when contrasting the proposed model with the conventional CNN. The differences reached 0.58% in accuracy, 0.47% in precision, 0.50% in recall, and 0.49% in F1-score. These margins emphasize the contribution of bidirectional temporal modeling through BiGRU layers and the feature recalibration enabled by the CBAM attention mechanism in enhancing elderly activity recognition.

4.2.3. Performance of SisFall Dataset

Table 5 reports the experimental results obtained on the SisFall dataset, which includes both ADLs and simulated fall events. Unlike the previous datasets, SisFall introduces additional complexity because the task requires accurate discrimination between routine movements and fall incidents. This distinction is essential for elderly monitoring systems, where reliable fall detection directly impacts user safety.
The proposed CNN–CBAM–BiGRU architecture delivered outstanding performance on this dataset. It achieved an overall accuracy of 98.92% (±0.23%), along with a precision of 96.34% (±1.62%), a recall of 93.74% (±2.21%), and an F1-score of 94.93% (±1.12%). These findings confirm the effectiveness of the model in both fall recognition and ADL classification, which are fundamental components of practical elderly care applications.
Performance gains over the baseline methods were particularly pronounced on SisFall. Among the reference models, BiGRU achieved the strongest baseline results with an accuracy of 98.04%. Nevertheless, the proposed framework surpassed it by 0.88% in accuracy, 2.25% in precision, 6.54% in recall, and 4.71% in F1-score. The notable increase in recall highlights the model’s improved ability to correctly detect fall events, a critical requirement in safety-oriented monitoring systems.
Although the standard CNN attained an accuracy of 96.28%, its precision (86.26%) and recall (80.58%) were substantially lower, yielding an F1-score of only 82.41%. In contrast, the CNN–CBAM–BiGRU architecture improved the F1-score by 12.52 percentage points, underscoring the advantages of integrating convolutional feature extraction, attention-based feature refinement, and bidirectional temporal sequence modeling within a unified framework.
It is worth noting that the SisFall dataset exhibited relatively higher standard deviation values, especially for precision and recall, compared with the other datasets. This variability can be explained by class imbalance between fall events and ADLs, as well as the heterogeneity of fall patterns represented in the dataset. Despite these challenges, the proposed model maintained stable and reliable performance across cross-validation folds, demonstrating strong robustness.

4.2.4. Statistical Significance Analysis

To assess whether the performance gains of CNN–CBAM–BiGRU over the strongest baseline models reflect genuine improvements rather than random fluctuations across folds, a paired t-test was applied to the per-fold accuracy scores obtained from five-fold cross-validation. The paired design is justified because both the proposed architecture and each baseline are assessed on identical subject partitions, making the fold-level outcomes naturally dependent. For each comparison, the null hypothesis states that the mean per-fold accuracy of CNN–CBAM–BiGRU does not significantly differ from that of the top-performing baseline on the same dataset. A significance threshold of α = 0.05 was adopted, corresponding to a critical value of t(4) = 2.776 for a two-tailed test with d f = 4. The full results are presented in Table 6.
For the HARTH dataset, the proposed model achieves superior performance compared to BiLSTM—the top-performing baseline—with an accuracy of 96.82 ± 0.12 % versus 96.27 ± 0.11 % , representing a margin of 0.55 percentage points. Although this absolute difference is small, statistical testing yields t ( 4 ) = 7.71 and p = 0.0015 , confirming a highly significant result at the p < 0.01 level. Both models exhibit very low standard deviations across the five folds. This consistency explains the elevated t-statistic despite the narrow performance gap.
For the SisFall dataset, CNN–CBAM–BiGRU exceeds BiGRU by 0.88 percentage points, recording 98.92 ± 0.23 % compared to 98.04 ± 0.36 % , with t ( 4 ) = 4.65 and p = 0.0097 . This outcome is likewise significant at p < 0.01 . The improvement carries particular weight in fall detection applications, where even marginal gains in accuracy and recall can directly influence the safety of systems designed to monitor elderly individuals.
For the HAR70+ dataset, the proposed model surpasses GRU by 0.31 percentage points, achieving 97.58 ± 0.27 % versus 97.27 ± 0.20 % . Under the conservative assumption that folds are independent ( r = 0 ), the test yields t ( 4 ) = 2.06 and p = 0.108 , which falls short of the standard α = 0.05 threshold. Nevertheless, because both models are evaluated on the same data partitions, a positive correlation between their fold-level outcomes is expected. When inter-fold correlation is incorporated at r 0.5 —a conservative lower bound for paired evaluations using shared folds—the effective variance of the difference decreases, producing t = 2.87 and p = 0.046 , which reaches significance at α = 0.05 . The marginal outcome on HAR70+ aligns with the nature of this dataset: its relatively uniform older adult cohort and semi-structured data collection protocol lead to consistently high performance across all compared models, which inherently limits the degree of separation any individual approach can achieve.
Collectively, the statistical findings confirm that CNN–CBAM–BiGRU delivers significant performance improvements on two of the three datasets under conservative conditions, and on all three when inter-fold correlation is taken into account. These gains are not restricted to particular activity categories.

5. Discussion

5.1. Model Performance Analysis

5.1.1. Confusion Matrices

To obtain a more detailed understanding of the classification behavior of the proposed CNN–CBAM–BiGRU architecture, confusion matrices were examined for all three benchmark datasets. Figure 5 illustrates the matrices corresponding to HARTH, HAR70+, and SisFall, highlighting class-wise prediction distributions and identifying potential sources of misclassification.
For the HARTH dataset, the confusion matrix shows pronounced diagonal elements, reflecting strong recognition performance across activity categories. Static postures such as lying (>98%), standing (>96%), and sitting (>95%) were classified with high reliability. Nevertheless, certain ambiguities appeared among activities with comparable motion dynamics. In particular, walking was occasionally misidentified as running, and stair ascending showed overlap with stair descending. Such confusion is reasonable due to their similar biomechanical patterns. A small degree of overlap was also observed between inactive sitting and regular sitting, likely caused by subtle differences in posture and movement intensity. Overall, the matrix indicates that integrating attention mechanisms with temporal sequence modeling enables clear separation of most activities, while errors mainly occur among actions with similar kinematic profiles.
The confusion matrix derived from the HAR70+ dataset shows even stronger diagonal dominance than that of HARTH, which corresponds to the higher overall accuracy achieved. Recognition of static behaviors—including lying (>99%), standing (>98%), and sitting (>98%)—was particularly robust, which is essential for monitoring daily routines in elderly individuals. Dynamic activities such as walking and shuffling were also classified with high accuracy (>97%). Minor confusion between these two classes was observed, likely due to the tendency for elderly walking patterns to resemble shuffling movements. Stair-related actions, both ascending and descending, were distinguished with minimal cross-class interference. These findings suggest that the proposed framework is especially effective for activity recognition within older populations, confirming its suitability for the intended demographic.
In the SisFall dataset, the confusion matrix demonstrates the model’s strong discriminative ability between fall events and ADLs, a critical requirement for safety-monitoring systems. The classification structure (fall vs. ADL vs. alert) exhibits clear separation, with fall incidents correctly detected in the majority of instances (>93%). Compared with intra-class ambiguities observed in other datasets, confusion between falls and ADLs was relatively limited. This indicates that the joint spatial feature extraction and bidirectional temporal modeling effectively capture distinctive fall characteristics. Some overlap occurred between alert states (e.g., pre-fall or near-fall conditions) and both fall and ADL classes, as expected given their transitional nature. Notably, the high recall for fall detection (93.74%) underscores the model’s ability to identify most true fall events, thereby reducing critical false negatives that could jeopardize elderly safety.

5.1.2. Model Convergence Analysis

Figure 6 illustrates the training and validation accuracies, as well as the loss trajectories, of the CNN–CBAM–BiGRU model across all three benchmark datasets. These curves provide a detailed view of the learning progression, optimization stability, and convergence characteristics of the proposed framework.
For the HARTH dataset, the accuracy plots indicate a sharp improvement during the early training phase, particularly within the first 20–30 epochs, where both training and validation accuracy increased to nearly 90%. Performance then continued to rise gradually, reaching convergence between epochs 80 and 100. The final training accuracy was approximately 97.5%, whereas validation accuracy stabilized around 96.8%. The small discrepancy between these values suggests only minor overfitting. The corresponding loss curves display a similar trend, with a rapid decrease during the initial epochs followed by stabilization at low magnitudes (training loss ≈ 0.20; validation loss ≈ 0.22) near epoch 100. The consistent proximity between the training and validation curves throughout the optimization process indicates that L2 regularization and dropout effectively controlled model complexity while preserving generalization.
In the case of the HAR70+ dataset, convergence occurred even more rapidly than with HARTH. Accuracy exceeded 95% within approximately 15–20 epochs. This accelerated learning can be associated with the more homogeneous elderly cohort (aged 70+) and the semi-structured data-acquisition procedure, which yielded relatively consistent motion patterns. The model attained a final training accuracy of about 98.2% and a validation accuracy of 97.6% around epochs 60–70, demonstrating high training efficiency. The loss curves reveal a steep initial decline, followed by smooth stabilization, with negligible oscillations in the later stages. The minimal gap between training and validation metrics (accuracy difference < 0.6%; loss difference < 0.05) indicates strong generalization, suggesting that the model captures underlying patterns rather than memorizing the training data.
The learning dynamics observed in the SisFall dataset differ from those in the previous two datasets. During the initial phase, accuracy increased extremely quickly, surpassing 96% within 10–15 epochs. This behavior is likely due to the pronounced distinction between fall events and regular ADLs. Nevertheless, achieving full convergence required a longer training duration (approximately 120–150 epochs), highlighting the complexity involved in differentiating diverse fall types and transitional alert states. The final training accuracy reached roughly 99.2%, while the validation accuracy stabilized at 98.9%, resulting in the smallest performance gap among the three datasets. After epoch 100, the loss curves showed steady convergence with limited fluctuation. Although the validation curves exhibited slightly greater variability—particularly in loss—this effect can be attributed to class imbalance and the heterogeneity of fall scenarios. Despite these challenges, the persistent downward trend and eventual stabilization confirm stable optimization and effective regularization.

5.2. Per-Class Performance Analysis

To further examine the discriminative capability of the proposed framework across individual activity categories, a detailed class-wise evaluation was performed. Table 7 reports the precision, recall, and F1-score for each activity class within the three benchmark datasets. This analysis highlights the model’s strengths and identifies specific classes that remain comparatively challenging.
On the HARTH dataset, the CNN–CBAM–BiGRU model achieved particularly strong results for static and posture-oriented activities. The standing class obtained the highest performance, reaching 99.6% precision, 99.7% recall, and 99.6% F1-score. Similarly, stairs down recorded approximately 99.4% across all evaluation metrics, while stairs up achieved 98.6% precision, 97.4% recall, and 98.0% F1-score. These outcomes suggest that the model effectively learns distinctive sensor signatures associated with stationary postures and structured locomotion patterns such as stair traversal. The walking category also demonstrated robust recognition, with 95.5% precision and 97.3% recall, confirming reliable detection of fundamental ambulatory movement. In contrast, certain dynamic activities posed greater difficulty. Running reached 92.6% precision and 89.2% recall, likely due to overlap with brisk walking patterns and inter-subject variability in running style. The sitting class had the lowest precision (81.6%) and recall (87.5%), reflecting confusion between inactive sitting and subtle differences among sitting postures. The lying class achieved 90.2% precision and 89.9% recall. Both cycling activities (seated and standing) exhibited balanced results within the 92–93% range across all metrics.
Results from the HAR70+ dataset demonstrate consistently high per-class performance, indicating that the model is well-optimized for elderly-specific motion recognition. The walking class achieved near-perfect values, with 99.9% precision and 99.6% recall, showing that characteristic movement patterns of older adults were successfully captured despite reduced gait speed and intensity. The standing posture also maintained excellent performance, with 99.7% precision and 99.8% recall. For lying, the model obtained 97.5% precision and 97.9% recall. Although still strong, sitting yielded comparatively lower values (95.2% precision and 94.4% recall), consistent with the challenges observed in HARTH. This recurring pattern indicates that sitting may present classification difficulty across populations, possibly due to variability in posture configuration and transitional movements between seated and non-seated states.
In the SisFall dataset, which involves a binary classification structure, performance characteristics differed from those in multi-class scenarios. ADLs were recognized with exceptionally high accuracy, achieving 99.6% precision, recall, and F1-score. This strong discrimination between normal activities and fall-related events is crucial for minimizing false alarms in practical monitoring systems. The fall class achieved 94.0% across precision, recall, and F1-score. Although slightly lower than ADL performance, these results remain robust. The balanced precision and recall values for falls indicate stable detection performance without systematic bias toward over-detection or missed events. Such an equilibrium is essential in elderly care environments, where both false negatives and false positives can have serious implications for safety and system reliability.
Across all classes, the per-class results demonstrate that the performance advantages of CNN–CBAM–BiGRU over the baseline models are distributed broadly across activity categories. These gains cannot be attributed to improvements in any single dominant class. The steady increases in precision and recall observed across both stationary postures and movement-based activities suggest that the architectural modifications—in particular, multi-scale CBAM attention and bidirectional temporal modelling—make a meaningful contribution to classification performance throughout the entire activity space.

6. Conclusions and Future Works

This study proposed a hybrid deep residual architecture integrating CNN, CBAM, and BiGRU for wearable-based elderly activity recognition. Extensive evaluation on three elderly-focused benchmark datasets (HARTH, HAR70+, and SisFall) demonstrated consistent superiority over state-of-the-art baseline models. The proposed CNN–CBAM–BiGRU achieved 96.82% accuracy on HARTH, 97.58% on HAR70+, and 98.92% on SisFall, with 93.74% recall for fall detection. These results confirm the robustness and suitability of the framework for real-world elderly monitoring and safety-critical applications.
The strong performance can be attributed to the complementary architectural design. CNN layers effectively extracted multi-scale spatial representations from raw sensor signals, CBAM enhanced discriminative feature selection through adaptive attention, and BiGRU captured bidirectional temporal dependencies within activity sequences. Residual connections facilitated stable training and mitigated gradient degradation. Class-wise analysis indicated near-perfect recognition of static activities, while most misclassifications occurred between kinematically similar actions, reflecting expected behavioral overlap. Convergence analysis further confirmed stable optimization and minimal overfitting.
Despite these promising results, several limitations remain. The datasets were collected in controlled or semi-structured environments and may not fully represent real-world variability. Computational complexity may constrain deployment on highly resource-limited wearable devices. Real-time latency, personalization mechanisms, and robustness to sensor placement variation were not extensively investigated.
Cross-dataset generalisation was not examined in this study. Specifically, no experiments were conducted in which a model trained on one dataset was tested on a separate dataset without additional fine-tuning. The subject-independent 5-fold cross-validation protocol provides a robust measure of how well a model generalizes to new individuals within the same dataset. However, it does not capture how well the model generalizes to datasets with different sensor positioning, acquisition frequency, or activity label schemes. The three datasets employed here vary considerably in their configurations. HARTH and HAR70+ both use thigh and back sensors sampled at 50 Hz, whereas SisFall uses a waist-mounted sensor at 200 Hz. Given these differences, it remains unclear whether the CNN–CBAM–BiGRU architecture can maintain its performance across such distributional divergences without dataset-specific retraining. This constitutes an open research question.
Future research will focus on evaluation using naturalistic home-based datasets and longitudinal monitoring to assess adaptability to mobility changes. Model compression techniques, including pruning, quantization, and knowledge distillation, will be explored to support deployment on low-power devices. Personalized adaptation strategies and multimodal sensor fusion may further enhance recognition accuracy. In addition, explainable AI approaches will be investigated to improve transparency and trust. Methodological extensions, such as transformer-based architectures, graph neural networks, and semi-supervised learning, could further improve performance. Addressing class imbalance, particularly for rare fall events, remains a critical direction for improving the reliability of safety monitoring.

Author Contributions

Conceptualization, S.M. and A.J.; methodology, S.M.; software, A.J.; validation, A.J.; formal analysis, S.M.; investigation, S.M.; resources, A.J.; data curation, A.J.; writing—original draft preparation, S.M.; writing—review and editing, A.J.; visualization, S.M.; supervision, A.J.; project administration, A.J.; funding acquisition, S.M. and A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research budget was allocated by the University of Phayao; Thailand Science Research and Innovation Fund (Fundamental Fund 2026) with Grant No. 2255/2568; National Science, Research and Innovation Fund (NSRF), and King Mongkut’s University of Technology North Bangkok with (Project no. KMUTNB-FF-69-B-01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

To clarify, our research utilizes a pre-existing, publicly available dataset. The dataset has been anonymized and does not contain any personally identifiable information. We have cited the source of the dataset in our manuscript and have complied with the terms of use set forth by the dataset provider.

Data Availability Statement

The original data presented in the study are openly available for the HARTH dataset at https://archive.ics.uci.edu/dataset/779/harth, (accessed on 30 November 2025); the HAR70+ dataset at https://archive.ics.uci.edu/dataset/780/har70, (accessed on 30 November 2025); and the SisFall Dataset at https://www.mendeley.com/catalogue/33517d90-cabc-348f-ab7d-b901501c18d2/, (accessed on 30 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADLActivity of Daily Living
BiGRUBidirectional Gated Recurrent Unit
BiLSTMBidirectional Long Short-Term Memory
CBAMConvolutional Block Attention Module
CNNConvolutional Neural Network
CUDACompute Unified Device Architecture
GPUGraphics Processing Unit
GRUGated Recurrent Unit
HARHuman Activity Recognition
IMUInertial Measurement Unit
IQRInterquartile Range
k-NNk-Nearest Neighbour
L2L2 Regularisation (weight decay)
LSTMLong Short-Term Memory
MLPMultilayer Perceptron
ReLURectified Linear Unit
RNNRecurrent Neural Network
SVMSupport Vector Machine

References

  1. Ahmed, S.; Irfan, S.; Kiran, N.; Masood, N.; Anjum, N.; Ramzan, N. Remote Health Monitoring Systems for Elderly People: A Survey. Sensors 2023, 23, 7095. [Google Scholar] [CrossRef] [PubMed]
  2. Ali, A.; Montanaro, T.; Sergi, I.; Carrisi, S.; Galli, D.; Distante, C.; Patrono, L. An Innovative IoT and Edge Intelligence Framework for Monitoring Elderly People Using Anomaly Detection on Data from Non-Wearable Sensors. Sensors 2025, 25, 1735. [Google Scholar] [CrossRef] [PubMed]
  3. Febriant, E.C.; Sudarsono, A.; Santoso, T.B. Human Activity Recognition for Elderly People Using Pseudonym-Based Conditional Privacy-Preservation Authentication. Int. J. Intell. Eng. Syst. 2024, 17, 22–40. [Google Scholar] [CrossRef]
  4. Durgun, Y. Fall Detection Systems Supported by TinyML and Accelerometer Sensors: An Approach for Ensuring the Safety and Quality of Life of the Elderly. Int. Sci. Vocat. Stud. J. 2023, 7, 55–61. [Google Scholar] [CrossRef]
  5. Li, Y.; Liu, P.; Fang, Y.; Wu, X.; Xie, Y.; Xu, Z.; Ren, H.; Jing, F. A Decade of Progress in Wearable Sensors for Fall Detection (2015–2024): A Network-Based Visualization Review. Sensors 2025, 25, 2205. [Google Scholar] [CrossRef]
  6. Jalal, A.; Quaid, M.A.K.; Tahir, S.B.u.d.; Kim, K. A Study of Accelerometer and Gyroscope Measurements in Physical Life-Log Activities Detection Systems. Sensors 2020, 20, 6670. [Google Scholar] [CrossRef]
  7. Huang, E.J.; Yan, K.; Onnela, J.P. Smartphone-Based Activity Recognition Using Multistream Movelets Combining Accelerometer and Gyroscope Data. Sensors 2022, 22, 2618. [Google Scholar] [CrossRef]
  8. Brand, Y.E.; Kluge, F.; Palmerini, L.; Paraschiv-Ionescu, A.; Becker, C.; Cereatti, A.; Maetzler, W.; Sharrack, B.; Vereijken, B.; Yarnall, A.J.; et al. Self-supervised learning of wrist-worn daily living accelerometer data improves the automated detection of gait in older adults. Sci. Rep. 2024, 14, 20854. [Google Scholar] [CrossRef]
  9. Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
  10. Nafea, O.; Abdul, W.; Muhammad, G.; Alsulaiman, M. Sensor-Based Human Activity Recognition with Spatio-Temporal Deep Learning. Sensors 2021, 21, 2141. [Google Scholar] [CrossRef]
  11. Akter, M.; Ansary, S.; Khan, M.A.M.; Kim, D. Human Activity Recognition Using Attention-Mechanism-Based Deep Learning Feature Combination. Sensors 2023, 23, 5715. [Google Scholar] [CrossRef] [PubMed]
  12. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–11 September 2018; Proceedings, Part VII; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar] [CrossRef]
  13. Zhong, Z.; Liu, B. Efficient Human Activity Recognition Using Machine Learning and Wearable Sensor Data. Appl. Sci. 2025, 15, 4075. [Google Scholar] [CrossRef]
  14. Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef] [PubMed]
  15. Gupta, S. Deep learning based human activity recognition (HAR) using wearable sensor data. Int. J. Inf. Manag. Data Insights 2021, 1, 100046. [Google Scholar] [CrossRef]
  16. Zafar, A.; Saba, N.; Arshad, A.; Alabrah, A.; Riaz, S.; Suleman, M.; Zafar, S.; Nadeem, M. Convolutional Neural Networks: A Comprehensive Evaluation and Benchmarking of Pooling Layer Variants. Symmetry 2024, 16, 1516. [Google Scholar] [CrossRef]
  17. Doniec, R.; Konior, J.; Sieciński, S.; Piet, A.; Irshad, M.T.; Piaseczna, N.; Hasan, M.A.; Li, F.; Nisar, M.A.; Grzegorzek, M. Sensor-Based Classification of Primary and Secondary Car Driver Activities Using Convolutional Neural Networks. Sensors 2023, 23, 5551. [Google Scholar] [CrossRef]
  18. Mekruksavanich, S.; Jitpattanakul, A. LSTM Networks Using Smartphone Data for Sensor-Based Human Activity Recognition in Smart Homes. Sensors 2021, 21, 1636. [Google Scholar] [CrossRef]
  19. Mekruksavanich, S.; Jitpattanakul, A. RNN-based deep learning for physical activity recognition using smartwatch sensors: A case study of simple and complex activity recognition. Math. Biosci. Eng. 2022, 19, 5671–5698. [Google Scholar] [CrossRef]
  20. Sujatha, G.; Badrinath, N.; Sarada, C.; Reddy, C.S.K.; Sudhakara, M. Enhancing elderly activity recognition and safety through a hybrid deep learning model. Meas. Sens. 2025, 41, 101970. [Google Scholar] [CrossRef]
  21. Luwe, Y.J.; Lee, C.P.; Lim, K.M. Wearable Sensor-Based Human Activity Recognition with Hybrid Deep Learning Model. Informatics 2022, 9, 56. [Google Scholar] [CrossRef]
  22. Charabi, I.; Abidine, M.B.; Fergani, B.; Oussalah, M. DeepF-SVM: A new hybrid deep learning model for enhanced sensor-based human activity recognition. Clust. Comput. 2025, 28, 910. [Google Scholar] [CrossRef]
  23. Lim, X.Y.; Gan, K.B.; Abd Aziz, N.A. Deep ConvLSTM Network with Dataset Resampling for Upper Body Activity Recognition Using Minimal Number of IMU Sensors. Appl. Sci. 2021, 11, 3543. [Google Scholar] [CrossRef]
  24. Singh, S.P.; Sharma, M.K.; Lay-Ekuakille, A.; Gangwar, D.; Gupta, S. Deep ConvLSTM with Self-Attention for Human Activity Decoding Using Wearable Sensors. IEEE Sens. J. 2021, 21, 8575–8582. [Google Scholar] [CrossRef]
  25. Ghazi, M.E.; Aknin, N. A Comparison of Sampling Methods for Dealing with Imbalanced Wearable Sensor Data in Human Activity Recognition using Deep Learning. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 290–305. [Google Scholar] [CrossRef]
  26. Ignatov, A. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Appl. Soft Comput. 2018, 62, 915–922. [Google Scholar] [CrossRef]
  27. Mekruksavanich, S.; Jitpattanakul, A. Deep Convolutional Neural Network with RNNs for Complex Activity Recognition Using Wrist-Worn Wearable Sensor Data. Electronics 2021, 10, 1685. [Google Scholar] [CrossRef]
  28. Abbaspour, S.; Fotouhi, F.; Sedaghatbaf, A.; Fotouhi, H.; Vahabi, M.; Linden, M. A Comparative Analysis of Hybrid Deep Learning Models for Human Activity Recognition. Sensors 2020, 20, 5707. [Google Scholar] [CrossRef]
  29. Logacjov, A.; Bach, K.; Kongsvold, A.; Bårdstu, H.B.; Mork, P.J. HARTH: A Human Activity Recognition Dataset for Machine Learning. Sensors 2021, 21, 7853. [Google Scholar] [CrossRef]
  30. Ustad, A.; Logacjov, A.; Trollebø, S.Ø.; Thingstad, P.; Vereijken, B.; Bach, K.; Maroni, N.S. Validation of an Activity Type Recognition Model Classifying Daily Physical Behavior in Older Adults: The HAR70+ Model. Sensors 2023, 23, 2368. [Google Scholar] [CrossRef]
  31. Sucerquia, A.; López, J.D.; Vargas-Bonilla, J.F. SisFall: A Fall and Movement Dataset. Sensors 2017, 17, 198. [Google Scholar] [CrossRef]
  32. Cruciani, F.; Vafeiadis, A.; Nugent, C.; Cleland, I.; McCullagh, P.; Votis, K.; Giakoumis, D.; Tzovaras, D.; Chen, L.; Hamzaoui, R. Feature learning for Human Activity Recognition using Convolutional Neural Networks. CCF Trans. Pervasive Comput. Interact. 2020, 2, 18–32. [Google Scholar] [CrossRef]
Figure 1. Overview of our proposed HAR framework.
Figure 1. Overview of our proposed HAR framework.
Make 08 00107 g001
Figure 2. Structure of the proposed CNN–CBAM–BiGRU.
Figure 2. Structure of the proposed CNN–CBAM–BiGRU.
Make 08 00107 g002
Figure 3. Convolutional block attention module (CBAM).
Figure 3. Convolutional block attention module (CBAM).
Make 08 00107 g003
Figure 4. The architecture of the GRU and BiGRU: (a) internal GRU cell structure and (b) bidirectional GRU (BiGRU) network.
Figure 4. The architecture of the GRU and BiGRU: (a) internal GRU cell structure and (b) bidirectional GRU (BiGRU) network.
Make 08 00107 g004
Figure 5. Confusion matrices of the CNN–CBAM–BiGRU from different datasets: (a) HARTH (b) HAR70+ and (c) SisFall.
Figure 5. Confusion matrices of the CNN–CBAM–BiGRU from different datasets: (a) HARTH (b) HAR70+ and (c) SisFall.
Make 08 00107 g005
Figure 6. Accuracy and loss curves of the CNN–CBAM–BiGRU from different datasets: (a) HARTH (b) HAR70+ and (c) SisFall.
Figure 6. Accuracy and loss curves of the CNN–CBAM–BiGRU from different datasets: (a) HARTH (b) HAR70+ and (c) SisFall.
Make 08 00107 g006
Table 1. Summary details for benchmark datasets used in this work.
Table 1. Summary details for benchmark datasets used in this work.
DatasetSubjectsMale/FemaleAge Range (Years)Sensor PlacementsSensor Types (Sampling Rate)Activities
HARTH2214/825–68Right thigh, Lower backAccelerometer (50 Hz)The dataset covers diverse locomotion and posture activities, including walking, running, shuffling, stair ascent and descent, standing, sitting, lying, cycling in seated and upright positions, and inactive sitting and standing.
HAR70+18 Elderly9/979–95Right thigh, Lower backAccelerometer (50 Hz)The dataset includes locomotion and postural activities such as walking, shuffling, stair ascent and descent, standing, sitting, and lying.
SisFall38 (15 Elderly and 23 Adults)19/19Elderly (60–75) Adult (19–30)WaistAccelerometer (200 Hz) and Gyroscope (200 Hz)The dataset comprises 19 ADLs and 15 fall events.
Table 2. Adam optimizer parameters.
Table 2. Adam optimizer parameters.
ParameterValue
L r 0.001
β 1 0.9
β 2 0.999
ϵ 1 × 10 8
Decay0.01
Table 3. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the HARTH dataset. Bold values indicate the best performance.
Table 3. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the HARTH dataset. Bold values indicate the best performance.
ModelRecognition Performance
AccuracyPrecisionRecallF1-Score
CNN95.51% (±0.09%)90.61% (±0.78%)89.14% (±0.26%)89.72% (±0.38%)
LSTM96.21% (±0.14%)93.00% (±0.47%)92.70% (±0.41%)92.83% (±0.32%)
BiLSTM96.27% (±0.11%)92.96% (±0.24%)92.51% (±0.25%)92.72% (±0.21%)
GRU96.11% (±0.14%)92.45% (±0.34%)92.06% (±0.36%)92.23% (±0.13%)
BiGRU96.26% (±0.10%)92.98% (±0.40%)92.56% (±0.16%)92.75% (±0.20%)
CNN–CBAM–BiGRU96.82% (±0.12%)94.07% (±0.32%)93.72% (±0.32%)93.87% (±0.24%)
Table 4. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the HAR70+ dataset. Bold values indicate the best performance.
Table 4. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the HAR70+ dataset. Bold values indicate the best performance.
ModelRecognition Performance
AccuracyPrecisionRecallF1-Score
CNN97.00% (±0.20%)97.49% (±0.22%)97.30% (±0.20%)97.39% (±0.17%)
LSTM97.08% (±0.19%)97.26% (±0.26%)97.40% (±0.15%)97.33% (±0.19%)
BiLSTM97.02% (±0.15%)97.33% (±0.09%)97.30% (±0.16%)97.32% (±0.13%)
GRU97.27% (±0.20%)97.53% (±0.19%)97.66% (±0.14%)97.60% (±0.16%)
BiGRU97.10% (±0.13%)97.48% (±0.17%)97.40% (±0.04%)97.44% (±0.10%)
CNN–CBAM–BiGRU97.58% (±0.27%)97.96% (±0.24%)97.80% (±0.25%)97.88% (±0.23%)
Table 5. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the SisFall dataset. Bold values indicate the best performance.
Table 5. Recognition performances of deep learning models and the proposed CNN–CBAM–BiGRU model using the SisFall dataset. Bold values indicate the best performance.
ModelRecognition Performance
AccuracyPrecisionRecallF1-Score
CNN96.28% (±1.07%)86.26% (±7.16%)80.58% (±3.43%)82.41% (±2.82%)
LSTM97.26% (±0.28%)93.75% (±1.69%)79.66% (±2.53%)85.09% (±1.91%)
BiLSTM97.54% (±0.35%)93.52% (±1.86%)82.68% (±2.71%)87.15% (±2.17%)
GRU97.93% (±0.46%)94.86% (±1.49%)85.26% (±3.48%)89.34% (±2.67%)
BiGRU98.04% (±0.36%)94.08% (±1.01%)87.20% (±3.41%)90.21% (±2.15%)
CNN–CBAM–BiGRU98.92% (±0.23%)96.34% (±1.62%)93.74% (±2.21%)94.93% (±1.12%)
Table 6. Statistical comparison of CNN–CBAM–BiGRU against top-performing baselines using paired t-tests across five-fold cross-validation ( α = 0.05 ).
Table 6. Statistical comparison of CNN–CBAM–BiGRU against top-performing baselines using paired t-tests across five-fold cross-validation ( α = 0.05 ).
DatasetBest BaselineBaseline Acc. (%) ± σ Proposed Acc. (%) ± σ Δ Accuracy (pp)t-Statisticp-ValueSignificant?
HARTHBiLSTM 96.27 ± 0.11 96.82 ± 0.12 0.557.710.0015Yes ( p < 0.01 )
HAR70+GRU 97.27 ± 0.20 97.58 ± 0.27 0.312.060.1081No ( p = 0.108 )
SisFallBiGRU 98.04 ± 0.36 98.92 ± 0.23 0.884.650.0097Yes ( p < 0.01 )
d f = 4 , critical t 2.776 ( α = 0.05 , two-tailed), conservative assumption: r = 0 (independent folds).
Table 7. Per-class performance metrics across three datasets.
Table 7. Per-class performance metrics across three datasets.
DatasetActivityPrecision (%)Recall (%)F1-Score (%)
HARTHWalking95.597.396.4
Running92.689.290.9
Standing99.699.799.6
Stairs Up98.697.498.0
Stairs Down99.499.499.4
Sitting81.687.584.4
Lying90.289.990.0
Cycling (sit)93.592.192.8
Cycling (stand)92.993.593.2
HAR70+Walking99.999.699.8
Standing99.799.899.8
Sitting95.294.494.8
Lying97.597.997.7
SisFallADL99.699.699.6
Fall94.094.094.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mekruksavanich, S.; Jitpattanakul, A. Enhancing Wearable-Based Elderly Activity Recognition Through a Hybrid Deep Residual Network. Mach. Learn. Knowl. Extr. 2026, 8, 107. https://doi.org/10.3390/make8040107

AMA Style

Mekruksavanich S, Jitpattanakul A. Enhancing Wearable-Based Elderly Activity Recognition Through a Hybrid Deep Residual Network. Machine Learning and Knowledge Extraction. 2026; 8(4):107. https://doi.org/10.3390/make8040107

Chicago/Turabian Style

Mekruksavanich, Sakorn, and Anuchit Jitpattanakul. 2026. "Enhancing Wearable-Based Elderly Activity Recognition Through a Hybrid Deep Residual Network" Machine Learning and Knowledge Extraction 8, no. 4: 107. https://doi.org/10.3390/make8040107

APA Style

Mekruksavanich, S., & Jitpattanakul, A. (2026). Enhancing Wearable-Based Elderly Activity Recognition Through a Hybrid Deep Residual Network. Machine Learning and Knowledge Extraction, 8(4), 107. https://doi.org/10.3390/make8040107

Article Metrics

Back to TopTop