Next Article in Journal
MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification
Previous Article in Journal
Unplugged Activities for Teaching Decision Trees to Secondary Students—A Case Study Analysis Using the SOLO Taxonomy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Conv-ScaleNet: A Multiscale Convolutional Model for Federated Human Activity Recognition

1
Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450, Malaysia
2
Centre for Advanced Analytics (CAA), COE for Artificial Intelligence, Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450, Malaysia
3
Infineon Technologies Sdn. Bhd., Free Trade Zone, Melaka 75350, Malaysia
*
Author to whom correspondence should be addressed.
AI 2025, 6(9), 218; https://doi.org/10.3390/ai6090218
Submission received: 14 July 2025 / Revised: 30 August 2025 / Accepted: 3 September 2025 / Published: 8 September 2025

Abstract

Background: Artificial Intelligence (AI) techniques have been extensively deployed in sensor-based Human Activity Recognition (HAR) systems. Recent advances in deep learning, especially Convolutional Neural Networks (CNNs), have advanced HAR by enabling automatic feature extraction from raw sensor data. However, these models often struggle to capture multiscale patterns in human activity, limiting recognition accuracy. Additionally, traditional centralized learning approaches raise data privacy concerns, as personal sensor data must be transmitted to a central server, increasing the risk of privacy breaches. Methods: To address these challenges, this paper introduces Conv-ScaleNet, a CNN-based model designed for multiscale feature learning and compatibility with federated learning (FL) environments. Conv-ScaleNet integrates a Pyramid Pooling Module to extract both fine-grained and coarse-grained features and employs sequential Global Average Pooling layers to progressively capture abstract global representations from inertial sensor data. The model supports federated learning by training locally on user devices, sharing only model updates rather than raw data, thus preserving user privacy. Results: Experimental results demonstrate that the proposed Conv-ScaleNet achieves approximately 98% and 96% F1-scores on the WISDM and UCI-HAR datasets, respectively, confirming its competitiveness in FL environments for activity recognition. Conclusions: The proposed Conv-ScaleNet model addresses key limitations of existing HAR systems by combining multiscale feature learning with privacy-preserving training. Its strong performance, data protection capability, and adaptability to decentralized environments make it a robust and scalable solution for real-world HAR applications.

1. Introduction

Sensor-based HAR makes use of sensor data from various sensors, e.g., accelerometers and gyroscopes of mobile phones, to recognize and classify human activities [1,2,3,4]. It facilitates numerous applications, e.g., health monitoring, fitness tracking, personalized services, etc., and thus is a key element in context-aware systems in the growing Internet of Things (IoT) ecosystem. The advent of Artificial Intelligence (AI), particularly Deep Learning (DL), has enhanced HAR technology by overcoming limitations in adaptability and accuracy. The DL approach provides the possibility of automated feature extraction from raw sensor data without requiring hand-designed features [5,6,7]. In contrast to traditional machine learning models, DL models can model intricate patterns in inertial data, which can significantly improve activity classification performance.
However, applying DL models to HAR applications has its shortcomings, with the foremost being in centralized processing, as well as problems surrounding data privacy [8,9,10]. The majority of classic DL models prefer to have their foundation rooted in the centralized collection and processing of data, wherein sensitive individualized data, e.g., activity patterns that can reflect the current health state of an individual, are stored and processed within one central facility. This centralized system significantly increases the risk of personal data breaches because a single point of failure can be attacked. For example, HAR data theft from a centralized database would lead to the unauthorized release of personal data in large numbers, such as thousands or even millions of victims. Such attacks not only violate user privacy but also raise immense ethical and legal issues.
To address these challenges, FL has emerged as a promising alternative. In the FL framework, data learning and analysis are conducted in a decentralized practice [11,12,13]. Specifically, model training occurs locally on each client device/system. With this, raw sensor data remains on local devices, and only model updates are transmitted to a central server for model update aggregation to form a global model. This approach decentralizes computation, reducing the risk of data leakage that might normally occur in centralized frameworks. Nevertheless, the application of FL in sensor-based HAR presents significant challenges, mainly due to the inherent heterogeneity of the data. Environmental conditions and noises (i.e., different weathers, surface types, crowded versus spacious areas, obstacles, etc.), as well as distinctions in motion styles, gestures, and pace among individuals, result in behavioral variability, even when performing the same activities and using the same device. This data heterogeneity can substantially degrade the performance of the HAR system. Local models trained on data from one group of users may not generalize well to the other groups of users. Additionally, this variability across users can cause client drift, where local model updates diverge, impacting the convergence and overall performance of the global model.
The factors mentioned above pose challenges to achieving reliable and accurate HAR in the FL framework, particularly when traditional models such as 1-dimensional Convolutional Neural Networks (1D CNNs) are employed in HAR applications. The models may struggle to handle the data heterogeneity effectively. This heterogeneity results in complex data features, and the 1D CNNs may be suboptimal to extract these intrinsic complex features from the inertial data, which are important for accurate activity recognition. FL operates in a decentralized mode, where local models are trained at the clients using their respective local datasets. Only the model updates are shared with a central server and aggregated to construct a global model, which is then distributed back to the clients. This process is repeated iteratively until the global model converges; see Figure 1. Local training models are vital in this context as they enable each client to capture unique and informative features of their data.
Therefore, this study focuses on improving the 1D CNN architecture for local data training in FL. In the FL framework, decentralized datasets are stored on user devices, e.g., smartphones or wearable sensors, and no raw inertial sensor data is transmitted to the server. In the proposed 1D CNN model termed Conv-ScaleNet, the Pyramid Pooling Module (PPM) and Global Average Pooling (GAP) are incorporated to capture multiscale features of the inertial data, facilitating effective feature extraction that aggregates local and global data characteristics. This multiscale feature representation enhances the model’s ability to handle diverse data distributions. Experiments on publicly available HAR datasets, i.e., WISDM and UCI-HAR, demonstrate that the proposed Conv-ScaleNet outperforms existing approaches in federated settings, achieving a higher accuracy and F1-score while preserving user privacy.
The following are the contributions of this paper:
  • To develop a smartphone-based HAR system that protects privacy by decentralizing model training through an FL architecture. Raw sensor data remains on each user’s device, and only model updates are exchanged with the central server for aggregation.
  • To integrate Pyramid Pooling Module (PPM) and Global Average Pooling (GAP) in the DL architecture to produce a multiscale feature representation. The application enables multiscale feature extraction from the inertial sequence through pooling operations at different scales. The aggregation of local and global characteristics enhances the model’s adaptability to varied data distributions.
  • The performance of the proposed Conv-ScaleNet is assessed using a publicly accessible dataset. This assessment shows that the proposed model has superior performance compared to the other existing approaches in the FL environment.

2. Related Work

Advancements in DL have significantly enhanced HAR with increased accuracy and robustness in various applications such as healthcare monitoring and smart environments [13,14,15,16,17]. DL models could intrinsically extract meaningful spatiotemporal features from raw sensor data, outperforming traditional methods dependent on handcrafted features. There are numerous DL models proposed in human activity recognition. For instance, Meng et al. focused on optimizing sensor configurations for specific populations [18]. In the work, a comparative study was performed to investigate single-sensor HAR for stroke survivors and able-bodied individuals, using accelerometers, gyroscopes, and sEMG sensors to identify the optimal sensor type and placement. Additionally, the authors demonstrated that a model trained on healthy individuals could effectively classify activities in stroke survivors. This finding highlights opportunities for rehabilitation monitoring with minimal sensor setups. Similarly, research on hand gesture recognition for prosthesis and armband-based human–machine interfaces has examined how factors such as measurement location, gesture type, and validation protocol affect classification accuracy [19]. Using high-density surface electromyography (HD-sEMG) to simulate various electrode placements, the authors evaluated ten commonly-used dynamic and gesture-and-hold gestures under intraday, interday, and intersubject protocols. The results showed that while traditional pattern recognition approaches achieved high accuracy within-subjects, performance decreased across days or users due to sEMG variability and electrode shifts. This underscores the need for robust configurations that balance gesture set size, accuracy, and generalizability for practical applications.
Furthermore, Abdel-Basset et al. proposed ST-deepHAR, utilizing a spatiotemporal dual-channel model integrated with a Long Short-Term Memory (LSTM) layer and an Attention Mechanism with modified Residual for smartphone HAR recognition [20]. This approach achieved 97.7% accuracy on the UCI-HAR dataset and 96.4% on the WISDM dataset. Gupta et al. adopted a hybrid CNN-GRU model, attaining an accuracy of 96.54% using smartwatch data and 90.44% with smartphone data [7]. Additionally, Bashar et al. proposed a smartphone-based HAR model that combines activity-driven handcrafted features with neighborhood component analysis for feature selection, followed by a dense neural network with four hidden layers [21]. The proposed model attained an accuracy of 95.79%. Similarly, Sekaran et al. devised smartphone-based inertial sensor HAR [22]. The empirical results revealed that the proposed model achieved 96.4% accuracy while mitigating the computational overhead and privacy concerns associated with vision-based HAR. Kumar and Suresh introduced an ensemble model named Deep-HAR by integrating CNN and a Recurrent Neural Network (RNN) [23]. The model demonstrated its superiority in smartphone HAR evaluated on WISDM and KU-HAR smartphone datasets.
Hierarchical Hybrid Deep Learning Architecture for Wearable Sensor-Based Human Activity Recognition, coined HiHAR, was proposed by Nguyen et al. [24]. This is a hierarchical deep model integrating CNN and Bidirectional Long Short-Term Memory (BiLSTM) to extract short-term and long-term dependencies from sensor data. There are two stages in the model: a local stage that captures spatiotemporal features from a single window and a global stage that extracts long-term context from adjacent windows. HiHAR was tested on smartphone HAR datasets: UCI HAPT and MobiAct, and attained high accuracy with scores of 97.98% and 96.16%, respectively. Although DL approaches exhibit promising performance, they often raise privacy concerns. The centralized data collection and analysis in DL increases the risk of exposing sensitive data. Centralized systems are vulnerable to single points of failure, where a single subsystem’s failure can lead to the disruption of the entire system.
To mitigate these challenges, FL has been introduced as an alternative solution. In this decentralized data processing framework, FL enables HAR models to be trained on user devices for the sake of privacy [12,25]. Presotto et al. devised FedCLAR, a Federated Clustering approach for HAR [26]. FedCLAR classifies users with similar activity patterns by examining a subset of the model weights, which are shared with a central server. This selective analysis minimizes communication overhead. The system outperforms conventional FL solutions in HAR tasks. Additionally, the author further enhanced the model by integrating federated clustering and Semi-Supervised learning [27]. In the proposed model, SS-FedCLAR, each client employs a combination of active learning and label propagation to compute pseudo-labels, which are subsequently used to collaboratively train a Federated Clustering model. On the other hand, Sarkar et al. proposed the Graph Neural Network, called (GraFeHTy), in a federated setting for HAR [28]. This approach constructs a similarity graph from sensor measurements to leverage relationships between labeled and unlabeled data. The experimental results showed that the proposed model attained an accuracy of 91% in a centralized setting and 81.7% when applied to the WISDM dataset in a federated setting. Xiao et al. proposed HARFLS, a federated learning system for sensor-based human activity recognition [29]. The model integrates a perceptive extraction network (PEN), consisting of a convolutional feature network and an LSTM–attention relation network, to capture both local and global features from HAR data. Experimental results show that PEN outperforms 14 existing HAR algorithms in F1-score, and HARFLS with PEN achieves better recognition on WISDM and PAMAP2 compared to 11 federated learning systems with different feature extraction methods. PEN achieved F1-scores of 98.97% on WISDM, 96.33% on UCI_HAR2012, 97.78% on PAMAP2, and 96.89% on OPPORTUNITY, resulting in a mean F1-scores of 97.49% (±1.83) across the datasets, demonstrating robust and consistent performance across diverse datasets.
On the other hand, Yussif et al. proposed the Efficient Graph and Temporal Con-volution Network, abbreviated as EGTCN, for human activity classification [30]. The approach enhances segment-based and frame-wise predictions and decreases computational cost. FL has been incorporated into EGTCN to preserve user privacy by training models locally on edge devices and aggregating central updates for efficient and secure activity recognition in decentralized environments. The proposed EGTCN outperforms current state-of-the-art methods with an F1-score of 99.99% on the WISDM dataset, 98.3% on the UCI-HARD, and 99.92% on the PAMAP2 dataset. This demonstrates that the weight dual-objective function optimization can dramatically improve classification performance on datasets.
Shen et al. introduced Federated Multi-Task Attention (FedMAT), a federated activity recognition framework that addresses mismatches in sensor data contributed by different individuals [31]. FedMAT extracts shared and private features to combine information from various sensors effectively. FedMAT consistently outperforms the compared baselines in tests over four different datasets. The proposed model achieved an accuracy of 96.88% and an F1 score of 96.81% on the HHAR dataset; 92.61% accuracy and a 91.84% F1-score on PAMAP2; 75.72% accuracy and a 75.03% F1-score on the ExtraSensory dataset; and 89.78% accuracy and an F1 score of 83.02% on the SmartJLU dataset.
Despite significant advances in FL for human activity recognition. Many existing models face challenges in capturing multiscale features from sensor data, which are crucial for activity recognition. Conv-ScaleNet addresses this challenge by incorporating a Pyramid Pooling Module (PPM) with Global Average Pooling (GAP) to improve feature extraction while leveraging FL to train models locally on user devices for privacy protection.

3. Materials and Methods

As aforementioned, FL in sensor-based HAR encounters challenges due to data heterogeneity, which can degrade model performance. Traditional models, such as 1D CNNs, may struggle to extract the intrinsic, complex features for reliable activity recognition. Thus, we introduce an FL HAR with an improved convolutional-based local training model (named Conv-ScaleNet), depicted in Figure 2.
In the proposed decentralized system, each client trains the Conv-ScaleNet using their local inertial data. The Conv-ScaleNet is designed by incorporating the Pyramid Pooling Module (PPM) and Global Average Pooling (GAP) to capture multiscale features of the inertial data. Next, the locally trained model updates are transmitted to a central server, and an aggregation strategy is employed to combine the updates and construct a global model. This collaborative learning ensures data privacy preservation by keeping the raw data locally while enhancing the generalizability of the global model.

3.1. Data Acquisition and Preprocessing

In this study, we employed two publicly available smartphone-based HAR datasets, i.e., WISDM and UCI-HAR, which are both labeled with each activity instance assigned to a predefined class. The Wireless Sensor Data Mining (WISDM) dataset is available at https://www.cis.fordham.edu/wisdm/dataset.php (accessed on 2 June 2025) [32]. The dataset consists of 1,098,207 instances of sensor-based activity data from 36 subjects, sampled at 20 Hz, recorded using the accelerometer embedded in a smartphone. Six activities were performed, including walking, jogging, walking upstairs, walking downstairs, sitting, and standing, according to a predefined protocol. On the other hand, the UCI Human Activity Recognition (HAR) dataset is available at https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones (accessed on 2 June 2025) [33]. It contains recordings of six daily activities, including walking, walking upstairs, walking downstairs, sitting, standing, and lying, performed by 30 subjects wearing a Samsung Galaxy S II smartphone on the waist. Motion signals were captured at 50 Hz using the phone’s embedded accelerometer and gyroscope. Participants were aged between 19 and 48 years, though no gender distribution is reported. Each sample consists of a 128-time-step window of nine sensor signals: three axial linear accelerations, three axial body accelerations, and three axial angular velocities.
For both datasets, we applied a split of 65% for training, 25% for validation, and 10% for testing, ensuring each data instance appeared in only one partition. The validation set was used for hyperparameter tuning and model selection, while the independent testing set was reserved for model performance evaluation. In this study, data samples from the same participant could be present across training, validation, and testing sets, but no samples overlapped across these sets. This experimental strategy was designed to approximate realistic deployment conditions where an individual may interact with multiple applications or devices. However, care was taken to avoid direct data leakage by ensuring that training samples were excluded from both validation and testing.

3.2. Local Training Model: Conv-ScaleNet Architecture

The proposed Conv-ScaleNet is an enhanced convolutional-based local training model in the FL framework for human activity recognition. It is designed to capture multiscale features from inertial sensor data. As depicted in Figure 3, the Conv-ScaleNet architecture begins with two convolutional blocks to extract both localized and high-level feature patterns from the input inertial signals. The first convolutional block employs a 1D kernel of size 5 to capture local, short-range dependencies, while the second convolutional block adopts a larger 1D kernel of size 7 to capture broader patterns. These two convolutional blocks complement each other to learn short and long-range dependencies, which are crucial for reliable human activity recognition. Batch normalization and ReLU activation functions are integrated into the convolutional blocks to stabilize the model training. Specifically, batch normalization improves model convergence, and ReLU introduces nonlinearity to facilitate the model’s ability to learn complex patterns from the sensor data. Each convolutional block is followed by a max pooling layer to reduce feature map dimensions.
To enhance the multiscale feature learning, Conv-ScaleNet incorporates a pooling module, using Pyramid Pooling operations at different scales—e.g., pool sizes 1, 2, and 4—to learn features at various levels of granularity. The Pyramid Pooling multiscale behavior allows the model to detect fine-level details as well as broader contextual information in the given data. By pooling features at different scales, the Pyramid Pooling Module captures information across multiple spatial areas and thereby enhances the capacity of the model to learn various sizes and shapes as well as objects. On the other hand, GAP provides global context information, complementing the local context extracted by the former pooling.

3.3. Model Aggregation

Traditional centralized learning faces challenges in a federated setting, particularly due to non-IID data distributions across clients. Hence, the proposed approach performs dynamic weighting for global model aggregation to ensure the contribution of each client is proportional to its dataset size. Global model aggregation is performed by integrating the model updates in a decentralized manner. Clients’ contributions are weighted dynamically depending on the number of data samples per client.
In the proposed Conv-ScaleNet based FL, the central server constructs a global model by merging local model updates, and the aggregation is weighted dynamically based on the clients’ dataset sizes to prevent models from being biased towards clients with larger datasets [34]. The global model at each iteration is computed as follows:
w t + 1 = k = 1 K n k n w t + 1 k
where w t + 1 k is the local model updated from client k, n k is the number of data samples on client k, and n denotes the total number of data samples across all clients.

4. Experiments and Results

The proposed model framework was developed using the Python 3.12.2 environment. The software was executed on a workstation with a Windows operating system, equipped with a central processing unit (CPU) of 11th Gen Intel Core i5-1135G7 processor, a graphical processing unit (GPU) of Intel Iris Xe Graphics, 24 GB of random-access memory (RAM), and a solid-state drive (SSD) of 475 GB. Accuracy and F1-scores were used to assess the performance of the proposed model:
Accuracy   =   T P + T N T P + F P + F N + T N
where TP are true positives, TN are true negatives, FP are false positives, and FN are false negatives.
F1-score quantifies a balance between precision and recall:
F 1 - score   =   2   ×   R e c a l l   ×   P r e c i s i o n R e c a l l   ×   P r e c i s i o n
where Precision = T P T P + F P and Recall = T P T P + F N .

4.1. Model Performance Analysis

To evaluate the model training performance, the training loss progression for two CNN models—1D CNN and the proposed Conv-ScaleNet—under the FedAvg method is illustrated in Figure 4 and Figure 5. Note that the 1D CNN architecture serves as a baseline method for comparison and is trained under an FL setting using the FedAvg aggregation method. In Figure 4, the 1D CNN model begins with a high loss that rapidly decreases before gradually stabilizing at a low value. A similar trend can be observed in Figure 5 for the proposed Conv-ScaleNet. Nevertheless, it was observed that Conv-ScaleNet’s loss curve is smoother and exhibits fewer fluctuations, indicating more stable convergence. This suggests that our proposed model is better suited for FL scenarios where training stability is critical. Since FedAvg assigns aggregation weights proportional to each client’s data size n k , we also examined the effect of client imbalance. In our experiments, class imbalance was introduced by retaining only 30% of the “Walking” activity while keeping the other activity classes unchanged, thereby simulating skewed class distributions. In cases where certain clients had substantially more samples, their updates exerted greater influence on the global model, leading to slightly faster convergence but occasionally introducing bias toward majority-client data distributions. Despite this, the loss curves in Figure 4 and Figure 5 demonstrate smooth convergence, indicating that both models maintained stability under moderate levels of imbalance. In terms of performance with an imbalanced class, the 1D CNN model achieved 93.17% accuracy and a 92.20% F1-score, whereas Conv-ScaleNet attained 98.57% accuracy and a 98.53% F1-score, further confirming its superiority under federated training conditions.
Table 1 reports the Wilcoxon signed-rank test results used to evaluate whether the proposed model’s accuracy across 20 runs exhibited low deviation compared with the baseline threshold of 98.92%. The test yields a W statistic of 52.5 and a p-value of 0.383991, indicating the observed improvement is not statistically significant at the 0.05 level. While the median accuracy is marginally higher than the baseline, it shows the model’s performance is stable and consistently aligned with the baseline. The low variability, evidenced by tied ranks, confirms that the training process is not only robust but also highly reproducible, generating equivalent results across independent runs. This confirms that Conv-ScaleNet can be reliably and repeatedly trained to meet the high-performance standard obtained by the baseline results.

4.2. Hyperparameter and Model Evaluation

Table 2, Table 3 and Table 4 summarize the results of hyperparameter tuning for Conv-ScaleNet. Table 2 shows the effect of different learning rates with dropout fixed at 0.1 and batch size fixed at 10. It was observed that a learning rate of 0.005 achieves the best performance with a 98.65% F1-score and 98.92% accuracy. A lower learning rate, i.e., 0.001, and a higher learning rate, i.e., 0.05, result in performance deterioration. Table 3 records the effect of dropout using a fixed learning rate of 0.005 and a batch size of 10. The results reveal that lower dropout values yield better performance. In contrast, a higher dropout of 0.5 significantly degrades accuracy to 96.53% due to excessive regularization. Table 4 shows the effect of different batch sizes under fixed settings of learning rate 0.005 and dropout 0.1. A batch size of 10 delivers the most stable and balanced performance with a 98.65% F1-score and 98.92% accuracy. On the other hand, smaller batch sizes, such as 5, and larger batch sizes, such as 15, lead to slight reductions in performance. Overall, the optimal hyperparameter configuration for Conv-ScaleNet is a learning rate of 0.005, dropout of 0.1, and batch size of 10, which provides the optimal performance. For the optimizer, Stochastic Gradient Descent (SGD) was employed, offering a simple and effective update rule while maintaining stable convergence. The model training was conducted over 500 rounds, with 10 clients participating in each round and performing one local epoch per round.
Table 5 records the performance comparison between the proposed Conv-ScaleNet and the baseline on the testing set. The baseline is a 1D CNN architecture trained under an FL setting using the FedAvg aggregation method. From the results, we can see that Conv-ScaleNet outperforms 1D CNN in both accuracy and F1-score. To be specific, the proposed model exhibits approximately a 5% improvement in accuracy and a 6% increase in F1-score over the baseline 1D CNN. This denotes that Conv-ScaleNet classifies more instances correctly and balances recall and precision much better with a higher F1-score. The findings signify the superior ability of Conv-ScaleNet to capture complex dependencies, which are crucial to representing and discriminating activities. The incorporation of a pooling module comprising Pyramid Pooling and GAPs helps enhance feature extraction and multiscale feature representation, effectively capturing different scaled features and yielding superior classification performance.
Table 6 presents the influence of different convolutional kernel sizes on model performance. From the results, it can be noticed that the kernel size of (5, 7) attains the highest F1-score of 98.65% and accuracy of 98.92%. On the other hand, smaller kernel sizes, for example, kernel sizes of (4, 7) and (5, 6), exhibit lower performance. This may be due to their limited receptive field, which restricts the model from extracting longer-term dependencies from the inertial data. Nevertheless, by increasing the kernel size, i.e., (5, 8), the performance does not experience further improvement. On the other hand, there is a slight degradation in the F1-score. The large kernel may introduce redundant information, acting as noise and affecting effective pattern recognition.

4.3. Performance Comparison with Existing Methods

To evaluate the efficacy of the proposed Conv-ScaleNet model in the context of HAR with FL settings, we compare its performance against several existing methods. Table 7 presents the accuracy and F1-score comparisons between the proposed model and existing approaches. In this study, all compared methods and Conv-ScaleNet are evaluated on the same public HAR datasets (UCI-HAR and WISDM) to ensure the comparability of performance under a unified benchmark dataset. From the empirical results, it was observed that our proposed Conv-ScaleNet achieves the highest performance with an accuracy of 98.92% and an F1-score of 98.65%. In comparison, several existing FL methods have reported lower results. For example, Federated Personalized Random Forest (PP-RPRF), proposed by Liu et al. [35], adapts random forest classifiers for personalized FL and obtains 94.5% accuracy. At the same time, GraFehTy, a graph neural network, achieves only 81.7% [28]. Similarly, FedCLAR [26] and its semi-supervised extension SS-FedCLAR [27], both based on simple feedforward deep neural networks, achieve F1-scores of 89% and 88%, respectively. Another semi-supervised transformer model, CapMatch, obtains an F1-score of 89.14% [36]. FedCoad, which tackles non-IID challenges through contrastive learning and adaptive control variates, adopts a 1D CNN as its base model and attains an F1-score of 73.55%, showing improvements in convergence but at the expense of predictive performance [37].
Beyond federated settings, non-federated models have also demonstrated competitive accuracy. Yet, the models lack scalability and privacy preservation. For instance, Saidani et al.’s transformer model with hybrid features achieves 97.3% accuracy [38]. WISNet, a custom one-dimensional Convolutional Neural Network (1D CNN) that integrates a Convolved Normalized Pooled (CNPM) Block, an Identity and Basic Normalization (IDBN) Block, and a Channel and Spatial attention (CASb) Block, obtains 96.41% accuracy on WISDM and 95.66% on UCI-HAR [39]. Ye et al. proposed LBSNet, a dynamic quantization framework with activity-aware bitwidth allocation, achieving 96.27% and 96.24% accuracy on WISDM and UCI-HAR, respectively [40]. Mekruksavanich proposed a hierarchical CNN–LSTM ensemble for activity recognition and user identification, addressing privacy concerns in centralized environments [41]. Likewise, Akter et al. presented a CNN-based HAR model using CBAM attention with spectrogram-based features, achieving 93.48% accuracy [42].
The high performance of the proposed model, i.e., accuracy of 98.92% and F1-score of 98.65% on WISDM, as well as 96.41% accuracy and a 96.41% F1-score on UCI-HAR, can be attributed to the model’s ability to capture features at multiple temporal scales. The Pyramid Pooling Module combines fine-grained and global patterns, which is important because human activity signals contain both short, sharp movements and longer, smoother trends. Additionally, the integration of Global Average Pooling further enhances contextual understanding without adding significant complexity. Compared to fixed-scale pooling, this multiscale approach helps the network remain robust to variations in activity duration and intensity across subjects. Together, these design choices allow the model to extract rich, discriminative features that align well with the nature of wearable sensor signals, leading to the observed strong performance.
In conclusion, the empirical results exhibit the superiority of the proposed Conv-ScaleNet at classification performance while preserving data privacy in federated settings. Nevertheless, beyond model performance, another critical consideration in federated HAR environments is the robustness of the federated frameworks against poisoning or adversarial attacks. Robust aggregation strategies are significant to mitigate these threats [43,44,45]. Incorporating robust aggregation with Conv-ScaleNet is an important direction for future research to ensure both classification performance and security for real-world deployments.

5. Conclusions

HAR possesses great potential for real-world applications. Nevertheless, achieving high recognition performance while preserving user privacy poses a challenge due to centralized data collection. In this paper, we devise an FL-driven model for human activity recognition, coined Conv-ScaleNet. The proposed model enhances multiscale feature extraction while maintaining user data privacy. To facilitate multiscale feature extraction, a dynamic pooling module is implemented by integrating Pyramid Pooling and Global Average Pooling to produce enhanced feature representation. In the experiments, we verify the applicability of FL for HAR by training local models directly on devices and sharing only the generated model updates. The obtained experimental results demonstrate that the proposed Conv-ScaleNet achieves approximately 98% and 96% F1-scores on the WISDM and UCI-HAR datasets, respectively. These findings signify that the proposed model can capture multiscale feature dependencies for effective feature learning in HAR tasks, while FL ensures privacy-preserving model learning.

Author Contributions

Conceptualization, X.W.T., Y.H.P., S.Y.O., and Z.Y.L.; methodology, X.W.T., Y.H.P., and F.S.H.; writing—original draft preparation, X.W.T.; writing—review and editing, X.W.T. and Y.H.P.; visualization, X.W.T. and F.S.H.; supervision, Y.H.P., S.Y.O., and Z.Y.L.; project administration, Y.H.P. and S.Y.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The following supporting information can be downloaded at [32] https://www.cis.fordham.edu/wisdm/dataset.php (accessed 2 June 2025), and [33] https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones (accessed 2 June 2025).

Acknowledgments

The authors acknowledge the editors and reviewers for their valuable comments and suggestions.

Conflicts of Interest

Author Mr. Fu San Hiew was employed by the company Infineon Technologies Sdn. Bhd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HARHuman Activity Recognition
CNNConvolution Neural Networks
DLDeep Learning
FLFederated Learning

References

  1. De-La-Hoz-Franco, E.; Ariza-Colpas, P.; Quero, J.M.; Espinilla, M. Sensor-Based Datasets for Human Activity Recognition—A Systematic Review of Literature. IEEE Access 2018, 6, 59192–59210. [Google Scholar] [CrossRef]
  2. Shiranthika, C.; Premakumara, N.; Chiu, H.L.; Samani, H.; Shyalika, C.; Yang, C.Y. Human Activity Recognition Using CNN & LSTM. In Proceedings of the ICITR 2020-5th International Conference on Information Technology Research: Towards the New Digital Enlightenment, Online, 2–4 December 2020; Institute of Electrical and Electronics Engineers Inc.: New York City, NY, USA, 2020. [Google Scholar] [CrossRef]
  3. Souza, W.T.D.; Rajamohan, K. Human Activity Recognition Using Accelerometer and Gyroscope Sensors. Int. J. Eng. Technol. 2017, 9, 1171–1179. [Google Scholar] [CrossRef]
  4. Dayakaran, D.; Kadiresan, N. Federated Learning Framework for Human Activity Recognition Using Smartphones. In Proceedings of the Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2024; Volume 235, pp. 2069–2078. [Google Scholar] [CrossRef]
  5. Deng, L.; Yu, D. Deep learning: Methods and applications. Found. Trends Signal Process 2013, 7, 197–387. [Google Scholar] [CrossRef]
  6. Dong, S.; Wang, P.; Abbas, K. A Survey on Deep Learning and Its Applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
  7. Gupta, S. Deep learning based human activity recognition (HAR) using wearable sensor data. Int. J. Inf. Manag. Data Insights 2021, 1, 100046. [Google Scholar] [CrossRef]
  8. Abdulrahman, S.; Tout, H.; Ould-Slimane, H.; Mourad, A.; Talhi, C.; Guizani, M. A survey on federated learning: The journey from centralized to distributed on-site learning and beyond. IEEE Internet Things J. 2021, 8, 5476–5497. [Google Scholar] [CrossRef]
  9. Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric AI perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
  10. Khan, A.R.; Manzoor, H.U.; Ayaz, F.; Imran, M.A.; Zoha, A. A Privacy and Energy-Aware Federated Framework for Human Activity Recognition. Sensors 2023, 23, 9339. [Google Scholar] [CrossRef]
  11. Liu, Y.; Kang, Y.; Zou, T.; Pu, Y.; He, Y.; Ye, X.; Ouyang, Y.; Zhang, Y.Q.; Yang, Q. Vertical Federated Learning: Concepts, Advances, and Challenges. IEEE Trans. Knowl. Data Eng. 2024, 36, 3615–3634. [Google Scholar] [CrossRef]
  12. Zhang, J.; Cheng, X.; Wang, C.; Wang, Y.; Shi, Z.; Jin, J.; Song, A.; Zhao, W.; Wen, L.; Zhang, T. FedAda: Fast-convergent adaptive federated learning in heterogeneous mobile edge computing environment. World Wide Web 2022, 25, 1971–1998. [Google Scholar] [CrossRef]
  13. Guendouzi, B.S.; Ouchani, S.; EL Assaad, H.; EL Zaher, M. A systematic review of federated learning: Challenges, aggregation methods, and development tools. J. Netw. Comput. Appl. 2023, 220, 103714. [Google Scholar] [CrossRef]
  14. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal Deep Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11), Bellevue, WA, USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 689–696. [Google Scholar]
  15. Sun, Y.; Wang, X.; Tang, X. Deep Learning Face Representation from Predicting 10,000 Classes. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [Google Scholar]
  16. Bengio, Y. Deep Learning of Representations for Unsupervised and Transfer Learning. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop (UTLW’11), Ft. Lauderdale, FL, USA, 11–12 June 2011; JMLR.org: Cambridge, MA, USA, 2011; Volume 27, pp. 17–37. [Google Scholar]
  17. Le, Q.V.; Ngiam, J.; Coates, A.; Lahiri, A.; Prochnow, B.; Ng, A.Y. On Optimization Methods for Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Bellevue, WA, USA, 28 June–2 July 2011; Omnipress: Madison, WI, USA, 2011; pp. 265–272. [Google Scholar]
  18. Meng, L.; Jiang, X.; Liu, X.; Fan, J.; Ren, H.; Guo, Y.; Diao, H.; Wang, Z.; Chen, C.; Dai, C.; et al. User-Tailored Hand Gesture Recognition System for Wearable Prosthesis and Armband Based on Surface Electromyogram. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
  19. Meng, L.; Zhang, A.; Chen, C.; Wang, X.; Jiang, X.; Tao, L.; Fan, J.; Wu, X.; Dai, C.; Zhang, Y.; et al. Exploration of Human Activity Recognition Using a Single Sensor for Stroke Survivors and Able-Bodied People. Sensors 2021, 21, 799. [Google Scholar] [CrossRef]
  20. Abdel-Basset, M.; Hawash, H.; Chakrabortty, R.K.; Ryan, M.; Elhoseny, M.; Song, H. ST-DeepHAR: Deep Learning Model for Human Activity Recognition in IoHT Applications. IEEE Internet Things J. 2021, 8, 4969–4979. [Google Scholar] [CrossRef]
  21. Bashar, S.K.; Al Fahim, A.; Chon, K.H. Smartphone Based Human Activity Recognition with Feature Selection and Dense Neural Network. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 5888–5891. [Google Scholar]
  22. Raja Sekaran, S.; Han, P.Y.; Yin, O.S. Smartphone-Based Human Activity Recognition Using Lightweight Multiheaded Temporal Convolutional Network. Expert. Syst. Appl. 2023, 227, 120132. [Google Scholar] [CrossRef]
  23. Kumar, P.; Suresh, S. Deep-HAR: An ensemble deep learning model for recognizing the simple, complex, and heterogeneous human activities. Multimed. Tools Appl. 2023, 82, 30435–30462. [Google Scholar] [CrossRef]
  24. Hoai Thu, N.T.; Han, D.S. HiHAR: A Hierarchical Hybrid Deep Learning Architecture for Wearable Sensor-Based Human Activity Recognition. IEEE Access 2021, 9, 145271–145281. [Google Scholar] [CrossRef]
  25. Ouyang, X.; Xie, Z.; Zhou, J.; Huang, J.; Xing, G. ClusterFL: A similarity-aware federated learning system for human activity recognition. In Proceedings of the MobiSys 2021, 19th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 24 June–2 July 2021; Association for Computing Machinery, Inc.: New York, NY, USA, 2021; pp. 54–66. [Google Scholar] [CrossRef]
  26. Presotto, R.; Civitarese, G.; Bettini, C. FedCLAR: Federated Clustering for Personalized Sensor-Based Human Activity Recognition. In Proceedings of the 2022 IEEE International Conference on Pervasive Computing and Communications, PerCom 2022, Pisa, Italy, 21–25 March 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 227–236. [Google Scholar] [CrossRef]
  27. Presotto, R.; Civitarese, G.; Bettini, C. Federated Clustering and Semi-Supervised learning: A new partnership for personalized Human Activity Recognition. Pervasive Mob Comput. 2022, 88, 101726. [Google Scholar] [CrossRef]
  28. Sarkar, A.; Sen, T.; Roy, A.K. GraFeHTy: Graph Neural Network using Federated Learning for Human Activity Recognition. In Proceedings of the Proceedings-20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021, virtually online, 13–15 December 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 1124–1129. [Google Scholar] [CrossRef]
  29. Xiao, Z.; Xu, X.; Xing, H.; Song, F.; Wang, X.; Zhao, B. A Federated Learning System with Enhanced Feature Extraction for Human Activity Recognition. Knowledge-Based Systems 2021, 229, 107338. [Google Scholar] [CrossRef]
  30. Yussif, S.B.; Xie, N.; Yang, Y.; Huang, Y.; Wang, G.; Du, Z. EGTCN: An Efficient Graph and Temporal Convolution Network for Sensor-Based Human Activity Recognition in Federated Learning. IEEE Sens. J. 2024, 24, 34892–34906. [Google Scholar] [CrossRef]
  31. Shen, Q.; Feng, H.; Song, R.; Teso, S.; Giunchiglia, F.; Xu, H. Federated Multi-Task Attention for Cross-Individual Human Activity Recognition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 3423–3429. [Google Scholar]
  32. Weiss, G.M. WISDM Smartphone and Smartwatch Activity and Biometrics Dataset. UCI Machine Learning Repository. 2019. Available online: https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset (accessed on 2 June 2025).
  33. Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A Public Domain Dataset for Human Activity Recognition using Smartphones. In Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium, 24–26 April 2013. [Google Scholar]
  34. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. y Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017; Singh, A., Zhu, J., Eds.; PMLR: Cambridge, MA, USA, 2017; Volume 54, pp. 1273–1282. [Google Scholar]
  35. Liu, S.; Wang, J.; Zhang, W. Federated personalized random forest for human activity recognition. Math. Biosci. Eng. 2022, 19, 953–971. [Google Scholar] [CrossRef] [PubMed]
  36. Xiao, Z.; Tong, H.; Qu, R.; Xing, H.; Luo, S.; Zhu, Z.; Song, F.; Feng, L. CapMatch: Semi-Supervised Contrastive Transformer Capsule with Feature-Based Knowledge Distillation for Human Activity Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 2690–2704. [Google Scholar] [CrossRef]
  37. Iwan, I.; Yahya, B.N.; Lee, S.-L. Federated Model with Contrastive Learning and Adaptive Control Variates for Human Activity Recognition. Front. Inf. Technol. Electron. Eng. 2025, 26, 896–911. [Google Scholar] [CrossRef]
  38. Saidani, O.; Alsafyani, M.; Alroobaea, R.; Alturki, N.; Jahangir, R.; Jamel, L. An Efficient Human Activity Recognition Using Hybrid Features and Transformer Model. IEEE Access 2023, 11, 101373–101386. [Google Scholar] [CrossRef]
  39. Sharen, H.; Jani Anbarasi, L.; Rukmani, P.; Gandomi, A.H.; Neeraja, R.; Narendra, M. WISNet: A deep neural network based human activity recognition system. Expert. Syst. Appl. 2024, 258, 124999. [Google Scholar] [CrossRef]
  40. Ye, N.; Zhang, L.; Cheng, D.; Bu, C.; Sun, S.; Wu, H.; Song, A. Dynamic instance-aware layer-bit-select network on human activity recognition using wearable sensors. Eng. Appl. Artif. Intell. 2024, 137, 109260. [Google Scholar] [CrossRef]
  41. Mekruksavanich, S.; Jitpattanakul, A. Biometric User Identification Based on Human Activity Recognition Using Wearable Sensors: An Experiment Using Deep Learning Models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
  42. Akter, M.; Ansary, S.; Khan, A.-M.; Kim, D. Human Activity Recognition Using Attention-Mechanism-Based Deep Learning Feature Combination. Sensors 2023, 23, 5715. [Google Scholar] [CrossRef]
  43. Pillutla, K.; Kakade, S.M.; Harchaoui, Z. Robust Aggregation for Federated Learning. IEEE Trans. Signal Process. 2022, 70, 1142–1154. [Google Scholar] [CrossRef]
  44. Nabavirazavi, S.; Taheri, R.; Shojafar, M.; Iyengar, S.S. Impact of Aggregation Function Randomization against Model Poisoning in Federated Learning. In Proceedings of the 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, UK, 1 November 2023; pp. 165–172. [Google Scholar]
  45. Taheri, R.; Arabikhan, F.; Gegov, A.; Akbari, N. Robust Aggregation Function in Federated Learning. In Proceedings of the Advances in Information Systems, Artificial Intelligence and Knowledge Management; Saad, I., Rosenthal-Sabroux, C., Gargouri, F., Chakhar, S., Williams, N., Haig, E., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 168–175. [Google Scholar]
Figure 1. FL framework for human activity recognition.
Figure 1. FL framework for human activity recognition.
Ai 06 00218 g001
Figure 2. The overview of the proposed HAR FL framework with Conv-ScaleNet.
Figure 2. The overview of the proposed HAR FL framework with Conv-ScaleNet.
Ai 06 00218 g002
Figure 3. The proposed Conv-ScaleNet architecture.
Figure 3. The proposed Conv-ScaleNet architecture.
Ai 06 00218 g003
Figure 4. Train loss for 1D CNN.
Figure 4. Train loss for 1D CNN.
Ai 06 00218 g004
Figure 5. Train loss for Conv-ScaleNet.
Figure 5. Train loss for Conv-ScaleNet.
Ai 06 00218 g005
Table 1. Wilcoxon Signed-Rank Test Analysis for Conv-ScaleNet accuracy and F1-score against the baseline.
Table 1. Wilcoxon Signed-Rank Test Analysis for Conv-ScaleNet accuracy and F1-score against the baseline.
MetricWilcoxon W Valuep-Value
Accuracy52.50.383991
F1-score52.50.383991
Table 2. Effect of learning rate on Conv-ScaleNet.
Table 2. Effect of learning rate on Conv-ScaleNet.
Learning RateDropoutBatch SizeF1-ScoreAccuracy
0.0010.11097.797.96
0.0050.11098.65 98.92
0.050.11094.8996.53
Table 3. Effect of dropout on Conv-ScaleNet.
Table 3. Effect of dropout on Conv-ScaleNet.
Learning RateDropoutBatch SizeF1-ScoreAccuracy
0.0050.51095.5296.53
0.0050.31098.65 98.92
0.0050.11098.6598.92
Table 4. Effect of batch size on Conv-ScaleNet.
Table 4. Effect of batch size on Conv-ScaleNet.
Learning RateDropoutBatch SizeF1-ScoreAccuracy
0.0050.1598.298.56
0.0050.11098.6598.92
0.0050.11598.6698.8
Table 5. Performance comparison between the proposed Conv-ScaleNet and baseline 1D CNN.
Table 5. Performance comparison between the proposed Conv-ScaleNet and baseline 1D CNN.
MethodsF1-ScoreAccuracy
1D CNN92.8394.01
Conv-ScaleNet98.6598.92
Table 6. Classification performance between convolutional kernel sizes.
Table 6. Classification performance between convolutional kernel sizes.
Kernal SizeF1-ScoreAccuracy
4, 798.2898.56
5, 698.2198.56
5, 798.6598.92
5, 898.2898.44
Table 7. Performance comparison between the proposed Conv-ScaleNet and other methods.
Table 7. Performance comparison between the proposed Conv-ScaleNet and other methods.
WISDM
SystemsF1-ScoreAccuracy
PP-FPRF * [35]-94.5
GraFeHTy * [28]-81.7
FedCLAR * [26]89
SS FedCLAR * [27]88
CapMatch * [36]89.14
FedCoad * [37]73.5576.67
Hybrid features and Transformer * [38]-97.3
WISNet * [39]-96.41
LBSNet * [40]-96.27
Conv-ScaleNet98.6598.92
UCI-HAR
SystemsF1-ScoreAccuracy
WISNet * [39]-95.66
LBSNet * [40]-96.24
Biometric HAR * [41]90.1091.78
AM-DLFC * [42] 93.48
Conv-ScaleNet96.4196.41
* extracted from the original papers.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ting, X.W.; Pang, Y.H.; Lim, Z.Y.; Ooi, S.Y.; Hiew, F.S. Conv-ScaleNet: A Multiscale Convolutional Model for Federated Human Activity Recognition. AI 2025, 6, 218. https://doi.org/10.3390/ai6090218

AMA Style

Ting XW, Pang YH, Lim ZY, Ooi SY, Hiew FS. Conv-ScaleNet: A Multiscale Convolutional Model for Federated Human Activity Recognition. AI. 2025; 6(9):218. https://doi.org/10.3390/ai6090218

Chicago/Turabian Style

Ting, Xian Wu, Ying Han Pang, Zheng You Lim, Shih Yin Ooi, and Fu San Hiew. 2025. "Conv-ScaleNet: A Multiscale Convolutional Model for Federated Human Activity Recognition" AI 6, no. 9: 218. https://doi.org/10.3390/ai6090218

APA Style

Ting, X. W., Pang, Y. H., Lim, Z. Y., Ooi, S. Y., & Hiew, F. S. (2025). Conv-ScaleNet: A Multiscale Convolutional Model for Federated Human Activity Recognition. AI, 6(9), 218. https://doi.org/10.3390/ai6090218

Article Metrics

Back to TopTop