Abstract
This paper proposes a robust convolutional neural network (CNN) architecture for human activity recognition (HAR) using smartphone accelerometer data, evaluated on the WISDM dataset. We introduce two novel pooling mechanisms—Pooling A (Extrema Contrast Pooling (ECP)) and Pooling B (Center Minus Variation (CMV))—that enhance feature discrimination and noise robustness. ECP emphasizes sharp signal transitions through a nonlinear penalty based on the squared range between extrema, while CMV Pooling penalizes local variability by subtracting the standard deviation, improving resilience to noise. Input data are normalized to the [0, 1] range to ensure bounded and interpretable pooled outputs. The proposed framework is evaluated in two separate configurations: (1) a 1D CNN applied to raw tri-axial sensor streams with the proposed pooling layers, and (2) a histogram-based image encoding pipeline that transforms segment-level sensor redundancy into RGB representations for a 2D CNN with fully connected layers. Ablation studies show that histogram encoding provides the largest improvement, while the combination of ECP and CMV further enhances classification performance. Across six activity classes, the 2D CNN system achieves up to 96.84% weighted classification accuracy, outperforming baseline models and traditional average pooling. Under Gaussian, salt-and-pepper, and mixed noise conditions, the proposed pooling layers consistently reduce performance degradation, demonstrating improved stability in real-world sensing environments. These results highlight the benefits of redundancy-aware pooling and histogram-based representations for accurate and robust mobile HAR systems.
1. Introduction
Human Activity Recognition (HAR) has attracted significant attention in recent years due to the widespread adoption of smartphones and wearable devices equipped with inertial sensors. These embedded accelerometers continuously generate rich time-series data that enable the recognition of human activities, supporting a wide range of applications such as health monitoring, elder care, fitness tracking, and human–computer interaction [1,2].
Traditional HAR approaches have largely relied on handcrafted statistical features or conventional deep learning models, particularly convolutional neural networks (CNNs). However, real-world deployment of HAR systems remains challenging due to sensor noise, device orientation variability, and inter-user differences, all of which can substantially degrade classification performance [3,4]. To address these issues, recent studies have explored hybrid deep learning models, time–frequency representations, and attention-based mechanisms to improve feature robustness and discriminative power [5,6,7]. Despite these advances, most CNN-based HAR frameworks still rely on standard pooling operations, such as max or average pooling, which do not explicitly model uncertainty, noise, or signal variability. Moreover, conventional processing of raw sensor data often overlooks the statistical redundancy inherent in high-frequency and multi-axis measurements [8,9].
In this work, we focus on accelerometer-based HAR while maintaining a sensor-agnostic design philosophy. We propose a robust and redundancy-aware HAR framework that introduces two novel pooling operations—Pooling A and Pooling B—designed to capture meaningful statistical characteristics of sensor signals. Pooling A emphasizes extrema-based contrast to preserve transient motion patterns, while Pooling B incorporates variance-aware aggregation to suppress noise and reduce the influence of local fluctuations. These pooling mechanisms enhance feature robustness while remaining computationally lightweight, making them suitable for deployment on edge and embedded devices operating at typical HAR sampling rates (50–100 Hz).
To further enrich the learned representations, the proposed framework integrates histogram-based encoding and window-level statistical descriptors. This strategy enables a redundancy-aware representation of sensor data, capturing distributional characteristics without incurring significant computational overhead. As a result, the proposed approach improves classification performance while remaining efficient enough for real-time applications.
The proposed pooling strategies are integrated into two CNN-based architectures: (1) a 1D CNN that operates directly on normalized raw tri-axial accelerometer signals, and (2) a 2D CNN that processes RGB images derived from histogram-based encodings of the sensor data. This histogram-to-image representation captures intra-window distributional characteristics across sensor axes, enabling effective spatial feature learning and improved robustness to noise through structured visual patterns. The proposed framework is evaluated using the publicly available WISDM dataset [1], which includes labeled accelerometer recordings for six common daily activities. Experimental results demonstrate that the proposed approach yields notable improvements in classification accuracy and robustness, particularly under synthetic noise conditions designed to mimic real-world signal variability.
The remainder of this paper is organized as follows. Section 2 reviews related work in sensor-based human activity recognition and pooling strategies. Section 3 introduces the dataset and provides preliminary analytical insights. Section 4 presents the proposed methodology, including the CNN architectures, pooling mechanisms, and evaluation metrics. Section 5 reports experimental results, including ablation studies and computational cost analysis under both clean and noisy scenarios. Finally, Section 6 concludes the paper and discusses future research directions, including cross-dataset generalization, multimodal sensor fusion (e.g., gyroscope and magnetometer), and deployment on resource-constrained edge devices.
2. Related Work
Recent advances in sensor-based HAR have increasingly emphasized robustness to environmental noise and signal variability, especially within the context of real-world IoT applications. Numerous studies have explored a range of deep architectures, including CNNs, hybrid CNN-RNN models, and attention-enhanced frameworks, aimed at improving classification accuracy under diverse conditions. For instance, Genc et al. [10] proposed a fine-tuned CNN–LSTM model with small kernels and average pooling to capture detailed temporal–spatial features and Q. Zhang et al. [11] proposed a lightweight wearable HAR system using 1D CNNs, demonstrating the practical advantages of streamlined architectures. Likewise, Liu et al. [12] introduced the SETransformer, which integrates squeeze-and-excitation modules with transformer-based temporal modeling and learnable pooling for enhanced generalization under signal noise.
A growing body of work has addressed the challenge of noisy or uncertain input by embedding attention [13], adaptive pooling [14], or embedding-based denoising techniques [15]. These approaches demonstrate improved resilience to real-world sensor drift and user variance. However, most rely on standard max or average pooling, which are not explicitly designed to handle statistical outliers or preserve local distributional information critical in noisy HAR scenarios.
Within the WISDM dataset domain specifically, Kobayashi et al. [16] introduced MarNASNets, a family of CNNs optimized for wearable sensor efficiency. Sharen et al. [6] proposed WISNet, a deep CNN trained directly on raw sensor signals, achieving promising accuracy. Similarly, Abdellatef et al. [7] explored multi-layer CNNs but did not incorporate pooling strategies tailored to uncertainty. While Seelwal and Srinivas [4], and Min et al. [3] compared traditional ML classifiers on WISDM, none addressed redundancy modeling or custom pooling. Importantly, none of these studies explicitly evaluated model robustness under synthetic noise injection.
Handling sensor noise remains a critical challenge in HAR, as accelerometer data often suffers from measurement uncertainty, environmental interference, and sampling inconsistencies. Various studies have attempted to mitigate the effects of noise particularly Gaussian, salt-and-pepper, and speckle noise using robust modeling and preprocessing techniques. For instance, Chu et al. [17] employed an adaptive wavelet denoising approach to filter Gaussian noise in activity datasets, showing significant gains in classification accuracy. In a related study, Ma, X., & Liu, S. [18] examined the impact of salt-and-pepper noise and applied median filtering prior to LSTM classification to enhance signal integrity. A hybrid autoencoder–CNN model was proposed by Han et al. [19] to extract denoised latent features from sensor data distorted by speckle and additive noise. Furthermore, robust pooling strategies have been explored to counteract outliers, such as in the work of Tang et al. [20], where noise-resilient feature selection improved activity segmentation performance. Notably, Negri et al. [21] integrated uncertainty modeling into the HAR pipeline, emphasizing the role of noisy label correction and dropout-based regularization in deep networks. Despite these efforts, few studies have directly modeled statistical variability within the pooling layer or used custom transformations to encode intra-window redundancy. Beyond max and average pooling, several alternative pooling strategies have been introduced to improve robustness and generalization. Stochastic pooling selects activations within a pooling region according to a multinomial distribution proportional to their magnitude, thereby introducing randomness that reduces overfitting and enhances noise tolerance in CNNs [22]. Fractional max-pooling, on the other hand, applies pooling with non-integer strides, preserving more fine-grained structural information and avoiding excessive signal smoothing [23]. While both methods have shown benefits in image recognition tasks, their reliance on randomness and irregular pooling regions makes them less directly interpretable and harder to constrain within bounded ranges. By contrast, our proposed pooling mechanisms are deterministic, normalization-compatible, and tailored to preserve transient temporal features in sensor data, making them more suitable for HAR under noisy conditions. Our method addresses this gap by introducing range- and deviation-sensitive pooling layers, explicitly tested under multiple synthetic noise types to evaluate stability and generalization.
Noise-aware modeling, as seen in recent works on uncertainty quantification [24,25,26], suggests that defuzzification and uncertainty avoidance strategies can play a critical role in signal segmentation and classification. Such ideas have been explored in fuzzy image segmentation [25,27,28] and recommender systems [29,30,31], often within the IoT ecosystem [32]. These studies highlight the importance of stability in data-driven systems operating in uncertain environments, a principle echoed in our custom pooling designs.
Pooling mechanisms and multi-scale feature aggregation play a critical role in improving the robustness and representational capacity of convolutional neural networks across diverse application domains. Xu et al. [24] proposed a multi-scale CNN combined with attention mechanisms for remaining useful life prediction using multi-sensor signals, demonstrating that hierarchical feature extraction and adaptive weighting can significantly enhance robustness to signal variability. Several studies have specifically examined the impact of pooling strategies on CNN performance; Zafar et al. [25] conducted a comprehensive comparison of common pooling methods, highlighting trade-offs between information preservation and computational efficiency. Beyond conventional max and average pooling, task-specific pooling layers have been introduced to capture richer statistical and structural information, such as the novel pooling layer proposed by Mohamed et al. [26] for thermographic breast cancer analysis. Attention-based pooling approaches have also shown effectiveness in handling complex spatial patterns, as demonstrated by Bi et al. [33] through multi-scale stacking attention pooling for remote sensing scene classification. Comprehensive surveys further emphasize that pooling design significantly influences model robustness, generalization, and sensitivity to noise, particularly in medical and signal-based applications [34]. While these works underscore the importance of pooling and attention mechanisms, most are designed for image-based or industrial signals and do not explicitly address redundancy-aware pooling for noisy mobile sensor time-series data, which is the focus of the present study.
Robust activity recognition under noisy sensing conditions has received increasing attention across multiple modalities, including audio-visual and wearable sensor data. Noise-tolerant learning strategies have been explored beyond inertial sensing; for example, Han et al. [35] proposed a noise-resilient framework for audio-visual action recognition that explicitly mitigates corrupted modalities, demonstrating the importance of redundancy-aware representations. Similarly, recent work on background-noise-aware modeling in speech processing highlights the benefits of explicitly accounting for noise characteristics during feature transformation and inference [36]. In the context of sensor-based HAR, Miah et al. [37] introduced a multi-stream architecture with time-varying features and attention-based dimensionality reduction to enhance robustness against sensor variability. Choudhury and Soni [38] addressed uncontrolled environments by combining CNNs and LSTMs with adaptive batch sizing to improve stability under noisy and non-stationary data. Comprehensive reviews further emphasize that noise, sensor displacement, and redundancy remain central challenges in HAR using smart devices, calling for architectures that can robustly aggregate information across time and feature dimensions [39]. While these approaches improve robustness through multimodal fusion, temporal modeling, or attention mechanisms, they largely rely on conventional pooling strategies. In contrast, our work focuses on redundancy-aware CNN representations and novel pooling mechanisms explicitly designed to stabilize feature aggregation under noisy mobile sensor conditions.
Our work is distinguished by the integration of redundancy-aware histogram-based encoding and statistically grounded pooling layers. Pooling A penalizes wide extrema ranges, while Pooling B subtracts intra-window standard deviation enhancing both feature contrast and noise suppression. The transformation of tri-axial accelerometer data into RGB histograms further enables the use of 2D CNNs to capture local spatial dependencies.
To the best of our knowledge, no previous study has combined statistical feature encoding with custom pooling mechanisms within a unified CNN framework for HAR, particularly not with explicit evaluation under various synthetic noise scenarios.
3. Dataset Description
This study uses the Wireless Sensor Data Mining (WISDM) dataset, introduced by Kwapisz et al. [1], which has become a foundational benchmark for mobile sensor-based human activity recognition. The dataset was collected as part of the WISDM Lab’s early work on activity classification using smartphone accelerometers. It captures raw time-series data from users performing physical activities while carrying a consumer-grade mobile phone.
3.1. Data Collection Methodology
The WISDM dataset was constructed using an Android-based smartphone equipped with a built-in 3-axis accelerometer. Participants were instructed to carry the device in their front pants pocket while performing a sequence of predefined physical activities. The phone continuously recorded x, y, and z acceleration values at a frequency of 20 Hz, which was considered sufficient for activity pattern recognition at the time. Six types of physical activities were targeted: Walking, Jogging, Sitting, Standing, Upsatirs (ascending staris), and Downstairs (descending stairs).
Data was collected from 36 subjects across multiple sessions, yielding over 1 million labeled accelerometer readings. Each reading in the dataset is a timestamped sample consisting of a user ID, activity label, and acceleration values along the three spatial axes.
The raw data was segmented into 10-s windows, equivalent to 200 samples per window, assuming a nominal 20 Hz sampling rate. Labels were assigned per window based on the majority activity performed within that period. Technical specifications are shown in Table 1; user-activity frequencies and cumulative statistics regarding data are shown in Table 2 and Table 3, respectively.
Table 1.
Technical specifications.
Table 2.
User-activity frequency table from the WISDM dataset.
Table 3.
Cumulative statistics regarding dataset.
This study is limited by its reliance on the WISDM dataset, which lacks transitional complexity and may constrain generalization. Future validation on richer datasets (e.g., UCI-HAR, PAMAP2, MobiAct) is needed. Moreover, only accelerometer data were used, whereas incorporating multimodal signals such as gyroscopes and magnetometers could enhance robustness for subtle activities. Although Pooling A and B remain lightweight, their added computations (range and standard deviation) should be benchmarked for latency and energy efficiency on mobile and edge devices. Finally, extending the approach beyond HAR to domains like speech, ECG, and industrial IoT offers promising avenues for broader applicability.
3.2. Necessary Preprocessing for Human Activity Recognition (HAR) Models
The preprocessing of HAR datasets plays a critical role in ensuring reliable and generalizable model performance. Based on an in-depth analysis of the WISDM dataset, we identify and address three major challenges:
Class Imbalance: Observation: Activities such as Walking and Jogging are significantly overrepresented, while sitting and standing appear less frequently. This leads to a skewed distribution across class labels.
Implication: A classifier trained on such data may become biased towards majority classes, resulting in poor recognition of minority activities.
Applied Solution: Class weighting in the loss function (e.g., CrossEntropyLoss (weight = class_weights) in PyTorch 2.9.0) is set to penalize misclassification of minority classes more heavily.
User-Specific Behavior Variability: Observation: The execution of activities varies across users due to physiological differences, movement styles, and contextual factors.
Implication: Generic models trained on aggregate data may underperform on unseen users, particularly when activity dynamics differ.
Applied Solution: User-aware modeling is used, including user ID as a feature.
Dataset Strengths: Naturalistic usage: Subjects carried phones in a natural way (in their pockets), making data more reflective of real-world use compared to sensor attachments. Simplicity and scale: The dataset’s straightforward design and scale make it highly accessible for baseline and deep learning studies. Broad influence: It has been widely cited and used in both traditional machine learning and deep learning HAR research [1].
Limitations and Challenges: Despite its historical importance, the WISDM dataset has several critical limitations that modern HAR systems must address:
Sampling Inconsistency: Although the nominal sampling rate is 20 Hz, the actual rate fluctuates due to background processes on the smartphone OS. This leads to irregular time intervals between samples, violating assumptions of uniform sampling often required by deep learning models.
Missing Data: Some windows contain incomplete samples or abnormal interruptions due to phone behavior (e.g., temporary suspension, buffering delays). These missing segments are not explicitly annotated, which can lead to misleading training patterns, inaccurate temporal modeling, and noise amplification in CNN filters.
Windowing Strategy: While the WISDM dataset originally employs non-overlapping 10s windows—potentially missing activity transitions and reducing temporal continuity—we additionally evaluated overlapping windows with a 2 s (20%) overlap, similar to strategies used in PAMAP2 and UCI-HAR [40], to enrich training samples and better capture transitional dynamics.
Fixed Device Position: All participants carried the phone in the same front pocket location. This eliminates variability in device orientation or placement (e.g., backpack, armband), limiting the dataset’s generalizability to other phone-carrying configurations.
Lack of Gyroscopic Data: Only accelerometer readings are recorded. No gyroscope, magnetometer, or GPS data are included. This restricts feature richness and sensor fusion opportunities.
Labeling Bias: Labels are assigned per window based on majority rule, which can misclassify transitional windows where users switch from one activity to another.
Demographic Homogeneity: Although 36 users participated, demographic information (e.g., age, gender, fitness level) was not published. This raises concerns about user diversity and generalization across populations.
3.3. Relevance to This Study
Despite its limitations, the WISDM dataset is ideally suited to evaluating redundancy-aware pooling and histogram-to-image encoding methods. Each activity is sampled with high temporal redundancy due to the 20 Hz rate and continuous motion. This allows effective aggregation of intra-window patterns and facilitates the histogram-based RGB image construction used in our proposed 2DCNN path. As illustrated in Figure 1, the raw accelerometer readings along the X, Y, and Z axes exhibit distinct temporal patterns corresponding to different physical movements.
Figure 1.
Raw accelerometer signal curves along the X, Y, and Z axes.
Furthermore, the dataset’s lack of noise annotation makes it an ideal candidate for synthetic noise injection experiments, as performed in this work (Gaussian, salt-and-pepper, and mixed noise). This allows us to benchmark the noise robustness of traditional vs. proposed pooling strategies under controlled yet realistic sensor uncertainty.
4. Methodology
Our proposed system comprises two parallel processing branches: one based on raw time-series accelerometer data and the other on histogram-based RGB image representations. Both employ enhanced CNN architectures equipped with specialized pooling mechanisms to improve robustness and discrimination in human activity recognition (HAR). Figure 2 illustrates the overall methodology, including preprocessing, classification, and noise analysis components.
Figure 2.
Overview of the proposed HAR methodology.
4.1. Pooling Mechanisms
To replace conventional max or average pooling, we introduce two novel pooling strategies:
Pooling A for array X is defined as
Pooling B for array X is defined as
Pooling A computes the average of extrema and penalizes the range nonlinearly, emphasizing sharp transitions and making it ideal for dynamic activities.
Pooling B subtracts the standard deviation from the extrema average, promoting noise resilience through more stable aggregation.
Pooling A (Extrema Contrast Pooling (ECP)) and Pooling B (Center Minus Variation (CMV)) are named to reflect the core operations they perform. ECP captures the midpoint between the maximum and minimum values while penalizing large spreads via the squared range, emphasizing contrast between extrema. CMV, on the other hand, centers on the average of the extreme values and adjusts it by the standard deviation, effectively accounting for the overall variation in the data. Together, these names provide intuitive insight into the functional behavior of each pooling method.
There is an important prerequisite for applying both proposed pooling mechanisms: the input data must be normalized to the [0, 1] range. This normalization ensures consistent numerical behavior and guarantees that the outputs of both Pooling A and Pooling B remain bounded within [0, 1]. Specifically, normalization ensures that the extrema-based average () lies within [0, 1], and the standard deviation std(x) stays within [0, 0.5]. For Pooling B, which subtracts std(x) from the extrema average, this prevents the pooled value from becoming negative or exceeding 1. Similarly, for Pooling A, which subtracts a penalization proportional to the square of the range (), this squared range remains within [0, 1] due to the initial normalization. Since both () and () are within [0, 1], the final pooled value also stays within the [0, 1] interval. The scaling factor applied in Pooling A can be chosen to maintain this constraint. Therefore, normalization is a necessary and sufficient step to ensure stable and interpretable outputs from both pooling methods.
Boundedness of Pooling Outputs:
Boundedness of Pooling A.
Pooling A is defined as
where is a scaling factor that ensures the output remains within [0, 1].
Because normalization guarantees
the midpoint satisfies
The squared range satisfies
To guarantee
we define
which ensures the maximum possible subtraction is .
Thus,
- When the signal is flat ():
- When (maximum possible contrast):
Therefore, is mathematically guaranteed to remain within the interval .
Boundedness of Pooling B.
Pooling B is defined as
Normalization ensures
Thus,
which shows Pooling B is also bounded in [0, 1].
Normalization to the [0, 1] interval directly influences the sensitivity of both proposed pooling functions. Because all inputs satisfy , the extrema range also satisfies , and the standard deviation satisfies . As a result, both pooling functions operate on statistically bounded and numerically comparable quantities. Without normalization, high-magnitude segments could produce disproportionately large ranges or variances, causing Pooling A to over-penalize high-energy windows and causing Pooling B to subtract excessively large variance terms. Thus, normalization plays a critical role in stabilizing the sensitivity of both pooling operations by constraining the scale of the inputs, preventing numerical domination by outlier windows, and ensuring consistent pooling behavior across subjects and activities. Normalization also ensures stable pooling sensitivity by preventing the range and variance terms from dominating the pooling output across different signal magnitudes.
Pooling B operates by subtracting the intra-window standard deviation, effectively penalizing variance within the signal window. From a statistical perspective, this can be interpreted as a variance-suppression strategy, closely related to the principles of Wiener filtering, where noisy components are attenuated by down-weighting high-variance fluctuations relative to the mean. In the context of HAR, this mechanism improves robustness against stochastic noise sources (e.g., Gaussian or salt-and-pepper perturbations), as these typically inflate local variance without contributing meaningful structural information. Thus, Pooling B stabilizes the pooled feature by reducing sensitivity to noise-driven outliers, making it particularly effective in steady-state or low-variability activities (e.g., Sitting, Standing).
The design of Pooling A and Pooling B is driven by two statistical properties of accelerometer windows that are crucial for activity recognition: extrema contrast and intra-window variability. Traditional max pooling selects only a single extreme value and is highly sensitive to noise, while average pooling oversmooths the signal and suppresses the sharp transitions that characterize many human activities. In Pooling A, combining the extrema midpoint with a non-linear range penalty enhances the representation of transient, high-contrast patterns without being overly sensitive to isolated outliers. Pooling B, in contrast, subtracts the standard deviation to suppress noise-driven fluctuations, providing a deterministic and variance-aware alternative to conventional smoothing.
Unlike stochastic pooling or fractional pooling, which introduce randomness or irregular pooling region sizes, our pooling methods remain deterministic and maintain stable pooling regions. Furthermore, compared to Lp-norm or learnable pooling approaches, the proposed functions are parameter-free and lightweight, avoiding additional model complexity. These characteristics make Pooling A and Pooling B particularly suitable for noisy and variable time-series sensor data, where both sharp transitions and noise suppression are essential.
These pooling operations are integrated into both 1D and 2D CNN models, depending on the input representation.
4.2. Raw Sensor Path: Enhanced 1D CNN
In this stream, segmented tri-axial accelerometer data (x, y, z) are directly passed to a tailored 1D CNN. Each convolutional block is followed by either Pooling A or B. This model is designed for scenarios where raw signal fidelity is high. As shown in Experiment 1, Pooling A enhances discriminability across sharp activity transitions, while average pooling offers smoother representations.
4.3. Histogram-Based Image Path: 2D CNN with Histogram Encoding
The second path converts accelerometer windows into 2D RGB images:
Each axis is binned into 100 intervals.
Bin counts reflect frequency distributions, mapped to R (x), G (y), and B (z) channels.
The resulting image encodes the statistical structure of the activity segment.
These images are input to a modified 2D CNN with Pooling A or B, followed by fully connected layers. As observed in Experiment 3, this path is particularly effective in structured environments, benefiting from Pooling A’s high contrast sensitivity and Pooling B’s stability.
4.4. Noise Injection for Robustness Testing
To simulate real-world sensor uncertainty, we introduce three types of artificial noise: Gaussian, salt-and-pepper, and mixed. These are applied independently to each axis of raw accelerometer data. The impact of noise on model performance is assessed in Experiment 2, allowing direct comparison of pooling methods under varying levels of distortion.
4.5. Classification and Metrics
Each CNN branch outputs one of six activity labels: walking, jogging, sitting, standing, upstairs, or downstairs. Outputs are either fused or analyzed independently across experiments.
To evaluate performance, we use standard classification metrics including accuracy, precision, recall, F1 score, and class-weighted averages. Additional metrics such as F2 and F0.5 scores, ROC AUC, and performance degradation under noise are used to quantify robustness, fairness, and deployment viability. Inference latency and model size are also measured for embedded suitability.
5. Experiments, Results, and Discussion
To comprehensively evaluate the role of pooling strategies in human activity recognition, we conducted a series of experiments across different neural architectures and input representations. Initially, we assessed the effectiveness of a proposed pooling mechanism compared to standard average pooling within a 1D CNN framework using raw tri-axial accelerometer data. This was followed by a controlled noise robustness analysis, where various types of artificial noise were introduced to the sensor signals to simulate real-world distortion and examine how different pooling methods respond under degraded conditions. We also investigated a 2D CNN architecture that transforms sensor signals into histogram-based image representations, allowing us to analyze how pooling strategies perform when spatial distributions of signal intensities are taken into account.
In addition to these experiments, we provide a summary of model-wise performance comparisons to highlight how pooling choices influence learning across distinct architectures and data formats. Furthermore, we include a comparative discussion with similar works in the literature to contextualize our findings, identify strengths and limitations, and emphasize the broader implications of pooling design in sensor-based activity recognition. Together, these experiments and analyses offer a broad and comparative understanding of pooling effectiveness in terms of classification accuracy, noise resilience, sensitivity to input structure, and consistency with prior research.
5.1. Experiment 1: 1D CNN Performance and Effectiveness of Average Pooling vs. Proposed Pooling A
In this experiment, raw tri-axial accelerometer signals (X, Y, and Z) from the WISDM dataset were directly fed into a one-dimensional Convolutional Neural Network (1D CNN) followed by a fully connected (FC) classifier. The input signals were segmented using a sliding window of 100 samples with a stride of 50, forming input tensors of size . The proposed 1D CNN architecture consists of two convolutional layers with 32 and 64 filters, respectively, using kernel sizes of 5 and 3, stride 1, and ReLU activation functions. Each convolutional layer is followed by a pooling operation either standard average pooling or the proposed Pooling A with a pooling size of 2.
The extracted feature maps are flattened and passed to a fully connected layer of 128 neurons, followed by an output layer corresponding to the number of activity classes. The model is trained using the Adam optimizer with a learning rate of 0.001, a batch size of 64, and a total of 10 training epochs. Cross-entropy loss is used as the optimization objective. The performance comparison focuses on evaluating how the proposed Pooling A improves classification accuracy and robustness compared to standard average pooling across different human activities.
5.1.1. Pooling A: Enhanced Temporal Discrimination via Nonlinear Aggregation
Pooling A is a non-linear pooling strategy designed to combine central tendency (e.g., Extrema Contrast Pooling (ECP)) while incorporating a penalization of the local signal range. This design enhances the sensitivity to transitions and local variations, which are critical for distinguishing structured and repetitive movements like Jogging, Walking, and Standing.
Strengths:
Pooling A consistently yields higher class-wise F1 scores for sharp, rhythm-based activities. For instance, F1 scores for Walking and Downstairs are 0.99 and 0.86, respectively, outperforming average pooling (see Table 4).
Table 4.
Transposed metrics comparison 1D CNN with Pooling A vs. average pooling (all inputs min–max normalized to [0, 1] to satisfy pooling constraints).
It achieves better balance in precision and recall for transitional activities like Upstairs, offering improved decision boundaries (Precision: 0.95, Recall: 0.96).
Table 5 further confirms that Pooling A delivers superior overall accuracy (1D CNN 95.1% (96.5 for 2D CNN)) compared to average pooling (1D CNN 93.0%(94.1 for 2D CNN)).
Table 5.
Main metrics comparison—1D CNN with Pooling A vs. average pooling.
Weaknesses of pooling A:
Due to its range-based sensitivity, it may be less stable under intra-class signal variability, particularly in classes with inconsistent patterns.
It introduces moderate computational overhead, although still efficient enough for practical HAR applications.
Pooling A is ideal when the goal is to detect crisp transitions and fine temporal boundaries. Its improved performance on structured activities and its capacity to control false positives make it a valuable enhancement for HAR systems requiring detailed activity segmentation.
5.1.2. Average Pooling: Stable Smoothing with Limited Granularity
Average pooling, the conventional approach, compresses local windows via mean reduction, offering stable and smoothed representations. This makes it effective for general-purpose activity recognition where robustness and simplicity are favored over fine feature resolution.
Strengths:
Strong recall performance in Upstairs (Recall: 0.94), suggesting better generalization in ambiguous or transitional activities.
Performs reliably on Jogging and Standing (F1: 0.98 and 0.99, respectively), where repeated signal structures dominate.
Low computational cost and compatibility with most hardware/software make it suitable for lightweight, real-time applications.
Weaknesses:
Tends to over-smooth rapid transitions, leading to reduced precision in short, bursty activities like Downstairs (F1: 0.79 vs. 0.86 with Pooling A).
Exhibits a precision-recall mismatch in Upstairs (Precision: 0.72, Recall: 0.94), indicating a tendency to overpredict this class (see Table 4).
Lacks edge-awareness, which is crucial for nuanced class separation.
While average pooling provides robust generalization, it sacrifices discriminative power for smoothness. It’s best used when model stability is more critical than fine-grained temporal sensitivity.
5.1.3. Short Fourier Transform (STFT)
To empirically validate the transient-preserving properties of our pooling mechanisms, we conducted a comparative analysis using short-time Fourier transform (STFT) spectrograms of pooled signals. Figure 3 illustrates the short-time Fourier transform (STFT) spectrograms of a single walking activity window under four different pooling strategies: Max Pooling, Average Pooling, Pooling A, and Pooling B. The horizontal axis represents time (0–0.04 s) and the vertical axis represents frequency (0–16 Hz). Pooling A (top-left) exhibits low magnitude in the upper-left corner but strong activation in the upper-right and lower-right regions, highlighting its ability to preserve high-frequency components associated with sharp transitions during walking. Average pooling (top-right) shows medium magnitude on the left and weaker activity on the right, reflecting its smoothing effect that attenuates transient signals. Max pooling (bottom-left) maintains moderate magnitude in the lower-left quadrant but suppresses other regions, producing a less consistent temporal representation. Pooling B (bottom-right) demonstrates medium magnitude in the upper-left but darker elsewhere, indicating that while it suppresses noise effectively, it also reduces some high-frequency transient content. Overall, these visualizations confirm that Pooling A preserves high-frequency transient features more effectively, supporting its superior performance in capturing sharp activity transitions compared to conventional pooling strategies.
Figure 3.
STFT spectrogram comparison of pooling strategies on a walking activity window.
5.1.4. Precision–Recall Dynamics in Transitional Activities
A clear behavioral divergence is seen in the Upstairs class:
Average Pooling achieves high recall (0.94) but low precision (0.72), leading to overclassification errors.
Pooling A maintains a balanced prediction profile (Precision: 0.95, Recall: 0.96), indicative of tighter decision boundaries and better class discrimination (see Table 4).
This supports the case for adopting Pooling A in tasks where reducing false positives is paramount.
5.1.5. Overall Performance
Table 5 offers a high-level comparison of the two pooling methods across performance and operational characteristics:
Pooling A excels in class-wise F1 scores and overall accuracy:
Accuracy: 95.1% (vs. 93.0%) (96.5% and 94.1% 2D CNN)
ROC (Macro): 0.97 (vs. 0.95)
Higher F1 for most classes, except a slight drop in Jogging, where average pooling has marginally better precision.
Average Pooling remains favorable in terms of:
Lower computational complexity
Recall-heavy performance, making it suitable for sensitive applications where false negatives must be minimized.
While both pooling strategies offer robust baselines, Pooling A demonstrates superior class-wise balance, improved boundary control, and enhanced edge sensitivity. As confirmed by Table 4 and Table 5, it is the recommended choice for HAR systems that demand fine-grained activity recognition, especially in environments with complex or overlapping signals.
5.2. Experiment 2: Noise Robustness Evaluation of Pooling Methods
Although the WISDM dataset naturally includes a degree of noise due to its sensor-based nature, for the sake of controlled and systematic evaluation, we artificially injected additional noise into the raw tri-axial accelerometer data (X, Y, Z). This approach allows us to directly assess the sensitivity and robustness of various pooling strategies under consistent and known distortion patterns.
Each signal was independently corrupted with one of the following noise types, applied at 20% intensity, simulating realistic but challenging operating conditions:
Salt-and-Pepper Noise: Introduces abrupt, impulsive spikes and drops in the signal.
Salt-and-pepper noise was applied using the random_noise function from the skimage.util library in Python 3.13.5, with fixed density parameters (amount = 0.10, 0.20, 0.30). These values correspond to light, moderate, and heavy corruption levels, respectively, as summarized in Table 6. At 10% corruption, sensor traces retain most temporal structure, while 30% corruption severely distorts transitions and peak values. Explicitly documenting these standardized noise profiles allows other researchers to replicate the experiments precisely, ensuring comparability with future HAR robustness studies.
Table 6.
Salt-and-pepper noise profiles used in experiments to ensure reproducibility.
Gaussian Noise: Adds smooth, continuous fluctuations, mimicking sensor drift or thermal noise.
Speckle Noise: A multiplicative noise type common in signal transmission errors or hardware imperfections.
Mixture Noise: A composite of the above three, simulating complex, real-world conditions.
Noiseless Baseline: For clean reference.
Each pooling method was applied directly to the noisy input signals, and the output was compared to the original (clean) signal using the following metrics: Mean Difference (MD), Standard Deviation Difference (STD), Mean Squared Error (MSE), and 1—Pearson Correlation Coefficient (1 − ρ)
Lower values in all metrics indicate better retention of the signal’s original structure and higher robustness to noise.
Pooling Methods Compared: Max Pooling, Average Pooling, Proposed Pooling A, Proposed Pooling B
Quantitative Results: Table 7 presents comprehensive quantitative measurements, followed by a thorough analysis and interpretation of the noise robustness of various pooling methods. The visual summary of these metrics across different pooling strategies under varying noise conditions is illustrated in Figure 4 and Figure 5, providing a clear comparative perspective complementing the numerical values in Table 7.
Table 7.
Signal deviation metrics under 20% noise for different pooling methods (lower is better).
Figure 4.
Quantitative comparison of pooling methods under various noise types (mixture of all noise types) using MD, STD, MSE, and 1−correlation metrics.
Figure 5.
Quantitative comparison of pooling methods under various noise types (a) no noise, (b) under Gaussian noise, (c) under Salt & pepper noise, (d) under speckle noise using MD, STD, MSE, and 1−correlation metrics.
The key findings from the table can be summarized as follows:
- Pooling A demonstrated consistent and robust performance across noise environments; Pooling A matched or outperformed Average Pooling in several conditions, particularly under salt-and-pepper (MD = 0.02, MSE = 0.01, 1 − ρ = 0.11) and No Noise (MSE = 0.01, 1 − ρ = 0.09). It maintained low distortion across all noise types, with MD values never exceeding 0.03 and MSE staying at or below 0.02.
This stability suggests that Pooling A effectively attenuates both impulsive and continuous noise while maintaining a strong correlation with the original signal.
- b.
- Max Pooling performed the worst overall across all noise types. Across every condition including the No Noise baseline Max Pooling produced the largest MD values (e.g., 0.41 under Gaussian, 0.27 under Mixture Noise) and the highest MSE (0.22 for Gaussian, 0.11 for Mixture Noise). It also consistently exhibited the largest correlation loss, with 1 − ρ reaching 0.70 under Gaussian noise. These results reinforce that Max Pooling is highly sensitive to outliers and amplifies noise, making it inappropriate for preserving fine temporal signal characteristics.
- c.
- Pooling B showed its strongest performance under Gaussian noise. Under Gaussian noise, Pooling B achieved the lowest correlation loss among all methods (1 − ρ = 0.12) and a moderate MSE (0.09). While Average Pooling had lower MSE (0.02) and STD (0.00), Pooling B’s correlation preservation surpasses all other methods for this noise type.
This indicates that Pooling B is particularly well suited to countering zero-mean, normally distributed noise, which aligns with the filtering behavior implied by its design.
- d.
- Average Pooling remained a general-purpose baseline. It performed especially well under Speckle noise, with the lowest MD (0.02), lowest MSE (0.01), and the lowest correlation loss (1 − ρ = 0.06) among all pooling methods for that condition. Its smoothing effect effectively suppresses multiplicative noise components. However, under impulsive or mixed noise (e.g., salt-and-pepper and Mixture Noise), Average Pooling is outperformed by Pooling A in terms of correlation preservation.
- e.
- Performance varies depending on noise characteristics. Pooling A is the most versatile, showing consistently low MD, MSE, and correlation loss across all noise types—including the No Noise condition. Pooling B excels particularly under Gaussian noise due to its superior correlation preservation. Average Pooling is highly effective for smooth or speckle-like distortions. Max Pooling systematically yields the poorest results and is unsuitable for noise-sensitive applications.
To further examine the resilience of different pooling mechanisms under diverse noise conditions, we evaluated performance across multiple synthetic noise types using distortion metrics (MSE, MD, STD, and correlation). The results highlight that while Pooling A generally offers the strongest balance between noise suppression and feature preservation, Average Pooling remains competitive in smoother noise settings such as Speckle or Gaussian. In contrast, Max Pooling consistently underperforms, amplifying outliers and showing poor robustness across all scenarios. A summary of best- and worst-performing pooling methods per noise type is provided in Table 8.
Table 8.
Noise robustness evaluation summarization.
Although the WISDM dataset lacks explicit “noise labels,” it inherently contains real-world perturbations originating from uncontrolled smartphone use, such as sampling jitter, micro-movements inside the pocket, thermal drift, and device-specific sensor imperfections. These naturally occurring distortions are an integral part of the raw signal and already contribute to the baseline classification difficulty. Our use of synthetic noise is therefore not intended to replace real-world noise, but rather to systematically isolate and stress-test specific noise types (e.g., impulsive, Gaussian, and multiplicative distortions) that are difficult to analyze directly in uncontrolled field recordings. This combination of natural and synthetic noise allows us to evaluate robustness both in practical conditions and under controlled perturbations.
5.2.1. Class-Specific Robustness Analysis via Confusion Matrices Under Noise
To assess class-specific robustness, we analyzed confusion matrices under noise injection. As shown in Table 9, Gaussian noise leads to misclassification between ‘Walking’ and ‘Jogging’, reflecting their overlapping temporal dynamics under perturbation. By contrast, static activities (e.g., Sitting, Standing) remain resilient. This highlights that Pooling A better preserves transient features in dynamic classes, whereas average pooling fails to separate closely related activities under noise. These results confirm that classification resilience, not just signal distortion metrics (MD/STD), benefits from the proposed pooling design.
Table 9.
Class-wise performance under clean and Gaussian noise conditions.
A Gaussian noise (σ = 0.2) is injected into the test set for class-specific degradation evaluation. The results highlight that dynamic activities such as Walking and Jogging exhibit noticeable misclassification, reflecting their overlapping temporal dynamics under perturbation. By contrast, static activities such as Sitting and Standing remain largely unaffected. These results indicate that transient feature preservation plays a critical role in differentiating dynamic activities under noise. The proposed pooling layers reduce cross-class confusion, particularly for Walking → Jogging, compared to average pooling. Table 9 summarizes class-wise performance under clean and Gaussian noise conditions. Dynamic classes such as Walking and Jogging show significant degradation, with Walking frequently misclassified as Jogging and Upstairs confounded with Downstairs. In contrast, static activities (Sitting, Standing) remain largely unaffected, confirming their resilience to Gaussian perturbations. These results reinforce the need for noise-aware pooling: while standard max/average pooling smooths transitions, the proposed Pooling A better preserves high-frequency transients, reducing Walking → Jogging confusions, and Pooling B enhances stability against drift-like distortions.
5.2.2. STFT Visualizations Under Simulated Device Displacement Drift
To simulate real-world sensor displacement artifacts, we introduced linear and sinusoidal drift components into accelerometer signals using the following model:
where
- is the original signal;
- controls the linear drift slope;
- is the amplitude of sinusoidal drift;
- is the drift frequency (set to 0.5 Hz to mimic low-frequency oscillatory displacement).
This approach emulates baseline shifts and slow oscillatory motion typical of device displacement in wearable and mobile sensing.
Figure 6 presents STFT spectrograms in the 0–6 Hz range, chosen because device displacement drift is primarily a low-frequency artifact (<6 Hz) caused by slow device movement or orientation changes. In contrast, Figure 3 (0–16 Hz) analyzes transient activity behavior, requiring a broader frequency range to capture higher-frequency transitions.
Figure 6.
STFT under device displacement drift.
Visual Analysis:
- Average Pooling shows uniform attenuation with minimal spectral structure, reflecting its smoothing behavior and reduced sensitivity to drift-induced variations;
- Pooling A retains subtle mid-frequency residuals due to its extrema-sensitive design, preserving some transient features despite drift;
- Max Pooling exhibits sporadic high-energy patches, indicating sensitivity to outliers and drift-induced spikes;
- Pooling B demonstrates strong suppression of low-frequency drift components, confirming its variance-reduction mechanism and robustness to baseline shifts.
These results indicate that redundancy-aware pooling is particularly effective in suppressing low-frequency drift, complementing the noise resilience demonstrated under synthetic Gaussian and salt-and-pepper noise.
5.2.3. Latency and Computational Cost Analysis of Pooling Strategies
Pooling B, which relies on standard deviation computation, introduces the highest per-window latency (0.15 ms) compared to Average Pooling (0.05 ms), Max Pooling (0.06 ms), and Pooling A (0.08 ms). Table 10 provides a detailed comparison of computational cost metrics across both 1D-CNN and 2D-CNN architectures, including FLOPs estimates, epochs to converge, total training time, and final model size. These evaluations were performed on a desktop system equipped with an Intel i7-12700K CPU and NVIDIA RTX 3080 GPU(NVIDIA Corporation, headquartered in Santa Clara, CA, USA), and therefore represent algorithmic-level latency rather than hardware-specific inference speed. Device-dependent characteristics such as power consumption, memory bandwidth, processor scaling, and sensor-management overhead are not captured in these measurements.
Table 10.
Latency and computational cost of pooling strategies for edge deployment (1D CNN).
Despite this limitation, the relative differences across pooling types are informative. For 1D CNNs, Average and Max Pooling exhibit the lowest computational cost and fastest training times (35 and 28 min), while Pooling A offers moderate overhead and Pooling B incurs the highest total training time (37 min) and largest model size (1.45 MB). A similar pattern is observed for 2D CNNs, where Pooling B again results in the largest model size (3.98 MB) and longest training duration (48 min). These results indicate that Average Pooling, Max Pooling, and Pooling A comfortably satisfy real-time constraints for edge-oriented HAR systems, whereas Pooling B remains feasible but may require caution in latency-sensitive deployments due to its higher computational burden.
To fully characterize inference behavior on mobile or embedded hardware, future work will incorporate direct benchmarking on devices such as Raspberry Pi, smartphone SoCs, and microcontroller-class platforms.
Although the absolute latency values are desktop-based, their relative differences reliably indicate computational efficiency trends that would persist across embedded platforms.
5.3. Experiment 3: 2D CNN + FC Using Histogram-Based Sensor Representations with Different Pooling Strategies
In this experiment, raw tri-axial accelerometer signals (X, Y, Z) from the WISDM dataset are transformed into histogram-based image representations to enable learning with a two-dimensional Convolutional Neural Network (2D CNN) followed by fully connected (FC) layers. This approach addresses a key limitation observed in high-frequency wearable sensor data, where redundant or repetitive measurements often occur within short temporal windows. Such redundancy can reduce discriminative power and negatively impact model generalization.
To mitigate this issue, the continuous values of each accelerometer axis are discretized into 100 uniform bins, forming histogram representations that capture the underlying statistical distribution of motion patterns. Separate histograms are generated for the X, Y, and Z axes and subsequently combined to form a single multi-channel (RGB-like) image representation. This transformation preserves spatial relationships while enabling the use of convolutional filters. The distributions of the three axes are illustrated in Figure 7, while the merged RGB-style histogram images are shown in Figure 8.
Figure 7.
X, Y, and Z distributions.
Figure 8.
Histogram distributions of X, Y, and Z accelerometer axes and the corresponding RGB-encoded image (Bin = 100).
5.3.1. The Structure and Hyperparameters of Designed 2D CNN
The resulting histogram images are used as inputs to a 2D CNN consisting of two convolutional layers with 32 and 64 filters, respectively, using kernel sizes of and stride 1. Each convolutional layer is followed by a pooling operation (either standard pooling or the proposed Pooling A or Pooling B) with a pooling size of . Rectified Linear Unit (ReLU) activations are applied after each convolution. The extracted feature maps are flattened and passed to a fully connected layer with 128 neurons, followed by a softmax output layer corresponding to the number of activity classes.
All models are trained using the Adam optimizer with a learning rate of 0.001, a batch size of 64, and a total of 10 training epochs. Categorical cross-entropy is employed as the loss function. Prior to training, all histogram values are normalized using min–max normalization to the range , which is required to ensure numerical stability and consistent behavior of the proposed pooling mechanisms. This normalization also facilitates fair performance comparison between standard pooling and the proposed Pooling A and Pooling B strategies.
5.3.2. Performance Evaluation and Class-Level Observations
Pooling operations in 2D CNNs serve as critical mechanisms for dimensionality reduction, noise suppression, and abstraction. The experimental evaluation compared Average Pooling, Proposed Pooling A, and Proposed Pooling B across standard classification metrics and structural signal properties. The insights drawn from Table 11 and Table 12 are summarized below.
Table 11.
Precision, recall, and F1 score per class for histogram-based CNN + FC models.
Table 12.
Quantitative and qualitative evaluation of various pooling techniques.
1. Overall Performance and Weighted Averages:
Pooling A achieved the highest weighted average accuracy (96.5%) and F1 score (0.96), outperforming both Average Pooling (93.0%) and Pooling B (93.3%). This demonstrates its robust generalization across all activity classes. Its strong ROC score (0.98) further reflects superior discriminative ability.
2. Class-wise Trends:
- (a)
- Downstairs, Jogging, Standing, and Walking: Pooling A consistently delivered the highest precision, recall, and F1 scores. These classes tend to contain dynamic, edge-rich patterns (e.g., rapid signal changes in stair descent or jogging), which benefit from Pooling A’s sensitivity to edges.
- (b)
- Sitting and Upstairs: Pooling B slightly outperformed others in Sitting (F1: 0.99), a class associated with stable and stationary signals. This suggests that Pooling B’s reliance on standard deviation offers better discrimination under subtle signal changes. Meanwhile, the upstairs performance shows Average Pooling and Pooling A in close competition.
- (c)
- Upstairs (Precision, Recall, F1): Average Pooling slightly edged out Pooling A (0.88 vs. 0.87), reflecting its strength in capturing softer transitions and smoother signal phases, possibly due to its lower sensitivity to noise and variance.
5.3.3. Methodological Impact of Histogram-Based Representation
The conversion of signal windows into histograms drastically reduced local redundancy and preserved the distributional essence of movement, which likely enhanced the CNN’s ability to generalize across users. Pooling methods directly impacted how local intensity patterns in these histograms were downsampled:
- (a)
- Average Pooling provided stability and denoising, but sometimes blurred class-discriminative edges;
- (b)
- Pooling A preserved local variations while balancing suppression of outliers, resulting in consistent superiority across most classes;
- (c)
- Pooling B showed promise in more uniform or slowly changing signals (e.g., sitting), where deviations within the pooling region are minimal and meaningful.
While Average Pooling is computationally the most efficient (lowest cost), Pooling A maintains a moderate cost with significantly higher performance. Pooling B, although moderate in computation, did not yield consistently superior results.
In terms of robustness, the following conclusions were drawn:
- (a)
- Pooling A and Average Pooling both demonstrated low sensitivity to noise, confirming their utility in real-world sensor environments where measurement fluctuations are common;
- (b)
- Pooling A’s strong edge sensitivity was critical for dynamic classes (Jogging, Downstairs), while Average Pooling’s smoother behavior proved beneficial in classes like Upstairs or Sitting.
No single pooling method universally dominated. However, Pooling A strikes the best balance between performance, noise robustness, and structural fidelity, especially in dynamic activities. Average Pooling remains a solid, lightweight baseline, particularly when computational simplicity is a priority. Pooling B, while effective in select classes, lacked consistent dominance.
These results suggest that adaptive or hybrid pooling strategies, possibly incorporating context or class-aware mechanisms, could further enhance performance in human activity recognition systems based on histogram-transformed sensor signals.
5.3.4. Ablation Study: Effect of Histogram Bin Size and Raw Signal Baseline in 2D CNN Path
Although Table 13 shows that 50, 100, and 200 bins yield very similar accuracy values, the differences are too small to be considered statistically significant. We select 100 bins as a practical compromise: (1) it preserves sufficient distributional detail compared to 50 bins, and (2) it avoids the increased computational overhead associated with 200 bins. This choice reflects a trade-off between representational granularity and efficiency, rather than a statistically verified superiority. Feeding raw windows directly into the 2D CNN reduces performance, demonstrating that the histogram transformation contributes significantly to the observed gains independently of pooling.
Table 13.
Effect of histogram bin size.
5.3.5. Evaluation on Sliding Windows to Capture Transitional Activities
Using overlapping sliding windows (stride = 50% of window size) introduces transitional activity segments into the evaluation. Table 14 shows that while static activities such as Standing and Sitting remain relatively stable, transitional activities like Upstairs and Downstairs experience a noticeable drop in F1 scores. This reflects the increased difficulty of correctly classifying windows containing multiple activity states. Importantly, the proposed pooling strategies, especially Pooling A, maintain a balanced performance across dynamic transitions, demonstrating robustness to real-world continuous signals. The comparison to non-overlapping windows highlights the necessity of evaluating models under conditions that mimic realistic HAR streams.
Table 14.
Class-wise performance on overlapping sliding windows (stride = 50% of window size).
Table 14 illustrates class-wise confusion matrices when evaluating HAR with overlapping sliding windows. Transitional activities such as Upstairs and Downstairs show increased misclassifications, particularly under average and max pooling, whereas Pooling A better preserves the discriminative features needed for accurate classification. Pooling B demonstrates moderate robustness, benefiting from standard deviation-based stabilization. Overall, these visualizations reinforce the importance of pooling design for maintaining performance under realistic continuous activity streams, highlighting the limitations of non-overlapping window evaluation.
5.3.6. Robustness to Motion Artifact Noise
To evaluate Pooling A’s edge sensitivity under non-activity artifacts, we introduced synthetic “motion artifact” noise into test windows by injecting abrupt perturbations modeled as short bursts of high-amplitude jitter (simulating device shaking at ~0.5 g). As shown in Table 15, false positives increased most prominently in static activities (e.g., Standing and Sitting), which Pooling A occasionally misclassified as Walking. This confirms that the Pooling A’s heightened edge sensitivity makes it prone to mistaking abrupt device motion for genuine transitions. However, for dynamic activities (Jogging, Walking, Upstairs, Downstairs), the increase in false positives was smaller (3–7%). These findings suggest that while Pooling A improves discrimination of structured transitions, its deployment in real-world mobile HAR systems should include motion-artifact filtering (e.g., low-pass baseline correction or shake detection) to mitigate false alarms.
Table 15.
False positive rates under motion artifact perturbations.
5.4. Model-Wise Performance Comparison
Table 16 presents the per-class classification accuracy for five different model configurations evaluated on the WISDM dataset. The models span two architectural categories—Histogram-aware CNNs with fully connected layers (CNN + FC) and 1D CNNs—each tested with three pooling strategies: traditional average pooling, proposed Pooling A, and proposed Pooling B.
Table 16.
Per-class accuracy (%) across models and pooling strategies.
Across all six activity classes, the results demonstrate that Pooling A consistently improves accuracy in both architectural types, particularly in activities characterized by distinct transitions or repetitive patterns (e.g., Downstairs, Walking, Standing). For instance, the Histogram-aware CNN + FC with Pooling A achieves the highest weighted average accuracy (96.5%), outperforming both the baseline average pooling (93.0%) and Pooling B (93.3%).
Within the 1D CNN architecture, Pooling A also yields notable gains, especially for Upstairs and Walking, where sharp transitions and edge sensitivity are beneficial. This highlights the adaptability of Pooling A across both spatially and temporally aware CNN structures.
Ablation experiments were conducted to quantify the contribution of each component of the proposed pipeline. Table 17 summarizes accuracy results for the baseline CNN alone, with histogram encoding only, with ECP only, with CMV only, with the combined ECP + CMV pooling representation, and with the full system including the histogram encoder. The results show that each component contributes measurable improvement: histogram encoding captures distributional characteristics, ECP emphasizes extrema dynamics, and CMV captures intra-window variability. Combining ECP and CMV yields the largest intermediate gain, and adding the histogram encoder on top of this representation provides the highest final accuracy.
Table 17.
Component-wise contribution analysis (ablation study).
5.5. Comparison with Similar Works and Discussion
The two CNN architectures proposed in this work are designed to exploit complementary feature representations. A lightweight 1D CNN operating on raw sensor windows captures temporal dynamics and edge-level patterns, while a histogram-to-RGB 2D CNN encodes intra-window statistical distributions that enhance robustness to noise. Although employing both architectures increases computational demand, the per-window latency analysis reported in Table 10 demonstrates that Pooling A, along with both the 1D and 2D CNN configurations, remains practical for typical HAR sampling rates (50–100 Hz). For scenarios with strict real-time or resource constraints, several efficient alternatives are available, including deploying only the 1D CNN with Pooling A, adopting a cascaded or fallback strategy (e.g., on-device 1D inference with occasional server-side 2D processing), reducing histogram bin resolution, or applying model compression techniques such as quantization and pruning. These strategies can significantly lower computational cost with minimal impact on accuracy.
Table 18 presents a qualitative comparison between the proposed approach and existing methods reported in the literature. The comparison highlights key dimensions such as dataset usage, architectural design, pooling mechanisms, feature representations, noise robustness, and reported performance. Particular emphasis is placed on the advantages introduced by the proposed framework, including the integration of custom pooling strategies, histogram-based feature encoding, and explicit evaluation under noisy conditions.
Table 18.
A qualitative comparison of the proposed method with several existing approaches.
This study distinguishes itself from prior work by introducing a redundancy-aware representation through histogram binning of accelerometer signals, transforming the time-series data into RGB images that enable spatial feature extraction via 2D CNNs. Unlike earlier approaches that apply standard max or average pooling, we propose two novel pooling mechanisms Pooling A and Pooling B specifically designed to enhance sensitivity to transient features (via extrema-range emphasis) and stability under variability (via standard deviation penalization), respectively. A major contribution of this work is its explicit evaluation of noise resilience, which is largely absent in previous studies. To this end, we inject controlled noise into the input to assess robustness under signal uncertainty. All experiments are conducted using the original WISDM dataset [1], widely adopted for smartphone-based human activity recognition.
5.6. Cross-Dataset Generalization and Contextualizing Performance
While our proposed pooling mechanisms yield 96.5% accuracy on WISDM exceeding prior CNN-based baselines on this dataset we acknowledge that this does not surpass SOTA models on newer benchmarks such as UCI-HAR and MobiAct, where >97% accuracy has been achieved [41]. Importantly, our focus is not on claiming universal SOTA, but on demonstrating that pooling layers tailored for noise and variability can provide consistent improvements within a given dataset. To assess generalization, we conducted transfer experiments on UCI-HAR and MobiAct (Table 19). Results show that our method achieves accuracies of 95.1% (96.5 2D CNN) and 93.0% (94.1 2D CNN), respectively, slightly below the strongest specialized models but comparable to general CNN baselines. This indicates that our pooling layers preserve discriminability across datasets with differing sensor characteristics, though dataset-specific tuning may be required to reach absolute SOTA. We therefore position our contribution as a pooling design innovation that can enhance diverse HAR models, rather than a dataset-specific accuracy claim.
Table 19.
Cross-dataset results (proposed vs. baselines).
6. Conclusions
This study presents a redundancy-aware CNN-based framework for human activity recognition using smartphone accelerometer data from the WISDM dataset. Two novel pooling operations—Pooling A (Extrema Contrast Pooling (ECP)) and Pooling B (Center Minus Variation (CMV))—were introduced to exploit statistical characteristics such as extrema range and signal variability. Pooling A emphasizes transient motion patterns, while Pooling B suppresses local fluctuations, enhancing robustness to noise.
Ablation experiments summarized in the results table demonstrate the contributions of each component. The baseline CNN achieved 79.91% accuracy, while adding histogram-based encoding alone increased performance to 91.6%. Incorporating ECP or CMV individually improved accuracy to 87.4% and 83.6%, respectively, and combining both pooling mechanisms without histograms reached 89.1%. The full model, which integrates histogram encoding with both ECP and CMV, achieved the highest accuracy of 96.84%, closely followed by the combination of histogram encoding with ECP alone at 96.5%. These results confirm that histogram-based feature representation provides the largest single improvement, while the pooling mechanisms further enhance discriminative capability and robustness, particularly under noisy conditions.
The proposed pooling strategies were evaluated in two separate configurations: a 1D CNN applied to normalized raw tri-axial accelerometer windows, capturing temporal dependencies, and a 2D CNN applied to histogram-based image representations, capturing intra-window distributional patterns through spatial convolution. Both configurations demonstrate that redundancy-aware pooling substantially improves classification accuracy and robustness.
Although this study focused on accelerometer data from the WISDM dataset, the pooling mechanisms are sensor-agnostic and can be integrated into other time-series learning frameworks. Future work will investigate cross-dataset generalization on benchmarks such as UCI HAR, PAMAP2, and MotionSense, extend the framework to multimodal inertial sensors (e.g., gyroscope and magnetometer), and evaluate deployment feasibility on resource-constrained edge devices, where latency, memory usage, and energy efficiency are critical. Additionally, adapting the proposed pooling strategies to broader time-series applications beyond human activity recognition represents a promising avenue for extending their practical impact.
Author Contributions
Conceptualization, B.A.H.A. and S.A.A.; methodology, B.A.H.A. and S.A.A.; software, B.A.H.A. and S.A.A.; validation, B.A.H.A., S.A.A.; formal analysis, B.A.H.A. and S.A.A.; investigation, B.A.H.A. and S.A.A.; resources, B.A.H.A. and S.A.A.; data curation, B.A.H.A.; writing—original draft preparation, B.A.H.A.; writing—review and editing, S.A.A.; visualization, B.A.H.A.; supervision, S.A.A.; project administration, B.A.H.A. and S.A.A. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
The study did not require ethical approval.
Informed Consent Statement
Not applicable.
Data Availability Statement
The WISDM dataset used in this study is publicly available from the official WISDM repository and can be freely accessed for research purposes.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM SIGKDD Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
- Walse, K.H.; Dharaskar, R.V.; Thakare, V.M. Performance evaluation of classifiers on WISDM dataset for human activity recognition. In Proceedings of the 2nd International Conference on Information and Communication Technology for Competitive Strategies (ICTCS), Udaipur, India, 4–5 March 2016; pp. 1–7. [Google Scholar]
- Min, Y.; Htay, Y.Y.; Oo, K.K. Comparing the performance of machine learning algorithms for human activities recognition using WISDM dataset. Int. J. Comput. (IJC) 2020, 38, 61–72. [Google Scholar]
- Seelwal, P.; Srinivas, C. Human activity recognition using WISDM datasets. J. Online Eng. Educ. 2023, 14, 88–94. [Google Scholar]
- Heydarian, M.; Doyle, T.E. rWISDM: Repaired WISDM, a public dataset for human activity recognition. arXiv 2023, arXiv:2305.10222. [Google Scholar] [CrossRef]
- Sharen, H.; Anbarasi, L.J.; Rukmani, P.; Gandomi, A.H.; Neeraja, R.; Narendra, M. WISNet: A deep neural network based human activity recognition system. Expert Syst. Appl. 2024, 258, 124999. [Google Scholar] [CrossRef]
- Abdellatef, E.; Al-Makhlasawy, R.M.; Shalaby, W.A. Detection of human activities using multi-layer convolutional neural network. Sci. Rep. 2025, 15, 7004. [Google Scholar] [CrossRef]
- Oddo, C.M.; Valdastri, P.; Beccai, L.; Roccella, S.; Carrozza, M.C.; Dario, P. Investigation on calibration methods for multi-axis, linear and redundant force sensors. Meas. Sci. Technol. 2007, 18, 623. [Google Scholar] [CrossRef]
- Gao, Z.; Wang, Y.; Chen, J.; Xing, J.; Patel, S.; Liu, X.; Shi, Y. MMTSA: Multi-modal temporal segment attention network for efficient human activity recognition. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; ACM: New York, NY, USA, 2023; Volume 7, pp. 1–26. [Google Scholar]
- Genc, E.; Yildirim, M.E.; Yucel, B.S. Human activity recognition with fine-tuned CNN-LSTM. J. Electr. Eng. 2024, 75, 8–13. [Google Scholar] [CrossRef]
- Wang, J.; Li, C.; Zhang, B.; Zhang, Y.; Shi, L.; Wang, X.; Zhou, L.; Xiong, D. Automatic rehabilitation exercise task assessment of stroke patients based on wearable sensors with a lightweight multichannel 1D-CNN model. Sci. Rep. 2024, 14, 19204. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Qin, X.; Gao, Y.; Li, X.; Feng, C. SETransformer: A hybrid attention-based architecture for robust human activity recognition. arXiv 2025, arXiv:2505.19369. [Google Scholar] [CrossRef]
- Zhu, P.; Qi, M.; Li, X.; Li, W.; Ma, H. Unsupervised self-driving attention prediction via uncertainty mining and knowledge embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8558–8568. [Google Scholar]
- Momeny, M.; Sarram, M.A.; Latif, A.; Sheikhpour, R. A convolutional neural network based on adaptive pooling for classification of noisy images. Signal Data Process. 2021, 17, 139–154. [Google Scholar] [CrossRef]
- Shang, Z.; Tang, L.; Pan, C.; Cheng, H. A hybrid semantic attribute-based zero-shot learning model for bearing fault diagnosis under unknown working conditions. Eng. Appl. Artif. Intell. 2024, 136, 109020. [Google Scholar] [CrossRef]
- Kobayashi, S.; Hasegawa, T.; Miyoshi, T.; Koshino, M. Marnasnets: Toward cnn model architectures specific to sensor-based human activity recognition. IEEE Sens. J. 2023, 23, 18708–18717. [Google Scholar] [CrossRef]
- Chu, J.; Li, X.; Zhang, J.; Lu, W. Super-resolution using multi-channel merged convolutional network. Neurocomputing 2020, 394, 136–145. [Google Scholar] [CrossRef]
- Ma, X.; Liu, S. A Simple but Effective Non-local Means Method For Salt and Pepper Noise Removal. In Proceedings of the 2022 IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 12–14 October 2022; pp. 1423–1432. [Google Scholar]
- Wu, H.; Han, Y.; Zhu, Q.; Geng, Z. Novel feature-disentangled autoencoder integrating residual network for industrial soft sensor. IEEE Trans. Ind. Inform. 2023, 19, 10299–10308. [Google Scholar] [CrossRef]
- Tang, Y.; Zhang, L.; Min, F.; He, J. Multiscale deep feature learning for human activity recognition using wearable sensors. IEEE Trans. Ind. Electron. 2022, 70, 2106–2116. [Google Scholar] [CrossRef]
- Negri, V.; Mingotti, A.; Tinarelli, R.; Peretto, L. Uncertainty-Aware Human Activity Recognition: Investigating Sensor Impact in ML Models. In Proceedings of the 2025 IEEE Medical Measurements & Applications (MeMeA), Chania, Greece, 28–30 May 2025; pp. 1–6. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. arXiv 2013, arXiv:1301.3557. [Google Scholar] [CrossRef]
- Hang, S.T.; Aono, M. Bi-linearly weighted fractional max pooling: An extension to conventional max pooling for deep convolutional neural network. Multimed. Tools Appl. 2017, 76, 22095–22117. [Google Scholar] [CrossRef]
- Xu, X.; Li, X.; Ming, W.; Chen, M. A novel multi-scale CNN and attention mechanism method with multi-sensor signal for remaining useful life prediction. Comput. Ind. Eng. 2022, 169, 108204. [Google Scholar] [CrossRef]
- Zafar, A.; Aamir, M.; Mohd Nawi, N.; Arshad, A.; Riaz, S.; Alruban, A.; Dutta, A.K.; Almotairi, S. A comparison of pooling methods for convolutional neural networks. Appl. Sci. 2022, 12, 8643. [Google Scholar] [CrossRef]
- Mohamed, E.A.; Gaber, T.; Karam, O.; Rashed, E.A. A Novel CNN pooling layer for breast cancer segmentation and classification from thermograms. PLoS ONE 2022, 17, e0276523. [Google Scholar]
- Aminifar, S.; bin Marzuki, A. Horizontal and vertical rule bases method in fuzzy controllers. Math. Probl. Eng. 2013, 2013, 532046. [Google Scholar] [CrossRef]
- Kekshar, S.M.; Aminifar, S.A. Uncertainty Avoider Defuzzification of General Type-2 Multi-Layer Fuzzy Membership Functions for Image Segmentation. IEEE Access 2025, 13, 58120–58135. [Google Scholar] [CrossRef]
- Khorsheed, H.A.; Aminifar, S. Measuring uncertainty to extract fuzzy membership functions in recommender systems. J. Comput. Sci. 2023, 19, 1359–1368. [Google Scholar] [CrossRef]
- Sachdeva, R.; Gakhar, R.; Awasthi, S.; Singh, K.; Pandey, A.; Parihar, A.S. Uncertainty and Noise Aware Decision Making for Autonomous Vehicles-A Bayesian Approach. IEEE Trans. Veh. Technol. 2024, 74, 378–389. [Google Scholar]
- Aminifar, S. Uncertainty avoider interval type II defuzzification method. Math. Probl. Eng. 2020, 2020, 5812163. [Google Scholar] [CrossRef]
- Maulud, R.H.; Aminifar, S.A. Enhancing indoor asset tracking: IoT integration and machine learning approaches for optimized performance. J. Comput. Sci. 2025, 21, 1512–1525. [Google Scholar] [CrossRef]
- Bi, Q.; Zhang, H.; Qin, K. Multi-scale stacking attention pooling for remote sensing scene classification. Neurocomputing 2021, 436, 147–161. [Google Scholar] [CrossRef]
- Nirthika, R.; Manivannan, S.; Ramanan, A.; Wang, R. Pooling in convolutional neural networks for medical image analysis: A survey and an empirical study. Neural Comput. Appl. 2022, 34, 5321–5347. [Google Scholar] [CrossRef] [PubMed]
- Han, H.; Zheng, Q.; Luo, M.; Miao, K.; Tian, F.; Chen, Y. Noise-tolerant learning for audio-visual action recognition. IEEE Trans. Multimed. 2024, 26, 7761–7774. [Google Scholar] [CrossRef]
- Chen, K.Y.; Li, J.L.; Ding, J.J. SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement. arXiv 2025, arXiv:2505.14066. [Google Scholar]
- Miah, A.S.M.; Hwang, Y.S.; Shin, J. Sensor-based human activity recognition based on multi-stream time-varying features with eca-net dimensionality reduction. IEEE Access 2024, 12, 151649–151668. [Google Scholar] [CrossRef]
- Choudhury, N.A.; Soni, B. An adaptive batch size-based-CNN-LSTM framework for human activity recognition in uncontrolled environment. IEEE Trans. Ind. Inform. 2023, 19, 10379–10387. [Google Scholar]
- Qureshi, T.S.; Shahid, M.H.; Farhan, A.A.; Alamri, S. A systematic literature review on human activity recognition using smart devices: Advances, challenges, future directions. Artif. Intell. Rev. 2025, 58, 276. [Google Scholar] [CrossRef]
- Gaud, N.; Rathore, M.; Suman, U. Hybrid deep learning-based human activity recognition (har) using wearable sensors: An edge computing approach. In International Conference on Data Analytics & Management; Springer Nature Singapore: Singapore, 2023; pp. 399–410. [Google Scholar]
- Guo, P.; Nakayama, M. Transformer-based Human Activity Recognition Using Wearable Sensors for Health Monitoring. In Proceedings of the 2025 9th International Conference on Biomedical Engineering and Applications (ICBEA), Seoul, Republic of Korea, 27 February–2 March 2025; pp. 68–72. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.