Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies

Turdalyuly, Mussa; Zholdassova, Ay; Turdalykyzy, Tolganay; Doshybekov, Aydin

doi:10.3390/info17040368

Open AccessArticle

Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies

¹

School of Engineering and Information Technology, META University, Almaty 050000, Kazakhstan

²

Software Engineering Department, International Engineering Technological University, Almaty 050060, Kazakhstan

³

Department of Basic Military Training, Abai Kazakh National Pedagogical University, Almaty 050010, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(4), 368; https://doi.org/10.3390/info17040368

Submission received: 11 February 2026 / Revised: 30 March 2026 / Accepted: 10 April 2026 / Published: 14 April 2026

Download

Browse Figures

Versions Notes

Abstract

Human activity recognition (HAR) using inertial sensors is essential for health monitoring and wellness applications, yet robust classification in real-world adult scenarios remains challenging due to subject variability and activity transitions in smartphone sensing environments. This study investigated smartphone-based physical activity recognition using accelerometer and gyroscope signals under a cross-subject evaluation protocol. To reduce label ambiguity and improve generalization, the original activity set was grouped into a reduced 6-class taxonomy. We evaluated lightweight deep learning models, including a smartphone-only convolutional neural network (CNN) and a multimodal fusion model combining smartphone and smartwatch signals. Using GroupKFold cross-subject validation, the smartphone-only CNN achieved competitive performance with Macro-F1 ≈ 0.46, while multimodal fusion did not provide consistent improvements. We also examined temporal segmentation and showed that shorter windows (2.0 s) yield better results than longer windows. Sensor ablation confirmed the importance of gyroscope information, and per-class analysis indicated that dynamic activities could be recognized reliably, whereas stairs and static categories remained difficult. Overall, the results demonstrate the practicality of smartphone-based activity recognition using built-in smartphone sensors without external wearable devices for adult activity monitoring and provide recommendations for window length and sensor selection in cross-subject HAR.

Keywords:

human activity recognition; physical activity monitoring; smartphone sensors; accelerometer; gyroscope; deep learning; convolutional neural network; cross-subject evaluation; GroupKFold; sliding window segmentation

1. Introduction

Human activity recognition (HAR) using inertial measurement unit (IMU) signals has become a fundamental technology for modern health monitoring, wellness applications, and ubiquitous mobile systems. Accurate recognition of daily activities enables important use cases such as estimating physical activity levels, detecting sedentary behavior, and supporting personalized lifestyle analytics [1]. With the widespread availability of smartphones equipped with accelerometers and gyroscopes, smartphone-based HAR has gained significant attention due to its low cost and scalability compared to dedicated wearable platforms [2]. Throughout this paper, the term ‘wearable-free’ refers to the absence of dedicated external wearable sensors (e.g., smartwatches or body-worn IMU devices); the proposed approach exclusively relies on the built-in IMU of a standard smartphone carried by the participant. Recent studies further indicate that smartphone-only sensing remains a viable solution for large-scale deployment, provided that evaluation protocols adequately reflect real-world usage conditions and subject variability [3].

Despite the rapid progress of deep learning methods, robust activity recognition in real-world adult scenarios remains challenging. A key difficulty is the high variability across individuals: adults differ in movement patterns, gait characteristics, and activity execution style. In addition, smartphone placement and orientation can vary substantially (e.g., pocket, hand, bag), leading to inconsistent signal distributions even for the same activity [2]. As a result, models often experience performance degradation when tested on unseen users, making cross-subject generalization a critical requirement for practical deployment [1]. Recent empirical analyses confirm that this degradation is particularly pronounced under uncontrolled smartphone placement and naturalistic usage conditions, underscoring the importance of subject-independent evaluation protocols [3].

Another important challenge concerns the ambiguity of fine-grained activity labels. Many HAR datasets include a large number of activity categories, some of which exhibit overlapping motion signatures or appear in short transitional segments. Under sliding-window segmentation, longer windows may contain transitions between activities, which increases label noise when majority voting is applied [4]. Consequently, both the activity taxonomy and the temporal segmentation strategy strongly affect the reliability of training data and final recognition quality. Recent work further suggests that careful window-length selection is especially critical in cross-subject settings, where inter-user variability amplifies the negative impact of activity mixing within longer windows [5].

In this work, we investigated adult physical activity recognition using smartphone IMU signals under a realistic cross-subject evaluation protocol. We used the WISDM smartphone and smartwatch dataset [6] and adopted GroupKFold validation to ensure that data from each participant were strictly separated between training and testing splits. To reduce label ambiguity and improve robustness, we aggregated fine-grained activities into a reduced 6-class taxonomy. We evaluated lightweight deep learning models that can be trained on consumer-grade hardware and systematically analyzed design choices that affect performance, including window length and sensor modality selection (accelerometer vs. gyroscope). In addition, we explored whether multimodal fusion with smartwatch signals provides measurable benefits compared to smartphone-only sensing, an issue that remains open in recent smartphone-centric HAR studies [3].

Our experimental results demonstrate that smartphone-only sensing remains a strong baseline for adult activity monitoring, achieving approximately Macro-F1 ≈ 0.46 on the reduced-class protocol. Window-length analysis indicates that shorter windows (2.0 s) outperform longer windows, suggesting improved temporal localization and reduced activity mixing. Sensor ablation shows that gyroscope signals provide highly informative motion cues, and the combination of accelerometer and gyroscope yields the most balanced performance across classes. Per-class evaluation highlights that dynamic activities such as locomotion and sports are recognized reliably, while static and stairs remain challenging categories due to low signal variability and semantic overlap with locomotion-related patterns.

Contributions. The main contributions of this work are summarized as follows:

We conducted adult HAR experiments under a realistic cross-subject GroupKFold protocol to evaluate generalization to unseen users.
We propose a reduced 6-class activity taxonomy to reduce label ambiguity and improve robustness.
We provide a systematic window-length analysis (2.0 s, 4.0 s, 6.0 s) to quantify segmentation effects.
We performed sensor modality ablation (accelerometer-only, gyroscope-only, and combined IMU) to identify informative signals.
We analyzed multimodal fusion with smartwatch data and discussed practical implications for smartphone-based activity monitoring without additional wearable devices.

In particular, the study provides empirical insight into how segmentation strategy, sensor modality, and multimodal fusion affect cross-subject generalization in smartphone-based HAR systems.

2. Related Work

Human activity recognition (HAR) based on inertial sensor data has been widely studied due to its relevance to mobile health and ubiquitous computing [1]. Early smartphone-based HAR research primarily relied on handcrafted feature extraction from accelerometer signals combined with classical machine learning classifiers. These studies demonstrated that consumer-grade devices could be used for activity recognition in everyday settings [7]. However, handcrafted feature pipelines often exhibit limited robustness under cross-subject evaluation and varying sensor placement, which motivated the transition toward end-to-end deep learning approaches that learn discriminative representations directly from raw time-series data [2].

Deep learning models have subsequently become the dominant paradigm in HAR. Convolutional neural networks (CNNs) are widely adopted due to their ability to efficiently capture local temporal patterns and their suitability for deployment on mobile devices [8]. Hybrid architectures that combine CNN-based feature extraction with recurrent sequence modeling, such as long short-term memory networks (LSTMs), have also been extensively explored and shown to perform well on multimodal wearable datasets [8,9]. These approaches illustrate that deep models can learn rich temporal representations that outperform traditional handcrafted pipelines, particularly when applied to raw inertial signals [10]. More recent work continues to favor temporal convolutional designs, as they offer a favorable balance between recognition performance and computational efficiency, which is critical for on-device smartphone inference [3,11].

A persistent challenge in HAR is cross-subject generalization. Random train–test splits may substantially overestimate performance because samples from the same individual can appear in both training and evaluation sets. Cross-subject evaluation protocols, which explicitly test models on previously unseen participants, provide a more realistic assessment of generalization capability but remain challenging due to inter-person variability, behavioral differences, and heterogeneous smartphone placement and orientation [1,2,3]. As a result, subject-independent validation has become an essential criterion for evaluating the real-world applicability of smartphone-based HAR systems.

Temporal segmentation plays a central role in shaping HAR performance. Sliding-window approaches are commonly used to partition continuous sensor streams into fixed-length segments suitable for supervised learning. Window length and overlap determine the trade-off between temporal resolution and motion context. Prior studies showed that longer windows may introduce label ambiguity due to activity transitions, while excessively short windows may fail to capture sufficient discriminative information [4]. Subsequent work has reinforced the importance of careful window-length selection, particularly in cross-subject settings where motion patterns vary significantly across users [5,12].

Multimodal fusion has also been investigated as a means of improving HAR performance by combining signals from multiple devices, such as smartphones and smartwatches. Datasets such as WISDM enable systematic evaluation of smartphone–smartwatch fusion strategies for daily activity recognition [6]. Although multimodal sensing can provide complementary motion information, fusion approaches are often affected by device heterogeneity, temporal misalignment, and inconsistent user behavior. To address these issues, modality-robust training strategies, including modality dropout, have been proposed to improve resilience to missing or noisy sensor streams [13]. Nevertheless, empirical evidence indicates that early fusion does not consistently yield performance gains over smartphone-only sensing, particularly under strict cross-subject evaluation protocols [3].

Benchmark datasets have played a key role in advancing HAR research by enabling reproducible evaluation under controlled conditions. Smartphone-based datasets containing accelerometer and gyroscope recordings are widely used to analyze segmentation strategies, sensor modality contributions, and cross-subject robustness [14,15]. These studies consistently report that variations in smartphone placement and orientation have a significant impact on recognition accuracy, underscoring the need for placement-robust models in mobile HAR systems.

Recent work has also focused on the development of efficient and lightweight HAR models suitable for resource-constrained mobile devices. Compact CNN-based and residual architectures, sometimes augmented with channel re-weighting or attention-inspired mechanisms, have demonstrated competitive performance while maintaining low computational overhead [16,17]. In parallel, transformer-based time-series models have been explored for inertial HAR; however, several studies emphasize the trade-off between modeling long-range dependencies and the computational and memory constraints of on-device inference, which has motivated the development of compressed or distilled transformer variants [18,19].

Overall, prior research demonstrated the potential of deep learning for smartphone-based HAR while highlighting the importance of realistic cross-subject evaluation, appropriate temporal segmentation, sensor modality selection, and deployment-oriented model design [1,4,8]. Building on these insights, the present study focuses on smartphone-based activity recognition using built-in smartphone inertial sensors and conducts systematic ablation studies under a strict cross-subject protocol to provide practical guidance for real-world mobile sensing applications.

3. Materials and Methods

3.1. Dataset

In this study, we used the WISDM507 dataset, which contains inertial sensor recordings collected from adult participants performing a diverse set of everyday physical activities. The dataset provides multivariate time-series signals from consumer devices, including smartphone and smartwatch sensors. Each sample includes tri-axial measurements from the accelerometer (x, y, z) and gyroscope (x, y, z). The dataset includes multiple activity types ranging from locomotion-related movements (e.g., walking and jogging) to daily living actions (e.g., eating and drinking) and sport-like behaviors [6].

The original dataset contains a fine-grained set of activities labeled with short activity codes (e.g., A, B, C), which are mapped to human-readable activity names using the official activity key file. Each record is associated with a subject identifier, enabling cross-subject evaluation and preventing subject leakage between training and test sets. Benchmark datasets such as WISDM and other widely used smartphone-based HAR datasets have played a critical role in advancing inertial activity recognition research by enabling reproducible evaluation under controlled yet realistic conditions [6,14,15].

3.2. Preprocessing and Window Segmentation

Sensor streams were aligned by timestamp and segmented into fixed-length windows using a sliding-window approach. Sliding windows are commonly used in HAR to convert continuous time-series into samples suitable for supervised learning [4,12]. We applied 50% overlap between consecutive windows, a commonly adopted setting that balances sample diversity and temporal continuity in inertial HAR pipelines. Window labels were assigned using majority voting.

To reduce information leakage, normalization was performed using z-score statistics, computed only from the training split of each fold. Proper preprocessing and temporal segmentation are critical for HAR systems, as window length and overlap can significantly affect classification accuracy and robustness.

After preprocessing, each window is represented as a multichannel time-series tensor containing synchronized accelerometer and gyroscope signals. These tensors serve as input samples for the deep learning models used in this study. Longer windows may introduce label ambiguity due to activity transitions, while shorter windows may lack sufficient motion context, particularly in cross-subject scenarios with high inter-user variability [4,12].

3.3. Reduced 6-Class Activity Taxonomy

The original dataset includes fine-grained activity labels, some of which are difficult to separate under cross-subject evaluation due to overlapping motion signatures and short transitional segments. To reduce ambiguity and improve robustness, we grouped activities into six broader categories:

Locomotion
Stairs
Static
Eat–drink
Sports
Upper-body.

This grouping supports more stable evaluation and provides clearer interpretation of performance patterns by mitigating label noise caused by activity transitions and semantic overlap. Grouping activities into higher-level categories is consistent with prior HAR studies emphasizing the importance of temporal structure and contextual aggregation for robust recognition, particularly under subject-independent evaluation protocols [4,12].

3.4. Deep Learning Models

We evaluated a lightweight smartphone-only CNN baseline and a multimodal fusion model combining smartphone and smartwatch signals. CNN architectures are commonly used in HAR due to their ability to efficiently extract local temporal patterns from raw inertial signals and their suitability for deployment on resource-constrained mobile devices [8,10].

For fusion robustness, we considered modality dropout based on the ModDrop principle [13], which has been shown to improve HAR performance in the presence of missing or noisy sensor streams. Deep learning models, particularly CNN-based and hybrid CNN–LSTM architectures, have increasingly replaced handcrafted feature pipelines, demonstrating the effectiveness of end-to-end representation learning for smartphone-based HAR under both subject-dependent and cross-subject evaluation settings [2,10].

The CNN architecture used in this study follows a lightweight temporal convolutional design suitable for smartphone-based HAR. The model consists of stacked 1D convolutional layers with ReLU activation functions and max-pooling operations for temporal downsampling. Batch normalization is applied after convolutional layers to stabilize training. The extracted feature representation is passed to a fully connected classification layer producing activity probabilities. This lightweight design balances recognition performance and computational efficiency, which is important for potential deployment in smartphone-based activity monitoring scenarios.

The architecture includes convolutional layers followed by pooling operations and fully connected classification layers. The lightweight design was intentionally chosen to balance recognition performance and computational efficiency, which is important for practical smartphone-based activity monitoring scenarios.

3.5. Evaluation Protocol and Metrics

We adopted 5-fold GroupKFold cross-validation to ensure cross-subject evaluation, such that samples from the same participant never appear in both training and test splits. This protocol mitigates the overestimation of performance that can occur under random splits and provides a more realistic assessment of generalization to unseen users [1,2]. Performance was reported using Accuracy and Macro-F1. Macro-F1 is particularly suitable for HAR tasks with class imbalance because it equally weights the performance of all activity classes regardless of their frequency.

Five folds were selected to maintain sufficient subject diversity within each test split while keeping the evaluation computationally manageable.

Class imbalance is a common characteristic of HAR datasets because different activities occur with unequal frequency and duration. In this study, the impact of class imbalance was primarily addressed through the use of the Macro-F1 metric, which assigns equal importance to each activity class and therefore provides a balanced evaluation of recognition performance. Although additional techniques such as class weighting or resampling could further mitigate imbalance effects, the present work focused on evaluating model behavior under the natural class distribution of the dataset. Exploring additional imbalance-handling strategies represents an interesting direction for future research.

4. Experimental Setup

4.1. Implementation

All experiments were implemented in Python 3.11 using PyTorch 2.2.1. The same preprocessing pipeline, model configurations, and evaluation protocol were applied consistently across all experiments to ensure fair comparison.

4.2. Hardware

Training was performed on a consumer-grade laptop with an NVIDIA GeForce RTX 4050 Laptop GPU (8 GB VRAM), demonstrating that the proposed models can be trained without access to server-class or high-memory GPUs.

4.3. Training Configuration

Models were trained using the AdamW optimizer (PyTorch implementation) with a learning rate of 1 × 10⁻³, batch size 256 for training, and batch size 512 for evaluation. Each experiment was trained for 10 epochs using cross-entropy loss. Mixed-precision training was enabled when CUDA was available to improve computational efficiency.

4.4. Experiments

We conducted several systematic experiments to evaluate model performance under different conditions:

Model comparison–smartphone-only CNN versus multimodal fusion using smartphone and smartwatch signals.
Window-length study–evaluating the effect of different sliding-window sizes (2.0 s, 4.0 s, 6.0 s) on recognition performance.
Sensor modality ablation–assessing models trained on accelerometer-only, gyroscope-only, and combined accelerometer–gyroscope inputs.
Per-class analysis–using class-wise F1-scores and confusion matrices to analyze error patterns and activity-specific behavior.

These experimental designs follow established best practices in HAR research, emphasizing realistic cross-subject evaluation, temporal segmentation analysis, sensor modality assessment, and deployment-oriented model efficiency [4,15,16].

4.5. Deployment Feasibility

The proposed CNN architecture is lightweight and computationally efficient, consisting of standard convolutional layers commonly used in smartphone-based HAR systems. The model was successfully trained on a consumer-grade laptop GPU, demonstrating modest computational requirements.

Given its relatively low complexity and use of standard operations, the model can be deployed on smartphones using mobile deep learning frameworks such as TensorFlow Lite or PyTorch Mobile. Similar CNN-based HAR models with comparable architectures were previously implemented on mobile devices, confirming the practical feasibility of smartphone-based activity recognition without requiring external wearable sensors [10,16].

5. Results and Discussion

This section reports cross-subject HAR performance for adult activity monitoring. Results are averaged across GroupKFold folds and reported as mean ± standard deviation. The analysis covers model comparison, window-length effects, sensor modality ablation, and per-class performance, with findings interpreted in the context of prior work on smartphone-based HAR and cross-subject generalization [2,4,15].

5.1. Model Comparison

Table 1 presents the comparison between the smartphone-only CNN baseline and the multimodal fusion model under the reduced 6-class cross-subject GroupKFold protocol.

The smartphone-only CNN achieved the best overall results, reaching Macro-F1 ≈ 0.46. This level of performance is consistent with prior studies using strict cross-subject evaluation protocols, where performance typically decreases compared to random train–test splits due to increased inter-subject variability. In contrast, multimodal fusion with smartwatch signals did not consistently improve performance. This outcome suggests that naive early fusion may be affected by device heterogeneity and temporal misalignment, limiting the benefit of additional wearable sensors [6,13]. These findings further indicate that multimodal fusion does not necessarily guarantee improved performance under strict subject-independent evaluation, highlighting the importance of robust modality integration strategies in cross-user HAR scenarios. These results support the feasibility of smartphone-only sensing for scalable adult activity monitoring, confirming prior findings that smartphone-based HAR can achieve competitive performance without relying on additional wearables [10,15].

Strict cross-subject evaluation represents a significantly more challenging scenario compared to subject-dependent protocols commonly reported in the literature. In subject-dependent evaluation, samples from the same individuals may appear in both training and test sets, leading to optimistic performance estimates. In contrast, subject-independent evaluation requires models to generalize to entirely unseen users, which typically results in lower recognition accuracy due to inter-subject variability, differences in motion patterns, and device placement variability. Therefore, the performance observed in this study should be interpreted in the context of realistic deployment conditions rather than idealized experimental setups.

To better contextualize the observed performance, Table 2 provides a structured comparison of representative HAR studies grouped by validation protocol. The studies are divided into two groups: (A) studies using user-dependent or random train–test splits, and (B) studies employing strict cross-subject evaluation (LOSO or GroupKFold). This distinction is critical, as the choice of evaluation protocol has a direct and substantial impact on reported performance metrics.

Studies in Group A consistently report F1-scores above 0.98. However, these results were obtained under user-dependent conditions, where samples from the same individual may have appeared in both training and test sets. Under such protocols, the model is effectively evaluated on familiar motion patterns rather than generalizing to entirely unseen users, which leads to systematically optimistic performance estimates that do not reflect real-world deployment conditions [1,2].

In contrast, Group B illustrates the substantial performance reduction observed under strict cross-subject evaluation. Soleimani and Nazerfard [20] directly quantified this effect on the Opportunity dataset: without transfer learning, models trained on one subject and tested on another achieve Weighted F1 of only 0.21–0.48, representing a performance drop of 22–47% compared to supervised training on the same subject’s data. This directly demonstrates that cross-subject generalization is a fundamentally challenging problem, independent of model architecture. Logacjov et al. [21] similarly reported that on the HARTH dataset–recorded with two fixed-placement dedicated accelerometers under free-living conditions–the best model (SVM) achieves Macro-F1 = 0.81 under LOSO, but with a standard deviation of ±0.18, reflecting critical per-class disparities: stairs-related classes reach only 40–64% per-class F1 even under these favorable sensing conditions. Hoelzemann et al. [22] further confirmed this pattern in a sports context: under LOSO evaluation across 24 basketball players, sport-specific micro-activities such as rebound and layup yield Macro-F1 as low as 0.20–0.25 during game sessions, due to high intra-class variability and naturalistically uncontrolled movement execution. Furthermore, Garcia-Gonzalez et al. [3] demonstrated that even under a user-independent 7-fold cross-validation on WISDM, the best model achieved only F1 = 84.6%, confirming that WISDM is an inherently more challenging benchmark compared to simpler locomotion-focused datasets, regardless of the evaluation protocol used.

The Macro-F1 = 0.46 achieved in the present study falls within the lower-to-mid range of performance observed across Group B studies (0.21–0.81) and is substantially above the expected random baseline for a six-class classification task. Importantly, the present study operates under conditions that are arguably more challenging than those in Group B: WISDM507 contains 51 subjects with uncontrolled smartphone placement and orientation, a heterogeneous set of 18 fine-grained activity labels grouped into 6 classes that include low-discriminability categories (Stairs, Eat-drink, Upper-body), and uses a strict GroupKFold protocol that ensures complete subject separation across all folds. By contrast, HARTH uses two fixed-placement dedicated accelerometers with professional annotations, and the Opportunity dataset contains only four subjects in a laboratory setting. Under these more demanding conditions, the present study’s Macro-F1 = 0.46 represents a competitive and meaningful result that is consistent with the current state of cross-subject HAR benchmarking.

These results confirm that the obtained performance level is comparable to previously reported smartphone-based HAR studies when accounting for differences in evaluation protocols.

5.2. Window-Length Analysis

The effect of sliding-window length on smartphone-only CNN performance under the reduced 6-class protocol is summarized in Table 3.

Shorter windows achieved the best performance, with 2 s segmentation outperforming 4 s and 6 s windows. This finding is consistent with prior work showing that window length strongly affects HAR performance and that longer windows may increase label ambiguity due to activity transitions [4,12]. Short windows provide better temporal localization and reduce within-window activity mixing, which is particularly relevant for smartphone-only setups where sensor placement can vary across users [15].

This result likely reflects a balance between capturing sufficient motion context for recognizing activity patterns and avoiding mixing of multiple activities within longer windows, especially during activity transitions.

5.3. Sensor Modality Ablation

Table 4 reports the results of the sensor modality ablation study for the smartphone-only CNN under the reduced 6-class protocol.

Gyroscope-only input achieved the highest accuracy, highlighting the importance of rotational motion cues for HAR. Accelerometer-only performance was lower, indicating that translational acceleration alone is insufficient for robust recognition across subjects. The combined accelerometer + gyroscope configuration provided the best balanced Macro-F1, supporting the use of multi-sensor IMU input as a practical strategy for capturing complementary motion information under cross-subject evaluation [2]. These findings align with prior research emphasizing the value of multi-modal smartphone sensor fusion for capturing complementary motion information, even in a wearable-free scenario [8,10].

5.4. Per-Class Performance

Per-class recognition performance for the best configuration (smartphone-only CNN with a 2 s window) is summarized in Table 5.

The class-wise confusion patterns of the smartphone-only CNN are visualized in the normalized confusion matrix shown in Figure 1. Figure 2 further illustrates the distribution of per-class F1-scores for the smartphone-only CNN under the reduced 6-class protocol.

Figure 2 further illustrates the distribution of per-class F1-scores for the smartphone-only CNN under the reduced 6-class protocol.

Per-class analysis shows that dynamic activities, such as locomotion and sports, are recognized reliably, while static and stairs activities remain challenging due to subtle motion patterns and overlap with locomotion-like behaviors. These confusion patterns are consistent with prior HAR studies and highlight the difficulty of recognizing low-movement or transitional activities using smartphone-only inertial data [1,15]. Improving recognition for these classes may require enhanced temporal modeling, activity context awareness, or placement-robust representations.

The confusion matrix reveals that stairs activities are often misclassified as locomotion, which can be attributed to similar periodic lower-body movements captured by smartphone sensors. Similarly, static activities exhibit confusion with low-intensity upper-body or transitional movements, particularly under variable smartphone placement conditions. These observations highlight the inherent difficulty of distinguishing low-motion or transitional activities using smartphone-only inertial data without explicit contextual information.

Figure 2 further emphasizes the imbalance in per-class recognition performance, illustrating that classes with higher motion intensity benefit more from short-window temporal segmentation, while low-variance activities remain underrepresented in discriminative feature space. These findings are consistent with prior HAR studies reporting reduced recognition accuracy for static and stairs-related activities under smartphone-based sensing setups [1,15].

Overall, the per-class results suggest that while smartphone-only CNN models can reliably recognize dynamic activities, improving performance for low-motion and transitional classes may require enhanced temporal modeling, activity context awareness, or placement-robust feature representations.

6. Conclusions

This study investigated smartphone-based adult physical activity recognition using inertial signals under a strict cross-subject GroupKFold validation protocol on the WISDM dataset [6]. In contrast to random data splits that may overestimate performance, the adopted evaluation setup provides a realistic assessment of generalization to unseen users, which is critical for real-world mobile HAR applications.

A lightweight smartphone-only CNN achieved competitive performance on the reduced 6-class activity taxonomy, reaching Macro-F1 ≈ 0.46, indicating that smartphone-based sensing represents a practical solution using built-in smartphone sensors for adult activity monitoring. Multimodal fusion with smartwatch data did not consistently improve performance, suggesting that device heterogeneity and temporal misalignment may limit the benefits of naive early fusion in practical scenarios.

Window-length experiments showed that shorter temporal segmentation (2.0 s) provides improved recognition quality compared to longer windows, consistent with prior findings that window size significantly influences HAR performance and that longer windows can introduce label ambiguity due to activity transitions [4]. Sensor modality ablation further highlighted the importance of gyroscope signals, with combined accelerometer and gyroscope input offering the best balance between accuracy and Macro-F1. This supports the use of multi-sensor IMU data as a practical means of capturing complementary motion information under cross-subject evaluation.

Despite these encouraging results, certain activity classes–particularly stairs and static -– remain challenging due to subtle motion characteristics and overlap with other activity patterns. These limitations are consistent with previous smartphone-based HAR studies and underline the need for more robust temporal modeling and placement-invariant representations. Future work will therefore focus on multi-scale temporal modeling, orientation- and placement-invariant feature learning, and more advanced fusion strategies to further improve recognition robustness under unconstrained real-world conditions.

7. Limitations and Future Work

Although the proposed approach achieves robust performance for several activity categories, stairs and static remain challenging due to low signal variability and overlap with locomotion-related patterns. Multimodal fusion did not outperform smartphone-only sensing, suggesting that alignment-aware fusion and attention-based modality weighting may be required. Future work will explore multi-scale temporal architectures, frequency-domain representations, and deployment-oriented optimization such as quantization and model compression. Additionally, real-time deployment on a physical smartphone device and evaluation of inference latency under on-device constraints represent an important direction for future work, which would further validate the practical applicability of the proposed approach.

Furthermore, the current study was evaluated on a single benchmark dataset (WISDM507). Although this dataset provides a challenging and realistic evaluation setting, future work should validate the proposed approach across additional smartphone-based HAR datasets to further assess cross-dataset generalizability.

Author Contributions

Conceptualization, M.T. and T.T.; methodology, M.T. and A.Z.; software, M.T., A.Z. and T.T.; validation, M.T. and T.T.; formal analysis, A.D.; investigation, M.T. and A.Z.; resources, T.T. and A.Z.; data curation, T.T. and A.Z.; writing–original draft preparation, M.T., A.Z., T.T. and A.D.; writing–review and editing, M.T.; visualization, A.Z. and T.T.; supervision, A.D.; project administration, M.T.; funding acquisition, A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP32726323).

Data Availability Statement

The data presented in this study are openly available in the WISDM Smartphone and Smartwatch Activity and Biometrics Dataset at https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset (accessed on 1 February 2026). These data were used in this study and are cited in Reference [6].

Acknowledgments

The authors would like to thank colleagues and collaborators for valuable feedback and support. We also acknowledge the open research community for providing public datasets and tools enabling reproducible research in human activity recognition. During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-based language model, version GPT-5.3) for language editing and improving the clarity of the text. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

HAR	Human Activity Recognition
IMU	Inertial Measurement Unit
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
TCN	Temporal Convolutional Network
F1	F1-score
Macro-F1	Macro-averaged F1-score
GPU	Graphics Processing Unit
VRAM	Video Random Access Memory

References

Zhang, S.; Wang, L.; Zhu, J. Deep Learning in Human Activity Recognition with Wearable Sensors: Advances and Challenges. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef] [PubMed]
Sousa Lima, W.; Souto, E.; El-Khatib, K.; Jalali, R.; Gama, J. Human Activity Recognition Using Inertial Sensors in a Smartphone: An Overview. Sensors 2019, 19, 3213. [Google Scholar] [CrossRef]
Garcia-Gonzalez, D.; Rivero, D.; Fernandez-Blanco, E.; Luaces, M.R. Deep Learning Models for Real-Life Human Activity Recognition from Smartphone Sensor Data. Internet Things 2023, 24, 100925. [Google Scholar] [CrossRef]
Baños, O.; Gálvez, J.M.; Damas, M.; Pomares, H.; Rojas, I. Window Size Impact in Human Activity Recognition. Sensors 2014, 14, 6474–6499. [Google Scholar] [CrossRef] [PubMed]
Mennella, C.; Esposito, M.; De Pietro, G.; Maniscalco, U. Multiscale Activity Recognition Algorithms to Improve Cross-Subjects Performance Resilience in Rehabilitation Monitoring Systems. Comput. Methods Programs Biomed. 2025, 267, 108792. [Google Scholar] [CrossRef]
Weiss, G.M. WISDM Smartphone and Smartwatch Activity and Biometrics Dataset. UCI Mach. Learn. Repos. 2019. [Google Scholar] [CrossRef]
Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity Recognition Using Cell Phone Accelerometers. ACM SIGKDD Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
Ordóñez, F.J.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
Hernandez, N.; Ben-Abdallah, F.; Mazzara, M.; Dragoni, N. Human Activity Recognition Using Deep Learning: A Survey. Sensors 2020, 20, 155. [Google Scholar] [CrossRef]
Ronao, C.A.; Cho, S.-B. Human Activity Recognition with Smartphone Sensors Using Deep Learning Neural Networks. Expert Syst. Appl. 2016, 59, 235–244. [Google Scholar] [CrossRef]
Sekaran, S.R.; Han, P.Y.; Yin, O.S. Smartphone-based human activity recognition using lightweight multiheaded temporal convolutional network. Expert Syst. Appl. 2023, 227, 120132. [Google Scholar] [CrossRef]
Bulling, A.; Blanke, U.; Schiele, B. A Tutorial on Human Activity Recognition Using Body-Worn Inertial Sensors. ACM Comput. Surv. 2014, 46, 33. [Google Scholar] [CrossRef]
Neverova, N.; Wolf, C.; Taylor, G.; Nebout, F. ModDrop: Adaptive Multi-Modal Gesture Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1692–1706. [Google Scholar] [CrossRef]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A Public Domain Dataset for Human Activity Recognition Using Smartphones. ESANN 2013, 3, 437–442. [Google Scholar]
Shoaib, M.; Bosch, S.; Incel, O.D.; Scholten, H.; Havinga, P.J.M. Complex Human Activity Recognition Using Smartphone and Wrist-Worn Motion Sensors. Sensors 2016, 16, 426. [Google Scholar] [CrossRef]
San-Segundo, R.; Gil-Martín, M.; Díaz-Morcillo, A.; Montero, J.M. Human Activity Recognition Using a Smartwatch and a Smartphone. Pattern Recognit. Lett. 2018, 119, 22–29. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jitpattanakul, A. Efficient and Explainable Human Activity Recognition Using Deep Residual Network with Squeeze-and-Excitation Mechanism. Appl. Syst. Innov. 2025, 8, 57. [Google Scholar] [CrossRef]
Lamaakal, I.; Yahyati, C.; Maleh, Y.; El Makkaoui, K.; Ouahbi, I.; Abd El-Latif, A.A.; Zomorodi, M.; Abd El-Rahiem, B. A tiny inertial transformer for human activity recognition via multimodal knowledge distillation and explainable AI. Sci. Rep. 2025, 15, 42335. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, X.; Feng, Y.; Zhang, T.; Xiong, L. Efficient human activity recognition on edge devices using DeepConv LSTM architectures. Sci. Rep. 2025, 15, 13830. [Google Scholar] [CrossRef] [PubMed]
Soleimani, E.; Nazerfard, E. Cross-Subject Transfer Learning in Human Activity Recognition Systems using Generative Adversarial Networks. arXiv 2019, arXiv:1903.12489. [Google Scholar] [CrossRef]
Logacjov, A.; Bach, K.; Kongsvold, A.; Bårdstu, H.B.; Mork, P.J. HARTH: A Human Activity Recognition Dataset for Machine Learning. Sensors 2021, 21, 7853. [Google Scholar] [CrossRef]
Hoelzemann, A.; Romero, J.L.; Bock, M.; Van Laerhoven, K.; Lv, Q. Hang-Time HAR: A Benchmark Dataset for Basketball Activity Recognition Using Wrist-Worn Inertial Sensors. Sensors 2023, 23, 5879. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Normalized confusion matrix for the smartphone-only CNN model under the reduced 6-class protocol (window = 2.0 s).

Figure 2. Per-class F1-score distribution for the smartphone-only CNN model under the reduced 6-class protocol (window = 2.0 s).

Table 1. Model comparison on the reduced 6-class protocol (cross-subject GroupKFold).

Experiment	Input/Setting	Accuracy (Mean ± Std)	Macro-F1 (Mean ± Std)
CNN Phone-only (baseline)	4.0 s window, phone accel + gyro (6 ch)	0.4716 ± 0.0596	0.4626 ± 0.0408
FusionTCN + ModDrop (p = 0.3)	4.0 s window, phone + watch (12 ch)	0.4189 ± 0.0348	0.4074 ± 0.0520

Table 2. Comparison of HAR studies by validation protocol, primary metric, and reported performance. Group A: user-dependent protocols. Group B: strict cross-subject protocols. Studies in Group B are directly comparable to the present work.

Study	Dataset	Validation Protocol	Classes	Activity Types	Primary Metric	Value	Key Note
Group A: User-dependent/random-split protocols (for comparison only)
Ronao & Cho [10] (Expert Syst. Appl., 2016)	UCI-HAR (30 subjects, smartphone)	User-dep. split (70% train/30% test)	6	Walk, Upstairs, Downstairs, Sit, Stand, Laying	Accuracy	95.75%	No subject separation; locomotion-focused
Mekruksavanich & Jitpattanakul [17] (ASI, 2025)	WISDM v1.1 (51 subjects, smartphone)	5-fold CV (user-dep., no subject separation)	6	Walk, Jog, Upstairs, Downstairs, Sit, Stand	Accuracy & F1	98.78% F1: 98.09%	Same WISDM dataset; user-dep. split only
Garcia-Gonzalez et al. [3] (IoT, 2023)	WISDM (smartphone)	7-fold CV (user-independent)	6	Walk, Jog, Upstairs, Downstairs, Sit, Stand	F1	84.6%	User-indep. but only 6 locomotion classes
Group B: Strict cross-subject protocols (LOSO/GroupKFold)–directly comparable to present study
Soleimani & Nazerfard [20] (arXiv, 2019)	Opportunity Challenge (4 subjects, body-worn IMU)	Cross-subject (train on 1, test on another)	6	ADL micro-activities: Relaxing, Coffee time, Sandwich, etc.	Weighted F1 (no transfer)	0.21–0.48	22–47% perf. drop vs supervised; even w/GAN: 0.49–0.73
Logacjov et al. [21] (Sensors, 2021)	HARTH (22 subjects, thigh + back accel., fixed placement)	LOSO (22 subjects, free-living)	12	Free-living daily activities incl. Stairs asc./desc., Cycling, etc.	Macro-F1 (best: SVM)	0.81 (±0.18)	Stairs: 40–64% per-class F1; SD = ±0.18 across classes
Hoelzemann et al. [22] (Sensors, 2023)	Hang-Time HAR (24 players, wrist IMU, 2 countries)	LOSO (24 players, game + drill sessions)	10	Basketball-specific: Dribble, Pass, Layup, Rebound, Run, etc.	Macro-F1 (game session)	~0.25 (sport classes)	Rebound & layup < 0.20; high intra-class variab.
Present Study
Present study (CNN, smartphone-only)	WISDM507 (51 subjects, smartphone IMU, free placement, 18 orig. labels)	GroupKFold strict (subject-indep., 5 folds)	6 (grouped from 18)	Locomotion, Stairs, Static, Eat–drink, Sports, Upper-body	Macro-F1	0.46	Random baseline typically < 0.20 for six-class tasks; full subject separation; smartphone free placement

Table 3. Effect of window length on smartphone-only CNN performance (reduced 6-class protocol).

Window Length	Accuracy (Mean ± Std)	Macro-F1 (Mean ± Std)
2.0 s	0.4595 ± 0.0439	0.4571 ± 0.0443
4.0 s	0.4568 ± 0.0553	0.4473 ± 0.0525
6.0 s	0.4451 ± 0.0535	0.4297 ± 0.0558

Table 4. Sensor modality ablation study for smartphone-only CNN (reduced 6-class protocol).

Input Modality	Accuracy (Mean ± Std)	Macro-F1 (Mean ± Std)
accel-only (3 ch)	0.4030 ± 0.0603	0.3997 ± 0.0502
gyro-only (3 ch)	0.5030 ± 0.0467	0.4491 ± 0.0417
accel + gyro (6 ch)	0.4534 ± 0.0559	0.4535 ± 0.0643

Table 5. Per-class F1-scores for the best configuration (CNN phone-only, window = 2.0 s).

Class	F1-Score
locomotion	0.512
eat–drink	0.462
sports	0.452
upper-body	0.325
static	0.151
stairs	0.140

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Turdalyuly, M.; Zholdassova, A.; Turdalykyzy, T.; Doshybekov, A. Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies. Information 2026, 17, 368. https://doi.org/10.3390/info17040368

AMA Style

Turdalyuly M, Zholdassova A, Turdalykyzy T, Doshybekov A. Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies. Information. 2026; 17(4):368. https://doi.org/10.3390/info17040368

Chicago/Turabian Style

Turdalyuly, Mussa, Ay Zholdassova, Tolganay Turdalykyzy, and Aydin Doshybekov. 2026. "Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies" Information 17, no. 4: 368. https://doi.org/10.3390/info17040368

APA Style

Turdalyuly, M., Zholdassova, A., Turdalykyzy, T., & Doshybekov, A. (2026). Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies. Information, 17(4), 368. https://doi.org/10.3390/info17040368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wearable Sensor-Free Adult Physical Activity Monitoring Using Smartphone IMU Signals: Cross-Subject Deep Learning with Window-Length and Sensor Modality Studies

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset

3.2. Preprocessing and Window Segmentation

3.3. Reduced 6-Class Activity Taxonomy

3.4. Deep Learning Models

3.5. Evaluation Protocol and Metrics

4. Experimental Setup

4.1. Implementation

4.2. Hardware

4.3. Training Configuration

4.4. Experiments

4.5. Deployment Feasibility

5. Results and Discussion

5.1. Model Comparison

5.2. Window-Length Analysis

5.3. Sensor Modality Ablation

5.4. Per-Class Performance

6. Conclusions

7. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI