Next Article in Journal
Comparative Analysis of Markerless Motion-Capture Models for Assessing Football Kinematics During 30 m Long-Pass Tasks
Previous Article in Journal
Eye-Tracking-Based Evaluation of Visual Search Efficiency in Simulated VR Menu Interfaces: Effects of Card Layout Structure and Target Spatial Quadrant
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From IMU Streams to Real-Time Decisions: Past-Only Next-Window Badminton Action Prediction

1
College of Electrical Engineering, Sichuan University, Chengdu 610065, China
2
College of Mechanical Engineering, Sichuan University, Chengdu 610065, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(12), 3651; https://doi.org/10.3390/s26123651
Submission received: 1 April 2026 / Revised: 18 May 2026 / Accepted: 5 June 2026 / Published: 8 June 2026
(This article belongs to the Section Intelligent Sensors)

Abstract

We study real-time next-window badminton action prediction from wearable IMU streams where the system must predict the action label of the upcoming 100 ms window using past-only (causal) information. To handle severe class imbalance in continuous streams, we employ window-level downsampling of the dominant background class and compress multi-sensor time/frequency features using PCA before temporal modeling. We evaluate the full pipeline under a hop-based streaming protocol and show that our BiLSTM + MHSA model achieves high recognition performance (test accuracy 96.36%, Macro-F1 95.82%) while remaining deployable in real time, reaching 58.20 windows/s end to end (including preprocessing), i.e., 5.82× the real-time requirement (10 windows/s under a 100 ms output interval), on a Windows PC with an NVIDIA RTX 3080 GPU. These results support low-latency applications such as live coaching feedback and tactical analytics.

1. Introduction

Wearable sensing has become a practical foundation for fine-grained sports analytics, enabling an automated understanding of athletes’ actions, tactics, and biomechanics from lightweight on-body sensors. In badminton, inertial measurement units (IMUs) can capture fast and subtle motion cues of strokes, and recent datasets and studies have demonstrated the feasibility of recognizing rich shot categories from multi-sensor IMU streams. In this paper, we focus on a badminton IMU dataset containing 11 shot types plus an “other” class, recorded at 100 Hz with five IMU placements (lower, upper, left foot, right foot, and racket), yielding 30 raw channels (three-axis accelerometer + three-axis gyroscope per device).
Beyond offline recognition, many real-world applications (e.g., real-time coaching, tactical feedback, and downstream control) require predictive inference: the system should anticipate what action is happening in the immediate future rather than only classifying the past. We therefore study next-window action prediction: given a history of past windows, the model predicts the action label of an upcoming window using a strict past-only (causal) formulation. Concretely, with a 100 Hz IMU stream, the system outputs one prediction every 100 ms, enabling continuous online inference without accessing future observations. This hop-based protocol requires a minimum throughput of 10 windows/s for real-time operation; in our end-to-end stream-replay benchmark (feature extraction, PCA, and model inference), the pipeline reaches 58.20 windows/s (5.82× real time) on a Windows PC equipped with an NVIDIA RTX 3080 GPU. This setting naturally aligns with online deployment, where the current/future window is not fully observed when the decision must be made.
Next-window prediction for badminton IMU is challenging for several reasons. First, badminton motions are highly dynamic with abrupt state transitions; discriminative patterns may be short-lived while still depending on longer-term context. Second, many strokes exhibit similar short-term signatures (e.g., clear vs. smash), making long-range temporal cues important for disambiguation. Third, the data distribution is strongly imbalanced: the background “other” class can dominate the stream, while rare strokes appear sparsely, which can bias training and degrade minority-class performance. Finally, obtaining accurate frame-/window-level labels for such high-frequency sensor data is labor-intensive, motivating learning and labeling strategies that reduce human annotation cost.
To address these challenges, we propose an LSTM-based pipeline for IMU next-window prediction. We first transform each IMU window into a compact representation via multi-channel time/frequency-domain features (13 features per channel) and apply standardization followed by PCA for dimensionality reduction, yielding an m-dimensional embedding per window. We then construct a past-only sequence of length H and feed it into a temporal model combining BiLSTM-based sequence encoding with multi-head attention and an MLP classifier head. This design targets both short-term variations and long-term dependencies, matching the nature of badminton motion streams. In addition, to alleviate labeling cost, we adopt a self-supervised labeling approach derived from LIMU-BERT-style IMU representation learning, which can generate reliable labels and significantly reduce manual annotation overhead.
The main contribution of this paper is a complete causal prediction pipeline for badminton IMU streams. We formulate the task as strict past-only next-window prediction, where the system outputs one prediction every 100 ms without accessing future observations. To support this setting, we combine multi-channel time/frequency features, PCA-based compression, and a lightweight BiLSTM + MHSA temporal encoder that captures both short-term stroke dynamics and longer-range motion context. We further evaluate deployability with a full end-to-end streaming benchmark, including feature extraction, standardization, PCA, and model inference, and show that the pipeline reaches 58.20 windows/s, or 5.82× the real-time requirement, on a Windows PC with an NVIDIA RTX 3080 GPU. Finally, because continuous badminton streams are dominated by the other class, we incorporate window-level downsampling and ablation analyses to clarify how imbalance handling, PCA dimensionality, and attention affect prediction robustness, especially at longer horizons.
The remainder of the paper is organized to make this pipeline explicit. Section 3 describes the dataset, preprocessing, temporal model, and labeling strategy, while Section 4 evaluates prediction accuracy, real-time feasibility, calibration, and ablation results. This organization connects each methodological component to the deployment requirements of low-latency badminton analytics.

Abbreviations

For clarity, the main acronyms used in this paper are summarized as follows: IMU, inertial measurement unit; PCA, principal component analysis; LSTM, long short-term memory; BiLSTM, bidirectional LSTM; MHSA, multi-head self-attention; MLP, multilayer perceptron; ECE, expected calibration error; UWB, ultra-wideband; IoU, intersection over union; GT, ground truth; and PR, precision–recall.

2. Related Work

2.1. Wearable Sensing for Badminton Analytics

Wearable IMU sensing has been widely adopted for sports motion analysis due to its low cost, portability, and high temporal resolution. For badminton, prior work has demonstrated effective stroke recognition from wearable IMUs (and sometimes additional modalities such as UWB) [1,2,3,4,5,6]. In parallel, forecasting in badminton has been studied at the match/event level, e.g., movement forecasting [7] and rally-wise behavior imitation from offline match trajectories [8]. More broadly, wearable inertial sensing has also been explored for anticipatory/predictive inference beyond badminton, including motion intention prediction [9], forecasting of future gait patterns from IMUs [10], and online frameworks that couple action recognition with motion prediction for early risk assessment [11].
This paper builds on the publicly released badminton wearable-sensing dataset introduced by Van Herbruggen et al. [1]. Their work collected synchronized IMU/UWB recordings from badminton players and demonstrated the feasibility of wearable-based badminton shot recognition under the original dataset protocol. In that protocol, the goal was to recognize badminton actions from the available sensor recordings using the dataset’s original sensing configuration and label definition, providing an important foundation for subsequent badminton IMU studies. Building on this dataset, we reformulate the problem from offline shot recognition to causal next-window prediction. Specifically, our setting differs in sensing configuration, class taxonomy, task definition, and evaluation protocol: we use the five-IMU subset and predict the upcoming IMU window from past windows only under a streaming protocol. Therefore, the recognition accuracy reported by Van Herbruggen et al. [1] is treated as related context rather than as a head-to-head baseline; a controlled numerical comparison would require rerunning the prior recognition model under the same sensor subset, 12-class label set, past-only input constraint, next-window target definition, and train/validation/test split. Compared with these lines of work, our focus is a strictly causal sensor-stream setting that directly supports low-latency online feedback.

2.2. Real-Time Constraints and Temporal Modeling for Wearable IMU Streams

In real-time wearable IMU systems, low-latency inference under strict causal constraints is a primary requirement: models must process streaming signals online and output stable predictions within tight timing budgets. Recent reviews of intelligent wearables summarize multiple real-time deployment cases for motion-intent and biomechanical inference [12,13]. Capturing both short-term discriminative cues and long-range context is crucial for highly dynamic motions such as badminton strokes. The Long-Short Term Memory (LSTM) [14,15] framework was proposed to separate fine-grained variations from longer contextual dependencies in time series under complex interventions, providing a principled approach to modeling dynamics. Real-time wearable pipelines have also demonstrated simultaneous action recognition and whole-body motion/dynamics prediction in online settings [16]. Attention-based LSTM sensor-fusion studies further show robust performance under challenging NLOS conditions, reinforcing the practical value of temporal modeling for real-time wearable systems [17]. Recent multimodal wearable forecasting research further reports strong continuous multi-step-ahead biomechanical prediction using a transformer-style encoder–decoder architecture (KsFormer) [18]. Building on this motivation, we adopt an LSTM-inspired design that leverages BiLSTM encoding and multi-head attention to aggregate historical information, and we tailor it to the strict past-only next-window prediction objective.

2.3. Self-Supervised Learning and Automatic Labeling for IMU Data

Self-supervised learning (SSL) has emerged as a transformative paradigm for IMU sensing, enabling robust representation learning from massive unlabeled streams and significantly mitigating the reliance on labor-intensive manual annotation [19,20]. The foundational work of LIMU-BERT [21] first demonstrated that masked modeling objectives could effectively unlock the latent temporal patterns in IMU sequences. Building upon this, recent universal models such as oneHAR [22] have further scaled these representations across diverse sensor modalities and datasets. More recently, bio-inspired SSL frameworks [23] have refined this process by incorporating movement-specific inductive biases into the pre-training phase. Inspired by these advancements in automated labeling and feature extraction [24], we propose a self-supervised labeling strategy specifically tailored for badminton dynamics. This approach generates high-fidelity, window-level labels that demonstrate strong empirical alignment with ground truth across various subjects, ensuring both data efficiency and labeling precision.

3. Research Methods

This section describes the complete construction of the proposed next-window prediction pipeline. We first specify the dataset source and the IMU-only experimental subset; then, we define how continuous sensor streams are converted into sliding-window samples under a past-only constraint. We next describe the treatment of class imbalance, feature standardization and PCA compression as well as the temporal prediction model used to map historical IMU windows to future action labels. Finally, we introduce the self-supervised labeling component used to reduce annotation effort and support scalable window-level supervision.

3.1. Dataset Overview

The dataset used in this study was originally collected and released by Van Herbruggen et al. [1]. We did not collect new player data; instead, we used the IMU-only subset of this badminton dataset obtained from the original authors for reproducible evaluation. The original dataset contains wearable-sensing recordings from three experienced badminton players performing multiple stroke categories with an additional other label assigned to periods in which no target stroke is executed.
For the experiments in this paper, we use the annotated IMU-only CSV recordings available for the three players. The recordings are sampled at 100 Hz using Axivity AX6 inertial sensors. Five sensors are placed on the player/equipment system: lower body/waist, upper body, left foot, right foot, and racket. Each sensor provides tri-axial acceleration and tri-axial angular velocity, resulting in 5 × 6 = 30 raw IMU channels per time step (denoted as D 0 = 30 ). This multi-position setup captures both whole-body movement and racket-specific motion, which is important for distinguishing fast badminton strokes with similar local motion patterns.
Following the labels available in the released IMU-only annotations, our final experimental setting is formulated as a 12-class next-window prediction task, consisting of 11 badminton action classes and one other class. This distinction separates the broader label inventory described in the source dataset from the specific label set used in our reproducible IMU-only experiments. Figure 1 illustrates the data-collection workflow and wearable sensor placement used in the source dataset.

3.2. LSTM Model Application

Badminton motions exhibit high dynamics and abrupt state transitions, posing significant challenges for temporal sequence models to extract discriminative features from IMU data. The long short-term memory (LSTM) model is a specialized temporal sequence modeling framework designed to capture both short-term fine-grained variations and long-term contextual dependencies in dynamic time-series data—an advantage that makes it well suited for processing badminton IMU signals characterized by rapid strokes, smashes, and drops.
Inspired by the BiLSTM open-source model proposed by Cai et al. [15], we learn motion representations from the badminton IMU dataset. This subsection introduces the application process and evaluates the ability of the model to capture the dynamic features of badminton motion.

3.2.1. Data Processing

Each recording is treated as a continuous labeled IMU stream
D = { ( s t , y t ) } t = 0 T 1 , s t R D 0 , y t Y , ϕ : Y { 0 , , C 1 } , C = 12 .
The goal is to construct prediction samples that respect the causal nature of online deployment: at prediction time, only historical sensor observations are available, whereas the target label belongs to the upcoming window.
To transform the continuous stream into model inputs, the IMU sequence is divided into overlapping windows of length W with hop size h o p . The k-th window starts at s k = k · h o p and is represented as
S ( k ) = s s k s s k + 1 s s k + W 1 R W × D 0 , N w = T W h o p + 1 .
The window label is assigned according to the label at the window end,
y k = ϕ y t k , t k = s k + W 1 .
For next-window prediction, the sample at index k is formed from previous windows only, while the supervision target corresponds to a future window. This design prevents information leakage from future observations and matches the intended streaming inference setting.
Each window is converted into a compact multi-channel representation. For channel c { 0 , , D 0 1 } , let x ( k , c ) R W denote the one-dimensional signal segment in window k. We summarize this segment using 13 time- and frequency-domain descriptors:
ψ ( x ) = μ , σ , x min , x max , med , q 0.1 , q 0.9 , rms , E , μ f , σ f , M max , dom R 13 ,
where the frequency-domain terms are computed from the single-sided magnitude spectrum | RFFT ( x ) | . Concatenating the descriptors from all IMU channels gives the window-level feature vector
w k = Concat ψ ( x ( k , 0 ) ) , , ψ ( x ( k , D 0 1 ) ) R d , d = 13 D 0 .
After standardization and PCA compression, each window is represented by z k R m . A history tensor X k R H × m is then constructed from the most recent H past window embeddings and used as the causal input to the temporal prediction model.

3.2.2. Class Imbalance Handling

The other label constitutes the majority of window samples. We therefore perform random undersampling: within each file, we retain all non-other windows and randomly keep an equal number of other windows (when non-other windows exist); files containing no non-other windows are excluded from balancing. After balancing, we randomly split the remaining windows into train/validation/test sets with a 70/10/20 ratio (fixed seed for reproducibility).

3.2.3. PCA-Based Dimensionality Reduction with Feature Standardization

To alleviate scale mismatch and feature redundancy, we apply feature standardization followed by PCA. Specifically, we fit a standard scaler on the training split only and transform the validation/test splits using the same parameters to avoid information leakage. PCA is then fitted on the standardized training features, and all splits are projected into the PCA space. We set the PCA dimension by retaining a target explained-variance ratio (e.g., 95%) or by selecting the best-performing dimension on the validation set. In practice, this compresses correlated statistics (e.g., energy-related measures) into a smaller set of orthogonal components and improves downstream training stability.

3.2.4. Causal Temporal Prediction Architecture

The temporal model maps a history of past IMU window embeddings to the action label distribution of one or more future windows. For a window index k, the causal input is the history tensor
X k = z k H + 1 , , z k R H × m ,
where H = history _ len is the number of historical windows and m = enc _ in is the PCA-reduced feature dimension. The corresponding supervision is defined over a prediction horizon,
y k tar = y k + f 1 , , y k + f K { 0 , , C 1 } K ,
where f s [ future _ start , future _ end ] and K = future _ end future _ start + 1 . This formulation separates the observed history from the future target labels and therefore preserves the past-only constraint required for streaming deployment.
For mini-batch training, the input tensor is X R B × H × m and the target tensor is Y tar { 0 , , C 1 } B × K . The network produces horizon-wise logits
O = F θ ( X ) R B × K × C , y ^ i , s = arg max c O i , s , c .
The first stage projects each time-step feature vector x τ R m into the model dimension d = d _ model ,
h τ = Dropout ( W i n x τ + b i n ) R d ,
which yields a sequence representation H R B × H × d . This projected sequence is encoded by L = depth stacked bidirectional LSTM layers with hidden size h = d _ hidden , allowing the model to summarize short-term stroke variations together with broader temporal context within the available history.
The BiLSTM output is then projected back to the model dimension and refined by multi-head self-attention (MHSA). After LayerNorm and a residual connection, the refined sequence is
U = LN ( Z + MHSA ( Z ) ) , Z R B × H × d .
A fixed-length causal representation is obtained by last-step pooling,
u = U : , H 1 , : R B × d ,
so the classifier uses the representation at the most recent available historical window. A feed-forward MLP maps this representation to K C logits and reshapes them into horizon-specific class scores,
r = MLP ( u ) R B × ( K C ) , O = Reshape ( r ) R B × K × C .
This non-autoregressive prediction head estimates all horizon steps in a single forward pass, which reduces inference overhead and supports real-time use.
During training, the horizon dimension is flattened so that each horizon step contributes one classification term:
L = 1 B K i = 1 B j = 1 K L LS O i , j , : , y i , j ,
where L LS denotes label-smoothed cross-entropy with ε = label _ smoothing . Together with PCA compression and fixed-size sliding windows, this architecture is designed to balance temporal modeling capacity with the throughput requirements of real-time badminton analytics.
The architecture of the proposed model is illustrated in Figure 2.

3.3. Self-Supervised Learning Approach

To generate reliable labels for the IMU sensor data of badminton players, we adopt a self-supervised learning-based annotation method, which is derived from the LIMU-BERT framework proposed by Xu et al. [21]. Unlike manual annotation that relies on experience, this self-supervised approach (originally designed for IMU sensing tasks) enables automatic label generation by leveraging intrinsic time-series characteristics of wearable sensor data, significantly reducing human labor costs while ensuring label consistency—aligning with the core advantage of LIMU-BERT in mining unlabeled sensor data.

4. Results and Analysis

A series of experiments were conducted to verify the proposed method. This section presents the core results obtained in the experiment with a focus on analyzing the performance of real-time prediction models in highly dynamic IMU data. The subsequent analysis provided a solid foundation for verifying the effectiveness and superiority of the proposed method.

4.1. Window-Level Offline Evaluation

We evaluate the proposed model on the window-level next-window prediction task (prediction horizon = 1 step). Since the dataset is class-imbalanced (e.g., the other class has substantially larger support), we report Macro-F1 and balanced accuracy in addition to overall accuracy. Table 1 summarizes the overall performance, and Table 2 provides per-class precision/recall/F1, which enables a fine-grained inspection of class-wise strengths and failure modes.

4.2. Performance of the Optimized LSTM Model

To validate the effectiveness of the combined hyperparameter optimization strategy, we integrate the optimal temporal configuration ( w = 10 , h o p = 4 , H = 49 ) and the recommended PCA dimensionality ( d = 52 ) to construct the final LSTM model. Figure 3 presents key performance visualizations of this integrated model, including training/validation accuracy, class distribution balance, confusion matrices, and loss dynamics.
To avoid repeating the aggregate metrics already reported in Window-Level Offline Evaluation, we focus here on optimization dynamics and class-wise error patterns. In Figure 3a, the training/validation curves remain close throughout optimization, indicating stable convergence and no obvious overfitting under the selected configuration.
Figure 3b further shows that most confusion is concentrated among semantically similar stroke classes, while the dominant “other” class remains well separated rather than being over-predicted. The main residual weakness is the rare lob_backhand class (very limited support), which is consistent with the long-tail distribution and suggests that future gains will primarily come from targeted data balancing or class-aware augmentation rather than further global hyperparameter tuning.
To further contextualize the IMU-only setting, we discuss its relation to prior wearable-sensing-based badminton recognition work without treating it as a directly comparable benchmark.

4.3. End-to-End Real-Time Inference Benchmark

A system is considered real-time feasible in our setting if throughput is at least 10 windows/s, since the deployment hop is 100 ms. Our end-to-end pipeline achieves 58.20 windows/s, i.e., 5.82× the real-time requirement. To verify this under realistic processing overhead, we run an end-to-end stream-replay benchmark. Raw IMU CSV streams are replayed in chronological order, and the pipeline performs sliding-window inference per hop, including window-level feature extraction, standardization, PCA projection, history buffering, and a single model forward pass. After warming up for N w windows, we benchmark N e consecutive windows and report throughput and latency percentiles.
Metrics. Let f s be the sampling rate (Hz) and let h out denote the output hop size (samples). The system produces one prediction every Δ t = h out / f s seconds; thus, the minimum required throughput is
TP min = 1 Δ t = f s h out ( windows / s ) .
Given the measured throughput TP (windows/s), we define the real-time factor (RTF) with respect to the output hop as
RTF hop = TP TP min = TP · h out f s .
A system is considered real-time feasible if RTF hop > 1 .
Results and analysis. With f s = 100 Hz and h out = 10 (i.e., Δ t = 0.1 s, TP min = 10  windows/s), our end-to-end pipeline achieves TP = 58.20 windows/s, corresponding to RTF hop = 5.82 , which satisfies the real-time requirement with a clear safety margin. In addition, model-only inference (batch size 1) reaches 185.17 windows/s with median latency p 50 = 5.08 ms, indicating that the remaining end-to-end overhead mainly stems from preprocessing (feature computation, normalization, and PCA) rather than the network forward pass. This margin suggests the system can tolerate moderate deployment overhead (I/O, scheduling, logging) while still meeting the hop-based real-time constraint.

4.4. Calibration and Selective Prediction

Beyond accuracy, deployment quality depends on whether confidence scores are well calibrated and useful for selective prediction. We therefore evaluate reliability with the Expected Calibration Error (ECE) and assess confidence ranking quality with precision–recall analysis. Using B confidence bins, ECE is defined as
ECE = b = 1 B | I b | N acc ( I b ) conf ( I b ) ,
where I b is the sample set in bin b, N is the number of test windows, acc ( I b ) is the empirical accuracy, and conf ( I b ) is the mean predicted confidence; we set B = 15 in this paper. A lower ECE indicates better alignment between predicted probabilities and true correctness frequencies.
Figure 4a shows that the reliability curve closely follows the diagonal, indicating good calibration. For the reliability diagram (Figure 4a), the x-axis is the binned mean confidence c ¯ b [ 0 , 1 ] , and the y-axis is the corresponding empirical accuracy acc ( I b ) [ 0 , 1 ] for bin I b . Figure 4b further shows strong precision–recall behavior, suggesting that confidence can effectively separate likely-correct from likely-incorrect predictions. For the PR curve (Figure 4b), the x-axis is recall R = T P T P + F N and the y-axis is precision P = T P T P + F P . Moreover, the micro-AP is 0.986 and macro-AP is 0.988, indicating strong confidence ranking quality. Precision decreases when recall approaches 1 because lowering the decision threshold introduces more low-confidence false positives. Together, these results support confidence-based deployment strategies (e.g., thresholding or abstention) in streaming scenarios.

4.5. Ablation Studies

Downsampling under severe class imbalance. The raw window-level labels are highly imbalanced where the other class dominates. Without handling this imbalance, the classifier tends to collapse to a trivial majority-class predictor, yielding misleadingly high accuracy while failing on minority actions. Therefore, we apply downsampling to mitigate the dominance of the majority class and stabilize optimization. Figure 5 provides a class-distribution comparison before and after downsampling.
Effect of PCA. We further evaluate whether PCA-based dimensionality reduction benefits horizon prediction. Table 3 reports the mean accuracy (averaged over all steps within the prediction horizon) under four prediction horizons. PCA consistently improves performance for short-to-medium horizons (10–40), indicating that denoising and compact representations help the encoder learn more robust temporal features.
Effect of MHSA. We ablate the multi-head self-attention (MHSA) module by removing it from the encoder while keeping all other settings unchanged. As shown in Table 4, MHSA provides consistent gains across different horizons, especially at longer horizons, which suggests that attention helps highlight informative temporal segments within the historical window for more reliable prediction across horizons.
BiLSTM vs. UniLSTM. Finally, we compare a bidirectional LSTM encoder with a unidirectional LSTM encoder under the same training protocol. Note that the bidirectionality is applied only within the observed input window and does not access any future observations from the prediction horizon; hence, it does not introduce information leakage. Figure 6 visualizes the all-correct rate (the percentage of samples for which all steps within the prediction horizon are correctly predicted) over training. BiLSTM achieves a higher all-correct rate throughout training, suggesting that modeling both earlier and later temporal dependencies inside the input window yields a more informative representation for prediction across the horizon. Additional ablation details are provided in Figure 7 for key-parameter sensitivity and in Figure 8 for PCA dimensionality analysis.

5. Conclusions and Future Work

5.1. Conclusions

This paper studied real-time badminton next-window action prediction from wearable IMU streams under a strict past-only protocol. We proposed a practical pipeline that combines sliding-window feature extraction, standardization and PCA compression, and a lightweight temporal model (BiLSTM with multi-head self-attention) to capture both short- and long-range motion context. Experiments on a multi-player, multi-sensor badminton dataset demonstrate strong recognition performance under severe class imbalance, while the end-to-end stream-replay benchmark confirms real-time feasibility (58.20 windows/s including preprocessing). Overall, the results indicate that compact IMU representations together with causal temporal modeling can enable accurate and deployable online action prediction for badminton analytics.

5.2. Future Work

Future extensions will focus on moving from action-level prediction toward richer real-time badminton intelligence. One direction is to incorporate UWB-based positional information so that body-motion cues from IMUs can be combined with on-court trajectories for strategy-level forecasting, including player movement intent, court coverage, and tactical patterns. This extension also requires multimodal fusion architectures that can align asynchronous IMU and UWB streams while remaining robust to realistic noise, occlusion, and sensor dropouts. Beyond sensing and prediction, large language models (LLMs) may serve as higher-level reasoning and generation modules: real-time predicted actions and positions could be translated into coherent play-by-play descriptions, tactical summaries, and short-horizon forecasts, making the system output more interpretable for players, coaches, and spectators.

6. Additional Analyses

This section provides additional analyses, including key-parameter sensitivity, PCA dimensionality effects, calibration and selective prediction, downsampled streaming/event-level evaluation, self-supervised labeling assessment, and deployment-oriented confidence analysis.

6.1. Impact of Key Parameters on Recognition Performance

To identify the optimal hyperparameter combination for the LSTM model, we conducted a grid search over three key parameters: window size (w), model hop size (hop), and history length (hist). The grid search results are visualized in Figure 7, which quantifies the model accuracy across different parameter combinations. Here, hop denotes the model-window stride used for feature sequence construction, which is distinct from the 100 ms deployment hop used in the real-time benchmark.
As indicated by the peak accuracy in Figure 7, the optimal parameter combination is determined as w = 10 (window size), hop = 4 (model hop size), and hist = 49 (history length). A small window size ( w = 10 ) captures fine-grained dynamic features of badminton motions (e.g., rapid stroke transitions) without over-smoothing short-term IMU signal fluctuations. A small model hop size ( hop = 4 ) ensures dense sampling of the time series, preserving temporal continuity and reducing information loss between adjacent windows. A large history length ( hist = 49 ) enables the LSTM model to leverage long-term contextual dependencies of motion sequences, which is critical for distinguishing similar badminton shots (e.g., forehand clear vs. smash) that share short-term IMU patterns but differ in long-term motion context.

6.2. Impact of PCA Dimensionality on Recognition Performance

We investigated the impact of PCA dimensionality on recognition performance while keeping the temporal configuration fixed ( w = 10 , h o p = 4 , H = 49 ) and using the same split protocol. We first conducted a coarse scan over a broad range of PCA components, which is followed by a fine-grained scan around the promising region. Across all evaluated settings, the best observed test accuracy was 96.13 % at d = 52 PCA components. To balance accuracy and model efficiency, we also report the smallest dimensionality whose accuracy is within 0.20 percentage points the best result; this yields a recommended setting of d = 36 with 96.03 % accuracy. Overall, PCA enables substantial dimensionality reduction with a negligible drop in accuracy, indicating that the handcrafted time/frequency features contain considerable redundancy and can be compactly represented without sacrificing recognition performance. The combined figure consolidates the coarse and fine accuracy curves, the best-per-dimension accuracy envelope, and the accuracy gap to the best, enabling a compact view of the accuracy–efficiency trade-off and the diminishing returns beyond the recommended dimensionality.

6.3. Calibration and Selective-Prediction Protocol

Calibration analysis evaluates whether the model confidence reflects the empirical probability of correct prediction, which is important when the system is used for real-time coaching or tactical feedback. For each test window, the model produces logits z R C and class probabilities p = softmax ( z ) . The predicted label is y ^ = arg max c p c , and the associated confidence is max c p c . These quantities support Expected Calibration Error (ECE), coverage–risk analysis, and confidence-thresholded selective prediction where low-confidence windows can be abstained from rather than forced into potentially unreliable action decisions. Table 5 reports the calibration result on the test set.

6.4. Streaming/Event-Level Evaluation (Downsampled Setting)

To better reflect event detection performance under controlled class imbalance, we additionally evaluate streaming predictions in a downsampled setting. Following our training-time strategy, we downsample background (other) windows to match the number of non-other windows (1:1) before constructing history sequences and running inference. We then merge consecutive non-other windows into predicted action segments and match them to ground-truth segments using temporal IoU with threshold θ = 0.5 . We report event-level precision/recall/F1 and detection delay measured by time-to-detect (median and 90th percentile), where delays are computed based on N = 2 consecutive correct predictions at the native resolution of this event-level protocol. Table 6 summarizes the event-level performance in the downsampled streaming setting.
We further report the sensitivity to the temporal IoU threshold by sweeping θ { 0.5 , 0.6 , 0.7 , 0.8 } , as summarized in Table 7.
This downsampled setting reduces the dominance of the other class and therefore provides an upper-bound estimate of event detection performance under a balanced background.

6.5. Performance Evaluation of Self-Supervised Labeling

In this experiment, the comparison between ground truth (GT) labels and self-supervised predicted (pred) labels for three players is visualized in Figure 9. Each player’s results include a side-by-side comparison of GT labels (left column) and model-generated labels (right column), enabling both a qualitative and quantitative assessment of the self-supervised labeling effectiveness.
Quantitative evaluation reveals that the self-supervised model achieves high labeling accuracy across all participants: Player 1 reaches 0.9888 , Player 2 0.9753 , and Player 3 the highest at 0.9913 . Qualitatively, the predicted labels in Figure 9 closely align with the GT labels, accurately capturing the temporal boundaries of badminton actions (e.g., serves, smashes, and drops) without obvious misclassifications. This validates the model’s ability to generate reliable labels without manual annotation, significantly reducing labor costs.
Notably, window size configuration is critical for balancing labeling quality and temporal responsiveness. An excessively small window leads to label “glitches” (spurious action transitions) due to heightened sensitivity to short-term IMU signal noise—this is because small windows fail to average out random fluctuations in sensor data. Conversely, an overly large window introduces substantial labeling delay, as it requires accumulating more temporal data before generating a label, which cannot keep pace with the rapid dynamics of badminton motions (e.g., quick stroke reversals or sudden direction changes). The window size adopted in this paper optimizes this trade-off: as evidenced by the smooth label sequences and high accuracy in Figure 9, it effectively minimizes noise-induced glitches while maintaining sufficient temporal responsiveness to match the fast-changing characteristics of IMU data.

6.6. Deployment-Oriented Additional Analyses

To complement the main streaming and calibration results, we provide two additional analyses that are directly relevant to real-time deployment. The first examines confidence-based selective prediction, where the system may abstain from low-confidence windows rather than forcing unreliable decisions. The second evaluates the sensitivity of event-level performance to the temporal IoU threshold used for segment matching. Together, these analyses clarify how the proposed system behaves under practical confidence filtering and event-detection criteria.
Figure 10 shows the trade-off between prediction coverage and risk under confidence thresholding. This result supports selective deployment modes in which uncertain predictions can be withheld or flagged for downstream review, thereby improving the reliability of the displayed feedback.
Table 7 reports the event-level sensitivity to the temporal IoU threshold. As expected, stricter matching thresholds reduce precision, recall, and F1 because predicted segments must align more tightly with ground-truth action intervals. The gradual performance decrease indicates that the predicted event boundaries remain reasonably stable across a range of temporal matching criteria.

Author Contributions

Conceptualization, Q.Z.; methodology, Q.Z. and B.G.; investigation, Q.Z. and J.W.; data curation, Q.Z. and J.W.; formal analysis, Q.Z.; software, Q.Z.; visualization, Q.Z.; resources, J.W.; writing—original draft preparation, Q.Z.; writing—review and editing, Q.Z. and B.G.; supervision, B.G.; project administration, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank Sichuan University for support. The authors also gratefully acknowledge the dataset contribution from the laboratory team of Van Herbruggen et al. [1], whose publicly released badminton IMU/UWB dataset made this study possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Van Herbruggen, B.; Fontaine, J.; Simoen, J.; De Mey, L.; Peralta, D.; Shahid, A.; De Poorter, E. Strategy analysis of badminton players using deep learning from IMU and UWB wearables. Internet Things 2024, 27, 101260. [Google Scholar] [CrossRef]
  2. Steels, T.; Van Herbruggen, B.; Fontaine, J.; De Pessemier, T.; Plets, D.; De Poorter, E. Badminton activity recognition using accelerometer data. Sensors 2020, 20, 4685. [Google Scholar] [CrossRef] [PubMed]
  3. Kiang, C.T.; Yoong, C.K.; Spowage, A.C. Local sensor system for badminton smash analysis. In IEEE Instrumentation and Measurement Technology Conference (I2MTC); IEEE: Piscataway, NJ, USA, 2009; pp. 883–888. [Google Scholar]
  4. Ghosh, I.; Ramamurthy, S.R.; Chakma, A.; Roy, N. Decoach: Deep learning-based coaching for badminton player assessment. Pervasive Mob. Comput. 2022, 83, 101608. [Google Scholar] [CrossRef]
  5. Peralta, D.; Van Herbruggen, B.; Fontaine, J.; Debyser, W.; Wieme, J.; De Poorter, E. Badminton stroke classification based on accelerometer data: From individual to generalized models. In IEEE International Conference on Big Data; IEEE: Piscataway, NJ, USA, 2022; pp. 5542–5548. [Google Scholar] [CrossRef]
  6. Jin, G.; Li, X. Wearable sensing for badminton stroke recognition with one-dimensional convolutional neural network. Sci. Rep. 2025, 15, 25158. [Google Scholar] [CrossRef] [PubMed]
  7. Chang, K.S.; Wang, W.Y.; Peng, W.C. Where will players move next? Dynamic graphs and hierarchical fusion for movement forecasting in badminton. Proc. Aaai Conf. Artif. Intell. 2023, 37, 6998–7005. [Google Scholar] [CrossRef]
  8. Wang, K.D.; Wang, W.Y.; Hsieh, P.C.; Peng, W.C. Offline Imitation of Badminton Player Behavior via Experiential Contexts and Brownian Motion. In Proceedings of the Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track; Springer Nature: Cham, Switzerland, 2024; pp. 348–364. [Google Scholar] [CrossRef]
  9. Tang, C.; Xu, Z.; Occhipinti, E.; Yi, W.; Xu, M.; Kumar, S.; Virk, G.S.; Gao, S.; Occhipinti, L.G. From brain to movement: Wearables-based motion intention prediction across the human nervous system. Nano Energy 2023, 115, 108712. [Google Scholar] [CrossRef]
  10. Zhang, W.; Zhang, H.; Jiang, Z.; Servati, A.; Servati, P. Real-time forecasting of pathological gait via IMU navigation: A few-shot and generative learning framework for wearable devices. Discov. Electron. 2025, 2, 51. [Google Scholar] [CrossRef]
  11. Guo, C.; Rapetti, L.; Darvish, K.; Grieco, R.; Draicchio, F.; Pucci, D. Online Action Recognition for Human Risk Prediction with Anticipated Haptic Alert via Wearables. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids); IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar] [CrossRef]
  12. Xiao, X.; Yin, J.; Xu, J.; Tat, T.; Chen, J. Advances in machine learning for wearable sensors. ACS Nano 2024, 18, 22734–22751. [Google Scholar] [CrossRef] [PubMed]
  13. Chen, S.; Peng, C.; Yang, B.; Lin, J.; Zhou, L.; Jiang, Z.; Liu, Z.; Liu, Y.; Tang, L. Recent advances in intelligent wearable systems: From multiscale biomechanical features towards human motion intent prediction. npj Artif. Intell. 2026, 2, 33. [Google Scholar] [CrossRef]
  14. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  15. Cai, R.; Huang, H.; Jiang, Z.; Li, Z.; Zhou, C.; Liu, Y.; Liu, Y.; Hao, Z. Disentangling long-short term state under unknown interventions for online time series forecasting. Proc. Aaai Conf. Artif. Intell. 2025, 39, 15641–15649. [Google Scholar] [CrossRef]
  16. Darvish, K.; Ivaldi, S.; Pucci, D. Simultaneous Action Recognition and Human Whole-Body Motion and Dynamics Prediction from Wearable Sensors. In 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids); IEEE: Piscataway, NJ, USA, 2022; pp. 488–495. [Google Scholar] [CrossRef]
  17. Ren, M.; Wei, J.; Qin, J.; Guo, X.; Wang, H.; Li, S. Attention based LSTM framework for robust UWB and INS integration in NLOS environments. Sci. Rep. 2025, 15, 21637. [Google Scholar] [CrossRef]
  18. Zhou, H.; Peng, Y.; Li, X.; Lyu, X.; Shou, D.; Li, G.; Wang, L. Multimodal wearable sensors-driven KsFormer model for continuous multi-step ahead prediction of lower limb joint moments and ground reaction forces. Biomim. Intell. Robot. 2026, 6, 100287. [Google Scholar] [CrossRef]
  19. Tan, T.; Shull, P.B.; Hicks, J.L.; Uhlrich, S.D.; Chaudhari, A.S. Self-supervised learning improves accuracy and data efficiency for IMU-based ground reaction force estimation. IEEE Trans. Biomed. Eng. 2024, 71, 2095–2104. [Google Scholar] [CrossRef] [PubMed]
  20. Jiang, A.; Ye, J. SelfVis: Self-supervised learning for human activity recognition based on area charts. IEEE Trans. Emerg. Top. Comput. 2024, 13, 196–206. [Google Scholar] [CrossRef]
  21. Xu, H.; Zhou, P.; Tan, R.; Li, M.; Shen, G. LIMU-BERT: Unleashing the potential of unlabeled data for IMU sensing applications. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems (SenSys); Association for Computing Machinery: New York, NY, USA, 2021; pp. 220–233. [Google Scholar]
  22. Wei, Q.; Huang, J.; Gao, Y.; Dong, W. One Model to Fit Them All: Universal IMU-based Human Activity Recognition with LLM-assisted Cross-dataset Representation. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2025, 9, 139:1–139:22. [Google Scholar] [CrossRef]
  23. Tarale, P.P.; Chu, K.; Varghese, A.; Liu, K.C.; Xu, M.A.; Iyyer, M.; Lee, S.I. Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals. arXiv 2026, arXiv:2603.10961. [Google Scholar]
  24. Wang, J.; Zhao, Z.; Cui, J.; Lu, J.; Wu, B. TrackLet: Data-Driven Inertial Tracking on Your Own IMU Data. IEEE Trans. Mob. Comput. 2025, 24, 8301–8313. [Google Scholar] [CrossRef]
Figure 1. Illustration of the badminton IMU dataset workflow used in the source dataset from Van Herbruggen et al. [1]: (a) IMU placement on the upper body, lower body/waist, left foot, right foot, and racket; (b) data collection during real badminton matches with multi-camera time synchronization and later manual labeling; and (c) data processing from raw IMU signals to synchronized and filtered streams, shot labeling, and strategy annotation. The colored curves in panel (c) schematically represent multi-channel IMU signals.
Figure 1. Illustration of the badminton IMU dataset workflow used in the source dataset from Van Herbruggen et al. [1]: (a) IMU placement on the upper body, lower body/waist, left foot, right foot, and racket; (b) data collection during real badminton matches with multi-camera time synchronization and later manual labeling; and (c) data processing from raw IMU signals to synchronized and filtered streams, shot labeling, and strategy annotation. The colored curves in panel (c) schematically represent multi-channel IMU signals.
Sensors 26 03651 g001
Figure 2. Overall architecture of the real-time badminton model. The past-only multivariate input sequence is first projected into a latent feature space, which is followed by stacked bidirectional LSTM layers for temporal encoding. Multi-head self-attention is then applied to capture global temporal dependencies, and the refined representation is finally mapped to class logits through a feed-forward classification head. Here, d model and d hidden denote feature dimensions, and the last-time-step pooling selects the final encoded state.
Figure 2. Overall architecture of the real-time badminton model. The past-only multivariate input sequence is first projected into a latent feature space, which is followed by stacked bidirectional LSTM layers for temporal encoding. Multi-head self-attention is then applied to capture global temporal dependencies, and the refined representation is finally mapped to class logits through a feed-forward classification head. Here, d model and d hidden denote feature dimensions, and the last-time-step pooling selects the final encoded state.
Sensors 26 03651 g002
Figure 3. Performance of the optimized LSTM-PCA model ( w = 10 , h o p = 4 , H = 49 , d = 52 ). (a) Training/validation accuracy and loss curves. (b) Confusion matrix on the test set.
Figure 3. Performance of the optimized LSTM-PCA model ( w = 10 , h o p = 4 , H = 49 , d = 52 ). (a) Training/validation accuracy and loss curves. (b) Confusion matrix on the test set.
Sensors 26 03651 g003aSensors 26 03651 g003b
Figure 4. Calibration and confidence-ranking diagnostics on the test set: (a) reliability diagram with the dashed ideal-calibration line; (b) micro precision–recall curve showing strong confidence ranking quality.
Figure 4. Calibration and confidence-ranking diagnostics on the test set: (a) reliability diagram with the dashed ideal-calibration line; (b) micro precision–recall curve showing strong confidence ranking quality.
Sensors 26 03651 g004
Figure 5. Class distribution comparison before and after downsampling.
Figure 5. Class distribution comparison before and after downsampling.
Sensors 26 03651 g005
Figure 6. BiLSTM vs. UniLSTM ablation on horizon prediction (horizon length 20). The metric is the all-correct rate over epochs for train/validation splits.
Figure 6. BiLSTM vs. UniLSTM ablation on horizon prediction (horizon length 20). The metric is the all-correct rate over epochs for train/validation splits.
Sensors 26 03651 g006
Figure 7. Grid search results for LSTM model accuracy: optimal parameters are w = 10 , hop = 4 , hist = 49 .
Figure 7. Grid search results for LSTM model accuracy: optimal parameters are w = 10 , hop = 4 , hist = 49 .
Sensors 26 03651 g007
Figure 8. PCA dimensionality analysis (combined): coarse and fine accuracy curves, best-per-dimension accuracy envelope, and accuracy gap to the best.
Figure 8. PCA dimensionality analysis (combined): coarse and fine accuracy curves, best-per-dimension accuracy envelope, and accuracy gap to the best.
Sensors 26 03651 g008
Figure 9. Self-supervised labeling for Player 3: GT vs. predicted labels with a side zoom of the largest discrepancy segment ((top) GT, (bottom) Pred). Accuracy: 0.9913.
Figure 9. Self-supervised labeling for Player 3: GT vs. predicted labels with a side zoom of the largest discrepancy segment ((top) GT, (bottom) Pred). Accuracy: 0.9913.
Sensors 26 03651 g009
Figure 10. Coverage–risk curve under confidence thresholding. As the confidence threshold increases, coverage decreases while Macro-F1 improves, indicating that abstention effectively filters low-confidence errors.
Figure 10. Coverage–risk curve under confidence thresholding. As the confidence threshold increases, coverage decreases while Macro-F1 improves, indicating that abstention effectively filters low-confidence errors.
Sensors 26 03651 g010
Table 1. Window-level test performance (prediction horizon = 1 step).
Table 1. Window-level test performance (prediction horizon = 1 step).
MetricValue
Accuracy0.9636
Macro-F10.9582
Weighted-F10.9636
Balanced Accuracy0.9654
Table 2. Per-class window-level metrics on the test set (prediction horizon = 1 step).
Table 2. Per-class window-level metrics on the test set (prediction horizon = 1 step).
ClassPrecisionRecallF1Support
clear_forehand0.93140.98250.9562456
dab_forehand0.93970.95320.9464278
drive_forehand0.93620.94620.9412279
drop_forehand0.95850.94930.9539414
lob_backhand0.91490.95560.934845
lob_forehand0.91810.96020.9387327
net_drop_backhand0.99260.96750.9799277
net_drop_forehand0.98460.99740.9910770
serve_backhand0.94640.95240.9494315
serve_forehand0.94720.98550.9660619
smash_forehand0.97440.98000.9772350
other0.97280.95490.96384165
Table 3. Ablation study results of PCA (mean accuracy).
Table 3. Ablation study results of PCA (mean accuracy).
ModelPrediction Horizon
10204080
No PCA0.8990.7990.6570.519
With PCA0.9150.8250.6860.519
Table 4. Ablation study results of MHSA (mean accuracy).
Table 4. Ablation study results of MHSA (mean accuracy).
ModelPrediction Horizon
10204080
No MHSA0.9030.7820.6560.515
With MHSA0.9150.8250.6860.519
Table 5. Calibration on the test set.
Table 5. Calibration on the test set.
MetricValue
ECE (bins = B)0.0267
Table 6. Event-level performance in the downsampled streaming setting (IoU θ = 0.5 ).
Table 6. Event-level performance in the downsampled streaming setting (IoU θ = 0.5 ).
SettingPrec.Rec.F1Med. (ms)P90 (ms)
Downsampled (1:1)0.9590.8890.923040
Table 7. Event-level results under different temporal IoU thresholds θ in the downsampled streaming setting.
Table 7. Event-level results under different temporal IoU thresholds θ in the downsampled streaming setting.
IoU θ PrecisionRecallF1
0.50.9590.8890.923
0.60.9420.8730.906
0.70.9210.8540.887
0.80.8650.8020.832
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Q.; Wang, J.; Guo, B. From IMU Streams to Real-Time Decisions: Past-Only Next-Window Badminton Action Prediction. Sensors 2026, 26, 3651. https://doi.org/10.3390/s26123651

AMA Style

Zhu Q, Wang J, Guo B. From IMU Streams to Real-Time Decisions: Past-Only Next-Window Badminton Action Prediction. Sensors. 2026; 26(12):3651. https://doi.org/10.3390/s26123651

Chicago/Turabian Style

Zhu, Qinglin, Jiao Wang, and Bin Guo. 2026. "From IMU Streams to Real-Time Decisions: Past-Only Next-Window Badminton Action Prediction" Sensors 26, no. 12: 3651. https://doi.org/10.3390/s26123651

APA Style

Zhu, Q., Wang, J., & Guo, B. (2026). From IMU Streams to Real-Time Decisions: Past-Only Next-Window Badminton Action Prediction. Sensors, 26(12), 3651. https://doi.org/10.3390/s26123651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop