1. Introduction
Wearable sensing has become a practical foundation for fine-grained sports analytics, enabling an automated understanding of athletes’ actions, tactics, and biomechanics from lightweight on-body sensors. In badminton, inertial measurement units (IMUs) can capture fast and subtle motion cues of strokes, and recent datasets and studies have demonstrated the feasibility of recognizing rich shot categories from multi-sensor IMU streams. In this paper, we focus on a badminton IMU dataset containing 11 shot types plus an “other” class, recorded at 100 Hz with five IMU placements (lower, upper, left foot, right foot, and racket), yielding 30 raw channels (three-axis accelerometer + three-axis gyroscope per device).
Beyond offline recognition, many real-world applications (e.g., real-time coaching, tactical feedback, and downstream control) require predictive inference: the system should anticipate what action is happening in the immediate future rather than only classifying the past. We therefore study next-window action prediction: given a history of past windows, the model predicts the action label of an upcoming window using a strict past-only (causal) formulation. Concretely, with a 100 Hz IMU stream, the system outputs one prediction every 100 ms, enabling continuous online inference without accessing future observations. This hop-based protocol requires a minimum throughput of 10 windows/s for real-time operation; in our end-to-end stream-replay benchmark (feature extraction, PCA, and model inference), the pipeline reaches 58.20 windows/s (5.82× real time) on a Windows PC equipped with an NVIDIA RTX 3080 GPU. This setting naturally aligns with online deployment, where the current/future window is not fully observed when the decision must be made.
Next-window prediction for badminton IMU is challenging for several reasons. First, badminton motions are highly dynamic with abrupt state transitions; discriminative patterns may be short-lived while still depending on longer-term context. Second, many strokes exhibit similar short-term signatures (e.g., clear vs. smash), making long-range temporal cues important for disambiguation. Third, the data distribution is strongly imbalanced: the background “other” class can dominate the stream, while rare strokes appear sparsely, which can bias training and degrade minority-class performance. Finally, obtaining accurate frame-/window-level labels for such high-frequency sensor data is labor-intensive, motivating learning and labeling strategies that reduce human annotation cost.
To address these challenges, we propose an LSTM-based pipeline for IMU next-window prediction. We first transform each IMU window into a compact representation via multi-channel time/frequency-domain features (13 features per channel) and apply standardization followed by PCA for dimensionality reduction, yielding an m-dimensional embedding per window. We then construct a past-only sequence of length H and feed it into a temporal model combining BiLSTM-based sequence encoding with multi-head attention and an MLP classifier head. This design targets both short-term variations and long-term dependencies, matching the nature of badminton motion streams. In addition, to alleviate labeling cost, we adopt a self-supervised labeling approach derived from LIMU-BERT-style IMU representation learning, which can generate reliable labels and significantly reduce manual annotation overhead.
The main contribution of this paper is a complete causal prediction pipeline for badminton IMU streams. We formulate the task as strict past-only next-window prediction, where the system outputs one prediction every 100 ms without accessing future observations. To support this setting, we combine multi-channel time/frequency features, PCA-based compression, and a lightweight BiLSTM + MHSA temporal encoder that captures both short-term stroke dynamics and longer-range motion context. We further evaluate deployability with a full end-to-end streaming benchmark, including feature extraction, standardization, PCA, and model inference, and show that the pipeline reaches 58.20 windows/s, or 5.82× the real-time requirement, on a Windows PC with an NVIDIA RTX 3080 GPU. Finally, because continuous badminton streams are dominated by the other class, we incorporate window-level downsampling and ablation analyses to clarify how imbalance handling, PCA dimensionality, and attention affect prediction robustness, especially at longer horizons.
The remainder of the paper is organized to make this pipeline explicit.
Section 3 describes the dataset, preprocessing, temporal model, and labeling strategy, while
Section 4 evaluates prediction accuracy, real-time feasibility, calibration, and ablation results. This organization connects each methodological component to the deployment requirements of low-latency badminton analytics.
Abbreviations
For clarity, the main acronyms used in this paper are summarized as follows: IMU, inertial measurement unit; PCA, principal component analysis; LSTM, long short-term memory; BiLSTM, bidirectional LSTM; MHSA, multi-head self-attention; MLP, multilayer perceptron; ECE, expected calibration error; UWB, ultra-wideband; IoU, intersection over union; GT, ground truth; and PR, precision–recall.
4. Results and Analysis
A series of experiments were conducted to verify the proposed method. This section presents the core results obtained in the experiment with a focus on analyzing the performance of real-time prediction models in highly dynamic IMU data. The subsequent analysis provided a solid foundation for verifying the effectiveness and superiority of the proposed method.
4.1. Window-Level Offline Evaluation
We evaluate the proposed model on the window-level next-window prediction task (prediction horizon = 1 step). Since the dataset is class-imbalanced (e.g., the
other class has substantially larger support), we report Macro-F1 and balanced accuracy in addition to overall accuracy.
Table 1 summarizes the overall performance, and
Table 2 provides per-class precision/recall/F1, which enables a fine-grained inspection of class-wise strengths and failure modes.
4.2. Performance of the Optimized LSTM Model
To validate the effectiveness of the combined hyperparameter optimization strategy, we integrate the optimal temporal configuration (
,
,
) and the recommended PCA dimensionality (
) to construct the final LSTM model.
Figure 3 presents key performance visualizations of this integrated model, including training/validation accuracy, class distribution balance, confusion matrices, and loss dynamics.
To avoid repeating the aggregate metrics already reported in
Window-Level Offline Evaluation, we focus here on optimization dynamics and class-wise error patterns. In
Figure 3a, the training/validation curves remain close throughout optimization, indicating stable convergence and no obvious overfitting under the selected configuration.
Figure 3b further shows that most confusion is concentrated among semantically similar stroke classes, while the dominant “other” class remains well separated rather than being over-predicted. The main residual weakness is the rare
lob_backhand class (very limited support), which is consistent with the long-tail distribution and suggests that future gains will primarily come from targeted data balancing or class-aware augmentation rather than further global hyperparameter tuning.
To further contextualize the IMU-only setting, we discuss its relation to prior wearable-sensing-based badminton recognition work without treating it as a directly comparable benchmark.
4.3. End-to-End Real-Time Inference Benchmark
A system is considered real-time feasible in our setting if throughput is at least 10 windows/s, since the deployment hop is 100 ms. Our end-to-end pipeline achieves 58.20 windows/s, i.e., 5.82× the real-time requirement. To verify this under realistic processing overhead, we run an end-to-end stream-replay benchmark. Raw IMU CSV streams are replayed in chronological order, and the pipeline performs sliding-window inference per hop, including window-level feature extraction, standardization, PCA projection, history buffering, and a single model forward pass. After warming up for windows, we benchmark consecutive windows and report throughput and latency percentiles.
Metrics. Let
be the sampling rate (Hz) and let
denote the output hop size (samples). The system produces one prediction every
seconds; thus, the minimum required throughput is
Given the measured throughput
(windows/s), we define the real-time factor (RTF) with respect to the output hop as
A system is considered real-time feasible if .
Results and analysis. With Hz and (i.e., s, windows/s), our end-to-end pipeline achieves windows/s, corresponding to , which satisfies the real-time requirement with a clear safety margin. In addition, model-only inference (batch size 1) reaches windows/s with median latency ms, indicating that the remaining end-to-end overhead mainly stems from preprocessing (feature computation, normalization, and PCA) rather than the network forward pass. This margin suggests the system can tolerate moderate deployment overhead (I/O, scheduling, logging) while still meeting the hop-based real-time constraint.
4.4. Calibration and Selective Prediction
Beyond accuracy, deployment quality depends on whether confidence scores are well calibrated and useful for selective prediction. We therefore evaluate reliability with the Expected Calibration Error (ECE) and assess confidence ranking quality with precision–recall analysis. Using
B confidence bins, ECE is defined as
where
is the sample set in bin
b,
N is the number of test windows,
is the empirical accuracy, and
is the mean predicted confidence; we set
in this paper. A lower ECE indicates better alignment between predicted probabilities and true correctness frequencies.
Figure 4a shows that the reliability curve closely follows the diagonal, indicating good calibration. For the reliability diagram (
Figure 4a), the
x-axis is the binned mean confidence
, and the
y-axis is the corresponding empirical accuracy
for bin
.
Figure 4b further shows strong precision–recall behavior, suggesting that confidence can effectively separate likely-correct from likely-incorrect predictions. For the PR curve (
Figure 4b), the
x-axis is recall
and the
y-axis is precision
. Moreover, the micro-AP is 0.986 and macro-AP is 0.988, indicating strong confidence ranking quality. Precision decreases when recall approaches 1 because lowering the decision threshold introduces more low-confidence false positives. Together, these results support confidence-based deployment strategies (e.g., thresholding or abstention) in streaming scenarios.
4.5. Ablation Studies
Downsampling under severe class imbalance. The raw window-level labels are highly imbalanced where the
other class dominates. Without handling this imbalance, the classifier tends to collapse to a trivial majority-class predictor, yielding misleadingly high accuracy while failing on minority actions. Therefore, we apply downsampling to mitigate the dominance of the majority class and stabilize optimization.
Figure 5 provides a class-distribution comparison before and after downsampling.
Effect of PCA. We further evaluate whether PCA-based dimensionality reduction benefits horizon prediction.
Table 3 reports the mean accuracy (averaged over all steps within the prediction horizon) under four prediction horizons. PCA consistently improves performance for short-to-medium horizons (10–40), indicating that denoising and compact representations help the encoder learn more robust temporal features.
Effect of MHSA. We ablate the multi-head self-attention (MHSA) module by removing it from the encoder while keeping all other settings unchanged. As shown in
Table 4, MHSA provides consistent gains across different horizons, especially at longer horizons, which suggests that attention helps highlight informative temporal segments within the historical window for more reliable prediction across horizons.
BiLSTM vs. UniLSTM. Finally, we compare a bidirectional LSTM encoder with a unidirectional LSTM encoder under the same training protocol. Note that the bidirectionality is applied
only within the observed input window and does not access any future observations from the prediction horizon; hence, it does not introduce information leakage.
Figure 6 visualizes the all-correct rate (the percentage of samples for which all steps within the prediction horizon are correctly predicted) over training. BiLSTM achieves a higher all-correct rate throughout training, suggesting that modeling both earlier and later temporal dependencies
inside the input window yields a more informative representation for prediction across the horizon. Additional ablation details are provided in
Figure 7 for key-parameter sensitivity and in
Figure 8 for PCA dimensionality analysis.
6. Additional Analyses
This section provides additional analyses, including key-parameter sensitivity, PCA dimensionality effects, calibration and selective prediction, downsampled streaming/event-level evaluation, self-supervised labeling assessment, and deployment-oriented confidence analysis.
6.1. Impact of Key Parameters on Recognition Performance
To identify the optimal hyperparameter combination for the LSTM model, we conducted a grid search over three key parameters: window size (
w), model hop size (
hop), and history length (
hist). The grid search results are visualized in
Figure 7, which quantifies the model accuracy across different parameter combinations. Here,
hop denotes the model-window stride used for feature sequence construction, which is distinct from the 100 ms deployment hop used in the real-time benchmark.
As indicated by the peak accuracy in
Figure 7, the optimal parameter combination is determined as
(window size),
(model hop size), and
(history length). A small window size (
) captures fine-grained dynamic features of badminton motions (e.g., rapid stroke transitions) without over-smoothing short-term IMU signal fluctuations. A small model hop size (
) ensures dense sampling of the time series, preserving temporal continuity and reducing information loss between adjacent windows. A large history length (
) enables the LSTM model to leverage long-term contextual dependencies of motion sequences, which is critical for distinguishing similar badminton shots (e.g., forehand clear vs. smash) that share short-term IMU patterns but differ in long-term motion context.
6.2. Impact of PCA Dimensionality on Recognition Performance
We investigated the impact of PCA dimensionality on recognition performance while keeping the temporal configuration fixed (, , ) and using the same split protocol. We first conducted a coarse scan over a broad range of PCA components, which is followed by a fine-grained scan around the promising region. Across all evaluated settings, the best observed test accuracy was at PCA components. To balance accuracy and model efficiency, we also report the smallest dimensionality whose accuracy is within percentage points the best result; this yields a recommended setting of with accuracy. Overall, PCA enables substantial dimensionality reduction with a negligible drop in accuracy, indicating that the handcrafted time/frequency features contain considerable redundancy and can be compactly represented without sacrificing recognition performance. The combined figure consolidates the coarse and fine accuracy curves, the best-per-dimension accuracy envelope, and the accuracy gap to the best, enabling a compact view of the accuracy–efficiency trade-off and the diminishing returns beyond the recommended dimensionality.
6.3. Calibration and Selective-Prediction Protocol
Calibration analysis evaluates whether the model confidence reflects the empirical probability of correct prediction, which is important when the system is used for real-time coaching or tactical feedback. For each test window, the model produces logits
and class probabilities
. The predicted label is
, and the associated confidence is
. These quantities support Expected Calibration Error (ECE), coverage–risk analysis, and confidence-thresholded selective prediction where low-confidence windows can be abstained from rather than forced into potentially unreliable action decisions.
Table 5 reports the calibration result on the test set.
6.4. Streaming/Event-Level Evaluation (Downsampled Setting)
To better reflect event detection performance under controlled class imbalance, we additionally evaluate streaming predictions in a downsampled setting. Following our training-time strategy, we downsample background (
other) windows to match the number of non-
other windows (1:1) before constructing history sequences and running inference. We then merge consecutive non-
other windows into predicted action segments and match them to ground-truth segments using temporal IoU with threshold
. We report event-level precision/recall/F1 and detection delay measured by time-to-detect (median and 90th percentile), where delays are computed based on
consecutive correct predictions at the native resolution of this event-level protocol.
Table 6 summarizes the event-level performance in the downsampled streaming setting.
We further report the sensitivity to the temporal IoU threshold by sweeping
, as summarized in
Table 7.
This downsampled setting reduces the dominance of the other class and therefore provides an upper-bound estimate of event detection performance under a balanced background.
6.5. Performance Evaluation of Self-Supervised Labeling
In this experiment, the comparison between ground truth (GT) labels and self-supervised predicted (pred) labels for three players is visualized in
Figure 9. Each player’s results include a side-by-side comparison of GT labels (left column) and model-generated labels (right column), enabling both a qualitative and quantitative assessment of the self-supervised labeling effectiveness.
Quantitative evaluation reveals that the self-supervised model achieves high labeling accuracy across all participants: Player 1 reaches
, Player 2
, and Player 3 the highest at
. Qualitatively, the predicted labels in
Figure 9 closely align with the GT labels, accurately capturing the temporal boundaries of badminton actions (e.g., serves, smashes, and drops) without obvious misclassifications. This validates the model’s ability to generate reliable labels without manual annotation, significantly reducing labor costs.
Notably, window size configuration is critical for balancing labeling quality and temporal responsiveness. An excessively small window leads to label “glitches” (spurious action transitions) due to heightened sensitivity to short-term IMU signal noise—this is because small windows fail to average out random fluctuations in sensor data. Conversely, an overly large window introduces substantial labeling delay, as it requires accumulating more temporal data before generating a label, which cannot keep pace with the rapid dynamics of badminton motions (e.g., quick stroke reversals or sudden direction changes). The window size adopted in this paper optimizes this trade-off: as evidenced by the smooth label sequences and high accuracy in
Figure 9, it effectively minimizes noise-induced glitches while maintaining sufficient temporal responsiveness to match the fast-changing characteristics of IMU data.
6.6. Deployment-Oriented Additional Analyses
To complement the main streaming and calibration results, we provide two additional analyses that are directly relevant to real-time deployment. The first examines confidence-based selective prediction, where the system may abstain from low-confidence windows rather than forcing unreliable decisions. The second evaluates the sensitivity of event-level performance to the temporal IoU threshold used for segment matching. Together, these analyses clarify how the proposed system behaves under practical confidence filtering and event-detection criteria.
Figure 10 shows the trade-off between prediction coverage and risk under confidence thresholding. This result supports selective deployment modes in which uncertain predictions can be withheld or flagged for downstream review, thereby improving the reliability of the displayed feedback.
Table 7 reports the event-level sensitivity to the temporal IoU threshold. As expected, stricter matching thresholds reduce precision, recall, and F1 because predicted segments must align more tightly with ground-truth action intervals. The gradual performance decrease indicates that the predicted event boundaries remain reasonably stable across a range of temporal matching criteria.