4.1. Trajectory Evaluation
We evaluate the proposed method on the CurlTracer dataset [
3], which contains 1033 recorded throws sampled at
Hz.
Motivated by the characteristics of curling delivery and expert feedback, we adopt a two-checkpoint analysis. Specifically, we report Mean Absolute Error (MAE) and Median Absolute Error (MdAE) at the initial forecast step and at the midpoint of the prediction horizon, in addition to full-trajectory statistics. Both “first-step” and “mid-step” errors are computed per sliding window and then aggregated across all windows in the test split. We also generate continuous trajectory predictions for qualitative inspection (
Figure 3).
For comparison, we re-implemented the constant-friction pivot–slide model of Shegelski and Lozowski [
24]. A single average pivot ratio calibrated on
of the throws (
) was applied to the remaining
. The resulting MAE/MdAE of
m/
m is an order of magnitude worse than our learning-based method but serves as a lightweight, interpretable baseline.
To clarify comparison fairness for
Table 1, all learning-based models (Plain LSTM and Attention-LSTM) were trained and evaluated under the same protocol: identical trajectory-level train/test split, identical pre-processing and feature construction, identical sliding-window settings (
,
, stride
S), and identical evaluation metrics. Model selection was performed using the same validation-based criterion within this unified pipeline. The only architectural difference between the two learning-based variants is the presence/absence of the attention block. For the physics baseline, calibration was performed on the training split only, and testing was conducted on the held-out split to keep data usage consistent across methods.
To complement
Table 1, we additionally benchmarked three lightweight sequence models (Attention-LSTM, BiLSTM, and BiGRU) in a supplementary multi-seed profiling protocol for edge-oriented comparison. Results are summarized in
Table 2.
In this supplementary comparison, BiGRU attains the lowest error, while the proposed Attention-LSTM remains the fastest and most compact among the three compared architectures and is close to BiLSTM in accuracy. To make this trade-off explicit, we provide an accuracy–latency Pareto visualization (
Figure 4). For statistical transparency,
Figure 5 further visualizes seed-level mean ± std error bars for the same comparison.
Removing the self-attention layer (“Plain LSTM”) degrades MAE by approximately
, indicating that global temporal context is important for modeling late–stage curl. For reference, the analytic prediction we use
with
m,
m. is
where
and
compensate for the rink-centric coordinate orientation. Despite relying only on 2-D coordinates and timestamps, the attention-augmented LSTM outperforms the physics model by an order of magnitude, underscoring the value of temporal context in this low-dimensional setting.
To further clarify the attention design choice, we compared four variants under the same data split and evaluation pipeline: no attention (plain), scaled dot-product attention, additive attention, and a two-head multi-head attention variant. Results are summarized in
Table 3.
In this comparison, additive attention achieves lower MAE but with substantially higher latency, whereas scaled dot-product attention improves error over the no-attention baseline with only a small latency increase (Keras: 0.334 → 0.253 m, 80.13 → 82.37 ms; TFLite FP32 latency: 10.37 → 10.50 ms). Given the edge-oriented objective of this study, we selected scaled dot-product attention as a practical accuracy–latency compromise.
We further examine the accuracy at the first and midpoint checkpoints:
Following
Section 3.5, we computed overlap-derived percentile envelopes (5th–95th) on the same held-out windows as descriptive uncertainty indicators. The envelope is tighter in the near horizon and widens toward the long horizon, consistent with the checkpoint trend in point-error statistics. To complement these point estimates, we additionally report compact error-distribution statistics for deployment reliability (median/IQR/p90) on held-out windows in
Table 4.
Consistent with the online-use objective, we focus operational interpretation on near/mid horizons and stop-region guidance. Here, the aggregation sample size is the number of test windows (
) generated from held-out trajectories; for the split reported in this work,
. The three checkpoints (first, mid, and last) are visualized in
Figure 6,
Figure 7 and
Figure 8 to illustrate model behavior across the horizon.
Stopping-point accuracy is shown in
Figure 9: predicted stops (purple crosses) versus ground truth (green circles), overlaid with standard hogline and house markings. In the illustrated case, the deviation is
m. Considering the measurement characteristics of a monocular camera pipeline, we regard this level of accuracy as suitable for practical use.
4.2. Real-Time Evaluation
To assess real-time feasibility, we measured computational performance under two representative environments using the same prediction pipeline (84 sliding windows per throw with stride of 5 frames, ≈0.08 s).
On a Google Colab free-tier instance (AMD EPYC 7B12, 8 threads), a full-trajectory forecast was completed in s. The per–window latency averaged ≈ 0.07 s.
On a Raspberry Pi 4B (4 GB RAM), the total prediction time was 21 s per throw. The mean per window latency was ≈0.25 s (including I/O and post-processing). Accordingly, we published updates at an effective cadence of ≈0.25 s while buffering intermediate frames. Adjusting the update rate can further trade freshness for computational headroom as needed. These results indicate that the model maintains acceptable throughput and compute overhead on a low-power edge device, supporting real-time field applications.
4.3. Lightweighting of Models in Edge AI Scenarios
To support on-device inference in edge AI settings, the trained model was converted into four TFLite variants: FP32 (builtins + select_tf_ops), dynamic-range quantization (DRQ), FP16, and INT8 (full integer quantization with select_tf_ops). All experiments were executed on the same host as the original Keras baseline.
Table 5 reports the window-level MAE (overall and first-step MAE) for each TFLite variant, evaluated on the same test split of
windows. To keep comparison fair, quantization effects in this subsection are interpreted relative to FP32 under the same feature setting. Under this setting, FP32 and DRQ remain close in the early horizon, while FP16 and INT8 exhibit larger deviations. A closer comparison shows that the degradation is not uniform across the horizon. Relative to FP32 (first-step MAE
m; window MAE
m), FP16 increases to
m (+
m) at the first step and
m (+
m) over the full window, while INT8 increases to
m (+
m) and
m (+
m), respectively. This indicates that precision reduction has a moderate short-horizon effect but a larger impact over long-horizon outputs. As a compact stability proxy for online publication, we also report the horizon-gap index
from
Table 5: FP32
m, DRQ
m, FP16
m, INT8
m. Larger
indicates stronger long-horizon output fluctuation, which is consistent with wider overlap-derived spread/percentile envelopes and potentially higher UI flicker risk in continuous updates. A plausible mechanism is that low-precision arithmetic introduces discretization and rounding perturbations in recurrent/attention computations, which become more visible as the prediction horizon extends [
25]. In our setting, this supports the practical choice of FP32/DRQ when accuracy robustness is prioritized and FP16/INT8 when additional compression is required.
We additionally benchmarked the end-to-end inference time for three representative throws. As shown in
Table 6, all TFLite runtimes provide substantial speed-ups over the Keras baseline, typically reducing latency by a factor of 7–8×. Among them, the FP32 variant offers the best balance between execution time and fidelity, requiring no additional calibration while maintaining accuracy close to the original model.
Overall, within this lightweight feature setting, the TFLite FP32 and DRQ models preserve most of the in-config predictive performance while reducing inference time to well under per full trajectory prediction. These results confirm that lightweight deployment is feasible with moderate accuracy degradation, enabling real-time trajectory feedback on embedded devices.