In this section, we present a comprehensive empirical evaluation of the proposed WinStat family of positional encodings across all available heterogeneous benchmarks. Our experimental design aims to identify overall trends that may indicate the most effective encoding variant across diverse datasets. To this end, we evaluate all five proposed variants on each dataset—HPC, the ETT hourly variants, NYC, and TINA—alongside established baselines, using identical architectural configurations and training protocols to ensure fair comparison. Performance is assessed through MSE and MAE metrics across multiple independent runs. To validate that improvements stem from meaningful positional structure rather than incidental artifacts, we include decoder-input shuffling tests that systematically disrupt temporal order. This ablation approach allows us to establish a causal link between the proposed local semantic encodings and forecasting accuracy, while the learned mixture weights provide interpretable insights into which positional components contribute most effectively to each domain.
5.1. Comparative Results
We use the five previously introduced datasets as a comprehensive testing ground to evaluate the proposed encodings (WinStat, WinStatLag, WinStatFlex, WinStatTPE) against established baselines under identical architectural configurations and data splits. Collectively, these datasets constitute a heterogeneous benchmark environment, capturing a wide spectrum of temporal properties such as seasonality, stationarity, locality, and the coexistence of high- and low-frequency information. This diversity provides a rigorous basis for assessing the robustness and generality of the proposed methods across problems with distinct structural and semantic characteristics.
Table 3 consolidates all results across datasets and methods, including Informer (with
timeF), sinusoidal APE only, a no-PE ablation, and the WinStat family. Informer serves as the reference baseline, with removal of PE leading to substantial performance degradation (higher MSE/MAE), whereas sinusoidal APE shows moderate losses compared with Informer.
Figure 3 shows that, among the WinStat variants, WinStatFlex and WinStatTPE consistently achieve the lowest errors and most stable variances in HPC. Additionally, results on ETTh, NYC, and TINA datasets—representative of state-of-the-art benchmarks—confirm that the observed patterns generalize beyond the HPC dataset.
On the ETT hourly benchmarks (ETTh1/ETTh2), both WinStatFlex and WinStatTPE markedly outperform Informer, while the
no-PE ablation consistently ranks last, underscoring the necessity of explicit positional information.
Figure 4 shows the performance in ETTh1: TUPE situates between WinStatFlex and WinStatTPE, highlighting its intermediate effectiveness. Sinusoidal APE behaves poorly, in some cases even worse than the shuffled WinStat variants, suggesting that purely geometric absolute signals may misalign with the true temporal dependencies of this subset unless complemented by local or semantic components. As shown in
Table 3, the ranking for ETTh2, from best to worst, is: WinStatFlex > WinStat base > TUPE > WinStatTPE > WinStatLag, with ROPE performing particularly poorly and occupying the last position. As illustrated in
Figure 5, the visual comparison clearly emphasizes this gap and reveals the high variance in ROPE’s results, underscoring its instability across runs. This confirms that semantic and local positional information, as captured by the WinStat variants, remains critical for accurately modeling long-range dependencies in these datasets.
Regarding the hourly NYC aggregates, we caution that the series was artificially extended, which may bias the metrics toward optimistic values due to the smoothing effects of generative augmentation. With that caveat, WinStat base achieves the best performance overall. WinStatLag, WinStatFlex, and ROPE show very similar results, while TUPE performs poorly, approaching the performance level of the
no-PE ablation. WinStatTPE still surpasses sinusoidal APE, but the ranking highlights that, in this dataset, the effectiveness of positional encoding varies substantially across methods. As shown in
Table 3 and
Figure 6 and
Figure 7, this variability is visually evident, revealing marked contrasts in performance stability between the WinStat family and methods like TUPE. For this reason, the figure has been divided to better read the results.
The fourth benchmark, TINA—an industrial, high-dimensional dataset originally curated for anomaly analysis—poses a stringent test of computational efficiency and robustness. Even after careful optimization, its scale requires substantial hardware and long runs; however, its breadth provides a valuable stress test for the positional mechanisms. As summarized in
Table 3, the results indicate that, in terms of MAE, WinStatLag achieves the best performance, followed closely by the rest of the WinStat family, with Informer slightly behind; ROPE, Fixed PE, and TUPE all perform worse than Informer. TUPE ranks last, even below the
no-PE ablation. These fluctuations highlight that the relative effectiveness of positional encodings can vary substantially due to the particular characteristics of this dataset. Graphically, the behavior of the encodings can be observed in
Figure 8, where WinStatLag and WinStatTPE stand out, achieving substantially better results than the other methods. However, it becomes difficult to establish a clear hierarchy, as the relative performance varies depending on the metric considered.
Computational complexity, encompassing both execution time and memory resources, represents an additional dimension of this analysis. As detailed in the
Time/Epoch rows of
Table 3, the Informer and
no-PE encoding methods demonstrate the highest temporal efficiency, resulting from their reduced computational load. Conversely, WinStatTPE exhibits the highest computational cost, attributable to the overhead of the TPE component calculation and the aggregation of weighted encodings. For the remaining WinStat variants, training duration scales by a factor of two to three—comparable to algorithms like ROPE or Informer on the largest datasets—reflecting the processing volume required for the additional calculated values.
To further evaluate computational cost in terms of execution time, we conducted a GPU profiling study using the manageable ETTh1 and ETTh2 datasets, as shown in
Figure 9. This analysis corroborates the temporal impact across the four WinStat models. Notably, WinStatFlex and WinStatTPE models present a 33% increase in training duration when compared with the Informer architecture. This behavior is inherent to the model’s design, attributable to the additional processing required to derive statistics within the sliding window.
In terms of memory consumption, while the computation of window-based statistics introduces an additional processing load, the specific impact on memory allocation remains minimal. The observed constraints are inherent to the underlying Transformer architecture rather than the positional encoding scheme. Since the WinStat method computes encodings via a sliding window without retaining auxiliary data, it maintains a memory footprint comparable to other standard methods, exhibiting no additional overhead.
This observation is further supported by
Figure 10, which visualizes memory usage for the ETTh1 and ETTh2 datasets. The results indicate negligible variance across most models, with only a minor increment observed in WinStatFlex. While WinStatTPE presents a distinct outlier with increased consumption—likely due to implementation specifics or GPU memory allocation dynamics—the primary proposed method, WinStatFlex, confirms that the memory overhead remains efficient and within reasonable limits.
Finally, to validate the performance of WinStatFlex, we conducted a Bayesian Signed-Rank Test, as visualized in
Figure 11. By establishing a Region of Practical Equivalence (ROPE) interval of 0.05, the analysis confirms that WinStatFlex significantly outperforms the
no-PE,
PE (sin/cos), and Informer baselines. In these cases, the probability mass is located almost entirely outside the interval of equivalence, demonstrating a clear statistically significant advantage. A degree of statistical equivalence is observed only when comparing against the TUPE and RoPE encodings; while WinStatFlex maintains favorable performance, these specific methods emerge as the closest competitors, falling partly within the equivalence boundaries. It should be noted that, due to the lack of independent samples for the test, the results may not be as strong as they could be. Nevertheless, these results provide further evidence of the performance of the WinStat family over the other positional encoding solutions.
5.3. Comparing the Semantic Quality of the Encoding
In this section, we highlight the semantic role of positional encodings and their importance for evaluating encoding quality. Although semantics lie outside the architecture, whose layers process only numerical tensors, such information may be implicitly captured through locality and patterns, which our proposed encodings aim to exploit.
A straightforward way to assess this is to shuffle the decoder inputs during inference and compare the performance with intact sequences, as suggested in [
6]. This procedure provides an intuitive measure of whether an encoding preserves order and locality, as a significant drop in performance indicates that the model relies on sequential structure to make accurate predictions. In contrast, if shuffling has little effect, it suggests that the encoding does not effectively capture temporal dependencies.
To examine this property, we employed the five datasets evaluated previously, selected to study the impact of input shuffling across problems with diverse temporal and semantic characteristics, particularly in terms of seasonality and stationary patterns. The decoder shuffling procedure was applied to the WinStat family, as well as to TUPE, ROPE, and the Informer model itself, enabling a direct comparison of its effect across these different approaches.
As shown in
Table 9, the performance of the original Informer and its shuffled variant is nearly identical, especially in HPC, which is consistent with the findings of the original study in [
6]. The version without positional encoding performs at an intermediate level, leading to overall similar results across all three models and underscoring the limited semantics and weak-order preservation provided by the traditional sinusoidal scheme.
In contrast, WinStatFlex exhibits a marked drop in performance when inputs are shuffled, a desirable behavior that reflects its ability to capture both contextual dependencies and the intrinsic sequential structure of the data.
A similar effect is observed for WinStatTPE, whose degradation under shuffling clearly exceeds that of the baseline Informer. This outcome indicates that the semantic information captured by T-PE is likewise sensitive to disorder, highlighting its superior ability to encode a meaningful temporal structure compared with the traditional approach.
Although these two methods achieved the best results in the previous experiments within our model family, the remaining variants exhibit a consistent pattern: all experience a substantial performance degradation when input shuffling is applied. This provides empirical evidence, in a general sense, that our proposed encodings are significantly sensitive to structural changes in the data. Consequently, it demonstrates that they successfully inject meaningful positional information with a strong semantic component into the datasets.
Across the evaluated benchmarks, WinStat base consistently demonstrates superior performance on the TINA and NYC, reflecting the particular temporal characteristics of these series. In TINA, the time series exhibits complex, low-frequency fluctuations over a wide temporal range, making the statistical component especially critical for capturing relevant patterns. In Taxi, the synthetic extension of the series similarly emphasizes the importance of statistical features, helping to stabilize learning despite the artificial augmentation. In contrast, TUPE consistently displays a counterintuitive behavior on these same datasets: applying input shuffling leads to better performance than the unshuffled model, contrary to the expected clear degradation. This can be attributed both to the distinct nature of these mechanisms—being more strongly driven by attention computations than by the explicit encoding itself—and to the specific characteristics of the datasets. The smoothing effects in Taxi and the wide-ranging fluctuations in TINA reduce the direct impact of positional information on performance, highlighting a fundamental difference compared with the WinStat family, whose encodings are more robustly anchored in the statistical and semantic structure of the data.
We can analyze the differential behavior graphically by examining visualizations that compare the original and shuffled models across the evaluated datasets. In these representations, higher values indicate better performance, meaning that semantic and ordering information is being effectively injected into the model. These visual comparisons highlight how the encoding methods from the WinStat family impact the model’s ability to capture meaningful temporal or structural dependencies across datasets.
First, the HPC dataset exhibits minimal differences between the original and shuffled Informer model, as illustrated in
Figure 13, which is particularly noteworthy and indicates the limited ordinal information contributed by this approach. A similar behavior is observed for the other tested methods, TUPE and ROPE. In contrast, all four WinStat variants show a substantially larger differential, highlighting a strong ordinal contribution to the model that is effectively disrupted when input shuffling is applied.
For ETTh1 and ETTh2, the observed results differ considerably. In ETTh1, as shown in
Figure 14, input shuffling imposes a substantial penalty on Informer, although this effect is smaller in terms of MAE compared with the WinStat family. TUPE, in contrast, exhibits almost negligible degradation, while ROPE even shows a negative difference, indicating better performance under shuffling than in the original configuration, which suggests that this method is not well suited for this dataset. In ETTh2, illustrated in
Figure 15, the behavior shifts: WinStatFlex now suffers the largest degradation (in contrast to WinStatLag in ETTh1). Most notably, ROPE exhibits a pronounced negative delta, reflecting a substantial improvement under shuffling. As these experiments were repeated multiple times, this is not an artifact, but rather a clear indication that this mechanism is ineffective for the ETTh2 dataset.
The NYC dataset, shown in
Figure 16, exhibits one of the largest discrepancies observed in this experiment, with the
WinStatFlex encoding achieving a difference on the order of 10
2. This represents, without doubt, the most pronounced result across all benchmarks. Although similar trends can be identified in other models within the WinStat family (WinStat base and WinStatLag), the magnitude of the difference in this case is considerably greater, highlighting the distinct sensitivity of this dataset to the proposed encoding.
Finally, as illustrated in
Figure 17, in the TINA dataset, several encodings—such as ROPE, TUPE, and the original Informer—exhibit negative differences, meaning that the shuffled models actually outperform their unshuffled counterparts. While this behavior is theoretically counterintuitive, it reinforces the notion that these positional encoding mechanisms are not well suited to the specific characteristics of this dataset. In contrast, all variants of the WinStat family display the expected degradation under shuffling—most notably WinStatTPE—thereby confirming their stronger capacity to encode meaningful positional and semantic information.