2. From Words to Numbers: Applying Transformers to Time-Series Forecasting
Originally developed for sequence-to-sequence modeling in natural language processing, the Transformer architecture employs a Decoder designed for autoregressive text generation. It was introduced by Vaswani et al. in [
3], with the aim of facilitating real-time translation between various languages using the classical architecture, containing an Encoder and a Decoder.
The adaptation of the Transformer for time-series forecasting is described in [
4]. The authors’ improvements to the Transformer’s initial configuration were related to the memory bottleneck and weak context modeling:
- •
The memory bottleneck arises from the nature of time series. The standard self-attention mechanism presents quadratic complexity O(L2) with respect to the input sequence length L. For a long time series, this quickly becomes infeasible in terms of both memory and computation.
- •
Weak context modeling is called “locality-agnostics” by the authors. Time series often have strong local dependencies (short-term correlations). Standard self-attention treats all positions equally, which makes it less effective at capturing these local patterns.
The solution to these two problems was the implementation of a convolutional self-attention mechanism that generates queries and keys [
4] using causal convolution, which makes context modeling easier to integrate into this mechanism. The second step of this solution involves the development of a LogSparse Transformer, featuring a complexity of
O(
L(
log L)
2), aimed at improving the forecasting accuracy, particularly for time series with high sampling rates.
It is important to mention that a high noise level in the time series affects the accuracy of the canonical Transformer’s predictions, especially when it surpasses 50% of the entire signal. This is because scalar-product attention treats all input positions equally, so when noise dominates, the attention scores correlate with noise patterns, not with the true temporal relationships. This violates the assumption that temporal proximity or local shape is significant, and the forecast becomes less reliable [
4].
In [
5], a new architecture is introduced, called the Time Fusion Transformer (TFT), designed for interpretable multi-horizon forecasting. Besides multi-horizon forecasting, handling heterogeneous inputs and interpretability are also very important for time-series analysis. Thus, static features (for example, customer ID or region), known future inputs (holidays or planned prices), and observed historical inputs (like past sales) influence the forecasting precision. Traditional models (Autoregressive Integrated Moving Average—ARIMA; Long Short-Term Memory—LSTM; Vanilla Transformer, as mentioned in [
3]) typically fail to provide both high accuracy and interpretability. In order to overcome these disadvantages, the TFT uses temporal processing with attention and LSTM, as well as quantile outputs and other components, as described in [
6]. At the same time, the gating mechanisms of the TFT prevent overfitting by letting parts of the network skip unnecessary computations. Variable selection networks provide interpretability by assigning importance weights to features. Static covariate encoders encode time-invariant features (e.g., category, region) and inject this information throughout the model. Temporal processing captures local sequential dependencies through LSTM layers, while multi-head attention focuses on long-term dependencies and relationships over time. Finally, the quantile outputs generate prediction intervals through simultaneous forecasts of different percentiles at each time step [
5].
When the noise level in the time series used for prediction with the TFT exceeds 50%, autocorrelation maps are weaker because the noise affects temporal dependencies, reducing the effective information for the LSTM Encoder. As a result, the forecasts will fluctuate around the mean, signifying higher residuals. At the same time, the attention maps will be flattened because the salient time points will be obscured, causing less efficient learning of long-term dependencies. Another phenomenon that appears is gate saturation, which forces the TFT to become less reactive to subtle variations in the signal. Ultimately, quantile widening makes the probabilistic forecasts less reliable. All this can increase the Mean Absolute Percentage Error (MAPE) by approximately 25–40% or even more, depending on the implementation.
As mentioned before, Vanilla Transformers are generally powerful but present quadratic complexity in terms of memory and computation with respect to the input sequence length
L, making them impractical for processing very long sequences (e.g., long text, audio, or time series). The Reformer architecture, described in [
7], addresses this by introducing two main efficiency innovations that allow Transformers to scale to much longer sequences without loss of modeling power. These are Locality-Sensitive Hashing (LSH) Attention and Reversible Residual Layers. Here, keys and queries are mapped into buckets using a hash function, according to the principle introduced in [
8]. Attention is computed only within the same bucket, drastically reducing memory and computation. Ultimately, the complexity is reduced from
O(
L2) to
O(
L logL), where
L represents the length of the time series. In general, standard residual connections store all intermediate activations for backpropagation, consuming memory. The Reformer uses reversible layers, where inputs can be recomputed from outputs during backpropagation, saving memory.
When the Reformer is used for time-series forecasting with a noise level higher than 50%, the attention locality becomes unstable, and global context learning degrades faster than for the canonical Transformer. Speed and memory are not influenced by random variations in the signal. However, the forecasting error can be severely affected because the noisy tokens will dominate the local buckets, and the attention weights become less relevant. In the end, this leads to overfitting and, of course, to higher forecasting errors. This degradation in accuracy is in the same range as in the TFT case.
In [
9], a new type of Transformer architecture, called Informer, is introduced. The goal is again to address the limitations of the initial architecture when used for text processing. Thus, the main objective was to maintain the modeling power of the self-attention mechanism while reducing the computational and memory burdens, thereby making the prediction more efficient for long output horizons. The most important contributions were related to the following:
- •
ProbSparse self-attention mechanism: A probabilistic sparse attention mechanism (ProbSparse) is implemented that selects only the “dominant” queries to compute full attention while treating the other queries approximately. This reduces both time and memory complexity of O(L2), achieving O(L logL), where L denotes the time-series length. The method leverages the observation that, in attention distributions, many query–key pairs contribute very little, and thus, focusing on the “important” ones is sufficient.
- •
Self-attention distilling mechanism (layer pooling): A distillation process is developed between attention layers, where intermediate representations are pooled in order to shrink the computational graph with increasing depth. This helps mitigate memory usage for very long time series.
- •
Generative-style Decoder (one-shot decoding): Instead of autoregressive decoding (step by step), a “generative decoding” approach is adopted. This outputs the full forecast horizon in one forward pass, significantly speeding up the inference when forecasting long sequences.
After Informer, which focused on efficiency for long sequences, researchers identified two other challenges. Transformer attention is not always the best for capturing periodic or seasonal dependencies in the input sequence.
Furthermore, pure attention may overfit noise and fail to generalize for long-term forecasting.
In time-series forecasting with a noise level greater than 50%, Informer generally performs well. The situation changes dramatically when the signal must not be denoised, like in trading operations. ProbSparse attention ensures robust filtering, but if this is not needed, this mechanism will automatically lead to high forecasting errors. The learning of long-term dependencies is partially preserved, which is a positive aspect since the long-term trend remains detectable. When the noise dominates, long-term modeling becomes severely affected. In conclusion, the overall robustness of Informer can be considered good as long as the noise component does not need to be filtered out. This is why, in this specific situation, where a prediction must be made using a highly volatile time series, Informer is actually not appropriate, because the MAE or MAPE degradation can surpass 40%.
To address some of the issues with Informer, Autoformer was introduced in [
10]. This is based on two new concepts: time-series decomposition and a novel autocorrelation mechanism.
The series decomposition block splits each input sequence into trend and seasonal parts. One of the most important advantages in this case is that Autoformer will process these components separately from one another, allowing better modeling of smooth long-term trends and periodic seasonal fluctuations. This method reduces noise and improves stability for long forecasting horizons. At the same time, instead of computing attention across all positions, the autocorrelation mechanism focuses on periodic dependencies in the time series so that the model can understand repeating patterns (e.g., daily or weekly cycles in energy data). Another feature of this architecture is progressive decomposition, where the principal and seasonal parts of the signal are refined at each time step, resulting in improved forecasting precision.
When it comes to highly volatile time-series forecasting, although Autoformer performs better than Reformer, the problems that emerge when the random component does not need to be filtered out are very similar to those described above for Informer. This is because Autoformer isolates high-frequency components. Time-series decomposition then extracts the volatile trend. However, this intrinsically leads to high forecasting errors. At the same time, the autocorrelation attention mechanism focuses on structured patterns while ignoring the noise, causing even further degradation of the prediction accuracy.
In [
11], a new Transformer architecture is introduced, called Frequency Enhanced Decomposed Transformer (FEDFormer). The challenges related to quadratic complexity in relation to the length of the time series, along with the difficulty in modeling global patterns like trends and seasonality, are once more addressed and solved differently. In this case, the contributions reside in Frequency-Enhanced Decomposition (FED) and in Frequency-Enhanced Attention (FEA). FED decomposes the input into seasonal and long-term parts, utilizing frequency-domain information to enhance the decomposition process, allowing the model to better map the underlying patterns. FEA operates in the frequency domain, enabling the model to concentrate on the most important frequency components. At the same time, it also enhances the ability of this Transformer type to capture long-term dependencies and periodic patterns. By integrating these components, FEDformer reduces the memory cost to linear growth in the sequence length, improving efficiency over conventional Transformers. Empirical results show that FEDformer outperforms leading methods, reducing prediction errors by more than 10% in both multivariate and univariate series, as described in [
11].
As mentioned previously, the prediction of highly volatile time series, where the random component contains actionable information and does not need to be removed, is also difficult with FEDFormer. This is because, in this case as well, the high-frequency component is isolated, while the low-frequency component, denoting the long-term trend, is retained. Unfortunately, stochastic fluctuations that do not repeat or align with the dominant frequency components are effectively down-weighted. Thus, FEDformer treats unstructured micro-fluctuations as useless noise. Singular events or sudden volatility are not modeled efficiently, so critical signals are lost, leading to high forecasting errors. Because actionable signals can be present in high-frequency or non-repeating variations, the model is forced to underreact to non-regular variations due to its design, thus delaying trading decisions.
In [
12], a Transformer architecture called PatchTST (Patch Time Series Transformer) is introduced. PatchTST incorporates two key innovations: segmentation of the time series into patches and channel independence. This approach retains local semantic information and reduces the quadratic complexity of attention mechanisms, enabling the Transformer to focus on longer histories. Channel independence refers to the fact that each channel possesses the same weights and representation for all series at the input, thus stimulating efficient learning and reducing the computational burden. Empirical results demonstrate that PatchTST significantly improves long-term forecasting accuracy compared to the best Transformer-based models. Additionally, this architecture performs well in self-supervised pre-training tasks, outperforming supervised training on large datasets. Transfer learning experiments further confirm their effectiveness across different datasets.
Using PatchTST for predicting highly volatile time series in which the high-frequency component contains actionable information is again problematic. This is because the patch representation averages the short-term spikes and micro-fluctuations, reducing their magnitude. On the other hand, channel-independent attention does not recover the information lost in patching, while the Transformer Encoder captures dependencies at the patch level, ignoring fine-grained point-to-point variations. In conclusion, PatchTST tends to smooth the high frequency and micro-fluctuations, making the model slower to react or insensitive to short-term trading opportunities.
After these variations in architecture were introduced, different combinations between neural networks and Transformer models were developed for time-series prediction. For example, hybrid Transformer–CNN (Convolutional Neural Networks) models were implemented, integrating 1D CNNs with Transformers to enhance the model’s ability to learn long-term and short-term patterns in multivariate time series, as presented in [
13]. In [
14], a hybrid CNN–Transformer architecture is proposed, effectively modeling both long-term and short-term variations simultaneously.
The architecture described in [
13] performs much better than everything else presented so far. This is because the CNN feature extractor models micro-fluctuations and short-term volatility more efficiently. At the same time, locally correlated spikes that may represent trading signals are preserved. The Transformer Attention Mechanism correlates the micro-fluctuations with the broader context, so even if the unstructured noise is removed, small fluctuations containing actionable information are kept. Given the structure’s complexity, the model might become saturated with noise, and in cases where the volatility surpasses 50%, the forecasting errors can become high.
The architecture in [
14], compared to that in [
13], is theoretically more appropriate for trading operations, given the structure of the model. The CNN and Transformer work in parallel, with the CNN extracting the features and the Transformer capturing the macro-context. Afterwards, a fusion mechanism integrates the two. This is achieved through gating or attention-based fusion that dynamically weights the CNN relative to the Transformer contributions, depending on context. Although it can perform better on highly volatile time series, this architecture also presents the same problem as the previous one: it can quickly become saturated with noise, especially when more than 50% of the time-series variation is attributable to noise.
Even more challenges appear when the time series used in forecasting contains a certain amount of generated energy that is traded through the stock market, while the rest is consumed directly by the final clients. Thus, the goal of this work is to predict the energy demand in such a situation using a Transformer architecture, which usually delivers results with very high precision, given its efficient training and its resilience when handling volatile time series. For a better understanding of how the proposed method compares to the current state of the art, please refer to
Appendix A,
Table A1. It should be mentioned that to achieve high forecasting precision, the data needs to be analyzed and preprocessed, as described in
Section 3.
3. Methodology
As mentioned in
Section 1, the proposed forecasting methodology is based on a systematic noise analysis and a model performance evaluation that uses the maximum theoretical limit of the Pearson Correlation Coefficient. The systematic noise analysis explores how noise variability affects the variability of the entire time series, as well as the attenuation of this effect, in order to train the Encoder-only Transformer more efficiently. In fact, this analysis is intrinsically related to the second part of the study, pertaining to the Pearson correlation. Thus, the degree to which the time series can be predicted by the model in the ideal case is first estimated. For this purpose, the proposed approach considers the heteroscedastic or homoscedastic character of the noise. To the authors’ knowledge, this combined solution for forecasting highly volatile time series is completely new.
The data used in this work originates from a Romanian energy producer that owns various PV plants across the country. As mentioned in the previous section, its clients utilize a share of this PV-generated electricity as well as electricity delivered by the main grid, while the rest is sold via OPCOM (Operatorul Pieței de Energie Electrică şi Gaze Naturale din România—The Electricity and Natural Gas Market Operator in Romania).
In other words, 100% of the generated electrical energy is sold.
The data provided spans from 1 January 2023 to 31 January 2024 and is sampled at 15-min intervals. The variation in energy demand in this interval is illustrated in
Figure 1.
The given time interval contains 38,016 samples. Given that this electricity demand includes both the consumption of the final clients and the share being traded through OPCOM, the time series is highly volatile. The forecasting models treat this volatility as noise, so from this point on, it will be referred to as such. Unfortunately, this component cannot be filtered out, as it is an integral part of the signal to be forecast, containing actionable information.
The basic statistical parameters of the series under study have been determined. The average is 0.0523 MW; the standard deviation, 0.0245 MW; the range, 0.1800 MW; the median, 0.05 MW; and the 25%, 50%, and 75% percentiles, 0.037 MW, 0.05 MW, and 0.065 MW.
The average and median are close to one another, which means that there are few outliers in the series, and the signal is centered around a stable level of approximately 0.05 MW. The range, being 3.44 times greater than the average, can be considered significant. Regarding the percentiles, one can observe that 50% of the values are found in the interval [0.037; 0.065] of length 0.028 MW, which is quite narrow. This means that the series has a central stable part and longer “tails” towards the extremes.
Table 1 synthesizes these parameters.
Trend decomposition is performed using Singular Spectrum Analysis (SSA) to estimate the noise level in the signal. This approach is an efficient way to decompose a signal into its distinct parts, such as long-term and periodic (seasonal) trends, along with the residual (noise). SSA is a non-parametric spectral estimation technique that uses Singular Value Decomposition (SVD). In our specific case, similarly to the above-mentioned aspects, the components are the long-term and seasonal trends, and the rest is represented by noise. This method consists of four steps:
- -
The embedding (converting the initial time series into a multidimensional space by creating lagged copies and generating the trajectory matrix).
- -
SVD, which decomposes the trajectory matrix to extract the principal components of the time series, which can then be grouped to identify the principal trend, seasonal (oscillatory) components, and residual (noise).
- -
Grouping or component selection for trend extraction that emphasizes which components to keep, rather than combining them.
- -
Reconstruction (Hankelization), which ensures that the end result of the SSA will represent a one-dimensional time series.
The most important advantages of this method in trend decomposition are as follows:
- •
No predefined assumptions: Unlike polynomial or Moving Average filtering, SSA does not require predefined model assumptions.
- •
Efficient handling of nonlinear trends: It can effectively extract smooth and even nonlinear trends.
- •
Efficient separation of the trend from noise: By selecting the appropriate singular values, the periodic (seasonal) component can be separated from the trend.
The noise level in the signal is visualized in
Figure 2.
The features of this decomposition, in terms of variability for each time-series component, are as follows: long-term trend, 2.1298 × 10−4; seasonal trend, 1.3098 × 10−5; noise variability, 3.2849 × 10−4. More importantly, the contribution of noise variability to the variability of the entire signal is 54.75%.
Theoretical research in deep learning shows that machine learning models, particularly neural networks, tend to perform better when the input data are normalized and exhibit distributions close to Normality. Proper normalization can improve numerical stability and accelerate convergence during training. This behavior has been consistently reported across foundational works in the field [
15,
16,
17]. Thus, this signal has to be tested to determine whether the data follows this statistical distribution or not. To this end, the Kolmogorov–Smirnov (KS) and Jarque–Bera (JB) tests were carried out [
18,
19].
The KS test operates with the empirical cumulative distribution function (ECDF) of the time series and with the cumulative distribution function (CDF) of the desired theoretical distribution—in our case, the Normal Distribution. Whether the tested data follows the desired distribution or not is determined by comparing the two functions. Given the time series
X = {
x1,
x2, …,
xn}, the ECDF is defined as in (1):
where
n—the number of observations in the time series;
—a characteristic function that can take the values of either 0 or 1.
For a Normal Distribution
N(
μ, σ2), the theoretical CDF is given by (2):
where
μ—the mean of the Normal Distribution;
σ2—the variance.
If the population parameters
μ and
σ2 are unknown, they are estimated from the sample as in (3) and (4):
On the other hand,
n represents the greatest absolute deviation between
Fe(
x) and the theoretical function
F(
x), taking into account the whole time series, as in (5):
The critical value for the KS test depends on the significance level α and on the size of the time series. This threshold α is usually set to 0.05. Thus, if n is small enough and the probability value is high, there is no strong evidence against Normality; when n is large and the probability value is low, the time series is not normally distributed.
The JB test assesses Normality based on the third and fourth moments of the distribution, that is, Skewness and Kurtosis, respectively. If a dataset is normally distributed, its Skewness should be 0, and its Kurtosis should be 3. Given the same time series and mean and variance as in (3) and (4), the Skewness is defined as in (6):
where
S—the Skewness;
n—the number of data points in the sequence;
Xi—the values in the time series;
—the mean of the Normal Distribution;
—the standard deviation.
The Kurtosis measures the “tails” of the distribution. Its formula can be seen in (7):
where
K—the Kurtosis;
n—the number of observations in the time series;
Xi—the values in the time series;
—the mean of the Normal Distribution;
—the standard deviation.
For a Normal Distribution, K = 0 (because the Normal Distribution has a theoretical Kurtosis of 3).
The JB statistic is calculated in (8):
where
S—the sample Skewness;
K—the Kurtosis of the distribution;
n—the sample size (number of observations in the time series).
If the time series follows a Normal Distribution, the JB statistic follows a Chi-square distribution with 2 degrees of freedom, according to (9) [
19]:
The probability is computed in (10):
where
is the CDF of the Chi-square distribution with 2 degrees of freedom.
If the probability value is greater than the desired threshold, which is usually set at 0.05, the time series is considered to follow a Normal Distribution, and vice versa.
Practically, the KS test checks for deviations in the entire cumulative distribution function, while JB focuses on deviations in Skewness and Kurtosis. The JB test is more relevant when the Normality deviations are due to these two specific moments (Skewness and Kurtosis), while KS is more general in character.
In our specific case, both tests rejected the hypothesis that the data are normally distributed. For a visualization of the data, refer to
Figure 3.
The power of the tests was 1. The p-value of KS was 0, whereas the p-value of JB was 0.001. The Kurtosis in this case was 4.2525, which is greater than 3, meaning that the time series is very noisy.
In
Figure 3, one can also observe that the histogram does not fit very well to the curve of the Theoretical Normal Distribution. This was obtained using the mean and the standard deviation of the entire signal, 0.0523 and 0.0245, respectively. The fact that both statistical tests rejected the Normal Distribution hypothesis indicates that the noise variability makes a large contribution to the variability of the whole time series. Given this, a method must be used to attenuate the noise. In this case, the Moving Average was chosen. This method is very simple to implement and usually delivers good results because the errors introduced to the newly averaged time series permit, theoretically, good forecasting precision.
The formula for the Moving Average is presented in (11):
where
—the value of the time series at time t;
N—the window size (number of data points considered for the average).
In other words, summation is performed over the past N observations.
For this specific situation, the chosen window was 24 h, taking into account that the signal was sampled every 15 min.
In MATLAB®, online version R2025b, this type of averaging is obtained through zero padding for the first few values, and then the actual values in the series are used. The newly obtained vector has exactly the same length as the initial one and the same trend components. Only the noise will be attenuated.
The results of this operation are shown in
Figure 4.
Visually,
Figure 4 does not significantly differ from
Figure 1. It is also relevant to observe the comparison between the newly obtained empirical distribution and the Theoretical Normal Distribution after averaging (see
Figure 5).
Once again, both of the above-mentioned tests demonstrate that the data do not follow a Normal Distribution. The power of the tests and the
p-values were identical to those in the previous case. In spite of this, as depicted in
Figure 5, the Theoretical Normal Distribution fits better with the histogram this time, with a mean of 0.0523 MW and a standard deviation of 0.0224 MW. The Kurtosis here is 3.8255, meaning that the noise level is lower than in the previous situation. In addition, the standard deviation is slightly lower in this case than in the previous one, which serves as additional proof, besides the visual results and the Kurtosis, that the noise has been successfully attenuated. The contribution of noise variability to the variability of the entire signal is now 45.79%.
Because there are a few important differences between the averaged and non-averaged signals, the error introduced by the Moving Average will not be too high. As mentioned above, this will help the Transformer to provide a precise forecast, as will be seen in the next section. Subsequently, all the forecasting models were trained on the averaged signal, which is closer to the Normal Distribution, and the results were compared to the original, non-averaged signal.
4. Results and Comparison to Other Methods
Initially, Transformers were implemented for natural language processing applications, and as mentioned in
Section 2, they have been efficiently adapted for time-series prediction, given their versatility in learning long-term patterns that frequently appear in time series. The classical Transformer is actually a sequence-to-sequence (seq2seq) neural network architecture, introduced in [
3], designed to model dependencies in sequential data without recurrence or convolution. The key concepts used in the Transformer for time-series prediction are the Encoder, the self-attention mechanism, and the Decoder. Before entering the Encoder, each input element is represented as a vector and has a positional encoding added to it (so that the model can process the order). The data is passed through stacked layers, each containing:
- •
Multi-Head Self-Attention, which lets each position attend to all other positions in the sequence;
- •
A Feedforward Network that applies nonlinear transformations to enrich the representations;
- •
Residual Connections and a Normalization Layer that stabilize and speed up the training.
The results will be a set of contextualized hidden representations for the entire input sequence, where each value in the time series has an ordered position. These representations are then sent either directly to the Decoder or to the prediction head, depending on the adopted architecture.
Because Transformers do not possess a sense of order, positional encodings are added to input embeddings to provide a sense of chronology.
On the other hand, the role of the Decoder is to generate the prediction stepwise. This is conditioned simultaneously by:
- •
The Encoder output embeddings (the context is taken from the input sequence);
- •
The previously generated outputs (autoregressive generation).
In general, a Decoder contains a Feedforward Network with a Normalization Layer, a Multi-Head Causal Attention Mechanism, and a Cross-Attention Mechanism.
Thus, within the Decoder, the output sequence generated so far is processed, and further, a mask is used so that the model cannot “cheat” by looking at future outputs (this is very important in autoregressive tasks).
When the Encoder processes the input sequence (say, past time-series values), it produces a set of hidden states (contextual representations). Each hidden state corresponds to one time step in the input and is enriched with information from the whole sequence, thanks to the well-known self-attention mechanism described in (12) and introduced by Vaswani et al. in [
3]:
where
Qr—queries that come from the Decoder’s current hidden state (the state of what is being generated);
KyT—the transposed matrix of the keys;
Va—values;
dimk—the key vector dimension.
When the Decoder is generating the output sequence (e.g., the future values), it will use the previous values generated so far and those values in the input that are relevant to the forecast. In other words, the Decoder applies the attention mechanism over the Encoder outputs [
3], where a combination of weights reflects which input time steps are most relevant to predicting the current output.
The Encoder takes the input sequence (e.g., words in NLP (natural language processing), or past time-series values, as in this case) and produces contextualized representations of each time step.
In contrast to Long Short-Term Memory (LSTM) or Recurrent Neural Networks (RNNs), Transformers do not perform sequential processing. This allows for parallel computation and better long-range modeling.
For forecasting, many models use just the Encoder (e.g., Neural Basis Expansion Analysis for Time Series—N-BEATS) or Encoder–Decoder structures (as in NLP).
In this particular case, the architecture used contains just the Encoder. This kind of architecture was adopted for two reasons. First, the normal Transformer, containing both an Encoder and a Decoder, becomes saturated with noise and does not perform well on this noisy series, since it becomes prone to overfitting. Second, less complex architectures, like those containing just a Decoder, are unable to efficiently model the input sequence. Usually, the Decoder works using the previous output sequence. If there is no Encoder, the Decoder will lack the global context of the input series, and its latent embeddings will not be modeled. As a result, the complex features of the noisy time series will not be captured, and the forecast will not be accurate. In terms of data partitioning, the above series of 38,016 samples corresponds to 396 days. Using sliding windows, 2-day contexts and 1-day-ahead forecasts were generated. Thus, the model effectively used 395 days of context (the last day in the series can never represent an input for prediction) and 394 days for forecasts (the first 2 days in the series can never serve as forecasts because they lack previous context). As will be seen below, no validation set was used. This decision was made because of the high volatility of the series and based on the multiple tests that were carried out. The results thus obtained are then compared with the Vanilla Transformer, a Transformer without an Encoder, N-BEATS, and Prophet and Holt–Winters (Exponential Smoothing), as well as other algorithms, as shown in
Table 2. All of these methods were chosen for comparison because of their high resilience in forecasting volatile time series.
In
Table 2, the proposed method’s results are in bold because it achieved the best performance among the compared methods.
The AI architectures were trained with ADAM, and the MAE (Mean Absolute Error) was minimized. The MAE is calculated as in (14). The results were compared using MAE, MAPE, MSE (Mean Squared Error), RMSE (Root Mean Square Error), R
2, Pearson Correlation Coefficient, and Directional Accuracy (DA), defined in (13) through (19).
In (13) through (18),
n—the total number of observations;
—the actual (true) value;
—the forecasted value;
—the average of the real values;
—an indicator function that equals 1 if the expression is true;
—the real (actual) value at time t;
—the real (actual) value at time t−1;
—the predicted value at time t;
—the predicted value at time t−1.
The Pearson Correlation Coefficient (Pearson’s r) is a statistical parameter that represents the linear interdependency between two variables.
This coefficient takes values in the interval [−1; +1]:
- •
r = −1 means that the two variables are negatively correlated.
- •
r = 0 means that the variables are not correlated (though other types of nonlinear relationships may still exist).
- •
r = +1 means that the two variables are positively correlated.
The formula for the Pearson Correlation Coefficient for two variables
A and
B is described in (19):
where
Cov(A,B)—the covariance of A and B;
—the standard deviation of A;
—the standard deviation of B.
The expanded form of (19) can be seen in (20).
where
—the average value of A;
—the average value of B.
In our case, if the model works perfectly, meaning that it can forecast the entire useful signal (that is, principal and seasonal trends together), the maximum Pearson Correlation Coefficient that can be achieved will be approximately 67.27% (see (27)).
A Breusch–Pagan F (BP–F) statistical test was carried out to determine whether the noise is additive or multiplicative. The test revealed the latter, which means that the noise in the model strongly varies with the level of the signal. Higher or lower predicted values are associated with systematically larger or smaller noise values, so the noise is heteroscedastic rather than uniform. The values obtained from this test were LM stat = 1551.7376, LM p-val = 0, F stat = 1623.5474, and F p-val = 0. The first value, denoting the Lagrange Multiplier (LM) statistic, measures how strongly the noise variance depends on the useful signal. A high value, as in this case, indicates heteroscedasticity. The second value, denoting the probability of observing the LM statistic under the null hypothesis of homoscedasticity, is 0, so this hypothesis is rejected. The third value represents an alternative version of the LM statistic, based on the F-distribution. It also tests the heteroscedasticity, and a high value confirms the multiplicative character of the noise. Finally, the fourth value denotes the probability of observing the F-statistic under the null hypothesis of homoscedasticity. Since the value is 0, this hypothesis is again rejected.
Thus, for heteroscedastic noise, taking a signal
y consisting of a useful trend and noise, as in (21), one can write the corresponding correlation between the entire signal and its predicted values, as in (22).
where
y—the overall signal;
s—the useful signal;
y—the original signal;
—the forecasted signal.
Assuming that
s and
n have a covariance of 0, the covariance between the useful signal and the original signal when the forecasting is perfect can be written as in (23):
where
—the covariance between the useful signal without noise and the useful signal with noise;
—the variance of the useful signal.
Given the initial supposition, one can further describe the forecast and the original signal as in (24) and (25).
In both (24) and (25),
—the variance of the forecasted signal;
—the variance of the initial useful signal;
—the variance of the original signal;
—the expected squared value of the useful signal.
Finally, the correlation between
and y can be written as in (26):
where
—the Pearson Correlation Coefficient between the forecasted signal and the original signal y;
—the variance of the initial useful signal;
—the variance of the noise;
—the expected squared value of s.
In (26),
is actually the contribution of the noise variance to the variance of the entire signal (for multiplicative noise) and, thus, the maximum value for the Pearson Correlation Coefficient for the non-averaged signal can be defined as in (27):
In order to validate the result in (24), the covariance between the useful signal and noise was computed and found to be 0. The heteroscedastic/multiplicative character of the noise can also be confirmed by visually inspecting
Figure 2. For an overview of the entire methodology used, refer to
Figure 6.
In terms of the actual forecast when using the Transformer or N-BEATS architecture, in this particular case, the model learns from a 2-day window (that is, 192 time steps in the past) and predicts the next day (meaning the next 96 time steps). All (100%) of the data available for training was used for training (no validation set was used for any AI architecture) because the series is very noisy, and the dependencies on which the forecast is based are very hard to determine otherwise. As mentioned previously, the metrics presented in
Table 2 are the MAPE, MAE, MSE, RMSE, and R2, but more important are the Pearson Correlation Coefficient and the DA. For some of the methods, the predicted values had near-zero variance, making the Pearson correlation numerically unstable; thus, the metric has been omitted from the table. As far as the DA is concerned, a value of less than 50% means that the model provides an inverse or constant prediction, approximately 50% means that the model performs randomly, between 50 and 65% signifies moderate accuracy, and between 65% and 80% means high precision. Thus, the Encoder-only Transformer performs the best, with 77.89%. All these algorithms were chosen for comparison because they are very resilient when it comes to highly volatile time-series forecasting [
20,
21,
22,
23]. For example, Holt–Winters builds a running estimate of the baseline demand, its trend, and its daily cycle. It then projects this structure forward to predict the next day.
The day-ahead forecast obtained with the Holt–Winters algorithm is depicted in
Figure 7.
On the other hand, in the case of the Prophet algorithm, the weekly seasonality is enabled, meaning that the algorithm will learn a 7-day cycle, with differences across weekdays. Daily seasonality is enabled as well, so a 24 h cycle will also be learned, capturing intra-day fluctuations that are useful for forecasting the electricity demand. At the same time, the trend component is included, meaning that the growth in the long-term part of the data is considered piecewise linear. Further, no holiday effects are specified, so no explicit holiday impact is modeled. The day-ahead forecast obtained with the Prophet algorithm is depicted in
Figure 8.
N-BEATS is a data-driven, block-based neural forecasting model that takes a window of recent history and predicts multiple future steps, learning both trend and seasonality automatically, without explicit decomposition. This makes it faster than Transformer architectures.
The context length in
Table 3 was set to 192 steps because greater numbers risk saturating the architecture with noise, in which case it will stop learning and succumb to overfitting. This value and the prediction length are fixed, and they are not determined by Brute Force. The rest of the hyperparameters are obtained through optimization. Thus, lower numbers are preferable for the given context. In other words, the context length is twice as long as the prediction length. The hidden size is the number of neurons in the hidden layers of each block. Further, one has two N-BEATS blocks. Each block predicts a component of the forecast and passes residuals forward.
The input hyperparameters of N-BEATS when optimized with Brute Force (also called Grid Search) are shown in
Table 3.
Theoretically, more blocks can refine the model and capture subtler patterns. A lower number can be trained faster, but this does not always deliver good results. In this case, just two blocks were used, given the high volatility of the data, in order to avoid noise saturation of the architecture. This is again the main reason why the depth of the Feedforward Network (number of layers in each block) is set to 2, and not to a higher value. The dropout is set to 0.1, meaning that 10% of the neurons are deactivated during training, in order to prevent overfitting. In this context, the batch size specifies how many training examples are fed into the model at once.
The best results were obtained for a Transformer without a Decoder whose hyperparameters were optimized with Brute Force. Compared to Optuna, Brute Force is more thorough, taking into account all possible combinations of hyperparameters with the goal of minimizing the error function. This is the best strategy to follow when the time series is less predictable, as in this case.
The first step of this optimization algorithm is to define a hyperparameter grid, with finite candidate values for
- •
Context length (Lc);
- •
Embedding dimension (dmodel);
- •
Number of attention heads (ℎ);
- •
Number of Encoder layers (N);
- •
Hidden size of the feedforward layer (dff);
- •
Learning rate (η);
- •
Batch size (B).
In the second step, itertools.product generates all possible combinations of hyperparameters.
In the third step, for each configuration θ, the following operations take place:
- •
A dataset is created with context length Lc = 192 and prediction length Lp = 96;
- •
A Transformer Encoder model is trained for 10 epochs using the Adam optimizer;
- •
The forecast is evaluated on the unaveraged test set using the MAE and MAPE (as in (26) and (27)).
Finally, in the fourth step, the best-performing configuration θ* with the hyperparameters yielding the smallest MAE (computed on all 96-step forecast windows, except for the last one, which is initially excluded from the training) is stored. Then, the model chosen in this way is tested on this final 96-step window. The results of this comparison are presented in
Figure 9 and
Figure 10.
The mathematical formulation can be seen in (28)–(32).
The hyperparameter space is shown in (28):
Each hyperparameter is chosen from a finite set: , , , , , , .
The total number of combinations can be seen in (29):
where
—the total number of hyperparameter combinations;
—a fixed hyperparameter pertaining to context length (it possesses just one value, 192);
—the finite set of hyperparameter combinations, pertaining to embedding;
—the finite set of hyperparameter combinations, pertaining to attention head number;
—the finite set of hyperparameter combinations, pertaining to the number of Encoder or Decoder layers;
—the finite set of hyperparameter combinations, pertaining to the hidden size of the feedforward layer;
—finite set of hyperparameter combinations, pertaining to possible learning values
—the finite set of hyperparameter combinations, pertaining to the batch size.
For each configuration θ, the model produces forecasts over the prediction horizon t = 1,…, Lp.
Thus, the MAE is described in this case as in (30):
In both (30) and (31),
—the time step;
—the length of prediction;
—the real unaveraged load at horizon t;
—the forecasted load at time step t.
The best hyperparameter set is the one minimizing MAE, according to (32):
The values of the optimal hyperparameters for the Encoder-only Transformer obtained with Brute Force are described in
Table 4.
Brute Force and the Encoder-only Transformer, as seen in
Table 2,
Table 3 and
Table 4, perform much better than OPTUNA and N-BEATS in terms of MAPE and Pearson’s Correlation Coefficient. This is in spite of OPTUNA’s ability to carry out Bayesian optimization of the hyperparameters. The primary reason for this lies in the characteristics of the time series. Brute Force is more thorough than OPTUNA and, in this case, delivers more accurate results because it scans the entire solution space in any circumstance and, in the end, finds the right combination of hyperparameters.
The training of the Encoder-only Transformer with optimized hyperparameters is illustrated in
Figure 9, and the forecast is shown in
Figure 10. Thus, the training of the Encoder-only Transformer is very robust despite the noise level, which is still rather high, even after applying the Moving Average. As a result, the MAE drops continuously from 0.028 MW to approximately 0.005 MW after 300 epochs. On the other hand, as shown in
Figure 10, the day-ahead forecast yielded an MAPE of 11.63% and a Pearson’s Correlation Coefficient of 53% compared to the original, non-averaged signal.
5. Discussion
The results obtained in this work demonstrate the challenges and opportunities inherent in forecasting highly volatile time series—as is usually the case, for example, when it comes to energy generation from renewable sources. As mentioned above, in the present dataset, approximately 55% of the signal represents random variations, leaving only about 45% as the useful signal. In order to observe how well the statistical distribution of the time series fits the Normal Distribution, two tests related to digital signal processing—namely, the Kolmogorov–Smirnov and Jarque–Bera tests—were performed, which revealed that the noise attenuated with the Moving Average better fits the desired Theoretical Normal Distribution. On the other hand, standard forecasting metrics, like the Pearson Correlation Coefficient, are sensitive to highly volatile content. Indeed, for several models—including N-BEATS (optimized with Brute Force and Optuna), Vanilla Transformer (optimized with Brute Force), and the Decoder-Only Transformer (optimized with Brute Force)—the forecast exhibits negligible variation, rendering the Pearson Correlation Coefficient non-computable. This highlights a critical limitation of conventional performance metrics when applied to very noisy time series.
To address this limitation, we introduced a procedure to estimate the maximum theoretical Pearson Correlation Coefficient between the forecast and the actual power demand based on the noise character of the series (additive or multiplicative, heteroscedastic or homoscedastic). This procedure provides a robust benchmark for model evaluation, allowing practitioners to distinguish between data inconsistency and genuine forecasting deficiencies. When the theoretical performance ceiling is known, the model results can be better interpreted, and more informed decisions can be made, even when most standard metrics fail.
The comparison of the models underlines the practical benefits of the current approach. The Encoder-only Transformer, applied to the noise-attenuated time series, consistently yielded strong results in terms of both Pearson Correlation and MAPE relative to the original signal. This demonstrates that noise reduction, when combined with an appropriate model architecture, significantly enhances predictive performance in volatile environments. Conversely, models that fail to produce variable forecasts underscore the importance of aligning evaluation metrics with data characteristics: without sufficient forecast variability, conventional correlation measures provide little actionable insight.
Methodologically, these results stress the need for customized evaluation frameworks in forecasting. Traditional metrics may misrepresent model performance in highly volatile environments, leading to potentially misguided operational decisions.
Finally, the study opens several avenues for future research. First, extending the proposed Pearson ceiling methodology to multivariate time series could enable more comprehensive evaluations across interconnected systems. Second, integrating probabilistic forecasting approaches may further capture uncertainty in very noisy data, providing richer insights for real-time decisions pertaining to energy consumption or generation. Lastly, exploring adaptive noise reduction techniques in conjunction with Transformer-based architectures could yield additional improvements in prediction accuracy for smart grids, among other applications.
Further, a sensitivity analysis was conducted to evaluate the robustness of the model. The first parameter that was studied was the variation in MAPE, depending on optimal or nonoptimal hyperparameters and the number of training epochs. The results can be seen in
Figure 11.
As one can see in
Figure 11, the best results are obtained with 300 training epochs and hyperparameters optimized wtih Brute Force. The averaged MAPE refers to the comparison of the forecast to the actual signal for which noise has been attenuated. When comparing the forecast with the non-averaged signal contaminated with noise, the errors are slightly higher, which is expected since the Transformer was trained on the averaged signal. This is the first limitation of the proposed method. Actually, the same behavior can also be observed for the variation in RMSE in
Figure 12. The same types of considerations also apply in this case.
On the other hand, the variation in the Pearson Correlation Coefficient with respect to the number of training epochs and hyperparameter optimization through Brute Force can be seen in
Figure 13.
Again, when comparing the variation in the Pearson Correlation Coefficient between the forecast and the series contaminated with noise, the coefficient is consistently lower than that obtained when one compares the forecast with the attenuated-noise series. The reason is the same as before—namely, that the Encoder-only Transformer was trained on the averaged signal. Although the difference caused by this approach is not large, there is a danger that if the noise level increases, this approach might become unreliable.
The superiority of the proposed solution compared to the baseline methods, as illustrated in
Table 2, can be further proven through the Diebold–Mariano (DM) test. If one observes
Table A2, in
Appendix A, the DM statistics are consistently negative, meaning that the Transformer always performs better than the other methods. Moreover, the fact that the
p-value is always 0, or very close to 0, indicates that the Transformer’s superior performance is not random but is statistically demonstrated. In this case,
p must be lower than 0.05. Thus, the validity of the proposed method is verified.
Another limitation of this approach is that the model must be trained daily in order to guarantee the accuracy of the day-ahead forecast.
Another important aspect is that all the concepts presented in this work, relating to both data preprocessing techniques and forecasting models, can be easily integrated as modules into a foundation model dedicated to helping transmission system operators (TSOs) or distribution system operators (DSOs) make the right decisions in real time regarding efficient power system operation under normal or abnormal conditions.