3.4.1. Transformer Architecture
The Transformer architecture includes the following sections:
The self-attention mechanism serves as the core component of the Transformer model, enabling each element in the sequence (e.g., a time point) to interact with all other elements and aggregate global information through weighted summation. This allows the model to directly capture long-range dependencies in time series, regardless of the distance between elements [
31].
Second, the Multi-Head Context-Attention Mechanism extends the self-attention mechanism by running multiple independent “attention heads” in parallel instead of computing attention only once. Each head learns to focus on different contextual information in different representation subspaces; for instance, one head may focus on short-term patterns while another attends to long-term cycles. This design enables the model to jointly process information from different positions and capture diverse temporal characteristics [
31].
Third, the Position-wise Feed-Forward Network processes the output from the self-attention module. This network consists of fully connected layers and nonlinear activation functions, applying identical transformations to each position in the sequence independently. Its primary function is to introduce non-linearity and facilitate feature transformation, thereby enhancing the model’s representational capacity [
31].
The above core components are organized within the classical Transformer encoder-decoder framework. As shown in
Figure 2, this structure consists of two parts: the encoder on the left and the decoder on the right, each containing six identical layers. Input sequences are combined with word embeddings and positional encoding before being fed into the encoder. Similarly, output sequences are combined with word embeddings and positional encoding and fed into the decoder. The encoder’s output is then fed into the decoder through attention mechanisms, and softmax is applied to the decoder’s output to predict the next token. Word embedding and positional encoding will be formally introduced in subsequent discussions. We first analyze each layer of the encoder and decoder in detail [
31].
3.4.4. Transform 1D Variations into 2D Variations
Within the methodology, we introduce TimesNet [
19], whose core innovation lies in analyzing intra-period and inter-period features through a two-dimensional convolutional network by folding temporal data into a two-dimensional tensor. Compared with other methods, e.g., spectral methods [
40], this innovation effectively integrates the modeling of fine-grained intra-period variations and long-term inter-period evolution. The specific implementation comprises two key steps:
First, the model identifies dominant periods based on the Fourier transform. The model performs FFT on the input sequence and selects the top-k frequencies with the highest spectral energy, using their corresponding period lengths as candidates. This step aligns with traditional spectral analysis philosophy, aiming to discover the data’s repetitive “rhythms” from the frequency domain.
Second, the model folds the 1D sequence into a 2D tensor according to the identified periods. For each candidate period length p, the model reshapes the original one-dimensional sequence into a two-dimensional matrix (where L is the sequence length). In this matrix, the row direction represents short-term variations within a single period (e.g., changes over 24 h in a day), while the column direction represents long-term evolution across multiple periods (e.g., trends over consecutive days). This transformation decomposes complex temporal variations into two orthogonal directions that are more amenable to modeling.
Once the data is converted into a two-dimensional structure, the model can employ mature two-dimensional convolutional neural networks to capture complex dependencies both within and between periods simultaneously. This contrasts with traditional methods: Fourier analysis excels at capturing global periodicity but loses temporal localization information; the wavelet transform can capture local time–frequency characteristics but has limited capacity for modeling long-term dependencies; whereas TimesNet’s 2D transformation attempts to combine the advantages of both, explicitly modeling temporal and frequency domain features simultaneously through a structured approach.
The model is derived from the state-of-the-art TimesNet time series analysis method [
19] by incorporating the transformer strategy [
31] and the wavelet approach, allowing for the quantification of predictive uncertainties and explanation of prediction results. The three parts and the integrated method will be introduced in the following subsections.
Unlike traditional machine learning and deep learning methods (e.g., LSTM), which only capture temporal dependencies among adjacent time points and thus fail to capture long-term dependencies, the key innovation of TimesNet is that it transforms the analysis of 1D temporal variations into 2D space based on the inherent periodicity of data. This allows it to explore not only the short-term temporal pattern within a period (intra-period variation), but also the variations among consecutive periods (inter-period variation) [
41].
In this model, the block handling the transformation of 1D time series into 2D space and the processing of 2D variations is referred to as TimesBlock. The time series X of length T exhibits multi-periodicity that is identified through the Fourier transform [
42] via frequency analysis. The Fourier transform is employed to identify dominant periods in the time series data. Based on these, the top-k dominant periods are selected. However, the selected top-k periods are not necessarily the the most significant; later, we will discuss how to dynamically learn the optimal k-value for optimization. For each period length
, the original 1D time is represented. The series X is to be divided into
. This is represented by the following formula:
where
denotes the length of the input time series,
denotes a period length used to decompose the time series, and
denotes the number of complete cycles within the total duration
. These are then reshaped into a two-dimensional matrix
. Consequently, we obtain a set of two-dimensional matrices denoted
. For each period length
, the rows and columns of
represent intra-period variation and inter-period variation, respectively. The transformed 2D matrices
are regarded as images and processed by 2D convolutional layers to extract intra-period and inter-period variations from the original 1D time series, leveraging the strengths of convolutional architectures in image processing [
43,
44]. The extracted 2D features are then transformed back into 1D space to generate the final time-series predictions.
The architecture diagram is shown as
Figure 3. First of all, the 1D historical time series are fed into the network. The time series data first passes through an adaptive module consisting of a Fourier and a wavelet transform. The Fourier transform and the wavelet transform are computed simultaneously, and the features are fused by weighting. With this adaptive design, the global frequency resolution of the Fourier transform and the local time–frequency analysis capability of the wavelet transform can be balanced to enhance the expression of the time series features.
The weighting mechanism employs learnable gating units to dynamically integrate global spectral features (F) obtained via the Fourier transform with local spatio-temporal features (W) derived from the wavelet transform. Specifically, we concatenate these features into [F; W] and generate adaptive fusion weights through a lightweight gated network.
The balancing mechanism dynamically fuses the strengths of both methods through adaptive weights: assigning a higher Fourier weight () to stabilise periodic components, leveraging its frequency recognition capability, while allocating a higher wavelet weight () to transient anomalous components, utilising its sensitivity to local variations. Finally, the 1D time series is transformed into a 2D matrix .
3.4.5. Linear Transition for Forecasting
After extracting multi-scale temporal features through the stacked TimesBlocks, we employ a lightweight linear projection layer to map the high-dimensional representations directly to future time series. This projection serves as a crucial component for temporal feature decoding and horizon adaptation, efficiently transforming the learned 2D variations into future sequences without introducing excessive parameters.
For forecasting tasks, the transition from encoded features to predictions is designed for both parameter efficiency and temporal coherence. This is achieved by projecting the high-dimensional features directly onto the future horizon using a single linear transformation. The approach is effective because the final representation from the stacked TimesBlocks, denoted as , already encapsulates the necessary multi-scale temporal semantics, thus requiring only a lightweight layer to decode the future sequence. The process proceeds as follows:
We project the high-dimensional features directly onto the future horizon using a single linear transformation. This approach leverages the principle that the final feature representation from TimesBlocks already encapsulates the necessary temporal semantics, requiring only a lightweight layer to decode the future sequence:
This operation generates
tentative prediction sequences, each of length
. To form the final forecast, the output
is first reshaped into a three-dimensional tensor
, which explicitly separates the input time steps, forecast horizon, and data variates. The final forecast
is then obtained by selecting the sequence corresponding to the last input time step:
This selection is grounded in the understanding that the feature vector at the final time step
inherently aggregates the most comprehensive historical context, making its corresponding prediction the most reliable and context-aware.