Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting

Yu, Ping; Kong, Hoiio; Li, Zijun

doi:10.3390/app152312698

Open AccessArticle

Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting

by

Ping Yu

,

Hoiio Kong

^*

and

Zijun Li

Faculty of Data Science, City University of Macau, Macau 999078, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12698; https://doi.org/10.3390/app152312698

Submission received: 7 October 2025 / Revised: 20 November 2025 / Accepted: 27 November 2025 / Published: 30 November 2025

(This article belongs to the Special Issue AI-Based Supervised Prediction Models)

Download

Browse Figures

Versions Notes

Abstract

Time series analysis is of critical importance in a wide range of applications, including weather forecasting, anomaly detection, and action recognition. Accurate time series forecasting requires modeling complex temporal dependencies, particularly multi-scale periodic patterns. To address this challenge, we propose a novel Wavelet-Enhanced Transformer (Wave-Net). Wave-Net transforms 1D time series data into 2D matrices based on periodicity, enhancing the capture of temporal patterns through convolutional filters. This paper introduces Wave-Net, a model that incorporates wavelet and Fourier transforms for feature extraction, along with an enhanced cycle offset and optimized dynamic K for improved robustness. The Transformer layer is further refined to bolster long-term modeling capabilities. Evaluations on real-world benchmarks demonstrate that Wave-Net consistently achieves state-of-the-art performance across mainstream time series analysis tasks.

Keywords:

Wave-Net; wavelet transform; Fourier transform; dynamic K selection; the Transformer layer

1. Introduction

Time series analysis is of immense importance in extensive applications and action recognition. Such applications include forecasting meteorological factors for weather prediction [1]; detecting anomalies in monitoring data for industrial maintenance [2]; imputing missing data for data analysis [3]; and classifying trajectories for action recognition [4]. Time series analysis has received great interest owing to its immense practical value [5]. Time series are continuous records that only save some scalars at each time point, unlike other types of sequential data, such as language, images, and videos [6]. Typically, a single time point does not provide enough semantic information [7]. Therefore, many studies focus on changes over time, which contain more information and can reflect the intrinsic properties of time series, such as continuity, periodicity, and trends [8,9].

However, changes in time series in the real world are always complex. They involve multiple changes, such as increases, decreases, and fluctuations that mix and overlap. This makes time change modeling extremely challenging, especially when it comes to accurately measuring and analyzing the effects of time change on various systems and processes. In the domain of deep learning, the advanced nonlinear modeling capabilities of deep models have led to the development of numerous methods that researchers have employed to capture the intricate dynamic properties of real-world time series. Early approaches primarily employed recurrent neural networks (RNNs) to model time-series dependencies based on Markov assumptions [10,11]. However, RNN-like models face challenges in effectively capturing long-term dependencies and are constrained by the limitations of sequential computation. Another class of approaches utilizes temporal convolutional networks (TCNs) to extract local patterns [12,13]. However, the locally-aware nature of their one-dimensional convolutional kernel leads to limited long-term dependency modeling capabilities. In recent years, Transformer-based architectures have achieved significant advances in sequence modeling through the utilization of the global attention mechanism [13]. In the domain of time series analysis, researchers have enhanced the attention mechanism to capture global dependencies.

As demonstrated in the studies conducted by Zhou et al. [14] and Liu et al. [15], sparse attention variants serve to reduce computational complexity. Attention mechanisms operating within the frequency domain [16,17] have been shown to enhance the capture of periodic patterns. Hierarchical temporal modeling is a methodology used to disassociate intricate temporal patterns.

However, given that temporal dependencies frequently become obscured by noise and multiscale patterns present in real-world scenes, standard attentional mechanisms often encounter difficulties in reliably recognizing deep dependencies directly from the original time points [18].

We observe that real-world time series typically exhibit multiple periodicities, such as daily and annual variations in weather observations [19], as well as fixed-interval traffic flow prediction [20]. These multiple periods overlap and interact with each other [21]. We refer to these two types of time variation as intra-period variation and inter-period variation, respectively. The former denotes short-term temporal patterns within cycles, while the latter reflects long-term trends over successive periods [22]. For time series without a clear periodicity, the variation will be dominated by inter-period variation [23]. To tackle the intricacies of time variation, we analyze the time series in a new dimension of multi-periodicity [19,21].

Based on the motivation above, the multi-periodicity of time series is discovered, and the corresponding time-varying modular architecture is captured. Specifically, intra-periodic and inter-periodic variations in 2D space can be further captured by parameter-efficient residual blocks adaptively. This study aims to address the challenges in adaptive multi-period time series forecasting by proposing a Wavelet-Enhanced Transformer (Wave-Net) framework. The development of Wave-Net is inspired by the superior performance of the TimesNet architecture [19]. Our contribution summarizes three aspects:

1.: Combine the wavelet transform and the Fourier transform for feature extraction, and enhance the cycle offset.
2.: The hyperparameter K (number of cycles selected) is dynamically learnable and not just based on the amplitude of selecting the first K cycles.
3.: Use a Transformer layer to further improve model performance on complex time series forecasting or analysis tasks, enhancing the ability to capture long-term dependencies and complex patterns.

2. Related Work

Time-varying modeling has been extensively studied as a cornerstone of time series analysis. Classical approaches include the Box–Jenkins methodology (ARIMA) [8], structural decomposition models like STL [24], and financial volatility frameworks such as GARCH [25]. While foundational, these models often struggle with the complex, non-stationary patterns prevalent in real-world dynamic systems.

The advent of deep learning has significantly advanced time series modeling. Early efforts centered on Recurrent Neural Networks (RNNs) [26] and Long Short-Term Memory (LSTM) networks [27] to capture temporal dependencies. Subsequently, Temporal Convolutional Networks (TCNs) [28] were proposed, leveraging convolutional layers for parallelizable sequence modeling. Multi-Layer Perceptron (MLP)-based approaches, such as the pioneering work of [29], re-emerged with modern architectures like N-BEATS [30], which solidifies temporal relationships within fixed network weights. Despite their efficacy, these paradigms primarily operate on a one-dimensional sequential representation, failing to explicitly model the multi-scale, periodic-driven variations in the two-dimensional temporal structure proposed in our work.

The introduction of the Transformer architecture [31] has catalyzed a paradigm shift. Its core self-attention mechanism excels at capturing long-range dependencies. However, the quadratic complexity of self-attention spurred a line of research focused on efficiency. The Informer [23] introduced ProbSparse attention to reduce computational overhead, while the LogTrans [32] utilized a LogSparse mask. Other notable efficient variants include the Reformer [33] with locality-sensitive hashing, and the Pyraformer [34] with a pyramidal attention hierarchy.

Concurrently, another research thrust has been the injection of temporal inductive biases into the Transformer. The Autoformer [1] replaced self-attention with an auto-correlation mechanism and a deep decomposition architecture. Building on this, the FEDformer [35] employed a frequency-enhanced transformer with a mixture-of-experts design. The Crossformer [36] introduced a dimension-segment-wise embedding and two-stage attention layer to capture cross-dimension and temporal dependencies. The Non-stationary Transformer [37] proposed a series of operations to stationarize the non-stationary real-world data for stable forecasting. The PatchTST model [38] reformulated time series forecasting as a patching and channel-independence task, demonstrating strong performance. The TimesNet [19] explored task-agnostic time series modeling by converting 1D time series into 2D tensors based on multiple periods.

Unlike previous methods, which often rely on single-scale periodicity or stationary decomposition, our approach fundamentally reconceptualizes time series analysis. We unravel intricate temporal patterns by integrating both wavelet and Fourier transforms for robust time–frequency analysis, enabling the exploration of multi-periodicity. Our model dynamically adjusts learnable parameters to optimize performance and ultimately leverages the self-attention mechanism to capture complex, dynamic long-term dependencies and global patterns across the entire sequence, effectively incorporating multi-scale information.

3. Methodology

Given a time series X containing D variables or channels, the goal of time series forecasting is to predict the next H future steps based on past observations of length L, mathematically represented as follows:

\begin{matrix} Input : & X_{t - L + 1 : t} \in R^{L \times D} \\ Output : & {\bar{X}}_{t + 1 : t + H} \in R^{H \times D} \end{matrix}

(1)

where

X_{t - L + 1 : t}

denotes the data series from time

t - L + 1

to time

t

, which is the matrix of past observations with the shape of

R^{L \times D}

.

{\bar{X}}_{t + 1 : t + H}

denotes the data series from time

t + 1

to time

t + H

, which is the matrix of future predicted values with the shape of

R^{H \times D}

. The inherent periodicity in the time series is the basis for accurate forecasting, especially when predicting over large horizons such as 96–720 steps (equivalent to days or months). To improve model performance in long-term prediction tasks, we propose a new model. The proposed method combines the wavelet transform and the Fourier transform to form a simple yet powerful approach.

3.1. The Wavelet Transform and the Fourier Transform

First, the Fourier transform is a frequency domain analysis method, which is based on the idea of the Fourier series and represents a signal as a linear combination of a series of sine and cosine functions of different frequencies. The Fourier transform converts a signal from the time domain to the frequency domain. This process involves losing all information in the time domain. It produces a spectrum that reflects the amplitude and phase of the signal’s different frequency parts.

However, in some cases, we need to obtain the frequency domain information of a time series at specific moments, which the Fourier transform cannot provide. Furthermore, the Fourier transform fails to distinguish between situations where the frequency components are identical but occur at different time positions.

For example, Figure 1 shows two time sequences formed by stitching together trigonometric functions with different frequencies in different orders. However, Fourier analysis produces identical frequency-domain results for these two distinct sequences. This illustrates a fundamental shortcoming of the Fourier transform: the complete loss of time-domain information. To address this limitation, we introduce the wavelet transform, which provides simultaneous time and frequency analysis.

The two time series are formed by stitching together trigonometric functions with different frequencies in different orders. Although these are two distinct sequences, Fourier analysis produces identical frequency spectra for both. This demonstrates the shortcoming of the Fourier transform: the loss of time-domain information. Therefore, it is necessary to characterize signals in both the time and frequency dimensions simultaneously.

The wavelet transform is also a time–frequency analysis method, which decomposes the signal by a set of wavelet basis functions. The approximation coefficients and the detail coefficients are obtained on different scales and locations, and these coefficients reflect the local characteristics of the signal at different times and frequencies. Therefore, the result of the wavelet transform reflects two dimensions: time and frequency. The wavelet transform is better suited to dealing with non-smooth signals and can effectively extract abrupt changes, transients, and local features by capturing changes in the signal at different times and frequencies through wavelet functions of different scales. The formula for the continuous wavelet transform is shown below.

W_{f} (a, b) = \frac{1}{\sqrt{| a |}} \int_{- \infty}^{\infty} f (t) ψ^{*} (\frac{t - b}{a}) d t

(2)

where

W_{f} (a, b)

denotes the wavelet coefficient at a given frequency f; a is a scale parameter (controls the expansion and contraction of the wavelet); b is a translation parameter (controls the time position of the wavelet);

ψ (t)

is the parent wavelet function;

ψ^{*}

indicates complex conjugation. Comparing the differences in the properties of the wavelet transform and the Fourier transform, we can see that the wavelet transform has good time–frequency localization and can localize the signal in time and frequency at the same time. It supports multi-resolution analysis, and the characteristics of the signal can be observed at different scales through multi-level decomposition. Wavelet basis functions of different scales can capture the changes in signals across various time and frequency domains. It is very suitable for dealing with non-smooth signals. Examples of such signals include those containing abrupt changes and transients.

The Fourier transform and the wavelet transform leverage complementary advantages. The Fourier transform provides precise frequency localization but lacks temporal resolution, making it ideal for stationary signals. The wavelet transform offers multi-resolution analysis in time–frequency domains, excelling in non-stationary signal characterization. By combining both transforms, we achieve enhanced feature extraction, with the wavelet transform specifically identifying transient local anomalies. The hybrid model demonstrates superior signal-to-noise ratio (SNR) improvement compared to standalone methods.

3.2. Dynamic K Selection

In time series analysis network models, hyperparameter K is a critical parameter primarily used to control the model’s multi-period information extraction capability. Fixed K may not be able to accommodate the dynamic nature of different time series, such as selecting the top-K frequency components based on Fourier transform. The potential benefits of learnable K are shown in Table 1.

The limitations of the fixed K selection mechanism include amplitude bias, over-reliance on Fourier transform amplitude for cycle selection, and neglect of cycle persistence. Short-term fluctuations may have large amplitude but lack persistent significance. Discrepancies between channel-specific cycles and global averages may obscure local features and non-stationary behaviors. The optimal K-value may vary across samples and between simple and mixed multi-period sequences.

For different K values, the impact on performance is analyzed systematically. If K is too large, it may introduce noise; if too small, information loss occurs. Different time series may require distinct K selection strategies, e.g., for ECG signals, weather data, and financial time series.

We propose a dynamic period selection approach that computes instance-wise importance weights for candidate cycles, enabling the model to automatically prioritize the most relevant temporal patterns for each input sequence instead of relying on a fixed period ranking.

The approach used in our model for dynamically selecting top-K periods is fundamentally based on the idea that, unlike methods that rely on a fixed number of periods (such as top-1 or top-2), it dynamically computes a set of weights for each input sequence according to the intrinsic characteristics of the data, thereby evaluating the importance of various candidate periods.

3.3. Dynamic Period Selection Mechanism

To illustrate our dynamic period selection approach, we provide a concrete example using hourly electricity consumption data collected over two weeks (336 h). The steps for generating dynamic weights are as follows:

1.: Period Discovery via Frequency Analysis

Given a sequence of 336 hourly electricity readings, our method identifies the most prominent periodic patterns through a three-step spectral analysis procedure. First, the amplitude spectrum is computed via the Fourier transform to extract the dominant frequencies. The top five frequencies with the highest spectral amplitudes are then selected. These frequencies are subsequently converted into their corresponding temporal periods: 24 h (daily cycle), 168 h (weekly cycle), 12 h (twice-daily cycle), 84 h (semi-weekly cycle), and 6 h (quad-daily cycle). This analysis reveals that the data contains multiple overlapping periodic patterns at different temporal scales.

2.: Dynamic Weight Assignment

Unlike approaches that treat all periods equally, our method dynamically assigns importance weights to the detected periods [24, 168, 12, 84, 6] based on sequence-specific characteristics. The periods are first projected into a feature space, then processed by a lightweight neural network to generate raw importance scores [2.1, 1.8, 1.2, 0.9, 0.5]. Finally, we make use of the softmax function to convert these scores into probability weights [0.38, 0.28, 0.17, 0.11, 0.06], reflecting their relative significance for the given sequence.

A weighted sum of values is computed, where the weights are determined by the compatibility between queries and keys. The attention scores are normalized into a probability distribution using the softmax function, which ensures all values are positive and sum to one. The softmax function [39] is defined as:

softmax (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

(3)

where

z_{i}

denotes the input vector of dimension

i

. This produces interpretable weights that represent the relative importance of each element in the sequence when generating the output.

3.4. Transformer and Self-Attention Mechanism

3.4.1. Transformer Architecture

The Transformer architecture includes the following sections:

The self-attention mechanism serves as the core component of the Transformer model, enabling each element in the sequence (e.g., a time point) to interact with all other elements and aggregate global information through weighted summation. This allows the model to directly capture long-range dependencies in time series, regardless of the distance between elements [31].

Second, the Multi-Head Context-Attention Mechanism extends the self-attention mechanism by running multiple independent “attention heads” in parallel instead of computing attention only once. Each head learns to focus on different contextual information in different representation subspaces; for instance, one head may focus on short-term patterns while another attends to long-term cycles. This design enables the model to jointly process information from different positions and capture diverse temporal characteristics [31].

Third, the Position-wise Feed-Forward Network processes the output from the self-attention module. This network consists of fully connected layers and nonlinear activation functions, applying identical transformations to each position in the sequence independently. Its primary function is to introduce non-linearity and facilitate feature transformation, thereby enhancing the model’s representational capacity [31].

The above core components are organized within the classical Transformer encoder-decoder framework. As shown in Figure 2, this structure consists of two parts: the encoder on the left and the decoder on the right, each containing six identical layers. Input sequences are combined with word embeddings and positional encoding before being fed into the encoder. Similarly, output sequences are combined with word embeddings and positional encoding and fed into the decoder. The encoder’s output is then fed into the decoder through attention mechanisms, and softmax is applied to the decoder’s output to predict the next token. Word embedding and positional encoding will be formally introduced in subsequent discussions. We first analyze each layer of the encoder and decoder in detail [31].

3.4.2. Encoder

The Encoder consists of six identical layers, each of which comprises two parts. The first part is a multi-head self-attention mechanism, and the second part is a position-wise feed-forward network, which is composed of two fully connected layers. Both of these parts are followed by layer normalization after a residual connection is applied to them.

3.4.3. Decoder

Similar to the encoder, the decoder consists of 6 identical layers. Each of these layers contains three core components: a multi-head self-attention mechanism (often referred to as masked multi-head attention), a multi-head context-attention mechanism (often referred to as multi-head encoder-decoder attention), and a position-wise feed-forward network [31]. Residual connections followed by layer normalization are employed around each of these three sub-layers.

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

where the Q, K, and V denote the query, key, and value matrices, respectively, and

d_{k}

is the vector dimension of the key vector. This formulation is the core of scaled dot-product attention in the Transformer model.

3.4.4. Transform 1D Variations into 2D Variations

Within the methodology, we introduce TimesNet [19], whose core innovation lies in analyzing intra-period and inter-period features through a two-dimensional convolutional network by folding temporal data into a two-dimensional tensor. Compared with other methods, e.g., spectral methods [40], this innovation effectively integrates the modeling of fine-grained intra-period variations and long-term inter-period evolution. The specific implementation comprises two key steps:

First, the model identifies dominant periods based on the Fourier transform. The model performs FFT on the input sequence and selects the top-k frequencies with the highest spectral energy, using their corresponding period lengths as candidates. This step aligns with traditional spectral analysis philosophy, aiming to discover the data’s repetitive “rhythms” from the frequency domain.

Second, the model folds the 1D sequence into a 2D tensor according to the identified periods. For each candidate period length p, the model reshapes the original one-dimensional sequence into a

p \times (L / p)

two-dimensional matrix (where L is the sequence length). In this matrix, the row direction represents short-term variations within a single period (e.g., changes over 24 h in a day), while the column direction represents long-term evolution across multiple periods (e.g., trends over consecutive days). This transformation decomposes complex temporal variations into two orthogonal directions that are more amenable to modeling.

Once the data is converted into a two-dimensional structure, the model can employ mature two-dimensional convolutional neural networks to capture complex dependencies both within and between periods simultaneously. This contrasts with traditional methods: Fourier analysis excels at capturing global periodicity but loses temporal localization information; the wavelet transform can capture local time–frequency characteristics but has limited capacity for modeling long-term dependencies; whereas TimesNet’s 2D transformation attempts to combine the advantages of both, explicitly modeling temporal and frequency domain features simultaneously through a structured approach.

The model is derived from the state-of-the-art TimesNet time series analysis method [19] by incorporating the transformer strategy [31] and the wavelet approach, allowing for the quantification of predictive uncertainties and explanation of prediction results. The three parts and the integrated method will be introduced in the following subsections.

Unlike traditional machine learning and deep learning methods (e.g., LSTM), which only capture temporal dependencies among adjacent time points and thus fail to capture long-term dependencies, the key innovation of TimesNet is that it transforms the analysis of 1D temporal variations into 2D space based on the inherent periodicity of data. This allows it to explore not only the short-term temporal pattern within a period (intra-period variation), but also the variations among consecutive periods (inter-period variation) [41].

In this model, the block handling the transformation of 1D time series into 2D space and the processing of 2D variations is referred to as TimesBlock. The time series X of length T exhibits multi-periodicity that is identified through the Fourier transform [42] via frequency analysis. The Fourier transform is employed to identify dominant periods in the time series data. Based on these, the top-k dominant periods are selected. However, the selected top-k periods are not necessarily the the most significant; later, we will discuss how to dynamically learn the optimal k-value for optimization. For each period length

L_{i}

, the original 1D time is represented. The series X is to be divided into

N_{f}

. This is represented by the following formula:

N_{f} = ⌈\frac{T}{L_{i}}⌉

(5)

where

T

denotes the length of the input time series,

L_{i}

denotes a period length used to decompose the time series, and

N_{f}

denotes the number of complete cycles within the total duration

T

. These are then reshaped into a two-dimensional matrix

X_{i}^{'} \in R^{L_{i} \times N_{f}}

. Consequently, we obtain a set of two-dimensional matrices denoted

{X_{i}^{'}}

. For each period length

L_{i}

, the rows and columns of

X_{i}^{'}

represent intra-period variation and inter-period variation, respectively. The transformed 2D matrices

X_{i}^{'}

are regarded as images and processed by 2D convolutional layers to extract intra-period and inter-period variations from the original 1D time series, leveraging the strengths of convolutional architectures in image processing [43,44]. The extracted 2D features are then transformed back into 1D space to generate the final time-series predictions.

The architecture diagram is shown as Figure 3. First of all, the 1D historical time series are fed into the network. The time series data first passes through an adaptive module consisting of a Fourier and a wavelet transform. The Fourier transform and the wavelet transform are computed simultaneously, and the features are fused by weighting. With this adaptive design, the global frequency resolution of the Fourier transform and the local time–frequency analysis capability of the wavelet transform can be balanced to enhance the expression of the time series features.

The weighting mechanism employs learnable gating units to dynamically integrate global spectral features (F) obtained via the Fourier transform with local spatio-temporal features (W) derived from the wavelet transform. Specifically, we concatenate these features into [F; W] and generate adaptive fusion weights through a lightweight gated network.

The balancing mechanism dynamically fuses the strengths of both methods through adaptive weights: assigning a higher Fourier weight (

α

) to stabilise periodic components, leveraging its frequency recognition capability, while allocating a higher wavelet weight (

1 - α

) to transient anomalous components, utilising its sensitivity to local variations. Finally, the 1D time series is transformed into a 2D matrix

X_{i}^{'} \in R^{L_{i} \times N_{f}}

.

3.4.5. Linear Transition for Forecasting

After extracting multi-scale temporal features through the stacked TimesBlocks, we employ a lightweight linear projection layer to map the high-dimensional representations directly to future time series. This projection serves as a crucial component for temporal feature decoding and horizon adaptation, efficiently transforming the learned 2D variations into future sequences without introducing excessive parameters.

For forecasting tasks, the transition from encoded features to predictions is designed for both parameter efficiency and temporal coherence. This is achieved by projecting the high-dimensional features directly onto the future horizon using a single linear transformation. The approach is effective because the final representation from the stacked TimesBlocks, denoted as

X_{1 D}^{L} \in R^{T \times d_{model}}

, already encapsulates the necessary multi-scale temporal semantics, thus requiring only a lightweight layer to decode the future sequence. The process proceeds as follows:

We project the high-dimensional features directly onto the future horizon using a single linear transformation. This approach leverages the principle that the final feature representation from TimesBlocks already encapsulates the necessary temporal semantics, requiring only a lightweight layer to decode the future sequence:

Y = Linear (X_{1 D}^{L}) \in R^{T \times (H \cdot C)}

(6)

This operation generates

T

tentative prediction sequences, each of length

H \cdot C

. To form the final forecast, the output

Y

is first reshaped into a three-dimensional tensor

\tilde{Y} \in R^{T \times H \times C}

, which explicitly separates the input time steps, forecast horizon, and data variates. The final forecast

Y_{final} \in R^{H \times C}

is then obtained by selecting the sequence corresponding to the last input time step:

Y_{final} = \tilde{Y} [T, :, :]

(7)

This selection is grounded in the understanding that the feature vector at the final time step

x_{T}^{L}

inherently aggregates the most comprehensive historical context, making its corresponding prediction the most reliable and context-aware.

4. Experiments

The primary objective of this experimental study was to develop and validate a novel architecture capable of effectively modeling and capturing the multi-scale temporal dependencies (e.g., short-term, periodic, and long-term periodic patterns) inherent in complex time series data, with the ultimate goal of enhancing forecasting precision across various forecasting horizons.

4.1. Evaluate Metrics

To fully evaluate the model’s performance, we employed two standard metrics: Mean Squared Error (MSE) and Mean Absolute Error (MAE) [45,46]. The prediction accuracy was assessed on the held-out test dataset.

4.2. Datasets

To validate the effectiveness of Wave-Net, we conducted experiments on mainstream analytical tasks. We utilized widely adopted benchmark datasets, including the Electricity Transformer Temperature (ETTh1) [23], Weather [7], and Traffic datasets [7]. Summaries of the ETTH1, Weather, and Traffic datasets, along with their relevant parameters, are presented in the Table 2. The preprocessing procedures for these datasets, such as dataset splitting and normalization methods, are consistent with previous work (e.g., Autoformer [1], iTransformer [47], etc.). These datasets show a stable periodic pattern, such as daily and weekly, which provides an objective basis for long-range prediction. These datasets all show stable cycle patterns, such as daily and weekly, which provide an objective basis for long-term forecasting. Combined with the sampling frequency of the datasets, we can infer the maximum cycle length of the datasets, e.g., 24 weeks for ETTH1 and 168 weeks for Weather. The following are detailed explanations of the fields in the dataset that are relevant to the operational monitoring of power transformers. A summary of benchmarks is shown in Table 2. Additionally, the detailed descriptions of the fields in the ETTH1, Weather, and Traffic datasets are presented in Table 3, Table 4, and Table 5, respectively.

4.3. Baseline Models

We compare the proposed Wave-Net with a comprehensive set of state-of-the-art time series forecasting models. Furthermore, a comparison of state-of-the-art models for each specific task is conducted to ensure a rigorous evaluation. The baselines include:

1.: Transformer-based models: FEDformer [35], Autoformer [1], ETSformer [48], Stationary [49], Informer [50], Pyraformer [34].
2.: Linear model: DLinear [41].
3.: Lightweight model: LightTS [51].
4.: Other deep learning models: LogTrans [52], Reformer [33], LSSL [53], LSTL [5].

4.4. Environment Configuration and Training Process

All experiments presented in this paper were meticulously implemented using PyTorch 1.8.1 and CUDA 11.1, with the widely adopted Adam optimizer employed for model training. The computations were carried out on high-performance NVIDIA GPUs, including RTX 2080, RTX 2070, or RTX 2060 models, ensuring efficient processing and experimental reproducibility.

The training process was accelerated using GPU computation. During training, an early stopping mechanism was employed to monitor the training progress, with training loss and validation metrics recorded every 100 iterations. The validation loss showed a consistent decrease throughout the initial phase of training until the 8th iteration, after which no further improvement was observed, triggering the early stopping protocol. Based on the output metrics, the model was saved at the point where validation loss was minimized. The entire training process and results can be verified through the saved model files and detailed log files. Finally, the model was evaluated on the test set, and its performance was quantified using MSE and MAE metrics.

The first epoch was the fastest, taking 2.2–2.9 s per iteration, since data loading and model initialization had already been completed. The speed of subsequent epochs stabilized at 2.8–3.7 s per iteration. The training process converged steadily, the validation loss continued to decrease, with no signs of overfitting, and the computational efficiency was within an acceptable range.

During the training process, the first epoch was the fastest, taking 2–3 s per iteration, since data loading and model initialization had already been completed. The speed of subsequent epochs stabilized at 3–4 s per iteration. The training process converged steadily, the validation loss continued to decrease, with no signs of overfitting, and the computational efficiency was within an acceptable range.

4.5. Main Result

Wave-Net has been consistently shown to achieve state-of-the-art performance on mainstream analysis tasks compared to other customized models. Furthermore, the replacement of the inception block with a more powerful one has been demonstrated to enhance Wave-Net’s capabilities. To assess the model’s forecasting performance, a common type of benchmark is adopted.

As shown in the Table 6, the proposed model demonstrates significant advantages on the ETTH1, weather, and traffic datasets. It exhibits particularly outstanding performance in long-term forecasting scenarios, where it not only surpasses various advanced models but also achieves state-of-the-art results in the vast majority of cases.

Experimental results demonstrate that the proposed architecture exhibits significant effectiveness in time series analysis tasks, as its design effectively captures critical features in temporal data and enhances prediction performance.

4.6. Comparison with Traditional Baseline Methods

To provide a comprehensive evaluation of our approach within the broader methodological landscape, we conducted extensive comparisons with classical time series forecasting methods. The baseline methods include:

1. ARIMA (1,1,1): AutoRegressive Integrated Moving Average model [8], representing classical statistical approaches.

2. GARCH: Generalized Autoregressive Conditional Heteroskedasticity [54] for volatility modeling.

3. Mean: Simple average of training data.

4. Drift: Linear extrapolation based on first and last observations.

Comparative analysis shows that, compared with traditional benchmarks, good performance is achieved on the traffic dataset under different evaluation metrics, as shown in Table 7.

According to the comparative analysis of the tabular data, the Wave-Net model proposed in this study demonstrates a significant advantage in predictive performance. Looking at the two key metrics, MSE and MAE, the prediction errors of Wave-Net are much lower than those of traditional models: its MSE value (0.588) is about 60% lower than the best-performing baseline model Drift, and the MAE value (0.315) is only 30–40% of that of traditional models. This significant performance improvement indicates that the Wave-Net model has excellent capability in capturing the complex characteristics of traffic data, providing more accurate and stable prediction results, fully validating the effectiveness and advancement of the model in traffic prediction tasks.

4.7. Ablation Study

We conduct ablation studies to evaluate the contribution of each module. Overall, the joint use of both modules achieves state-of-the-art performance. In most cases, both modules could work independently and provide significant improvements. Confirms our assertion: Using channel attention alone degrades performance.

Time series data typically contains short-term, medium-term, and long-term patterns. The Fourier transform only provides global frequency information and cannot capture time locality. In contrast, the Wave-Net transform effectively captures both high-frequency and low-frequency parts by stretching and shifting the basis function while preserving time information.

The combination of the two can obtain both local time–frequency characteristics and global dominant frequency, providing a more comprehensive input characterization. The attention mechanism directly models the time–point relationship at any distance, compensating for the shortcomings of the Fourier transform in long-period modeling.

A more detailed analysis of the effectiveness of channel mix-up is presented in Table 8. We compare the MAE and MSE of our model when the wavelet transform, Fourier transform, and Transformer are removed, respectively. Notably, our model achieves excellent performance when it incorporates the wavelet transform, the Fourier transform, and the Transformer simultaneously.

5. Conclusions and Future Work

This study introduces Wave-Net, a novel deep learning framework that advances predictive modeling by simultaneously addressing key challenges in time series prediction: accuracy and K-value dynamics. The model is constructed based on the TimesNet architecture [19], which converts 1D time series into 2D representations according to the data period, enabling the efficient capture of complex multi-periodic patterns through convolutional filters. The core innovation is that the framework further integrates a K dynamic learning algorithm and a wavelet Fourier dual-channel architecture; the output signals are then transformed for feature extraction to provide reliable and interpretable predictive models, significantly enhancing current modeling capabilities.

This research has broad implications for time series forecasting. Wave-Net relies solely on historical records and readily available data, making it suitable for contexts with sparse data and capable of providing accurate predictions for months in advance, thereby providing valuable lead time for sustainable planning, which is crucial in forecasting. The model’s robust performance makes it a promising tool for various forecasting tasks. Extending its use to different time scales and applications can further support data-driven resource decisions.

Despite the encouraging results, several limitations should be noted. First, to ensure broad applicability, our model intentionally uses only readily available data. Nonetheless, as the temporal scope extends, some dynamic characteristics fail to be fully captured in the observations and predictions. To address these limitations, future studies could incorporate additional predictors when accessible and investigate more advanced modeling techniques. Furthermore, to enhance the model’s applicability, generating supplementary training samples from existing data segments could be a worthwhile pursuit. This strategy would enlarge and diversify the training dataset, potentially improving the model’s generalization performance.

In the future, further exploration of large-scale pre-training methodologies will be conducted. The utilization of these methodologies has been demonstrated to yield substantial benefits, particularly in the context of diverse downstream tasks.

Author Contributions

Conceptualization, P.Y. and H.K.; methodology, P.Y. and H.K.; project administration, P.Y. and H.K.; investigation, P.Y. and H.K.; writing—review and editing, P.Y., H.K. and Z.L.; Software, P.Y.; validation, P.Y.; formal analysis, P.Y.; resources, P.Y.; data curation, P.Y.; visualization, P.Y.; writing—original draft preparation, P.Y., H.K. and Z.L.; supervision, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

2024 Guangdong Provincial University Characteristic Innovation Project (Natural Science) “Design and Research of a Wireless Communication Modulation System for the IoT”, Grant No. 2024KTSCX186.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We acknowledge that the datasets used in this study are sourced from the publicly available benchmark datasets provided by Wu, H., et al. in their paper “TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis”, including the ETTH1, Weather, and Traffic datasets. We extend our sincere gratitude to the original contributors of these datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv 2021, arXiv:2110.02642. [Google Scholar]
Friedman, M. The interpolation of time series by related series. J. Am. Stat. Assoc. 1962, 57, 729–757. [Google Scholar] [CrossRef]
Franceschi, J.Y.; Dieuleveut, A.; Jaggi, M. Unsupervised scalable representation learning for multivariate time series. Adv. Neural Inf. Process. Syst. 2019, 32, 4650–4661. [Google Scholar]
Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
Esling, P.; Agon, C. Time-series data mining. ACM Comput. Surv. (CSUR) 2012, 45, 1–34. [Google Scholar] [CrossRef]
Wu, X.; Li, Y.; Zhang, M.; Chen, Y.; Zhang, J.Y. Deep learning for time series analysis: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
Ilhan, F.; Karaahmetoglu, O.; Balaban, I.; Kozat, S.S. Markovian RNN: An adaptive time series prediction network with HMM-based switching for nonstationary environments. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 715–728. [Google Scholar] [CrossRef]
El Montassir, R.; Pannekoucke, O.; Lapeyre, C. HyPhAI v1.0: Hybrid Physics-AI architecture for cloud cover nowcasting. EGUsphere 2024, 2024, 1–38. [Google Scholar]
Ou, J.; Jin, H.; Wang, X.; Jiang, H.; Wang, X.; Zhou, C. STA-TCN: Spatial-temporal attention over temporal convolutional network for next point-of-interest recommendation. ACM Trans. Knowl. Discov. Data 2023, 17, 1–19. [Google Scholar] [CrossRef]
Wu, F.; Ma, R.; Li, Y.; Li, F.; Duan, S.; Peng, X. A novel electronic nose classification prediction method based on TETCN. Sens. Actuators B Chem. 2024, 405, 135272. [Google Scholar] [CrossRef]
Chen, B.; Dao, T.; Winsor, E.; Song, Z.; Rudra, A.; Ré, C. Scatterbrain: Unifying sparse and low-rank attention. Adv. Neural Inf. Process. Syst. 2021, 34, 17413–17426. [Google Scholar]
Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
Ye, M.; Jiang, Z.; Xue, X.; Li, X.; Wen, P.; Wang, Y. A Novel Spatiotemporal Correlation Anomaly Detection Method Based on Time-Frequency-Domain Feature Fusion and a Dynamic Graph Neural Network in Wireless Sensor Network. IEEE Sens. J. 2025, 25, 15548–15563. [Google Scholar] [CrossRef]
Lai, X.; Zhang, K.; Zheng, Q.; Li, Z.; Ding, G.; Ding, K. A frequency-spatial hybrid attention mechanism improved tool wear state recognition method guided by structure and process parameters. Measurement 2023, 214, 112833. [Google Scholar] [CrossRef]
Cheng, M.; Liu, Z.; Tao, X.; Liu, Q.; Zhang, J.; Pan, T.; Zhang, S.; He, P.; Zhang, X.; Wang, D.; et al. A comprehensive survey of time series forecasting: Concepts, challenges, and future directions. TechRxiv 2025. [Google Scholar] [CrossRef]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv 2022, arXiv:2210.02186. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. Proc. AAAI Conf. Artif. Intell. 2019, 33, 922–929. [Google Scholar] [CrossRef]
Li, Z.; Li, R.; Li, Q.; Zhou, T.; Wen, Q. Rethinking long-term time series forecasting: From the perspective of frequency domain analysis. ACM Trans. Intell. Syst. Technol. 2023, 14, 1–25. [Google Scholar]
Wang, Z.; Chen, Y.; Liu, W.; Sun, J. Mamba: Multi-scale analysis with memory-based attention for time series forecasting. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Version 3. Volume 35, pp. 11106–11115. [Google Scholar]
Cleveland, R.B.; Cleveland, W.S.; McRae, J.E.; Terpenning, I. STL: A seasonal-trend decomposition. J. Off. Stat. 1990, 6, 3–73. [Google Scholar]
Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econom. J. Econom. Soc. 1982, 50, 987–1007. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Hill, T.; O’Connor, M.; Remus, W. Neural network models for time series forecasts. Manag. Sci. 1996, 42, 1082–1092. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5243–5253. [Google Scholar]
Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Liu, S.; Yu, H.; Liao, J.; Li, J.; Lin, W.; Liu, A.X.; Gao, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2022; Volume 35, pp. 9881–9893. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Lu, C.; Koch, S.E.; Wang, N. Determination of temporal and spatial characteristics of atmospheric gravity waves combining cross-spectral analysis and wavelet transformation. J. Geophys. Res. 2005, 110, D01109. [Google Scholar] [CrossRef]
Dai, T.; Wu, B.; Liu, P.; Li, N.; Bao, J.; Jiang, Y.; Xia, S.T. Periodicity decoupling framework for long-term series forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Nussbaumer, H.J. The fast Fourier transform. In Fast Fourier Transform and Convolution Algorithms; Springer: Berlin/Heidelberg, Germany, 1981; pp. 80–111. [Google Scholar]
Mahmoud, A.; Mohammed, A. Leveraging hybrid deep learning models for enhanced multivariate time series forecasting. Neural Process. Lett. 2024, 56, 223. [Google Scholar] [CrossRef]
Wang, W.; Shen, J.; Shao, L. Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 2017, 27, 38–49. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and practice, 3rd ed.; A Widely Recognized Textbook That Explains and Uses MSE and MAE as Fundamental Accuracy Measures; OTexts: Melbourne, Australia, 2021. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S.C.H. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]
Liu, Y.; Li, G.; Payne, T.R.; Yue, Y.; Man, K.L. Non-Stationary Transformer Architecture: A Versatile Framework for Recommendation Systems. Electronics 2024, 13, 2075. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. arXiv 2020, arXiv:2012.07436. [Google Scholar] [CrossRef]
Su, C. DLinear Makes Efficient Long-Term Predictions; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Huang, X.; Tang, J.; Shen, Y. Long time series of ocean wave prediction based on PatchTST model. Ocean. Eng. 2024, 301, 117572. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, Y.; Cao, W.; Bian, J.; Yi, X.; Zheng, S.; Li, J. Less Is More: Fast Multivariate Time Series Forecasting with Light Sampling-oriented MLP Structures. arXiv 2022, arXiv:2207.01186. [Google Scholar] [CrossRef]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]

Figure 1. (a) Sequence time domain waveforms; (b) Frequency domain graph.

Figure 2. The Transformer model architecture.

Figure 3. Illustration of the (a) Block and (b) Wave-Net structures. Block serves as the basic building block of the Wave-Net model.

Table 1. Potential Benefits of Learnable K.

Dimension	Fixed K	Learnable K Mechanism
Flexibility	No adaptive capacity	Dynamically adapted data characteristics
Multi-cycle capture	Possible omission of secondary cycle	Weighted use of all candidate cycles
Typical implementation	Hardcoded top-K selection	Dynamic screening of attention weights
Efficiency	High	Slightly below

Table 2. A summary of the long-term forecasting benchmarks.

Benchmark	Metrics	Samples	Series-Length	Dimension
ETTH1	MSE, MAE	2785	96	7
Weather	MSE, MAE	52,696	96	14
Traffic	MSE, MAE	17,544	96	862

Table 3. A detailed description of the fields in the ETTH1 dataset.

Symbol	Description
DATE	The recorded date
HUFL	High UseFul Load
HULL	High Useless Load
MUFL	Medium UseFul Load
MULL	Medium Useless Load
LUFL	Low UseFul Load
LULL	Low Useless Load
OT	Oil Temperature

Table 4. A detailed description of the fields in the Weather dataset.

Symbol	Unit	Description
DATE	dd.mm.yyyy hh.mm	Date and time of the data record (the timestamp represents the end of the averaging period)
P	mbar	Air pressure
T	°C	Air temperature
TPOT	K	Potential temperature
TDEW	°C	Dew point temperature
RH	%	Relative humidity
VPMAX	mbar	Saturation water vapor pressure
VPACT	mbar	Actual water vapor pressure
VPDEF	mbar	Water vapor pressure deficit
SH	gkg⁻¹	Specific humidity
H2OC	mmolmol⁻¹	Water vapor concentration
RHO	gm⁻³	Air density
WV	ms⁻¹	Wind velocity
MAX.WV	ms⁻¹	Maximum wind velocity
WD	°	Wind direction
RAIN	mm	Precipitation
RAINING	s	Duration of precipitation
SWDR	Wm⁻²	Short wave downward radiation
PAR	mmolm⁻²s⁻¹	Photosynthetically active radiation
MAX.PAR	$μ$ molm⁻²s⁻¹	Maximum photosynthetically active radiation
TLOG	°C	Internal logger temperature
OT	ppm	CO₂-concentration of ambient air

Table 5. A detailed description of the fields in the Traffic dataset.

Symbol	Description
DATE	Timestamp
0-862	Sensor node
OT	Proportion of vehicles occupying

Table 6. Comparison of the performance of multiple time series forecasting models on ETTH1, Weather, and Traffic tasks, using MSE and MAE as evaluation metrics.

	ETTH1		Weather		Traffic
Model (Year)	MSE	MAE	MSE	MAE	MSE	MAE
ETSformer (2022)	0.494	0.479	0.197	0.281	0.607	0.392
LightTS (2022)	0.424	0.432	0.182	0.242	0.615	0.391
DLinear (2022)	0.386	0.400	0.196	0.255	0.650	0.396
FEDformer (2022)	0.376	0.419	0.217	0.296	0.587	0.366
Stationary (2022)	0.513	0.491	0.173	0.223	0.612	0.388
Autoformer (2022)	0.449	0.459	0.226	0.336	0.613	0.388
Pyraformer (2011)	0.664	0.612	0.662	0.556	0.867	0.468
Informer (2021)	0.865	0.713	0.300	0.384	0.719	0.391
LogTrans (2019)	0.878	0.740	0.458	0.490	0.684	0.384
Reformer (2020)	0.837	0.728	0.689	0.596	0.732	0.423
Wave-Net (Ours)	0.398	0.418	0.176	0.226	0.588	0.315

Table 7. Performance Comparison with Traditional Methods on the Traffic dataset.

Method	MSE	MAE
ARIMA	1.708	1.106
GARCH	1.600	1.080
Mean	1.590	1.084
Drift	1.481	1.047
Wave-Net (Ours)	0.588	0.315

Table 8. Ablation of channel mix-up and channel attention with average MSE and MAE across prediction lengths (✓: included; −: not included).

Wavelet Transform	Fourier Transform	Transformer	Wave-Net (MSE)	Wave-Net (MAE)
✓	–	✓	0.4182	0.4321
✓	–	–	0.4256	0.4379
✓	✓	–	0.4129	0.4274
✓	✓	✓	0.3980	0.4186

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, P.; Kong, H.; Li, Z. Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting. Appl. Sci. 2025, 15, 12698. https://doi.org/10.3390/app152312698

AMA Style

Yu P, Kong H, Li Z. Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting. Applied Sciences. 2025; 15(23):12698. https://doi.org/10.3390/app152312698

Chicago/Turabian Style

Yu, Ping, Hoiio Kong, and Zijun Li. 2025. "Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting" Applied Sciences 15, no. 23: 12698. https://doi.org/10.3390/app152312698

APA Style

Yu, P., Kong, H., & Li, Z. (2025). Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting. Applied Sciences, 15(23), 12698. https://doi.org/10.3390/app152312698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wavelet-Enhanced Transformer for Adaptive Multi-Period Time Series Forecasting

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. The Wavelet Transform and the Fourier Transform

3.2. Dynamic K Selection

3.3. Dynamic Period Selection Mechanism

3.4. Transformer and Self-Attention Mechanism

3.4.1. Transformer Architecture

3.4.2. Encoder

3.4.3. Decoder

3.4.4. Transform 1D Variations into 2D Variations

3.4.5. Linear Transition for Forecasting

4. Experiments

4.1. Evaluate Metrics

4.2. Datasets

4.3. Baseline Models

4.4. Environment Configuration and Training Process

4.5. Main Result

4.6. Comparison with Traditional Baseline Methods

4.7. Ablation Study

5. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI