DESTformer: A Transformer Based on Explicit Seasonal–Trend Decomposition for Long-Term Series Forecasting

: Seasonal–trend-decomposed transformer has empowered long-term time series forecasting via capturing global temporal dependencies (e.g., period-based dependencies) in disentangled temporal patterns. However, existing methods design various auto-correlation or attention mechanisms in the seasonal view while ignoring the ﬁne-grained temporal patterns in the trend view in the series decomposition component, which causes an information utilization bottleneck. To this end, a Transformer-based seasonal–trend decomposition methodology with a multi-scale attention mechanism in the trend view and a multi-view attention mechanism in the seasonal view is proposed, called DESTformer. Speciﬁcally, rather than utilizing the moving average operation in obtaining trend data, a frequency domain transform is ﬁrst applied to extract seasonal (high-frequency) and trend (low-frequency) components, explicitly capturing different temporal patterns in both seasonal and trend views. For the trend component, a multi-scale attention mechanism is designed to capture ﬁne-grained sub-trends under different receptive ﬁelds. For the seasonal component, instead of the frequency-only attention mechanism, a multi-view frequency domain (i.e., frequency, amplitude, and phase) attention mechanism is designed to enhance the ability to capture the complex periodic changes. Extensive experiments are conducted on six benchmark datasets covering ﬁve practical applications: energy, transportation, economics, weather, and disease. Compared to the state-of-the-art FEDformer, our model shows reduced MSE and MAE by averages of 6.5% and 3.7%, respectively. Such experimental results verify the effectiveness of our method and point out a new way towards handling trends and seasonal patterns in long-term time series forecasting tasks.


Introduction
Long-term time series prediction refers to the prediction of sequence changes over a longer period of time based on historical data, like predicting 24 points or more, which is indicated in Informer [1], Autoformer [2], and Fedformer [3].It has widespread applications in fields such as electricity forecasting [1,2], traffic flow prediction [4][5][6], inventory control [7], and healthcare management [8,9].For example, in the energy sector, long-term forecasting is used to optimize the operation and management of the power grid, improving energy efficiency and reliability.However, high nonlinearity, long-term temporal dependency, and entangled multi-scale temporal components (e.g., trend and seasonality) make long-term forecasting a very challenging task.
First, learning multi-scale temporal dependency is nontrivial.Guided by the idea of time series decomposition, long sequences can be decoupled into more expressive seasonal and trend components.In traditional transformer-based methods [10], such as Informer [1], Reformer [11], and Preformer [12], attention values are often calculated based on the position-aware data points from the original time series.Due to the high noise and complexity in long-term forecasting tasks, these methods often produce suboptimal results.The latest methods for constructing seasonal-trend attention are still limited to single-perspective frequency domain learning [3] and fixed-length subsequence learning [2].We believe that multi-view seasonal attention learning and variable-length sub-trend attention learning can flexibly capture more potential information in long sequences, thereby improving the predictive and generalization capabilities of the model.
Second, on the basis of time series decomposition, effective learning of seasonal-trend representations becomes more important [13].Existing methods that combine time series decomposition with the transformer architecture [10], such as Autoformer [2] and FEDformer [3], have demonstrated strong predictive capabilities on datasets with strong seasonality and weak noise disturbances.Although these methods combine progressive decomposition, they still cannot effectively distinguish between seasonal and trend representations.At the same time, they often focus more on learning seasonal components and neglect long-term trend fluctuations, which greatly increases the limitations of these methods.Therefore, we believe that effective and targeted seasonal-trend representation learning can maximize the model's long-sequence prediction capabilities.
To address all of the above challenges, a transformer model based on seasonal-trend decomposition is proposed, called DESTformer.Firstly, DESTformer effectively decouples and denoises complex sequences through frequency domain transform.Next, for the seasonal component, a multi-view attention mechanism (MVI-Attention) is proposed to replace the traditional self-attention mechanism for capturing complex periodic changes.MVI-Attention simultaneously calculates self-attention from three perspectives, i.e., frequency, amplitude, and phase, and then completes the conversion to the time domain through discrete Fourier inverse transformation.For the trend component, a multi-scale attention mechanism (MSC-Attention) is proposed to replace the traditional self-attention mechanism for capturing sub-trends under different receptive fields.MSC-Attention extracts sub-trends through one-dimensional convolution with multi-scale receptive fields and completes the attention aggregation of sub-trends by calculating the correlation coefficient between sequences.Finally, under the action of fast Fourier transform and sampling, DESTformer reduces the computational cost of the transformer from quadratic complexity to linear complexity.
To sum up, our main contributions are as follows: • A transformer architecture based on seasonal-trend decomposition is proposed that can effectively decouple complex long sequences and learn representations of seasonal and trend components in a targeted manner.

•
A multi-view attention mechanism (MVI-Attention) is proposed that can perform holistic modeling from multiple perspectives in the frequency domain to capture important periodic structures in time series.

•
A multi-scale attention mechanism (MSC-Attention) is proposed to enhance information utilization in the trend view via the modeling of variable-length sub-trends, thus learning information-rich trend representations.

•
Extensive experiments are conducted on six benchmark datasets in multiple domains (energy, transportation, ecology, weather, and disease).Experimental results show that DESTformer improves 6.0% and 4.8% over state-of-the-art methods in multivariate and univariate long-term time series prediction tasks, respectively.

Long-Term Time Series Forecasting
Time series prediction tasks aim at forecasting future time series in the prediction window, given historical time series data in the conditioning window.Long-term time series forecasting is characterized by the large length of predicted series.Mainstream time series prediction models could be divided into traditional statistical methods and machine learning-based methods.Traditional statistical methods mainly include autoregressive methods, such as ARIMA [14,15], and additive models, such as Holt-Winters [16] and Prophet [17].In particular, Holt-Winters [16] and Prophet [17] capture versatile temporal patterns (e.g., trends, seasonality, and randomness) to better model nonlinearities.These methods have strong explainability and are suitable for dealing with relatively stable and regular time series data.Meanwhile, these methods suffer from some limitations, such as sensitivity to outliers and missing values, difficulty in dealing with nonlinear and complex time series data, and difficulty in integrating other relevant information, such as timestamp information.In particular, when applied in the long-term forecasting tasks, the aforementioned statistical methods fail to capture the reliable dependencies.
In recent years, transformers based on self-attention mechanisms [10] has shown powerful capabilities in sequential data, such as natural language processing [18], audio processing [19], and even computer vision [20].However, in long-term forecasting tasks, due to the quadratic complexity of sequence length L in memory and time, applying selfattention mechanisms to long-term time series prediction is computationally expensive.LogTrans [21] introduced local convolution into transformer and proposed the LogSparse attention mechanism to select exponentially growing intervals of time steps, reducing the complexity to O(LlogL).Reformer [11] used a local-sensitive hashing (LSH) attention mechanism to reduce the complexity to O(LlogL).Informer [1] extended transformer with the ProbSparse 2 attention mechanism based on KL-divergence to achieve O(LlogL) complexity.Nevertheless, it is worth noting that these methods are based on ordinary transformers and attempt to change self-attention mechanisms into sparse versions.They still followed the pointwise dependence modeling principle.Autoformer [2] decomposed complex time series into seasonality and trendiness, and used the autocorrelation mechanism of the sequence to capture reliable temporal dependencies; FEDformer [3] also used a similar decomposition idea to complete the self-attention calculation for seasonal components in the frequency domain.In this paper, the frequency domain transform technique for disentangling seasonal and trend components is based on the inherent seasonality and trendiness of time series.

Time Series Decomposition
As a standard method for time series analysis, time series decomposition [22] decomposes time series into several different levels of representation, each of which can represent a predictable potential category and has mainly been used to explore historical changes over time.For prediction tasks, decomposition was usually used as a preprocessing step for historical sequences before predicting future sequences [23][24][25], such as trend-seasonal decomposition in Prophet [26], basis expansion in N-BEATS [27], and matrix decomposition in DeepGLO [28].However, such a reprocessing operation was limited by the simple decomposition of historical sequences and ignored the hierarchical interaction between underlying patterns of long-term future sequences.COST [29] utilized contrastive representation learning that was aware of seasonality and trends.LaST [13] decomposed seasonal-trend representations in latent space based on variational inference, and supervised and separated representations from the perspective of self and input reconstruction to achieve optimal performance.However, traditional methods used the moving average operation to extract trend features, which resulted in weak robustness to noise.In this paper, we explored the idea of decomposition from a new perspective.Specifically, we mapped the time series to the frequency domain and then separated the high-frequency part as the seasonal component and the low-frequency part as the trend component through frequency domain masking.At the same time, for the high-frequency part, we filtered the Top-K amplitudes corresponding to the frequency to complete denoising, making it more robust.

Methodology
In this section, a detailed description of the DESTformer architecture is provided.As mentioned earlier, long-term forecasting tasks involve complex temporal patterns in sequences.To effectively address this issue, a frequency domain transform module is used to decompose the original sequence into seasonal and trend components for modeling finegrained temporal patterns.In addition, MVI-Attention and MSC-Attention are designed to capture the representations of seasonal and trend components respectively, thereby achieving accurate prediction.

Problem Definition
First, the problem definition for long-term forecasting is provided.Given a sequence A seasonal representation H S and a trend representation H T are learned for the desired predicted sequence Ŷ.Given the learned representations of the seasonal and trend parts, P(Y|H S , H T ) is ultimately modeled.

DESTformer Architecture
In this section, a detailed description of the overall architecture of DESTformer is provided as shown in Figure 1.Combining the idea of time series decomposition, improvements are made to transformer, which includes a frequency domain transform module, a multi-view attention mechanism, a multi-scale attention mechanism, and corresponding encoders and decoders.

Frequency Decomposition
Compared to directly extracting seasonal and trend features on the original sequence (e.g., COST [29]), the approach of first decomposing and then extracting targeted features can effectively reduce interference from other features.It has been widely utilized in various data-denoising and pattern-filtering tasks [30,31].In long-term forecasting problems, time series decomposition can learn complex temporal representations.Unlike traditional methods that obtain trend components through fixed window moving averages, a new time series decomposition method is used that maps the sequence to the frequency domain and takes high frequencies as seasonal components and low frequencies as trend components.Compared to traditional decomposition methods, the effect of the frequency domain transform is more obvious [30,31].It effectively avoids the overall impact of outliers' on-trend components in traditional methods: where F denotes the FFT and F −1 is its inverse, F (X) ∈ R I×K , where I = T x /2 + 1 and ξ = 3.At this point, the seasonal component XS contains a large amount of noise.In long-term forecasting tasks, the presence of noise often reduces the generalization ability of the model.Therefore, further denoising of the seasonal component is performed by selecting the Top-K frequencies corresponding to the amplitude to obtain the final seasonal component X S .

Model Inputs
The input to the encoder is the past sequence X of length T x .Consistent with Autoformer, the sequence X is decomposed into seasonal component X S and trend component X T through the frequency domain transform.The latter half of the seasonal component sequence is concatenated with a zero vector of length T y along the time dimension as the input to the seasonal component in the decoder.The latter half of the trend component sequence is concatenated with a sequence mean vector of length T y along the time dimension as the input to the trend component in the decoder.The input to the encoder is represented as X en , the seasonal component input to the decoder is denoted as X des , and the trend component input to the decoder is represented as X det .Mathematically, we have X ens , X ent = FreDecomp(X en Tx MVI-Attention and MSC-Attention are provided in Sections 3.3 and 3.4, respectively.They can replace self-attention to extract seasonal and trend representations, respectively.

Decoder
In the decoder, a frequency domain transform module is introduced to further obtain clearer seasonal-trend representations.Suppose there are M decoder layers.In each decoder layer, cross MVI-Attention and cross MSC-Attention are first used to fuse the output representation of the encoder with the input representation of the decoder.Then, two interactive frequency domain decomposition modules are used to further complete time series decomposition and aggregation.Finally, seasonal-trend representations are learned through a feedforward network and residual connection.Formally, the l-th decoder layer is represented as X l de = Decoder(X l−1 de , X N en ).The detailed process of each decoder layer is represented as Extracting periodic fluctuations from the seasonal component sequence is particularly important.Through FFT [32], the seasonal component is mapped to the frequency domain represented by complex values.After mapping the time series to the frequency domain, a time series can be completely represented by three attributes-frequency, amplitude, and phase-which reflect different characteristics of the periodic fluctuations of the sequence.Amplitude usually reflects the maximum distance that a sequence deviates from its equilibrium position at a certain moment, while phase represents the different states of a periodic sequence at different moments.By mapping the learned seasonal representation from the time dimension to the frequency domain, F ∈ R I×D is obtained, where D represents the feature dimension.The real and imaginary parts of F are represented as F r and F i , respectively.Amplitude and phase are represented by A() and Φ(), respectively, and can be described as ). ( The inputs to MVI-Attention, i.e., queries, keys, and values, are denoted as q s ∈ R T x ×D , k s ∈ R T x ×D , and v s ∈ R T x ×D , respectively.They are mapped to the frequency domain and combined with the sampling strategy: Further, amplitude and phase representations corresponding to Q s , K s , and V s , are obtained, respectively.At the same time, similar attention mechanisms are leveraged in three other perspectives, including frequency, amplitude, and phase.Formally, where σ is the activation function, such as Softmax or tanh.Finally, iDFT is applied to obtain the seasonal representation: where A i and φ i represent the amplitude and phase at the i-th frequency, respectively, and fi and φi represent the corresponding conjugate frequency and amplitude.

Multi-Scale Attention
For the trend component, sub-trends with different receptive fields often have a significant impact on future trends.The existing research is often limited to the learning of fixed-length trends.Under this condition, choosing an appropriate lookback window often becomes a critical issue.A small window can lead to underfitting, while a large model can lead to overfitting problems.A direct solution is to optimize this hyperparameter through the grid search method [33], but this method is computationally expensive.Therefore, a multi-scale autoregressive mixture is used to adaptively capture sub-trends with different receptive fields.The size of the j-th convolutional kernel is represented as g j .The inputs to MSC-Attention, queries, keys, and values, are denoted as q t ∈ R T x ×D , k t ∈ R T x ×D , and v t ∈ R T x ×D , respectively.Unlike traditional self-attention mechanisms, a mean convolutional kernel g q = 1 J J ∑ j=1 g j is used to obtain the query vector Q t : Further, K t and V t can be represented as After capturing sub-trends with different receptive fields, softmax is used for activation.Finally, the trend representation is generated as follows: where softmax is used for activation.The entire process of the algorithm is summarized in Algorithm 1.

Complexity Analysis
In DESTformer, MVI-Attention is used to capture the periodic fluctuations of the seasonal term, while MSC-Attention is used to capture the long-term changes of the trend term.For a sequence of length L, in MVI-Attention, FFT is used to effectively reduce the time complexity to O(LlogL) [2].On this basis, we also leverage a sampling strategy to effectively reduce the time and memory complexity to O(L) [3].In MSC-Attention, we use one-dimensional convolution to extract J different sub-trends.Since the time complexity of one-dimensional convolution when encoding time series is O(L) and J is a constant, the final time complexity of MSC-Attention is O(J 2 L) = O(L), and the memory complexity of MSC-Attention combined with the sampling strategy is also O(L).In summary, DESTformer achieves an O(L) time and memory complexity.In Table 1, we summarize the comparison of time complexity and memory usage for training and inference steps.

Time Memory
Steps

Experiments
To evaluate the proposed DESTformer model, a series of experiments are designed to compare it with state-of-the-art methods for long-term forecasting.In addition, ablation studies are conducted to investigate the roles and effects of each module in the model.Finally, efficiency analysis and T-SNE [35] representation visualization experiments are performed.

Baselines
The DESTformer is compared with five state-of-the-art long-term forecasting methods based on the transformer, including ARIMA [37], Informer [1], and LogTrans [21], as well as FEDformer [3] and Autoformer [2], which combine time series decomposition with the transformer.

Evaluation Metrics
For the six long-term time series forecasting tasks, we choose MAE and MSE to evaluate the prediction performance of various models.MAE and MSE are calculated as

Implementation Details
The method is optimized using the Adam [38] optimizer.For all methods, the learning rate is set to 0.00001, and the batch size is set to 32.The method is trained using L2 loss, and early stopping is applied within 20 epochs during training.All experiments are repeated five times with different random seeds, and the final results are reported as the average of the metrics.The code is implemented in PyTorch [39].The training/validation/test data are split in a 6/2/2 ratio consistent with Informer.The convolutional kernel of MSC-Attention is selected from 2, 4, 8, 16, 32, and 64.The DESTformer consists of two encoder layers and one decoder layer.All models are trained/tested on a NIVIDA Tesla V100 32G GPU.

Multivariate Forecasting Results
According to the experimental results on multivariate forecasting tasks, as shown in Table 2, DESTformer performs best in all prediction length settings over all benchmark tests.It verifies the effectiveness of explicit seasonal and trend decomposition via the high-frequency filtering (for seasonal data) and low-frequency filtering (for trend data), followed by multi-scale attention mechanisms in both trend and frequency views.Note that when the input length is set to 96 and the prediction length is set to 336, the MSE of the DESTformer decreases by 6.4%, 5.2%, 6.0%, 3.4%, and 8.9% on the EET dataset, Electricity dataset, Exchange dataset, Traffic dataset, and Weather dataset, respectively.Finally, on the relatively special ILI dataset, when the input length is set to 36 and the prediction length is set to 60, the DESTformer reduces the MSE by 4.4%.Overall, in the above experimental settings, the average MSE of the DESTformer decreases by 5.7%.According to the experimental results, it can clearly be seen that the DESTformer performs well on the Exchange dataset, and no obvious periodicity is observed.In addition, it can also be seen from the experimental results that as the prediction length increases, the performance change of the DESTformer remains relatively stable, indicating that it maintains better long-term stability.This is meaningful for real-world applications, such as weather warnings and long-term energy consumption planning.To better illustrate the predictive performance of DESTformer, we visualize the predicted sequences and their corresponding true sequences on six datasets.As shown in Figure 2, DESTformer can capture the long-term temporal patterns and accurately fit the fluctuations of future long sequences across different tasks.

Univariate Forecasting Results
As shown in Table 3, the univariate experimental results on two typical datasets are listed.Compared with the experimental results of the baseline models, DESTformer still achieves the best performance in long-term prediction tasks.In particular, when the input length is set to 96 and the prediction length is set to 336, the DESTformer achieves an

Self-Attention vs. MSC-Attention
To investigate the difference between the multi-scale attention mechanism proposed in the DESTformer and the traditional attention mechanism, a third ablation experiment is set up, in which the multi-scale attention mechanism of the DESTformer for trend term information is replaced with a traditional attention mechanism; this version of the model is named DESTformer-t.Table 4 demonstrates the experimental results.It can be seen that our proposed multi-scale attention mechanism also outperforms the traditional attention mechanism and has smaller MSEs in actual experiments.In particular, when using the multi-scale attention mechanism, if the prediction step length is large, our model also has a smaller MSE.Therefore, it is believed that the multi-scale attention mechanism of the DESTformer helps the model to better learn trend term information and thus achieve better performance in the prediction tasks.

Efficiency Analysis
To show the efficiency of DESTformer with the multi-view attention mechanism in the seasonal domain and multi-scale mixed attention mechanism in the trend domain, we compare the memory cost and time cost in the training process with state-of-the-art models, including Informer, Autoformer, and FEDformer.As shown in Figure 3, DESTformer achieves O(L) complexity in both time and space efficiency.In addition, it shows superior efficiency in long-term time series forecasting tasks.

Conclusions
In this paper, we propose an explicit seasonal-trend decomposed transformer, called DESTformer, for long-term forecasting.DESTformer first explicitly extracts seasonal and trend components via high-and low-frequency filtering of the data after frequency transform.To enhance information utilization, a multi-scale attention mechanism in the trend domain and a multi-view attention mechanism in the frequency domain are proposed, capturing complex periodic changes and fine-grained sub-trends under different receptive fields, respectively.Experimental results verify the effectiveness of our method, thus providing a new approach for handling trend and seasonal patterns in long-term time series prediction tasks.Despite the outstanding performance of DESTformer in long-term forecasting tasks, there are still some limitations.First, the effect of the multi-scale attention mechanism is influenced by the sub-trend selection methods, but determining a suitable set of sub-trends adds an extra workload to the model training.Second, we only conducted experiments on datasets with obvious periodicity, and we hope to further test our model on more complex (even non-stationary) tasks in the future.

Figure 1 .
Figure 1.DESTformer architecture.The encoder combines traditional STL decomposition ideas to achieve separation and modeling of seasonal and trend components through representation learning of the original sequence.The decoder adopts an innovative frequency domain decomposition and representation learning method to further optimize and enhance the seasonal-trend representation.

4. 1 .
Experimental Settings 4.1.1.Datasets To validate the long-term forecasting capability of the DESTformer, experiments are conducted on six real-world datasets.(1) The EET dataset (Electricity Transformer Temperature Dataset) (https://github.com/zhouhaoyi/ETDataset(accessed on 12 July 2023)): This dataset is commonly used for long sequence time series prediction and contains data from two different regions in China, recorded at 2 h, 1 h, and 15 min intervals from July 2016 to July 2018.Each data point includes the oil temperature and six electricity load indicators.(2) The Electricity dataset (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (accessed on 12 July 2023)): This dataset contains hourly electricity consumption data for 321 customers between 2012 and 2014.(3) The Exchange dataset [36]: This dataset contains daily exchange rates for eight different countries from 1990 to 2016.(4) The Traffic dataset (http://pems.dot.ca.gov (accessed on 12 July 2023)): This dataset is a collection of hourly data from the California Department of Transportation and includes road occupancy rates measured by different sensors on highways in the San Francisco Bay Area.(5) The Weather dataset (https://www.bgc-jena.mpg.de/wetter/(accessed on 12 July 2023)): This dataset contains 21 meteorological indicators recorded every 10 min throughout the year 2020, including temperature and humidity.(6) The ILI dataset (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html(accessed on 12 July 2023)): This dataset contains weekly data on influenza-like illness (ILI) patients recorded by the Centers for Disease Control and Prevention in the United States from 2002 to 2021, describing the proportion of ILI patients to the total number of patients.

Figure 3 .
Figure 3. Efficiency analysis of the two special attention mechanisms proposed in this article with the DESTformer under the same experimental setup as Autoformer.

Table 2 .
Multivariate results with different prediction lengths O ∈ {96, 192, 336, 720}.We set the input length I as 36 for ILI and 96 for the others.A lower MSE or MAE indicates a better prediction.