Previous Article in Journal
Probabilistic Projections of South Korea’s Population Decline and Subnational Dynamics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SegmentedCrossformer—A Novel and Enhanced Cross-Time and Cross-Dimensional Transformer for Multivariate Time Series Forecasting

Deptartment of Information & Communication Sciences, Sophia University, Tokyo 102-8554, Japan
*
Author to whom correspondence should be addressed.
Forecasting 2025, 7(3), 41; https://doi.org/10.3390/forecast7030041
Submission received: 20 May 2025 / Revised: 29 July 2025 / Accepted: 29 July 2025 / Published: 3 August 2025
(This article belongs to the Section Forecasting in Computer Science)

Abstract

Multivariate Time Series Forecasting (MTSF) has been innovated with a series of models in the last two decades, ranging from traditional statistical approaches to RNN-based models. However, recent contributions from deep learning to time series problems have made huge progress with a series of Transformer-based models. Despite the breakthroughs by attention mechanisms applied to deep learning areas, many challenges remain to be solved with more sophisticated models. Existing Transformers known as attention-based models outperform classical models with abilities to capture temporal dependencies and better strategies for learning dependencies among variables as well as in the time domain in an efficient manner. Aiming to solve those issues, we propose a novel Transformer—SegmentedCrossformer (SCF), a Transformer-based model that considers both time and dependencies among variables in an efficient manner. The model is built upon the encoder–decoder architecture in different scales and compared with the previous state of the art. Experimental results on different datasets show the effectiveness of SCF with unique advantages and efficiency.

1. Introduction

Multivariate time series consisting of multiple univariates in different dimensions are of importance in many real-life scenarios. Each univariate represents one specific domain or attribute such as temperature, daily number of passengers in an airport, and so on. As a representative and critical problem in time series, MTSF aims to effectively predict future patterns within a certain period given past time series. In today’s digital era, extensive time series data is ubiquitous in almost every industry, including traffic, retail supply, energy, healthcare, and weather forecasting. Utilization of large-scale time series data enables reasonable forecasting using large deep learning models.
Success in deep learning tasks such as Natural Language Processing (NLP) and Computer Vision (CV) motivates innovations in time series domains along with the launch of models with intricate architectures. Recent generations of Large Language Models (LLMs) capable of NLP tasks such as generating long text [1,2] and machine translation [3] inspire constructions of large models in time series tasks. Derivatives of Transformer architectures with attention mechanisms are on the rise due to the success of Vision Transformer [4] in machine translation. Compared to the classical models like RNNs, Transformers are prominent in recognizing relations among segments given long sequences of text. Furthermore, the influence of positional information is also considered by attention mechanisms, so that the relationships among segments in any position can be extracted.
Inspired by the classical Transformer [4] for machine translation with benchmark datasets, subsequent models attain great performances in domains including image classification [5,6], machine translation, and video processing [7] with even smaller datasets [8] under limited circumstances. However, in large models with millions of parameters, costs of attention architectures are prohibitive with gigantic datasets [9]. To alleviate computational constraints, a series of Transformer-based architectures are developed with abilities to improve performance in particular scenarios. Through thoughtful pretraining procedures, large models are expected to produce excellent results in downstream tasks with smaller datasets.
The hidden potential of Transformers promotes the boom of deep learning models. At present, Transformer-based architecture is still in the stage of inception and needs improvements in various aspects. Several popular Transformer-based models such as FEDformer [10], Autoformer [11], Reformer [12], and Informer [13] have achieved state-of-the-art performance with sophisticated architectures that replace full attention mechanisms. However, to reduce the computational complexity in large-scale models, those models do not fully consider the dependencies in both temporal domains and dimensions. Some of them, such as [14], capture only the temporal dependencies in each univariate separately. To address such problems, we propose a novel approach that captures the dependencies in both time and dimensions among variables. Compared to the preceding state of the art, our model (i) considers dependencies for time series in each time point and correlations among the univariates, (ii) contains a novel algorithm for cross-time self-attention to extract significance of a variate along with other univariates in a certain history, and (iii) proposes an efficient encoder–decoder architecture to deal with forecasting problems that is compared to previous Transformer-based models. To make comparison with baselines, we conduct experiments with benchmark datasets in state-of-the-art models, and experimental results show the effectiveness of our model in MTSF problems.
The rest of the paper is organized as follows: Section 2 gives an overall literature review for models related to time series forecasting. Section 3 elaborates details of our methods including architecture and mathematical formulas. In Section 4, we show experiments for implementation of our proposed methods including datasets, experiment setup, as well as analysis and discussions of results. Finally, we conclude our work in Section 5. In the Appendix A.1, we elaborate the pseudocode for our methods and present comprehensive results on baselines as well as ablation studies.

2. Related Work

Research on time series tasks has made significant progress through the development of various architectures, ranging from early statistical methods to deep learning approaches such as Recurrent Neural Networks (RNNs) [15,16], Graph Neural Networks (GNNs) [17], Long Short-Term Memory (LSTM) networks [18], and, more recently, Transformer-based models. To address the lack of large-scale datasets in specific domains, generative pretraining using data from various domains through Foundation Models (FMs) enables effective transfer learning by fine-tuning on limited domain-specific data. Specifically, representation learning explores the characteristic and periodicity of time series, by which the models can be applied in downstream tasks. Compared to text with long sequences, time series data are more complicated with seasonal trends and continuous values rather than linguistic long sequences consisting of vocabularies with semantic information. To address the challenges of effective learning capabilities, a series of methods for new models along with enhanced algorithms based on existing models and optimization techniques are proposed.
Multivariate Time Series Forecasting (MTSF). Multivariate time series consist of multiple univariates, represented as sequences of numerical values for different specific dimensions over time domains. For example, an MSTF dataset of the climate consists of information about climate conditions including, but not limited to, temperatures, humidity, air pressure, wind speed, etc. over a period in a particular region, recorded in diverse time units such as minutes, hours, days, weeks, and so on. However, as mentioned earlier, MTSF over long-term history poses many challenges. Unlike text sequences in NLP and image pixels in CV, Multivariate Time Series (MTS) contain temporal dependencies over time as well as correlations between dimensions compared to univariate time series tokens. Thus, to address the challenges in MTSF tasks, novel methods proposed for representation learning explore trends in time series over the time domain and decomposition strategies that split MTS into smaller segments or patches for efficient computations [10,19].
Classical Architectures. Explorations into time series forecasting originate from the early classic methods [20]. Different from image pixels in CV or sequences of text tokens in NLP, performance of time series tasks is affected by seasonal patterns over time. To address such challenges, a variety of architectures [15,16,17,18] for time series were proposed over the last two decades. In the early stages, a series of traditional statistical methods were proposed, including Exponential Smoothing (ETS) [21] and Auto-Regressive Moving Averages (ARMA) [22]. While such classical benchmarks have been widely used in early studies, several recent works on Transformer-based time series forecasting [10,11,13] have adopted the convention of comparing only against state-of-the-art deep learning models, not statistical or naive methods. This is primarily because (i) the performance gap between statistical models and modern deep architectures on high-dimensional, non-linear, and multivariate tasks is already well-established, (ii) Transformer-based models are explicitly designed to address the shortcomings of both traditional statistical and RNN-based methods by modeling long-range dependencies and cross-variable interactions using attention mechanisms, and (iii) Statistical baselines such as ARIMA or ETS are inherently incapable of handling large-scale multivariate sequences with missing values, non-stationarity, or spatio-temporal structures, which are central to the current task formulations.
In recent years, extensive development of deep learning has led to the emergence of models in time series tasks, too. For example, a couple of recurrent-neural-network-based methods are DeepAR [23] and LSTNet [24] that are based on RNN architecture that considers recurrence of the output as the input for next time step. To address long-time recurrent backpropagation, LSTM [18] is proposed with computational complexity of O ( 1 ) in each time step, and subsequent variants, such as the Gated Recurrent Unit (GRU) [25], simplifies the LSTM framework and further reduces the complexity of computation. To adapt to time series, convolutional-neural-network-based methods such as TCN [25] include convolutions for learning dependencies. For graph-neural-network-based methods, MTGNN [17], StemGNN [26], AGCRN [27], etc. are proposed for long time series forecasting.
Transformers. Transformer-based methods have already made significant progress in NLP and CV tasks, following the introduction of Transformers [4] in machine translation. However, quadratic computational complexity versus sequence length for tokens over vanilla Transformers leads to prohibitively expensive computational burdens in long-term time series sequences. In addition, more effective architecture needs to handle long time series with more complicated patterns. To address such challenges, a series of enhanced Transformer-based methods are proposed for MTSF cases. For example, Autoformer [11] proposes a novel decomposition algorithm with autocorrelation for long-term forecasting, capturing dependencies and seasonal patterns with decomposition blocks within the encoder. In contrast, FEDformer [10] proposes a different seasonal-trend decomposition block with Fourier transform to solve the lack of a global view of time series caused by expensive memory costs. To deal with memory bottlenecks in full attention, recent works propose efficient models to capture intricate patterns in MTS. For example, Ref. [28] proposes a LogSparse Transformer with cost of O ( L ( l o g   L ) 2 ) with convolutional self-attention that yields queries and keys by casual convolution that enables local context incorporated into attention. Later, Informer [13] further broke the memory bottleneck by the ProbSparse self-attention mechanism, achieving O ( L log L ) computational complexity. Furthermore, FEDformer [10] and Pyraformer [29] achieve O ( L ) regarding sequence L. The former proposes the attention-based blocks for decomposition and capturing patterns, and the latter introduces a pyramid-based architecture called a Pyramidal Attention Module (PAM) that learns dependencies in intra- and inter-scales at different resolutions. Interestingly, PatchTST [14] splits input time series into patches and separates each channel to learn features with shared embedding and Transformer blocks. Similarly, Crossformer [30] also splits time series sequences into segments and learns dependencies with intermediate routers for efficient computations. In addition, other Transformer-based models such as [9,12] achieve state-of-the-art performance by proposed novel methods for self-attention for downstream tasks. A host of related recent architectures are found in [31].
Despite significant progress of models in time series forecasting, there are still challenges that need further improvements. For example, N-BEATS [32] handles only univariate time series forecasting with residual links and deep fully connected layers. PatchTST [14] separates multivariate time series into independent channels that share the same embedding and Transformer weights. However, independent channels are unable to interact with each other. Crossformer [30] attempts to learn relationships between segments but only with a few intermediate routers for efficient computation that cannot fully capture dependencies. Other methods such as Autoformer [11] and FEDformer [10] solve the MSTF problems with complicated algorithms for decomposition or for partial attention among all segments.
To address these challenges, we propose a new method to bridge the gap. This new method makes contributions to MSTF, including: (1) proposing an effective novel Transformer-based method for multivariate time series forecasting, (2) enhancing the performance of the state of the art by a novel strategy that captures dependencies of information in cross-time and cross-dimension efficiently, and (3) offering a foundation model for future related tasks in the time series domain.

3. Methodology

Multivariate Time Series Forecasting (MTSF) aims to make a prediction of future values for time series given a certain history. In MTSF, time series have multiple univariates, each of which represents a variable or an attribute depending on scenarios. With well-organized time series data, forecasting can be performed in different temporal granularities. In cases where long-term dependencies are necessary, time series within tiny granularities are recorded over a long history. To perform long-term forecasting, dependencies are captured, and temporary granularity varies in practice, ranging from hours in stock price to days in weather forecasting.
Problem Definition. Suppose the given history is x 1 : T R T × D , where T is the number of time steps for the past time and D > 1 is the dimension in each time step. The goal of MTSF is to predict future values of time series x T + 1 : T + τ   R τ × D where τ is the number of time steps in the future. Specifically, in MTSF, D is greater than 1.
To process multivariate time series, we first split the input time series into segments and then perform embedding operations for each segment in the embedded dimension. Then, inspired by Crossformer [30], we propose a new enhanced Two-Stage Attention (TSA) that considers dependencies for one dimension in history and for the other dimensions. Like the classical Transformer [4], our model is also an encoder-and-decoder structure with enhanced TSA equipped. Like the Hierarchical Encoder-and-Decoder (HED) structure constructed in [30], our model captures information in Multivariate Time Series (MTS) in different scales with multiple blocks of encoders and decoders before final forecasting.
Figure 1 above depicts the overall architecture of our model with the main components, each of which is expanded and further illustrated in subsequent sections.

3.1. Time Series Segment-Wise Embedding

A multivariate time series in a time step is represented by a T × D vector x 1 : T R T × D where T is the number of time steps for history and D is the dimension or number of univariates for MTS. Like Transformers for the previous state of the art [5,8,9,10,11,12,13,14], MTS is embedded into large dimensions before the stage of attentions that capture dependencies across time. That is, given input MTS x 1 : T R T × D in a time step T, the algorithm embeds the time series point at each time step into an embedded vector: x t h t , x t R D , h t R d m o d e l where 1 < t < T , x t and h t represent the vector for time series and embedded vector at time t. Thus, an MTS x 1 : T R T × D is embedded into T vectors   h 1 , h 2 , h 3 , , h T   . Figure 2 below shows the process of segmentation for embedded vectors.
Dimension-Wise Segmentation. We propose a segment-wise method that embeds time series in segments. That is, given input MTS x 1 : T R T × D in a time step T and a segment length L s e g , our method first transposes and represents the input time series in a D-dimensional vector x 1 : T =   x 1 , x 2 , x 3 , , x D , where x d   R T and 1 d D , and contains all values of the same dimension across the time step T.
Then, given the segment length L s e g , each vector x d   R T is divided into T L s e g segments, assuming that T is divisible by L s e g . Mathematical representation of segments in each dimension is shown in Equation (1):
x 1 : T = x 1 : D = { x i , s   |   1 i D , 1 s T L s e g } ,
where i denotes the i-th dimension in the MTS and s represents the s-th segment in the dimension i so the set x i , s aggregates all s temporal segments of a particular dimension i over all dimensions. Mathematically, x 1 : D is the transpose of x 1 : T .
Cross-time Dimension Embedding. With the segmentation of input MTS, all segments are then embedded into vectors with a higher dimension as defined in d m o d e l , represented as an embedded dimension. For each   x i , s     R L s e g , which is the s-th segment in i-th dimension, we follow the same embedding algorithm in [30] to embed each segment into the vector by linear projection followed by a position embedding, as shown in Equation (2),
h i , s = E x i , s + E i , s ( p o s ) ,
In Equation (2), E R d m o d e l × L s e g is the learnable embedding vector for linear projection and E i , s ( p o s ) R d m o d e l is the learnable vector for position embedding for each segment in position ( i , s ) . After the stage of embedding, input MTS is embedded into a 2D vector H = { h i , s R d m o d e l | 1 i D , 1 s T L s e g } , where h i , s represents the univariate embedding vector for each segment x i , s that denotes a segment of multivariate time series for the i-th dimension over the time segment of length L s e g , and H R T × d m o d e l is the embedding vector of MTS in segments.

3.2. Two-Stage Attention

Embedded 2D vectors obtained from embedding are utilized for self-attention in Transformers [5] with various algorithms for attention in encoder and decoder blocks. Given a 2D embedding vector, classical Transformers like ViT [5] and Swin Transformer [33] flatten 2D arrays into a 1D vector and feed it into Transformer blocks. However, simply flattening an embedding array for long-term time series data into a 1D array is prohibitive for the attention stage, causing computational cost O ( N 2 ) for input length N.
We propose a novel enhanced Two-Stage Attention (TSA) consisting of a multistep process with Multihead Attention (MSA) by considering attention for both the univariate itself in the past time domain and influences on other univariates. Compared to TSA proposed in [30], our method consists of two stages to capture dependencies in both cross-time and cross-dimension by segments of embedding vectors instead of intermediate routers to collect information to reduce complexity. The overall architecture of TSA is shown in Figure 3.
Attention. Attention [4] is the function of mapping the query and key–value pairs to output as the weighted sum of values, with query, key, and values all in vectors. Given query Q, together with keys and values K and V, the output of attention is computed as:
A t t e n t i o n Q , K , V = s o f t m a x ( Q K T d k ) V ,
To use attention in our method, we follow the operations in [4] for the attention function. For more information, one can also find details in [4].
Multi-head Self-Attention. To perform multiple attention functions with queries, keys, and values simultaneously, MSA deploys the group of queries, keys, and values with learnable linear projections with dimensions d k , d k , and d v , respectively. Each attention function has its own version of queries, keys, and values and produces output values in dimension d v . With h attention functions working in parallel, the output of each is concatenated and further projected to form the result value.
Instead of single attention, MSA allows attention for information with more interpretations of information in different positions. Given h attention functions in parallel, the mathematical representation of MSA is described as follows:
M u l t i H e a d Q , K , V = C o n c a t h e a d 1 , h e a d 2 , , h e a d h W O ,
w h e r e   h e a d i = A t t e n t i o n Q W i Q , K W i K , V W i V ,
where W i Q R d m o d e l × d k , W i K R d m o d e l × d k , W i V R d m o d e l × d v , and W O R h d v × d m o d e l are weights of matrices for query Q, key K, and value V for linear projections and h is the number of parallel attention functions or heads. In general cases, one uses d k = d v = d m o d e l / h for dimensions in weights of Q, K, and V, respectively, for each head.
Cross-Time Stage. Embedded vectors in segments from the embedding layer are evaluated in TSA operations in an encoder and decoder structure with multiple stages. Instead of using intermediate routes with MSA in Crossformer [30], we propose a novel two-stage attention that performs cross-time dimensions by MSA with embeddings for a univariate with its history and with other univariates in order. Given a 2D embedding in sequences of length L s e g , in the first stage, our method performs MSA in segments for each univariate separately. Suppose the 2D embedding vector is given in H = { h i , s   R d m o d e l |   1 i   D ,   1 s T L s e g   } , the output vector of MSA for each dimension is shown in Equation (6) below:
c r o s s t i m e i = M S A h i , : h i , : , h i , : where   1 i D ,
Specifically, h i : denotes the embedded vectors of the i-th dimension in segments, where h i = { h i , s R d m o d e l | 1 s T L s e g } for each i between 1 and D.
By MSA, we follow a similar structure for the cross-time stage in [30] but with our method to deal with attention for each univariate with its given history:
Y ~ i , : t i m e = L a y e r N o r m h i , : + c r o s s t i m e i ,
Y i , : t i m e = L a y e r N o r m Y ~ i , : t i m e + M L P Y ~ i , : t i m e
Equation (8) produces the output of the stage for each univariate. To get the final output of the cross-time stage, all Y i , : t i m e are concatenated as follows:
Y t i m e = C o n c a t ( Y 1 , : t i m e , Y 2 , : t i m e , , Y D , : t i m e ) ,
The output of this stage is in the same dimension as the 2D input embedding vector H such that Y t i m e R D × d m o d e l .
Cross-Dimension Stage. In our method, the second stage in TSA captures dependencies between one univariate and other univariates in the time domain. Instead of a small number of intermediate routers, we apply MSA in Y t i m e from the cross-time stage with embedded vectors of other univariates. To reduce the complexity of computation, groups of segments for other univariates are linearly projected from dimension D 1 to a 1-dimensional vector, resulting in computational complexity of O D instead of O D 2 .
The stage consists of two sub-stages. The first sub-stage takes 2D embedding vector H and extracts the segments of the embedding vector for other univariates. That is, for the i-th dimension, segments of embedding vectors for other dimensions except i are extracted for the attention function as follows:
H i = h d , s H | 1 d D , d i , 1 s T L s e g ,
On the other hand, H i consists of all segments in the embedding vector H except those in i-th dimension such that H i R ( D 1 ) × d m o d e l and, for each i , h i , 1 , h i , 2 , , h i , s are omitted from H i , where 1 s T L s e g . To reduce the complexity, we apply linear projection with learnable weights in H i for MSA in the next sub-stage. Suppose H i R ( D 1 ) × d m o d e l for each i , the linear projection is described in the following equation:
Y ¯ i = W i H i + b ,
where W i R d m o d e l × ( D 1 ) and b R d m o d e l represent the weight matrix and bias for each H i .
The second sub-stage takes the linear projection of H i as input with the output of cross-time stage Y t i m e and applies MSA to capture dependencies for other univariates in the time domain. With the final output of the previous stage Y ¯ i as input for each i , we propose a similar structure in Crossformer with our method for MSA. The mathematical representation of this stage is described below:
T i = M S A ( Y i , : t i m e , Y ¯ i , Y ¯ i ) ,
Y ~ i d i m = L a y e r N o r m ( Y i , : t i m e + T i ) ,
Y i d i m = L a y e r N o r m ( Y ~ i d i m + M L P ( Y ~ i d i m ) ) ,
Y i d i m is the output of the cross-dimension stage for each dimension i . By concatenating all Y i d i m , the final output of TSA is obtained, as shown below:
T S A H = Y d i m = C o n c a t ( Y 1 d i m , Y 2 d i m , , Y i d i m ) ,
where Y d i m R D × d m o d e l and 1 i D .

3.3. Hierarchical Encoder–Decoder

The encoder–decoder structure is adopted in Transformer and its state of the art [3,5,10,12,13,14,29,30] in MTSF to capture dependencies from the information in different scales. We also follow the Hierarchical Encoder–Decoder (HED) structure in [30] based on our method for TSA. With the use of HED blocks, information at different scales is used for forecasting from a fine level to a coarse scale.
Encoder. To construct the encoder for our model, we propose a simple algorithm by using TSA for each layer. Each layer output is modeled as Y e n c , l = E n c o d e r Y e n c , l 1 and represented as:
Y e n c , l = H ,   l = 1     TSA ( Y e n c , l 1 ) ,   l > 1 , 1 l L ,
where H denotes the embedding vector after dimension-wise segmentation and L refers to the number of layers in the encoder. As the equation shows, if l = 1 , the output is simply the value of H; if l > 1 , the input of layer at l is the TSA value of the layer at l 1 .
Decoder. Following the HED structure, we construct the decoder for forecasting with the same number of layers as in the encoder. The input of the decoder at layer l depends on the output of both the encoder at layer l and the decoder at layer l 1 so the process of the decoder at layer l could be modeled as Y d e c , l = D e c o d e r ( Y e n c , l , Y d e c , l 1 ) and the mathematical representation is shown below:
Y ~ d e c , l = T S A E d e c ,   l = 1   T S A Y ~ d e c , l 1 ,   l > 1 ,
Y ¯ d d e c , l = M S A Y ~ d d e c , l , Y d e n c , l , Y d e n c , l , 1 d D ,
Y ¯ d e c , l = C o n c a t Y ¯ 1 d e c , l , Y ¯ 2 d e c , l , , Y ¯ D d e c , l ,
Y ^ d e c , l = L a y e r N o r m ( Y ~ d e c , l + Y ¯ d e c , l ) ,
Y d e c , l = L a y e r N o r m ( Y ^ d e c , l + M L P ( Y ^ d e c , l ) ) ,
Specifically, E d e c R D × τ L s e g × d m o d e l is the learnable position embedding vector as the input of the decoder with l = 1 , and at layer 1 < l L , Y ~ d e c , l is the output of TSA that takes Y ~ d e c , l 1 as input. Then, Y ~ d e c , l is processed dimension-wise by MSA with Y d e n c , l as the key and value, which connects relations with the corresponding vector for dimension d at encoder layer l . Afterwards, each Y d e n c , l is concatenated for residual connection and MLP in two LayerNorms.
The prediction of the model is produced by the sum of prediction of each decoder layer by using linear projections with learnable weights. Suppose for each layer l the weight of vector is denoted by W l , the prediction of MTSF for future τ time steps x T + 1 : T + τ given history of T time steps is represented as:
p r e d x 1 : T = x T + 1 : T + τ = l = 1 L W l Y d e c , l , W l R L s e g × d m o d e l ,
The Multivariate Time Series (MTS) in our method is processed in segments, allowing the prediction of future τ steps to also be represented in segmented form as: x T + 1 : T + τ = x i , d R L s e g × d m o d e l | 1 i τ L s e g , 1 d D . Specifically, for each layer l , x T + 1 : T + τ l = x i , d l R L s e g × d m o d e l | 1 i τ L s e g , 1 d D , 1 l L .

4. Experiments

The proposed model is implemented with four real-world time series datasets widely used in baseline models, comparing the results with state-of-the-art Transformers. The performance of the proposed model is recorded along with the results of the state-of-the-art Transformers for MTSF.
In our experiments, we demonstrate the model configurations, present the results of its implementation in forecasting tasks using benchmark datasets, and validate the effectiveness of the proposed method. Like previous baselines, we use Mean Square Error (MSE) and Mean Absolute Error (MAE) as evaluation metrics in our case.

4.1. Datasets

We choose four MTS datasets as real-world benchmark datasets widely used in the state of the art [15,16,17] to evaluate the proposed model and present the performance results using the same metrics: (1) Electricity Transformer Temperature (ETT-small), (2) Weather (WTH), (3) Influenza-Like Illness (ILI), (4) Exchange-Rate. However, unlike [30], we split all chosen datasets with the ratio of 0.6:0.2:0.2. Configurations and details of all used datasets are shown in Table 1.
Electricity Transformer Temperature (ETT-small). ETT-small consists of four groups of datasets called ETTh1, ETTh2, ETTm1, and ETTm2. It records the data for an electricity transformer in two years with seven attributes in each time step including load, oil temperature in hours (ETTh1, ETTh2) and minutes (ETTm1, ETTm2). Full versions of datasets and details are also available in [11].
Weather (WTH). WTH contains the meteorological information of weather in over 1600 locations in the US in 4 years between 2010 and 2013. It consists of over 10 features including temperature, humidity, wind velocity, wind degree, and so on, every hour.
Influenza-Like Illness (ILI). ILI contains information about influenza patients in multiple states in the United States recorded by the Centers for Disease Control (CDC) from 2002 to 2020. It includes patients’ personal information, outpatient illness, and viral surveillance at national, regional, and state levels. The latest information and visualization about influenza tests in various areas can be viewed in [16].
Exchange-Rate. The Exchange-Rate dataset includes daily exchange of currencies from eight countries between 1990 and 2010. More information can be found in [11].

4.2. Setup

We conduct experiments with the proposed method using real-world MTS datasets on a machine with multi-GPUs that process inputs in batches. To evaluate the performance of our method, we choose the Mean Square Error (MSE) and Mean Absolute Error (MAE) as evaluation metrics. The model is implemented on a machine consisting of 4 GPUs, each with 24 GB memory.
Baselines. To show the effectiveness of our method, we use the recent models for MTSF as baseline methods, including (1) LSTnet, (2) Transformer, (3) Autoformer, (4) Informer, (5) Reformer, (6) FEDformer, (7) Pyraformer.
Model Parameters. With four real-world benchmark datasets, we follow similar configurations for the Crossformer model [30] along with segment length, input length, and prediction length. By default, in our method, input time series data is processed in batches with batch size B = 32 , the number of encoder layers is 3, the number of MSA heads is 4, d m o d e l = 256 . In ECL and Traffic, d m o d e l is also set to be 64. Specifically, for datasets ETTh1, ETTh2, WTH, and ECL and Traffic, the input length is set to be in 24 , 48 , 96 , 168 , 336 , 720 ; for datasets ETTm1 and ETTm2, the input length would be 24 , 48 , 96 , 192 , 288 , 672 ; for dataset ILI, the input length is in 24 , 36 , 48 , 60 ; for dataset Exchange-Rate, input length is in   24 , 36 , 48 , 60 , 96 , 168 , 336 , 720 . By default, the segment length is set to be 6 , 4 , 24 and batch size is 32. The learning rate is in 5 × 10 3 , 10 3 , 5 × 10 4 ,   10 4 , 5 × 10 5 , 10 5 and 10 4 by default. In short-term prediction, the prediction length is set to be in 4 , 6 , 24 , 48 whereas, in a long-term case, it is set to be larger than 48 in 168 , 336 , 720 . The epoch number is 20.

4.3. Model Results

We show the results of our new method with configurations under four real-world datasets as well as the performance of the state of the art including traditional models and Transformer-based methods. The full list of results is presented in the Appendix A.3.
Table 2 presents the results of our proposed model for different prediction lengths, using MSE and MAE as evaluation metrics. Configurations for models including other parameters such as input length, segment length, batch size, and so on are followed based on Section 4.2. In this section, we only show the results of short-term forecasting in different prediction lengths for four datasets. For each prediction length with different model configurations, we choose and show the best one among all experimental results. Detailed information about performance versus model parameters for all configurations is shown in the Appendix A.2.
To show the effectiveness of our method, Table 3 lists the results of the state-of-the-art baselines including Transformers and typical traditional methods; the results of our methods are shown under SCF. In this case, long-term forecasting with longer prediction lengths is additionally implemented in our method to coordinate with others and for comparisons. For more details about specific baselines, refer to [31]. A comprehensive list of results is also presented in the Appendix A.3.

4.4. Analysis and Discussion

Table 2 and Table 3 show the results of our method and its comparison with other baselines. As shown in the tables, our method generates excellent results with many benchmark datasets as expected. As explained in the previous section, our method has the advantage of fully considering all dimensions in two stages for cross-time and cross-dimension. Unlike LogSparse Transformer [28], our method takes O ( L ) of computational complexity but goes through all univariates at each time point. Table 3 shows that our method generates the best results with the WTH dataset compared to other baseline models, showing the effectiveness of our method with long-term multivariate time series. As mentioned in Section 4.1, the WTH dataset contains more types of univariates that record more information over a long history. On the other hand, the advantage of our method with two-stage attention is reflected with long-term time series. With two-stage attention, our method fully captures dependencies and patterns by self-attention between segments for all dimensions across time. In addition, with a large-scale time series dataset, our method is better for long predictions. For example, as shown in Table 2, our method generates the best results for long-term prediction, with prediction lengths of 288 and 672 for ETTm1, implying that, with sufficient historical information, our method could be more powerful for forecasting in long horizons.
However, in some datasets, the method does not provide the best results due to a few reasons. First, for efficient computation, our method applies simple linear projection in TSA, assuming linear relations for weights after each stage in TSA, but complex irregular distributions over various multivariate time series data are not the case for linear relationships. To address non-linear relationships in time series, a new algorithm is needed in future work. Secondly, only MSA is performed in dimensions and time in order which may neglect to learn dependencies between segments of two different dimensions at different time steps. Thirdly, the setup for parameters such as input length and segment length may affect performance of our methods. For example, a short input length is unable to make long predictions due to insufficient historical information. According to experimental results, although our method does not achieve the best results on all datasets, it consistently either outperforms other state-of-the-art baselines or performs comparably, demonstrating the effectiveness of our novel approach over existing methods.

5. Conclusions

We propose a novel Transformer-based method for multivariate time series forecasting and demonstrate its effectiveness over previous state-of-the-art approaches using commonly used benchmark datasets. We deploy models with our method in different configurations and record detailed results in this work. As discussed in the previous section, our method outperforms baseline models by the strategy that captures dependencies on dimensions in its past information and other dimensions across history by multi-head self-attention and connects with a simple linear projection in a hierarchical encoder–decoder structure for predictions. Our method has a significant impact on time series forecasting and motivates improvements in models for specific domains. With the success of our method, our novel architecture could be applied across key domains such as weather forecasting, energy systems, environmental monitoring, healthcare, and so on, contributing to reliable long-term forecasting.
However, there are a few issues that limit the performance of our method and achieving better results with some datasets. First, we assume all dimensions are in linear relationships, so linear projections are used for weights after the stage of two-stage attention; however, irregular patterns for time series data may have more complex relations where linear projection is not effective in such conditions. Secondly, for computation simplicity, our method does not focus on seasonal trends depending on types of data so that seasonality for samples is not clear for forecasting. In future work, it is necessary to visualize the seasonal trends in our method. Thirdly, the decomposition method needs to be further considered for better representation of input time series. We would concentrate on addressing these problems and improve our method for better performance as a foundation model in future work.

Author Contributions

Conceptualization, Z.Y.; methodology, Z.Y.; software, Z.Y.; validation, Z.Y.; formal analysis, Z.Y.; investigation, Z.Y.; resources, T.G.; data curation, Z.Y.; writing—original draft preparation, Z.Y.; writing—review and editing, Z.Y. and T.G.; visualization, Z.Y.; supervision, T.G.; project administration, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sets are available in the public domain.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Algorithms

The section provides the pseudocodes for algorithms of each component for our model architecture, which is represented mathematically in Section 3. As elaborated in the Methodology, we present algorithms including dimension-wise segmentation, cross-time dimension embedding, two-stage attention, and an encoder and decoder in a HED structure for our method.
Algorithm A1 SegmentedCrossformer
1: Inputs:
2:     Multivariate Time Series x 1 : T   R T × D with D dimensions in time step T,
3:     Segment Length: L s e g
4:     Embedded Dimension d m o d e l
5:     Learnable Embedding Vector E R d m o d e l × L s e g
6:     Learnable Position Embedding Vector E p o s   R d m o d e l
7:     Number of Heads in MSA: h
8:     Number of Current Encoder and Decoder Layers: l
9:     Learnable Position Embedding Vector: E d e c
10: Procedure SegmentedCrossformer( x 1 : T , L s e g , d m o d e l , h , l ):
11:      # Initialization
12:   Set Query Q, Key K and Value V
13:   Set Learnable Embedding Vector E R d m o d e l × L s e g
14:   Set Learnable Position Embedding Vector E p o s   R d m o d e l
15:   Set Learnable Position Embedding Vector: E d e c
16:   # Implementation Starts from here
12:   Set  x D =  DimWiseEmbedding( x 1 : T , L s e g )
13:   Set H T = C r o s s T i m e E m b e d d i n g ( x D , d m o d e l , E , E p o s )
14:   Set  H = T w o S t a g e A t t n ( H T )
15:   # Hierarchical Encoder-Decoder to make the prediction
16:   Set  Y = H E D ( H , l ,   E d e c , D )
17:   return Y
Algorithm A2 Dimension-Wise Embedding
1: Inputs:
2:     Multivariate Time Series x 1 : T   R T × D with D dimensions in time step T
3.     Segment Length: L s e g
4: Procedure DimWiseEmbedding( x 1 : T , L s e g ):
6:     Reshape  x 1 : T   R T × D into x 1 : D   R ( T / L s e g ) × L s e g × D
7.     Set x = x 1 : D
8.    return x
Algorithm A3 Cross-Time Embedding
1: Input:
2:    Segmented Multivariate Time Series x 1 : D   R T / L s e g × L s e g × D
3.    Embedded Dimension d m o d e l
3:    Learnable Embedding Vector E R d m o d e l × L s e g
4:    Learnable Position Embedding Vector E p o s   R d m o d e l
5: Procedure CrossTimeEmbedding( x 1 : D , d m o d e l , E, E p o s ):
6:    Set Embedding Vector H = {}
7:    Set E i , s ( p o s ) = E p o s where 1 i   D and 1 s T / L s e g
7:    for each segment x i , s in x 1 : D  do:
8:       Set  h i , s = E x i , s + E i , s ( p o s )
9:       Set H = Concat(H, h i , s )
10:   end for
11:   return H
Algorithm A4 Multihead Self-Attention (MSA)
1: Require: Query: Q, Key: K, Value: V
2: Input:  
3:    Number of Heads in MSA: h
4:    Query: Q, Key: K, Value: V
5:    Embedded Dimension: d m o d e l
6: Procedure MSA(Q, K, V, h , d m o d e l ):
7:    Set Y = {}
8:    for  i 0   t o   h  do:
9:        Set  d k = d v = d h = d m o d e l / h
10:      Set  W i Q R d m o d e l × d k , W i K   R d m o d e l × d k , W i V   R d m o d e l × d v and W O   R h d v × d m o d e l
11:      Set  h e a d i = s o f t m a x ( Q W i Q ( K W i K ) T d k )
12:      Set Y = Concat(Y, h e a d i )
13:    end for
14:    return Y
Algorithm A5 Two-Stage Attention
1: Input:
2:    2D Embedding Vector H = { h i , s   R d m o d e l |   1 i   D ,   1 s T L s e g   }
3:    Require Query Q, Key K, Value V for Multi-head Self-Attention (MSA)
4:    Number of Heads in MSA: h
5:    Embedded Dimension d m o d e l
6:
7: Procedure CrossTimeStage(H):
8:   Set  Y t i m e = {}
9:   for  i 0   t o   D  do:
10:      Set  h i , : = H [ i ]
11:      Set  c r o s s t i m e i = M S A ( h i , : , h i , : , h i , : , h , d m o d e l )
12:      Set  Y ~ i , : t i m e = L a y e r N o r m ( h i , : + c r o s s t i m e i )
13:      Set  Y i , : t i m e = L a y e r N o r m Y ~ i , : t i m e + M L P Y ~ i , : t i m e
14:      Set  Y t i m e = C o n c a t ( Y t i m e ,   Y i , : t i m e )
15:   end for
16:   return  Y t i m e
17:
18: Procedure CrossDimStage(H):
19.   Set  Y d i m = {}
19:   for  i 0   t o   D  do:
20:      Set  H i =   h d , s H   |   1 d D , d i ,   1 s T L s e g
21:      Set learnable weight for linear projection W i and bias b   R d m o d e l
22:      Set  Y ¯ i = W i H i + b
23:      Set  T i = M S A ( Y i , : t i m e , Y i ¯ , Y i ¯ )
24:      Set  Y ~ i d i m = L a y e r N o r m ( Y i , : t i m e + T i )
25:      Set  Y i d i m = L a y e r N o r m ( Y ~ i d i m + M L P ( Y ~ i d i m ) )
26:      Set  Y d i m = C o n c a t ( Y d i m , Y i d i m )
27:   end for
28:   return  Y d i m
29:
30: Procedure TwoStageAttn(H):
31:   Set  H t e m p = c r o s s T i m e S t a g e ( H )
32:   Set Y = c r o s s D i m S t a g e ( H t e m p )
33:   return Y
Algorithm A6 Hierarchical Encoder–Decoder (HED)
1: Inputs:
2:    Embedded Vector from TSA: H
3:    Number of Current Encoder and Decoder Layers: l
4:    Learnable Position Embedding Vector: E d e c
5: Procedure Encoder(H, l):
6:    if  l = 1  then: Set  Y e n c = H
9:    else: Set  Y e n c = T S A ( H )
8:    end if
9:    return  Y e n c
10:
11: Procedure Decoder(H,l E d e c ):
12:   if  l = 1 then: Set  Y ~ d e c = T S A E d e c
13:   else: Set  Y ~ d e c =   T S A H
14:   end if
15:   return  Y ~ d e c
16:
17: Procedure HED(H,L E d e c , D ):
18:   Set Y = 0
19:   for  l 1   t o   L + 1   do:
20:       Set  H e n c = H , H d e c = H
21:       Set  Y e n c , l = E n c o d e r ( H e n c , i ) , Y ~ d e c , l = D e c o d e r ( H d e c , l , E d e c )
22:       # update output for each encoder and decoder layer to be the input for next iteration
23:       Set  H e n c =   Y e n c , l , H d e c = Y ~ d e c , l
24:       Set  Y ¯ d e c , l = {}
25:       for  i 0   t o   D  do:
26:         Set  Y ~ d d e c , l =  Y ~ d e c , l [ i ] , Y d e n c , l = Y e n c , l [ i ]
27:         Set  Y ¯ d d e c , l = M S A ( Y ~ d d e c , l , Y d e n c , l , Y d e n c , l , h , d m o d e l )
28:         Set  Y ¯ d e c , l = C o n c a t ( Y ¯ d e c , l , Y ¯ d d e c , l )
29:       end for
30:       Set  Y ^ d e c , l = L a y e r N o r m ( Y ~ d e c , l + Y ¯ d e c , l )
31:       Set  Y d e c , l = L a y e r N o r m ( Y ^ d e c , l + M L P ( Y ^ d e c , l ) )
32:       # make the prediction for by the sum for each layer l
33:       # Make inference for time series
34:       Set  W l   R L s e g × d m o d e l
35:       Set  Y = Y + W l Y d e c , l
36:    end for
37:    return Y

Appendix A.2. Ablation Study

By default, we set up experiments for our model with parameters in Section 4, showing results of performance and comparing with the state of the art. To show the effectiveness of our methods, we conduct ablation studies with more configurations with hyperparameter tuning.
Input Length. The input length represents the size of the look-back window that contains historical information for inference. Based on different datasets, we set up input length depending on the size of the time series in each dataset. By default, we set up a short input length in {24, 48, 96}. In the ablation study, we set up input length based on datasets. For ETTh1, ETTh2, WTH, ECL, and Traffic, we set the input length to be {24, 48, 96, 168, 336, 720}. For ETTm1, ETTm2, and Exchange-Rate, we set up input length with {24, 48, 96, 192, 288, 672}. For ILI, we set the input length to be {24, 36, 48, 60}. The tables below show results of our methods for four real-world benchmark datasets with different input lengths.
Table A1. Results of our method with various input lengths with Datasets ETTh1, ETTh2, and WTH.
Table A1. Results of our method with various input lengths with Datasets ETTh1, ETTh2, and WTH.
DatasetsMetricsInput Length
244896168336720
ETTh1MSE0.5510.5930.5380.4640.5310.913
MAE0.5100.5340.5290.4960.5290.695
ETTh2MSE0.1180.1110.5261.0570.9131.957
MAE0.2200.2170.4450.7050.6841.137
WTHMSE0.1390.1320.1270.3030.3880.456
MAE0.1720.1720.1780.3180.3760.429
Table A2. Results of our method for various input lengths with Dataset ETTm1, ETTm2, and Exchange-Rate.
Table A2. Results of our method for various input lengths with Dataset ETTm1, ETTm2, and Exchange-Rate.
DatasetsETTm1ETTm2Exchange-Rate
MetricsMSEMAEMSEMAEMSEMAE
Input length240.4540.4310.1990.2860.2710.313
480.3880.3940.2220.3030.3610.364
960.3420.3770.3380.3822.3351.161
1920.3450.3910.4560.4243.1561.364
2880.3650.3960.6900.5442.4641.281
6720.3400.3901.1130.7241.9271.145
Table A3. Results of our method for various input lengths with Dataset ILI.
Table A3. Results of our method for various input lengths with Dataset ILI.
DatasetsILI
MetricsMSEMAE
Input length241.7900.875
365.4171.631
482.6791.020
607.0041.898
Table A1, Table A2 and Table A3 show the best results for model configurations with different input lengths for four real-world datasets with MSE and MAE as metrics.

Appendix A.3. Comprehensive Results

To show the effectiveness of our model, we collect comprehensive results of the state-of-the-art methods which were implemented with a variety of datasets from past research. In this section, we record the comprehensive results of both our method and state-of-the-art methods including Transformer-based models and classical architecture like RNN-based methods.
Table A5 and Table A6 below show comprehensive results of MTSF models for different prediction lengths for four real-world datasets.
Table A4. Results of models with different prediction length for Datasets ETTh1, ETTm1, WTH, and ILI. The best results are highlighted in bold.
Table A4. Results of models with different prediction length for Datasets ETTh1, ETTm1, WTH, and ILI. The best results are highlighted in bold.
ModelsMTGNNDLinearLSTNetLSTMaSTformerSCF
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTh1240.3360.3930.3120.3551.2930.9010.6500.6240.3680.4410.4640.496
480.3860.4290.3520.3831.4560.9600.7020.6750.4450.4650.5820.561
1680.4660.4740.4160.4301.9971.2141.2120.8670.6520.6080.9720.725
3360.7360.6430.4500.4522.6551.3691.4240.9941.0690.8060.9790.734
7200.9160.7500.4860.5012.1431.3801.9601.3221.0710.8170.9850.758
ETTm1240.2600.3240.2170.2891.9681.1700.6210.6290.2780.3480.3420.377
480.3860.4080.2780.3301.9991.2151.3920.9390.4450.4580.4650.453
960.4280.4460.3100.3542.7621.5421.3390.9130.4200.4550.5500.510
2880.4690.4880.3690.3861.2572.0761.7401.1240.7330.5970.7450.613
6720.6200.5710.4160.4171.9172.9412.7361.5550.7770.6250.9730.730
WTH240.3070.3560.3570.3910.6150.5450.5460.5700.3070.3590.1270.178
480.3880.4220.4250.4440.6600.5890.8290.6770.3810.4160.2030.256
1680.4980.5120.5150.5160.7480.6471.0380.8350.4970.5020.2980.323
3360.5060.5230.5360.5370.7820.6831.6571.0590.5660.5640.3880.376
7200.5100.5270.5820.5710.8510.7571.5361.1090.5890.5820.4560.429
ILI244.2651.3872.9401.2054.9751.6604.2201.3353.1501.2324.2871.406
364.7771.4962.8261.1845.3221.6594.7711.4273.5121.2435.4171.631
485.3331.5922.6771.1555.4251.6324.9451.4623.4991.2345.9011.704
605.0701.5523.0111.2455.4771.6755.1761.5043.7151.3167.0051.898
Table A5. Results of models with different prediction length for Datasets ETTh2 and ETTm2. The best results are highlighted in bold.
Table A5. Results of models with different prediction length for Datasets ETTh2 and ETTm2. The best results are highlighted in bold.
ModelsSCFInformerLogTransReformerLSTNetLSTMa
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTh2240.2590.3380.7200.6650.8280.7501.5311.6132.7421.4571.1430.813
480.4270.4051.4571.0011.8061.0341.8711.7353.5671.6871.6711.221
1681.0571.0573.4891.5154.0701.6814.6601.8463.2422.5134.1171.674
3360.9130.6842.7231.3403.8751.7634.0281.6882.5442.5913.4341.549
7201.9571.9573.4671.4733.9131.5522.0154.6254.6253.7093.9631.788
ETTm2240.1990.2860.1730.3010.2110.3320.3330.4291.1010.8310.5800.572
480.2230.3030.3030.4090.4270.4870.5580.5712.6191.3930.7470.630
1680.3460.3770.3650.4530.7680.6420.6580.6193.1421.3652.0411.073
3360.7120.5501.0560.8041.0900.8062.4411.1902.8561.3290.9690.742
7201.0430.6793.1261.3022.3971.2141.3283.4093.4091.4202.5411.239
Table A6. Results of models with different prediction length for Datasets ETTh1, ETTh2, ETTm1, and ETTm2. The best results are highlighted in bold.
Table A6. Results of models with different prediction length for Datasets ETTh1, ETTh2, ETTm1, and ETTm2. The best results are highlighted in bold.
ModelsMetricDatasets
ETTh1ETTh2ETTm1ETTm2
96192336720961923367209619233672096192336720
SCFMSE0.7711.1360.9790.9850.9051.0070.9131.9570.5380.7010.7561.0330.3160.4560.7121.043
MAE0.6580.7940.7340.7580.6500.6980.6841.1370.4960.5970.6220.7520.4580.4240.5500.679
FEDformerMSE0.3760.4200.4590.5060.3460.4290.4960.4630.3790.4260.4450.5430.2030.2690.3250.421
MAE0.4190.4480.4650.5070.3880.4390.4870.4740.4190.4410.4590.4900.2870.3280.3660.415
AutoformerMSE0.4490.5000.5210.5140.3580.4560.4820.5150.5050.5530.6210.6710.2550.2810.3390.422
MAE0.4590.4820.4960.5120.3970.4520.4860.5110.4750.4960.5370.5610.3390.3400.3720.419
InformerMSE0.8651.0081.1071.1813.7555.6024.7213.6470.6720.7951.2121.1660.3650.5331.3633.379
MAE0.7130.7920.8090.8651.5251.9311.8351.6250.5710.6690.8710.8230.4530.5630.8871.338
LogTransMSE0.8781.0371.2381.1352.1164.3151.1243.1180.6000.8371.1241.1530.7680.9891.3343.048
MAE0.7400.8240.9320.8521.1971.6351.6041.5400.5460.7000.8320.8200.6420.7570.8721.328
ReformerMSE0.8370.9231.0971.2572.62611.129.3233.8740.5380.6580.8981.1020.6581.0781.5492.631
MAE0.7280.7660.8320.8891.3172.9792.7691.6970.5280.5920.7210.8410.6190.8270.9721.242

References

  1. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  2. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
  3. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Luxburg, U., Guyon, I., Bengio, S., Wallach, H., Fergus, R., Eds.; pp. 6000–6010. [Google Scholar]
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. Presented at the International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  6. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
  7. Yan, W.; Sun, Y.; Yue, G.; Zhou, W.; Liu, H. FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting. IEEE J. Emerg. Sel. Top. Circuits Syst. 2024, 14, 235–244. [Google Scholar] [CrossRef]
  8. Cao, Y.; Yu, H.; Wu, J. Training Vision Transformers with only 2040 Images. In Proceedings of the 17th European Conference (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 220–237. [Google Scholar] [CrossRef]
  9. Woo, G.; Liu, C.; Kumar, A.; Xiong, C.; Savarese, S.; Sahoo, D. Unified Training of Universal Time Series Forecasting Transformers. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; pp. 53140–53164. [Google Scholar]
  10. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar] [CrossRef]
  11. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS 21), Red Hook, NY, USA, 6–14 December 2021; pp. 22419–22430. [Google Scholar]
  12. Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. Presented at the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
  13. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Presented at the AAAI Conference on Artificial Intelligence (AAAI-21), Virtually, 2–9 February 2021; Available online: https://cdn.aaai.org/ojs/17325/17325-13-20819-1-2-20210518.pdf (accessed on 20 March 2025).
  14. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers. Presented at the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  15. Ghojogh, B.; Ghodsi, A. Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey. arXiv 2023, arXiv:2304.11461. [Google Scholar] [CrossRef]
  16. Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv 2013, arXiv:1308.0850. [Google Scholar]
  17. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 20), Virtual Event, 6–10 July 2020; pp. 753–763. [Google Scholar] [CrossRef]
  18. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  19. Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A Transformer-based Framework for Multivariate Time Series Representation Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 21), Singapore, 14–18 August 2021; pp. 2114–2124. [Google Scholar] [CrossRef]
  20. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  21. Holt, C.C. Forecasting seasonals and trends by exponentially weighted moving averages. Int. J. Forecast. 2004, 20, 5–10. [Google Scholar] [CrossRef]
  22. Box, G.E.P.; Jenkins, G.M. Some Recent Advances in Forecasting and Control. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1968, 17, 91–109. [Google Scholar] [CrossRef]
  23. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecastingf with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  24. Yan, C.; Wang, Y.; Zhang, Y.; Wang, Z.; Wang, P. Modeling Long- and Short-Term User Behaviors for Sequential Recommendation with Deep Neural Networks. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
  25. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
  26. Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J. Spectral temporal graph neural networks for multivariate time-series forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS 20), Vancouver, BC, Canada, 6–12 December 2020; pp. 17766–17778. [Google Scholar]
  27. Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS 20), Vancouver, BC, Canada, 6–12 December 2020; pp. 17804–17815. [Google Scholar]
  28. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5243–5253. [Google Scholar]
  29. Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling. April 2022. Available online: https://openreview.net/pdf?id=0EXmFzUn5I (accessed on 17 March 2025).
  30. Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. Presented at the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  31. Liu, X.; Wang, W. Deep Time Series Forecasting Models: A Comprehensive Survey. Mathematics 2024, 12, 1504. [Google Scholar] [CrossRef]
  32. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. Presented at the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
  33. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of our Method: (a) Segment-Wise Embedding, (b) Positional Embedding, (c) Hierarchical Encoder–Decoder (HED), (d) MLP Head.
Figure 1. Overall architecture of our Method: (a) Segment-Wise Embedding, (b) Positional Embedding, (c) Hierarchical Encoder–Decoder (HED), (d) MLP Head.
Forecasting 07 00041 g001
Figure 2. Segment-Wise Embedding: (a) Embedded vectors for input time series after embedding layer, (b) Segmented embedded vectors: Embedded vectors are further split into small segments given segment length L s e g . Vectors in different color represent different dimensions.
Figure 2. Segment-Wise Embedding: (a) Embedded vectors for input time series after embedding layer, (b) Segmented embedded vectors: Embedded vectors are further split into small segments given segment length L s e g . Vectors in different color represent different dimensions.
Forecasting 07 00041 g002
Figure 3. Architecture of TSA for one dimension: (a) Cross-time Stage, (b) Cross-dimension Stage, (c) Output of TSA Stage. Vectors in different color represent different dimensions.
Figure 3. Architecture of TSA for one dimension: (a) Cross-time Stage, (b) Cross-dimension Stage, (c) Output of TSA Stage. Vectors in different color represent different dimensions.
Forecasting 07 00041 g003
Table 1. Detailed information for real-world datasets in our experiment.
Table 1. Detailed information for real-world datasets in our experiment.
Dataset NameDimensionsTotalTrainingValidationTesting
ETTh1717,42010,44334813481
ETTh2717,42010,40534613461
ETTm1769,68041,67113,91313,913
ETTm2769,68041,79713,93113,931
WTH2152,69631,57010,51710,517
Exchange-Rate87588454515161516
ILI7966532171170
Table 2. Results of MTSF Task for our method with small prediction length with different datasets.
Table 2. Results of MTSF Task for our method with small prediction length with different datasets.
DatasetETTh1ETTh2ETTm1ETTm2WTHExchangeILI
MetricsMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
40.2730.3420.1110.2170.1170.2090.1090.2020.0500.0840.2280.2901.6410.777
60.3430.3790.1260.2360.1580.2440.1400.2400.0600.0960.3240.3422.3050.917
120.4530.4470.1790.2820.2580.3240.1520.2450.0780.1270.5770.4922.8351.097
240.4640.4960.2590.3380.3420.3770.2110.2890.1320.1721.4130.7914.2871.406
480.5830.5610.4270.4050.4540.4720.2220.3030.2220.2642.0481.1095.9011.704
Table 3. Results of MTSF Task with different Prediction Lengths. Bold indicates the best. Results of our method are shown in the SCF column. Figures in bold indicate the best results.
Table 3. Results of MTSF Task with different Prediction Lengths. Bold indicates the best. Results of our method are shown in the SCF column. Figures in bold indicate the best results.
ModelsCrossformerFEDformerTransformerInformerAutoformerPyraformerSCFLSTMaMTGNN
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTh1240.4640.4960.3180.3840.6200.5770.5770.5490.3840.4250.4930.5070.4640.4960.6500.6240.3360.393
480.5830.5610.3420.3960.6920.6710.6850.6250.3920.4190.5540.5440.5830.5610.7020.6750.3860.429
1680.4100.4410.4120.4490.9470.7970.9310.7520.4900.4810.7810.6750.4640.4961.2120.8670.4660.474
3360.4400.4610.4560.4741.0940.8131.1280.8730.5050.4840.9120.7470.5310.5291.4240.9940.7360.643
7200.5190.5240.5210.5151.2410.9171.2150.8960.4980.5000.9930.7920.9130.6951.9601.3220.9160.750
ETTm1240.2110.2930.2900.3640.3060.3710.3230.3690.3830.4030.3100.3710.3420.3800.6210.6290.2600.324
480.3000.3520.3420.3960.4650.4700.4940.5030.4540.4530.4650.4640.4570.4441.3920.9390.3860.408
960.3200.3730.3660.4120.6810.6120.6780.6140.4810.4630.5200.5040.3420.3771.3390.9130.4280.446
2880.4040.4270.3980.4331.1620.8791.0560.7860.6340.5280.7290.6570.3650.3961.7401.1240.4690.488
6720.5690.5280.4550.4641.2311.1031.1920.9260.6060.5420.9800.6780.3400.4002.7361.5550.6200.571
WTH240.2940.3430.3570.4120.3490.3970.3350.3810.3630.3960.3010.3590.1320.1720.5460.5700.3070.356
480.3700.4110.4280.4580.3860.4330.3950.4590.4560.4620.3760.4210.2030.2560.8290.6770.3880.422
1680.4730.4940.5640.5410.6130.5820.6080.5670.5740.5480.5190.5210.2980.3231.0380.8350.4980.512
3360.4950.5150.5330.5360.7070.6340.7020.6200.6000.5710.5390.5430.3880.3761.6571.0590.5060.523
7200.5260.5420.5620.5570.8340.7410.8310.7310.5870.5700.5470.5530.4560.4291.5361.1090.5100.527
ILI243.0411.1862.6871.1473.9541.3234.5881.4623.1011.2383.9701.3384.2871.4064.2201.3354.2651.387
363.4061.2322.8871.1604.1671.3604.8451.4963.3971.2704.3771.4105.4171.6314.7711.4274.7771.496
483.4591.2212.2971.1554.4761.4634.8651.5162.9471.2034.8111.5035.9011.7044.9451.4625.3331.592
603.6401.3052.8091.1635.2191.5535.2121.5763.0191.2025.2041.5887.0051.8985.1761.5045.0701.552
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Z.; Gonsalves, T. SegmentedCrossformer—A Novel and Enhanced Cross-Time and Cross-Dimensional Transformer for Multivariate Time Series Forecasting. Forecasting 2025, 7, 41. https://doi.org/10.3390/forecast7030041

AMA Style

Yang Z, Gonsalves T. SegmentedCrossformer—A Novel and Enhanced Cross-Time and Cross-Dimensional Transformer for Multivariate Time Series Forecasting. Forecasting. 2025; 7(3):41. https://doi.org/10.3390/forecast7030041

Chicago/Turabian Style

Yang, Zijiang, and Tad Gonsalves. 2025. "SegmentedCrossformer—A Novel and Enhanced Cross-Time and Cross-Dimensional Transformer for Multivariate Time Series Forecasting" Forecasting 7, no. 3: 41. https://doi.org/10.3390/forecast7030041

APA Style

Yang, Z., & Gonsalves, T. (2025). SegmentedCrossformer—A Novel and Enhanced Cross-Time and Cross-Dimensional Transformer for Multivariate Time Series Forecasting. Forecasting, 7(3), 41. https://doi.org/10.3390/forecast7030041

Article Metrics

Back to TopTop