MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation

Peng, Shunhua; Sun, Wu; Chen, Panfeng; Xu, Huarong; Ma, Dan; Chen, Mei; Wang, Yanhao; Li, Hui

doi:10.3390/electronics14132618

Open AccessArticle

MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation

by

Shunhua Peng

¹,

Wu Sun

¹,

Panfeng Chen

¹

,

Huarong Xu

¹,

Dan Ma

¹,

Mei Chen

¹,

Yanhao Wang

²

and

Hui Li

^1,*

¹

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China

²

School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2618; https://doi.org/10.3390/electronics14132618

Submission received: 23 May 2025 / Revised: 21 June 2025 / Accepted: 27 June 2025 / Published: 28 June 2025

(This article belongs to the Special Issue Machine Learning in Data Analytics and Prediction)

Download

Browse Figures

Versions Notes

Abstract

Accurate multivariate time series forecasting remains a major challenge in various real-world applications, primarily due to the limitations of existing models in capturing multiscale temporal dependencies and effectively integrating external data. To address these issues, we propose MSP-EDA, a novel multivariate time series forecasting framework that integrates multiscale patching and external data enhancement. Specifically, MSP-EDA utilizes the Discrete Fourier Transform (DFT) to extract dominant global periodic patterns and employs an adaptive Continuous Wavelet Transform (CWT) to capture scale-sensitive local variations. In addition, multiscale patches are constructed to capture temporal patterns at different resolutions, and a specialized encoder is designed for each scale. Each encoder incorporates temporal attention, channel correlation attention, and cross-attention with external data to capture intra-scale temporal dependencies, inter-variable correlations, and external influences, respectively. To fuse information from different temporal scales, we introduce a trainable global token that serves as a variable-wise aggregator across scales. Extensive experiments on four public benchmark datasets and three real-world vector database datasets that we collect demonstrate that MSP-EDA consistently outperforms state-of-the-art methods, achieving particularly notable improvements on vector database workloads. Ablation studies further confirm the effectiveness of each module and the adaptability of MSP-EDA to complex forecasting scenarios involving external dependencies.

Keywords:

time series forecasting; multiscale patches; external data augmentation

1. Introduction

Multivariate time series (MTS) forecasting aims to predict future values of one or more target variables by leveraging multiple correlated and temporally evolving features. It has been widely applied in domains such as economics [1], traffic [2,3], healthcare [4,5], and workload forecasting [6,7,8]. Among these, workload forecasting often involves highly dynamic and non-stationary patterns, exhibiting multiscale fluctuations driven by diverse query types, variable data volumes, and burst access behaviors. For example, the workloads of vector databases typically reflect this complexity, making accurate forecasting essential for adaptive resource scheduling, performance tuning, and system scalability under volatile conditions [8,9,10].

Existing MTS forecasting methods can be broadly categorized into statistical learning and deep learning approaches. Traditional methods, such as Vector AutoRegression (VAR) [11] and extended ARIMA models, are effective for stationary, low-noise datasets but struggle with complex patterns. In recent years, deep learning models have dominated due to their ability to model nonlinear dynamics and handle noisy real-world data. Early works based on RNNs and CNNs, such as SegRNN [12] and SCINet [13], have demonstrated promising performance. More recently, Transformer-based models [14,15,16] have gained attention for their superior ability to capture long-term dependencies, although further design improvements are needed to fully exploit their potential in MTS forecasting.

Accurate time series forecasting in real-world scenarios demands the ability to capture complex and diverse temporal patterns. However, achieving this remains a significant challenge due to several key issues.

First, real-world time series often exhibit multi-modal behaviors, characterized by a mixture of long-term trends, seasonal fluctuations, and abrupt short-term changes. Effectively modeling these intertwined patterns is critical but non-trivial. Most existing approaches either overlook or only extract simple temporal patterns [17,18], which makes it difficult to capture the diverse dynamics inherent in time series data [19,20]. Related studies [21,22,23] leverage Fourier Transform to extract frequency-domain representations, capturing global periodic patterns effectively. However, these methods are generally limited by their reliance on fixed basis functions or fixed receptive fields, which makes it difficult to adapt to localized, non-stationary temporal patterns. In particular, they tend to ignore the fine-grained, transient frequency changes that are common in real-world time series, especially under conditions of noise, abrupt events, or irregular cycles. Moreover, their frequency modeling tends to be either global or statically defined, making it hard to dynamically adjust to scale-varying or time-varying features.

Second, different temporal resolutions often capture distinct and complementary information. For example, the utilization of CPU and memory resources in cloud computing reveals unique temporal patterns spanning daily, monthly, and seasonal scales [6]. Moreover, as illustrated in Figure 1, multivariate time series data exhibit strong temporal dependencies and complex inter-variable relationships [24]. To better represent these characteristics, the figure depicts sequences at three segmentation scales (shown by background colors), each containing two correlated variables. This design emphasizes the need to capture both multiscale temporal patterns and cross-variable dynamics, which are essential for accurate modeling. Effectively capturing both sequential dynamics and cross-feature interactions is critical for enhancing both accuracy and generalization ability of forecasting models [25,26]. However, many existing models treat time series at a single temporal scale, thereby failing to capture the inherent multiscale nature of real-world data [27,28,29]. Additionally, most attention-based forecasting models [30] either focus solely on temporal correlations or apply shallow inter-variable attention, lacking an integrated mechanism for capturing interactions across both time and variable dimensions at multiple scales. Furthermore, they do not provide a unified way to dynamically fuse information across resolutions, which is essential for reconciling long-term trends with short-term fluctuations.

Third, integrating external data can significantly enhance forecasting performance [15]. Approaches such as ARIMAX [31] capture the relationships between exogenous and endogenous variables, along with autoregression on the endogenous variables. TiDE [32] further suggests that forecasting models might have access to the future values of exogenous variables during the prediction of endogenous variables. However, most existing methods assume well-aligned and complete data streams, which is often unrealistic in practical scenarios [15]. In real-world applications, time series data are frequently affected by missing values, time delays, and uneven sampling, making it challenging to model the dynamic interactions between exogenous and endogenous variables accurately.

In summary, existing methods face several key limitations. Many rely on fixed or global frequency bases, which limits their ability to capture localized and time-varying patterns in non-stationary time series. Others fail to effectively model the interactions between multiscale temporal structures and inter-variable dependencies, which are crucial for understanding complex dynamics in real-world data. Additionally, most approaches assume complete and well-aligned external data, making them less robust to missing values, time delays, and irregular sampling commonly encountered in practical scenarios.

In this article, we propose MSP-EDA, a novel forecasting model that integrates multiscale patch and external data augmentation to address the aforementioned challenges. First, we combine Discrete Fourier Transform (DFT) with an adaptive multiscale wavelet transform to simultaneously extract global periodic trends and localized time–frequency variations. Unlike existing models that rely on fixed-frequency bases, our approach enables flexible and fine-grained temporal-frequency decomposition, which is essential for modeling non-stationary time series. Second, we introduce a multiscale patching strategy that segments time series into variable-length patches at different resolutions. Each patch is processed by a dedicated encoder equipped with a temporal attention module for sequential dependencies and a channel attention module for inter-variable interactions. This design enables the model to jointly capture temporal dynamics and variable relationships across multiple scales, which are often modeled separately in prior works. Third, a dedicated embedding module fuses timestamp-related external information with internal representations, and a cross-attention mechanism selectively integrates external variables based on contextual relevance. This dynamic integration mechanism offers greater robustness compared to models that assume clean and fully aligned external data. Finally, a global token aggregates multi-resolution outputs from all encoder layers, allowing the model to adaptively fuse information across temporal scales and enhance forecasting accuracy.

Experiments on several datasets show that the proposed method consistently outperforms existing approaches, achieving superior or near-optimal performance. In particular, on three vector database datasets, MSP-EDA significantly outperforms the best baseline methods in terms of MSE. Compared to the second-best TimeXer model [15], MSP-EDA improves MSE by 1.4% on the Dvdb dataset, 2.1% on the Mvdb, and 1.9% on the Wvdb, respectively, highlighting its effectiveness in forecasting for real-world applications.

Our main contributions can be summarized as follows:

We design an adaptive multiscale wavelet-based representation module that automatically adjusts scale decomposition based on the characteristics of input data, enabling the model to extract rich and relevant time–frequency features.
We introduce a cross-attention that dynamically fuses external data, enhancing the ability to adapt to complex and changing input conditions.
We propose a novel MSP-EDA framework, which unifies multiscale patch representation and external data integration to capture complex temporal patterns and intra-variable dependencies.
We perform extensive experiments on seven benchmark datasets to demonstrate that MSP-EDA consistently outperforms state-of-the-art baselines, confirming its effectiveness and generalizability across diverse forecasting scenarios.

The remainder of this paper is organized as follows. We first review the related literature in Section 2. Then, we elaborate on MSP-EDA and its modules in Section 3. Finally, we show the experimental results in Section 4 and conclude the whole paper in Section 5.

2. Related Work

2.1. Multivariate Time Series Forecasting

Existing methods for MTS forecasting can be categorized into statistical learning and neural network-based approaches [24]. Statistical methods, such as ARIMA [33] and VAR [11], capture linear dependencies and are well-suited for stationary data. In contrast, deep learning models like TimesNet [34], DeepAR [35] treat time series as sequences of vectors and utilize CNNs or RNNs to capture temporal dependencies. Furthermore, Transformer-based architectures, including Informer [16], FEDformer [18], MedFormer [5], iTransformer [14], and Crossformer [25] are more effective at modeling nonlinear patterns and handling complex, noisy data. MLP-based methods, including MSD-Mixer [26], TiDE [32], TimeMixer [36], and LSTF-Linear [37] employ a simple architecture with relatively few parameters and have also demonstrated good forecasting accuracy. However, many of these models rely on fixed-length time segments, which limits their ability to capture the full complexity of time series data, as they fail to account for patterns occurring at different temporal scales, such as long-term trends and short-term fluctuations.

2.2. Multiscale Modeling for MTS Forecasting

Multiscale modeling has shown remarkable success in enhancing correlation learning and feature extraction in domains such as computer vision [38] and multi-modal learning [39]. However, its application to time series forecasting remains relatively underexplored. Some recent efforts have attempted to incorporate multiscale representations into time series models [40,41]. MTST [42] employs a multi-branch architecture to model various temporal patterns at different resolutions, addressing the challenge of capturing both short-term high-frequency patterns and long-term seasonal trends. Pathformer [27] introduces a multiscale Transformer with adaptive pathways that integrates temporal resolution and distance to capture diverse temporal patterns. However, while these methods achieve multiscale partitioning, they generally assume independence across channels, potentially limiting their ability to capture inter-variable relationships and fully exploit the dependencies between different data features.

2.3. MTS Forecasting with External Data

Incorporating external data into time series forecasting can significantly improve predictive performance by providing additional insight beyond the intrinsic temporal patterns of the target variables [15]. Classical models like ARIMAX [33] and SARIMAX [43] explicitly model both autoregressive dependencies and the relationships between internal and external variables. In deep learning, models such as the Temporal Fusion Transformer (TFT) [44] introduce a unified attention-based mechanism to learn variable importance dynamically over time, handling both historical and known future inputs. Similarly, NBEATSx [45] and TiDE [32] incorporate future external inputs along with historical inputs from the target variables. TimeXer [15] focuses on cross-time dependencies via inter-time attention, allowing it to capture long-term dependencies between distant time steps. However, these methods often rely on strict temporal alignment, overlook localized or multiscale temporal variations, and lack explicit mechanisms for fusing external signals across different time resolutions.

Table 1 presents a comparison of various forecasting methods based on three critical components: decomposition, multiscale modeling, and external data integration. Each of these components plays a pivotal role in capturing the complex patterns and dynamics within time series data. Decomposition enhances the ability to represent the data by extracting frequency-domain and more detailed time–frequency features, which improve overall data representation. Multiscale modeling captures temporal dependencies at multiple resolutions while considering the inter-channel relationships, allowing for more effective capture of complex data features. The integration of external data strengthens predictive performance by incorporating additional contextual information, enriching the ability to address dynamic and diverse time series. MSP-EDA combines all three components, offering a comprehensive approach that more effectively captures the rich and varied information within time series data, thereby outperforming existing methods.

3. Methodology

Time series forecasting aims to predict future values of a sequence based on its historical observations. Formally, given a multivariate time series input

X = [x_{1}, x_{2}, \dots, x_{T}] \in R^{D \times T}

, where D is the number of variables and T is the length of the historical window, the objective is to forecast the future values

Y = [y_{T + 1}, \dots, y_{T + H}] \in R^{D \times H}

over a prediction horizon H.

The key challenge in this task lies in effectively modeling complex temporal dynamics, including long-term periodic patterns, short-term variations, and potential external influences. To this end, we propose MSP-EDA, a novel model that integrates multiscale patch representations with external data enhancement to improve forecasting accuracy and robustness. In the remainder of this section, we first present an overview of the model architecture in Section 3.1. Then, we detail each module individually: data pattern enhancement in Section 3.2, multiscale patch processing in Section 3.3, and multiscale fusion in Section 3.4.

3.1. Overview

Our approach begins by enhancing the input time series through frequency- and time–frequency-based analysis to capture both global periodic trends and local variations. The enhanced data are then decomposed into patches at multiple temporal scales, where each scale is processed by dedicated encoders that incorporate external information. Finally, a global token is used to fuse multiscale representations for accurate forecasting. As illustrated in Figure 2, our model consists of three main components:

Data Pattern Enhancement: Extracts both global periodic trends and localized time–frequency features using signal processing techniques, specifically leveraging the Discrete Fourier Transform (DFT) to identify dominant global frequency components and the Continuous Wavelet Transform (CWT) to capture localized, non-stationary patterns across time and scale.
Multiscale Patch Processing: Decomposes the time series into multiple temporal resolutions to capture multiscale dependencies while simultaneously modeling temporal dynamics and inter-variable correlations. In addition, tailored embedding strategies are designed to effectively integrate external data into each scale for enhanced contextual understanding.
Multiscale Fusion: Combines representations from different temporal scales by introducing a global variable token, which is fused with the outputs from each encoder layer to generate the final prediction. This process effectively integrates multiscale information and external data, ensuring a comprehensive representation for accurate forecasting.

3.2. Data Pattern Enhancement

In time series forecasting, enhancing raw data representations is crucial for capturing complex temporal characteristics, such as global periodic trends and local variations. To this end, we design a dedicated enhancement module to extract both global and local time–frequency patterns. Specifically, we employ the Discrete Fourier Transform (DFT) to extract long-term periodical patterns, while the Continuous Wavelet Transform (CWT) is utilized to capture fine-grained, localized temporal dynamics.

To capture both global and local temporal characteristics, we design an enhancement module that integrates Discrete Fourier Transform (DFT) and Continuous Wavelet Transform (CWT). DFT provides a compact representation of long-term periodic trends by decomposing signals into global frequency components, while CWT enables localized time–frequency analysis through scaled and shifted wavelets. This combination allows the model to extract multiscale features more effectively than fixed-kernel convolutions or learnable filters, which often struggle with non-stationary or transient patterns common in real-world time series.

Periodic feature extraction involves transforming input data from the time domain to the frequency domain to capture frequency-based patterns. We employ DFT, denoted as

DFT (\cdot)

, which decomposes a signal into a set of sinusoidal components with different frequencies, thereby revealing dominant periodic trends while suppressing noise. The extracted frequency-domain features are then transformed back to the time domain using the inverse transform

IDFT (\cdot)

for further processing. The process is represented as follows:

X_{freq} = IDFT ({TopK}_{amp} (DFT (x))),

(1)

where

{TopK}_{amp} (\cdot)

represents the frequency-domain filtering operation that retains the top-k frequency components with the highest frequency.

Local time–frequency features are extracted using the CWT, which projects the input signal onto a family of scaled and shifted wavelet functions to capture localized patterns across different temporal resolutions. To enhance adaptability, we parameterize the multiscale settings of the wavelet transform as trainable variables. In particular, the time-scaling parameters of the wavelet basis functions are learned during training, allowing the model to adaptively select optimal time–frequency decomposition scales based on the characteristics of the input data. The k-th adaptive wavelet kernel function

ψ_{k} (t)

is represented as follows:

ψ_{k} (t) = \frac{1}{Z_{k}} \cdot C_{θ} [ψ (\frac{t}{σ_{k}})],

(2)

where t denotes the time variable and the normalization factor

Z_{k}

ensures a consistent energy across wavelets computed as the

L_{2}

norm. The operator

C_{θ} [\cdot]

represents linear interpolation, which resizes the wavelet to match the input sequence length. The scale parameter is defined as

σ_{k} = exp (β_{k}) \cdot exp (γ)

, where

β_{k}

controls the scale of the k-th wavelet and

γ

is a global scaling factor. Both

β_{k}

and

γ

are learnable and optimized during training.

Unlike fixed-scale wavelet transforms, where scales are predefined and typically spaced logarithmically, the adaptive method learns the scale parameters

β_{k}

and

γ

during training. Initially, both parameters are set to zero, allowing the network to learn the appropriate scales from the data. Throughout the training process, these parameters are adjusted based on the specific characteristics of the input sequence, such as frequency components and signal variations. The parameter

β_{k}

allows each wavelet kernel to adapt its scale to focus on different frequency components, enabling the network to capture features on multiple scales. The global scaling factor

γ

adjusts the overall scale across all wavelets, optimizing the ability to capture broad or fine-grained patterns in the data. This dynamic adaptation enables the network to respond more effectively to non-stationary signals, improving the extraction of time–frequency features.

3.3. Multiscale Patch Processing

To effectively model complex temporal patterns and cross-variable interactions in multivariate time series, we propose a multiscale patch processing module consisting of a multiscale embedding and a unified encoder. The embedding divides the input sequences into patches of varying lengths to capture temporal dependencies at multiple resolutions, enabling the model to jointly learn short-term variations and long-term trends. The encoder applies temporal attention, channel attention, and cross-attention with timestamp-derived features to extract sequential, inter-variable, and external dependencies. This design allows the model to disentangle heterogeneous patterns across time and variables, improving the adaptability to real-world, multiscale forecasting scenarios.

The multiscale patching strategy divides the input time series into multiple sets of non-overlapping subsequences with varying lengths, enabling the model to capture patterns across short- and long-term temporal scales. Formally, we define a set of M temporal scales

S = {S_{1}, S_{2}, \dots, S_{M}}

, where each

S_{i}

defines a specific patch size. Given an input sequence

X \in R^{D \times L}

, each scale

S_{i}

splits X into

P = L / S_{i}

patches

(X^{1}, X^{2}, \dots, X^{P})

, where

X_{S_{i}}^{j} \in R^{D \times S_{i}}

. For each patch, a temporal embedding is applied along the time dimension, producing

X_{t}^{j} \in R^{D \times d_{m}}

, where

d_{m}

denotes the embedding dimension. This multi-resolution embedding enables robust feature extraction across different temporal granularities.

After multiscale embedding, the resulting representations from each scale are passed to a corresponding encoder. Each encoder includes two main components: (1) a dual-attention block that combines temporal attention, capturing intra-series dependencies over time, and channel attention, modeling inter-variable correlations; and (2) a cross-attention module that incorporates timestamp-derived features, such as hour of day, day of week, or other periodic temporal patterns, to enhance the temporal context of learned representations. By integrating these three attention mechanisms, the encoder is able to learn comprehensive and time-aware representations, which are crucial for accurate and robust time series forecasting under diverse temporal conditions. The detailed processing flow of this module is illustrated in Figure 3.

To model temporal dependencies across different time resolutions, we apply attention between patches rather than within each patch. Each patch, constructed at a specific temporal scale, is treated as a compressed temporal unit that retains key local dynamics. This design reduces redundancy and focuses attention on higher-level temporal transitions across patches. Given a set of sub-sequences

{X^{1}, X^{2}, \dots, X^{P}}

partitioned with patch size

S_{i}

, we denote the embedded input as

X_{t} \in R^{C \times P \times d_{m}}

, where P is the number of patches and

d_{m}

is the embedding dimension. A multi-head self-attention mechanism is then used to model inter-patch dependencies, enabling the model to capture both short- and long-range temporal patterns across multiple scales. The attention is computed by first initializing the query and key as follows:

Q_{inter}^{i} = W_{Q}^{i} X_{t}^{i}, K_{inter}^{i} = W_{K}^{i} X_{t}^{i}, 1 \leq i \leq H,

(3)

where

W_{Q}^{i}, W_{K}^{i}

are the learnable projection matrices for the query and key, respectively. The attention output for each head is computed as follows:

{Attn}_{inter}^{i} = Softmax (\frac{Q_{inter}^{i} {K_{inter}^{i}}^{⊤}}{\sqrt{d_{m}}}) V_{inter}^{i},

(4)

where

Softmax (\cdot)

normalizes the attention weights. After applying inter-patch attention, each patch is updated based on its dependencies with other patches, and the resulting attention outputs of all heads are concatenated to form the final representation, denoted as

{Attn}_{inter} \in R^{P \times d_{m}}

, i.e.,

{Attn}_{inter} = Concat ({Attn}_{inter}^{1}, \dots, {Attn}_{inter}^{H}) W_{O},

(5)

where

W_{O} \in R^{H \times d_{h} \times d_{m}}

is the output projection, and H is the number of attention heads. This temporal attention enables the model to effectively learn long-range temporal patterns across the entire sequence by focusing on relationships between compressed patch-level representations at multiple time scales.

To effectively model inter-variable dependencies in multivariate time series, we introduce a channel attention module that captures the importance of each feature dimension. Instead of treating all input channels equally, this mechanism adaptively reweighs feature representations by learning channel-wise interactions, allowing the model to focus on the most informative variables while suppressing irrelevant or redundant ones. This design enhances the model’s capacity to extract meaningful cross-feature relationships, which are essential for accurate and robust forecasting [24,30]. Given the input representation

X^{'} \in R^{C \times P \times d_{m}}

, where C is the number of channels, P represents the number of patches, and

d_{m}

is the embedding dimension, we first permute the tensor to reshape it to

X^{'} \in R^{d_{m} \times P \times C}

. This reorganization allows us to focus on the feature dimension, making it easier to apply attention mechanisms along the channel axis. A linear projection is then applied along the channel dimension C to compute the attention weights, which are used to refine the feature representations. The detailed processing steps are as follows:

Q_{c} = W_{q} X^{'}, K_{c} = W_{k} X^{'}, V_{c} = W_{v} X^{'},

(6)

where

Q_{c}

,

K_{c}

, and

V_{c}

denote the query, key, and value, respectively, and

W_{q}

,

W_{k}

, and

W_{v}

represent the learnable parameter matrices of

Q_{c}

,

K_{c}

, and

V_{c}

. Then, we compute the attention weights of each channel as follows:

{Attn}_{c} (Q_{c}, K_{c}, V_{c}) = Softmax (\frac{Q_{c} K_{c}^{T}}{\sqrt{d_{m}^{'}}}) V_{c},

(7)

Finally, the attention-weighted output is combined with the input through a residual operation to produce the final output.

External time-related features provide essential prior knowledge that helps forecast models disambiguate similar input patterns that occur at different times. For example, behaviors at 8 a.m. and 8 p.m. may follow different dynamics, even if other features are similar. To incorporate this inductive bias, we extract temporal features such as weekday, date, hour, and minute from raw timestamps during preprocessing. These features are embedded alongside the original input sequence, allowing the model to learn position-aware and periodic patterns that are difficult to infer from the original signal alone. This design enhances the model’s temporal awareness, particularly under scenarios with recurring seasonal trends or irregular sampling intervals. The processing steps are as follows:

Ex_Embedding = ValueEmbedding (Concat (X, X_{time})),

(8)

where

ValueEmbedding (\cdot)

denotes the embedding operation function,

Concat (\cdot)

denotes the concat operation,

X \in R^{C \times L}

represents the origin input, and

X_{time} \in R^{T \times L}

represents the timestamp data. Instead of applying patch division directly on these external features, which could significantly increase computational overhead and introduce redundant noise, we adopt a more efficient strategy. Each external feature sequence is directly embedded as a corresponding variable token. These embedded tokens are integrated into the model alongside the primary input channels, enabling the attention mechanisms to jointly model both intrinsic dynamics and exogenous temporal patterns. This approach allows the model to better align periodic variations with the internal temporal structure of the data, improving its capacity to make temporal-aware predictions.

After the above processing, the embedded data is represented as

Ex_Embedding \in R^{(C + T) \times D}

, and we denote it as

E \in R^{(C + T) \times D}

for simplicity. By embedding external data as variable-level tokens, cross-attention is employed to effectively aggregate temporal semantics from the external information. Specifically, the input time series is used as the query (Q), while the embedded external variables serve as the keys (K) and the values (V), establishing correlations between the two modalities. This operation is formulated as follows:

\{\begin{matrix} Q = X_{s}^{P} \in R^{C \times D} \\ K = V = E \in R^{(C + T) \times D}, \end{matrix}

(9)

where

X_{s}^{P}

denotes the representation of the last temporal patch P, and S refers to the patch size. The reason for using only the last patch rather than all patches is that the last patch, being the most recent, contains the highest information density and has the greatest impact on the final prediction. Unlike TimeXer [15], which introduces a learnable token to bridge endogenous temporal patches and exogenous variate-level representations via cross-attention, our approach maintains temporal locality by directly utilizing the most recent patch as the query. This design preserves the interpretability of the attention mechanism and aligns better with the causal structure inherent in time series forecasting.

The proposed approach effectively combines multiple attention mechanisms to enhance the ability to capture complex temporal dependencies and inter-feature relationships. By employing temporal attention, the model captures sequential dependencies across patches, allowing for the modeling of long-range temporal dynamics while mitigating noise from intra-patch processing. Channel attention further improves the capability to capture inter-feature interactions by focusing on channel-level dependencies, enhancing feature representation across multiple channels. Furthermore, cross-attention with external data integrates contextual temporal information, such as weekday, hour, and minute, to further enrich the understanding of periodic patterns and temporal semantics. By leveraging these multiscale attention strategies, the model gains a comprehensive and nuanced understanding of the time series data, improving both forecasting accuracy and robustness.

3.4. Multiscale Fusion

To integrate information across multiple temporal resolutions, we introduce a trainable global token

G \in R^{1 \times C \times 1 \times d_{m}}

that performs variable-wise, cross-scale aggregation. Unlike simple concatenation or pooling, this learnable token serves as a dynamic query to adaptively summarize multiscale representations based on the relevance to each variable. It captures global contextual information across different patch scales, enabling the model to weigh contributions from both short-term and long-term patterns.

The token is initialized from a standard normal distribution

N (0, 1)

and expanded along the batch dimension during training, yielding

G \in R^{B \times C \times 1 \times d_{m}}

. Given the encoder outputs

F_{i} \in R^{C \times P_{i} \times d_{m}}

from each scale i, we concatenate

F_{i}

and G along the temporal axis. This design maintains positional relationships between scale-specific features and the global token, allowing the model to explicitly learn cross-scale attention patterns.

Compared to conventional fusion strategies such as averaging or hierarchical stacking, our global token-based approach provides a flexible and learnable mechanism to model fine-grained interactions across scales and variables, improving adaptability to varying temporal dynamics. Specifically, we have

H = Concat (F_{1}, \dots, F_{N}, G) \in R^{C \times (\sum_{i = 1}^{N} P_{i} + 1) \times d_{m}},

(10)

where

F_{i}

represents the output of the i-th patch length. After that, we employ the multi-head attention to compute the attention weights of different scales. The final result of all attention heads is spliced and then transformed back to the

d_{m}

dimension by a linear operation, formulated as follows:

O = W_{A} (Concat (A_{1}, A_{2}, \dots, A_{h})),

(11)

where

W_{A} (\cdot)

denotes the linear operation, and

A_{i}

denotes the result of the attention calculated for the i-th head.

For the final output representation

O \in R^{C \times (\sum P_{i} + 1) \times d_{m}}

, a prediction head is applied to map it to the desired forecast window. To do this, we first flatten the last two dimensions to obtain a unified vector representation, enabling the model to jointly consider both patch-level and temporal information. This flattened representation is then passed through a multi-layer perceptron (MLP) for nonlinear transformation, followed by a dropout layer to prevent overfitting and enhance generalization. The process is defined as follows:

\hat{Y} = Dropout (MLP (Flatten (O))), \hat{Y} \in R^{C \times pred_len} .

(12)

The prediction module bridges the high-dimensional multiscale representation and the final forecasting objective, effectively distilling rich temporal and cross-feature information into accurate future predictions.

4. Experiments

4.1. Experimental Setup

We conduct extensive experiments to evaluate the performance of our approach on long-term forecasting, including seven real-world benchmark datasets, four of which are well-known time series datasets and the remaining three of which are real-world workload datasets we collect from vector databases. Furthermore, we have conducted detailed comparisons with seven advanced baselines. The evaluation metrics used are Mean Squared Error (MSE) and Mean Absolute Error (MAE), defined as follows:

\begin{matrix} MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}, \end{matrix}

(13)

\begin{matrix} MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} | . \end{matrix}

(14)

We conduct comprehensive experiments on seven datasets, covering both widely used public benchmarks and real-world vector database workloads. Detailed statistics for these datasets are provided in Table 2. The public ETTh1, ETTh2, and ETTm2 [16] datasets from power-transformer temperature monitoring and the Weather [17] dataset feature relatively long periodicity and smooth temporal variations, making them appropriate for evaluating forecasting performance. In contrast, our three proprietary vector-database workload datasets (Dvdb, Mvdb, and Wvdb) exhibit pronounced dynamic behavior with frequent abrupt shifts and spike patterns, reflecting realistic, high-variability workload scenarios. This diversity ensures that MSP-EDA is rigorously tested not only on gently varying, well-studied series but also on challenging, non-stationary workloads, thereby validating its effectiveness across both stable and highly dynamic forecasting environments.

We compared with seven state-of-the-art time series forecasting models, including four Transformer-based models (TimeXer [15], PatchTST [29], and FEDformer [18]), one CNN-based model (TimesNet [34]), two MLP-based models (TimeMixer [37] and TiDE [32]). We also include a decomposition-then-prediction method HRTCP [47], which first applies signal decomposition before performing forecasting.

4.2. Overall Performance

As shown in Table 3, we provide a comprehensive comparison of eight forecasting models across multiple prediction horizons on seven datasets. For each method, the results are obtained by running the model three times and reporting the average performance to ensure robustness and fairness. The best results are highlighted in red, while the second-best results are underlined in blue. From the table, it is evident that our proposed method consistently outperforms other approaches under various settings, demonstrating superior accuracy and generalizability.

On the three real-world vector database datasets (Dvdb, Mvdb, and Wvdb), our model significantly outperforms strong baselines like TimeXer, achieving relative MSE improvements of 1.4%, 2.1%, and 1.9%, respectively. This validates its ability to capture complex temporal patterns that involve high-dimensional metrics and dynamic workloads. By combining multiscale temporal attention, channel-specific features, and external signal integration, our model handles noisy and highly dynamic scenarios effectively.

Moreover, MSP-EDA also performs competitively on public benchmarks, often outperforming or closely matching the best baselines. However, on ETTh2, the gain is limited due to weak inter-variable correlations. Models like PatchTST, assuming channel-wise independence, perform better in such cases by avoiding spurious dependencies. In contrast, our model’s explicit modeling of cross-variable interactions may introduce noise, slightly affecting accuracy.

In addition, we include a comparison with HRTCP, a decomposition-then-prediction method that explicitly separates frequency components before modeling. While this approach can effectively isolate signals of different frequencies, its ability to capture inter-variable dependencies in multivariate settings is limited. As a result, its overall performance lags behind our proposed model on most datasets. Nonetheless, with targeted enhancements, such decomposition-based approaches may hold greater potential in future multivariate forecasting scenarios.

Despite the overall strong performance, MSP-EDA exhibits certain limitations on specific datasets. On both ETTh1 and Weather datasets, the model achieves very competitive results in terms of MSE, obtaining the best performance on ETTh1 and either the best or second-best on Weather. However, in both cases, the MAE scores are relatively less competitive. This indicates that while the model effectively reduces large errors, it is less sensitive to smaller deviations or local fluctuations. Such a pattern suggests a trade-off in the current design, where optimizing for MSE may reduce precision in finer-grained accuracy.

As shown in Figure 4, the proposed model exhibits varying training times across different datasets, primarily due to differences in parameter configurations. The maximum training time is approximately two hours, while the longest testing time remains within one minute, which is considered acceptable for practical applications. Furthermore, since the test sets in real-world deployments are typically smaller, the actual testing time is expected to be even shorter. These results demonstrate the model’s feasibility and efficiency for deployment in real-time or resource-constrained environments.

4.3. Ablation Study

We conducted a comprehensive ablation study on three datasets: Dvdb, Mvdb, and Wvdb, to evaluate the individual contributions of key components in our model. The baseline model (denoted as origin) incorporates all modules, while six ablation variants are constructed by removing one component at a time: DFT-based transformation (dft), continuous wavelet transform (cwt), cross-attention (cross), temporal attention (time), channel attention (channel), and global token fusion (global). The results, reported in terms of mean squared error (MSE) with a prediction length of 96, are illustrated in Figure 5.

The findings show that disabling any single module leads to an increase in MSE across all datasets, confirming the effectiveness of each component. Among the variants, the origin model consistently achieves the best performance. Notably, the absence of the cwt and global modules results in the most substantial degradation, particularly on the Dvdb dataset. This highlights the importance of multiscale temporal feature extraction and the integration of global contextual information. The cross and time attention modules also contribute significantly, as their removal causes moderate increases in MSE, indicating their role in capturing temporal and inter-series dependencies. The impact of the channel attention module appears to be more dataset-dependent: while its removal has little effect on Dvdb and Mvdb, it leads to a noticeable degradation on Wvdb, suggesting that feature refinement across variables plays a more prominent role in that scenario.

Overall, the ablation study underscores the complementary nature of all the proposed modules. In particular, the incorporation of multiscale representations, global context, and attention mechanisms is critical to enhancing predictive accuracy and ensuring robust generalization across diverse multivariate time series datasets.

In addition, we further conduct a targeted study to evaluate the impact of wavelet design. This ablation study systematically evaluates the performance differences between fixed-scale and adaptive wavelet transforms in time series modeling. As illustrated in Figure 6, the adaptive wavelet transform, by dynamically selecting decomposition scales based on input data, demonstrates a clear advantage in capturing multiscale temporal features. This adaptivity leads to consistently better performance in terms of accuracy and robustness across all three workload datasets. Notably, the improvement is most significant on the Dvdb dataset, which exhibits pronounced non-stationarity and high-frequency fluctuations, allowing the adaptive mechanism to fully leverage its ability to capture variable-scale patterns. In comparison, the Mvdb and Wvdb datasets show relatively stable load patterns with limited fluctuations, which naturally constrains the potential performance gain. However, even in these more stationary scenarios, the adaptive approach still achieves marginal but consistent improvements over the fixed-scale variant. These results highlight the effectiveness and generalizability of adaptive wavelet mechanisms in enhancing model performance, especially under dynamic and heterogeneous workload conditions.

4.4. Hyperparameter Analysis

We evaluate the effect of two hyperparameters: multiscale patch combinations and

d_m o d e l

of attention on the Dvdb, Mvdb, and Wvdb datasets. The prediction length is 96. The results are shown in Figure 7. Experimental results demonstrate that multiscale input combinations significantly enhance prediction performance compared to single-scale input. On the Dvdb and Wvdb datasets, finer-grained scales yield better results. For example, in single-scale settings, increasing the patch length from 4 to 8 and then to 16 allows the model to capture longer-term dependencies but leads to a reduced ability to model local patterns, resulting in increased prediction errors. Moreover, incorporating excessively long scales in multi-scale combinations may even degrade performance. This is because Dvdb and Wvdb exhibit high fluctuations under heavy workloads, making long-range features more susceptible to noise and less effective. In contrast, the Mvdb dataset demonstrates more stable behavior, maintaining strong performance even with larger-scale inputs.

This experiment investigates the impact of the model scale on forecasting performance by varying the hidden dimension size (d_model) within the range of 64 to 1024. The results are shown in Figure 8. The results demonstrate that increasing d_model generally enhances the learning capacity, leading to lower prediction errors across most datasets. However, this improvement comes with a trade-off: as d_model grows, the model incurs significantly higher computational and memory costs, raising concerns about scalability and efficiency in practical deployments. Specifically, while performance improves consistently from 64 to 128, a further increase to 256 causes a noticeable drop in accuracy on the Mvdb dataset, likely due to feature redundancy introduced by multiscale extraction, which can cause the model to overfit noisy or irrelevant patterns. Moreover, when d_model reaches 1024, both the Dvdb and Wvdb datasets exhibit degraded performance, suggesting that excessive model capacity can hinder generalization by capturing spurious fluctuations rather than meaningful trends. Although the Mvdb dataset still benefits from a larger model size, the marginal performance gains diminish as d_model increases, highlighting the need to balance model expressiveness with computational efficiency.

4.5. Visualization

To qualitatively evaluate the forecasting capabilities of different models, we visualize the final dimension of the prediction results on a representative subset of the Wvdb dataset, as shown in Figure 9. Compared with other methods, MSP-EDA demonstrates a clearer ability to follow the overall trend. Although minor fluctuations may not be accurately captured, MSP-EDA successfully identifies the critical downward trajectory, which is essential for accurate decision-making in time series forecasting [48]. In addition to the qualitative comparison, Figure 10 presents a quantitative evaluation of all models using MSE and MAE. As depicted, MSP-EDA consistently achieves the lowest error across both metrics, clearly validating its superior performance over existing baselines.

5. Conclusions

To address the challenges of capturing multiscale temporal dependencies and effectively integrating external data in time series forecasting, we propose a novel forecasting framework based on multiscale patch decomposition and external data augmentation. The model first applies the discrete Fourier transform (DFT) and adaptive multiscale continuous wavelet transform (CWT) to extract dominant periodic components and scale-sensitive local features from the input sequence. A multiscale encoder is then employed to capture temporal patterns at multiple resolutions, with each encoder comprising temporal attention, channel attention, and cross-attention mechanisms to model intra-scale dynamics, inter-variable dependencies, and the influence of external variables. To enhance global context understanding, a learnable global token is incorporated and fused with the multiscale representations. Experimental results across multiple real-world benchmarks demonstrate that the proposed method consistently achieves superior performance compared to existing approaches.

Although the proposed model achieves strong forecasting performance, its relatively complex architecture can hinder deployment in real-time or resource-constrained environments. Moreover, the integration of multiple modules and attention mechanisms reduces interpretability, making it difficult to trace and explain prediction outcomes. In future work, we will explore simplifying the model through techniques such as model pruning and knowledge distillation to reduce computational complexity without sacrificing accuracy. To enhance interpretability, we plan to incorporate attention visualization, feature attribution approaches like SHAP or Integrated Gradients, and modular analysis to better understand how the model makes forecasting decisions.

Author Contributions

Conceptualization, S.P. and H.L.; methodology, S.P. and H.L.; software, S.P.; validation, S.P.; formal analysis, S.P., W.S., P.C. and H.L.; investigation, S.P.; resources, S.P. and H.L.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, S.P., W.S., P.C., H.X., D.M., M.C., Y.W. and H.L.; visualization, S.P.; supervision, H.L.; project administration, S.P., W.S., P.C., H.X., D.M., M.C., Y.W. and H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 61562010), National Key Research and Development Program of China (No. 2023YFC3341205), Guizhou Provincial Major Scientific and Technological Program (No. [2024]003), Guizhou Provincial Program on Commercialization of Scientific and Technological Achievements (No. [2023]010, No. [2025]008), Research Projects of the Science and Technology Plan of Guizhou Province (No. [2023]276, No. [2022]261, No. [2022]271).

Data Availability Statement

The official access links for all publicly available datasets are provided in the manuscript. At the same time, all data and codes related to this study are available at https://github.com/ACMISLab/MSP-EDA (accessed on 22 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Lin, Y.; Wan, H.; Guo, S.; Lin, Y. Pre-training context and time aware location embeddings from spatial-temporal trajectories for user next location prediction. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4241–4248. [Google Scholar] [CrossRef]
Sun, C.; Ning, Y.; Shen, D.; Nie, T. Graph Neural Network-Based Short-Term Load Forecasting with Temporal Convolution. Data Sci. Eng. 2024, 9, 113–132. [Google Scholar] [CrossRef]
Miao, X.; Wu, Y.; Wang, J.; Gao, Y.; Mao, X.; Yin, J. Generative semi-supervised learning for multivariate time series imputation. Proc. AAAI Conf. Artif. Intell. 2021, 35, 8983–8991. [Google Scholar] [CrossRef]
Wang, Y.; Huang, N.; Li, T.; Yan, Y.; Zhang, X. Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series Classification. Adv. Neural Inf. Process. Syst. 2024, 37, 36314–36341. [Google Scholar]
Pan, Z.; Wang, Y.; Zhang, Y.; Yang, S.B.; Cheng, Y.; Chen, P.; Guo, C.; Wen, Q.; Tian, X.; Dou, Y.; et al. MagicScaler: Uncertainty-Aware, Predictive Autoscaling. Proc. VLDB Endow. 2023, 16, 3808–3821. [Google Scholar] [CrossRef]
Wang, S.; Chu, Z.; Sun, Y.; Liu, Y.; Guo, Y.; Chen, Y.; Jian, H.; Ma, L.; Lu, X.; Zhou, J. Multiscale Representation Enhanced Temporal Flow Fusion Model for Long-Term Workload Forecasting. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, New York, NY, USA, 21–25 October 2024; pp. 4948–4956. [Google Scholar] [CrossRef]
Zhao, F.; Lin, W.; Lin, S.; Zhong, H.; Li, K. TFEGRU: Time-Frequency Enhanced Gated Recurrent Unit with Attention for Cloud Workload Prediction. IEEE Trans. Serv. Comput. 2024, 18, 467–478. [Google Scholar] [CrossRef]
Gao, Y.; Huang, X.; Zhou, X.; Gao, X.; Li, G.; Chen, G. DBAugur: An Adversarial-based Trend Forecasting System for Diversified Workloads. In Proceedings of the 39th IEEE International Conference on Data Engineering, Anaheim, CA, USA, 3–7 April 2023; pp. 27–39. [Google Scholar] [CrossRef]
Guo, Y.; Ge, J.; Guo, P.; Chai, Y.; Li, T.; Shi, M.; Tu, Y.; Ouyang, J. Pass: Predictive auto-scaling system for large-scale enterprise web applications. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 2747–2758. [Google Scholar] [CrossRef]
Kilian, L.; Lütkepohl, H. Structural Vector Autoregressive Analysis; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar] [CrossRef]
Lin, S.; Lin, W.; Wu, W.; Zhao, F.; Mo, R.; Zhang, H. SegRNN: Segment Recurrent Neural Network for Long-Term Time Series Forecasting. arXiv 2023, arXiv:2308.11200. Available online: https://arxiv.org/abs/2308.11200 (accessed on 22 May 2025).
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction. Adv. Neural Inf. Process. Syst. 2022, 35, 5816–5828. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=JePfAI8fah (accessed on 22 May 2025).
Wang, Y.; Wu, H.; Dong, J.; Qin, G.; Zhang, H.; Liu, Y.; Qiu, Y.; Wang, J.; Long, M. TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. Adv. Neural Inf. Process. Syst. 2024, 37, 469–498. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the 39th International Conference on Machine Learning, 17–23 July 2022; Volume 162, pp. 27268–27286. Available online: https://proceedings.mlr.press/v162/zhou22g/zhou22g.pdf (accessed on 22 May 2025).
Bandara, K.; Hyndman, R.J.; Bergmeir, C. MSTL: A seasonal-trend decomposition algorithm for time series with multiple seasonal patterns. Int. J. Oper. Res. 2025, 52, 79–98. [Google Scholar] [CrossRef]
Wen, Q.; Zhang, Z.; Li, Y.; Sun, L. Fast RobustSTL: Efficient and Robust Seasonal-Trend Decomposition for Time Series with Complex Patterns. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 6–10 June 2020; pp. 2203–2213. [Google Scholar] [CrossRef]
Fan, W.; Yi, K.; Ye, H.; Ning, Z.; Zhang, Q.; An, N. Deep frequency derivative learning for non-stationary time series forecasting. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 3944–3952. [Google Scholar] [CrossRef]
Xu, Z.; Zeng, A.; Xu, Q. FITS: Modeling Time Series with $10k$ Parameters. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=bWcnvZ3qMb (accessed on 22 May 2025).
Ye, W.; Deng, S.; Zou, Q.; Gui, N. Frequency Adaptive Normalization For Non-stationary Time Series Forecasting. Adv. Neural Inf. Process. Syst. 2024, 37, 31350–31379. [Google Scholar]
Qiu, X.; Hu, J.; Zhou, L.; Wu, X.; Du, J.; Zhang, B.; Guo, C.; Zhou, A.; Jensen, C.S.; Sheng, Z.; et al. TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods. Proc. VLDB Endow. 2024, 17, 2363–2377. [Google Scholar] [CrossRef]
Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=vSVLM2j9eie (accessed on 22 May 2025).
Zhong, S.; Song, S.; Zhuo, W.; Li, G.; Liu, Y.; Chan, S.H.G. A Multi-Scale Decomposition MLP-Mixer for Time Series Analysis. Proc. VLDB Endow. 2024, 17, 1723–1736. [Google Scholar] [CrossRef]
Chen, P.; ZHANG, Y.; Cheng, Y.; Shu, Y.; Wang, Y.; Wen, Q.; Yang, B.; Guo, C. Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=lJkOCMP2aW (accessed on 22 May 2025).
Lee, S.; Park, T.; Lee, K. Learning to Embed Time Series Patches Independently. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=WS7GuBDFa2 (accessed on 22 May 2025).
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=Jbdc0vTOcol (accessed on 22 May 2025).
Cai, W.; Liang, Y.; Liu, X.; Feng, J.; Wu, Y. MSGNet: Learning Multi-Scale Inter-Series Correlations for Multivariate Time Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2024, 38, 11141–11149. [Google Scholar] [CrossRef]
Williams, B.M. Multivariate vehicular traffic flow prediction: Evaluation of ARIMAX modeling. Transp. Res. Rec. 2001, 1776, 194–200. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Leach, A.; Mathur, S.K.; Sen, R.; Yu, R. Long-term Forecasting with TiDE: Time-series Dense Encoder. Trans. Mach. Learn. Res. 2023. Available online: https://openreview.net/forum?id=pCbC3aQB5W (accessed on 22 May 2025).
Shumway, R.H.; Stoffer, D.S.; Shumway, R.H.; Stoffer, D.S. ARIMA models. Time Series Analysis and Its Applications: With R Examples; Springer: Cham, Switzerland, 2017; pp. 75–163. [Google Scholar] [CrossRef]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=ju_Uqw384Oq (accessed on 22 May 2025).
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Wang, S.; Wu, H.; Shi, X.; Hu, T.; Luo, H.; Ma, L.; Zhang, J.Y.; ZHOU, J. TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=7oLshfEIC2 (accessed on 22 May 2025).
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4804. [Google Scholar] [CrossRef]
Wang, J.; Wu, Z.; Ouyang, W.; Han, X.; Chen, J.; Jiang, Y.G.; Li, S.N. M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval, New York, NY, USA, 27–30 June 2022; pp. 615–623. [Google Scholar] [CrossRef]
Challu, C.; Olivares, K.G.; Oreshkin, B.N.; Ramirez, F.G.; Canseco, M.M.; Dubrawski, A. NHITS: Neural Hierarchical Interpolation for Time Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2023, 37, 6989–6997. [Google Scholar] [CrossRef]
Shabani, M.A.; Abdi, A.H.; Meng, L.; Sylvain, T. Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=sCrnllCtjoE (accessed on 22 May 2025).
Zhang, Y.; Ma, L.; Pal, S.; Zhang, Y.; Coates, M. Multi-resolution Time-Series Transformer for Long-term Forecasting. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 2–4 May 2024; Volume 238, pp. 4222–4230. Available online: https://proceedings.mlr.press/v238/zhang24l/zhang24l.pdf (accessed on 22 May 2025).
Arunraj, N.S.; Ahrens, D.; Fernandes, M. Application of SARIMAX model to forecast daily sales in food retail industry. Int. J. Oper. Res. Inf. Syst. 2016, 7, 1–21. [Google Scholar] [CrossRef]
Lim, B.; Arık, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Olivares, K.G.; Challu, C.; Marcjasz, G.; Weron, R.; Dubrawski, A. Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx. Int. J. Forecast. 2023, 39, 884–900. [Google Scholar] [CrossRef]
Wang, H.; Peng, J.; Huang, F.; Wang, J.; Chen, J.; Xiao, Y. MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=zt53IDUR1U (accessed on 22 May 2025).
Su, Y.; Tan, M.; Teh, J. Short-Term Transmission Capacity Prediction of Hybrid Renewable Energy Systems Considering Dynamic Line Rating Based on Data-Driven Model. IEEE Trans. Ind. Appl. 2025, 61, 2410–2420. [Google Scholar] [CrossRef]
Kim, J.; Kim, H.; Kim, H.; Lee, D.; Yoon, S. A Comprehensive Survey of Time Series Forecasting: Architectural Diversity and Open Challenges. arXiv 2024, arXiv:2411.05793. Available online: https://arxiv.org/abs/2411.05793 (accessed on 22 May 2025).

Figure 1. Example of multiscale time series data processing. Multivariate time series at three different segmentation scales are represented by distinct background colors. Each region shows two variables; red ellipses denote strong inter-variable correlations, and black arrows indicate temporal dependencies across time points.

Figure 2. The architecture of MSP-EDA, which consists of three components: data pattern enhancement (left), multiscale patch processing (including multiscale embedding and multi-layer encoders), and multiscale feature fusion.

Figure 3. The structure of the encoder module, which consists of three attention mechanisms: temporal attention for modeling intra-series temporal dependencies, channel attention for capturing inter-variable correlations, and cross attention for integrating external data.

Figure 4. Training and testing time cost statistics for each dataset.

Figure 5. Ablation studies on the Dvdb, Mvdb, and Wvdb datasets to evaluate the individual contributions of different modules in MSP-EDA.

Figure 6. Comparison of fixed-scale and adaptive-scale wavelet transforms in forecasting performance (MSE) across various forecasting lengths. Here,

[4, 8, 16]

,

[4, 8]

, and

[8, 16]

indicate the different fixed scales used for the wavelet transforms.

Figure 6. Comparison of fixed-scale and adaptive-scale wavelet transforms in forecasting performance (MSE) across various forecasting lengths. Here,

[4, 8, 16]

,

[4, 8]

, and

[8, 16]

indicate the different fixed scales used for the wavelet transforms.

Figure 7. Evaluation of different patch length combinations in terms of on Dvdb, Mvdb, and Wvdb datasets.

Figure 8. Impact of attention dimension d_model on forecasting performance in terms of MSE on Mvdb, Wvdb, and Dvdb.

Figure 9. Prediction cases from Wvdb dataset by different models in the input-96-predict-96 setting.

Figure 10. MSE and MAE results for the instances in Figure 9.

Table 1. Comparison of time series forecasting methods in the incorporation of key components.

Method	Decomposition	Multiscale	External Data
FEDformer [18]	✓	×	×
MICN [46]	✓	✓	×
TiDE [32]	×	✓	✓
MSGNet [30]	✓	✓	×
MSD-Mixer [26]	✓	✓	×
Pathformer [27]	✓	✓	×
TimeXer [15]	×	×	✓
MSP-EDA (This work)	✓	✓	✓

Table 2. Statistics of time series datasets in the experiments.

Dataset	Dimension	Timestamps	Granularity
ETTh1	7	17,420	1 h
ETTh2	7	17,420	1 h
ETTm2	7	69,680	15 min
Weather	21	52,696	10 min
Dvdb	22	28,278	1 min
Mvdb	18	21,570	1 min
Wvdb	15	18,490	1 min

Table 3. Long-term forecasting results on seven benchmark datasets at different prediction lengths (96/192/336/720). The bold red and blue underlined fonts indicate the best and second-best results in each setting.

Model		MSP-EDA		TimeXer		TimeMixer		HRTCP		TimesNet		PatchTST		TiDE		FEDformer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTh1	96	0.3771	0.4034	0.3873	0.4057	0.3865	0.4011	0.3883	0.4032	0.3931	0.4145	0.3916	0.4103	0.3888	0.3968	0.3801	0.4196
	192	0.4277	0.4332	0.4399	0.4278	0.4425	0.4323	0.5213	0.4873	0.5483	0.4958	0.4371	0.4298	0.4401	0.4260	0.4694	0.4693
	336	0.4655	0.4565	0.4798	0.4483	0.4768	0.4556	0.5533	0.5013	0.5564	0.5062	0.4784	0.4524	0.4834	0.4486	0.5231	0.4908
	720	0.4792	0.4803	0.4876	0.4747	0.5334	0.4920	0.5910	0.5325	0.7257	0.6046	0.4770	0.4753	0.4861	0.4735	0.5365	0.5153
ETTh2	96	0.2910	0.3403	0.2952	0.3425	0.3001	0.3519	0.2954	0.3438	0.3293	0.3695	0.2940	0.3509	0.2912	0.3404	0.3470	0.3894
	192	0.3745	0.4013	0.3703	0.3930	0.3768	0.3964	0.3937	0.4008	0.4657	0.4516	0.3685	0.3937	0.3765	0.3919	0.4154	0.4251
	336	0.4171	0.4326	0.4228	0.4340	0.4438	0.4450	0.4445	0.4467	0.5046	0.4852	0.4158	0.4298	0.4221	0.4301	0.4624	0.4697
	720	0.4382	0.4504	0.4474	0.4553	0.4482	0.4584	0.4613	0.4699	0.4925	0.4842	0.4272	0.4483	0.4232	0.4417	0.4819	0.4866
ETTm2	96	0.1730	0.2597	0.1759	0.2579	0.1754	0.2601	0.1801	0.2613	0.1895	0.2686	0.1833	0.2671	0.1822	0.2650	0.2006	0.2849
	192	0.2383	0.3016	0.2364	0.2999	0.2448	0.3051	0.2480	0.3080	0.2547	0.3110	0.2511	0.3107	0.2487	0.3057	0.2639	0.3221
	336	0.2976	0.3395	0.2952	0.3385	0.3023	0.3475	0.3155	0.3602	0.3150	0.3492	0.3126	0.3487	0.3075	0.3430	0.3278	0.3628
	720	0.3998	0.3999	0.3965	0.3980	0.3951	0.4013	0.4218	0.4179	0.4211	0.4080	0.4124	0.4038	0.4077	0.3984	0.4169	0.4132
Weather	96	0.1609	0.2089	0.1576	0.2049	0.1618	0.2088	0.1732	0.2086	0.1694	0.2187	0.1709	0.2122	0.1930	0.2343	0.2108	0.2900
	192	0.2105	0.2535	0.2115	0.2531	0.2070	0.2513	0.2211	0.2532	0.2390	0.2779	0.2287	0.2617	0.2400	0.2700	0.2640	0.3169
	336	0.2645	0.2922	0.2668	0.2931	0.2911	0.3095	0.2968	0.3156	0.2897	0.3110	0.2832	0.3003	0.2919	0.3062	0.3144	0.3470
	720	0.3436	0.3438	0.3436	0.3433	0.3430	0.3439	0.3781	0.3724	0.3590	0.3539	0.3592	0.3497	0.3652	0.3540	0.3815	0.3828
Dvdb	96	0.9751	0.7130	0.9889	0.7137	0.9817	0.7154	0.9892	0.7194	1.0789	0.7546	0.9924	0.7214	1.047	0.7455	1.0726	0.7686
	192	1.0007	0.7273	1.0191	0.7389	1.0080	0.7281	1.0273	0.7405	1.0808	0.7535	1.0359	0.7462	1.1658	0.7963	1.0964	0.7752
	336	1.0214	0.7365	1.0382	0.7412	1.0282	0.7339	1.0521	0.7453	1.0968	0.7598	1.0587	0.7566	1.1893	0.8058	1.1154	0.7833
	720	1.0360	0.7454	1.0490	0.7493	1.0362	0.7463	1.0913	0.7707	1.1034	0.7623	1.0768	0.7650	1.1983	0.8104	1.1219	0.7855
Mvdb	96	0.2173	0.2968	0.2219	0.2978	0.2342	0.3060	0.2312	0.2997	0.2474	0.3101	0.2357	0.3008	0.2564	0.3131	0.2448	0.3143
	192	0.2391	0.3114	0.2403	0.3146	0.2781	0.3349	0.2638	0.3199	0.2547	0.3227	0.2676	0.3313	0.3072	0.3512	0.2736	0.3366
	336	0.2729	0.3465	0.2774	0.3452	0.3286	0.3694	0.3256	0.3688	0.2854	0.3490	0.3145	0.3652	0.3540	0.3837	0.3227	0.3724
	720	0.3461	0.3956	0.3752	0.4066	0.4479	0.4351	0.4503	0.4337	0.3767	0.4107	0.4271	0.4299	0.4659	0.4446	0.4364	0.4380
Wvdb	96	0.9007	0.5913	0.9178	0.5978	0.9192	0.5989	0.9213	0.6007	0.9391	0.6041	0.9095	0.5974	0.9593	0.6141	0.9419	0.6128
	192	0.9111	0.6055	0.9196	0.6089	0.9509	0.6245	0.9324	0.6096	0.9347	0.6161	0.9239	0.6157	1.0970	0.6657	0.9386	0.6225
	336	0.9239	0.6274	0.9330	0.6288	0.9833	0.6530	0.9617	0.6458	0.9527	0.6363	0.9445	0.6392	1.1218	0.6902	0.9562	0.6450
	720	1.0233	0.6908	1.1054	0.7031	1.0723	0.7097	1.0617	0.7113	1.1270	0.7189	1.1249	0.7086	1.1995	0.7461	1.0371	0.7021

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, S.; Sun, W.; Chen, P.; Xu, H.; Ma, D.; Chen, M.; Wang, Y.; Li, H. MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation. Electronics 2025, 14, 2618. https://doi.org/10.3390/electronics14132618

AMA Style

Peng S, Sun W, Chen P, Xu H, Ma D, Chen M, Wang Y, Li H. MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation. Electronics. 2025; 14(13):2618. https://doi.org/10.3390/electronics14132618

Chicago/Turabian Style

Peng, Shunhua, Wu Sun, Panfeng Chen, Huarong Xu, Dan Ma, Mei Chen, Yanhao Wang, and Hui Li. 2025. "MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation" Electronics 14, no. 13: 2618. https://doi.org/10.3390/electronics14132618

APA Style

Peng, S., Sun, W., Chen, P., Xu, H., Ma, D., Chen, M., Wang, Y., & Li, H. (2025). MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation. Electronics, 14(13), 2618. https://doi.org/10.3390/electronics14132618

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation

Abstract

1. Introduction

2. Related Work

2.1. Multivariate Time Series Forecasting

2.2. Multiscale Modeling for MTS Forecasting

2.3. MTS Forecasting with External Data

3. Methodology

3.1. Overview

3.2. Data Pattern Enhancement

3.3. Multiscale Patch Processing

3.4. Multiscale Fusion

4. Experiments

4.1. Experimental Setup

4.2. Overall Performance

4.3. Ablation Study

4.4. Hyperparameter Analysis

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI