Next Article in Journal
Sensortoolkit—A Python Library for Standardizing the Ingestion, Analysis, and Reporting of Air Sensor Data for Performance Evaluation
Previous Article in Journal
Design of a W-Band Low-Voltage TWT Utilizing a Spoof Surface Plasmon Polariton Slow-Wave Structure and Dual-Sheet Beam
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FCP-Former: Enhancing Long-Term Multivariate Time Series Forecasting with Frequency Compensation

by
Ming Li
,
Muyu Yang
*,
Shaolong Chen
,
Huangyongxiang Li
,
Gaosong Xing
and
Shuting Li
School of Computer Science and Technology/School of Artificial Intelligence, China University of Mining and Technology, Xuzhou 221116, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(18), 5646; https://doi.org/10.3390/s25185646
Submission received: 22 July 2025 / Revised: 1 September 2025 / Accepted: 9 September 2025 / Published: 10 September 2025
(This article belongs to the Section Physical Sensors)

Abstract

Long-term multivariate time series forecasting is crucial for real-world applications, including energy consumption, traffic flow, healthcare, and finance. Usually, some statistical approaches are used for predicting future observations based on historical temporal data. Recently, transformer-based models with patch mechanisms have demonstrated significant potential in enhancing computational efficiency. However, their inability to fully capture intra-patch temporal dependencies often limits the accuracy of predictions. To address this issue, we propose the Frequency Compensation Patch-wise transFormer (FCP-Former), which integrates a frequency compensation layer into the patching mechanism. This layer extracts frequency-domain features via Fast Fourier Transform (FFT) and incorporates them into patched data, thereby enriching patch representations and mitigating intra-patch information loss. To verify the feasibility of this model, FCP-Former was conducted on eight benchmark datasets via PyTorch 2.4.0 and trained on an NVIDIA RTX 4090 GPU (Santa Clara, CA, USA). Results demonstrate that FCP-Former 48 optimal experiment results and 17 suboptimal experiment results across all datasets. Especially on the ETTh1 dataset, it achieves an average MSE of 0.437 and an average MAE of 0.430, while on the Electricity dataset, it achieves an average MSE of 0.186 and an average MAE of 0.277. This demonstrates that FCP-Former has better forecasting accuracy and a superior ability to capture periodic and trend patterns.

1. Introduction

Time series forecasting constitutes a statistical approach aimed at predicting future observations based on historical temporal data. This methodology has demonstrated extensive applicability across a broad spectrum of domains, including but not limited to meteorology [1,2,3], healthcare analytics [4,5,6,7], intelligent transportation systems [8,9,10,11], electrical load forecasting [12,13,14,15], financial risk assessment [16,17,18,19], and Earth sciences [20,21]. In recent years, recurrent neural network (RNN)-based architectures have been extensively employed for modeling time series data due to their capacity to learn temporal dependencies [22,23]. While these methods have yielded considerable empirical success [24,25], they are inherently constrained by several limitations, most notably the issues of vanishing and exploding gradients. These challenges significantly hinder the ability of RNNs to effectively model long-range dependencies within sequential data, thereby limiting their performance in scenarios requiring long-term forecasting accuracy.
After achieving great success in computer vision [26,27,28,29] and natural language processing [30,31,32,33], the Transformer [34] model was introduced to time series forecasting to directly model the relationships between any two time steps in a sequence. Due to its powerful attention mechanism, the transformer overcomes the gradient vanishing and gradient exploding problems that still trouble RNN and LSTM (Long Short-Term Memory)-type methods, making it a popular research topic in the field of time series forecasting [35,36].
Based on the token granularity fed into the attention mechanism in the time domain, existing Transformer-based research can be roughly divided into patch-wise models and point-wise models. Point-wise models treat each time step and its corresponding variates as a token, which gives them a stronger ability to capture internal temporal variations. Typical point-wise models include FEDformer [35], Informer [37] and Autoformer [38]. However, due to their high computational complexity, it is challenging for these models to capture long-term dependencies between time series data. In contrast, for patch-wise models, a patch is a basic module formed by concatenating multiple temporally contiguous time series data points. This enables the model to treat a patch as a token instead of treating each timestep as a token, significantly reducing the computational time. Based on different treatments of the variates, patch-wise models can be further divided into channel-independent strategy models and channel-dependent strategy models. Typical channel-independent strategy models include PatchTST [39], while channel-dependent strategy models include iTransformer [40], TimeXer [41], and Crossformer [42].
As summarized in Table 1, previous patch-wise time series forecasting models mainly focused on leveraging the patching mechanism to capture long-term dependencies. PatchTST [39] and iTransformer [40] embed each patch into a coarse token through a temporal linear projection, which leads to their inability to fully utilize the data within the patch, potentially compromising the accuracy of the final prediction [41]. Crossformer incorporates a cross-variable attention mechanism, assuming that features influence each other and leveraging historical dependencies across variables for forecasting. Similarly, TimeXer [41] introduces exogenous variables with a design concept comparable to the cross-attention mechanism in Crossformer. The proposed solutions in previous patch-wise methods were confined to the time domain, overlooking the frequency domain, where periodicity and trends of time series are often more effectively captured.
To address the information loss resulting from the model’s inability to fully utilize intra-patch data, this study proposes FCP-Former, an optimized version of PatchTST [39], which integrates corresponding frequency-domain information into the patched data to enhance forecasting accuracy. The main contributions of this paper are summarized as follows:
  • This study introduces a frequency compensation layer that integrates frequency domain features into the patching mechanism of Transformer-based models. This layer applies Fast Fourier Transform (FFT) to each patch to extract spectral components, performs representation learning in the frequency domain, and then reconstructs enriched patch representations via inverse FFT. This approach effectively mitigates intra-patch information loss by capturing periodic and trend features that are often overlooked in purely time-domain patch embeddings.
  • A cross-patch frequency fusion mechanism via overlapping patches is proposed. By using overlapping patch segmentation with reduced stride, the model effectively integrates spectral information across adjacent patches. This enhances long-term periodicity and trend modeling. The fusion occurs within the frequency compensation layer, enriching patch tokens with broader contextual awareness without modifying the core attention structure.
  • This study conducts extensive experiments on eight widely used benchmark datasets, demonstrating the superior performance of FCP-Former compared with state-of-the-art methods, and provides ablation studies and visual analyses to validate the effectiveness of the frequency compensation mechanism.
The remainder of this paper is organized as follows. Section 2 reviews the related work; Section 3 details the proposed FCP-Former; Section 4 validates experiments; Section 5 concludes.

2. Related Work

2.1. Problem Definition

Time series data is a set of data arranged in chronological order. This type of data is typically collected at specific time points, and there is a temporal dependence between the data points. In time series forecasting, future events are predicted by utilizing these time-ordered data. The historical data can be defined as X t = { x 1 , x 2 , , x L 1 , x L } R L × D and the predicted data can be defined as X ^ t = { x ^ L + 1 , x ^ L + 2 , , x ^ L + T 1 , x ^ L + T } R T × D , where D is the number of variables, L is the length of historical data, and T is the length of predicted data. The concept of time series forecasting can be expressed as follows:
X ^ t = f X t + ϵ ,
where X ^ t is the predicted value, f is the forecasting function, X t are the historical values, and ϵ is the forecasting error.

2.2. Transformer-Based Time Series Forecaster

With the great success made in the field of natural language processing and computer vision, Transformer has gained the attention of researchers in the field of time series forecasting due to its powerful ability to capture long-term temporal dependencies and complex multivariate correlations. This study briefly reviews several key variants below. Informer [37] addresses the high computational complexity of transformers in time series forecasting by proposing a sparse self-attention mechanism. FEDformer [35] enhances the transformer model’s ability to capture global features of time series data by combining the transformer model with seasonal trend decomposition while retaining key frequency information of the time series data through Fourier and wavelet transforms. PatchTST [39] improves the transformer’s ability to capture historical dependencies by using a channel-independent strategy to patch the time series data, reducing computational overhead while maintaining the ability to model long-range dependencies. Crossformer [42] enhances the transformer’s ability to handle multivariate time series forecasting tasks through dimension-segment-wise embedding and a two-stage attention mechanism. Npformer [43] introduces an innovative multi-scale segmented Fourier attention mechanism to more effectively capture dependencies. TimeXer [41] enhances the Transformer model’s prediction accuracy by incorporating exogenous variables. iTransformer [40] applies the Transformer’s attention mechanism along the variate dimension instead of the time dimension.
Most of these transformer-based models either focus on designing new attention mechanisms to reduce the complexity of the original attention mechanism or process the time series data itself to better leverage the transformer to achieve better performance on forecasting, especially when the prediction length is long. However, these patch-wise transformer methods face a common problem. As shown in Figure 1, compared to point-wise methods, the patch-wise approach, where the model treats a patch as a single token, cannot fully utilize each piece of data, which results in information loss within the patch. TimeXer introduces exogenous variables to address this issue. But the feature extracted from the time domain remains inherently limited. In contrast to TimeXer, this study utilizes the complementarity of frequency-domain information to time-domain data. The frequency compensation layer is proposed to extract features from the frequency domain, effectively overcoming the limitations of relying solely on time-domain feature extraction. This enables the model to better capture the periodic and trend characteristics of time series data.

2.3. Time Series Forecasting with Time–Frequency Analysis

The Fourier transform serves as a bridge for converting signals between the time and frequency domains, with the discrete Fourier transform (DFT) and discrete wavelet transform (DWT) commonly used tools for time–frequency analysis. Current mainstream time–frequency analysis methods can be categorized into two types. The first type involves transforming time-domain data into the corresponding Fourier spectrum, analyzing the Fourier spectrum to extract frequency-domain-based features, and then using inverse transformations to convert the data back to the time domain to obtain prediction results. Typical examples include FreTS [44], FITS [45], and SparseTSF [46]. In contrast, the second type simultaneously extracts features from both the time and frequency domains of time series data, with the extracted features then concatenated at the network output to produce the prediction result. A typical example is FEDformer [35]. The method proposed in this paper primarily addresses the issue of data loss within patches in patch-wise models. Since the first type of time–frequency analysis typically demands relatively low resource overhead, the proposed FCP-Former adopts this approach.

3. FCP-Former Principle

As illustrated in Figure 2, FCP-Former includes three components: a patching module, a frequency compensation layer, and an encoder. Lu Han et al. [47] demonstrated through extensive experiments that the prediction method using channel-independent strategies typically achieves better prediction results than the method using channel-dependent strategies. Therefore, similar to PatchTST [39], FCP-Former adopts the channel-independent strategy. Furthermore, FCP-Former applies the frequency compensation layer to process the patched data before encoding, adding corresponding frequency features to compensate for intra-patch information loss. iTransformer [40] demonstrates that the standard attention mechanism can also yield excellent results. Therefore, instead of modifying the attention mechanism, FCP-Former focuses on enriching the information within each patch. Moreover, FCP-Former can adjust the patch length and the step size for dividing patches so that adjacent patches overlap. This overlapping portion serves as a bridge to fuse the frequency-domain features of different patches, further enhancing the model’s ability to capture the periodicity and trend of time series data.

3.1. Model Structure

FCP-Former consists of a patching module, a frequency compensation layer, and an encoder.
Patching: Following a channel-independent strategy, this study divides the original time series into D channels according to the data dimension and applies patching separately to each channel. The time series in the i-th channel is denoted as X t i = { x 1 i , x 2 i , , x L 1 i , x L i } R 1 × L , where L is the sequence length. Let the patch length be P, the patch step size be S, and the number of patches be N, which is computed as follows:
N = L P S + 1 ,
Padding is applied at the end of the series. When the final patch extends beyond the sequence length, the remaining positions are filled with the last observed value x L i , ensuring that all patches have consistent size. After patching, the original time series in each channel is transformed into a sequence of patches P n i = { p 1 i , p 2 i , , p N 1 i , p N i } R P × N .
Frequency Compensation Layer: As shown in Figure 3, in the time domain, data points are arranged in chronological order, with each data point representing the observation at a specific point in time. In the frequency domain, the data is decomposed into different frequency components, with each frequency component representing the extent of a particular periodicity in the signal. This representation facilitates the identification of periodicity and underlying trends in the data.
The core function of the frequency compensation layer is to perform representation learning on each patch in the frequency domain, thereby supplementing the temporal information that may be overlooked within individual time steps of the patch. Specifically, the frequency compensation layer first applies a Fast Fourier Transform (FFT) to the patch, converting the data into its frequency-domain representation. Representation learning is then conducted in the frequency domain, after which an inverse Fourier transform is applied to map the data back into the time domain. The processed patch thus represents an enriched version of the original data, combining both time-domain characteristics and frequency-domain features such as periodicity and trends. By incorporating spectral information, the frequency compensation layer effectively mitigates intra-patch information loss and enhances the model’s ability to capture fine-grained details within each patch, ultimately improving prediction accuracy. This study analyzes the frequency compensation layer in detail in Section 3.2.
Encoder: This study uses a vanilla Transformer encoder to map the patches processed by the frequency compensation layer into latent representations. Each patch is embedded into a latent space of dimension D by applying a learnable linear projection matrix W p R D × P and position encoding W p o s R D × N , which serves as the input to the encoder. The embedding process is formulated as follows:
F C L n i = F C L P n i ,
I N d i = W p F C L n i + W p o s ,
where F C L is the frequency compensation layer, F C L n i is the result obtained after applying the frequency compensation layer to each patch, and I N d i R D × N is the embedded result used as the input of the encoder. Then the multi-head attention will transform them into query matrices Q h i , key matrices K h i , and value matrices V h i . The attention output O U T h i R D × N is ultimately obtained through scaled dot product. The attention process can be simply formulated as follows:
Q h i = I N d i T W h Q ,
K h i = I N d i T W h K ,
V h i = I N d i T W h V ,
O U T h i T = Attention Q h i , K h i , V h i = Softmax Q h i K h i T d k V h i ,
where W h Q , W h K R D × d k and W h V R D × D . After passing through BatchNorm layers and a feed-forward network, the final predicted result can be obtained from a linear layer. The encoder and attention mechanism effectively capture the dependencies among patches processed by the frequency compensation layer, including both the temporal characteristics of the patches and the integrated frequency-domain features.

3.2. Analysis of Frequency Compensation Layer

As depicted in Figure 2, the frequency compensation layer is divided into the following steps to process each patch.
(1)
Fast Fourier Transform (FFT): The Fourier Transform can decompose a signal in the time domain into a linear combination of a series of sine and cosine functions. Each sine and cosine function represents a specific frequency component of the signal. Thus, the Fourier transform can extract the frequency characteristics from time series data. For discrete signals, the Discrete Fourier Transform is used:
X k = n = 0 N 1   x n e j 2 π k N n ,
where X k is the complex value of the k-th frequency in the frequency domain; x n is the n-th sampling point of the time-domain signal; and N is the length of the signal. Similarly, the IDFT is defined as follows:
x n = n = 0 N 1   X k e j 2 π k N n .
Equation 9 shows that for a signal of length N, the computational complexity of the DFT is O ( N 2 ) . However, the Fast Fourier Transform (FFT) reduces the computational load by utilizing the symmetry and periodicity of the signal, breaking the computation into smaller parts, thus reducing its complexity to O ( N l o g 2   N ) and significantly improving efficiency.
When performing Fast Fourier Transform on time series data, how to select the appropriate sampling rate is an issue that must be addressed. Retaining all frequency components may inevitably introduce noise interference, while preserving only a portion of the frequencies may have a risk of missing some underlying trends in the data. FEDformer [35] demonstrates that real-world multivariate time series typically yield low-rank matrices after Fourier transform. This low-rank property implies that representing the time series by randomly selecting a fixed number of Fourier components is reasonable. Consequently, this study adopts random sampling as our sampling method and sets the number of modes as M. The specific approach is as follows: First, the time series within the patch is transformed from the time domain to the frequency domain, resulting in a Fourier coefficient matrix A R a × b (where a represents the number of time series and b represents the total number of Fourier components). Since a channel-independent strategy is employed to process patches, a is fixed at 1, and b corresponds to the length of the time series data within the patch. Next, M Fourier components are randomly selected from all b Fourier components to construct a selection matrix S { 0,1 } M × b , where S i , k = 1 indicates the selection of the k-th component, and S i , k = 0 represents the non-selection of that component. Finally, through matrix operations A = A S , a sparse matrix A R 1 × M is obtained, retaining only the selected components, which serves as the sampling result.
(2)
Representation Learning In The Frequency Domain: After random sampling, the selected set of frequency indices is defined as I = { i 1 , i 2 , , i m } . Next, this study defines two weight tensors, 𝒲 ( 1 ) C F × N × N × M and 𝒲 ( 2 ) C F × N × N × M , where F is the number of features, N is the number of patches, and M is the number of selected frequency components. These tensors are initialized with random values and serve as learnable weights for frequency-domain transformations. The input patch tensor is X R B × V × P L × N , where B is the batch size, V is the number of features, and PL is the patch length. This study applies a Fast Fourier Transform (FFT) to the input tensor X along the PL dimension. A tensor Y f t C B × V × P L × N / / 2 + 1 is defined to store the frequency domain data after the Fourier transform. Through representation learning, the frequency-domain features such as periodicity and trends within the patches will be extracted and preserved.
(3)
Inverse Fast Fourier Transform: The processed frequency-domain data is mapped back to the time domain using the Inverse Fast Fourier Transform (IFFT). This process can be simply formulated as follows:
X f t = F F T X ,
𝒲 i = 𝒲 i 1 + j 𝒲 i 2 , i I ,
Y f t = i   X f t 𝒲 i ,
X o u t = I F F T Y f t ,
where X is the input tensor, X f t is the result obtained by applying the Fourier transform to X , 𝒲 i represents the learned complex weight for frequency i. Y f t is the frequency domain output, and the final reconstructed output is X o u t R B × V × P L × N .
Generally, through the frequency compensation layer, the newly generated patches not only contain the time-domain features of the original patches but also include the frequency-domain features of the original patches and the fused frequency-domain features from adjacent patches. This enables the new patches to retain the original time-domain features while exhibiting more prominent periodicity and trend characteristics.

4. Experiments and Discussion

To verify the effectiveness and generality of FCP-Former, this study conducted several comprehensive experiments on eight real-world long-term time series forecasting datasets, which are widely used in practical applications. To ensure a fair comparison with baseline methods that typically use shorter look-back windows, this study set the similar input length of 96 in FCP-Former. This configuration abandons the potential advantage of longer look-back windows afforded by the patching mechanism and instead focuses on the inherent capabilities of the proposed frequency compensation layer. The results demonstrated that even under this constrained setting, FCP-Former obtained competitive performance in terms of MSE and MAE compared to existing state-of-the-art methods. Moreover, this study explored the performance of FCP-Former when utilizing longer look-back windows (336 and 512 input time steps), which demonstrated FCP-Former’s superior predictive capabilities.

4.1. Experimental Setup

4.1.1. Datasets

The experiments utilized eight real-world datasets, which are widely applied in time series forecasting research. These datasets in detail are as follows:
ETT (Electricity Transformer Temperature): It contains two years of data from two different electricity transformers. ETTh1 and ETTh2 are recorded every hour, and ETTm1 and ETTm2 are recorded every 15 min.
Traffic: It contains data on hourly occupancy rates from 862 sensors on San Francisco Bay Area freeways from January 2015 to December 2016.
Weather: It includes 21 meteorological factors recorded every 10 min at the Weather Station of the Max Planck Biogeochemistry Institute in 2020.
Electricity: It records the hourly electricity consumption of 321 customers.
ILI: It describes the number of patients and influenza-like illness ratio at weekly intervals, obtained from the U.S. Centers for Disease Control and Prevention between 2002 and 2021.
The statistics of those datasets are summarized in Table 2.

4.1.2. Baselines and Experimental Settings

This study chose the SOTA transformer-based model as the baseline, including PatchTST [39], iTransformer [40], TimeXer [41], FEDformer [35], Crossformer [42], and Autoformer [38]. All of the models follow the same experimental setup with prediction length T { 24 , 36 , 48 , 60 } for ILI dataset and T { 96 , 192 , 336 ,   720 } for other datasets. To validate the effectiveness of the model proposed in this study, this study adopted the multivariate time series forecasting setup from the TimeXer [41] study, setting the input length to 96. This setting is widely used in the literature to ensure a fair comparison across models while maintaining the generality of patch-based methods.

4.1.3. Metrics

This study chose the mean square error (MSE) and mean absolute error (MAE) as evaluation metrics, which can be defined as follows:
M S E = 1 N t = 1 N   ( Y t 0 + t Y ^ t 0 + t ) 2 ,
M A E = 1 N t = 1 N   Y t 0 + t Y ^ t 0 + t ,
where N is the prediction length, Y t is the ground truth at timestamp t within the forecast horizon, and Y ^ t is the predicted value at timestamp t . A lower MSE or MAE indicates better forecasting performance.

4.1.4. Implementation Details

FCP-Former was implemented with PyTorch 2.4.0 and trained on an NVIDIA GeForce RTX4090 GPU. This study used the Adam optimizer and set the learning rate to 1 × 10−4 to train the model. For small datasets, such as the ETT dataset, this study set the batch size to 128. For larger datasets like traffic, due to memory resource limitations, this study adjusted the batch size between 8 and 32. For all datasets, this study set the maximum number of training epochs to 50. To prevent overfitting and reduce training time, this study set the dropout rate to 0.05 and used an early stopping mechanism with a patience of 3 to halt training when the validation loss showed no significant decrease. The patch length, denoted as P, was set to 16. The hyperparameter frequency modes, denoted as M, were set to 16.

4.2. Experimental Results

As shown in Table 3, for multivariate forecasting, FCP-Former outperforms other methods on six datasets, excluding Traffic and Weather datasets. Specifically, on the ETTh1 dataset, FCP-Former achieves an average MSE of 0.437, outperforming the best baseline, PatchTST (0.460). For the ETTh2 dataset, FCP-Former records 0.365, which is lower than the second-best PatchTST (0.369) and significantly better than other baselines. For ETTm1 dataset, FCP-Former obtains 0.389, slightly ahead of PatchTST (0.390) and considerably lower than Crossformer (0.602). For the ETTm2 dataset, FCP-Former achieves 0.280, surpassing PatchTST (0.291) and other methods. For the large-scale Electricity dataset, FCP-Former reaches 0.186, the best result among all methods, outperforming the second-best PatchTST (0.198). Finally, on the ILI dataset, FCP-Former yields an average MSE of 1.734, which is lower than PatchTST (1.765) and substantially better than FEDformer (3.692) and Crossformer (2.740). On datasets with strong periodicity and trend components, such as ETT and Electricity, FCP-Former benefits from the frequency compensation layer, which makes it more sensitive to periodic and trend-related patterns, thereby enhancing forecasting accuracy. In contrast, baseline methods such as PatchTST and iTransformer suffer from limitations in fully exploiting intra-patch information due to their patching mechanisms, while TimeXer, Crossformer, and Autoformer remain confined to the time domain and thus fail to utilize the rich information available in the frequency domain. This limitation likely contributes to their comparatively lower forecasting accuracy. The experimental results indicate that FCP-Former significantly outperforms other baseline methods in prediction performance for multivariate long-term time series forecasting tasks. FCP-Former achieves a total of 48 optimal values and 17 suboptimal values, especially on the ETT and electricity datasets. Although FCP-Former does not achieve the best performance on all datasets, it consistently attains near-optimal results. This demonstrates that FCP-Former exhibits significant advantages in long-term forecasting. It is worth noting that, compared to baseline methods, FCP-Former often exhibits a smaller MAE when the MSE values are similar. For example, in ETTm2 dataset, the average MAE of FCP-Former is 0.323, while the suboptimal average MAE is TimeXer (0.326). This indicates smaller average deviations between predictions and ground truth, reflecting higher overall prediction accuracy. In practical application scenarios, such as stock price forecasting, supply chain management, and healthcare, where a lower MAE is more critical, FCP-Former demonstrates a distinct advantage. Additionally, this study observed that FCP-Former performs poorly on the Traffic and Weather datasets. This could be attributed to the noise in the data, where certain periodicities are less evident. For instance, the Traffic dataset contains not only locally periodic daily commuting data but also a substantial amount of vehicle pass data with no discernible periodicity. The design of the frequency compensation layer makes FCP-Former more sensitive to periodic and trend-based features. Therefore, the aforementioned noise limits its predictive performance in certain scenarios, preventing the model from achieving the desired forecasting accuracy. In contrast, on datasets with more pronounced periodicity and less noise, such as the ETT and Electricity datasets, the model demonstrates superior performance. This suggests that FCP-Former is better suited for applications with clear periodic patterns, particularly in industrial settings.

4.3. Model Analysis

This study analyzes FCP-Former performance ablation experiments, hyperparameter sensitivity experiments, different input lengths experiments, capture information ability experiments from each timestep, and robustness experiments.

4.3.1. Ablation Experiments

In this section, this study conducted an ablation study on the model to demonstrate the effectiveness of the frequency compensation layer. The component w/o FCL is ablated, which means removing the frequency compensation layer before encoding.
This study compared the performance of the FCP-Former ablation version and the results of the full FCP-Former model in Table 4. On the ETTm2 dataset, FCP-Former achieves an average MSE of 0.280 and MAE of 0.323, representing 3.78% and 3.58% improvements compared with the model without FCL (0.291/0.335). On the Weather dataset, the average MSE and MAE decrease from 0.257/0.281 (w/o FCL) to 0.245/0.275, corresponding to 4.67% and 2.14% improvements, respectively. Similarly, on the Electricity dataset, FCP-Former achieves the lowest average MSE (0.186) and MAE (0.277), yielding 6.06% and 1.77% gains over the variant without FCL (0.198/0.282). From the results of the ablation experiment, it is evident that the application of the frequency compensation layer leads to improved prediction performance.

4.3.2. Hyperparameter Sensitivity Experiments

In the frequency compensation layer, this study employed a crucial hyperparameter: the number of modes in the frequency domain, M. This hyperparameter determines how many frequency components are selected from the frequency domain for the model to learn from. Its value directly impacts both the model’s frequency domain representation capability and computational complexity. Theoretically, a larger number of modes implies more frequency patterns are used, resulting in higher frequency domain resolution and finer data variations being captured, but at the cost of increased computational load and a higher risk of overfitting. On the other hand, a smaller number of modes compresses the frequency domain information, with the model focusing only on the main low-frequency components. This makes the model lighter and faster but may lead to the loss of high-frequency information, decreasing representational capacity while potentially improving generalization performance. In this experiment, this study set the patch length to 32 and evaluated the number of modes in the frequency domain M from the set {2, 4, 6, 8, 10, 12, 14, 16, 18}. The results are shown in Figure 4. This figure corroborates the aforementioned theoretical analysis. When the value of M is low, the model learns fewer frequency patterns, resulting in relatively lower prediction accuracy. As M increases, the MSE gradually decreases and plateaus. When M reaches 16, the model achieves its optimal performance for this hyperparameter on both the ETTh2 and Electricity datasets. However, as M continues to increase, the model’s prediction performance deteriorates due to overfitting, leading to a rise in MSE. This trend of performance deterioration due to overfitting is more pronounced on the Electricity dataset. Considering both computational costs and prediction performance, this study recommends setting the value of M to 16.

4.3.3. Different Input Lengths Experiments

In time series forecasting tasks, the input length determines the amount of historical information available to the model. A longer look-back window allows the model to capture a broader range of past observations, thereby expanding its perceptual scope. Following the approach in the PatchTST [39] study regarding the selection of the look-back window size, this study designed two experimental models for this study: FCP-Former-336, with a look-back window length of 336, and FCP-Former-512, with a look-back window length of 512. Since a longer look-back window inevitably leads to increased memory overhead, this study dynamically adjusted the batch size to balance memory consumption. Due to the limited size of the ILI dataset, with only 966 data points, increasing the input length leads to a reduction in the training set size. For FCP-Former-512, using the dataset split as shown in Table 2, the training set consists of only eight data points, making training impossible. Similarly, for FCP-Former-336, the training set contains only 184 data points, which is insufficient for adequate model training. Therefore, this study did not conduct experiments on the ILI dataset. For the remaining datasets, the comparative results of FCP-Former (look-back window length of 96), FCP-Former-336, and FCP-Former-512 are presented in Table 5. For small-scale datasets, such as ETTh2, FCP-Former-512 achieves an average MSE of 0.342, compared with 0.365 for the vanilla model, while the MAE also improves from 0.395 to 0.392. For large-scale datasets, on the Weather dataset, FCP-Former-512 attains an average MSE of 0.226 and MAE of 0.270, both lower than those of the original model (0.245/0.275). Based on the work of Wang et al. [48], it is evident that due to the presence of repeated short-term patterns in the data and the difficulty of Transformer models in effectively capturing and modeling these short-term patterns, the performance of Transformer-based models often deteriorates as the input length increases. This phenomenon explains the performance of FCP-Former-336, which, in a few specific cases, marginally outperformed that of FCP-Former-512. Overall, the performance of FCP-Former-512 surpasses that of both FCP-Former-336 and FCP-Former, particularly when the prediction length is higher, where its advantages become even more pronounced. The overall superior performance of FCP-Former-512, particularly for longer prediction horizons, suggests that FCP-Former performs well in capturing long-term temporal dependencies and in deeply extracting meaningful information from historical data.

4.3.4. Capture Information Ability Experiments from Each Timestep Within the Patches

Under a fixed-length look-back window, increasing the patch length in the patch-wise time series prediction model leads to a reduction in prediction accuracy. This is due to the model’s difficulty in capturing the information of each time step within the patch and its inability to extract more information from the look-back window. To investigate whether the method proposed in this study can effectively capture the information of each time step within the patch, this study conducted experiments by adjusting the patch length multiple times under a fixed look-back window. The dataset selected for the experiment is ETTh1, with PatchTST [39] baseline methods used for comparison. The look-back window is fixed at 96, and the patch lengths are set to 16, 24, and 32, respectively. In this experiment, the results with a patch length of 16 will be used as the baseline to investigate the impact of increasing the patch length on the MSE and MAE of the prediction results. The experimental results are shown in Table 6 and Table 7.
The results indicate that increasing the patch length from 16 to 24 has a negligible effect on FCP-Former’s prediction accuracy. For MSE, when the prediction lengths are 96 and 336, the MSE increases by only 0.26% and 2.12%, respectively, and there is almost no effect at 192 and 720. In contrast, PatchTST experiences a larger decline in prediction accuracy as the patch length increases. This decline becomes more evident at a patch length of 32, particularly for a prediction length of 720, where the MSE increase is only 1.06% for FCP-Former compared to 11.01% for PatchTST. A similar trend is observed for MAE. When the prediction lengths are 96 and 336, the MAE of FCP-Former increases by only 0.25% and 0.67%, respectively. By comparison, PatchTST shows a larger degradation, with the average MAE rising by 2.91%, whereas FCP-Former increases by only 0.46%. The difference becomes more pronounced at a patch length of 32, where for a prediction length of 720 the MAE increase is merely 0.43% for FCP-Former, in contrast to 5.21% for PatchTST. These findings demonstrate that patch-wise models generally lose accuracy due to their inability to capture each timestep, but FCP-Former effectively captures information from every time step within the patch.

4.3.5. Robustness Experiment

Robustness is an essential metric for evaluating a model’s predictive stability. To assess the robustness of FCP-Former, this study conducted 10 independent predictions for forecasting horizons of 96, 192, 336, and 720 on the ETTh1 dataset. A 90% confidence interval was applied to determine whether FCP-Former’s predictions in Table 3 are reliable. The experimental results are shown in Table 8. From these results, it can be observed that the effect of training randomness on predictive performance is minimal, confirming that FCP-Former exhibits strong robustness.

4.4. Multivariate Showcases

As illustrated in Figure 5, this study also compares the prediction results of FCP-Former with those of recently established state-of-the-art models (PatchTST, TimeXer, and iTransformer) on multiple datasets (ETTm1, Weather, and Electricity). The experimental setup in this study is consistent with the setup presented in Table 3, with an input length of 96. This implies that the first 96 points represent the input historical data, where predictions and ground truth are identical. The subsequent points represent forecasts, each aligned with a ground truth value for evaluation. FCP-Former demonstrated predictions that are closest to the ground truth values, providing a direct visual manifestation of its superior MAE performance reported in Table 3. Furthermore, due to its enhanced ability to capture the trend of time series data, FCP-Former exhibits a significant advantage in predicting the overall trend, as evidenced by the close alignment between its predicted trends and the actual trends. And in the later segments of the Weather dataset predictions, both FCP-Former and TimeXer produced relatively flat forecasts. This is likely due to the presence of noisy data within the retrospective window during the prediction of the latter part of the series. Both TimeXer and the proposed FCP-Former handle the ability to capture information within model patches, enhancing the capacity to capture patch-level information. The captured noise data somewhat interferes with the model’s ability to discern the trend of data variations, leading to flatter predictions. In contrast, PatchTST and iTransformer do not perform any special handling of patch-level data and are less influenced by noise. But PatchTST and iTransformer predict an upward trend at time step 150, while the true values exhibit a downward trend, resulting in slightly inferior overall performance compared to FCP-Former and TimeXer.

4.5. Training Costs Evaluation

The performance of current time series forecasting models is steadily improving. However, the excessive number of parameters and prolonged training times remain significant challenges. The proposed FCP-Former addresses patch data through the frequency compensation layer, adding only a minimal number of parameters to achieve an improvement in prediction performance. To specifically assess the impact of the frequency compensation layer on training overhead, this study analyzed the cost of FCP-Former, including runtime, GPU utilization, and performance in terms of MSE and MAE. Under the same experimental setup, this study compared the resource overhead of FCPformer with the baseline models mentioned in Table 3 on the ETTh1 dataset. In this experiment, this study set the prediction length to 96, the input length to 96, and the batch size to 32. “Iter” denotes the time required to train each iteration. To mitigate overfitting and reduce training time, this study implements an early stopping mechanism with a patience of 3, halting training when the validation loss exhibited no significant improvement. Consequently, different models have varying numbers of epochs, with a smaller number of epochs indicating a faster convergence rate of the model. The experimental results are presented in Table 9. For the ETTh1 dataset, the TSPE (Time Spent Per Epoch) and GPU usage of the patch-wise Transformer-based model are significantly smaller than those of the point-wise Transformer-based model.
Compared to PatchTST, FCP-Former achieved better forecasting performance while incurring only minor resource overhead: an additional 6 MiB of GPU memory and 3.6 ms per iteration. This demonstrated that the improvement in model prediction performance due to the frequency compensation layer far outweighs the resource overhead it introduces. Notably, the training speed per epoch (TSPE) of the patch-wise Transformer model for time series forecasting is significantly smaller than that of the point-wise Transformer model for time series forecasting. In contrast, iTransformer employs a different approach, treating the entire time series of each variate as a single token and applying attention along the variate dimension. This architectural choice inherently minimizes the computational overhead associated with temporal patching. In addition, as shown in Figure 6, this study provided a more intuitive comparison of the training speed, GPU utilization, and performance across multiple models.

5. Conclusions

This study introduces FCP-Former, a time series forecasting method that enhances patch-wise Transformer models through a novel frequency compensation layer. This layer enables representation learning in the frequency domain, enriching the information within each patch and fusing frequency features across patches to improve the capture of periodic and trend components in long-term multivariate time series. Experiments on eight real-world datasets (ETT, Weather, Traffic, Electricity, ILI, etc.) show that FCP-Former achieves state-of-the-art performance, obtaining 48 optimal experiment results and 17 suboptimal experiment results in MSE and MAE. For instance, for ETTm1, it attains an average MSE of 0.389 and an MAE of 0.401, outperforming PatchTST and iTransformer. On Electricity, it scores an MSE of 0.186 and an MAE of 0.277, significantly improving upon FEDformer and Autoformer. However, performance is tied to data periodicity: FCP-Former excels on highly periodic, low-noise data like ETT and Electricity but lags on aperiodic or noisy datasets such as Traffic and Weather. For example, on Traffic, it is slightly surpassed by iTransformer. Limitations include reduced effectiveness on non-periodic data and sensitivity to noise. Future work will focus on improving adaptability to aperiodic signals via noise suppression mechanisms and enhanced non-periodic feature extraction, as well as optimizing computational efficiency for deployment in resource-constrained environments.

Author Contributions

Conceptualization, M.L. and M.Y.; methodology, M.L. and M.Y.; software, M.Y.; validation, M.Y., S.C., and H.L.; investigation, G.X.; data curation, S.L.; writing—original draft preparation, M.Y.; writing—review and editing, all authors; visualization, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deep Earth Probe and Mineral Resources Exploration-National Science and Technology Major Project (Grant Number 2024ZD1003905) and the National Natural Science Foundation of China (Grant Number 51874302).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. ETT datasets: https://github.com/zhouhaoyi/etdataset (accessed on 10 January 2025), Traffic datasets: https://dot.ca.gov (accessed on 10 January 2025), Weather datasets: https://www.bgc-jena.mpg.de/wetter (accessed on 10 January 2025), Electricity datasets: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (accessed on 10 January 2025), ILI datasets: https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html (accessed on 10 January 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FCP-FormerFrequency Compensation Patch-wise TransFormer
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
DFTDiscrete Fourier Transform
FFTFast Fourier Transform
TSPETime Spent Per Epoch
TRTTotal Running Time

References

  1. Stephan, K.; Jisha, G. Enhanced Weather Prediction with Feature Engineered, Time Series Cross Validated Ridge Regression Model. In Proceedings of the Control Instrumentation System Conference (CISCON), Manipal, India, 2–3 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
  2. Sharma, S.; Bhatt, K.K.; Chabra, R.; Aneja, N. A Comparative Performance Model of Machine Learning Classifiers on Time Series Prediction for Weather Forecasting. In Proceedings of the 3rd International Conference on Advances in Information Communication Technology and Computing (AICTC), Bikaner, India, 20–21 December 2021; pp. 577–587. [Google Scholar] [CrossRef]
  3. Li, J.B.; Ma, L.; Li, Y.; Fu, Y.X.; Ma, D.C. Multivariate Short-Term Marine Meteorological Prediction Model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4202116. [Google Scholar] [CrossRef]
  4. Melin, P.; Monica, J.C.; Sanchez, D.; Castillo, O. Multiple Ensemble Neural Network Models with Fuzzy Response Aggregation for Predicting COVID-19 Time Series: The Case of Mexico. Healthcare 2020, 8, 181. [Google Scholar] [CrossRef]
  5. Sharma, R.R.; Kumar, M.; Maheshwari, S.; Ray, K.P. EVDHM-ARIMA-Based Time Series Forecasting Model and Its Application for COVID-19 Cases. IEEE Trans. Instrum. Meas. 2021, 70, 6502210. [Google Scholar] [CrossRef]
  6. Cui, C.S.; Xia, G.S.; Jia, C.Y.; Wen, J. A Novel Construction Method and Prediction Framework of Periodic Time Series: Application to State of Health Prediction of Lithium-Ion Batteries. Energies 2025, 18, 1438. [Google Scholar] [CrossRef]
  7. Shan, N.L.; Xu, X.H.; Bao, X.Q.; Xu, C.C.; Cui, F.R.; Xu, W. Health Status Prediction for Nonstationary Systems Based on Feature Decoupling of Time Series. IEEE Trans. Instrum. Meas. 2025, 74, 3526911. [Google Scholar] [CrossRef]
  8. Fang, Y.; Qin, Y.; Luo, H.; Zhao, F.; Zheng, K. STWave+: A Multi-Scale Efficient Spectral Graph Attention Network with Long-Term Trends for Disentangled Traffic Flow Forecasting. IEEE Trans. Knowl. Data Eng. 2024, 36, 2671–2685. [Google Scholar] [CrossRef]
  9. Elmazi, K.; Elmazi, D.; Musta, E.; Mehmeti, F.; Hidri, F. An Intelligent Transportation Systems-Based Machine Learning-Enhanced Traffic Prediction Model using Time Series Analysis and Regression Techniques. In Proceedings of the International Conference on INnovation in Intelligent SysTems and Applciations (INISTA), Craiova, Romania, 4–6 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
  10. Jeong, S.; Oh, C.; Jeong, J. BAT-Transformer: Prediction of Bus Arrival Time with Transformer Encoder for Smart Public Transportation System. Appl. Sci. 2024, 14, 9488. [Google Scholar] [CrossRef]
  11. Zhang, D.L.; Xu, Y.; Peng, Y.J.; Du, C.Y.; Wang, N.; Tang, M.C.; Lu, L.Y.; Liu, J.Q. An Interpretable Station Delay Prediction Model Based on Graph Community Neural Network and Time-Series Fuzzy Decision Tree. IEEE Trans. Fuzzy Syst. 2023, 31, 421–433. [Google Scholar] [CrossRef]
  12. Iftikhar, H.; Gonzales, S.M.; Zywiolek, J.; López-Gonzales, J.L. Electricity Demand Forecasting Using a Novel Time Series Ensemble Technique. IEEE Access 2024, 12, 88963–88975. [Google Scholar] [CrossRef]
  13. Gonzales, S.M.; Iftikhar, H.; López-Gonzales, J.L. Analysis and forecasting of electricity prices using an improved time series ensemble approach: An application to the Peruvian electricity market. AIMS Math. 2024, 9, 21952–21971. [Google Scholar] [CrossRef]
  14. Qiu, X.H.; Ru, Y.J.; Tan, X.Y.; Chen, J.; Chen, B.; Guo, Y. A k-nearest neighbor attentive deep autoregressive network for electricity consumption prediction. Int. J. Mach. Learn. Cybern. 2024, 15, 1201–1212. [Google Scholar] [CrossRef]
  15. Yu, L.; Ge, X. Time-Series Prediction of Electricity Load for Charging Piles in a Region of China Based on Broad Learning System. Mathematics 2024, 12, 2147. [Google Scholar] [CrossRef]
  16. Hsu, Y.; Tsai, Y.; Li, C. FinGAT: Financial Graph Attention Networks for Recommending Top-KK Profitable Stocks. IEEE Trans. Knowl. Data Eng. 2023, 35, 469–481. [Google Scholar] [CrossRef]
  17. Pal, S.S.; Kar, S. Fuzzy transfer learning in time series forecasting for stock market prices. Soft Comput. 2022, 26, 6941–6952. [Google Scholar] [CrossRef]
  18. Yin, S.; Gao, Y.W.; Nie, S.; Li, J.B. SSTP: Stock Sector Trend Prediction with Temporal-Spatial Network. Inf. Technol. Control 2023, 52, 653–664. [Google Scholar] [CrossRef]
  19. Jiang, M.R.; Chen, W.; Xu, H.L.; Liu, Y.X. A novel interval dual convolutional neural network method for interval-valued stock price prediction. Pattern Recognit. 2024, 145, 109920. [Google Scholar] [CrossRef]
  20. Shahvandi, M.K.; Mishra, S.; Soja, B. BaHaMAs: A method for uncertainty quantification in geodetic time series and its application in short-term prediction of length of day. Earth Planets Space 2024, 76, 127. [Google Scholar] [CrossRef]
  21. Shahvandi, M.K.; Belda, S.; Karbon, M.; Mishra, S.; Soja, B. Deep ensemble geophysics-informed neural networks for the prediction of celestial pole offsets. Geophys. J. Int. 2023, 236, 480–493. [Google Scholar] [CrossRef]
  22. Zhou, W.J.; Zhu, C.; Ma, J. Single-layer folded RNN for time series prediction and classification under a non-Von Neumann architecture. Digit. Signal Process. 2024, 147, 104415. [Google Scholar] [CrossRef]
  23. Murata, R.; Okubo, F.; Minematsu, T.; Taniguchi, Y.; Shimada, A. Recurrent Neural Network-FitNets: Improving Early Prediction of Student Performanceby Time-Series Knowledge Distillation. J. Educ. Comput. Res. 2023, 61, 639–670. [Google Scholar] [CrossRef]
  24. Zhang, C.; Liu, J.; Zhang, S. Online Purchase Behavior Prediction Model Based on Recurrent Neural Network and Naive Bayes. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 3461–3476. [Google Scholar] [CrossRef]
  25. Monti, M.; Fiorentino, J.; Milanetti, E.; Gosti, G.; Tartaglia, G.G. Prediction of Time Series Gene Expression and Structural Analysis of Gene Regulatory Networks Using Recurrent Neural Networks. Entropy 2022, 24, 141. [Google Scholar] [CrossRef] [PubMed]
  26. Elmi, S.; Morris, B. Res-ViT: Residual Vision Transformers for Image Recognition Tasks. In Proceedings of the 35th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Atlanta, GA, USA, 6–8 November 2023; pp. 309–316. [Google Scholar] [CrossRef]
  27. Meng, L.; Li, H.; Chen, B.; Lan, S.; Wu, Z.; Jiang, Y.; Lim, S. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12299–12308. [Google Scholar] [CrossRef]
  28. Nag, S.; Datta, G.; Kundu, S.; Chandrachoodan, N.; Beerel, P. ViTA: A Vision Transformer Inference Accelerator for Edge Applications. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
  29. Yang, Z.; Wang, J.; Ye, X.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H.S. Language-Aware Vision Transformer for Referring Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5238–5255. [Google Scholar] [CrossRef]
  30. Lin, H.; Yang, L.; Wang, P. W-core Transformer Model for Chinese Word Segmentation. In Proceedings of the WorldCist’21—9th World Conference on Information Systems and Technologies (WorldCIST), Terceira Island, Azores, Portugal, 30 March–2 April 2021; pp. 270–280. [Google Scholar] [CrossRef]
  31. Nguyen, M.; Lai, V.; Ben, V.A.P.; Nguyen, T.H.; Linguist, A.C. Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. In Proceedings of the 16th Conference of the European-Chapter-of-the-Association-for-Computational-Linguistics (EACL), Online, 19–23 April 2021; pp. 80–90. [Google Scholar]
  32. Sarkar, S.; Babar, M.F.; Hassan, M.M.; Hasan, M.; Santu, S.K.K. Processing Natural Language on Embedded Devices: How Well Do Transformer Models Perform? In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE), New York, NY, USA, 7–11 May 2024; pp. 211–222. [Google Scholar] [CrossRef]
  33. Molinaro, L.; Tatano, R.; Busto, E.; Fiandrotti, A.; Basile, V.; Patti, V. DelBERTo: A Deep Lightweight Transformer for Sentiment Analysis. In Proceedings of the 21st International Conference of the Italian Association for Artificial Intelligence (AIxIA), Udine, Italy, 28 November–2 December 2022; pp. 443–456. [Google Scholar] [CrossRef]
  34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
  35. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
  36. Liu, B.J.; Li, Z.M.; Li, Z.L.; Chen, C. CL-Informer: Long time series prediction model based on continuous wavelet transform. PLoS ONE 2024, 19, 9. [Google Scholar] [CrossRef] [PubMed]
  37. Zhou, H.; Zhang, S.H.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the 35th AAAI Conference on Artificial Intelligence/33rd Conference on Innovative Applications of Artificial Intelligence/11th Symposium on Educational Advances in Artificial Intelligence (AAAI), Online, 2–9 February 2021; pp. 11106–11115. [Google Scholar] [CrossRef]
  38. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; pp. 22419–22430. [Google Scholar]
  39. Nie, Y.; Nguyen, N.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  40. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–14 May 2024. [Google Scholar] [CrossRef]
  41. Wang, Y.; Wu, H.; Dong, J.; Qin, G.; Zhang, H.; Liu, Y.; Qiu, Y.; Wang, J.; Long, M. TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar] [CrossRef]
  42. Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  43. Tong, H.; Kong, L.; Liu, J.; Gao, S.; Xu, Y.; Chen, Y. Segmented Frequency-Domain Correlation Prediction Model for Long-Term Time Series Forecasting Using Transformer. IET Softw. 2024, 2024, 2920167. [Google Scholar] [CrossRef]
  44. Yi, K.; Zhang, Q.; Fan, W.; Wang, S.; Wang, P.; He, H.; Lian, D.; An, N.; Cao, L.; Niu, Z. Frequency-domain MLPs are More Effective Learners in Time Series Forecasting. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 76656–76679. [Google Scholar]
  45. Xu, Z.; Zeng, A.; Xu, Q. FITS: Modeling Time Series with 10k Parameters. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–14 May 2024. [Google Scholar]
  46. Lin, S.; Lin, W.; Wu, W.; Chen, H.; Yang, J. SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
  47. Han, L.; Ye, H.; Zhan, D. The Capacity and Robustness Trade-Off: Revisiting the Channel Independent Strategy for Multivariate Time Series Forecasting. IEEE Trans. Knowl. Data Eng. 2024, 36, 7129–7142. [Google Scholar] [CrossRef]
  48. Wang, H.; Peng, J.; Huang, F.; Wang, J.; Chen, J.; Xiao, Y. MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–11. [Google Scholar]
Figure 1. The difference in time steps between the point-wise model and the patch-wise model. The red dashed lines represent time steps from point-wise models, the blue rectangular areas represent time steps from patch-wise models using channel-dependent strategies, and the green rectangular areas represent time steps from patch-wise models using channel-independent strategies.
Figure 1. The difference in time steps between the point-wise model and the patch-wise model. The red dashed lines represent time steps from point-wise models, the blue rectangular areas represent time steps from patch-wise models using channel-dependent strategies, and the green rectangular areas represent time steps from patch-wise models using channel-independent strategies.
Sensors 25 05646 g001
Figure 2. FCP-Former structure. The green rectangle in the yellow background represents time-domain data, the blue rectangle represents frequency-domain data, and the brown rectangle represents time-domain data with frequency-domain features. A channel-independent strategy and a frequency compensation layer are used to perform representation learning in the frequency domain for each patch. The ellipsis represents data from other independent feature channels. After representation learning is completed, the frequency compensation layer will fuse the frequency-domain features between patches, creating new patches with frequency-domain characteristics as learned data. The learned data is then converted back to the time domain via an inverse Fourier transform for embedding operations. The vanilla transformer encoder and linear layers are used to produce the prediction results.
Figure 2. FCP-Former structure. The green rectangle in the yellow background represents time-domain data, the blue rectangle represents frequency-domain data, and the brown rectangle represents time-domain data with frequency-domain features. A channel-independent strategy and a frequency compensation layer are used to perform representation learning in the frequency domain for each patch. The ellipsis represents data from other independent feature channels. After representation learning is completed, the frequency compensation layer will fuse the frequency-domain features between patches, creating new patches with frequency-domain characteristics as learned data. The learned data is then converted back to the time domain via an inverse Fourier transform for embedding operations. The vanilla transformer encoder and linear layers are used to produce the prediction results.
Sensors 25 05646 g002
Figure 3. Data points in different domains. (a): Time-domain plots of a sine wave with a frequency of 5 Hz; (b) Frequency-domain plots of a sine wave with a frequency of 5 Hz.
Figure 3. Data points in different domains. (a): Time-domain plots of a sine wave with a frequency of 5 Hz; (b) Frequency-domain plots of a sine wave with a frequency of 5 Hz.
Sensors 25 05646 g003
Figure 4. The impact of different hyperparameters M on MSE in the ETTh2 (upper row) and Electricity (lower row) datasets.
Figure 4. The impact of different hyperparameters M on MSE in the ETTh2 (upper row) and Electricity (lower row) datasets.
Sensors 25 05646 g004
Figure 5. Visualization results of forecasting sequences randomly selected from ETTm1, Weather, and Electricity. The data alignment is based on the same time steps.
Figure 5. Visualization results of forecasting sequences randomly selected from ETTm1, Weather, and Electricity. The data alignment is based on the same time steps.
Sensors 25 05646 g005
Figure 6. Comprehensive performance analysis with training time, metrics, and GPU occupancy in the ETTh1 datasets. A larger graph size indicates more GPU usage.
Figure 6. Comprehensive performance analysis with training time, metrics, and GPU occupancy in the ETTh1 datasets. A larger graph size indicates more GPU usage.
Sensors 25 05646 g006
Table 1. Summary of time series forecasting methods.
Table 1. Summary of time series forecasting methods.
TypeMethodApproachData
Domain
Train SpeedGap
Patch-wisePatchTST [39]Patch mechanismTime
domain
FastPoor ability to capture internal information within the patch
iTransformer [40]Reverse dimension Patch mechanismTime
domain
Very fastPoor ability to capture internal information within the patch
TimeXer [41]Exogenous variablesTime
domain
FastOnly captures internal information within the patch in the time domain
Crossformer [42]Cross-dimension attentionTime
domain
SlowOnly captures internal information within the patch in the time domain
Point-wiseFEDformer [35]Frequency-enhanced attentionTime–frequency
domain
Very slowHigh training overhead
Informer [37]Sparse self-attentionTime
domain
MediumHigh training overhead
Autoformer [38]Seasonal self-attention mechanismTime
domain
SlowHigh training overhead
Table 2. Details of datasets.
Table 2. Details of datasets.
DatasetsETThETTmTrafficWeatherElectricityILI
Timesteps17,42069,68017,54452,69626,304966
Features77862213217
Partitions
(train/val/test)
12/4/412/4/47/1/27/1/27/1/26/2/2
Table 3. Multivariate long-term forecasting results. “96, 192, 336, and 720” respectively represent prediction lengths of 96, 192, 336, and 720. “avg” denotes the average of the results across these four prediction lengths.
Table 3. Multivariate long-term forecasting results. “96, 192, 336, and 720” respectively represent prediction lengths of 96, 192, 336, and 720. “avg” denotes the average of the results across these four prediction lengths.
MethodsFCP-FormerPatchTSTiTransformerTimeXerFEDformerCrossformerAutoformer
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTh1960.3780.3950.3780.3950.3850.4040.3860.3990.3880.4250.3840.4080.4470.451
1920.4260.4210.4430.4350.4410.4380.4380.4320.4370.4500.4330.4350.4860.475
3360.4720.4450.4930.4610.4790.4560.4830.4550.4820.4760.6770.6280.5050.490
7200.4710.4600.5270.4990.4890.4820.4910.4760.5020.4980.6700.6160.5170.519
avg0.4370.4300.4600.4470.4490.4450.4490.4400.4520.4620.5410.5220.4890.484
ETTh2960.2870.3390.2920.3430.2970.3470.2890.3420.3390.3830.6780.6340.3440.385
1920.3740.3940.3730.3990.3780.3980.3710.3940.4140.4271.1410.7450.4220.433
3360.3820.4120.3900.4160.4260.4330.4190.4300.4530.4641.2000.7640.4550.464
7200.4170.4370.4220.4430.4300.4480.4160.4380.4800.4871.3840.8360.4650.477
avg0.3650.3950.3690.4000.3830.4070.3740.4010.4220.4411.1010.7450.4210.440
ETTm1960.3220.3600.3300.3670.3600.3870.3300.3670.3730.4190.3430.3810.6200.528
1920.3680.3860.3700.3870.3890.4050.3670.3870.4150.4400.3750.4030.6030.519
3360.3990.4070.3980.4110.4190.4160.4010.4110.4500.4600.4130.4240.6220.526
7200.4670.4520.4610.4440.4930.4580.4670.4500.5090.4870.5300.5080.5650.515
avg0.3890.4010.3900.4030.4150.4170.3910.4030.4370.4520.4150.4290.6020.522
ETTm2960.1770.2570.1850.2640.1810.2650.1750.2580.1920.2820.2690.3510.2200.303
1920.2400.2980.2470.3070.2500.3100.2380.3000.2640.3240.3630.4190.2720.330
3360.3010.3400.3090.3460.3150.3520.2960.3390.3250.3620.6730.5960.3270.365
7200.4010.3980.4220.4220.4110.4060.4050.4060.4210.4162.6521.1110.4210.418
avg0.2800.3230.2910.3350.2890.3330.2790.3260.3010.3460.9890.6190.3100.354
Traffic960.4900.3110.4920.3140.4270.2890.4660.3020.5750.3540.5280.2930.6470.396
1920.4860.3070.4820.3050.4560.3050.4850.3170.6470.4060.5440.2950.6660.418
3360.5020.3180.4950.3110.4760.3160.5020.3220.6690.4190.5720.2980.6990.434
7200.5370.3350.5280.3300.5140.3410.5380.3400.7210.4440.5960.3110.7100.440
avg0.5040.3180.4990.3150.4680.3130.4980.3200.6520.4200.5600.2990.6800.422
Weather960.1620.2090.1750.2170.1730.2110.1580.2040.2200.2990.1580.2350.2530.323
1920.2100.2530.2220.2590.2220.2540.2060.2500.2830.3500.2030.2670.2980.353
3360.2650.2930.2760.2980.2810.2980.2630.2920.3470.3990.2540.3090.3570.394
7200.3430.3440.3540.3510.3560.3490.3430.3430.4020.4130.3670.3910.4190.427
avg0.2450.2750.2570.2810.2580.2780.2420.2720.3130.3650.2460.3010.3320.374
Electricity960.1560.2500.1670.2540.1580.2520.1620.2520.2150.3270.2190.3140.2070.321
1920.1690.2620.1800.2670.1890.2740.1920.2790.2320.3410.2310.3220.2160.327
3360.1880.2800.1980.2840.2080.2940.2080.2950.2540.3590.2460.3370.2710.368
7200.2290.3170.2380.3170.2540.3310.2490.3290.3050.3940.2800.3630.2820.377
avg0.1860.2770.1980.2820.2070.2910.2060.2930.2520.3560.2440.3340.2440.348
ILI241.6890.8031.6500.8042.3571.0582.3331.0424.0771.4243.3701.1932.8021.153
361.5730.7771.7140.8532.2361.0272.1920.9763.8651.4143.5331.2192.7341.085
481.6840.8151.7180.8632.2071.0202.1730.9693.8811.4043.7901.2632.5921.045
601.9920.9051.9770.9342.2121.0362.1110.9613.9471.4094.0761.3272.8331.127
avg1.7340.8251.7650.8632.2531.0352.2030.9873.9431.4133.6921.2502.7401.102
SOTA counts487616070
The best results are in bold and the second best are underlined.
Table 4. Results of the ablation study of FCP-Former.
Table 4. Results of the ablation study of FCP-Former.
Methods FCP-Former w/o FCL
MetricMSE|ΔMSE%|MAE|ΔMAE%|MSEMAE
ETTm2960.1774.32%0.2572.65%0.1850.264
1920.2402.83%0.2982.93%0.2470.307
3360.3012.58%0.3401.73%0.3090.346
7200.4014.98%0.3985.69%0.4220.422
avg0.2803.78%0.3233.58%0.2910.335
Weather960.1627.43%0.2093.69%0.1750.217
1920.2105.71%0.2532.32%0.2220.259
3360.2653.99%0.2931.68%0.2760.298
7200.3433.11%0.3441.99%0.3540.351
avg0.2454.67%0.2752.14%0.2570.281
Electricity960.1575.99%0.2511.18%0.1670.254
1920.1696.11%0.2621.87%0.1800.267
3360.1885.05%0.2801.41%0.1980.284
7200.2293.78%0.3170%0.2380.317
avg0.1866.06%0.2771.77%0.1980.282
The best results are in bold. |ΔMSE%| and |ΔMAE%| represent the percentage improvements in MSE and MAE performance, respectively, compared to the case without FCL.
Table 5. Multivariate long-term forecasting results with FCP-Former-336 and FCP-Former-512.
Table 5. Multivariate long-term forecasting results with FCP-Former-336 and FCP-Former-512.
MethodsFCP-FormerFCP-Former-336FCP-Former-512
MetricMSEMAEMSEMSEMSEMAE
ETTh1960.3780.3950.3790.4000.3760.403
1920.4260.4210.4110.4220.4210.439
3360.4720.4450.4820.4720.4380.453
7200.4710.4600.5050.5000.4750.484
avg0.4370.4300.4440.4480.4270.445
ETTh2960.2870.3390.2900.3490.2800.343
1920.3740.3940.3400.3850.3310.383
3360.3820.4120.3530.4020.3610.407
7200.4170.4370.4080.4400.3950.434
avg0.3650.3950.3480.3940.3420.392
ETTm1960.3220.3600.2960.3500.3040.350
1920.3680.3860.3430.3750.3450.375
3360.3990.4070.3820.3970.3760.392
7200.4670.4520.4400.4290.4310.421
avg0.3890.4010.3650.3880.3640.385
ETTm2960.1770.2570.1670.2560.1650.254
1920.2400.2980.2210.2930.2210.292
3360.3010.3400.2790.3300.2760.328
7200.4010.3980.3740.3870.3660.385
avg0.2800.3230.2600.3170.2570.315
Traffic960.4900.3110.4190.3030.4190.305
1920.4860.3070.4270.3050.4250.308
3360.5020.3180.4380.3070.4340.313
7200.5370.3350.4720.3290.4690.327
avg0.5040.3180.4390.3110.4370.313
Weather960.1620.2090.1510.2030.1500.208
1920.2100.2530.1950.2460.1940.248
3360.2650.2930.2490.2880.2440.287
7200.3430.3440.3290.3400.3150.337
avg0.2450.2750.2310.2690.2260.270
Electricity960.1570.2510.1370.2340.1360.235
1920.1690.2620.1560.2500.1580.255
3360.1880.2800.1730.2690.1710.268
7200.2290.3170.2080.2980.2220.316
avg0.1860.2770.1690.2630.1720.268
The best results are in bold.
Table 6. The MSE of the prediction results of FCP-Former and PatchTST under different patch lengths in the ETTh1 dataset.
Table 6. The MSE of the prediction results of FCP-Former and PatchTST under different patch lengths in the ETTh1 dataset.
MethodFCP-FormerPatchTST
Patch Length162432162432
MetricMSEMSE|ΔMSE%|MSE|ΔMSE%|MSEMSE|ΔMSE%|MSE|ΔMSE%|
960.3780.3790.26%0.3810.79%0.3780.3892.91%0.3923.70%
1920.4260.4260%0.4321.41%0.4430.4511.81%0.4522.03%
3360.4720.4822.12%0.4791.48%0.4930.5083.04%0.5072.84%
7200.4710.4710%0.4761.06%0.5270.5422.85%0.58511.01%
avg0.4370.4390.46%0.4421.14%0.460.4732.83%0.4845.22%
ΔMSE% represents the percentage change in the MSE value compared to the case with a patch length of 16.
Table 7. The MAE of the prediction results of FCP-Former and PatchTST under different patch lengths in the ETTh1 dataset.
Table 7. The MAE of the prediction results of FCP-Former and PatchTST under different patch lengths in the ETTh1 dataset.
MethodFCP-FormerPatchTST
Patch Length162432162432
MetricMAEMAE|ΔMAE%|MAE|ΔMAE%|MAEMAE|ΔMAE%|MAE|ΔMAE%|
960.395 0.396 0.25%0.399 1.01%0.395 0.402 1.77%0.405 2.53%
1920.421 0.423 0.47%0.426 1.19%0.435 0.442 1.61%0.445 2.30%
3360.445 0.448 0.67%0.449 0.90%0.461 0.468 1.52%0.469 1.74%
7200.460 0.462 0.43%0.468 1.74%0.499 0.501 0.40%0.525 5.21%
avg0.430 0.432 0.46%0.435 1.16%0.447 0.453 1.34%0.460 2.91%
ΔMAE% represents the percentage change in the MAE value compared to the case with a patch length of 16.
Table 8. Results of robustness experiment in ETTh1 dataset.
Table 8. Results of robustness experiment in ETTh1 dataset.
Forecast Length96192336720
MetricMSEMSEMSEMSE
10.3790.4350.4800.473
20.3800.4260.4720.471
30.3800.4340.4720.477
40.3760.4240.4740.484
50.3770.4290.4760.485
60.3800.4260.4800.473
70.3790.4260.4780.484
80.3790.4270.4720.469
90.3840.4270.4600.460
100.3780.4290.4820.473
90% confidence bands[0.377, 0.380][0.426, 0.429][0.471, 0.479][0.471, 0.478]
robustness
“√” indicates that the forecast results shown in Table 3 fall within the confidence bands.
Table 9. Results of the training cost experiment.
Table 9. Results of the training cost experiment.
MethodsETTh1
IterMSEMAEGPUEpochsTSPETRT
FCP-Former22.70.3710.3911702101.515
PatchTST19.10.3780.395169661.267.56
iTransformer8.50.3850.40477070.573.99
TimeXer18.40.3860.3991352141.2317.22
FEDformer146.60.3880.4254798129.82117.84
Crossformer570.3840.408393663.8222.92
Autoformer68.20.4470.451529864.5727.42
The best results are in bold. Iter: Time spent per iteration (ms/iteration). MSE: Evaluation results. MAE: Evaluation result. GPU: GPU usage (MiB). Epochs: Number of training epochs. TSPE: Time spent per epoch (s/epoch). TRT: Total running time (s).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, M.; Yang, M.; Chen, S.; Li, H.; Xing, G.; Li, S. FCP-Former: Enhancing Long-Term Multivariate Time Series Forecasting with Frequency Compensation. Sensors 2025, 25, 5646. https://doi.org/10.3390/s25185646

AMA Style

Li M, Yang M, Chen S, Li H, Xing G, Li S. FCP-Former: Enhancing Long-Term Multivariate Time Series Forecasting with Frequency Compensation. Sensors. 2025; 25(18):5646. https://doi.org/10.3390/s25185646

Chicago/Turabian Style

Li, Ming, Muyu Yang, Shaolong Chen, Huangyongxiang Li, Gaosong Xing, and Shuting Li. 2025. "FCP-Former: Enhancing Long-Term Multivariate Time Series Forecasting with Frequency Compensation" Sensors 25, no. 18: 5646. https://doi.org/10.3390/s25185646

APA Style

Li, M., Yang, M., Chen, S., Li, H., Xing, G., & Li, S. (2025). FCP-Former: Enhancing Long-Term Multivariate Time Series Forecasting with Frequency Compensation. Sensors, 25(18), 5646. https://doi.org/10.3390/s25185646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop