Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting

Shen, Aodong; Wang, Xingyue; Xu, Honghua; Zhan, Jichao; Zhou, Suyang; Kong, Youyong

doi:10.3390/su18010171

Open AccessArticle

Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting

by

Aodong Shen

^1,2,*

,

Xingyue Wang

¹,

Honghua Xu

³

,

Jichao Zhan

¹,

Suyang Zhou

⁴

and

Youyong Kong

¹

School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

²

School of Big Data and Information Engineering, Xinjiang University of Technology, Hotan 848000, China

³

Nanjing Power Supply Branch of State Grid Corporation of China, Nanjing 210008, China

⁴

School of Electrical Engineering, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(1), 171; https://doi.org/10.3390/su18010171

Submission received: 7 November 2025 / Revised: 12 December 2025 / Accepted: 19 December 2025 / Published: 23 December 2025

Download

Browse Figures

Versions Notes

Abstract

Load forecasting plays a vital role in key areas such as energy forecasting and resource management. Traditional forecasting methods are often limited in dealing with shifts in statistical distribution and dynamic changes in periodic parameters of load data, making it difficult to capture complex temporal dependencies and periodic change patterns. To address this challenge, we transform load data into the frequency domain and use a fine-tuned large language model for forecasting. Specifically, we propose the Frequency-Based Spatial–Temporal Mixture Learning Model (FSTML), which uses (1) a Frequency-domain Global Learning Module (FGLM), (2) Temporal-Dimension Learning Module (TDLM), and (3) Spatial-Dimension Learning Module (SDLM) to process load data and extract comprehensive temporal patterns. FGLM transfers load data to the frequency domain and provides the model with a frequency-domain global feature representation of load data. The TDLM and SDLM fine tune the pre-trained large language model in the time dimension and space dimension, respectively, extracting the temporal dependency and spatial dependency of load data, respectively, thereby extracting the spatial–temporal pattern of load data. FSTML achieves the best performance in the forecasting task on two public load datasets, and the forecasting accuracy is significantly improved. The high-precision load forecasting model proposed in this study can significantly improve the operational efficiency of power systems and the integration capacity of renewable energy sources, thereby supporting the sustainable development of the power industry in three dimensions: energy optimization, emission reduction, and economic operation.

Keywords:

load forecasting; deep learning; frequency-domain learning

1. Introduction

Time-series data records changes in various systems and phenomena in the time dimension, providing an indispensable basis for analyzing historical trends, understanding the current situation, and forecasting future development. Time-series data is spread across multiple disciplines, such as traffic [1], environmental science [2], and public health [3,4]. Each relies on time-series data to capture the changes in key variables over time. In addition, the analysis of time-series data reveals past patterns and future possibilities, providing a scientific basis for decision-making.

With the need to connect new energy power generation to the grid and build smart grids, it is of great significance to study the improvement of the new energy grid connection dispatching and management model. The dispatching mode of the power grid under the access of new energy is a complete, open, and fair competition ecosystem with the simultaneous participation of centralized and distributed energy, the coexistence of active response and fixed demand, and the joint coordination of centralized dispatch and implicit dispatch, covering power generation, transmission, distribution, and the entire industrial chain. As a result, load forecasting [5], which can handle the complex characteristics of load data and forecast development trends and key changes in the future, is particularly significant.

Load forecasting methods can be mainly divided into three categories: mathematical statistics-based methods [6,7], machine learning-based methods [8,9], and deep learning-based methods [10,11,12,13]. Traditional long-term time-series forecasting research typically relies on extensive feature engineering, extracting and selecting key features to model the non-stationarity of time series, thereby enhancing the model’s ability to capture time-series dynamics. Unlike traditional forecasting methods, deep learning forecasting methods utilize their multi-layered neural network structure to autonomously learn complex patterns from a large number of time-series samples, thereby effectively adapting to time-series datasets from different domains and achieving higher generality and accuracy. However, due to the non-stationarity and the complex periodicity in load data [14], it is difficult for these models to fully extract temporal characteristics or adapt to temporal patterns of different datasets. Recently, with the rapid development of large language models (LLMs) and pre-trained foundation models (PFMs), some studies have begun to focus on the application of large language models in forecasting tasks [15].

Although LLM-based forecasting methods perform well in dealing with time dependency problems in complex time series, they often lack effective mechanisms for handling complex spatial–temporal patterns of multivariate series, which may lead to poor forecasting performance when faced with load data with a high degree of non-stationarity. While these models explicitly model the periodic pattern of load data, it is difficult to capture the potential frequency-domain pattern of the time series when the periodicity of the series is not obvious. Therefore, there is an urgent need for a model that can capture the frequency-domain features of time series and combine them with large language models.

To address the above challenges in multivariate load forecasting, we introduce the Frequency-Based Spatial–Temporal Mixture Learning Model (FSTML), a load forecasting method based on spatial–temporal hybrid learning in the frequency domain. This method transfers the pre-trained large language model to the time-series prediction task and uses three modules to fine tune the pre-trained model, including the frequency-domain global learning module, the time-dimension learning module, and the space-dimension learning module. The frequency-domain global learning module aims to obtain the global temporal feature representation of load data and capture the overall trend information of the series. The time-dimension learning module captures time-dimension features such as periodic patterns from the frequency domain. The spatial-dimension learning module has the same structure as the time-dimension learning module and aims to learn the complex dependency patterns between different variable sequences and further improve the prediction accuracy through joint spatial–temporal prediction. Our contributions are summarized as follows:

We present FSTML, a load forecasting method that innovatively combines frequency-domain analysis with a pre-trained large language model, empowering the model to capture the potential frequency-domain pattern of load data.
We design a time-dimension fine-tuning layer and a space-dimension fine-tuning layer in the frequency domain, modeling the spatial–temporal features of multivariate load data.
Experiments on two real-world datasets demonstrate that FSTML achieves state-of-the-art performance on load forecasting tasks.

The rest of this paper is organized as follows. Section 2 reviews past time-series forecasting methods, which can be divided into traditional time-series forecasting methods and methods based on pre-trained large language models. Section 3 introduces the methodology of this paper, detailing the design principles of each component of the proposed model. Section 4 presents the experimental design of this paper and the experimental results of different models on two public datasets, along with a discussion of the experimental results. Section 5 summarizes the work presented in this paper and discusses its limitations and future directions.

2. Related Work

2.1. Time-Series Forecasting

Time-series forecasting has always been one of the core research topics in the field of time-series analysis, and it plays a key role in various real-world decision-making processes. Traditional time-series forecasting methods are mainly based on statistical principles and generally adopt the stationarity assumption, assuming that the statistical characteristics of the time series remain unchanged throughout the observation period. For example, ARIMA [16] transforms a non-stationary series into a stationary series by difference, then combines the autoregression of historical values and the moving average of historical errors for prediction. Its core is to fit the trend and residual of time-series data with a linear relationship, which is suitable for short-term stationary or differentially stationary series. Although traditional machine learning methods have achieved certain success in the field of time-series forecasting, they still face many challenges when dealing with complex time-series data with strong nonlinearity and non-stationarity. With the significant improvement of computing power, a variety of automated time-series prediction algorithms have been developed. These algorithms have greatly reduced the reliance on manual intervention while improving processing efficiency and prediction accuracy and have become a hot topic in current research. Salinas et al. [17] proposed DeepAR, which combines autoregressive methods and recurrent neural networks (RNNs) to capture short-term and long-term temporal patterns. Borovykh et al. [18] used convolutional neural networks (CNNs) to directly model the local dependencies of time-series data and captured patterns of adjacent time points using sliding windows. The core of this approach is to use the translation invariance of CNNs to extract time-series features and introduce conditional information for end-to-end prediction, which is suitable for high-resolution or spatially related time-series tasks. The Transformer-based time-series forecasting method mainly uses its powerful self-attention mechanism to effectively deal with the long-term dependency problem in the time series. This method calculates the relationship between each time point in the time series through the self-attention layer, thereby capturing the global time-series characteristics of the long-term series.

Transformer models based on a self-attention mechanism have been the focus of various research studies, including those incorporating natural language processing [19,20], computer vision [21,22], and audio processing [23]. Compared with traditional methods, the Transformer architecture can process sequence data in parallel, significantly improving the computational efficiency of the model. The layered structure of the Transformer model also supports deep representation learning, further enhancing the robustness of the model in complex sequence prediction tasks [24]. Autoformer [13] efficiently models the periodicity and multi-scale patterns of long time series through a sequence decomposition architecture and autocorrelation mechanism, significantly reducing computational complexity. It replaces dot-product attention with period similarity, which reduces memory consumption while improving the accuracy of ultra-long time-series prediction. Informer [12] significantly reduces the computational complexity of the Transformer architecture through the Prob-sparse self-attention mechanism and a distillation encoder, making it suitable for ultra-long time-series prediction. TimesNet [11] transforms time series into 2D space, uses multi-cycle convolution to mine the intra-cycle and inter-cycle dependencies of time series, uniformly models complex time-change patterns, and significantly improves the accuracy of multi-cycle time-series prediction. Although Transformer-based time-series forecasting methods perform well in dealing with long-term dependency problems in complex time series, these models still show certain limitations in modeling the non-stationary characteristics of time series, especially complex periodic patterns. Existing models usually lack effective processing mechanisms for the multiple periodicities of time series and the dynamic changes of periodic parameters, which may lead to insufficient forecasting performance when faced with highly non-stationary time-series data.

2.2. Pre-Trained Large Language Model

In the past few years, pre-trained large language models (LLMs) such as GPT-4 [25] and LLaMA [26] have not only achieved great success in various natural language processing tasks but also shown great potential in handling applications in more complex or structured fields. Inspired by this, there is growing interest in the leveraging of existing LLMs to facilitate time-series analysis. Zhou et al. [27] proposed a unified time-series analysis framework based on a pretrained language model. By reparameterizing time-series data into text tokens and leveraging the generalization ability of a pretrained LLM, they achieved generally high performance on multiple tasks, such as prediction, classification, and anomaly detection. Jin et al. [28] proposed Time-LLM. Through input reprogramming and lightweight adapters, the capabilities of pre-trained large language models are transferred to time-series forecasting tasks, achieving high-performance forecasting without additional fine tuning. Liu et al. [29] proposed AutoTimes, which achieves efficient time-series prediction with zero or few samples by reparameterizing time series into readable discrete symbols and leveraging the autoregressive generation capability of large language models, breaking through the traditional model’s reliance on historical data distribution. Chronos [30] converts time-series prediction into a text generation task by discretizing time series into token sequences and leveraging the autoregressive modeling capabilities of pre-trained language models, achieving efficient cross-domain time-series prediction with zero or few samples and significantly improving the generalization of the model.

3. Methodology

The framework of the proposed FSTML is shown in Figure 1. The model uses the Transformer decoder structure of GPT-2 [31], retaining the position encoding (PE) layer, multi-head attention (MHA) layer, and feedforward neural network (FFN) in the pre-trained model. Considering that the above three layers contain most of the generalized knowledge learned by the pre-trained large language model, which already includes the ability to learn load data, FSTML chooses to freeze the multi-head attention layer, position encoding layer, and feedforward neural network layer and only allows the residual connection and layer normalization of the Transformer decoder to participate in model training. The network layers marked with “snow” in the figure represent the frozen parameter layers of the pre-trained large language model, while the layers marked with “spark” are the model fine-tuning layers involved in training. Referring to the One-Fits-All model [27], the framework adopts the frozen pre-trained large language model architecture as a whole, and the corresponding fine-tuning layers are specially designed for the characteristics of load data. The methodological framework consists of three parts: (1) the Frequency-domain Global Learning Module (FGLM), (2) the Temporal-Dimension Learning Module (TDLM), and (3) the Spatial-Dimension Learning Module (SDLM).

3.1. Frequency-Domain Global Learning Module

Normalizing the load data before inputting the model is crucial to improving the model’s prediction performance. Assume that the input multivariate load data is

X \in R^{L \times C}

; the calculation formula for the instance normalization operation is expressed as follows:

\tilde{X} = \frac{X - μ}{\sqrt{σ^{2} + ϵ}}

(1)

where

μ

and

σ^{2}

are the mean and variance of the calculated input instances, respectively.

ϵ

is a very small constant to prevent numerical instability caused by insufficient variance.

In order to prevent the excessive and redundant information of the load data from affecting the model’s prediction performance and to improve the prediction efficiency of the model, this module further uses a patching [32] operation on the load data after normalization. Specifically, each input (

\tilde{X}

) is first divided into blocks by sliding, which can be overlapping or non-overlapping, as determined by the sliding step size (s). Assuming P represents the length of each block and N is the number of blocks, the block sequence (

X_{p} \in R^{P \times N}

) can be obtained. Given the input sequence length (L), block length (P), sliding-window step size (s), the number of blocks is

N = ⌊\frac{L - P}{s}⌋ + 2

. It should be noted that the chunking operation requires padding of the end of the input sequence with s repetitions of the last value (

x_{L}

). By using chunking, the number of input sequence units can be reduced from L to approximately

L / S

. This means that the memory usage and computational complexity of the attention graph can be reduced by about

s^{2}

times. Therefore, under the constraints of training time and GPU memory, the chunking design can reduce the number of model parameters while retaining the global and local semantic information of the load data, providing a more optimized input structure for the Transformer multi-head attention layer of the pre-trained large language model.

In addition, after obtaining the block representation of the input load data, a simple linear mapping network is used to embed each block to enhance the representation ability of the input data to adapt to the structure of the pre-trained large language model network.

To address the challenge of LLM-based methods often failing to perform well in forecasting load data with a high degree of non-stationarity, we attempt to capture the frequency-domain features hidden within the load data. We attribute the shortcomings in capturing the pattern of load data to the failure of LLMs to fully learn the global features of the load data. In order to capture the global knowledge of load data, the first fine-tuning layer used in FSTML is a frequency-domain global fine-tuning layer (FGFL). This layer converts the load data from the time domain to the frequency domain so that the model can capture the frequency components of the load data, thereby understanding the global characteristics of the data, such as periodicity and trend changes, and generating corresponding embedded representations. First, the time-domain information in each block is converted to the frequency domain using Fast Fourier Transform (FFT). In time-series analysis, Fourier transform is used to identify periodic components in data. Fourier transform decomposes a complex signal into the sum of multiple simple periodic signals. The calculation formula for the Fourier transform of a continuous time signal (

x (t)

) is expressed as follows:

X (f) = \int_{- \infty}^{\infty} x (t) e^{- j 2 π f t} d t

(2)

where

X (f)

is the complex spectrum of the signal at frequency f and j is an imaginary unit. This formula provides a mapping from the time domain to the frequency domain. In time-series analysis, Fourier transform is particularly suitable for analyzing and processing periodic data, such as sales data with obvious seasonal changes or temperature data with obvious day and night changes. By analyzing the frequency spectrum, the main periodic components and their intensities can be clearly identified, which helps to build a more accurate forecasting model.

Since the load data discussed in this work is discrete, discrete Fourier transform (DFT) is used to obtain the frequency-domain representation (

X_{f}

) of the original sequence (X). The calculation formula is expressed as follows:

X_{f}^{(k)} = \sum_{t = 1}^{L} x_{t} \cdot e^{- j 2 π k t / L}, 0 \leq k \leq L - 1

(3)

where

X_{f}^{(k)}

is the kth frequency component in the obtained frequency-domain representation (

X_{f}

) and j represents an imaginary unit. In practice, the module uses Fast Fourier Transform (FFT) with lower time complexity to more efficiently implement the conversion process from the time domain to the frequency domain.

Next, a linear layer is used to project the obtained frequency-domain representation; then, Inverse Fast Fourier Transform (IFFT) is applied to complete the inverse transformation from the frequency domain to the time domain. This process can be expressed as follows:

{\hat{X}}_{f} = IFFT (W_{l} \cdot FFT (X_{p}))

(4)

where

W_{l} \in R^{2 N \times 2 F}

is the weight matrix of the projection layer. It should be noted that the reason why

W_{l}

uses a double dimension is that the output result after Fourier transformation of the input sequence is in complex form. For each real input point, there is a corresponding complex point (a real part and an imaginary part) in the frequency domain. Therefore, the dimension of the data in the corresponding frequency-domain representation will double, so a corresponding double dimension is used. Afterwards, in order to adapt to the input of the downstream Transformer multi-head attention layer, the output (

{\hat{X}}_{f}

) of the inverse fast Fourier transform is further embedded through the linear embedding layer to obtain the final input pre-trained large language model’s global information prompt (

P_{l} \in R^{F \times D}

where D refers to the data input dimension required by the pre-trained multi-head attention layer). After obtaining the global prompt (

P_{l}

), in order to allow the model to adaptively determine the contribution of global information to load prediction, a selective gating mechanism is used to adaptively suppress the global cue to maximize the prediction performance.

3.2. Temporal- and Spatial-Dimension Learning Module

In order to ensure that the model can accurately capture the periodic pattern of a specific dataset, we propose a time-dimension learning module to fine tune the pre-trained large language model. This module mainly consists of two parts: the select gate and the Frequency-domain Temporal Fine-tuning Layer (FTFL). The frequency-domain time-dimension fine-tuning layer uses a complex multi-layer perceptron network to learn the temporal pattern from the frequency-domain representation of the time dimension and outputs the feature representation. The function of the selection gate is to determine the contribution of the feature representation to the model’s prediction.

The frequency-domain time-dimension fine-tuning layer is the core part of the time-dimension learning module, which is specifically used to capture the time dependency of time series. Converting time-series data from the time domain to the frequency domain can effectively capture the global time dependency. The frequency representation can reveal the periodicity and trend characteristics of the time series, which may not be so obvious in the time domain. The fine-tuning layer operates in the frequency domain and uses a multi-layer perceptron (MLP) to learn long-term and short-term patterns in load data in an efficient way. First, the input sequence is converted to a complex form in the frequency domain through a fast Fourier transform, and a sequence consisting of a real part and a sequence consisting of an imaginary part are obtained. Together, these two sequences reveal the periodic amplitude, phase, and other information of the original data, which helps the model learn the complex periodic pattern of the sequence. Afterwards, a complex multilayer perceptron network is specially designed for this complex data. The network analyzes and learns the real and imaginary parts of the frequency representation, then combines the results. The single-layer structure of the frequency-domain complex multilayer perceptron network is shown in Figure 2. Assume that the complex input of the

(l - 1)

th layer of the multilayer perceptron is

X^{l - 1}

; the output of layer l can be calculated as follows:

\begin{matrix} X^{l} = & σ (Re (X^{l - 1}) W_{r}^{l} - Im (X^{l - 1}) W_{i}^{l} + B_{r}^{l}) \\ + j σ (Re (X^{l - 1}) W_{i}^{l} + Im (X^{l - 1}) W_{r}^{l} + B_{i}^{l}) \end{matrix}

(5)

where

R e (\cdot)

and

I m (\cdot)

represent the real and imaginary parts of a complex number, respectively;

σ (\cdot)

is the sigmoid activation function;

W_{r}^{l}

and

W_{i}^{l}

represent the imaginary and real learning weights of the lth layer network, respectively;

B_{r}^{l}

and

B_{i}^{l}

indicate the corresponding imaginary bias and real bias, respectively; and j is the plural symbol. This module can fully learn time-dimension characteristics such as the periodicity and trend of frequency components while maintaining the complex properties. The MLP architecture is highly configurable, allowing us to effectively balance model capacity and the risk of overfitting by adjusting the number of neurons or hidden layers based on the complexity of the problem and the scale of the data. Its forward propagation process is easily parallelizable, ensuring fast training and inference speeds.

After L layers of the frequency-domain complex multilayer perceptron network, the final output (

X^{L}

) is input to the selection gate. The selection-gate mechanism mainly adaptively selects the feature representation of each fine-tuning-layer output, thereby determining the impact of the fine-tuning layer on the model’s prediction, aiming to optimize the model’s adaptability and performance for different datasets. Specifically, the selection gate dynamically adjusts the activation state or weight of the adapter to achieve flexible adaptation to different datasets. The selection gate uses a learnable scaling sigmoid function as a gating mechanism, and the calculation formula is expressed as follows:

gate (x) = σ (λ x) = \frac{1}{1 + e^{- λ x}}

(6)

where

λ

is the scaling factor, which affects the steepness of the sigmoid function, thereby determining the sensitivity of the selected gate. In order to intuitively understand the working mechanism of the selection gate, the selection curves corresponding to different scaling factors are visualized in Figure 3. The advantage of using a gating mechanism is that by selecting the scaling factor (

λ

), the model can adaptively adjust the contributions of different features based on the learned frequency-domain characteristics of load data, thereby achieving more accurate predictions.

During the training phase, the model will adaptively adjust parameters to determine the contributions of all feature representations in

X^{L}

to the model’s prediction. At the same time, the setting of the scaling factor (

λ

) affects the response sensitivity of the selection gate. A larger

λ

value makes the gating function more sensitive, thereby achieving fine tuning of the adapter activation state. In general, the introduction of the selection-gate mechanism effectively achieves precise fine tuning of different datasets without significantly increasing the complexity of the model, further improving the generalization ability of the model. Afterwards, the output of the frequency-domain multilayer perceptron will be converted back to the time domain.

The spatial-dimension learning module is designed to capture cross-sequence dependencies in load data. In multivariate load forecasting, there may be complex interactions between different series, and these interactions are crucial for accurately predicting future values. Traditional load forecasting methods often have limitations in capturing such cross-sequence dependencies, especially when the relationships between the sequences are nonlinear and dynamically changing. The spatial-dimension learning module is designed to overcome these limitations and improve the accuracy of predictions by learning the dependencies between sequences by processing data in the frequency domain.

The structure of the spatial-dimension learning module is consistent with that of the temporal-dimension learning module, and a similar frequency-domain spatial-dimension fine-tuning layer (FSFL) is used to extract relevant features between different variable sequences from the frequency domain. The difference is that the frequency-domain spatial-dimension fine-tuning layer operates along the variable dimension, so the output of the upstream module needs to be reshaped by channel before being input into the spatial-dimension learning module. A selection gate is also used to adaptively select the feature representation of the output of the spatial-dimension fine-tuning layer.

3.3. Load Forecasting Method Based on Frequency-Domain Spatial–Temporal Mixture Learning

The model is designed as an end-to-end structure, which mainly converts the input historical load data into a frequency-domain representation through the frequency-domain transformation module, effectively capturing periodic and non-periodic information. The model adopts the pre-trained large language model structure and inserts a fine-tuning layer specially designed for load data into it. After completing the pre-processing of the input load data, global learning in the frequency domain is first performed to obtain the global frequency-domain feature representation of the load data. After that, two fine-tuning layers are used to extract the time-dimension features and spatial-dimension features of the sequence so as to learn the spatial–temporal pattern of the load data. In addition, the residual connection and layer normalization modules from the pre-trained large language model are retained in the model to enhance the training stability of the network and accelerate convergence.

In forecasting process, the data is first preprocessed, which includes two key steps: data cleaning (handling missing values, outliers, etc.) and data normalization (scaling the data to a consistent range). Then, the preprocessed data is fed into the proposed FSTML model. This model takes the input historical data as input, captures periodic patterns and global dependencies in the frequency domain through the frequency-domain global learning module, and extracts local spatial correlations and temporal dynamics through the temporal- and spatial-dimension learning module. Through these two modules, the model learns the latent features of the historical load sequence and generates the final forecasting results.

The loss function of the model uses the mean squared error (MSE) to calculate the error between the predicted sequence and the true sequence. The calculation formula is expressed as follows:

L_{MSE} = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(7)

where

y_{i}

is the value of the real sequence and

{\hat{y}}_{i}

is the value predicted by the model.

4. Results and Discussion

In order to verify the effectiveness of the load prediction method based on frequency-domain spatial–temporal hybrid learning proposed in the previous chapter, we first conducted comparative experiments on two public large-scale load prediction datasets and compared them with existing prediction models with good performance. Afterwards, in order to further verify the performance of each fine-tuning module, we conducted a module ablation experiment. For the hyperparameters of the fine-tuning module, we conducted parameter experiments. In addition, we further visualized and analyzed some detailed information of the proposed method. Finally, the prediction efficiency of the model was analyzed.

4.1. Datasets

We used public load prediction datasets from multiple fields to conduct comprehensive experiments, including the following two datasets: The ETT dataset, acquired at https://github.com/zhouhaoyi/ETDataset (accessed on 17 April 2025), and the ECL dataset, acquired at https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014 (accessed on 17 April 2025). The ETT dataset is a dataset of oil temperature of power transformers collected from two independent sites in China. It comes from the research team of Informer [12], a long-term time-series prediction model. The research team cooperated with relevant companies to collect data from two regions in China for two years (July 2016 to February 2018) to construct the ETT dataset. The ETT dataset comprises four sub-datasets: ETTh1, ETTh2, ETTm1, and ETTm2. ETTh1 and ETTh2 record the power transformer’s oil temperature and related indicators at two different substations at an hourly granularity, while ETTm1 and ETTm2 record the same indicators at a 15 min granularity. All four sub-datasets share the same time-series structure, containing seven variables: power-transformer oil temperature and six other external load values of different types. These six external load values are all highly correlated with the oil temperature. Although the ETT dataset forecasts the oil temperatures of power transformers, the oil temperature can reflect the condition of an electricity transformer and is helpful in investigating the extreme load capacity of electrical transformers. The ECL dataset is a household electricity consumption dataset collected from Portugal, obtained from the UC Irvine Machine Learning Repository. The dataset records the electricity consumption of 321 households from Portugal from 2011 to 2014, and the granularity of the time series is hourly. The dataset contains 321 variables, corresponding to 321 households, and the numerical unit is kilowatt-hour (kWh). Table 1 shows the main attributes of these two datasets.

4.2. Implementation Details

In recent years, research in the field of long-term time-series prediction has generally adopted a dataset partitioning setting consistent with Autoformer [13]. For a fair comparison with mainstream models, this paper also adopts the same partitioning setting. Specifically, the input length used in the experiment is 96. At the same time, the experiment uses four prediction lengths for model training and evaluation on both datasets. After determining the length of the input sequence and the output sequence, both datasets are divided into a training set, validation set, and test set in chronological order according to the sliding window partitioning method. The ETT dataset is divided in a ratio of 6:2:2, and the ECL dataset is divided in a ratio of 7:1:2. For example, the ETTm1 dataset contains 57,600 time steps. Dividing it in a 6:2:2 ratio yields a training set with 57,600 × 0.6 = 34,560 steps. If a prediction length of 96 is used, then according to the sliding window method, the training set will contain 34,560 − 96 − 96 + 1 = 34,369 time-series samples. The specific dataset division is shown in Table 2, where the number of samples in each dataset after splitting is calculated based on a prediction length of 96.

To comprehensively evaluate the predictive performance of the method proposed in this paper, two quantitative evaluation metrics commonly used in the field of time-series forecasting were employed to assess the prediction results. These two metrics are Mean Squared Error (MSE) and Mean Absolute Error (MAE). Their calculation formulas are expressed as follows:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(8)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(9)

where

y_{i}

is the value of the real sequence and

{\hat{y}}_{i}

is the value predicted by the model. It should be noted that, since the model evaluation for load forecasting tasks involves datasets from multiple domains and the numerical distributions of data across these domains vary significantly, standardized forecasting results were used to calculate the evaluation metrics in this paper to ensure model evaluation on a consistent scale.

In the model training phase, the experiment uses MSE as the loss function and optimizes the loss by setting the initial learning rate to 0.001 with the Adam optimizer. The batch size of the input model is set to 32, and the total number of training rounds is limited to 20. At the same time, an early stopping mechanism is introduced to prevent overfitting. For time-series block partitioning, a block length of 16 is used, with a sliding block step size of 8. Due to computing-power constraints, a small GPT-2 version is used, and the first three layers of the frozen pre-trained Transformer layer are considered for experiments. The scaling factor of the selection gate is set to 2. For the fine-tuning layer, in order to balance the model’s prediction efficiency and prediction performance, the hidden-layer dimension of the multilayer perceptron is set to 128.

4.3. Baselines

This section compares FSTML with several models that have performed well in the field of time-series prediction, including Reformer [33], Informer [12], Pyraformer [34], Autoformer [13], FEDformer [35], Stationary [36], DLinear [37], and TimesNet [11]. Among them, since the proposed FSTML model mainly learns the dimensional features of time series in the frequency domain, this experiment selects the TimesNet [11] and FEDformer [35] models, which recently modeled time series in the frequency domain, for comparison. The simple moving average (SMA) method is also included in our baseline, which uses the average of historical data points as the predicted value.

4.4. Main Results

Table 3 and Table 4 show the comparison results of the evaluation indicators of the FSTML model and the selected baseline model, respectively, on the test sets of the two datasets. The FSTML model showed the best performance in all scale prediction tasks on both datasets, and its MSE and MAE indicators were lower than those of other comparison models. Specifically, on the ETT dataset, the FSTML model improved by 4.09% and 2.72% in MSE and MAE indicators, respectively, compared with the best baseline model—namely, TimesNet. For the ECL power load prediction dataset with more non-stationary characteristics and complex periodic patterns, the FSTML model improved by an average of 5.89% compared with TimesNet. This shows that FSTML achieves excellent prediction accuracy on datasets with complex multivariate dependencies and can accurately predict long-term and short-term trends.

In order to further study the detailed differences between the prediction results of each model and the actual results and reveal the model’s ability to handle non-stationary load data, this experiment selected the three models with the best comprehensive prediction performance (FSTML, TimesNet, and DLinear) for visual comparison of prediction results. Figure 4 and Figure 5 show the prediction result fragments of the above prediction models on two datasets (the first 192 time steps on the test set). Among them, the blue curve represents the actual observation value, the orange curve represents the value predicted by each model, and the numerical values of the prediction results are all normalized.

As can be seen from the figure, compared with the selected baseline model, on two different datasets, the FSTML model shows excellent fitting ability, and the degree of agreement between its prediction results and actual observations is high, indicating that its prediction error is small. Compared with the selected baseline model, FSTML performs better in capturing local fluctuations and abnormal amplitudes of load data. Compared with TimesNet, FSTML further improves its detail performance on the basis of accurately fitting the trend of load data, such as better fitting of local fragments on the ETT dataset. It is worth noting that on the ECL dataset of power load with high-dimensional variables (321) and strong non-stationarity, FSTML not only has a lower average error in the overall prediction index but also captures some local sudden fluctuations more accurately and exhibits less lag. Accurate load forecasting is fundamental to achieving lean scheduling and operation of power systems. The FSTML model effectively reduces power generation redundancy and insufficient reserves caused by forecasting errors, minimizing inefficient energy production and transmission losses throughout the system and directly improving the efficiency of primary energy utilization. Furthermore, load forecasting is a key technological prerequisite for the successful integration of renewable energy sources. By accurately predicting demand-side changes, dispatch centers can more effectively match the output of intermittent renewable energy sources, thereby reducing greenhouse gas and pollutant emissions. Finally, accurate load forecasting helps power companies develop better generation plans, maintenance schedules, and market trading strategies. An efficient and economical power system is a crucial infrastructure guarantee for sustainable socio-economic development.

4.5. Ablation Study

In order to further explore the impact of the frequency-domain global learning module, temporal-dimension learning module, and spatial-dimension learning module in FSTML on the model’s prediction ability, we conducted a series of ablation experiments on the FSTML model. Specifically, the comparison model in the experiment includes the model after removing the frequency-domain global fine-tuning layer (FGFL), frequency-domain temporal fine-tuning layer (FTFL), and frequency-domain spatial fine-tuning layer (FSFL) from the original model, then evaluates their specific contributions to the model’s performance. This section evaluates the prediction accuracy of the above-mentioned modified models on the ETT and ECL datasets to accurately quantify the actual impact of the removal of each fine-tuning module on the prediction results.

Table 5 and Table 6 show the results of the ablation experiment in detail. It is clearly observed from the results that after removing any fine-tuning module, the overall prediction performance of the FSTML model on each dataset decreased, with the fine-tuning layer of the frequency-domain time dimension and the fine-tuning layer of the frequency-domain space dimension exhibiting a significant impact on the prediction performance of the model. The introduction of the frequency-domain global fine-tuning layer reduces the performance of the model in a few prediction tasks. In practice, the scaling factor of the selection gate can be used to appropriately adjust the contribution of the output of this layer. Overall, the results of the ablation experiment verify the effectiveness of each module in the prediction method, all of which played a positive role in further improving the load prediction performance.

4.6. Visual Analysis

In order to explore the correlation between variables in different datasets, this section first visualizes the variable dependency on the ETT and ECL datasets, as shown in Figure 6. The ETT dataset contains seven variables, the first six of which are load-related data, while the seventh variable is oil temperature. The ECL dataset contains 321 variables, each representing the electricity load data for a single household. The heat map shows the relationship between the variables in the two multivariate load datasets. Each small grid represents the correlation measure between two variables. It can be seen that there are partial correlations between different variables in the two datasets. Since these correlated variables may have similar periodic patterns, they are more important for the model to learn load features. Therefore, the FSTML model uses a spatial-dimension learning module to specifically capture the dependency characteristics between variables.

The frequency-domain feature spectrogram obtained by the frequency-domain global learning module is visualized in Figure 7. From the spectrogram, we can observe the periodic patterns of each dataset, indicating that the frequency-domain global learning module provides global periodic pattern information for the entire model. In the ETT dataset, the spectrum heat map shows a significant vertical stripe pattern, indicating that the data presents regular periodic changes, and the peak clearly marks the frequency of periodic events. In addition, the bright area at the bottom of the spectrum map suggests the concentration of energy in the lower frequency range, which may reflect the existence of long-period components such as seasonal changes. In the ECL dataset, although vertical stripes are also observed in its spectrum heat map, the clarity and regularity of the stripes are not as good as those of the ETT dataset. This suggests that despite the periodicity of the ECL dataset, its intensity or regularity is weak. The light and dark changes in the spectrum map may reveal some irregular periodic events or potential noise factors. However, the overall bright spectrum shows that there is a certain amount of energy concentration in the lower frequency range, which may indicate a longer-term seasonal effect. In summary, compared with the time domain, the load can show clearer timing information in the frequency domain, which is why FSTML performs feature learning in the frequency domain.

4.7. Efficiency Analysis

Our experiments also analyzed the prediction inference time and GPU memory usage of each model. As shown in Table 7, FSTML uses less energy than the traditional Transformer-based model, which is mainly due to the model’s time-series block processing. The model’s inference time and its prediction accuracy are further visualized as shown in Figure 8. The orange scattered points in the figure represent the FSTML model proposed in this chapter. It can be seen that although FSTML takes a relatively longer inference time, it produces fewer prediction errors in prediction tasks. In practice, this model can be used on complex large-scale load datasets to achieve better prediction performance. When power-grid companies use this model for load forecasting, they can obtain prediction results in a relatively short period, allowing them to determine subsequent power generation levels and develop reliable power transmission plans. This not only helps reduce resource waste but also helps reduce emissions.

5. Conclusions

This paper introduces FSTML, a novel framework designed for load forecasting. By leveraging Fourier transform techniques, the model extracts features in the frequency domain. With fine tuning in both the temporal and spatial dimensions, FSTML uses pre-trained large language models to learn long- and short-term patterns in load data. Comparative experiments on two public load datasets show that FSTML consistently outperforms existing load forecasting models, making it a state-of-the-art solution. The FSTML model not only improves prediction accuracy technically but also possesses clear sustainable development value. Its core value lies in providing a crucial data-driven decision-making tool for the power system’s transition to a more sustainable future by improving energy efficiency, reducing emissions, and optimizing system economics.

Despite some progress, certain limitations and areas for improvement remain. First, the model presented in this paper relies on large-scale time-series datasets to learn periodic patterns, which may not be feasible in specific domains or for tasks with limited data. Future work could explore few-shot learning strategies to enable the model to effectively learn and generalize under conditions of limited data. Second, the model proposed in this paper was only tested on two publicly available datasets related to load forecasting. Although we believe that the model’s flexibility and high efficiency are sufficient for its application to a wider range of load forecasting tasks, its specific performance requires further testing. Furthermore, our model utilize a pre-trained GPT-2 model as the backbone, which can be replaced by other recent large language models, such as LLaMA and GPT-NeoX, to achieve better performance on load forecasting tasks. Future work could explore how to further couple the frequency-domain properties of load data with more advanced large language models.

Author Contributions

Conceptualization, A.S. and J.Z.; methodology, J.Z.; validation, X.W. and A.S.; formal analysis, J.Z.; investigation, X.W.; data curation, S.Z.; writing—original draft preparation, J.Z.; writing—review and editing, X.W. and Y.K.; visualization, J.Z.; supervision, H.X. and S.Z.; project administration, A.S., S.Z. and Y.K.; funding acquisition, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by a grant from the State Grid Corporation of China Headquarters Technology Project 5700-202499327A-1-3-ZB.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ETT dataset can be acquired at https://github.com/zhouhaoyi/ETDataset (accessed on 17 April 2025), and the ECL dataset can be acquired at https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014 (accessed on 17 April 2025).

Conflicts of Interest

Author Honghua Xu was employed by the company Nanjing Power Supply Branch of State Grid Corporation of China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Qin, Y.; Luo, H.; Zhao, F.; Fang, Y.; Tao, X.; Wang, C. Spatio-temporal hierarchical mlp network for traffic forecasting. Inf. Sci. 2023, 632, 543–554. [Google Scholar] [CrossRef]
Liu, C.; Xiao, Z.; Long, C.; Wang, D.; Li, T.; Jiang, H. Mvcar: Multi-view collaborative graph network for private car carbon emission prediction. IEEE Trans. Intell. Transp. Syst. 2025, 26, 472–483. [Google Scholar] [CrossRef]
Puri, C.; Kooijman, G.; Vanrumste, B.; Luca, S. Forecasting time series in healthcare with gaussian processes and dynamic time warping based subset selection. IEEE J. Biomed. Health Inform. 2022, 26, 6126–6137. [Google Scholar] [CrossRef]
Alshanbari, H.M.; Iftikhar, H.; Khan, F.; Rind, M.; Ahmad, Z.; El-Bagoury, A.A.-A.H. On the implementation of the artificial neural network approach for forecasting different healthcare events. Diagnostics 2023, 13, 1310. [Google Scholar] [CrossRef]
Li, Y.; Lu, X.; Xiong, H.; Tang, J.; Su, J.; Jin, B.; Dou, D. Towards long-term time-series forecasting: Feature, pattern, and distribution. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 1611–1624. [Google Scholar]
Makridakis, S.; Hibon, M. Arma models and the box–jenkins methodology. J. Forecast. 1997, 16, 147–163. [Google Scholar] [CrossRef]
Taylor, S.J.; Letham, B. Forecasting at scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
Ahmadi, M.; Khashei, M. Generalized support vector machines (gsvms) model for real-world time series forecasting. Soft Comput. 2021, 25, 14139–14154. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Abbasimehr, H.; Paki, R. Improving time series forecasting using lstm and attention models. J. Ambient. Intell. Humaniz. Comput. 2022, 13, 673–691. [Google Scholar] [CrossRef]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Fan, W.; Wang, P.; Wang, D.; Wang, D.; Zhou, Y.; Fu, Y. Dish-ts: A general paradigm for alleviating distribution shift in time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 7522–7529. [Google Scholar]
Jin, M.; Wen, Q.; Liang, Y.; Zhang, C.; Xue, S.; Wang, X.; Zhang, J.; Wang, Y.; Chen, H.; Li, X.; et al. Large models for time series and spatio-temporal data: A survey and outlook. arXiv 2023, arXiv:2310.10196. [Google Scholar] [CrossRef]
Hyndman, R.J.; Khandakar, Y. Automatic time series forecasting: The forecast package for r. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. Deepar: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Borovykh, A.; Bohte, S.; Oosterlee, C. Conditional time series forecasting with convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks, ICANN 2017, Alghero, Italy, 11–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Virtual Conference, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Gong, Y.; Chung, Y.-A.; Glass, J. Ast: Audio spectrogram transformer. Proc. Interspeech 2021, 2021, 571–575. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macau, China, 19–25 August 2023; pp. 6778–6786. [Google Scholar]
OpenAI. Gpt-4 Technical Report. OpenAI, Technical Report. 2023. Available online: https://cdn.openai.com/papers/gpt-4.pdf (accessed on 2 June 2025).
Meta AI Research Team. Llama: Open and Efficient Foundation Language Models. Meta AI, Technical Report. 2023. Available online: https://ai.meta.com/research/publications/llama-open-and-efficient-foundation-language-models/ (accessed on 14 June 2025).
Zhou, T.; Niu, P.; Sun, L.; Jin, R. One fits all: Power general time series analysis by pretrained lm. Adv. Neural Inf. Process. Syst. 2023, 36, 43322–43355. [Google Scholar]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-f.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Liu, Y.; Qin, G.; Huang, X.; Wang, J.; Long, M. Autotimes: Autoregressive time series forecasters via large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 122154–122184. [Google Scholar]
Ansari, A.F.; Stella, L.; Turkmen, A.C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Arango, S.P.; Kapoor, S.; et al. Chronos: Learning the language of time series. Trans. Mach. Learn. Res. 2024. Available online: https://jmlr.org/tmlr/papers/index.html (accessed on 13 June 2025).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the PMLR International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 9881–9893. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]

Figure 1. The FSTML model architecture.

Figure 2. The single-layer structure of the frequency-domain complex multilayer perceptron network.

Figure 3. Visualization of the selection weight function of the selection gate.

Figure 4. Visualization of load prediction on the ETT dataset. The horizontal axis represents the predicted time steps, and the vertical axis represents the standardized data.

Figure 5. Visualization of load prediction on the ECL dataset. The horizontal axis represents the predicted time steps, and the vertical axis represents the standardized data.

Figure 6. Visualization of variable dependencies on different datasets.

Figure 7. Spectrum visualization of the frequency-domain global learning module.

Figure 8. Visualization of the prediction efficiency of each model when the prediction length is 720 on the ETTh1 dataset (the size of the scatter points in the figure indicates the model’s memory usage). The horizontal axis represents the time required for inference (ms/iteration), and the vertical axis represents the model’s prediction performance (MSE).

Table 1. Main information of the datasets.

Dataset	Time Span	Number of Variables	Time Steps	Time-Series Granularity
ETTm1, ETTm2	July 2016–February 2018	7	57,600	15-min
ETTh1, ETTh2	July 2016–February 2018	7	14,400	1-h
ECL	July 2011–July 2014	321	26,304	1-h

Table 2. Training, validation, and testing setup for load datasets.

Dataset	Input Length	Predicted Length	Split Ratio	Number of Samples (Train/Val/Test)
ETTm1, ETTm2	96	{96,192,336,720}	6:2:2	(34,369, 11,425, 11,425)
ETTh1, ETTh2	96	{96,192,336,720}	6:2:2	(8449, 2785, 2785)
ECL	96	{96,192,336,720}	7:1:2	(18,221, 2537, 5165)

Table 3. Comparison of prediction results of different models on the ETT dataset.

Model	MSE					MAE
Model	96	192	336	720	Avg.	96	192	336	720	Avg.
Reformer	1.165	3.445	3.217	2.216	2.511	0.798	1.291	1.324	1.167	1.146
Informer	1.414	1.985	2.101	2.343	1.961	0.816	0.989	1.101	1.163	1.017
Pyraformer	0.572	0.716	0.938	1.615	0.961	0.557	0.644	0.746	0.935	0.721
SMA	0.494	0.535	0.559	0.586	0.544	0.450	0.473	0.493	0.514	0.483
Autoformer	0.389	0.386	0.431	0.533	0.465	0.415	0.443	0.473	0.504	0.459
FEDformer	0.329	0.386	0.431	0.483	0.408	0.381	0.414	0.444	0.472	0.428
Stationary	0.392	0.446	0.492	0.552	0.471	0.405	0.445	0.478	0.526	0.464
DLinear	0.314	0.395	0.464	0.595	0.442	0.363	0.415	0.460	0.537	0.444
TimesNet	0.312	0.365	0.419	0.467	0.391	0.355	0.385	0.421	0.455	0.404
FSTML	0.304	0.357	0.395	0.442	0.375	0.348	0.379	0.406	0.439	0.393

Table 4. Comparison of prediction results of different models on the ECL dataset.

Model	MSE					MAE
Model	96	192	336	720	Avg	96	192	336	720	Avg
SMA	0.846	0.849	0.861	0.892	0.862	0.762	0.761	0.765	0.775	0.766
Reformer	0.312	0.348	0.350	0.340	0.338	0.402	0.433	0.433	0.420	0.422
Informer	0.274	0.296	0.300	0.373	0.311	0.368	0.386	0.394	0.439	0.397
Pyraformer	0.386	0.378	0.376	0.376	0.379	0.449	0.443	0.443	0.445	0.445
Autoformer	0.201	0.222	0.231	0.254	0.227	0.317	0.334	0.338	0.361	0.338
FEDformer	0.193	0.201	0.214	0.246	0.214	0.308	0.315	0.329	0.355	0.327
Stationary	0.169	0.182	0.200	0.222	0.193	0.273	0.286	0.304	0.321	0.296
DLinear	0.197	0.196	0.209	0.245	0.212	0.282	0.285	0.301	0.333	0.300
TimesNet	0.168	0.184	0.198	0.220	0.192	0.272	0.289	0.300	0.320	0.295
FSTML	0.153	0.170	0.187	0.220	0.183	0.245	0.258	0.279	0.314	0.274

Table 5. Ablation study on different modules of FSTML on the ETT dataset.

Module	MSE					MAE
Module	96	192	336	720	Avg	96	192	336	720	Avg
FSTML (full)	0.300	0.357	0.395	0.442	0.375	0.348	0.379	0.406	0.439	0.393
w/o FGFL	0.303	0.358	0.398	0.447	0.377	0.348	0.381	0.406	0.443	0.395
w/o FTFL	0.308	0.361	0.404	0.464	0.384	0.357	0.384	0.416	0.455	0.403
w/o FSFL	0.304	0.357	0.401	0.449	0.378	0.350	0.380	0.408	0.447	0.396

Table 6. Ablation study on different modules of FSTML on the ECL dataset.

Configuration	MSE					MAE
Configuration	96	192	336	720	Avg	96	192	336	720	Avg
FSTML (Full)	0.153	0.170	0.187	0.220	0.183	0.245	0.258	0.279	0.314	0.274
w/o FGFL	0.157	0.174	0.188	0.222	0.185	0.249	0.260	0.283	0.314	0.277
w/o FTFL	0.162	0.182	0.196	0.230	0.193	0.265	0.274	0.290	0.329	0.290
w/o FSFL	0.159	0.180	0.195	0.226	0.190	0.263	0.265	0.289	0.320	0.284

Table 7. GPU memory usage and inference time of each model when the prediction length is 720 on the ETTh1 dataset.

Model	GPU Memory (MiB)	Inference Time (ms)
FSTML	1672	34.6
TimesNet	1894	9.8
DLinear	428	0.6
Stationary	3209	10.9
FEDformer	4222	60.2
Autoformer	11,754	25.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, A.; Wang, X.; Xu, H.; Zhan, J.; Zhou, S.; Kong, Y. Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting. Sustainability 2026, 18, 171. https://doi.org/10.3390/su18010171

AMA Style

Shen A, Wang X, Xu H, Zhan J, Zhou S, Kong Y. Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting. Sustainability. 2026; 18(1):171. https://doi.org/10.3390/su18010171

Chicago/Turabian Style

Shen, Aodong, Xingyue Wang, Honghua Xu, Jichao Zhan, Suyang Zhou, and Youyong Kong. 2026. "Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting" Sustainability 18, no. 1: 171. https://doi.org/10.3390/su18010171

APA Style

Shen, A., Wang, X., Xu, H., Zhan, J., Zhou, S., & Kong, Y. (2026). Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting. Sustainability, 18(1), 171. https://doi.org/10.3390/su18010171

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency-Based Spatial–Temporal Mixture Learning for Load Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Time-Series Forecasting

2.2. Pre-Trained Large Language Model

3. Methodology

3.1. Frequency-Domain Global Learning Module

3.2. Temporal- and Spatial-Dimension Learning Module

3.3. Load Forecasting Method Based on Frequency-Domain Spatial–Temporal Mixture Learning

4. Results and Discussion

4.1. Datasets

4.2. Implementation Details

4.3. Baselines

4.4. Main Results

4.5. Ablation Study

4.6. Visual Analysis

4.7. Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI