Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach

Dong, Xiangluan; Yu, Yan; Jin, Hongyang; Hu, Zhanshuo; Bao, Jieqiu

doi:10.3390/pr14010005

Open AccessArticle

Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach

by

Xiangluan Dong

^1,*,

Yan Yu

²,

Hongyang Jin

³,

Zhanshuo Hu

³ and

Jieqiu Bao

¹

Engineering Training Center, Shenyang Institute of Engineering, Shenyang 110136, China

²

State Grid Liaoning Extra High Voltage Company, Shenyang 110003, China

³

School of Electric Power, Shenyang Institute of Engineering, Shenyang 110136, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(1), 5; https://doi.org/10.3390/pr14010005

Submission received: 31 October 2025 / Revised: 11 December 2025 / Accepted: 16 December 2025 / Published: 19 December 2025

(This article belongs to the Special Issue Application of Artificial Intelligence and Optimization Algorithms in Power Systems and Energy Storage)

Download

Browse Figures

Versions Notes

Abstract

Under the ongoing electricity market reforms, short-term load forecasting (STLF) is increasingly challenged by pronounced non-stationarity driven by price fluctuations. This study proposes an adaptive STLF framework tailored to price-induced non-stationarity. Firstly, a market state identification method based on load–price joint clustering is developed to structurally model the temporal interactions between price and load. It allows the automatic extraction of typical market patterns and helps uncover how price fluctuations drive load variations. Secondly, a gated mixture forecasting network is proposed to dynamically adapt to the inertia of historical price fluctuations. By integrating parallel branches with an adaptive weighting mechanism, the model dynamically captures historical price features and achieves both rapid response and steady correction under market volatility. Finally, a Transformer-based expert model with multi-scale dependency learning is introduced to capture sequential dependencies and state transitions across different load regimes through self-attention, thereby enhancing model generalization and stability. Case studies using real market data confirm that the proposed approach delivers substantial performance improvements, offering reliable support for system dispatch and market operations. Relative to mainstream baseline models, it reduces MAPE by 1.08–2.62 percentage points.

Keywords:

short-term load forecasting; non-stationary electricity price; adaptive forecasting; time-series clustering; transformer model

1. Introduction

In recent years, the reform of electricity marketization has been accelerating globally, leading to the establishment of a price mechanism determined by real-time supply and demand relationships [1,2]. With the large-scale integration of renewable energy and the participation of diverse entities in market transactions, short-term electricity price fluctuations have become increasingly intense and frequent. Such volatility not only alters users’ electricity consumption behavior but also causes the load profile to deviate from the traditional stable pattern, exhibiting high levels of randomness and nonlinearity [3,4]. Against this backdrop, traditional forecasting models that rely on the stable patterns of historical loads have seen reduced applicability. Exploring novel forecasting methods capable of dynamically capturing and adapting to price signals has become crucial for ensuring real-time grid balance and the stable operation of the electricity market.

At present, the methods for short-term load forecasting [5,6,7] are divided into statistical methods [8] and data-driven methods (from traditional machine learning to modern deep neural networks) [9,10]. The existing methods on this topic can be summarized and grouped into three main categories:

(i): Statistical load forecasting methods extract patterns of load variation from historical data, operating on the relatively simplistic assumption that past patterns will replicate in the future. Examples include the ARIMA model [11] and exponential smoothing methods [12]. Wu et al. [13] proposed a model based on a fractionally autoregressive integrated moving average with long-range dependence, incorporating a dynamically adjusted cuckoo search algorithm to optimize the parameters of the forecasting model. This method decomposes the load into three components: autoregressive, differencing, and moving average, to uncover the underlying patterns in historical load variations. Rendon-Sanchez et al. [14] proposed a forecasting approach based on a seasonal exponential smoothing model, utilizing combined forecast results for short-term load prediction. The proposed model can capture seasonality and time-varying volatility, demonstrating favorable forecasting performance. Exponential smoothing methods assign different weights to historical data, with more recent data receiving higher weights (greater importance). This approach employs an exponentially decreasing weighting scheme to smooth out random fluctuations and capture primary trends and seasonal patterns. However, this method primarily captures linear relationships and exhibits limited capability in handling sharp nonlinear fluctuations caused by factors such as abrupt electricity price changes or extreme weather conditions.
(ii): Based on classical machine learning methods, the introduction of multi-factor correlation approaches acknowledges that future load depends not only on past load but also on factors such as weather, date, and electricity prices, constructing complex relationships between load and various influencing factors [15,16]. Zhao et al. [17] proposed a load forecasting method combining a grey model with Least Squares Support Vector Machine (LSSVM). This method employs a feature matching pattern for prediction based on each decomposed component and effectively enhances long-term load forecasting accuracy through the extraction of load characteristics. Support Vector Machine (SVM) operates in a high-dimensional space constituted by numerous features, identifying optimal parameters to fit historical data and ensuring model robustness. This method performs well with small sample sizes but suffers from slower training when dealing with large datasets. Fan et al. [18] introduced a hybrid model integrating Random Forest (RF) with the Mean Generating Function (MGF). This model first obtains predicted values from the time variable, Random Forest, and the Mean Generating Function separately. These are then used as inputs for short-term load forecasting via a multivariate response surface methodology. The model demonstrates stronger robustness and higher forecasting accuracy. Random Forest [19] is an ensemble learning method that constructs multiple decision trees, each trained on randomly sampled data and features, ultimately outputting results through collective decision-making. This approach not only effectively mitigates overfitting inherent in single trees, making the model more robust, but also quantifies the importance of each feature in the prediction.
(iii): Deep learning-based load forecasting methods utilize computational models containing multiple processing layers to learn and extract complex, non-linear temporal characteristics and patterns from massive historical load and related data, thereby achieving high-precision artificial intelligence methods for predicting future short-term load [20,21]. Li et al. [22] proposed a novel hybrid model named CEEMDAN-CNN-LSTM-SA-AE to enhance household electricity load forecasting accuracy. The model first decomposes the original load data using CEEMDAN, then extracts local features via CNN, and captures long- and short-term dependencies using an LSTM-AE model integrated with a self-attention mechanism to complete the forecasting. Experiments on two real-world datasets showed that this model significantly outperforms existing baselines with marked improvements across various performance metrics. Tian et al. [23] proposed a method for short-term electric vehicle charging load forecasting that combines Temporal Convolutional Network (TCN) and Long Short-Term Memory (LSTM) networks. This method employs comprehensive similar-day identification technology and analyzes meteorological factors and historical load data for validation. Experimental results indicate that the model effectively improves forecasting accuracy, with a further 2% reduction possible after introducing similar-day analysis. Ahmad [24] proposed a novel Transformer architecture, TFTformer, to enhance power load forecasting accuracy. The model effectively integrates multi-source data such as weather and time through multi-modal feature embedding, linear transformation layers, and temporal convolutional networks, while also enhancing its ability to capture long-range dependencies. In recent years, to overcome the limitations of single models, hybrid models combining machine learning and deep learning have emerged as a new trend in research and application. Tan et al. [25] proposed a forecasting method that combines the SVMD algorithm with an improved Informer model. This method decomposes load data using SVMD and incorporates relative position encoding, causal convolutions, and skip connections into the Informer model to enhance sequence dependency capture and local feature extraction. This model significantly outperforms several mainstream models and provides a reliable technical reference for optimizing intelligent heating systems. Incremona et al. [26] proposed a Gaussian-process-based load forecasting method with a tailored kernel to address the challenge of predicting electricity demand during the moving holiday of Easter Week. Their results on Italian data show that the proposed approach significantly outperforms GP models with canonical kernels as well as the official forecasts of the TSO Terna.

In summary, existing forecasting methods have progressed from statistical models relying on historical patterns, to machine learning approaches that incorporate multiple nonlinear factors, and further to deep learning frameworks capable of capturing complex temporal dependencies. However, under non-stationary conditions of severe price fluctuations, their performance still faces three major challenges:

(i): Most traditional and data-driven models fail to explicitly characterize the evolving coupling between electricity price and load, which makes it difficult to identify typical market regimes or capture their temporal transitions.
(ii): Existing forecasting approaches seldom consider the inertia of historical price fluctuations, leading to weak responsiveness and reduced stability when market dynamics change rapidly.
(iii): Although many deep learning methods can learn temporal patterns, they often lack the ability to model hierarchical dependencies and state transitions across different load regimes, resulting in limited generalization under volatile market scenarios.

To address these limitations, this paper proposes an adaptive short-term load forecasting method tailored for price volatility non-stationarity. By constructing a price-triggered gated hybrid architecture and dedicated expert forecasting models, the proposed approach aims to overcome existing constraints and achieve higher forecasting accuracy and robustness. The main contributions of this work are as follows:

(i): A market state identification method based on load–price joint clustering is developed to structurally model the temporal interactions between price and load. It allows the automatic extraction of typical market patterns and helps uncover how price fluctuations drive load variations.
(ii): A gated mixture forecasting network is proposed to dynamically adapt to the inertia of historical price fluctuations. By integrating parallel branches with an adaptive weighting mechanism, the model dynamically captures historical price features and achieves both rapid response and steady correction under market volatility.
(iii): A Transformer-based expert model with multi-scale dependency learning is introduced to capture sequential dependencies and state transitions across different load regimes through self-attention, thereby enhancing model generalization and stability.

2. Overall Description of the Proposed Method

In order to achieve accurate prediction of electricity load under non-stationary conditions of price fluctuations in the electricity market, this paper proposes an adaptive short-term load forecasting method tailored to the non-stationarity of price fluctuations. The model structure is shown in Figure 1. The overall framework first identifies different market states based on load price joint clustering. On this basis, corresponding Transformer expert prediction models are constructed for various typical states, and an adaptive gated hybrid network is further introduced to dynamically weight and fuse the outputs of multiple experts based on the current market state and their historical performance. Through the collaborative mechanism of “state recognition expert modeling gate control fusion”, the proposed method can achieve adaptive characterization of complex time-varying characteristics and high-precision short-term load forecasting in the context of constantly changing price distribution and load response relationships.

Firstly, develop a market state recognition method based on load price joint clustering. The “market state” referred to in this article refers to the typical market operation situation determined by the electricity price level and its fluctuation characteristics, corresponding load level, and load dynamic changes on a given time scale. By clustering analysis of the price load joint time series, samples with similar price fluctuation patterns and load response characteristics are classified into the same category, and each category is defined as a market state. This market state characterizes the structured manifestation of electricity price non stationarity over time and reveals the driving mechanism of price fluctuations on load changes, providing a basis for conditional division and model selection for subsequent adaptive short-term load forecasting. The electricity price and the load response relationship driven by it have significant non stationarity and heterogeneity: the price load coupling mechanism under different fluctuation levels and load situations is often completely different. If a single model is used indiscriminately to fit all samples, it is easy to cause multiple modes to interfere with each other and weaken the ability to depict key scenes such as spikes and violent fluctuations. By clustering samples with similar price fluctuation characteristics and load response behavior into the same market state, pattern decomposition can be achieved at the data level. This makes the internal states more homogeneous and the mechanisms clearer, providing a reasonable basis for the subsequent construction of expert models and the design of gate fusion based on states, thereby effectively alleviating the prediction difficulties caused by non-stationarity. Based on electricity market and load time series data as inputs, market states are clustered using a deep embedding clustering algorithm. Specifically, electricity market prices and historical load time series are utilized as inputs to form a unified load-price joint feature matrix, thereby providing a high-quality data foundation for subsequent modeling. Subsequently, one-dimensional convolutional and linear neural networks are constructed and trained in an unsupervised manner using this dataset. The high-dimensional raw variable matrix is mapped to a low-dimensional latent feature space, and a compact representation capable of comprehensively characterizing price fluctuation patterns and load dynamic changes is extracted. Furthermore, K-means initialization clustering is performed on the features within this latent space. And through the iterative optimization process of soft labels and target distribution, the clustering center and encoder parameters are jointly updated to make the samples within the same market state more compact in the latent space and more separable between different market states. Finally, a stable set of market state labels with clear physical and economic meanings is obtained, providing prior information and input conditions for subsequent adaptive short-term load forecasting models divided by states.

Secondly, to further characterize the load evolution law driven by price fluctuations, this paper introduces a Transformer based short-term load forecasting expert model based on market state segmentation. This model is based on a self-attention mechanism and can adaptively capture the temporal dependencies and state transitions of load time series under different load conditions in a given market state, thereby improving its generalization ability and stability under complex working conditions. The reason for choosing Transformer as the expert model is that electricity prices and load sequences usually exhibit the characteristics of short-term disturbances and medium to long-term evolution: on the one hand, loads are driven by short-term factors such as weather and user behavior. On the other hand, traditional models that rely on fixed windows or local receptive fields are difficult to simultaneously consider both types of features, as they are constrained by cyclical patterns such as daily/weekly/seasonal patterns and structural changes. Transformers based on self-attention mechanisms do not require pre-set receptive field ranges and can adaptively allocate the importance of different historical moments on the global timeline, making them more suitable for characterizing complex temporal dependencies under non-stationary price conditions. For each identified market state, this article trains a corresponding Transformer expert model to specifically learn the response mechanism of load to price, historical load, and other influencing factors under that state, forming an adaptive prediction framework of “state expert” matching. Unlike traditional models that only utilize single time scale information, the proposed expert model adopts multiple time scale inputs. Introducing price and load time series of different scales, such as hourly, daily, weekly, and even monthly, into a unified attention structure to achieve joint modeling of short-term fluctuations and medium- to long-term trends. Multi time scale dependency learning is essentially reflected in the unified modeling and fusion of multi-scale input data. The model can not only use recent information to characterize the rapid changes in prices and loads, but also extract stable periodic and trend features through a longer time window, thus more comprehensively characterizing the load evolution law under non-stationary price conditions.

Finally, in order to cope with the non-stationary challenges brought by electricity price fluctuations, this paper constructs an adaptive gating network on top of an expert model as the intelligent decision-making core of the hybrid prediction architecture. The key idea of this gate-controlled network is to combine two types of information at each prediction moment to weight each Transformer expert. One is the typical daily pattern defined by the clustering results, which refers to the market state to which the current sample belongs and its related price load statistical characteristics; the second is the dynamic feedback reflecting the recent prediction performance of various expert models, such as prediction errors or accuracy indicators within a sliding time window. Specifically, the gate control network takes the current market state vector and the historical performance characteristics of each expert as inputs. After several layers of feedforward neural networks, non-negative weights for each expert are calculated and normalized using SoftMax to weight and fuse the prediction results of multiple experts. Through this adaptive weight allocation mechanism of “pattern perception + performance perception”, the model can tend to choose experts who are good at characterizing long-term trends during the stable stage of the market. By increasing the weight of experts who are more sensitive to spikes and mutations during severe price fluctuations or state transitions, dynamic adaptation to price non stationarity and timely correction of prediction bias can be achieved without the need for frequent retraining.

3. Short-Term Load Adaptive Forecasting Method for Non-Stationary Electricity Price Fluctuations

3.1. Typical Daily Pattern Recognition Method Based on Deep Embedding Clustering

Deep embedding clustering [27,28] (DEC) is an advanced unsupervised learning method whose core idea lies in nonlinearly mapping high-dimensional raw data into a low-dimensional embedding space through deep neural networks, while simultaneously optimizing clustering objectives within this space. Unlike traditional clustering methods that operate directly in the original data space, this approach effectively overcomes high dimensionality, nonlinearity, and noise interference in the data, thereby learning more discriminative feature representations. In the implementation, the model initialization phase employs the K-means algorithm to cluster the hidden layer features output by the encoder, thereby obtaining initial cluster centers. This clustering result not only serves as a starting point for optimization but also provides evaluation metrics (such as the silhouette coefficient) that can be used as a reference for monitoring the training process. The model is then jointly optimized end-to-end through three core computational steps. After initialization, the model is jointly optimized end-to-end according to Equations (1)–(3). Given the embedding

z_{i}

of sample

i

and cluster center

μ_{j}

, we use a Student’s t-distribution to compute the soft assignment

q_{i j}

, which represents the probability that sample

i

belongs to cluster

j

, as shown in Equation (1). These soft assignments are then sharpened into a target distribution

P = {p_{i j}}

to enhance cluster purity and alleviate cluster-size imbalance, by re-normalizing the squared assignments with cluster frequency

f_{j}

, as shown in (2). Finally, the encoder parameters and cluster centers are updated by minimizing the Kullback–Leibler (KL) divergence between the target distribution

P

and the current soft assignments

Q = {q_{i j}}

, as shown in Equation (3). The Kullback–Leibler (KL) divergence measures how one probability distribution differs from a reference distribution, and equals zero only when the two distributions are identical.

The overall model structure is shown in Figure 2. The deep embedding clustering model used has a two-stage structure of “self-encoding first, clustering later”. Firstly, the input load price joint feature is mapped to a compact feature representation in a low dimensional embedding space through an encoder composed of multiple layers of nonlinear transformations. This feature is then fed into the decoder for reconstruction, and the autoencoder is pretrained by minimizing the reconstruction error to obtain a representation that effectively preserves key information. Subsequently, in the clustering stage, the trained encoder is retained and a clustering layer is inserted into its output embedding space. Several learnable clustering centers are used for soft allocation of samples, and the encoder parameters and clustering centers are jointly updated by combining the target distribution and KL divergence minimization. Ultimately, a market state cluster with clear distribution and clear physical meaning is formed in the embedding space.

Based on the above advantages, this study introduces deep embedding clustering into the field of energy analysis and constructs a typical daily pattern recognition method based on electricity price time series clustering. This method uses a deep embedding model to jointly represent and learn the time-series data of electricity prices and loads, automatically capturing and enhancing the electricity price fluctuation patterns dynamically associated with loads. Through end-to-end training, the model can effectively identify typical daily operating scenarios that reflect the intrinsic correlation structure between electricity prices and loads. The calculation equations in the model are as follows:

q_{i j} = \frac{{(1 + | | z_{i} - μ_{j} | |^{2})}^{- 1}}{\sum_{j} {(1 + | | z_{i} - μ_{j} | |^{2})}^{- 1}}

(1)

p_{i j} = \frac{q_{i j}^{2} / f_{j}}{\sum_{j^{'}} q_{i j^{'}}^{2} / f_{j^{'}}}

(2)

L = K L (P | | Q) = \sum_{i} \sum_{j} p_{i j} \log \frac{p_{i j}}{q_{i j}}

(3)

Among them,

q_{i j}

is a probabilistic cluster membership label, which measures the relative similarity between sample i and cluster center j in the embedding space. P is the target distribution; Q is the minimum soft allocation.

f_{j} = \sum_{i} q_{i j}

is the frequency of soft clusters, which is strengthened by square operations and normalization to enhance high confidence allocation and balance cluster size.

3.2. Adaptive Gate Control Network Based on Typical Daily Patterns and Historical Performance Perception

To address the non-stationary challenges brought about by electricity price fluctuations, this paper proposes an adaptive gate control network that integrates typical daily patterns with historical performance perception. As the intelligent decision-making core of the hybrid prediction architecture, the innovation of this network lies in its ability to integrate key information for weight allocation: (1) typical daily patterns defined by clustering results; (2) dynamic feedback reflecting the recent performance of various expert models.

Through this multidimensional decision-making mechanism, the gate-controlled network [29] can act like an experienced dispatcher, making the most reasonable scheduling decisions based on “date type”, “current operating conditions”, and “expert status”. Firstly, based on dividing historical days into N typical patterns, for the current predicted day, we classify it into a certain pattern and input it into the gate network in the form of one hot encoding, denoted as

f_{c l u s t e r} \in R^{N}

(where

R^{N}

denotes the N-dimensional real-valued vector space, and N is the number of typical day patterns). This vector provides high-level structured scene labels for the gated network, enabling it to prioritize calling expert models trained in this type of daily mode. Expert recent performance vector

f_{p e r f} \in R^{K}

is a key online feedback signal that quantifies the prediction reliability of each expert at the last M time points (where

R^{K}

denotes the K-dimensional real-valued vector space, and K is the total number of experts). This ensures that the decisions of the gating network are dynamic and adaptive.

Firstly, the average absolute error of expert k within the sliding window M is calculated, as shown in Equation (4). In the day-ahead horizon considered in this study, M is set to 96. Next, these errors are converted into a performance-aware score vector. Specifically, we take the reciprocal of each

M A E_{k}

(with a small constant

ε > 0

added for numerical stability) and normalize them using the Softmax function to obtain the performance vector

f_{p e r f}

, so that experts with smaller recent errors receive larger performance scores, as shown in Equation (5). The input to the gating network is then formed by combining the performance vector

f_{p e r f}

with the clustering-based feature

f_{c l u s t e r}

(which encodes the current market state or typical daily pattern), yielding the composite feature vector, as shown in Equation (6). On top of this composite representation, we construct a gating network

G (\cdot; θ_{G})

, implemented as a two-layer feedforward neural network. It maps

z

to a vector of non-negative weights

g

through a ReLU hidden layer followed by a Softmax output layer, as shown in Equation (7). Finally, the prediction of the mixture-of-experts model is obtained as the gated weighted sum of the individual expert predictions, as shown in Equation (8).

The model structure is shown in Figure 3. This model structure achieves scene adaptive decision-making by integrating typical daily patterns with dynamic performance feedback. In summary, the gating mechanism used in this article belongs to the soft gating paradigm [30]. This design outputs a continuous set of probability weights to weight and fuse the prediction results of all expert models, rather than making a hard binary choice. The advantage of this soft decision-making mechanism is that it can more finely depict the transition state between different typical daily patterns, and effectively improve the predictive robustness of the hybrid model in the face of uncertainty, especially suitable for dealing with complex non-stationary load sequences caused by electricity price fluctuations.

M A E_{k} = \frac{1}{M} \sum_{i = t - M}^{t - 1} | y_{i}^{t r u e} - y_{i}^{k, p r e d} |

(4)

f_{p e r f} = Softmax (\frac{1}{M A E_{1} + ε}, \dots, \frac{1}{M A E_{k} + ε})

(5)

z = f_{c l u s t e r} + f_{p e r f}

(6)

g = G (z; θ_{G}) = softmax (W_{2} (ReLU (W_{1} z + b_{1})) + b_{2})

(7)

{\hat{y}}_{t} = \sum_{k = 1}^{K} g_{k} \cdot {\hat{y}}_{t}^{k}

(8)

Among them, k represents the kth expert model, and

ε

is a very small positive number to prevent zero division errors.

W_{1}

,

b_{1}

and

W_{2}

,

b_{2}

are the weights and biases of the hidden layer and output layer, respectively. The ReLU activation function introduces non-linear decision-making capability. The output layer Softmax function ensures that the final output weight g forms a probability distribution that satisfies a sum of probabilities of 1.

3.3. Expert Model for Load Forecasting Based on Transformer

To accurately capture the complex long-term and short-term dependencies in the load sequence under the background of electricity price fluctuations, this paper adopts the Transformer architecture [31] as the unified basic model for parallel experts in the hybrid model. Compared to traditional recurrent neural networks (RNNs) and their variants, Transformer’s unique self-attention mechanism enables it to directly model the global dependency relationship between any two time points in a sequence, overcoming the problems of gradient vanishing and information bottleneck that RNNs face when processing long sequences. Since the input data are essentially one-dimensional time series, no flattening operation is required. Instead, the sequences are arranged in parallel and treated as different feature dimensions of the time axis. These multi-dimensional inputs are then passed through linear layers for dimensional expansion and further feature mixing. The model structure is shown in Figure 4.

The self-attention mechanism [32] is the cornerstone of the Transformer model, which allows the model to focus on the information of all other time points in the sequence and automatically calculate their importance weights when dealing with the load at a certain time point. For an input sequence

X = (x_{1}, x_{2}, \dots, x_{T})

, first linearly map each input vector to a query vector Q, a key vector K, and a value vector V. Then, calculate the attention score matrix by scaling the dot product, which reveals the strength of the correlation between any two-time steps in the sequence. Specifically, for the query matrix Q, key matrix K, and value matrix V, the self-attention output is computed by first forming the similarity scores between all query–key pairs through the scaled dot product

Q K^{T} / \sqrt{d_{k}}

. These scores are then normalized by a row-wise softmax to obtain the attention weights, which are finally used to take a weighted sum of the value vectors

V

, leading to the standard scaled dot-product attention formulation in Equation (9).

In this architecture, we did not use a single Transformer model, but instead constructed multiple Transformer expert models with the same structure but independent parameters. These expert models are trained on different typical daily pattern datasets during the training phase. Through this division of labor, each Transformer expert becomes a “domain expert” under specific operating conditions. Ultimately, the gate-controlled network dynamically calls upon these experts based on real-time scenarios, forming a collaborative prediction paradigm of “each performing their own duties and calling as needed”. This not only gives full play to the strong ability of Transformer in dealing with complex time dependencies, but also effectively solves the problem of insufficient generalization ability of a single model in dealing with global non-stationary through the hybrid architecture, thus significantly enhancing the prediction stability and generalization performance of the overall model.

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(9)

where

d_{k}

is the dimension of the key vector.

4. Results

4.1. Evaluation Indicators

To rigorously evaluate the proposed adaptive framework, underprice-driven non-stationarity, we benchmark performance using three error metrics: Mean Absolute Percentage Error (MAPE), Normalized Root Mean Square Error (NRMSE), and Normalized Mean Absolute Error (NMAE). Models are compared along three evaluation dimensions: point accuracy, volatility adaptability, and overall stability. These metrics collectively reflect the model’s ability to track load variations under evolving market dynamics. Specifically, MAPE measures the mean relative deviation between forecasts and observations and serves as the primary accuracy indicator; NRMSE, computed on normalized data, captures the dispersion of residuals and thus the model’s robustness during pronounced price swings; NMAE reports the average magnitude of normalized errors, indicating stability and consistency across different load levels. The calculation formulas are as follows.

MAPE = \frac{100 %}{n} \sum_{t = 1}^{n} |\frac{{\hat{y}}_{t} - y_{t}}{y_{t}}|

(10)

NRMSE = \frac{\sqrt{\frac{1}{n} \sum_{t = 1}^{n} {(y_{t} - {\hat{y}}_{t})}^{2}}}{y_{\max} - y_{\min}}

(11)

NMAE = \frac{\frac{1}{n} \sum_{t = 1}^{n} |y_{t} - {\hat{y}}_{t}|}{y_{\max} - y_{\min}}

(12)

where

{\hat{y}}_{t}

denotes the forecasted load at time t,

y_{t}

is the observed load at time t; n is the sample size; and

y_{\max}

,

y_{\min}

are the maximum and minimum load values (used for normalization).

4.2. Data and Experimental Setup

The experiments utilize anonymized market operation data from a power system in Northern China. The electricity price corresponds to the 15 min settlement price of the day-ahead wholesale spot market, reflecting the centralized bidding outcomes among generators, retailers, and large industrial and commercial consumers. Residential users still participate indirectly through regulated retail tariffs and are therefore excluded from the dataset.

The data cover the period from January 2021 to December 2022, with a 15 min sampling frequency, resulting in approximately 6.5 × 10⁴ observations. Each record includes seven fields: one target variable (actual load), four meteorological forecast variables (wind speed, air temperature, solar irradiance, and relative humidity), one historical price variable, and one timestamp metadata field. During the study period, the system load had a mean of 6778.74 MW, a standard deviation of 1256.67 MW, and a range between 4167.61 MW and 9833.68 MW. The electricity price averaged 25.98 CNY/MWh, with most samples falling within the range of 10.22–84.18 CNY/MWh, and a standard deviation of 9.28 CNY/MWh. In the preprocessing stage, all variables underwent systematic screening. The missing rates for load, weather, and price variables were 0.00%, 2.71%, and 0.00%, respectively. Outlier detection based on the interquartile range (IQR) criterion identified 3.59% of abnormal records. Missing and anomalous values were corrected using a sliding moving-average smoother. To mitigate the effect of scale differences, all input features were normalized using the min–max method. Normalization parameters were estimated from the training set and consistently applied to the validation and test sets.

To ensure fair comparison, all baseline models were re-implemented and trained under identical settings. Specifically, each model used the same look-back window (96-time steps, equivalent to 24 h), the same input features (historical load, historical price, weather forecasts, and calendar/time indicators), and the same normalization scheme, with parameters derived from the training set and shared across all data splits. For evaluation, the data were divided into training, validation, and test sets in a 70%/15%/15% ratio.

All models were implemented in Python 3.7.9 and PyTorch 1.7.1. Training and inference were conducted on a Linux workstation equipped with an Intel Xeon Silver 4210 CPU (2.20 GHz, 20 cores), an NVIDIA Tesla V100 GPU (32 GB memory), and 128 GB RAM. All deep learning models were trained on a single GPU using the Adam optimizer with dynamic learning-rate decay (adaptively adjusted based on validation performance) and an early-stopping strategy to prevent overfitting and reduce redundant epochs. The principal hyperparameters are summarized in Table 1.

To jointly optimize sample allocation and expert specialization under a unified loss function, the expert models were first pre-trained on cluster-specific data subsets, followed by joint training of the gating network and the expert models to achieve effective task coordination across different load–price regimes.

In terms of time complexity, let N be the number of training samples, T_x the input sequence length, T_y the forecasting horizon, d_model the embedding dimension, L the number of self-attention layers in each encoder/decoder module, M the number of experts in the gated mixture, and E the number of training epochs. For a single expert, the dominant cost of one forward pass through the encoder–decoder structure comes from multi-head self-attention and encoder–decoder attention, with time complexity

O (L (T_{x}^{2} + T_{y}^{2}) d_{model})

; the position-wise feed-forward networks have complexity

O (L (T_{x} + T_{y}) d_{model}^{2})

. The 1D convolutional layers and the gating network introduce only a linear overhead of order

O ((T_{x} + T_{y}) d_{model} M)

, which is negligible compared with the quadratic attention term. Therefore, the total time complexity over the whole training set can be written as

O (E N M L ((T_{x}^{2} + T_{y}^{2}) d_{model} + (T_{x} + T_{y}) d_{model}^{2}))

(13)

Overall, the time complexity of the proposed method is mainly determined by the multi-head attention blocks in the expert encoder–decoder structure, and is on the order of

O (E N M L (T_{x}^{2} + T_{y}^{2}) d_{model})

. Under our experimental configuration, the training and inference costs are comparable to those of standard Transformer-based forecasting models, and the additional 1D CNN and gated mixture-of-experts components introduce only a slight linear increase.

4.3. Clustering Results

To characterize the nonstationary and heterogeneous interactions between price and load in the market environment, historical time-series samples are segmented on a daily basis. The proposed method is then applied to cluster the daily price trajectories, after which the resulting groups are interpreted using load characteristics and operational attributes. This process provides structured priors for the subsequent gated architecture and expert modeling.

As shown in Figure 5, increasing the number of clusters k yields a clear elbow in the sum of squared errors (SSE), with diminishing returns beyond k = 5. The Calinski–Harabasz index reaches a high and stable level near k ≈ 5, the Davies–Bouldin index attains a local minimum at this point. Although the silhouette coefficient remains relatively high for small k, it shows no substantial improvement as k increases. Balancing compactness and separation, we therefore set k = 5 as the number of typical-day classes.

Figure 6 and Table 2 summarize the five prototypes. The classes show systematic differences in intraday load peaks, price trajectories, and the strength of price–load coupling. C1 accounts for 39.1% of all cases. It represents single-peak working days with mid-level prices, an average of 30.58 ¥/MWh, and a price–load correlation of 0.57. This class reflects regular and stable weekdays. C2 represents 15.0% of the samples. All cases in this class are holidays. The load level is lower, with an average of 5645 MW. The average price is the lowest, at 16.17 ¥/MWh, while the peak–valley spread reaches 37.36. The price–load correlation is weak at 0.24, forming a holiday pattern characterized by low load and localized price lifts. C3 covers 19.8% of the observations. It exhibits higher load, averaging 7113 MW, moderately high prices at 25.12 ¥/MWh, and a correlation of 0.47. This class corresponds to high-load regular days. C4 includes 13.5% of the data. Most cases are holidays, accounting for 92.9% of the samples. It is characterized by low load, averaging 5804 MW, and the weakest price volatility, with a standard deviation of 5.98. This pattern indicates a holiday regime with stable prices and low load. C5 accounts for 12.6% of all cases. It shows the highest load, averaging 8054 MW, an average price of 26.08 ¥/MWh, and a peak–valley spread of 32.99. The price–load correlation is the strongest at 0.61, representing a reinforced working-day regime with tight coupling between price and load.

These stratified labels are consistent with operational intuition and are directly useful for modeling. At the architecture level, the gated mixture can switch adaptively between weakly coupled holiday regimes (C2/C4) and strongly coupled working-day regimes (C5), thereby shrinking tail errors under volatility. At the model level, training Transformer experts by typical-day class enables targeted learning of distinct dependency structures. For example, the model captures price-driven dynamics in class C5 and long-memory stability in classes C2 and C4. This design improves both generalization and stability.

4.4. Comparison with Conventional Methods

To verify effectiveness, we benchmark the proposed approach against TimesNet [33], iTransformer [34], Transformer [35], Autoformer [36], GRU [37] and GP [26]. Under identical data splits, inputs, and training settings, our method achieves the best performance across all three metrics. The model attains MAPE = 4.08%, NRMSE = 8.02%, and NMAE = 5.39%, as reported in Table 3, with representative forecasts shown in Figure 7. Compared with the five baseline models, MAPE is reduced by 20.9–39.1%. NRMSE decreases by 15–32%, and NMAE drops by 22–40%. These results demonstrate that the proposed adaptive pipeline, which combines price–time clustering for structured inputs, volatility-triggered gating, and Transformer-based experts, achieves substantially higher accuracy and stability underprice-driven non-stationarity.

From representative time windows (Figure 7), the proposed model tracks the ground truth most closely: both phase and amplitude mismatches at peak–valley turnarounds are visibly smaller than the baselines. By contrast, GRU shows systematic bias near peaks, and Autoformer/Transformer exhibit noticeable lag on rising ramps, indicating that a single fixed architecture struggles to capture price-driven nonstationary dependencies.

Figure 8 presents the error-distribution boxplots. Our approach attains the lowest median absolute error, the smallest interquartile range, and a markedly lower upper whisker, implying not only reduced central error but also compressed dispersion and fewer outliers. GRU shows both wider spread and more extreme values.

To characterize tail risk, we employ the Error Duration Curve (EDC), as shown in Figure 9. The curve is constructed by sorting absolute errors in descending order and plotting the coverage ratio (share of samples within the top k%). Lower curves indicate fewer large errors and a thinner tail. Across the 0–60% coverage range, our method consistently outperforms all baselines. The performance gap is most pronounced in the top 10% segment, indicating stronger robustness during periods of sharp volatility or abrupt regime shifts. Applying range normalization or using a logarithmic y-axis does not change the relative ranking.

4.5. Ablation Studys

To quantify the contribution of each component, six configurations (M1–M6) are designed using the settings in Table 4. M1 is the proposed pipeline, combining typical-day expert training, a soft gating mechanism for expert selection, and Transformer-based experts. M2 retains typical-day experts and Transformers but replaces gating with static concatenation. M3 keeps gating but swaps the expert backbone for GRU. M4 concatenates clustered features directly and uses GRU experts. M5 and M6 remove typical-day labels and employ a single forecaster, Transformer and GRU, respectively. All schemes share identical data splits and training protocols to ensure a fair comparison.

Table 5 reports the overall errors. M1 achieves the best performance on all three metrics with MAPE 4.08%, NRMSE 8.02%, and NMAE 5.39%. Relative to M2, which relies on static concatenation, M1 yields markedly lower errors, indicating that soft selection allocates weights adaptively across volatility regimes and avoids the mismatch inherent to fixed fusion. Compared with M3, the improvements in NRMSE and NMAE underscore the Transformer’s advantage in modeling long-range dependencies and price–load coupling. M6, a single GRU, performs worst, reflecting limited adaptability to multi-scale, nonstationary dynamics.

The contributions of the clustering scheme, gating mechanism, and expert models can be assessed through the following comparisons. Compared with models M5 and M6, which do not incorporate the clustering scheme, the models using cluster-based grouping achieve an average 1.28% reduction in MAPE. For the gating mechanism, the proposed approach reduces MAPE by an average of 1.08% relative to traditional direct-fusion methods such as M1 and M3. Regarding the expert models, the Transformer-based designs (M1 and M2) outperform the GRU-based counterparts (M3 and M4), yielding an average 1.32% improvement in MAPE. Overall, the horizontal comparison indicates that the expert model contributes the largest accuracy gain, followed by the clustering scheme, and then the gating mechanism.

Figure 10 illustrates several representative periods. Model M1 aligns most closely with the observed load, exhibiting smaller phase and amplitude errors during peak–valley transitions and maintaining accurate tracking across both smooth and rapidly changing intervals. In contrast, M2 and M3 display different degrees of lag during fast ramps, particularly around sudden upward or downward movements. M4 shows block-like deviations near regime switches, suggesting difficulty in adapting to structural changes in the price–load relationship. M5 performs reasonably well in steady segments but deteriorates noticeably when abrupt variations occur. M6 presents systematic bias across multiple peaks and valleys, indicating a limited ability to capture the underlying load dynamics.

Figure 11 presents the error boxplots. Model M1 achieves the lowest median absolute error, the tightest interquartile range, and a visibly shorter upper whisker, reflecting reduced dispersion and fewer extreme deviations. M2 and M3 obtain similar medians but exhibit broader spreads, while M4 and M5 show further widening in both the interquartile ranges and whiskers. M6 produces the largest outliers, confirming its weaker robustness under volatile conditions.

Figure 12 displays the expert outputs and the time-varying gating weights for M1. The soft-weighting mechanism produces smooth transitions across regimes, effectively avoiding the discontinuities commonly associated with hard switching. This dynamic allocation enables the model to adjust its reliance on different experts as operating conditions evolve, thereby enhancing adaptability to nonstationary patterns and improving overall stability.

5. Conclusions

To address price-driven non-stationarity under market reforms, this study develops an adaptive short-term load forecasting framework that integrates typical-day priors, volatility-triggered gating, and Transformer-based experts. Using 15 min operational data from a Chinese market, the framework is validated empirically. The main conclusions of this paper are summarized as follows:

(i): The proposed clustering-based partition supplies effective prior structure for modeling. Removing this component raises MAPE from 4.08% to 6.34–7.10%, indicating a clear loss of accuracy.
(ii): The proposed soft gating network fuses multiple experts more effectively than static concatenation or hard assignment. Relative to direct concatenation, MAPE is reduced by an average of 1.08 percentage points.
(iii): The proposed Transformer-expert scheme achieves higher accuracy, reducing MAPE by 1.49 percentage points versus GRU. Replacing GRU with Transformer experts yields a 1.49 percentage-point reduction in MAPE. Against mainstream baselines, the proposed method lowers MAPE by 1.08–2.62 percentage points and exhibits smaller maximum errors.

Future work may extend the current framework by incorporating probabilistic forecasting and risk metrics, enabling a shift from point forecasts to interval or scenario forecasts and providing uncertainty information for system scheduling and market decision-making. In addition, the joint load–price modeling approach can be expanded into a multi-source driven framework by integrating factors such as weather conditions, renewable generation, and demand response, thereby capturing more complex forms of non-stationarity.

Author Contributions

Conceptualization, X.D. and Y.Y.; Methodology, X.D.; Software, J.B.; Validation, H.J., Z.H. and J.B.; Formal analysis, Y.Y.; Investigation, J.B.; Resources, X.D.; Data curation, X.D.; Writing—original draft, H.J.; Writing—review and editing, Y.Y.; Visualization, Z.H.; Supervision, Z.H.; Project administration, J.B.; Funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research and Development of Key Technologies and Equipment for Novel Virtual Power Plants, grant number 2023JH1/10400049 and the research on medium and long term power load forecasting method based on time series combination model, grant number LJ212411632007.

Data Availability Statement

The datasets presented in this article are not readily available because they are proprietary operational data, and the data provider has requested that they remain confidential.

Acknowledgments

During the preparation of this manuscript, the authors used the web version of ChatGPT-4o (OpenAI) for the purpose of polishing the English writing and language. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Author Yan Yu was employed by the State Grid Liaoning Extra High Voltage Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Srivastava, M.; Tiwari, P.K. A profit driven optimal scheduling of virtual power plants for peak load demand in competitive electricity markets with machine learning based forecasted generations. Energy 2024, 310, 133077. [Google Scholar] [CrossRef]
Laitsos, V.; Vontzos, G.; Paraschoudis, P.; Tsampasis, E.; Bargiotas, D.; Tsoukalas, L.H. The State of the Art Electricity Load and Price Forecasting for the Modern Wholesale Electricity Market. Energies 2024, 17, 5797. [Google Scholar] [CrossRef]
Tong, Q. A Short-Term Electricity Load Forecasting Method Based on Multi-Factor Impact Analysis and BP-GRU Model. Processes 2025, 13, 2336. [Google Scholar] [CrossRef]
Yao, H.; Qu, P.; Qin, H.; Lou, Z.; Wei, X.; Song, H. Multidimensional electric power parameter time series forecasting and anomaly fluctuation analysis based on the AFFC-GLDA-RL method. Energy 2024, 313, 134180. [Google Scholar] [CrossRef]
Eren, Y.; Küçükdemiral, I. A comprehensive review on deep learning approaches for short-term load forecasting. Renew. Sustain. Energy Rev. 2024, 189, 114031. [Google Scholar] [CrossRef]
Hasan, M.; Mifta, Z.; Papiya, S.J.; Roy, P.; Dey, P.; Salsabil, N.A.; Chowdhury, N.-U.; Farrok, O. A state-of-the-art comparative review of load forecasting methods: Characteristics, perspectives, and applications. Energy Convers. Manag. X 2025, 26, 100922. [Google Scholar] [CrossRef]
Rondón-Cordero, V.H.; Montuori, L.; Alcázar-Ortega, M.; Siano, P. Advancements in hybrid and ensemble ML models for energy consumption forecasting: Results and challenges of their applications. Renew. Sustain. Energy Rev. 2025, 224, 116095. [Google Scholar] [CrossRef]
El-Keib, A.; Ma, X.; Ma, H. Advancement of statistical based modeling techniques for short-term load forecasting. Electr. Power Syst. Res. 1995, 35, 51–58. [Google Scholar] [CrossRef]
Phyo, P.P.; Jeenanunta, C. Daily Load Forecasting Based on a Combination of Classification and Regression Tree and Deep Belief Network. IEEE Access 2021, 9, 152226–152242. [Google Scholar] [CrossRef]
van de Sande, S.N.P.; Alsahag, A.M.M.; Ziabari, S.S.M. Enhancing the Predictability of Wintertime Energy Demand in The Netherlands Using Ensemble Model Prophet-LSTM. Processes 2024, 12, 2519. [Google Scholar] [CrossRef]
Karamolegkos, S.; Koulouriotis, D.E. Advancing short-term load forecasting with decomposed Fourier ARIMA: A case study on the Greek energy market. Energy 2025, 325, 135854. [Google Scholar] [CrossRef]
Taylor, J.W. Short-Term Load Forecasting with Exponentially Weighted Methods. IEEE Trans. Power Syst. 2012, 27, 458–464. [Google Scholar]
Wu, F.; Cattani, C.; Song, W.; Zio, E. Fractional ARIMA with an improved cuckoo search optimization for the efficient Short-term power load forecasting. Alex. Eng. J. 2020, 59, 3111–3118. [Google Scholar] [CrossRef]
Rendon-Sanchez, J.F.; de Menezes, L.M. Structural combination of seasonal exponential smoothing forecasts applied to load forecasting. Eur. J. Oper. Res. 2019, 275, 916–924. [Google Scholar] [CrossRef]
Khayat, A.; Kissaoui, M.; Bahatti, L.; Raihani, A.; Errakkas, K.; Atifi, Y. Hybrid model for microgrid short term load forecasting based on machine learning. IFAC-PapersOnLine 2024, 58, 527–532. [Google Scholar] [CrossRef]
Forootani, A.; Rastegar, M.; Sami, A. Short-term individual residential load forecasting using an enhanced machine learning-based approach based on a feature engineering framework: A comparative study with deep learning methods. Electr. Power Syst. Res. 2022, 210, 108119. [Google Scholar] [CrossRef]
Zhao, Z.; Zhang, Y.; Yang, Y.; Yuan, S. Load forecasting via Grey Model-Least Squares Support Vector Machine model and spatial-temporal distribution of electric consumption intensity. Energy 2022, 255, 124468. [Google Scholar] [CrossRef]
Fan, G.-F.; Zhang, L.-Z.; Yu, M.; Hong, W.-C.; Dong, S.-Q. Applications of random forest in multivariable response surface for short-term load forecasting. Int. J. Electr. Power Energy Syst. 2022, 139, 108073. [Google Scholar] [CrossRef]
Fan, G.-F.; Yu, M.; Dong, S.-Q.; Yeh, Y.-H.; Hong, W.-C. Forecasting short-term electricity load using hybrid support vector regression with grey catastrophe and random forest modeling. Util. Policy 2021, 73, 101294. [Google Scholar] [CrossRef]
Irankhah, A.; Yaghmaee, M.H.; Ershadi-Nasab, S. Optimized short-term load forecasting in residential buildings based on deep learning methods for different time horizons. J. Build. Eng. 2024, 84, 108505. [Google Scholar] [CrossRef]
Shen, Q.; Mo, L.; Liu, G.; Zhou, J.; Zhang, Y.; Ren, P. Short-Term Load Forecasting Based on Multi-Scale Ensemble Deep Learning Neural Network. IEEE Access 2023, 11, 111963–111975. [Google Scholar]
Li, C.; Shi, J. A novel CNN-LSTM-based forecasting model for household electricity load by merging mode decomposition, self-attention and autoencoder. Energy 2025, 330, 136883. [Google Scholar] [CrossRef]
Tian, J.; Liu, H.; Gan, W.; Zhou, Y.; Wang, N.; Ma, S. Short-term electric vehicle charging load forecasting based on TCN-LSTM network with comprehensive similar day identification. Appl. Energy 2025, 381, 125174. [Google Scholar] [CrossRef]
Ahmad, A.; Xiao, X.; Mo, H.; Dong, D. TFTformer: A novel transformer based model for short-term load forecasting. Int. J. Electr. Power Energy Syst. 2025, 166, 110549. [Google Scholar] [CrossRef]
Tan, Q.; Cao, C.; Xue, G.; Xie, W. Short-term heating load forecasting model based on SVMD and improved informer. Energy 2024, 312, 133535. [Google Scholar] [CrossRef]
Incremona, A.; De Nicolao, G. Short-term forecasting of the Italian load demand during the Easter Week. Neural Comput. Appl. 2022, 34, 6257–6271. [Google Scholar] [CrossRef]
Zheng, Y.; Jia, C.; Yu, J.; Li, X. Deep embedded clustering with distribution consistency preservation for attributed networks. Pattern Recognit. 2023, 139, 109469. [Google Scholar] [CrossRef]
Prasanthi, L.; Malyala, L.P.; Krishnan, S.B.; Prasad, K.; Chakrabarti, P. A deep embedded clustering approach for detecting trend class using time-series sensor data. Knowl.-Based Syst. 2025, 320, 113609. [Google Scholar] [CrossRef]
Liu, Z.-F.; Chen, X.-R.; Huang, Y.-H.; Luo, X.-F.; Zhang, S.-R.; You, G.-D.; Qiang, X.-Y.; Kang, Q. A novel bimodal feature fusion network-based deep learning model with intelligent fusion gate mechanism for short-term photovoltaic power point-interval forecasting. Energy 2024, 303, 131947. [Google Scholar] [CrossRef]
Begga, A.; Lozano, M.Á.; Escolano, F. AG-GNN: Adaptive gating mechanism for robust node classification in graph neural networks. Inf. Sci. 2025, 726, 122750. [Google Scholar] [CrossRef]
Wang, Z.; Chen, L.; Wang, C. Parallel ResBiGRU-transformer fusion network for multi-energy load forecasting based on hierarchical temporal features. Energy Convers. Manag. 2025, 345, 120360. [Google Scholar] [CrossRef]
Zhan, X.; Kou, L.; Xue, M.; Zhang, J.; Zhou, L. Reliable Long-Term Energy Load Trend Prediction Model for Smart Grid Using Hierarchical Decomposition Self-Attention Network. IEEE Trans. Reliab. 2023, 72, 609–621. [Google Scholar] [CrossRef]
Zhao, H.; Huang, X.; Xiao, Z.; Shi, H.; Li, C.; Tai, Y. Week-ahead hourly solar irradiation forecasting method based on ICEEMDAN and TimesNet networks. Renew. Energy 2024, 220, 119706. [Google Scholar]
Fang, B.; Xu, L.; Luo, Y.; Luo, Z.; Li, W. A method for short-term electric load forecasting based on the FMLP-iTransformer model. Energy Rep. 2024, 12, 3405–3411. [Google Scholar]
Hertel, M.; Beichter, M.; Heidrich, B.; Neumann, O.; Schäfer, B.; Mikut, R.; Hagenmeyer, V. Transformer training strategies for forecasting multiple load time series. Energy Inform. 2023, 6, 20. [Google Scholar]
Jiang, Y.; Gao, T.; Dai, Y.; Si, R.; Hao, J.; Zhang, J.; Gao, D.W. Very short-term residential load forecasting based on deep-autoformer. Appl. Energy 2022, 328, 120120. [Google Scholar] [CrossRef]
Abumohsen, M.; Owda, A.Y.; Owda, M. Electrical Load Forecasting Using LSTM, GRU, and RNN Algorithms. Energies 2023, 16, 2283. [Google Scholar] [CrossRef]

Figure 1. Structure of short-term load forecasting model.

Figure 2. Deep embedding clustering model structure.

Figure 3. Dynamic adaptive gating network architecture.

Figure 4. Structure of load forecast expert.

Figure 5. Variation of clustering evaluation metrics with the number of cluster centers.

Figure 6. Cluster centroid profiles of typical-day patterns.

Figure 7. Forecasting results of different methods.

Figure 8. Error distribution boxplots of different forecasting methods.

Figure 9. Error duration curves of different forecasting methods.

Figure 10. Forecasting results of different ablation schemes.

Figure 11. Error distribution boxplots of ablation experiments.

Figure 12. Expert outputs and time-varying gating weights in the proposed model.

Table 1. Parameter value.

Parameter			Value/Type
Optimizer			Adam
Learning rate			5 × 10⁻⁴ (dynamic decay)
DEC		Learning rate	4 × 10⁻⁴ (dynamic decay)
		Network	Linear + 1DCNN
		Dimensionality of embeddings	128
Training epochs			350
Batch size			64
Input sequence length			96 (24 h)
forecasting horizon			96 (24 h)
Position encoding type			Absolute Positional Encoding
Experts training method			Pre-training followed by joint training
Gated mixture architecture			5 parallel branches; adaptive weighting
Expert	Encoder	Number of modules	2
		Feed foreword	dmodel = 128, activation functions = ReLu
		Self-Attention	nlayers = 3; dmodel = 128; nheads = 8
	Decoder	Number of modules	2
		Self-Attention	nlayers = 3; dmodel = 128; nheads = 8
		Encoder–Decoder Attention	nlayers = 3; dmodel = 128; nheads = 8
		feed foreword	dmodel = 128, activation functions = ReLu

Table 2. Comparison of clustering quality metrics and explanatory variables.

Type	C1	C2	C3	C4	C5
Share	39.1%	15.0%	19.8%	13.5%	12.6%
Avg. price (¥/MWh)	30.58	16.17	25.12	24.63	26.08
Peak–valley spread	30.26	37.36	29.76	28.2	32.99
Price SD (σ)	7.2	7.89	6.75	5.98	7.43
Mean load (MW)	6966.20	5645.22	7112.61	5803.93	8053.93
Price–load corr	0.57	0.24	0.47	0.3	0.61
Holiday share (%)	0.0	100.0	14.6	92.9	0.0

Table 3. Forecasting accuracy comparison with conventional methods.

Model	MAPE/%	NRMSE/%	NMAE/%
Proposed	4.08	8.02	5.39
TimesNet	5.16	9.47	6.88
iTransformer	5.38	9.87	7.20
Transformer	5.45	10.01	7.28
Autoformer	5.59	10.29	7.46
GRU	6.70	11.75	8.98
GP	5.42	9.98	7.19

Table 4. Configuration of ablation experiment.

Setting		M1 (Proposed)	M2	M3	M4	M5	M6
Experts trained by typical-day classes		√	√	√	√	×	×
Expert fusion	Soft expert selection	√	×	√	×	-	-
Expert fusion	Direct concatenation	×	√	×	√	-	-
Expert backbone	Transformer	√	√	×	×	√	×
Expert backbone	GRU	×	×	√	√	×	√

Table 5. Forecasting errors for ablation schemes.

Model	MAPE/%	NRMSE/%	NMAE/%
Proposed	4.08	8.02	5.39
M2	5.47	9.86	7.26
M3	5.57	10.16	7.45
M5	6.34	11.25	8.50
M4	6.63	11.79	8.88
M6	7.10	12.78	9.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, X.; Yu, Y.; Jin, H.; Hu, Z.; Bao, J. Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach. Processes 2026, 14, 5. https://doi.org/10.3390/pr14010005

AMA Style

Dong X, Yu Y, Jin H, Hu Z, Bao J. Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach. Processes. 2026; 14(1):5. https://doi.org/10.3390/pr14010005

Chicago/Turabian Style

Dong, Xiangluan, Yan Yu, Hongyang Jin, Zhanshuo Hu, and Jieqiu Bao. 2026. "Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach" Processes 14, no. 1: 5. https://doi.org/10.3390/pr14010005

APA Style

Dong, X., Yu, Y., Jin, H., Hu, Z., & Bao, J. (2026). Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach. Processes, 14(1), 5. https://doi.org/10.3390/pr14010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Load Forecasting in Price-Volatile Markets: A Pattern-Clustering and Adaptive Modeling Approach

Abstract

1. Introduction

2. Overall Description of the Proposed Method

3. Short-Term Load Adaptive Forecasting Method for Non-Stationary Electricity Price Fluctuations

3.1. Typical Daily Pattern Recognition Method Based on Deep Embedding Clustering

3.2. Adaptive Gate Control Network Based on Typical Daily Patterns and Historical Performance Perception

3.3. Expert Model for Load Forecasting Based on Transformer

4. Results

4.1. Evaluation Indicators

4.2. Data and Experimental Setup

4.3. Clustering Results

4.4. Comparison with Conventional Methods

4.5. Ablation Studys

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI