GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning

Wan, Junling; Sun, Yusen; Fan, Jianguo; Zhou, Yu; Ye, Rui; Yuan, Peisen

doi:10.3390/math13233820

Open AccessArticle

GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning

by

Junling Wan

¹

,

Yusen Sun

²,

Jianguo Fan

³,

Yu Zhou

²,

Rui Ye

¹ and

Peisen Yuan

^1,*

¹

College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 211800, China

²

State Grid Electric Power Research Institute, NARI Group Co., Ltd., Nanjing 211106, China

³

Chongqing Three Gorges Water Conservancy and Electric Power Co., Ltd., Chongqing 401120, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3820; https://doi.org/10.3390/math13233820

Submission received: 5 October 2025 / Revised: 7 November 2025 / Accepted: 17 November 2025 / Published: 28 November 2025

(This article belongs to the Special Issue Intelligent Optimization and Control Modeling in Power and Energy System)

Download

Browse Figures

Versions Notes

Abstract

Short-term load forecasting of distributed user groups is crucial for the efficient operation of electricity markets, but existing methods mainly rely on intra-group consistency while neglecting inter-group correlations, which limits the utilization of cross-group information and reduces forecasting accuracy. To overcome these limitations, this study introduces a clustering and multi-task learning-based framework for short-term load forecasting of distributed user groups. First, historical load data are clustered to form representative consumption groups. Next, a Transformer encoder is used as a hard parameter shared network for multi-task learning. Within the multi-task framework, we apply dynamic task weighting and task-specific prediction heads, which balance multi-task losses while optimizing the forecasting performance of each group. Moreover, a filter-attention mechanism and an Inception convolution module are introduced into the encoder to improve local pattern extraction and multi-scale feature fusion. Experiments conducted on two publicly available datasets show that, for the London smart meter dataset, the MAE values of the clusters are 0.2858 and 0.4312, and the RMSE values are 0.5042 and 0.5266. On different clusters of the UCI electricity load dataset, the MAE values are 0.1617, 0.1554, and 0.2608, and the RMSE values are 0.2299, 0.2130, and 0.3678, respectively. These results demonstrate that our method outperforms baseline models and significantly improves the accuracy of distributed user short-term load forecasting in electricity markets.

Keywords:

clustering; distributed user group; filter-attention; multi-task learning; short-term load forecasting; Inception convolution

MSC:

68T07

1. Introduction

To accelerate the development of a unified national electricity market and ensure the high-quality operation of the new power system, it is essential to address the challenge of accurate electricity consumption [1]. Short-term power load forecasting for the distributed user aggregate is a critical process in formulating spot trading clearing strategies, and accurate load forecasting for different groups of distributed users plays a vital role in ensuring the stable operation of power systems, optimizing resource allocation, and enhancing economic efficiency [2,3].

Compared with traditional large-scale entities participating in the spot market, distributed users are numerous, with small individual loads and high heterogeneity, posing significant challenges for load forecasting [4]. In particular, different user groups exhibit distinct consumption patterns. Conventional models struggle to uncover true load–driver relationships amid strong nonlinearities and high-dimensional covariates [5,6]. There is an urgent need to develop aggregation methods for large-scale heterogeneous user resources within the electricity market environment, to identify the key factors influencing electricity consumption across various user types and time periods [7]. Moreover, traditional electricity demand forecasting approaches primarily perform system-level predictions uniformly across all users, failing to effectively account for the diversity in consumption habits and the complexity of influencing factors among different industries and user groups [8,9,10].

To address the higher requirements for load forecasting accuracy posed by the diversity and volatility of users’ electricity consumption behaviors, in recent years, some studies have employed cluster analysis methods to capture electricity consumption differences among various user groups, thereby achieving more accurate load forecasting [11,12]. The load clustering algorithm is used to partition users into different clusters, forming corresponding resource aggregates, and further analyzing the users’ electricity consumption characteristics [13]. By grouping and modeling users with similar electricity consumption patterns, such methods effectively reduce the prediction errors caused by the randomness of individual user behaviors and lay a foundation for subsequent refined load management [14].

In the scenario of multi-user collaborative prediction, how to leverage the correlations among users to improve overall prediction performance has become a research hotspot. Multi-task learning has been applied to time-series prediction for multiple users recently [15]. By sharing the underlying feature representations and model parameters, the correlations among different tasks are mined. Meanwhile, multi-task learning can study multiple tasks simultaneously during the training process and is capable of capturing a broader range of power consumption pattern features [16]. This mechanism not only enhances the model’s ability to characterize complex electricity consumption patterns but also alleviates the data sparsity problem through information complementarity, making it particularly suitable for practical scenarios with a large number of users but limited individual data [17].

Although these methods are effective in most application scenarios, there are still some fundamental challenges that remain unresolved: (1) The electricity market has numerous participants with distinct consumption patterns. Therefore, making overall predictions without considering these variations may obscure group-specific characteristics, increasing prediction errors. (2) Predicting each user group separately ignores the inherent correlations in electricity consumption trends between different groups and fails to effectively integrate multi-source information to improve prediction accuracy. (3) Existing methods capture the insufficient local features of electricity load data, have poor dynamic adaptability, and struggle to balance short-term fluctuations and long-term trends within one scale.

To address the above issues, our paper proposes a short-term load forecasting framework for distributed user Groups based on Clustering and Multi-task Learning, called GCML. Firstly, distributed user groups are formed by clustering load curves. Secondly, a multi-task learning framework based on the Transformer encoder is employed, combined with a dynamic weighting mechanism and lightweight prediction heads designed for specific tasks, enabling unified feature extraction and information sharing across different user groups. Finally, we employ a filter-attention mechanism and an Inception convolution module within the shared encoder to enhance the model’s ability to capture local patterns and fuse multi-scale information.

The main contributions of this paper are summarized as follows:

(1): This paper proposes a novel load forecasting model, which forecasts the short-term load by segmenting electricity user groups and jointly optimizing multiple forecasting tasks, thereby improving prediction accuracy on the demand side of the electricity market.
(2): An improved multi-task learning architecture is proposed to capture correlations between distributed user groups. It uses an encoder-only approach to extract common features across clusters, introduces dynamic weighting, and assigns independent task heads for multiple forecasting tasks, overcoming the limitations of single-task modeling.
(3): The filter-attention mechanism and Inception convolution module are integrated into an encoder, significantly enhancing the model’s ability to capture local patterns and fuse multi-scale features of load data.
(4): We conducted experiments on publicly available datasets, and the experimental results show that GCML outperforms existing baseline models.

2. Related Work

As a central issue in the development of trading strategies for electricity markets, the accuracy of short-term electricity load forecasting for distributed users has attracted considerable research attention [18]. However, the diversity of customer types and electricity consumption behaviors in the electricity market poses a significant challenge to achieving highly accurate forecasts; existing research has primarily focused on modeling within the group correlations and capturing complex patterns, which often neglects the variations between different distributed user groups [9,19].

Currently, the short-term electricity load forecasting for distributed users research primarily includes two approaches: (1) clustering-based load forecasting methods; (2) multi-task learning-based forecasting methods.

2.1. Clustering-Based Load Forecasting Methods

In order to identify different patterns in load data and achieve more accurate short-term load forecasting, researchers have proposed many time-series forecasting methods based on clustering. By clustering historical data and building specialized forecasting models for each group, these methods reduce the interference of complex load patterns on a single model and improve prediction performance [20]. For example, Hyojeoung et al. [21] used a time-series clustering method based on Euclidean distance and dynamic time warping distance to cluster household electricity demand before forecasting. Dalil et al. [22] extracted consumption patterns through outlier detection and replacement, followed by cluster analysis, and improved prediction accuracy. Fang et al. [23] developed a short-term time-series prediction model based on multilinear trend fuzzy information particles, employing K-Medoids clustering and novel fuzzy association rules to enhance both data characterization and semantic representation.

Existing studies mainly applied the clustering algorithms to classify users into clusters and then use prediction algorithms to predict the electricity load for each group [13,24], but these methods have some shortcomings. First, predicting each group individually ignores the possible intrinsic correlation of electricity consumption trends among different user groups, and the model can only learn the electricity consumption patterns within the clusters, which restricts the generalization ability of the model; second, the model needs to be trained individually for each type of group, which increases the complexity of the model [25].

2.2. Multi-Task Learning Methods for Time-Series Forecasting

In recent years, multi-task deep learning has been used for time-series prediction in several fields, and existing studies have proved that multi-task learning frameworks can implicitly capture the dynamic relationship between multiple time series and improve the accuracy and generalization ability of time-series prediction [26,27]. Tian et al. [28] adopted multi-task learning techniques and an end-to-end learning framework to handle multiple load forecasting tasks in parallel, achieving favorable prediction performance. Guo et al. [29] proposed a multi-task learning method based on bidirectional long short-term memory to realize the prediction of cold, heat, and electricity combined loads. Jiang et al. [30] constructed a multi-task framework of dual-level information extraction by integrating LSTM and CNN methods to realize multi-family short-term load forecasting. Zhang et al. [31] used the DBSCAN clustering algorithm to group data and proposed a multi-task graph convolutional network to learn different spatial patterns for short-term load forecasting.

However, existing multi-task learning methods still face challenges in distributed user groups, primarily due to the insufficient consideration of variations and dynamic changes between different user groups, which limits the model’s adaptability and accuracy.

To address the limitations of previous studies, this paper proposes a short-term load forecasting method that integrates clustering and multi-task learning. By applying a multi-task learning framework to distributed user groups aggregated through time-series clustering, while incorporating dynamic weighting, innovative attention mechanisms, and multi-scale feature extraction methods, our approach enhances both the forecasting accuracy and the time performance.

3. Methodology

3.1. Problem Definition

The purpose of distributed user group short-term load forecasting is to simultaneously predict the future load values of M user groups over the next S time steps. Given a total of M load forecasting tasks, the historical load data of the m-th user group is represented as follows:

X^{(m)} = {x_{:, 1}^{(m)}, \dots, x_{:, T}^{(m)}} \in R^{n_{m} \times T}

(1)

where

X^{(m)}

denotes the historical load data matrix of the m-th user group,

n_{m}

is the number of measurement channels in that group, and T represents the number of time steps in the historical observation window. The forecasting target for the m-th user group is formulated as

Y^{(m)} = {x_{:, T + 1}^{(m)}, \dots, x_{:, T + S}^{(m)}} \in R^{n_{m} \times S}, m = 1, 2, \dots, M

(2)

where

Y^{(m)}

represents the predicted load matrix of the m-th user group for the next S steps. The forecasting results of all user groups are organized as follows:

Y_{a l l} = {Y^{(1)}, Y^{(2)}, \dots, Y^{(M)}} \in R^{M \times n_{m} \times S}

(3)

where

Y_{a l l}

denotes the unified output representation of multi-task forecasting.

3.2. Processing Model

In order to capture the power consumption differences in the load aggregates and achieve more accurate power load forecasting, this paper proposes a short-term power load forecasting model based on clustering and multi-task learning. The processing model of GCML is shown in Figure 1.

As illustrated in Figure 1, the proposed model consists of 3 main components: (1) Distributed user clustering: the historical load data of distributed users are first clustered, with users exhibiting similar consumption patterns grouped into different categories. (2) Multi-task learning framework: a hard parameter sharing strategy is adopted, where different groups share the encoder of the Transformer to achieve joint modeling and feature extraction across groups. In addition, each group is assigned an independent linear layer as the task-specific prediction head, ensuring that group-specific load characteristics can be accurately captured based on the shared global representations. (3) Enhanced feature extraction module: An Inception convolution module and a filter-attention mechanism are incorporated into the encoder. These components collectively enhance the model’s ability to represent and forecast complex time-series data.

3.3. Clustering for Distributed Users

The clustering of electricity users is based on their power consumption behaviors, grouping users with similar consumption patterns to form corresponding load aggregates. In this paper, we utilize a deep clustering algorithm to achieve the clustering of heterogeneous electricity users. Specifically, low-dimensional features of the load curves are extracted by an autoencoder [32], and then K-means [33] is applied for clustering. The set of electricity consumption of all users is

A = {z_{1}, z_{2}, \dots, z_{N}}, z_{i} \in R^{T}

, where A represents the set of historical load curves of all users,

z_{i}

represents the power load curve of the i-th user, and T is the number of time points.

Firstly, the low-dimensional representation of the load data is extracted through the self-encoding encoding and decoding process, as shown in Equations (4) and (5).

h_{i} = f_{θ_{enc}} (z_{i}) = σ_{enc} (W_{enc} z_{i} + b_{enc})

(4)

{\hat{z}}_{i} = f_{θ_{dec}} (h_{i}) = σ_{dec} (W_{dec} h_{i} + b_{dec})

(5)

where

W_{enc}

and

b_{enc}

are the encoder parameters,

σ_{enc}

is the activation function,

W_{dec}

and

b_{dec}

are the decoder parameters,

f_{θ_{enc}}

is the mapping function of the encoder,

θ_{dec}

denotes the parameter set of the decoder, and

σ_{dec}

is the output layer activation function.

Next, the low-dimensional latent vectors

h_{1}, h_{2} \dots h_{n}

obtained by the autoencoder are used as inputs to K-means clustering, which classifies users into multiple distinct clusters. We randomly select K hidden vectors as the initial center

{c_{1}, c_{2} \dots c_{K}}

, assign the sample to the nearest cluster k, and update the cluster center, as shown in Equations (6) and (7).

k_{i} = arg min_{k \in {1, \dots, K}} {∥ h_{i} - c_{k} ∥}_{2}^{2}

(6)

c_{k} = \frac{1}{| S_{k} |} \sum_{i : k_{i} = k} h_{i}

(7)

where

k_{i}

is the cluster to which the ith sample belongs,

S_{k} = {i | k_{i} = k}

is the sample set of the k-th cluster, and

| S_{k} |

is the number of samples in the cluster.

By analyzing the consumption behaviors of distributed users, those with similar load patterns are grouped together, enabling more accurate predictions. However, forecasting the load behavior of these groups requires not only capturing the internal relationships within each cluster but also modeling the interactions between clusters of distributed users.

3.4. Multi-Task Learning

To extract useful information from correlated distributed user groups, this study adopts a multi-task learning (MTL) framework [34], treating the load forecasting tasks for each user group as distinct yet related, thus effectively capturing the inherent relationships between different clusters. The MTL enables the parallel training of multiple load forecasting tasks by leveraging shared parameters in the lower layers of the model to extract common features across tasks, thereby improving overall forecasting accuracy.

According to the parameter sharing strategy, this method is generally categorized into hard parameter sharing and soft parameter sharing [35]. In this paper, we employ a hard parameter sharing framework. The encoder serves as a shared backbone network for forecasting tasks from different clusters, while each task has its own task-specific linear prediction head. In the shared layers, all tasks use identical parameters, which effectively mitigate overfitting and enhance the model’s generalization ability [36]. The multi-task learning architecture of our proposed model is illustrated in Figure 2.

As shown in Figure 2, this paper employs an encoder-only Transformer architecture [37] as the backbone network to learn representative features of different user groups

T_{1}

,

T_{2}

, …,

T_{n}

and introduces a weighting mechanism. Additionally, to improve the prediction accuracy of the standard Transformer, a linear layer is designed as the prediction head. The MTL framework is trained end-to-end, with samples from all user clusters processed in parallel. During forward propagation, the shared encoder extracts common features, while task-specific heads generate forecasts for their respective clusters. Thus, the Transformer encoder serves as a unified feature extractor, enabling the model to capture both global consumption trends and cluster-specific patterns.

The overall loss in multi-task learning is typically formulated as a weighted sum of the individual task losses. To ensure balanced training, we employ a dynamic weighting strategy that adapts to each task’s learning stage, inherent difficulty, and current performance [34,38]. Specifically, all clusters perform load forecasting, and the weights are dynamically adjusted according to the rate of change in each task’s loss function, encouraging similar learning progress across tasks. This strategy prevents any single task from dominating the optimization and ensures that all tasks receive sufficient training. Assuming a total of n tasks, the weight for task d at iteration t, denoted as

α_{d} (t)

, is computed according to Equations (8) and (9).

α_{d} (t) = \frac{d exp (r_{d} (t - 1) / T)}{\sum_{i} exp (r_{i} (t - 1) / T)}

(8)

r_{d} (t - 1) = \frac{L o s s_{d} (t - 1)}{L o s s_{d} (t - 2)}, r_{d} (t - 1) > 0

(9)

where t is the number of iterations;

L o s s_{d} (t - 1)

and

r_{d} (t - 1)

are the loss function and relative decline rate of task d at iteration

t - 1

, respectively; and T is a constant to control the smoothness of the task weights.

The dynamic weighting strategy adaptively adjusts task weights according to learning progress, effectively balancing training across different user groups. Compared to methods such as uncertainty weighting [39], it requires no additional learnable parameters, thereby avoiding instability in parameter estimation under small-sample or low-quality data conditions.

After sharing the encoder, an independent task-specific prediction head is designed for each forecasting task. Each head implemented as a fully connected neural network, is responsible for generating the predicted values corresponding to its respective cluster. By employing separate task heads, the model enables differentiated outputs across tasks, thereby achieving joint modeling of load data from multiple user aggregates.

3.5. Encoder-Only-Based Shared Encoder Architecture

In this paper, we adopt the encoder-only architecture of the Transformer [37] as the shared component in the multi-task learning framework to extract features from different users. This design reduces the number of parameters and enables faster convergence on strongly periodic electricity load data by incorporating a linear layer as the prediction head, further improving prediction accuracy. Additionally, to better capture local patterns and multi-scale features, we apply a filter-attention mechanism and an Inception convolution module into the model.

3.5.1. Filter-Attention Mechanism

The self-attention mechanism distributes the weights across the entire sequence, which limits its ability to capture the local temporal patterns [40,41]. To address the above issues, we propose a filter-attention mechanism that strengthens the model’s focus on short-term dependencies.

By applying a two-dimensional convolutional filter to the attention matrix and performing filtering operations on the attention distribution, the model has enhanced the ability to capture and model local patterns. This enables the model to pay more attention to the local important information in the data when calculating the attention weights, and suppress unnecessary information interference, thereby enhancing the model’s performance when dealing with complex and variable time-series data and allowing the model to capture the key features and local changes in the data more accurately. The structure of the filter-attention mechanism is illustrated in Figure 3.

In Figure 3, the values of Q, K, and V are initially obtained through a process of linear projection, which can be mathematically expressed as follows:

\{\begin{matrix} Q = Linear (X) \in R^{B \times L \times H \times d_{k}} \\ K = Linear (X) \in R^{B \times L \times H \times d_{k}} \\ V = Linear (X) \in R^{B \times L \times H \times d_{v}} \end{matrix}

(10)

where

d_{k} = d_{v} = d_{mod e l} / H

, Q, K, and V represent the query, key, and value, respectively, and H is the number of attention heads.

The dot product between the query and the key is calculated and multiplied by the scaling factor to obtain the attention score as defined in Equation (11).

s c o r e s = \frac{Q K^{T}}{\sqrt{d_{k}}}

(11)

Instead of directly proceeding with aggregation, the attention distribution is divided according to the number of heads, and a two-dimensional convolution filter with kernel size

(1, f_{s})

, where

f_{s}

denotes the filter window size, is applied to the attention map of each head.

The filtered-attention maps are subsequently recombined, and the resulting matrix is then used to compute the weighted sum over the value matrices, thereby producing the final attention outputs. The process of filtered-attention can be formally defined in Equations (12) and (13),

A_{filtered} = Conv 2 D (s o f t m a x (scores)); k e r n e l = (1, f_{s})

(12)

A t t n_{o u t} = L i n e a r (c o n c a t (A_{filtered} V))

(13)

where

A_{filtered}

denotes the filtered-attention weight matrix, and

A t t n_{o u t}

represents the final attention output. The filter-attention mechanism enhances the expression ability of local patterns by applying convolution filters on the attention matrix, so that the model can pay more attention to the local important information in the data, thereby improving the performance of the model in dealing with complex time-series data.

3.5.2. Multi-Scale Information Fusion

The Inception convolution [42] module performs parallel convolution operations using kernels of various sizes, enabling the extraction of features at multiple scales. Concatenating and fusing features from different scales allows the model to analyze data from multiple perspectives and levels of granularity, enhancing its expressive power and significantly improving performance on complex datasets.

As shown in Figure 4, the parallel 1D convolution branches are set within the module, and convolution kernels of different sizes slide along the time axis, as shown in Equations (14)–(16),

conv 1 = Conv 1 d (X_{attn}^{T}, kernel = 1, out_channels = d_{ff})

(14)

conv 3 = Conv 1 d (X_{attn}^{T}, kernel = 3, padding = 1, padding_mode = circular)

(15)

conv 5 = Conv 1 d (X_{attn}^{T}, kernel = 5, padding = 2, padding_mode = circular)

(16)

where

d_{f f}

represents the number of convolution output channels, and

c o n v 1

,

c o n v 3

, and

c o n v 5

, respectively, indicate the results of one-dimensional convolution operations using convolution kernels of sizes 1, 3, and 5.

Through the combined use of these convolution operations, the model effectively extracts both local and global features of time-series data at multiple scales while preserving sequence integrity. Next, the temporal features at different scales will be captured and concatenated to integrate multi-scale information, as described in Equation (17),

\{\begin{matrix} c o n v_{a l l} = c o n c a t (c o n v 1, c o n v 3, c o n v 5) \\ F F N = Conv 1 d (R e L U (c o n v_{a l l}), k e r n e l = 1, o u t_c h a n n e l s = d_{m o d e l}) \end{matrix}

(17)

where

c o n v_{a l l}

denotes the fused feature obtained after concatenation,

F F N

refers to the output of the feedforward network, and

o u t_c h a n n e l s

indicates the number of output channels.

By concatenating data at multiple scales, the model is capable of not only accurately capturing local patterns within complex load data but also learning the underlying long-term trends. Therefore, the model’s ability is enhanced to represent the internal structure of load data while preserving computational efficiency.

4. Experiments

4.1. Experimental Settings

4.1.1. Experimental Environment

The experimental platform is Windows Server 2019, with an Intel Xeon E5-2686 v4 CPU, 96 GB of server memory, and an NVIDIA Tesla P40 GPU equipped with 24 GB memory and configured with CUDA 11.6. The model is implemented using PyTorch 1.13.1 and Python 3.8.

4.1.2. Datasets

In this paper, we conduct experiments on publicly available datasets, using the following datasets:

(1): Dataset I: London smart meter dataset (https://www.kaggle.com/datasets/jeanmidev/smart-meters-in-london, accessed on 16 November 2025).
(2): Dataset II: UCI electricity load dataset (https://archive.ics.uci.edu/dataset/321, accessed on 16 November 2025).

Dataset I mainly contains daily and hourly electricity consumption data of 5567 London households, and this paper selects the hourly electricity consumption dataset, while dataset II includes the daily electricity load data of 321 customers with hourly electricity consumption from 2011 to 2014.

4.1.3. Data Processing

In order to solve the impact of missing data in the dataset on the experimental results and deal with the subtle differences in the time-series data, the linear interpolation [43] method is used to fill the missing values to ensure the continuity and integrity of the data. For two given data points

x_{1}

and

x_{0}

, we perform linear interpolation to fill in period-missing values, which is calculated as Equation (18).

y = y_{0} + \frac{(y_{1} - y_{0}) (x - x_{0})}{x_{1} - x_{0}}

(18)

Using linear interpolation to fill in missing values preserves the continuity and smoothness of time-series data, making it particularly effective for datasets with strong temporal patterns. For each user, the missing energy values of certain days are estimated by calculating the linear interpolation between the nearest known values before and after the interval. Additionally, it reduces the risks of overfitting and noise amplification.

To further address dimensional discrepancies among different features, a standardization method is employed in this study to preprocess the data. We applied Z-score normalization to standardize the features by subtracting the mean and dividing by the standard deviation, ensuring zero mean and unit variance for each feature and thereby eliminating scale differences across features and improving the stability and convergence speed of the model training. Thereby, this enables fair comparison and evaluation of various indicators on a uniform scale and improves the reliability and validity of the subsequent analysis. Additionally, the 2 datasets were divided into a training set and a test set in an 8:2 ratio.

4.1.4. Result Evaluation

To evaluate the clustering performance of the proposed method, this paper adopts the silhouette coefficient (SC) [33] as the evaluation metric. The SC is an indicator used to evaluate clustering performance. A higher SC value closer to 1 indicates better clustering results, and its calculation formula is as Equations (19) and (20),

s (i) = \frac{b (i) - a (i)}{max (a (i), b (i))}

(19)

S C = \frac{1}{n} \sum_{i = 1}^{n} s (i),

(20)

where

a (i)

denotes the average distance from sample point i to other points in the same cluster, and

b (i)

denotes the average distance from sample point i to the nearest point in the other cluster. The overall silhouette coefficient is obtained by averaging the silhouette coefficient of all individual points.

To quantitatively assess the prediction accuracy of the GCML model, we adopt the Mean Absolute Error (MAE) [44], coefficient of determination (

R^{2}

) [45], and Root Mean Square Error (RMSE) [46] as evaluation metrics.

The MAE evaluates the model’s performance by calculating the average absolute difference between the actual and predicted values, which is highly general and easy to compute. The RMSE highlights larger deviations while maintaining consistent dimensions, and

R^{2}

provides a better reflection of the model’s fit to the data. The calculation methods are shown in Equations (21)–(23), respectively,

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(21)

R M S E = \sqrt{\frac{1}{n} {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}}

(22)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(23)

where

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value, and n is the number of samples.

4.2. Comparison of Clustering Results on 2 Datasets

After preprocessing the data, we chose the silhouette coefficient (SC) as the evaluation metric, as it effectively considers both intra-cluster cohesion and inter-cluster separation, providing an intuitive and reliable measure for determining the optimal number of clusters.

By comparing the SC for different numbers of clusters, the optimal clustering solutions for datasets I and II can be identified, enabling the classification of user types. Figure 5 shows the silhouette coefficients obtained for different numbers of clusters on two datasets.

As shown in Figure 5, dataset I achieves the highest silhouette coefficient when the number of clusters is 2, indicating that 2 is the optimal choice. For dataset II, the silhouette coefficient reaches its maximum value when the number of clusters is 3, suggesting that dividing users into 3 groups is more appropriate. After clustering the two datasets according to the optimal number of clusters, the electricity consumption patterns of typical users in each cluster are illustrated in Figure 6.

Figure 6 demonstrates the effectiveness of the clustering results in revealing distinct electricity consumption patterns among user groups. In dataset I, Cluster 0 corresponds to high-demand users with consistently high load levels, pronounced peak–valley variations, and strong temporal regularity, while Cluster 1 represents low-demand users exhibiting relatively low overall consumption and minor fluctuations. For dataset II, Cluster 1 consists of high-demand users characterized by intense electricity usage and prominent peaks; Cluster 2 represents medium-demand users with periodic load profiles; and Cluster 0 includes low-demand users whose consumption behavior is stable but less structured.

By partitioning users into clusters, tailored modeling for each cluster can effectively capture group-specific temporal characteristics, mitigating biases caused by holistic modeling and significantly improving short-term load forecasting accuracy. Furthermore, in electricity market clearing, cluster-based user classification enables more accurate prediction of bidding behaviors and demand response potentials across different groups, thereby optimizing the price formation mechanism and enhancing market efficiency and resource allocation.

The clustering algorithm groups users with similar electricity consumption patterns into corresponding load aggregates, thereby reducing the number of users that need to be modeled independently. This approach enables more accurate prediction of electricity demand for different user groups, enhances the model’s ability to capture and exploit patterns in the data, and contributes to the effective optimization of power market operations.

4.3. Training Parameter Tuning

The reasonable selection of training epochs is essential for preventing overfitting or underfitting, optimizing training efficiency, and ensuring prediction accuracy. As presented in Figure 7, on both dataset I and dataset II, the training and test losses of the proposed model decrease rapidly in the early stages and gradually stabilize after about 50 epochs, indicating that 50 iterations are sufficient for model convergence.

After an initial decline, the model performance stabilizes on both datasets, indicating effective learning of load data features and good generalization ability. Therefore, the number of training epochs is set to 50 in the prediction experiments on both datasets.

An appropriate look-back length is crucial for short-term power load forecasting. In our paper, historical window lengths of 24, 48, 72, and 96 h are selected for training. The MAE and RMSE of the clustered data are then compared to serve as the evaluation criteria for subsequent experiments. The experimental results are demonstrated in Figure 8.

As shown in Figure 8, with the increase in the historical window length, the MAE values obtained for different clusters in dataset I exhibit an overall upward trend, although the differences remain relatively small. At the same time, the RMSE achieves its best performance when the historical window length is 24. For dataset II, both the MAE and RMSE reach their minimum values when the historical window length is smallest, outperforming other settings. Considering the results from both datasets and the short-term electricity load forecasting scenarios in this study with prediction lengths of 12 and 24, the historical window length of 24 is selected as the optimal configuration.

In the following experiments, we set the historical input length to 24, employ the Adam optimizer with a learning rate of 1

\times 10^{- 3}

, apply a dropout rate of 1

\times 10^{- 1}

, and use a batch size of 32. Moreover, training runs for 50 epochs are performed on two datasets.

4.4. Effectiveness of Clustering for Short-Term Forecasting

We evaluate the effectiveness of distributed user clustering by comparing the load forecasting results of each aggregated cluster with those obtained from the original unclustered data. The results for prediction horizons of 12 and 24 steps are shown in Figure 9, where Cluster i denotes the i-th cluster formed after data aggregation, and Original represents the unclustered dataset.

As demonstrated in Figure 9, when the prediction window is 12, the proposed model achieves significantly lower MAE and RMSE values for both clustered aggregates compared to direct predictions on the original data. Similarly, at a 24-step prediction horizon, the MAE and RMSE of the clustered groups remain smaller than the original dataset. These results clearly indicate that clustering the distributed users significantly improves forecasting performance compared to direct predictions on the original dataset.

4.5. Comparison with Baseline Models

To evaluate the proposed method, we compare it with the representative methods as the baseline from both multi-task and single-task models.

(1): Single-task models: we adopt Timemixer [47], Convtimenet [48], Timesnet [49], ETSformer [50], Fedformer [51], and Dlinear [52].
(2): Multi-task models: include Multi-Transformer [53], AutoSTL [54], and Multitask-GNN [55].

4.5.1. Comparison of Experimental Results on Dataset I

Evaluation of prediction performance based on MAE and RMSE: When the prediction lengths are 12 and 24, short-term load forecasting experiments were conducted on clustered distributed users using the method proposed in this paper and baseline models on dataset I. The experimental results are summarized in Table 1.

As illustrated in Table 1, in the 12-step short-term load forecasting task for distributed user groups, the proposed method achieves MAE values of 0.2858 and 0.4312 while the RMSE values are 0.5042 and 0.5266. When the forecasting horizon is extended to 24 steps, the MAE values remain as 0.3090 and 0.4433, and the RMSE values are 0.5596 and 0.5690. In both cases, the results are significantly better than those of other baseline methods, including both single-task and multi-task models.

Across all forecasting tasks, the method achieves the most favorable results: (1) Compared to DLinear, the best-performing single-task baseline, the proposed method reduces the MAE by 38.06% and 14.03%, while the RMSE is reduced by 37.16% and 40.36% in the 12-step scenario. In the 24-step scenario, the reductions are 33.71% and 12.53% for the MAE, as well as 30.67% and 35.73% for the RMSE. (2) In the 12-step case, it yields relative MAE reductions of 19.33% and 11.59% and RMSE reductions of 22.04% and 21.99% across the two clusters. (3) In the 24-step case, the method consistently surpasses the baselines; compared with the Multi-Transformer baseline, it decreases the MAE by 5.19% and 9.3% and the RMSE by 8.38% and 19.96%. These results highlight the effectiveness of the model in capturing short-term load variations and its robust generalization capability across different forecasting horizons.

Moreover, the experimental results demonstrate that the multi-task model outperforms the single-task model in both prediction horizons. This is because many single-task baseline methods fail to capture cross-cluster dependencies and struggle with complex time-series patterns and noisy data. These limitations highlight the effectiveness of multi-task learning, which leverages the correlations and complementarities between different user groups with varying distributions, thereby enhancing the model’s ability to capture load variation patterns and improving forecasting performance.

Evaluation of model interpretability based on $R^{2}$ : In order to more comprehensively evaluate the performance of each model in the short-term power load forecasting task, we also compare the

R^{2}

values obtained by different models for short-term power load forecasting on dataset I, and the obtained experimental results are shown in Table 2.

As illustrated in Table 2, our model demonstrates superior performance across different clusters and forecasting horizons, achieving the optimal prediction results. For the 12-step forecasting horizon, compared to the best-performing single-task model Dlinear, it shows improvements of 17.96% and 17.01%, respectively. Meanwhile, compared to the better-performing multi-task model Multitask-GNN, it achieves increases of 5.85% and 48.25%, respectively. Additionally, it also shows significant improvements over other models for the 24-step forecasting horizon, fully verifying the superiority of GCML in complex data distributions and multi-task scenarios.

In addition, when the prediction length is 4 steps, the MAE, RMSE, and

R^{2}

of Cluster 0 forecasting using GCML are 0.2386, 0.4179, and 0.8254, respectively. Meanwhile, on Cluster 1, the MAE is 0.4089, RMSE is 0.4988, and

R^{2}

is 0.7499. The proposed model outperforms other comparison methods, demonstrating superior performance in short-term load forecasting.

4.5.2. Comparison of Experimental Results on Dataset II

Evaluation of prediction performance based on MAE and RMS: The experimental results of short-term electricity load forecasting with prediction horizons of 12 steps and 24 steps, conducted on different clusters obtained from clustering dataset II using the model proposed in this study, are presented in Table 3.

As demonstrated in Table 3, the proposed method consistently achieves the lowest MAE and RMSE values in all forecasting scenarios, showing clear improvements over the compared models: (1) In the 12-step forecasting task, the MAE values obtained for the three clusters are 0.1617, 0.1554, and 0.2608, while the corresponding RMSE values are 0.2299, 0.2130, and 0.3678. Compared with TimeMixer, which performs relatively well among the single-task models, the MAE is reduced by 37.01%, 40.82%, and 19.16%, and the RMSE is reduced by 42.00%, 46.71%, and 30.18%, respectively. (2) In the 24-step forecasting task, the MAE values for the three clusters are 0.1796, 0.1725, and 0.2682, with RMSE values of 0.2481, 0.2353, and 0.3799. Relative to TimeMixer, the MAE decreases by 34.60%, 37.16%, and 22.17%, and the RMSE decreases by 40.93%, 44.06%, and 33.05%, respectively.

In addition, the approach also shows clear advantages over advanced multi-task learning frameworks. We use AutoSTL as a representative baseline, and the method consistently outperforms across different forecasting horizons and user groups. The detailed performance in different scenarios is as follows: (1) In the 12-step forecasting task, the MAE and RMSE are reduced by 33.86% and 30.46% for Cluster 1, by 36.00% and 35.36% for Cluster 2, and by 38.75% and 37.81% for Cluster 3; (2) for the 24-step task, the reductions in the MAE are 33.07%, 35.52%, and 34.20%, with corresponding decreases in the RMSE of 30.23%, 35.71%, and 32.02% across the three clusters. These results indicate that the proposed method not only improves short-term prediction accuracy but also delivers stable gains across varying conditions.

Evaluation of model interpretability based on $R^{2}$ : To evaluate the interpretability and fitting performance of different models in the short-term load forecasting task, we employed

R^{2}

as the evaluation metric to assess their fitting performance. Table 4 presents the

R^{2}

values of various models for the 12-step and 24-step forecasting tasks across different clusters, offering a clear comparison of their effectiveness and fitting effects at different forecasting horizons.

As shown in Table 4, our model consistently achieves higher

R^{2}

values than the baseline models under different clustering scenarios, indicating superior fitting performance and stability. In the 12-step forecasting task, it outperforms the best-performing baseline across all clusters, with improvements ranging from about 1.7% to 6.2%. Similarly, in the 24-step task, the proposed model maintains its advantage, yielding gains of approximately 0.9% to 2.8% over the strongest baseline. These results highlight the model’s ability to exploit inter-cluster correlations and enhance pattern learning through knowledge sharing, thereby improving prediction accuracy.

To further evaluate the model’s performance in short-term forecasting, we conducted experiments using a 4-step prediction horizon. The results show that for Cluster 0, the model achieved an MAE of 0.1561, an RMSE of 0.2216, and an

R^{2}

of 0.9509. In Cluster 1, it obtained even better performance with an MAE of 0.1461, an RMSE of 0.1986, and an

R^{2}

of 0.9605. For Cluster 3, the metrics were 0.2631, 0.3738, and 0.8603, respectively. Compared with other methods, our model demonstrates superior accuracy and stronger predictive capability across all clusters.

Experimental results on multiple datasets demonstrate that the proposed model consistently achieves superior predictive performance. By employing a multi-task learning framework with a shared encoder, the model simultaneously acquires knowledge across related tasks, thereby enhancing its generalization to unseen scenarios. In addition, the integration of a filter-attention mechanism emphasizes critical local information, while multi-scale feature extraction enables the capture of complex dynamic patterns in time-series data. The improved forecasting accuracy enables aggregators to submit more reliable bids in electricity markets and enhances the operational efficiency of power systems with high penetration of distributed energy resources.

4.6. Performance Comparison

By adopting a multi-task model, multiple forecasting tasks can be addressed simultaneously, whereas single-task models handle only one objective. The multi-task approach improves prediction accuracy, reduces redundant computation, and enhances efficiency.

To assess practicality, this study evaluates time performance by recording training and testing durations of each model, reflecting computational efficiency and real-time applicability. For fairness, the reported time of single-task models corresponds to the total duration required for predicting each cluster separately. Comparative results for the 24-step forecasting horizon are shown in Figure 10.

As illustrated in Figure 10, a comparison of time performance on the two datasets indicates that the multi-task model generally outperforms the single-task baseline models, with the only exception being the linear model DLinear. This may be attributed to the relatively simple structure and small number of parameters of DLinear, which allow it to maintain high computational efficiency even in single-task scenarios.

To sum up, the results on dataset I and dataset II demonstrate that multi-task learning not only provides advantages in forecasting accuracy but also exhibits superior efficiency and practicality in terms of time performance, further validating the feasibility and effectiveness of the proposed approach in real-world applications.

4.7. Ablation Experiment

In order to verify the effectiveness of each module in the multi-task model proposed in this paper, ablation experiments are carried out for comparative analysis. The model used in this paper is regarded as the benchmark network, and the following three models are obtained by removing the relevant modules in the proposed model, respectively, which are used for ablation experiments.

(1): w/o Multi-task: Instead of adopting the multi-task learning paradigm, load forecasting is performed independently for each cluster after clustering.
(2): w/ Self-attention: The filter-attention module is replaced with the vanilla self-attention module from the original Transformer architecture.
(3): w/o Inception: The Inception module is removed from the model, thus disabling multi-scale feature extraction.
(4): w/o Dy-weighting: The dynamic weighting strategy is removed from the model.
(5): w/ Un-weighting: The dynamic weighting is replaced with the uncertainty weighting.
(6): w/ InceptionSize: The number of convolutional branches in the Inception module is set to 4, with an additional branch of kernel size 7.
(7): w/ FilterSize: The window size in the filter-attention mechanism is set to 5.

The ablation experiments were conducted using the above variant models, and their results were compared with those of the proposed model. The detailed experimental findings are presented in Table 5.

As shown in Table 5, the proposed model demonstrates the following improvements across different clusters and forecasting horizons: (1) In the 12-step forecasting, compared with the experimental results of the w/o Multi-task, w/ Self-attention, w/o Inception models, w/o Dy-weighting, and w/ Un-weighting, the MAE obtained by the proposed model on Cluster 0 is reduced by 48.23%, 15.24%, 25.34%, 12.41%, 6.54%, 1.69%, and 4.19%, respectively, and the RMSE is reduced by 47.01%, 21.85%, 28.65%, 3.13%, 10.59%, 3.98%, and 9.40%, respectively; (2) for Cluster 1, the proposed model outperforms the other models, with the MAE reduced by 37.50%, 13.14%, 14.75%, 6.28%, 2.82%, 3.15%, and 3.62%, respectively, and the RMSE reduced by 49.18%, 25.86%, 31.74%, 10.09%, 5.34%, 2.57%, and 8.67%, respectively. In the 24-step forecasting scenario, the proposed model continues to yield lower values for the relevant metrics on both distributed user groups, surpassing the results of the ablation models.

The ablation experiment was conducted using the user aggregates obtained after clustering in dataset II, and the experimental results are shown in Table 6.

From Table 6, the proposed model demonstrates superior performance across different clusters, with the key results under the 12-step forecasting horizon summarized as follows: (1) On the aggregate obtained by Cluster 0, compared with the experimental results of the w/o Multi-task, w/ Self-attention, w/o Inception models, w/o Dy-weighting, w/ Un-weighting, w/ InceptionSize, and w/ FilterSize, the MAE of the proposed model is decreased by 54.35%, 13.11%, 24.58%, 18.64%, 15.74%, 7.86%, and 4.60%, respectively, and the RMSE is reduced by 50.91%, 11.71%, 22.41%, 14.28%, 16.61%, 2.00%, and 3.32%, respectively; (2) in the case of Cluster 1, the MAE is reduced by 43.79%, 17.17%, 22.14%, 7.72%, 16.68%, 7.17%, and 2.53%, respectively, and the RMSE is reduced by 44.78%, 19.59%, 24.58%, 7.37%, 17.86%, 4.91%, and 1.43%, respectively; and (3) on the Cluster 3 dataset, the MAE is decreased by 21.66%, 13.21%, 17.18%, 33.14%, 16.06%, 4.54%, and 1.36%, respectively, while the RMSE is diminished by 27.68%, 15.86%, 20.82%, 34.26%, 17.68%, 5.21%, and 2.72%, respectively. Meanwhile, the experimental results also indicate that the model proposed in this study outperforms the ablation models in the 24-step forecasting scenario.

The ablation experiment results demonstrate that GCML achieves superior performance across different architectural designs. By leveraging multi-task learning, the model can capture inter-group correlations and enhance generalization ability for complex load patterns. The integration of the filter-attention mechanism and Inception convolution further strengthens local feature extraction and multi-scale representation. Meanwhile, key parameters such as the window size of the filter-attention mechanism and the number of convolution kernels in the Inception module also have a certain impact on model performance.

4.8. Visualization of Prediction Results on Different Datasets

In order to better understand the characteristics of the load sequence under test, we present the visualization case of short-term power load forecasting using the proposed method in Figure 11, mainly including the results on two clusters after clustering, when the prediction length is 24 and 12, respectively. In each subfigure, the solid line represents the ground truth, while the dashed line shows the prediction value generated by the corresponding model.

As shown in Figure 11, the proposed model demonstrates remarkable stability in prediction trends. For both forecasting horizons, the predicted loads for all types of users closely match the actual power values, indicating excellent performance in load forecasting. This result suggests that the model can provide valuable support for management and operation in the electricity market.

Figure 12 presents the visualization results of short-term electricity load forecasting on the three clusters obtained after clustering dataset II. From the visualization results, it can also be observed that our model exhibits excellent predictive performance. Meanwhile, it can effectively capture complex trends in different samples, which demonstrates that the proposed method has strong capability in extracting features of load data.

4.9. Results Discussion

Experiments were conducted on two publicly available power datasets to verify the effectiveness of the proposed short-term load forecasting algorithm based on clustering and multi-task learning. A comprehensive analysis of the experimental results was performed, taking into account the advantages, limitations, and comparisons with other methods.

(1): Our model combines clustering algorithms with multi-task learning to accurately classify user electricity consumption patterns while overcoming the limitations of traditional models. By learning from multiple related tasks, the model captures shared patterns across different groups, improving prediction accuracy and adaptability in complex scenarios with diverse user behaviors.
(2): Compared to other models, our approach’s main advantage lies in the integration of filter-attention mechanisms and Inception convolution modules into the encoder. This combination enhances the model’s ability to capture local patterns and fuse multi-scale features, resulting in improved prediction performance. By focusing on relevant features and extracting information at multiple scales, the model becomes more robust to variations in the data.
(3): The effectiveness of the proposed model is systematically validated through comparative experiments and ablation studies. Experimental analysis demonstrates that the proposed method exhibits significant advantages in prediction accuracy, with each module playing a crucial role.

5. Conclusions

To effectively utilize cross-group information between distributed user groups and improve short-term load forecasting accuracy, this research proposes a clustering and multi-task learning framework. Firstly, users are divided into consumption groups using clustering algorithms. Secondly, a shared encoder with dynamic weighting is employed for joint multi-task training, while independent linear task heads are assigned to each group. Furthermore, a filter-attention mechanism and an Inception convolution module are incorporated into the encoder to capture local patterns and extract multi-scale temporal features from load data. The experiment results demonstrate that GCML outperforms the comparison methods in load forecasting across various horizons, effectively improving prediction accuracy for distributed users. And this model can be used for diverse electricity usage scenarios and holds importance for efficient market clearing in the electricity spot market.

In the future, we can explore dynamic clustering methods for seasonal patterns to further enhance the adaptability and prediction performance of the model in scenarios where external variables are introduced, and we will investigate incremental learning strategies and lightweight model update mechanisms.

Author Contributions

Conceptualization, J.W.; Methodology, J.W.; Formal analysis, J.F.; Resources, Y.Z.; Data curation, J.F.; Writing—original draft, J.W.; Writing—review & editing, Y.S., Y.Z., R.Y. and P.Y.; Visualization, R.Y.; Supervision, P.Y.; Funding acquisition, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the following projects: NSFC (Grant No. 62377012) and Jiangsu Province Post graduate Practice and Innovation Program (No. SJCX240221).

Data Availability Statement

The original datasets are openly available in the following repositories: The electricity load forecasting dataset is on Kaggle [https://www.kaggle.com/datasets/jeanmidev/smart-meters-in-london, accessed on 16 November 2025] and individual household electric power consumption is from the UCI Machine Learning Repository [https://archive.ics.uci.edu/dataset/321, accessed on 16 November 2025]. These publicly accessible datasets were used to validate our proposed method, and we have properly cited them in the manuscript.

Conflicts of Interest

Authors Yusen Sun and Yu Zhou were employed by the company NARI Group Co., Ltd. Author Jianguo Fan was employed by the company Chongqing Three Gorges Water Conservancy and Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Fan, D.; Liu, Y.; Xu, X.; Shao, X.; Deng, X.; Xiang, Y.; Liu, J. Economic Operation of an Agent-Based Virtual Storage Aggregated Residential Electric-Heating Loads in Multiple Electricity Markets. J. Clean. Prod. 2024, 454, 142112. [Google Scholar] [CrossRef]
Pourdaryaei, A.; Mohammadi, M.; Mubarak, H.; Abdellatif, A.; Karimi, M.; Gryazina, E.; Terzija, V. A New Framework for Electricity Price Forecasting Via Multi-Head Self-Attention and CNN-based Techniques in the Competitive Electricity Market. Expert Syst. Appl. 2024, 235, 121207. [Google Scholar] [CrossRef]
Zhang, D. Optimization and Research of Smart Grid Load Forecasting Model Based on Deep Learning. Int. J.-Low-Carbon Technol. 2024, 19, 594–602. [Google Scholar] [CrossRef]
Eren, Y.; Kucukdemiral, I. A Comprehensive Review on Deep Learning Approaches for Short-Term Load Forecasting. Renew. Sustain. Energy Rev. 2024, 189, 114031. [Google Scholar] [CrossRef]
Botman, L.; Soenen, J.; Theodorakos, K.; Yurtman, A.; Bekker, J.; Vanthournout, K.; Blockeel, H.; Moor, B.D.; Lago, J. A Scalable Ensemble Approach to Forecast the Electricity Consumption of Households. IEEE Trans. Smart Grid 2023, 14, 757–768. [Google Scholar] [CrossRef]
Dong, H.; Zhu, J.; Li, S.; Wu, W.; Zhu, H.; Fan, J. Short-term Residential Household Reactive Power Forecasting Considering Active Power Demand Via Deep Transformer Sequence-to-sequence Networks. Appl. Energy 2023, 329, 120281. [Google Scholar] [CrossRef]
Qiu, D.; Wang, Y.; Wang, J.; Jiang, C.; Strbac, G. Personalized Retail Pricing Design for Smart Metering Consumers in Electricity Market. Appl. Energy 2023, 348, 121545. [Google Scholar] [CrossRef]
Li, K.; Li, Z.; Huang, C.; Ai, Q. Online Transfer Learning-Based Residential Demand Response Potential Forecasting for Load Aggregator. Appl. Energy 2024, 358, 122631. [Google Scholar] [CrossRef]
Guan, W.; Zhang, D.; Yu, H.; Peng, B.; Wu, Y.; Yu, T.; Wang, K. Customer Load Forecasting Method Based on the Industry Electricity Consumption Behavior Portrait. Front. Energy Res. 2021, 9, 742993. [Google Scholar] [CrossRef]
Jalalifar, R.; Delavar, M.R.; Ghaderi, S.F. SAC-ConvLSTM: A Novel Spatio-Temporal Deep Learning-Based Approach for a Short Term Power Load Forecasting. Expert Syst. Appl. 2024, 237, 121487. [Google Scholar] [CrossRef]
Pei, J.; Liu, N.; Shi, J.; Ding, Y. Tackling the Duck Curve in Renewable Power System: A Multi-Task Learning Model with Itransformer for Net-Load Forecasting. Energy Convers. Manag. 2025, 326, 119442. [Google Scholar] [CrossRef]
Sheng, Y.; Wang, H.; Yan, J.; Liu, Y.; Han, S. Short-term Wind Power Prediction Method Based on Deep Clustering-Improved Temporal Convolutional Network. Energy Rep. 2023, 9, 2118–2129. [Google Scholar] [CrossRef]
Yan, Q.; Lu, Z.; Liu, H.; He, X.; Zhang, X.; Guo, J. Short-term Prediction of Integrated Energy Load Aggregation Using a Bi-Directional Simple Recurrent Unit Network with Feature-Temporal Attention Mechanism Ensemble Learning Model. Appl. Energy 2024, 355, 122159. [Google Scholar] [CrossRef]
Zha, W.; Ji, Y.; Liang, C. Short-term Load Forecasting Method Based on Secondary Decomposition and Improved Hierarchical Clustering. Results Eng. 2024, 22, 101993. [Google Scholar] [CrossRef]
Tan, M.; Liao, C.; Chen, J.; Cao, Y.; Wang, R.; Su, Y. A multi-task learning method for multi-energy load forecasting based on synthesis correlation analysis and load participation factor. Appl. Energy 2023, 343, 121177. [Google Scholar] [CrossRef]
Xiao, J.-W.; Cao, M.; Fang, H.; Wang, J.; Wang, Y.-W. Joint Load Prediction of Multiple Buildings Using Multi-Task Learning with Selected-Shared-Private Mechanism. Energy Build. 2023, 293, 113178. [Google Scholar] [CrossRef]
Gao, Y.; Wu, J.; Yang, Y.; Wang, Z.; Ding, Z. Frequency-Aware Multi-Task Forecasting for Integrated Energy Systems Via Variational Mode Decomposition and Convolution-Attention Encoding. IEEE Trans. Smart Grid 2025. [Google Scholar] [CrossRef]
Junior, M.Y.; Freire, R.Z.; Seman, L.O.; Stefenon, S.F.; Mariani, V.C.; dos Santos Coelho, L. Optimized Hybrid Ensemble Learning Approaches Applied to Very Short-Term Load Forecasting. Int. J. Electr. Power Energy Syst. 2024, 155, 109579. [Google Scholar]
Wang, Z.; Zhang, H.; Yang, R.; Chen, Y. Improving Model Generalization for Short-Term Customer Load Forecasting with Causal Inference. IEEE Trans. Smart Grid 2025, 16, 424–436. [Google Scholar] [CrossRef]
Li, Y.; Anastasiu, D.C. MC-ANN: A Mixture Clustering-Based Attention Neural Network for Time Series Forecasting. IEEE Trans. Pattern Anal. Mach. 2025, 47, 6888–6899. [Google Scholar] [CrossRef]
Kim, H.; Park, S.; Kim, S. Time-series Clustering and Forecasting Household Electricity Demand Using Smart Meter Data. Energy Rep. 2023, 9, 4111–4121. [Google Scholar] [CrossRef]
Hadjout, D.; Sebaa, A.; Torres, J.F.; Martinez-Alvarez, F. Electricity Consumption Forecasting with Outliers Handling Based on Clustering and Deep Learning with Application to the Algerian Market. Expert Syst. Appl. 2023, 227, 120123. [Google Scholar] [CrossRef]
Li, F.; Wang, C. Develop a Multi-Linear-trend Fuzzy Information Granule Based Short-Term Time Series Forecasting Model with K-Medoids Clustering. Inf. Sci. 2023, 629, 358–375. [Google Scholar] [CrossRef]
Yang, W.; Shi, J.; Li, S.; Song, Z.; Zhang, Z.; Chen, Z. A Combined Deep Learning Load Forecasting Model of Single Household Resident User Considering Multi-Time Scale Electricity Consumption Behavior. Appl. Energy 2022, 307, 118197. [Google Scholar] [CrossRef]
Dab, K.; Henao, N.; Nagarsheth, S.; Dube, Y.; Sansregret, S.; Agbossou, K. Consensus-based Time-Series Clustering Approach to Short-Term Load Forecasting for Residential Electricity Demand. Energy Build. 2023, 299, 113550. [Google Scholar] [CrossRef]
Chen, Z.; Jiaze, E.; Zhang, X.; Sheng, H.; Cheng, X. Multi-Task Time Series Forecasting with Shared Attention. In Proceedings of the Industrial Conference on Data Mining, New York, NY, USA, 15–19 July 2020; pp. 917–925. [Google Scholar]
Hao, S.; Bao, J.; Lu, C. A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph. arXiv 2025, arXiv:2503.07682. [Google Scholar] [CrossRef]
Tian, Z.; Liu, W.; Zhang, J.; Sun, W.; Wu, C. EDformer Family: End-to-end Multi-Task Load Forecasting Frameworks for Day-Ahead Economic Dispatch. Appl. Energy 2025, 383, 125319. [Google Scholar] [CrossRef]
Guo, Y.; Li, Y.; Qiao, X.; Zhang, Z.; Zhou, W.; Mei, Y.; Lin, J.; Zhou, Y.; Nakanishi, Y. BiLSTM Multitask Learning-Based Combined Load Forecasting Considering the Loads Coupling Relationship for Multienergy System. IEEE Trans. Smart Grid 2022, 13, 3481–3492. [Google Scholar] [CrossRef]
Jiang, L.; Wang, X.; Li, W.; Wang, L.; Yin, X.; Jia, L. Hybrid Multitask Multi-Information Fusion Deep Learning for Household Short-Term Load Forecasting. IEEE Trans. Smart Grid 2021, 12, 5362–5372. [Google Scholar] [CrossRef]
Zhang, W.; Yu, Y.; Ji, S.; Zhang, S.; Ni, C. A Multitask Graph Convolutional Network with Attention-Based Seasonal-Trend Decomposition for Short-Term Load Forecasting. IEEE Trans. Power Syst. 2024, 40, 3222–3231. [Google Scholar] [CrossRef]
Deng, S.; Cai, Q.; Zhang, Z.; Wu, X. User Behavior Analysis Based on Stacked Autoencoder and Clustering in Complex Power Grid Environment. IEEE Trans. Intell. Transp. 2022, 23, 25521–25535. [Google Scholar] [CrossRef]
Michalakopoulos, V.; Sarmas, E.; Papias, I.; Skaloumpakas, P.; Marinakis, V.; Doukas, H. A Machine Learning-Based Framework for Clustering Residential Electricity Load Profiles to Enhance Demand Response Programs. Appl. Energy 2024, 361, 122943. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
Yang, Q.; Zhang, Y.; Dai, W.; Pan, S.J. Multi-task Learning. Mach. Learn. 2020, 126–140. [Google Scholar]
Heng, S. Enhanced Multi-Energy Load Forecasting Via Multi-Task Learning and GRU-attention Networks in Integrated Energy Systems. Electr. Eng. 2025, 107, 7673–7683. [Google Scholar] [CrossRef]
Fu, Z.; Lam, W.; Yu, Q.; So, A.M.-C.; Hu, S.; Liu, Z.; Collier, N. Decoder-Only or Encoder-Decoder? Interpreting Language Model As a Regularized Encoder-Decoder. arXiv 2023, arXiv:2304.04052. [Google Scholar]
Gao, S.; Koker, T.; Queen, O.; Hartvigsen, T.; Tsiligkaridis, T.; Zitnik, M. UniTS: A Unified Multi-Task Time Series Model. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 9–15 December 2024; pp. 140589–140631. [Google Scholar]
Ye, C.; Xiong, W.; Gu, Q.; Zhang, T. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 39834–39863. [Google Scholar]
Liang, J.; Cao, L.; Madden, S.; Ives, Z.; Li, G. RITA: Group Attention is All You Need for Timeseries Analytics. In Proceedings of the ACM SIGMOD Conference, Santiago, Chile, 9–15 June 2024; pp. 62:1–62:28. [Google Scholar]
Ye, T.; Dong, L.; Xia, Y.; Sun, Y.; Zhu, Y.; Huang, G.; Wei, F. Differential Transformer. In Proceedings of the ICLR, Hong Kong, China, 20–22 October 2025. [Google Scholar]
Bueno-Barrachina, J.-M.; Ye-Lin, Y.; del Amor, F.N.; Fuster-Roig, V. Inception 1D-Convolutional Neural Network for Accurate Prediction of Electrical Insulator Leakage Current from Environmental Data During Its Normal Operation Using Long-Term Recording. Eng. Appl. Artif. 2023, 119, 105799. [Google Scholar] [CrossRef]
Wang, Y.; Xu, X.; Hu, L.; Fan, J.; Han, M. A time series continuous missing values imputation method based on generative adversarial networks. Knowl.-Based Syst. 2024, 283, 111215. [Google Scholar] [CrossRef]
Qiu, X.; Wu, X.; Lin, Y.; Guo, C.; Hu, J.; Yang, B. DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting. In Proceedings of the KDD 2025, Toronto, ON, Canada, 3–7 August 2025; pp. 1185–1196. [Google Scholar]
Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-squared is More Informative Than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
Seo, H.; Lim, C. ST-MTM: Masked Time Series Modeling with Seasonal-Trend Decomposition for Time Series Forecasting. In Proceedings of the KDD 2025, Toronto, ON, Canada, 3–7 August 2025; pp. 1209–1220. [Google Scholar]
Wang, S.; Wu, H.; Shi, X.; Hu, T.; Luo, H.; Ma, L.; Zhang, J.Y.; Zhou, J. TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In Proceedings of the ICLR, Vienna, Austria, 7 May 2024. [Google Scholar]
Cheng, M.; Yang, J.; Pan, T.; Liu, Q.; Li, Z. ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for Multivariate Time Series Analysis. In Proceedings of the WWW’25: Companion Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 171–180. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. In Proceedings of the ICLR 2023, Online, 1–5 May 2023. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the International Conference on Machine Learning, Pontignano, Italy, 19–22 September 2022; pp. 27268–27286. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? In Proceedings of the AAAI, Washington, DC, USA, 7–14 February 2023; pp. 11121–11128. [Google Scholar]
Zou, S.; Yang, J.; Ruan, X.; Qin, Y.; Wu, Y.; Li, C.; Zhang, W. MSM-TFL: A Multiservice, Multitask Transformer Framework for Edge Load Prediction. IEEE Internet Things J. 2025, 12, 37790–37808. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, X.; Miao, H.; Zhang, C.; Zhao, H.; Zhang, J. AutoSTL: Automated Spatio-Temporal Multi-Task Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 4902–4910. [Google Scholar]
Han, X.; Huang, Y.; Pan, Z.; Li, W.; Hu, Y.; Lin, G. Multi-Task Time Series Forecasting Based on Graph Neural Networks. Entropy 2023, 25, 1136. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The processing model of GCML for short-term electricity load forecasting for distributed users.

Figure 2. Encoder-only-based multi-task learning architecture for extracting inter-group shared information.

Figure 3. Multi-head filter-attention mechanism structure.

Figure 4. Inception network architecture for multi-scale information extraction.

Figure 5. Comparison of SC under different number of clusters on 2 datasets.

Figure 6. The electricity consumption trends of typical users in each cluster after clustering. (a) The electricity consumption trends of typical users in each cluster after clustering on dataset I. (b) The electricity consumption trends of typical users in each cluster after clustering on dataset II.

Figure 7. The change in loss in the multi-task Transformer model during training and testing. (a) Training loss and testing loss changes of the model when using the clusters clustered on dataset I. (b) Training loss and testing loss changes of the model when using the clusters clustered on dataset II.

Figure 8. Comparison of forecasting results for different clusters under various historical window lengths. (a) Comparison of predicted MAE values for different historical window lengths on dataset I. (b) Comparison of predicted RMSE values for different historical window lengths on dataset I. (c) Comparison of predicted MAE values for different historical window lengths on dataset II. (d) Comparison of predicted RMSE values for different historical window lengths on dataset II.

Figure 9. The prediction results of the clustered aggregates and the original data under different prediction windows. (a) Comparison of the predicted results between the clustered clusters and the original data when the prediction window is 12. (b) Comparison of the predicted results between the clustered clusters and the original data when the prediction window is 24.

Figure 10. Comparison of training and testing time of different models on 2 datasets. (a) Comparison of training time of different models on 2 datasets. (b) Comparison of testing time of different models on 2 datasets.

Figure 11. Visualization of prediction cases for load aggregators under different forecasting lengths on dataset I. (a) Visualization of the results for Cluster 0 with a prediction length of 24. (b) Visualization of the results for Cluster 1 with a prediction length of 24. (c) Visualization of the results for Cluster 0 with a prediction length of 12. (d) Visualization of the results for Cluster 1 with a prediction length of 12.

Figure 12. Visualization of prediction cases for load aggregators under different forecasting lengths on dataset II. (a) Visualization of the results for Cluster 0 with a prediction length of 24. (b) Visualization of the results for Cluster 1 with a prediction length of 24. (c) Visualization of the results for Cluster 2 with a prediction length of 24. (d) Visualization of the results for Cluster 0 with a prediction length of 12. (e) Visualization of the results for Cluster 1 with a prediction length of 12. (f) Visualization of the results for Cluster 2 with a prediction length of 12.

Table 1. Comparison of short-term load forecasting performance of different models on dataset I.

Task Type	Method	Cluster 0 (12-Step)		Cluster 1 (12-Step)		Cluster 0 (24-Step)		Cluster 1 (24-Step)
Task Type	Method	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
Single-task models	Timemixer	0.5049	0.9387	0.4964	0.9749	0.5079	0.9248	0.5052	0.9929
	Convtimenet	0.5052	0.9928	0.4863	0.9764	0.5124	0.9937	0.5083	1.0245
	Timesnet	0.4860	0.8588	0.4898	0.8860	0.4483	0.8349	0.4981	0.8894
	ETSformer	0.5554	0.9029	0.6875	1.0601	0.5592	0.8992	0.7540	1.1250
	Fedformer	0.7412	1.1114	0.6924	1.1659	0.7511	1.1215	0.7424	1.1973
	Dlinear	0.4614	0.8024	0.5016	0.8830	0.4661	0.8072	0.5068	0.8853
Multi-task models	Multi-Transformer	0.3543	0.6467	0.4877	0.6750	0.3259	0.6108	0.4892	0.7109
	AutoSTL	0.4037	0.7412	0.4966	0.7635	0.4174	0.7603	0.5063	0.7940
	Multitask-GNN	0.2917	0.5323	0.5071	0.7126	0.4001	0.6246	0.4738	0.6929
	GCML	0.2858	0.5042	0.4312	0.5266	0.3090	0.5596	0.4433	0.5690

Table 2. Comparison of

R^{2}

in short-term load forecasting among different models on dataset I.

Table 2. Comparison of

R^{2}

in short-term load forecasting among different models on dataset I.

Task Type	Method	Cluster 0		Cluster 1
Task Type	Method	R² (12-Step)	R² (24-Step)	R² (12-Step)	R² (24-Step)
Single-task models	Timemixer	0.2595	0.2445	0.2516	0.2316
	Convtimenet	0.4211	0.3721	0.3745	0.3624
	Timesnet	0.5636	0.5516	0.5549	0.5381
	ETSformer	0.5671	0.5238	0.4553	0.4108
	Fedformer	0.3464	0.3240.	0.3266	0.3445
	Dlinear	0.6432	0.6388	0.6238	0.5812
Multi-task models	Multi-Transformer	0.5816	0.6269	0.5447	0.4945
	AutoSTL	0.4511	0.4221	0.4167	0.3706
	Multitask-GNN	0.7168	0.6908	0.4922	0.5200
	GCML	0.7587	0.6958	0.7297	0.6856

Table 3. Comparison of short-term load forecasting performance of different models on dataset II.

Task Type	Method	Cluster 0 (12-Step)		Cluster 1 (12-Step)		Cluster 2 (12-Step)		Cluster 0 (24-Step)		Cluster 1 (24-Step)		Cluster 2 (24-Step)
Task Type	Method	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
Single-task models	Timemixer	0.2564	0.3962	0.2626	0.3997	0.3226	0.5268	0.2705	0.4200	0.2745	0.4206	0.3446	0.5674
	Convtimenet	0.3674	0.4788	0.3764	0.5436	0.5099	0.9008	0.3824	0.5332	0.4861	0.7263	0.5467	0.8592
	Timesnet	0.3664	0.4831	0.277	0.4063	0.3626	0.5922	0.3801	0.5051	0.3116	0.4678	0.4007	0.6473
	ETSformer	0.3757	0.4908	0.3596	0.4594	0.5099	0.6825	0.4185	0.5546	0.4169	0.5365	0.5884	0.7911
	Fedformer	0.4581	0.6036	0.4130	0.5515	0.5393	0.7440	0.4850	0.6440	0.4434	0.5924	0.5733	0.7843
	Dlinear	0.3771	0.5231	0.5093	0.6708	0.6293	0.8688	0.4309	0.5899	0.5576	0.7306	0.6904	0.9252
Multi-task models	Multi-Transformer	0.1883	0.2622	0.2264	0.3177	0.2792	0.3973	0.1909	0.2648	0.1888	0.2602	0.2887	0.2887
	AutoSTL	0.2445	0.3306	0.2427	0.3295	0.4258	0.5914	0.2643	0.3556	0.2675	0.3660	0.4076	0.5588
	MultiTask-GNN	0.3644	0.4586	0.3249	0.4181	0.2808	0.3932	0.3767	0.4760	0.3949	0.4823	0.2920	0.4098
	GCML	0.1617	0.2299	0.1554	0.2130	0.2608	0.3678	0.1769	0.2481	0.1725	0.2353	0.2682	0.3799

Table 4. Comparison of

R^{2}

in short-term load forecasting among different models on dataset II.

Table 4. Comparison of

R^{2}

in short-term load forecasting among different models on dataset II.

Task Type	Method	Cluster 0		Cluster 1		Cluster 2
Task Type	Method	$R^{2}$ (12-Step)	$R^{2}$ (24-Step)	$R^{2}$ (12-Step)	$R^{2}$ (24-Step)	$R^{2}$ (12-Step)	$R^{2}$ (24-Step)
Single-task models	Timemixer	0.8798	0.8648	0.8745	0.8610	0.8745	0.8109
	Convtimenet	0.8363	0.8048	0.8251	0.7610	0.7759	0.7230
	Timesnet	0.7764	0.7556	0.8388	0.7853	0.7387	0.6750
	ETSformer	0.5554	0.9029	0.6875	1.0601	0.5592	0.8992
	Fedformer	0.7412	0.8014	0.6924	0.8059	0.7511	0.8215
	Dlinear	0.7379	0.6668	0.7465	0.6804	0.6804	0.6296
Multi-task models	Multi-Transformer	0.9313	0.9299	0.8991	0.9323	0.8421	0.8327
	AutoSTL	0.8907	0.8735	0.8914	0.8660	0.6501	0.6877
	Multitask-GNN	0.7896	0.7734	0.8252	0.7674	0.8252	0.8321
	GCML	0.9471	0.9384	0.9546	0.9446	0.8647	0.8557

Table 5. Comparison of short-term load forecasting performance of different models on dataset I.

Method	Cluster 0 (12-Step)		Cluster 1 (12-Step)		Cluster 0 (24-Step)		Cluster 1 (24-Step)
Method	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
w/o Multi-task	0.5521	0.9515	0.6898	1.0363	0.5785	0.9714	0.6309	1.0019
w/ Self-attention	0.3372	0.6452	0.4964	0.7103	0.3218	0.6181	0.4903	0.7131
w/o Inception	0.3828	0.7067	0.5058	0.7715	0.3539	0.6853	0.5022	0.7347
w/o Dy-weighting	0.3263	0.5205	0.4601	0.5857	0.3363	0.5687	0.4984	0.6132
w/ Un-weighting	0.3058	0.5639	0.4437	0.5563	0.3257	0.6088	0.4834	0.5865
w/ InceptionSize	0.2907	0.5251	0.4452	0.5405	0.3194	0.5992	0.4834	0.5865
w/ FilterSize	0.2983	0.5565	0.4474	0.5766	0.3146	0.5893	0.4560	0.5712
GCML	0.2858	0.5042	0.4312	0.5266	0.3090	0.5596	0.4433	0.5690

Table 6. Comparison of short-term load forecasting performance of different models on dataset II.

Method	Cluster 0 (12-Step)		Cluster 1 (12-Step)		Cluster 2 (12-Step)		Cluster 0 (24-Step)		Cluster 1 (24-Step)		Cluster 2 (24-Step)
Method	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
w/o Multi-task	0.3542	0.4683	0.2765	0.3857	0.3329	0.5086	0.3862	0.5091	0.2907	0.4117	0.3555	0.5632
w/ Self-attention	0.1861	0.2604	0.1876	0.2649	0.3005	0.4371	0.1936	0.2683	0.1888	0.2618	0.2890	0.4109
w/o Inception	0.2144	0.2963	0.1996	0.2824	0.3149	0.4645	0.2211	0.3007	0.3007	0.2919	0.2981	0.4293
w/o Dy-weighting	0.1919	0.2682	0.1865	0.2593	0.3107	0.4461	0.2177	0.2990	0.1902	0.2612	0.4306	0.6090
w/ Un-weighting	0.1985	0.2757	0.1684	0.2321	0.3901	0.5595	0.2158	0.2972	0.2077	0.2860	0.3248	0.4652
w/ InceptionSize	0.1755	0.2346	0.1659	0.2240	0.2732	0.3880	0.1781	0.2677	0.1747	0.2867	0.2759	0.3978
w/ FilterSize	0.1695	0.2378	0.1580	0.2161	0.2644	0.3781	0.1831	0.2565	0.1946	0.2558	0.2729	0.3869
GCML	0.1617	0.2299	0.1554	0.2130	0.2608	0.3678	0.1769	0.2481	0.1725	0.2353	0.2682	0.3799

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, J.; Sun, Y.; Fan, J.; Zhou, Y.; Ye, R.; Yuan, P. GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning. Mathematics 2025, 13, 3820. https://doi.org/10.3390/math13233820

AMA Style

Wan J, Sun Y, Fan J, Zhou Y, Ye R, Yuan P. GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning. Mathematics. 2025; 13(23):3820. https://doi.org/10.3390/math13233820

Chicago/Turabian Style

Wan, Junling, Yusen Sun, Jianguo Fan, Yu Zhou, Rui Ye, and Peisen Yuan. 2025. "GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning" Mathematics 13, no. 23: 3820. https://doi.org/10.3390/math13233820

APA Style

Wan, J., Sun, Y., Fan, J., Zhou, Y., Ye, R., & Yuan, P. (2025). GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning. Mathematics, 13(23), 3820. https://doi.org/10.3390/math13233820

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GCML: A Short-Term Load Forecasting Framework for Distributed User Groups Based on Clustering and Multi-Task Learning

Abstract

1. Introduction

2. Related Work

2.1. Clustering-Based Load Forecasting Methods

2.2. Multi-Task Learning Methods for Time-Series Forecasting

3. Methodology

3.1. Problem Definition

3.2. Processing Model

3.3. Clustering for Distributed Users

3.4. Multi-Task Learning

3.5. Encoder-Only-Based Shared Encoder Architecture

3.5.1. Filter-Attention Mechanism

3.5.2. Multi-Scale Information Fusion

4. Experiments

4.1. Experimental Settings

4.1.1. Experimental Environment

4.1.2. Datasets

4.1.3. Data Processing

4.1.4. Result Evaluation

4.2. Comparison of Clustering Results on 2 Datasets

4.3. Training Parameter Tuning

4.4. Effectiveness of Clustering for Short-Term Forecasting

4.5. Comparison with Baseline Models

4.5.1. Comparison of Experimental Results on Dataset I

4.5.2. Comparison of Experimental Results on Dataset II

4.6. Performance Comparison

4.7. Ablation Experiment

4.8. Visualization of Prediction Results on Different Datasets

4.9. Results Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI