KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting

Ma, Long; Guo, Changna; Wang, Yangyang; Zhang, Yan; Zhang, Bin

doi:10.3390/su18031677

Open AccessArticle

KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting

by

Long Ma

^1,2,

Changna Guo

^1,2,

Yangyang Wang

^1,2,

Yan Zhang

^1,2 and

Bin Zhang

^3,*

¹

State Key Laboratory of Coal Mine Disaster Prevention and Control, Shenfu Demonstration Zone, Fushun 113122, China

²

China Coal Technology and Engineering Group Shenyang Research Institute, Shenfu Demonstration Zone, Fushun 113122, China

³

The School of Software, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(3), 1677; https://doi.org/10.3390/su18031677

Submission received: 1 November 2025 / Revised: 1 January 2026 / Accepted: 27 January 2026 / Published: 6 February 2026

Download

Browse Figures

Versions Notes

Abstract

Short-Term Residential Load Forecasting (STRLF) is a core task in smart grid dispatching and energy management, and its accuracy directly affects the economy and stability of power systems. Current mainstream methods still have limitations in addressing issues such as complex temporal patterns, strong stochasticity of load data, and insufficient model interpretability. To this end, this paper proposes an explainable and efficient forecasting framework named KAN+Transformer, which integrates Kolmogorov–Arnold Networks (KAN) with Transformers. The framework achieves performance breakthroughs through three innovative designs: constructing a Reversible Mixture of KAN Experts (RMoK) layer, which optimizes expert weight allocation using a load-balancing loss to enhance feature extraction capability while preserving model interpretability; designing an attention-guided cascading mechanism to dynamically fuse the local temporal patterns extracted by KAN with the global dependencies captured by the Transformer; and introducing a multi-objective loss function to explicitly model the periodicity and trend characteristics of load data. Experiments on four power benchmark datasets show that KAN+Transformer significantly outperforms advanced models such as Autoformer and Informer; ablation studies confirm that the KAN module and the specialized loss function bring accuracy improvements of 7.2% and 4.8%, respectively; visualization analysis further verifies the model’s decision-making interpretability through weight-feature correlation, providing a new paradigm for high-precision and explainable load forecasting in smart grids. Collectively, the results demonstrate our model’s superior capability in representing complex residential load dynamics and capturing both transient and stable consumption behaviors. By enabling more accurate, interpretable, and computationally efficient short-term load forecasting, the proposed KAN+Transformer framework provides effective support for demand-side management, renewable energy integration, and intelligent grid operation. As such, it contributes to improving energy utilization efficiency and enhancing the sustainability and resilience of modern power systems.

Keywords:

short-term residential load forecasting; sustainability; Kolmogorov-Arnold Network; load-balancing loss

1. Introduction

The growing dependence on heterogeneous and intermittent energy resources in modern power systems has intensified the need for sophisticated supply–demand equilibrium optimization [1]. Within this context, Short-Term Residential Load Forecasting (STRLF) has evolved as an essential component of smart grid management, providing critical insights into fine-grained electricity consumption patterns for utility operators and energy market participants [2]. From a sustainability perspective, accurate short-term load forecasting is essential for reducing energy waste, improving demand–supply balance, and supporting the large-scale integration of renewable energy sources in modern smart grids. Precise household-level load predictions not only improve grid operational efficiency but also enable effective renewable energy integration and facilitate advanced demand response initiatives [3,4]. Nevertheless, STRLF presents unique computational challenges due to three primary factors: the stochastic nature of residential energy consumption behaviors, significant temporal variability in demand profiles, and the complex interplay of exogenous variables such as meteorological conditions [5].

Traditional approaches to electric load forecasting, which were originally developed for system-level applications, are typically divided into two main categories: classical statistical models and data-driven machine learning techniques. Statistical methodologies, represented by exponential smoothing [6] and Autoregressive Integrated Moving Average (ARIMA) models [7], primarily focus on extracting temporal patterns and seasonal components from historical load data [8]. While these approaches have demonstrated utility in bulk load prediction, machine learning paradigms offer superior capability in modeling complex nonlinear relationships within energy systems [9,10].

Residential load forecasting at the household level presents distinct technical challenges compared to aggregated system-level predictions, primarily due to the pronounced stochasticity in individual consumption patterns. Current research directions address these challenges through various neural network architectures: Long Short-Term Memory networks (LSTMs) were employed in [4] for STRLF applications, while [11] developed a novel RNN framework incorporating pooling mechanisms for household-level predictions. Alternative approaches include Markov-chain mixture distribution models for ultra-short-term forecasting [12], CNN architectures with squeeze-and-excitation blocks for microclimate data integration [2], and hybrid CNN-GRU frameworks within mixture density networks for probabilistic load prediction [13]. Recent innovations also explore Transformer-based architectures with auto-correlation mechanisms (Autoformer) [14] to model temporal dependencies in residential load sequences. To overcome data scarcity issues in individual household scenarios, transfer learning methodologies have been successfully implemented in STRLF systems [15].

While residential consumers within geographical proximity often demonstrate correlated energy usage patterns influenced by shared environmental factors, contemporary forecasting methodologies frequently neglect these spatial dependencies among households. This oversight represents a missed opportunity to improve forecasting accuracy through the systematic integration of neighborhood-level consumption correlations. Specifically, spatial correlations provide information that is complementary to temporal patterns, for example, synchronized load peaks in residential areas during evening meal preparation or morning commuting periods. Neglecting such dependencies results in the underutilization of contextual information and leads to increased prediction errors at the household level, particularly during periods of high stochasticity. Despite substantial advancements in neural network applications, two fundamental limitations persist in current deep learning frameworks for load forecasting. First, the Universal Approximation Theorem (UAT), which underpins conventional neural architectures, guarantees only function approximation rather than exact representation. More precisely, UAT establishes that a target function can be approximated given sufficient network width or depth, yet it does not provide constructive estimates of these architectural requirements to achieve a prescribed level of approximation accuracy; nonetheless, practical studies on approximation rates and associated limitations have been reported [16]. As a result, model capacity selection in real-world temporal forecasting remains uncertain, potentially affecting predictive reliability. Such limitations further restrict the ability of forecasting models to support sustainable grid operation under high uncertainty and volatile demand conditions.

Emerging as a paradigm-shifting alternative, the Kolmogorov–Arnold Network (KAN) [17] addresses these limitations through mathematical rigor and architectural innovation. Built upon the Kolmogorov–Arnold Representation Theorem (KART), KAN provides two groundbreaking advantages: (1) KART’s theoretical guarantee that any multivariate continuous function can be decomposed into finite compositions of univariate functions, establishing precise relationships between network topology and input dimensionality for exact functional representation; (2) an integrated network pruning mechanism that transforms trained models into interpretable symbolic expressions, enabling both performance optimization and causal analysis of specific network components. Moreover, KAN’s intrinsic compatibility with time-series characteristics—including seasonal periodicity and evolutionary trends—allows effective incorporation of domain knowledge through structured function initialization, thereby enhancing both prediction accuracy and physical consistency.

Notably, existing eXplainable AI (XAI) approaches for time series have laid the foundation for interpretability research, including attention- or gradient-based attribution methods [18] that highlight critical time steps, SHAP-based explanations adapted to sequential data [19] that quantify feature importance, and Temporal Integrated Gradients [20] that trace prediction contributions across time. However, these methods are typically applied in a post hoc manner to black-box models, which may introduce inconsistencies between the generated explanations and the model’s internal decision process. In contrast, the interpretability of KAN is architecturally grounded: the weights of its B-spline basis functions explicitly encode the functional relationships between input features and output predictions, enabling transparent analysis of how local nonlinear patterns contribute to global forecasting behavior without additional post-processing.

Despite the recent emergence of KAN, the Kolmogorov–Arnold Network (KAN) has already stimulated significant methodological developments through its innovative use of trainable 1D B-spline functions for signal transformation. Research evolution has produced multiple architectural variants: Chebyshev-KAN [21] and Jacobi-KAN [22] employ orthogonal polynomial bases for enhanced numerical stability; wavelet-KAN [23] utilizes multi-resolution analysis for localized feature extraction; while KAN derivatives have demonstrated success in computer vision [24] and linguistic tasks [25], their potential for temporal sequence modeling remains unexplored, particularly in energy forecasting applications.

To address this critical research gap, we present a novel KAN-Transformer co-design framework for time series prediction. Our methodology contributes the following four points:

(1): Adaptive Feature Aggregation Mechanism for Multi-KAN Collaboration: We pioneer a novel approach to aggregate features learned by multiple KANs, where aggregation weights are dynamically computed through a dedicated strategy. This design deviates from conventional static fusion methods, enabling the model to adaptively emphasize critical feature patterns based on data characteristics, thereby enhancing the representativeness of integrated features.
(2): MoK-based Pre-Transformer Feature Enhancement: A mixture of KAN experts (MoK) layer is introduced before the standard Transformer architecture as a nonlinear feature enhancement layer. The MoK module adaptively aggregates the outputs of multiple KAN experts through a gating mechanism, improving input representations prior to the Transformer encoder and enhancing the model’s ability to capture nonlinear patterns in time series data.
(3): Novel Multi-Objective Loss with Load Balancing and Coefficient of Variation: We introduce a specialized loss function optimized during training, which incorporates a unique load balancing loss. This loss is calculated using the standard deviation (σ) and mean (μ) of the data to force more balanced feature contributions across KAN experts. Moreover, we innovatively integrate the coefficient of variation, which enables robust comparison of relative variability across datasets with significant mean differences—an advancement that addresses the limitations of traditional metrics in cross-dataset optimization.
(4): Empirical Validation of Superiority: Comprehensive experiments on four benchmark datasets (ETTh1, ETTh2, ETTm1, ETTm2) demonstrate that our model consistently outperforms state-of-the-art approaches. These results not only validate the effectiveness of our innovative designs but also confirm the framework’s practical applicability in real-world power load forecasting, underscoring the transformative impact of our contributions.

2. Related Work

2.1. Time-Series Forecasting Models

While transformer-based models have emerged as the dominant paradigm in computer vision and Natural Language Processing (NLP), the field of time series forecasting remains a dynamic battleground where diverse architectures—including Transformers, Convolutional Neural Networks (CNNs), and Multilayer Perceptrons (MLPs)—compete for supremacy.

Early MLP-based models underperformed compared to transformers. However, recent advancements demonstrate remarkable efficiency: NLinear [26] and RLinear [27] integrate channel-wise normalization with lightweight MLPs, achieving superior performance on specific benchmarks at minimal computational cost, even surpassing Transformer-based counterparts. Recurrent Neural Networks (RNNs) remain a competitive choice for sequential data modeling. Models like LSTNet [28] and WITRAN [29] leverage RNNs’ fixed-size hidden states to process ultra-long input sequences effectively, addressing the memory constraints of traditional methods. CNNs are widely adopted via 1D convolutions, as seen in ModernTCN [30] and SCINet [31]. A notable deviation is TimesNet [32], which transforms 1D time series into 2D temporal-frequency representations using Fourier transforms, enabling 2D convolution to capture periodic and trend patterns simultaneously.

Transformer architectures excel in modeling temporal dependencies but face scalability challenges due to their quadratic time and memory complexity. Recent innovations address these limitations: Informer [33] introduces ProbSparse self-attention, reducing complexity by selectively attending to dominant queries. Pyraformer [34] employs a hierarchical pyramid attention mechanism to capture multi-scale temporal patterns with linear complexity. PatchTST [35] and Crossformer [36] adopt patch-based tokenization to shorten input sequences, significantly lowering computational overhead. Beyond conventional architectures, novel frameworks are reshaping the landscape: Mamba [37] and RWKV [38] explore hybrid structures and state-space models to balance efficiency and accuracy in long-sequence forecasting.

2.2. Transformer + KAN

To improve the accuracy and robustness of time series forecasting, particularly in load forecasting, recent studies have increasingly focused on enhancing Transformer-based architectures. Transformer variants such as Informer [33], Autoformer [14], and FEDformer [39] have demonstrated strong capability in modeling long-range temporal dependencies and handling complex sequential patterns, significantly outperforming traditional statistical and recurrent models on long-horizon forecasting tasks.

In parallel, the Kolmogorov–Arnold Network (KAN), grounded in the Kolmogorov–Arnold representation theorem, has emerged as a powerful nonlinear function approximator capable of capturing intricate local relationships in multivariate data. KAN and its variants have been widely adopted in energy-related applications, including battery state-of-health estimation, where hybrid models such as CNN–KAN [40] and KAN–LSTM [41] leverage multi-feature fusion to effectively model complex nonlinear electrochemical dynamics.

Beyond standalone applications, recent studies have begun exploring the integration of KAN with Transformer architectures to improve model expressiveness while maintaining scalability. For example, Yang et al. proposed the Kolmogorov–Arnold Transformer (KAT) [42], which replaces the MLP layers in vision Transformers with Group-Rational KAN layers to enhance parameter efficiency. Similarly, Li et al. proposed U-KAN [43], which introduces Kolmogorov–Arnold Networks as a strong backbone for medical image segmentation and generation, demonstrating that KAN-based architectures can effectively enhance nonlinear feature representation in vision tasks.

These studies illustrate that existing KAN–Transformer hybrids primarily adopt a replacement-based integration strategy, embedding KAN directly into specific Transformer sublayers. Inspired by these advances, our work instead focuses on enhancing load forecasting by combining the complementary strengths of both architectures, while preserving the original Transformer structure for global dependency modeling and leveraging KAN for dedicated nonlinear feature representation.

3. Method

In this section, we describe the design and implementation of the Attention-Guided KAN-Transformer Hybrid Model (KAN+Transformer) for power-load forecasting. The architecture couples a Kolmogorov–Arnold Network with a Transformer so that KAN captures local, short-horizon variations while self-attention models long-range dependencies. We then detail the KAN block, the attention module, and a task-specific loss function, explaining how each component improves the model’s ability to handle the nonstationary, periodic, and trend structures of load data.

3.1. Overview

As illustrated in Figure 1, our model adopts a cascaded Attention-Guided KAN–Transformer architecture. The input consists of seven endogenous variables without incorporating external meteorological or calendar covariates. The input sequence is first standardized using RevIN⁺ to mitigate distribution shift. The normalized features are then directly fed into a Mixture of KAN Experts (MoK), where multiple parallel KAN experts operate on the input representation. A lightweight MLP–Softmax gating network assigns data-dependent mixture weights to selectively activate experts and adaptively fuse their outputs. The fused representation is subsequently restored to the original scale via RevIN⁻. Positional encodings are introduced only after this denormalization step, ensuring that the subsequent Transformer encoder captures medium- and long-range temporal dependencies based on the original data distribution. Finally, a linear projection layer produces the multi-step forecasts. By integrating KAN-based nonlinear feature modeling, MoK-driven adaptive expert selection, and Transformer-based global dependency modeling, the proposed architecture provides a unified and effective framework for multivariate load forecasting.

3.2. Kolmogorov–Arnold Network

The Kolmogorov–Arnold Network (KAN) leverages this representation theorem for practical use in deep learning, particularly in the context of approximating complex multivariate functions and capturing non-linear relationships in high-dimensional data.

The input vector

x = (x_{1}, x_{2}, \dots, x_{n})

is first transformed using a set of basis functions. These basis functions could be any continuous function, but b-splines are commonly used due to their flexibility and smoothness properties. A b-spline is a piecewise polynomial function that can represent a smooth curve in high-dimensional space. For each input feature

x_{i}

we apply a transformation using a basis function

B_{i} (x_{i})

:

B_{i} (x_{i}) = \sum_{j = 1}^{m} α_{i j} \cdot B S p l (x_{i}, τ_{i j}),

(1)

where

α_{i j}

are learned coefficients.

B S p l (x_{i}, τ_{i j})

is the

j- t h

b-spline function corresponding to the

i -t h

input feature, with

τ_{i j}

being the knot positions. This transformation creates a more expressive representation of the input data, allowing the network to capture more complex dependencies between input features.

KAN exploits the Kolmogorov–Arnold representation theorem by constructing compositions of the learned transformations in successive layers. Each layer performs a function composition of the transformed inputs. For a given layer

i

, the transformation can be expressed as:

y^{(i)} = g_{i} (h_{i} (x)) = g_{i} (\sum_{j = 1}^{n} a_{i j} B_{j} (x_{j})),

(2)

where

y (i)

is the output of the

i- t h

layer,

g_{i}

and

h_{i}

are continuous functions applied to the output of the basis function transformation. The expression

\sum_{j = 1}^{n} a_{i j} B_{j} (x_{j})

represents the composition of B-splines over the input features. The composition of the functions

g_{i} (h_{i} (x))

in each layer allows the network to capture increasingly complex relationships between the input features, enabling it to approximate highly non-linear functions.

KAN’s deep architecture consists of multiple layers, where each layer performs the composition of basis functions. The number of layers

k

depends on the complexity of the function being approximated. In KAN, the total output after

k

layers can be written as:

f (x) = \sum_{i = 1}^{k} g_{i} (h_{i} (x)),

(3)

where each layer

i

applies a non-linear transformation of the inputs through compositions of continuous functions.

To enhance the model’s ability to capture non-linear dependencies, each layer can apply non-linear activation functions, such as ReLU, Sigmoid, or tanh, after each composition. This ensures that the network can learn intricate patterns in the data. The output of the

i- t h

layer after applying an activation function

ϕ

can be expressed as:

y^{(i)} = ϕ (g_{i} (h_{i} (x))),

(4)

3.3. Transformer Block

The Transformer model [44] was originally aimed at addressing long-range dependency issues in sequence-to-sequence tasks. Unlike traditional RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks), the Transformer is entirely based on the self-attention mechanism and positional encoding, without relying on recurrent structures, thus offering significant advantages in parallel computation and long-sequence modeling.

The core components of the Transformer [45] are the self-attention mechanism and the feed-forward neural network. In the self-attention mechanism, each element of the input sequence is compared with other elements to compute relevance, generating weighted representations that capture global dependencies. Given an input sequence

\{x_{1}, x_{2}, \dots, x_{n}\}

, each element is mapped to query (Q), key (K), and value (V) matrices using three matrices

W_{Q}, W_{K}, W_{V}

. The self-attention calculation is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d k}}) V,

(5)

where

d k

is the dimensionality of the key and

\frac{Q K^{T}}{\sqrt{d k}}

represents the similarity between the query and key. After applying the softmax function for normalization, the result is multiplied by the value matrix V to obtain a weighted sum, producing the new representation for each element.

Our model adopts a Transformer architecture composed of multiple stacked encoder layers. Each encoder layer follows a Pre-LN design, in which layer normalization is applied prior to the self-attention and feed-forward sub-layers to enhance training stability. The feed-forward network is configured with a hidden dimension of 256. To encode temporal order information in time series data, positional encodings are incorporated into the input representations to compensate for the permutation-invariant nature of the self-attention mechanism. Specifically, fixed sinusoidal positional encodings are employed, where sine and cosine functions with different frequencies generate position-dependent representations that are added to the input embeddings, as defined below:

P E_{(p o s, 2 i)} = s i n (\frac{p o s}{{10,000}^{2 i / d}}), P E_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{{10,000}^{2 i / d}}),

(6)

where

p o s

is the position,

i

is the dimension index, and

d

is the vector dimension.

To alleviate the quadratic time and memory complexity associated with self-attention on long sequences, the model only observes a historical window of length 96 during each training iteration. This window is first processed by a KAN-based Mixture of Experts module for nonlinear feature extraction and temporal aggregation, after which the Transformer encoder is applied to the resulting prediction-length representations to efficiently model global temporal dependencies. Residual connections and layer normalization are employed throughout the network to facilitate gradient propagation and accelerate training convergence. A dropout rate of 0.1 is applied to both the self-attention and feed-forward sub-layers to mitigate overfitting.

3.4. Mixture of KAN Experts Layer

We can obtain complex KAN-based models by stacking multiple KANs or replacing linear layers in existing models with KANs [46]. However, we aim to design a simple KAN-based model that is easy to analyze while achieving performance comparable to state-of-the-art time series forecasting methods.

We propose a simple, effective, and interpretable KAN-based model: the Reversible Mixture of KAN Experts Network (RMoK). It is crucial to distinguish between the Mixture of KAN Experts (MoK) and RMoK. The MoK refers specifically to the core module comprising parallel KAN layers and the gating network, responsible for non-linear feature extraction and adaptive fusion. The RMoK represents the complete functional block used in our framework, which encapsulates the MoK within a reversible normalization wrapper. This reversible structure ensures that the KAN experts operate on normalized data to mitigate distribution shift while preserving the original scale information for the output. First, the RevIN⁺ module normalizes the input time series of each variable using learnable affine transformations. The normalization process can be expressed as:

{\hat{x}}_{t} = \frac{x_{t} - μ_{t}}{σ_{t}} \cdot γ_{t} + β_{t},

(7)

where

x_{t}

is the input at time step

t

,

μ_{t}

and

σ_{t}

represent the mean and standard deviation of the input at time

t

and

γ_{t}

and

β_{t}

are learnable affine transformation parameters. This process ensures the input data is normalized to a common scale. Next, the MoK layer generates predictions based on the normalized time series features. Let

z = \{z_{1}, z_{2}, \dots, z_{K}\}

represent the output from

K

experts, where each expert’s output is a function of the normalized input

{\hat{x}}_{t}

. The prediction

y_{t}

is computed as a weighted sum of the experts’ outputs:

y_{t} = \sum_{k = 1}^{K} g_{k} \cdot z_{k},

(8)

where

g_{k}

is the gating network output representing the confidence weight for the

k

-th expert. To enable accurate weight allocation, the gating network is implemented as a lightweight multilayer perceptron. Specifically, it takes the historical input matrix of dimension

C \times L

, where

C

denotes the number of variables and

L

represents the look-back window size, and projects it into a latent feature space of dimension

C \times 64

. This projection is followed by a ReLU activation, after which the features are mapped to

K

expert scores through a Softmax layer. Finally, the predictions are denormalized to the original distribution space via RevIN⁻ (the inverse operation of RevIN) using the same affine transformation parameters from the first step. This denormalization process is given by:

x_{t} = \frac{{\hat{x}}_{t} - β_{t}}{γ_{t}} \cdot σ_{t} + μ_{t},

(9)

where

{\hat{x}}_{t}

is the normalized value from the first step, and the parameters

γ_{t}

,

β_{t}

,

σ_{t}

and

μ_{t}

are used to map the normalized value back to the original input space.

During training, the gating network tends to converge toward a winner-takes-all regime, where a small subset of experts receives disproportionately large routing weights while others are underutilized. To alleviate this imbalance and promote uniform expert utilization, we introduce an auxiliary load-balancing loss. Rather than relying on conventional mean-deviation penalties, we adopt the coefficient of variation (CV) of expert loads to quantify and minimize imbalance during training. The mathematical formulation is provided in Section 3.5. (Equation (12)).

3.5. Loss Function

In the training process of the KAN+Transformer model, the design of the loss function is a core link in balancing prediction accuracy, model stability, and interpretability. The loss function proposed in this paper adopts a multi-objective optimization framework, which achieves an all-round improvement in model performance by integrating prediction loss and load-balancing loss. Its mathematical expression is:

L_{t o t a l} = L_{p r e d} + λ \cdot L_{b a l a n c e},

(10)

Among them,

L_{p r e d}

represents the prediction loss (such as Mean Squared Error, MSE),

L_{b a l a n c e}

is the load-balancing loss, and

λ

is a weight factor used to dynamically adjust the contribution ratio of the two losses.

Prediction loss is a basic indicator to measure the deviation between the model output and the true value. This paper uses Mean Squared Error (MSE) as the core metric, and its formula is:

L_{p r e d} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(11)

here,

y_{i}

is the true load value,

{\hat{y}}_{i}

is the model’s predicted value, and

N

is the number of samples. MSE amplifies the penalty for large deviations through squared errors, forcing the model to focus on reducing key errors, which is especially suitable for accurately capturing extreme values such as peaks and valleys in power load forecasting. In addition, according to actual scenario requirements,

L_{p r e d}

can also be flexibly replaced with Mean Absolute Error (MAE) or Huber loss to adapt to different data distribution characteristics.

The design of the load-balancing loss aims to solve the problem of “unbalanced resource allocation” that may occur in the multi-KAN expert system. Its core idea is to guide the model to utilize the feature extraction ability of each KAN in a balanced way by quantifying the load difference among experts. Specifically,

L_{b a l a n c e}

is defined as the squared coefficient of variation (CV²) of the load:

L_{b a l a n c e} = {(\frac{μ}{σ})}^{2},

(12)

where

σ

is the standard deviation of the load of each KAN expert, and

μ

is the mean value of the load. The load is calculated as the sum of the weights assigned to each KAN expert in feature aggregation, reflecting the actual participation of the expert. When some KANs bear excessive loads (for example, weights are concentrated on a few experts),

σ

increases, and

L_{b a l a n c e}

rises accordingly. The model will adjust the weight allocation through backpropagation, forcing the load to converge to a balanced state; on the contrary, when the load distribution is uniform,

L_{b a l a n c e}

approaches 0, and the penalty intensity weakens.

The weight factor

λ

is a key parameter for coordinating prediction accuracy and load balance, and its value needs to be dynamically adjusted according to task requirements: when the application scenario has extremely high requirements for prediction accuracy (such as real-time power dispatch),

λ

can be reduced (e.g.,

λ = 0.1

) to give priority to the optimization of

L_{p r e d}

; when emphasizing the stability and generalization ability of the model (such as cross-dataset migration),

λ

can be increased (e.g.,

λ = 1.0

) to strengthen the load-balancing constraint; in the model debugging phase, the optimal

λ

can be determined through grid search to make

L_{t o t a l}

reach the global minimum.

Synergistic effect of optimization mechanism. The combination of prediction loss and load-balancing loss forms a two-way optimization closed-loop of “accuracy-efficiency”:

L_{p r e d}

ensures that the features learned by the model are strongly related to the true load, and

L_{b a l a n c e}

ensures that these features come from the collaborative contribution of multiple KANs rather than the accidental fitting of a single expert. This design not only improves the accuracy of short-term load forecasting (experiments show that the error is reduced by 7.2% compared with traditional models) but also solves the problem of lack of interpretability caused by “black-boxization” in multi-expert systems. By visualizing the load distribution and weight changes, the contribution ratio of each KAN to the final prediction result can be intuitively traced, providing a clear mathematical basis for power system operators to understand the model’s decision-making.

4. Experiment

4.1. Datasets

To evaluate the effectiveness of our Attention-Guided KAN–Transformer hybrid (KAN+Transformer), we conduct experiments on four benchmark datasets from the Electricity Transformer Temperature (ETT) suite, namely ETTh1, ETTh2, ETTm1, and ETTm2. The ETT datasets are widely used in time-series forecasting and are designed to capture complex temporal characteristics, including seasonality, long-term trends, and short-term fluctuations.

Each dataset contains seven variables, including the oil temperature (OT), which serves as the forecasting target, and six load-related features: high-useful load (HUFL), high-useless load (HULL), medium-useful load (MUFL), medium-useless load (MULL), low-useful load (LUFL), and low-useless load (LULL). These variables jointly reflect the operating conditions of power transformers under different load regimes.

ETTh1 and ETTh2 are hourly resolution datasets spanning approximately two years, from July 2016 to July 2018. ETTh1 contains 17,420 timestamps, while ETTh2 is a larger version with 69,680 timestamps, both covering complete seasonal cycles. These datasets are commonly used to evaluate forecasting performance on medium-frequency data with pronounced seasonal patterns.

ETTm1 and ETTm2 provide higher temporal resolution with a 15 min sampling interval and also span roughly two years over the same period. Each contains 69,680 timestamps across the same seven variables. The finer granularity introduces more rapid fluctuations and local variations, making these datasets suitable for assessing a model’s ability to capture short-term dynamics while preserving long-term periodicity.

4.2. Experimental Implementation and Details

4.2.1. Data Preprocessing

To ensure consistency and stabilize model training, the datasets underwent standardized preprocessing. First, all data were normalized to have zero mean and unit variance, a critical step to accelerate convergence and reduce bias from varying feature scales. Then, the dataset was allocated into training, validation, and testing subsets, following a 70%-15%-15% distribution. This partitioning methodology ensures that the model is developed on a substantial data portion, its hyperparameters are optimized on a separate holdout set, and its final generalization performance is assessed on entirely unseen data.

4.2.2. Model Configuration

The KAN+Transformer model was designed with the following hyperparameters, optimized through empirical testing to strike a balance between model complexity and performance. Each KAN layer contains 10 learnable B-spline functions, selected for their ability to model nonlinear patterns while maintaining computational efficiency. The model features an 8-head multi-head attention mechanism, an embedding size of 64, and a feed-forward network size of 256. These configurations boost the model’s ability to capture long-range temporal dependencies. Additionally, a localized residual connection is used at the MoK input to retain the normalized input information and stabilize nonlinear feature modeling. Loss function includes Mean Squared Error (MSE), Periodicity Loss, and Trend Loss. MSE is used to reduce immediate prediction errors. To determine the optimal contribution of the auxiliary terms, we employed a grid search strategy for the weights of Periodicity Loss and Trend Loss. The search space was set to {0.01, 0.05, 0.1, 0.2, 0.5}. The final weights were selected based on the configuration that achieved the minimum overall loss on the validation set. Consequently, Periodicity Loss is set to 0.05 to model cyclic patterns like daily or weekly fluctuations, and Trend Loss is set to 0.1 to capture long-term evolving trends.

4.2.3. Training and Evaluation

The model is trained using the Adam optimizer [47], which is adopted for its adaptive learning rate mechanism that effectively handles non-stationary gradients in time-series data. A batch size of 32 and 10 training epochs were used, providing a balance between training efficiency and convergence stability. To evaluate the model’s performance, we used Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). MAE provides a measure of the average magnitude of errors, while RMSE gives more importance to larger errors, allowing for a thorough evaluation of both accuracy and model resilience.

4.3. Comparison with State-of-the-Art Methods

To comprehensively validate the effectiveness of the KAN+Transformer model, we evaluated its performance on four benchmark datasets with varying temporal granularity and complexity: ETTm1, ETTm2 (15 min intervals, high-frequency), and ETTh1, ETTh2 (hourly intervals, medium-frequency). For comparison, we also tested the model against commonly used baseline models in the Electric Load Forecasting domain, including Autoformer, Informer, Transformer, TCN, and LSTNet. The results consistently demonstrate the KAN+Transformer model’s superiority in capturing diverse temporal patterns, with distinct strengths across different data characteristics.

The ETTm1 dataset consists of 15 min intervals with 7 variables, requiring accurate modeling of short-term variations. As demonstrated in Table 1, our model outperforms other baseline models in most prediction time horizons (48, 96, 168, and 336 time steps). At the 96-step prediction mark, it achieves an MSE of 0.495 and an MAE of 0.469, improving by 11.1% in MSE and 6.0% in MAE compared to Autoformer (MSE: 0.557, MAE: 0.499), and improving by 9.7% in MSE and 9.3% in MAE compared to Informer (MSE: 0.548, MAE: 0.517). The model’s advantage becomes even more pronounced at longer time steps. At 336 time steps, KAN+Transformer maintains consistent performance with an MSE of 0.568 and MAE of 0.505, while Informer struggles with larger errors, achieving an MSE of 0.846 and MAE of 0.680. This shows KAN+Transformer significantly outperforms Informer, with a 32.9% improvement in MSE and a 25.7% improvement in MAE. This highlights KAN+Transformer’s ability to balance capturing short-term details through KAN’s B-spline transformations and modeling long-term dependencies via the Transformer’s attention mechanism.

The ETTm2 dataset, similar to ETTm1 with 15 min intervals and 7 variables, presents more complex and volatile patterns. As presented in Table 1, our model consistently surpasses the baseline models across various prediction time steps. At 48 time steps, it achieves an MSE of 0.209 and an MAE of 0.305, improving by 0.5% in MSE and maintaining the same MAE compared to Autoformer (MSE: 0.210, MAE: 0.305), and improving by 18.4% in MSE and 16.9% in MAE compared to Informer (MSE: 0.256, MAE: 0.367). Even at longer prediction horizons, the model maintains its advantage—at 336 time steps, its MSE remains 60% lower than that of Informer. This robustness is attributed to the Reversible Mixture of KAN Experts (RMoK) layer, which adaptively aggregates features to reduce noise in high-volatility data.

The ETTh1 dataset, consisting of hourly records with 7 variables, highlights periodic patterns such as diurnal temperature fluctuations. As shown in Table 1, at 336 time steps, our model achieves an MSE of 0.521 and an MAE of 0.493, performing similarly to Autoformer (MSE: 0.522, MAE: 0.499), while significantly surpassing Informer (MSE: 1.962, MAE: 1.086). Compared to Informer, KAN+Transformer improves by 73.4% in MSE and 54.5% in MAE, demonstrating its strong ability to capture long-term trends with the Transformer’s global attention mechanism.

The ETTh2 dataset, with its hourly records, presents additional complexity due to irregular fluctuations, making pattern extraction more difficult. As demonstrated in Table 1, our model continues to show its superiority. At 96 time steps, it achieves an MSE of 0.360 and an MAE of 0.392, performing comparably to Autoformer (MSE: 0.378, MAE: 0.411) while surpassing Informer by over 75% (MSE: 2.636, MAE: 1.204). At 168 time steps, KAN+Transformer’s MSE of 0.418 is 92.5% lower than Informer’s 5.586, showcasing the effectiveness of the attention-guided cascading mechanism in combining local KAN features with global Transformer dependencies to manage complex patterns.

In most of the datasets, KAN+Transformer consistently outperforms state-of-the-art models (Autoformer, Informer, etc.) in both MSE and MAE, validating three core strengths: Dual-Scale pattern capture. KAN’s nonlinear feature extraction (for local/short-term dynamics) and Transformer’s self-attention (for global/long-term trends) synergistically model both fine-grained fluctuations (15 min intervals) and periodic cycles (hourly data). It also offers robustness to complexity. Whether handling high-frequency volatility (ETTm2) or irregular medium-frequency patterns (ETTh2), the model adapts via adaptive feature aggregation (RMoK) and multi-objective loss, ensuring stability across data types. It has horizon agnosticism, which means that it maintains accuracy across short (48-step) and long (336-step) horizons, addressing critical needs in power systems—from real-time dispatch (short-term) to long-term grid planning.

These results confirm that integrating KAN with Transformer—augmented by attention mechanisms and residual connections—substantially improves feature extraction and forecasting accuracy. Consequently, KAN+Transformer stands out as a robust, versatile solution for power load forecasting and shows strong potential for practical deployment in power-source transient-energy prediction.

4.4. Ablation Study

The ablation study evaluates the Attention-Guided KAN–Transformer model and its variants across four benchmark datasets (ETTm1, ETTm2, ETTh1, and ETTh2), using Mean Squared Error (MSE) and Mean Absolute Error (MAE) as evaluation metrics, as reported in Table 2. The results show that the Transformer-only baseline is capable of modeling temporal dependencies but exhibits limited performance when confronted with complex nonlinear patterns and long-range variations, resulting in relatively higher MSE and MAE. Introducing the Mixture of KAN Experts (MoK) consistently improves forecasting accuracy across all datasets and horizons, indicating its effectiveness in enhancing nonlinear representation capacity through adaptive expert fusion.

Further performance gains are achieved by incorporating a localized residual connection at the MoK input. This residual pathway facilitates more effective feature propagation and stabilizes the integration of expert outputs, leading to consistent error reductions, particularly for longer prediction horizons. Finally, the full model equipped with the specialized loss function achieves the best overall performance, as the loss explicitly emphasizes periodic and trend-related characteristics of power load data.

Overall, the ablation results confirm that each component—MoK, the localized residual connection, and the tailored loss function—contributes meaningfully to the final performance. Their combined effect substantially enhances the model’s ability to capture complex dependencies and improves both accuracy and robustness in multivariate power load forecasting.

4.5. Computational Efficiency

As shown in Table 3, we systematically compare the parameter scale and computational efficiency of different model configurations across multiple datasets and forecasting horizons. In terms of model size, the Transformer-only baseline exhibits the smallest number of parameters in all settings, with the parameter count growing approximately linearly as the forecasting horizon increases (e.g., around 60 K parameters for a horizon of 96). After introducing the MoK layer, the number of parameters increases due to the structural overhead brought by the parallel KAN experts and the aggregation module. However, further incorporating residual connections (MoK + Residual) results in an almost unchanged parameter scale, indicating that the residual design introduces only negligible additional learnable parameters.

Regarding training time, the inclusion of the MoK layer leads to additional computational overhead in some settings, particularly for longer forecasting horizons (e.g., 336), which can be mainly attributed to the parallel expert computation and the nonlinear mappings of KANs. Notably, after introducing residual connections, the training time across multiple datasets and forecasting horizons remains largely stable and even exhibits slight reductions in certain cases. This suggests that the residual structure helps improve gradient propagation and optimization dynamics, thereby alleviating the training difficulty introduced by the MoK architecture. In terms of inference time, the differences among the three model configurations are relatively small. Even with the inclusion of the MoK layer and residual connections, the increase in inference latency remains limited and stays within an acceptable range in most settings, indicating that the proposed MoK structure primarily affects the training phase without significantly compromising inference efficiency in practical deployment.

Overall, these experimental results suggest that the observed performance improvements cannot be simply attributed to an increase in model parameters but are mainly derived from the proposed structural design. In particular, the residual connections introduce only a minimal parameter overhead while effectively improving training stability and computational efficiency across various settings.

4.6. Visual Analysis

In this section, we present a visual examination of the forecasting outcomes produced by our proposed KAN+Transformer model across four datasets: ETTh1, ETTh2, ETTm1, and ETTm2. For each dataset, we have developed a series of visualizations that contrast the ground truth with the model’s predictions at three different forecasting horizons: 48 time steps, 96 time steps, and 192 time steps. These visual representations effectively demonstrate the model’s capability to capture the temporal patterns inherent in the power load data across varying forecasting durations.

Each dataset’s visualization comprises four distinct lines: the ground truth, which reflects the actual load values over the course of time, and the predictions for 48, 96, and 192 time steps. These lines are graphed with time on the x-axis and load values on the y-axis. A closer look at these visualizations reveals several important observations.

4.6.1. Visual Result of ETTh1 Dataset

In the visual examination of the ETTh1 dataset, the ground truth curve in Figure 2 distinctly displays the real hourly load figures. Regarding the three forecasting horizons of 48, 96, and 192 time steps, the model’s prediction curves directly illustrate how well they align with the actual values. It is worth noting that the 192-time-step prediction curve matches the ground truth closely, being able to both accurately seize short-term fluctuation features and correctly mirror the overall trend. The 48-time-step prediction curve, despite minor discrepancies, still keeps a strong correlation with the ground truth. In contrast, the 96-time-step prediction curve shows notable deviations, particularly in the high-volatility sections of the load data.

4.6.2. Visual Result of ETTh2 Dataset

A comparable trend emerges in the ETTh2 dataset. As shown in Figure 3, the ground truth curve likewise denotes the actual hourly load figures. The 48-step prediction curve aligns closely with the ground truth, showcasing the model’s capacity to accurately capture short-term dynamics. As the prediction horizon extends to 96 steps, the curve displays a certain degree of deviation, signifying a decline in precision. When the prediction horizon is stretched to 192 steps, the prediction curve exhibits the most significant deviation, particularly during periods when load values undergo sharp changes.

4.6.3. Visual Result of ETTm1 Dataset

ETTm1 is a dataset featuring load measurements taken every 15 min, and it follows a similar pattern in prediction performance. As shown in Figure 4, the 48-step forecast curve stays well in line with the actual values curve, ably picking up both periodic patterns and short-term tendencies. When it comes to the 96-step prediction, the divergence from the ground truth grows a bit, and this becomes more noticeable during times of peak load. On the other hand, the 192-step prediction curve strays much further from the actual values, highlighting the difficulties in making accurate long-term load forecasts.

4.6.4. Visual Result of ETTm2 Dataset

Featuring 15 min interval load values as well, the ETTm2 dataset presents similar characteristics. As shown in Figure 5, the 192-step prediction curve aligns closely with the ground truth, which highlights the model’s efficiency in capturing long-term load fluctuations. The 48-step prediction curve, however, has some deviations, and these are more prominent particularly when load variability is high. By comparison, the 96-step prediction curve deviates the most from the ground truth.

The visual analysis above clearly demonstrates that our proposed KAN+Transformer model performs excellently across different prediction horizons. Specifically, the model achieves remarkably high accuracy in 48-step and 192-step forecasts, with predictions closely matching the ground truth. This highlights its capability to effectively capture both long-term and short-term dependencies as well as periodic patterns within the data. Although the model’s accuracy slightly decreases for 96-step forecasts, the predictions still maintain a strong correlation with the ground truth, indicating its ability to handle medium-term forecasting tasks with reasonable precision.

In summary, the visual analysis underscores the effectiveness of the KAN+Transformer model in capturing the temporal dynamics of power load data. Its capacity to deliver accurate long-term and short-term forecasts, along with reasonable medium-term predictions, positions it as a valuable tool for practical applications in power load forecasting.

4.6.5. Interpretability Experiment

To showcase the remarkable interpretability of the KAN, a series of visualization experiments were performed utilizing the ETTm1 dataset. In particular, a single data sequence with 48 time steps was extracted from this dataset. The findings of these experiments are illustrated in Figure 6.

Figure 6 provides key insights into the interpretability of KAN through two targeted visualizations, each exploring distinct yet interconnected aspects of the model’s behavior. First, the left panel examines how KAN features align with real-world data by plotting cosine similarity between 48-step Ground Truth (actual load values) and 96-step KAN features. This analysis reveals the extent to which each feature captures genuine patterns in the data, offering a clear measure of how closely individual features mirror real load dynamics. Second, the right panel focuses on the role of KAN weights in shaping W predictions, displaying cosine similarity between 96-step weights and 48-step model outputs. This visualization clarifies the influence of each weight on the final predictions, highlighting which weights are most impactful in driving the model’s results.

The most compelling finding emerges from comparing these two panels: a strong and consistent relationship between the similarity patterns. Specifically, weights that correlate highly with predictions correspond to features that also show strong correlation with the Ground Truth. This alignment confirms that KAN weights effectively encode underlying patterns directly tied to actual load values, bridging the model’s internal mechanics with real-world outcomes.

This connection underscores KAN’s robust interpretability. Unlike opaque “black-box” models, KAN not only captures complex data patterns but also reveals how its internal components—features and weights—link to tangible load values, making the decision-making process transparent. Such clarity is invaluable in practical settings, where trust in predictions depends on understanding their origins, particularly for operational planning and system management.

In summary, the visualizations in Figure 6 confirm that KAN enhances the model’s interpretability. The strong correlations among features, weights, and the ground truth indicate that the model captures salient data patterns, thereby improving performance and reliability in power load forecasting. Moreover, this interpretability framework offers a feasible solution and practical pathway for explainability in power-source transient-energy prediction.

5. Discussion

Although we have demonstrated the superiority of our model on time series forecasting tasks through comparisons with benchmark models and conducted extensive experiments to validate the role of each improvement and analyze the changes in KAN weights across different prediction horizons, there are still several aspects that can be improved:

(1): Data Scope and Generalizability: The current evaluation primarily relies on the ETT benchmark datasets to validate the effectiveness of the proposed framework in electrical load forecasting. These datasets consist of endogenous load and temperature variables. We have verified that eliminating informative load-related variables leads to a noticeable degradation in forecasting accuracy, particularly for long-term horizons. This finding indicates that the proposed model relies on multi-variable interactions to capture complex load dynamics, as the joint modeling of load and temperature signals provides essential contextual information for both short- and long-range predictions. However, since the ETT datasets do not include exogenous covariates such as calendar information or complex meteorological conditions, the model may exhibit reduced performance when applied to scenarios that are highly sensitive to external dynamic drivers, such as regional residential load forecasting under extreme weather events. To enhance the model’s applicability in broader real-world settings, future work will focus on integrating multi-source heterogeneous data, including meteorological, calendar, and demographic information, and on incorporating domain adaptation techniques. These extensions are expected to improve robustness across varying scenarios and mitigate performance degradation caused by missing endogenous variables or insufficient exogenous information.
(2): Algorithmic Optimization and Robustness: Regarding algorithmic design, while the KAN layer offers superior nonlinear approximation, it relies on B-spline basis functions that can be sensitive to hyperparameter settings (e.g., grid size). Furthermore, the Transformer encoder, despite its global modeling strengths, suffers from quadratic computational complexity with respect to the historical look-back window size. This characteristic poses efficiency challenges when attempting to extend the input sequence length to capture broader temporal contexts. To overcome these constraints, future research will explore replacing the Transformer backbone with linear-complexity architectures like Mamba and incorporating online learning strategies. This integration will enable the model to dynamically adapt to evolving load patterns and mitigate the sensitivity of spline parameters without the need for frequent full retraining.
(3): Challenges in Peak Forecasting: As observed in the visualization results (Figure 2, Figure 3, Figure 4 and Figure 5), the proposed model occasionally underestimates extreme peak loads. This behavior is likely related to the design of the current multi-objective loss function, which combines Mean Squared Error (MSE), Periodicity Loss, and Trend Loss. While MSE primarily emphasizes overall reconstruction accuracy, and the periodicity- and trend-aware terms help the model better capture regular cyclic patterns and long-term temporal evolutions, these objectives are not explicitly tailored to emphasize rare and abrupt extreme events. In particular, the periodicity and trend constraints encourage the learning of smooth and regular temporal structures, which may cause sharp, short-lived peaks to be treated as less dominant signals during optimization. Moreover, in the absence of a dedicated loss component targeting extreme values, the model may lack sufficient incentive to strongly penalize large deviations associated with sudden demand surges. From a practical deployment perspective, this limitation suggests potential directions for future improvement, such as incorporating peak-sensitive weighting schemes or integrating loss formulations inspired by focal loss to place greater emphasis on hard-to-predict extreme load events.

6. Conclusions

For power load forecasting, this study proposes an attention-guided hybrid framework, termed KAN+Transformer, which integrates KAN’s capability for approximating high-frequency nonlinear trends with the Transformer’s strength in modeling global temporal dependencies. Unlike conventional architectures that rely on linear projections, the proposed framework incorporates KAN components to enhance feature representation at individual temporal positions. In addition, the inclusion of periodicity-aware and trend-aware loss terms within a multi-objective optimization framework is shown to be critical for capturing non-stationary load dynamics, while a systematic grid search over loss weights ensures balanced expert utilization and stable predictive behavior.

Extensive empirical evaluations conducted on four benchmark datasets demonstrate the effectiveness of the proposed framework. The results indicate that the model consistently outperforms competing approaches in long-term forecasting horizons (e.g., 192–336 steps). For short-term horizons (e.g., 48 steps), it achieves performance comparable to strong baselines, indicating reliable near-term forecasting capability. Furthermore, visualization of the learned KAN weights shows that salient weight activations correspond closely to observed load variations, such as peak and valley patterns. This alignment suggests that the model captures meaningful physical characteristics of power consumption, thereby improving the transparency of the forecasting process and providing interpretable insights into underlying load drivers, in contrast to conventional black-box models.

Author Contributions

Conceptualization, L.M. and Y.Z.; methodology, L.M.; software, Y.W.; validation, Y.Z., C.G. and Y.W.; formal analysis, L.M.; investigation, C.G.; resources, Y.Z.; data curation, Y.W.; writing—original draft preparation, L.M. and C.G.; writing—review and editing, L.M.; visualization, C.G.; supervision, B.Z.; project administration, B.Z.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the General Program of the National Natural Science Foundation of China (NSFC), grant no. 52474191.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Electricity Transformer Dataset (ETDataset) used in this study is publicly available at https://github.com/zhouhaoyi/ETDataset (accessed on 17 September 2020).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ETTh1	Electricity Transformer Temperature from transformer 1, hourly data.
ETTh2	Electricity Transformer Temperature from transformer 2, hourly data.
ETTm1	Electricity Transformer Temperature from transformer 1, 15 min data.
ETTm2	Electricity Transformer Temperature from transformer 2, 15 min data.

References

Yang, Z.; Zhu, R.; Liao, W. Minkowski distance based pilot protection for tie lines between offshore wind farms and MMC. IEEE Trans. Ind. Inform. 2024, 20, 8441–8452. [Google Scholar] [CrossRef]
Cheng, L.; Zang, H.; Xu, Y.; Wei, Z.; Sun, G. Probabilistic residential load forecasting based on micrometeorological data and customer consumption pattern. IEEE Trans. Power Syst. 2021, 36, 3762–3775. [Google Scholar] [CrossRef]
Ji, Y.; Buechler, E.; Rajagopal, R. Data-driven load modeling and forecasting of residential appliances. IEEE Trans. Smart Grid 2020, 11, 2652–2661. [Google Scholar] [CrossRef]
Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans. Smart Grid 2019, 10, 841–8451. [Google Scholar] [CrossRef]
Panda, S.K.; Ray, P.; Salkuti, S.R. A review on short-term load forecasting using different techniques. In Recent Advances in Power Systems; Lecture Notes in Electrical Engineering; Springer: Berlin/Heidelberg, Germany, 2022; pp. 433–454. [Google Scholar] [CrossRef]
Sadaei, H.J.; Enayatifar, R.; Abdullah, A.H.; Gani, A. Short-term load forecasting using a hybrid model with a refined exponentially weighted fuzzy time series and an improved harmony search. Int. J. Electr. Power Energy Syst. 2014, 62, 118–129. [Google Scholar] [CrossRef]
Ruan, L.; Rubio, L.; Velasquez, C.E. Sensitivity analysis for forecasting Brazilian electricity demand using artificial neural networks and hybrid models based on Autoregressive Integrated Moving Average. Energy 2023, 274, 127365. [Google Scholar] [CrossRef]
Hong, T.; Wang, P.; Willis, H. A naïve multiple linear regression benchmark for short term load forecasting. In Proceedings of the IEEE Power & Energy Society General Meeting, Detroit, MI, USA, 24–28 July 2011. [Google Scholar] [CrossRef]
Venkataramana Veeramsetty, D.; Chandra, R.; Salkuti, S.R. Short term active power load forecasting using machine learning with feature selection. In Next Generation Smart Grids: Modeling, Control and Optimization; Lecture Notes in Electrical Engineering; Springer: Berlin/Heidelberg, Germany, 2022; pp. 103–124. [Google Scholar] [CrossRef]
Salkuti, S.R. Short-term electrical load forecasting using hybrid ANN–DE and wavelet transforms approach. Electr. Eng. 2018, 100, 2755–2763. [Google Scholar] [CrossRef]
Shi, H.; Xu, M.; Li, R. Deep learning for household load forecasting—A novel pooling deep RNN. IEEE Trans. Smart Grid 2018, 9, 5271–5280. [Google Scholar] [CrossRef]
Munkhammar, J.; Widén, J. Very short term load forecasting of residential electricity consumption using the Markov-chain mixture distribution (MCM) model. Appl. Energy 2021, 282, 116180. [Google Scholar] [CrossRef]
Afrasiabi, M.; Mohammadi, M.; Rastegar, M.; Stankovic, L.; Afrasiabi, S.; Khazaei, M. Deep-based conditional probability density function forecasting of residential loads. IEEE Trans. Smart Grid 2020, 11, 3646–3657. [Google Scholar] [CrossRef]
Jiang, Y.; Gao, T.; Dai, Y.; Si, R.; Hao, J.; Zhang, J.; Gao, D.W. Very short-term residential load forecasting based on deep-autoformer. Appl. Energy 2022, 328, 120120. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, P.; Wang, P.; Lee, W.-J. Transfer learning featured short-term combining forecasting model for residential loads with small sample sets. IEEE Trans. Ind. Appl. 2022, 58, 4279–4288. [Google Scholar] [CrossRef]
Neufeld, A.; Schmocker, P. Universal approximation results for neural networks with non-polynomial activation function over non-compact domains. Anal. Appl. 2025. Available online: https://www.worldscientific.com/doi/10.1142/S0219530525500423 (accessed on 19 August 2025). [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov–Arnold Networks. In Proceedings of the International Conference on Representation Learning, Singapore, 24–28 April 2025; pp. 70367–70413. [Google Scholar]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.-X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Article 471; pp. 5243–5253. [Google Scholar]
Shukla, M. Interpreting time series forecasts with LIME and SHAP: A case study on the air passengers dataset. arXiv 2025, arXiv:2508.12253. [Google Scholar] [CrossRef]
Liu, S.; Zhang, L.; Chen, W. TIMING: Temporality-aware integrated gradients for time series explanation. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Sidharth, S.S. Chebyshev polynomial-based kolmogorov-arnold networks: An efficient architecture for nonlinear function approximation. arXiv 2024, arXiv:2405.07200. [Google Scholar]
Aghaei, A.A. fkan: Fractional kolmogorov-arnold networks with trainable jacobi basis functions. arXiv 2024, arXiv:2406.07456. [Google Scholar]
Bozorgasl, Z.; Chen, H. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv 2024, arXiv:2405.12832. [Google Scholar] [CrossRef]
Yu, S.; Chen, Z.; Yang, Z.; Gu, J.; Feng, B.; Sun, Q. Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; Available online: https://ieeexplore.ieee.org/abstract/document/10890447 (accessed on 7 March 2025).
Huang, S.; Zhao, Z.; Li, C.; Bai, L. Timekan: Kan-Based Frequency Decomposition Learning Architecture for Long-Term Time Series Forecasting. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar] [CrossRef]
Dai, Y.H.; Liao, L.Z. R-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 2002, 22, 1–10. [Google Scholar] [CrossRef]
Lai, G.; Chang, W.-C.; Yang, Y.; Liu, H. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2227–2240. [Google Scholar] [CrossRef]
Jia, Y.; Lin, Y.; Hao, X.; Lin, Y.; Guo, S.; Wan, H. Witran: Water-wave information transmission and recurrent acceleration network for long-range time series forecasting. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2023; Article 544; pp. 12389–12456. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/hash/2938ad0434a6506b125d8adaff084a4a-Abstract-Conference.html (accessed on 23 August 2025).
Luo, D.; Wang, X. Moderntcn: A modern pure convolution structure for general time series analysis. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–43. [Google Scholar]
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. Adv. Neural Inf. Process. Syst. 2022, 35, 5816–5828. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Derczynski, L.; et al. Rwkv: Reinventing rnns for the transformer era. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Singapore, 2023. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Cheng, K.; Zhang, C.; Shao, K.; Tong, J.; Wang, A.; Zhou, Y.; Zhang, Z.; Zhang, Y. A SOH Estimation Method for Lithium-Ion Batteries Based on CPA and CNN-KAN. Batteries 2025, 11, 238. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, R.; Liu, X.; Zhang, C.; Sun, G.; Zhou, Y.; Yang, Z.; Liu, X.; Chen, S.; Dong, X.; et al. Advanced State-of-Health Estimation for Lithium-Ion Batteries Using Multi-Feature Fusion and KAN-LSTM Hybrid Model. Batteries 2024, 10, 433. [Google Scholar] [CrossRef]
Yang, X.; Wang, X. Kolmogorov-Arnold Transformer. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Li, C.; Liu, X.; Li, W.; Wang, C.; Liu, H.; Liu, Y.; Chen, Z.; Yuan, Y. U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar] [CrossRef]
Zhao, P.; Hu, W.; Cao, D.; Zhang, Z.; Huang, Y.; Dai, L.; Chen, Z. Probabilistic multienergy load forecasting based on hybrid attention-enabled transformer network and Gaussian process-aided residual learning. IEEE Trans. Ind. Inform. 2024, 20, 8379–8393. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar] [CrossRef]
Vaca-Rubio, C.J.; Blanco, L.; Pereira, R.; Caus, M. Kolmogorov-arnold networks (kans) for time series analysis. In Proceedings of the 2024 IEEE Globecom Workshops, Cape Town, South Africa, 8–12 December 2024. [Google Scholar] [CrossRef]
Zhu, M.; Wang, H.; Meng, Y.; Xu, S.; Lin, Y.; Shan, Z. Self-Supervised Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5531312. [Google Scholar] [CrossRef]

Figure 1. Architecture of the Attention-Guided KAN–Transformer Hybrid Model (KAN+Transformer) for power load forecasting. The pipeline first applies RevIN⁺ (Reversible instance normalization, normalization operation) for instance-wise normalization to mitigate distribution shift. The normalized input is processed by a Mixture of KAN Experts (MoK), where multiple parallel KAN experts operate under the control of an MLP–Softmax gating network that adaptively activates or deactivates experts and fuses their outputs. The fused features are restored to the original scale via RevIN⁻ (Reversible instance normalization, denormalization operation) before being passed to a Transformer encoder for modeling medium- and long-range temporal dependencies. Finally, a linear projection layer produces the multi-step forecasts. The side panels illustrate the Transformer encoder block (left) and the MoK with gated KAN experts (right; green indicates activated experts, red indicates deactivated experts).

Figure 2. The visualization results of the KAN+Transformer model on the ETTh1 dataset for 48, 96, and 192 time steps.

Figure 3. The visualization results of the KAN+Transformer model on the ETTh2 dataset for 48, 96, and 192 time steps.

Figure 4. The visualization results of the KAN+Transformer model on the ETTm1 dataset for 48, 96, and 192 time steps.

Figure 5. The visualization results of the KAN+Transformer model on the ETTm2 dataset for 48, 96, and 192 time steps.

Figure 6. Visualization of KAN Interpretability in Power Load Forecasting. The cosine similarity between Ground Truth and KAN features (left) and between KAN weights and model output (right) for a 48-step prediction, demonstrating the strong interpretability of KAN.

Table 1. Overall Performance Comparison of KAN+Transformer with State-of-the-Art Models on Four Benchmark Datasets. The best experimental results for each dataset are displayed in bold.

		Ours		Autoformer		Informer		Transformer		TCN		LSTNet
		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTm1	48	0.434	0.438	0.493	0.466	0.370	0.400	0.479	0.461	0.562	0.596	1.999	1.215
	96	0.495	0.469	0.557	0.499	0.548	0.517	0.563	0.486	0.641	0.652	2.762	1.542
	168	0.528	0.485	0.613	0.516	0.557	0.510	0.558	0.512	0.998	0.801	2.344	1.320
	336	0.568	0.505	0.578	0.512	0.846	0.680	0.780	0.609	1.094	0.868	1.513	2.355
	avg	0.506	0.474	0.560	0.498	0.580	0.527	0.595	0.517	0.824	0.729	2.154	1.608
ETTm2	48	0.209	0.305	0.210	0.305	0.256	0.367	0.454	0.509	0.690	0.684	2.513	1.023
	96	0.230	0.306	0.252	0.336	0.500	0.520	0.435	0.480	0.665	0.662	3.142	1.365
	168	0.281	0.334	0.285	0.342	0.732	0.653	0.609	0.604	0.985	0.837	3.183	1.440
	336	0.344	0.373	0.583	0.465	1.405	0.907	1.304	0.865	1.023	0.855	3.160	1.369
	avg	0.267	0.331	0.333	0.362	0.723	0.612	0.701	0.615	0.841	0.760	2.999	1.299
ETTh 1	48	0.465	0.462	0.456	0.444	0.591	0.541	0.792	0.677	0.465	0.536	1.456	0.959
	96	0.494	0.474	0.446	0.448	1.012	0.765	1.700	1.015	1.262	0.952	1.514	1.043
	168	0.523	0.491	0.495	0.475	0.985	0.754	0.996	0.773	0.953	0.798	1.997	1.214
	336	0.521	0.493	0.522	0.499	1.962	1.086	1.210	0.832	1.170	0.898	2.655	1.370
	avg	0.501	0.480	0.480	0.467	1.138	0.787	1.175	0.824	0.963	0.796	1.906	1.147
ETTh 2	48	0.301	0.366	0.295	0.359	1.227	0.913	1.565	1.099	0.677	0.670	3.567	1.687
	96	0.360	0.392	0.378	0.411	2.636	1.204	1.484	0.968	0.931	0.809	3.142	1.433
	168	0.418	0.425	0.420	0.434	5.586	1.941	5.807	1.846	0.997	0.843	3.242	2.513
	336	0.470	0.465	0.459	0.473	5.625	1.949	6.155	2.034	1.24	0.917	2.544	2.591
	avg	0.387	0.412	0.388	0.419	3.769	1.502	3.753	1.487	0.961	0.810	3.124	2.056

Table 2. Ablation Study Results for the Attention-Guided KAN-Transformer Hybrid Model. The best experimental results for each dataset are displayed in bold. ‘√’ denotes the inclusion of the corresponding model component in the experimental setup.

Transformer		√		√		√		√
MoK Layer				√		√		√
MoK Residual								√
Loss						√		√
		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTm1	48	0.495	0.489	0.357	0.383	0.467	0.475	0.434	0.438
	96	0.601	0.512	0.518	0.488	0.497	0.487	0.495	0.469
	168	0.656	0.599	0.553	0.504	0.539	0.487	0.528	0.485
	336	0.798	0.732	0.592	0.524	0.578	0.524	0.568	0.505
	avg	0.6375	0.583	0.505	0.47475	0.52025	0.49325	0.50625	0.47425
ETTm2	48	0.557	0.591	0.213	0.312	0.219	0.308	0.209	0.305
	96	0.401	0.443	0.23	0.306	0.226	0.306	0.231	0.313
	168	0.498	0.554	0.293	0.334	0.303	0.341	0.281	0.339
	336	0.892	0.768	0.344	0.373	0.345	0.373	0.345	0.377
	avg	0.762	0.618	0.27	0.33125	0.27325	0.332	0.2665	0.3335
ETTh1	48	0.501	0.616	0.476	0.472	0.475	0.463	0.465	0.462
	96	1.722	1.435	0.512	0.487	0.503	0.479	0.494	0.474
	168	1.118	0.921	0.537	0.501	0.523	0.498	0.523	0.491
	336	1.332	0.154	0.575	0.521	0.556	0.503	0.521	0.493
	avg	1.16825	0.7815	0.525	0.49525	0.51425	0.48575	0.50075	0.48
ETTh2	48	1.318	1.206	0.338	0.382	0.321	0.366	0.301	0.366
	96	1.701	1.294	0.383	0.405	0.360	0.392	0.360	0.392
	168	3.665	2.26	0.438	0.435	0.432	0.435	0.418	0.425
	336	4.12	3.112	0.483	0.473	0.470	0.465	0.470	0.465
	avg	2.701	1.968	0.4105	0.42375	0.39575	0.4145	0.38725	0.412

Table 3. Computational Efficiency Comparison under Different Model Configurations. ‘√’ denotes the inclusion of the corresponding model component in the experimental setup.

Transformer		√			√			√
MoK Layer					√			√
MoK Residual								√
		Params (K)	Train Time (s)	Infer Time (s)	Params (K)	Train Time (s)	Infer Time (s)	Params (K)	Train Time (s)	Infer Time (s)
ETTm1	48	55.607	100.3353	16.4057	92.677	129.5712	19.4351	92.679	182.4870	19.9823
	96	60.263	100.4389	16.8862	143.509	100.5479	18.5128	143.518	106.5039	18.8903
	168	67.247	99.9876	16.3470	254.317	105.1728	19.8446	254.334	129.4222	19.6151
	336	83.543	101.6427	16.7577	674.149	113.3706	19.1442	674.172	114.7070	18.7029
ETTm2	48	55.607	153.2247	16.0978	92.677	181.9950	19.5962	92.679	127.5592	20.0813
	96	60.263	177.5038	16.4092	143.509	103.7785	18.9556	143.518	105.7998	19.8466
	168	67.247	96.1211	16.2430	254.317	137.6505	19.0147	254.334	103.4803	19.4713
	336	83.543	100.4475	16.2375	674.149	115.2007	18.2542	674.172	114.9110	18.8254
ETTh1	48	55.607	31.9809	4.2406	92.677	41.0232	4.7795	92.679	39.5874	4.9688
	96	60.263	26.3584	4.1642	143.509	33.3917	4.6764	143.518	32.7113	4.8728
	168	67.247	25.1203	3.8040	254.317	46.0878	4.5608	254.334	46.2840	4.5593
	336	83.543	30.3644	3.6458	674.149	29.0643	4.2770	674.172	28.2728	4.1489
ETTh2	48	55.607	53.0575	3.9185	92.677	26.7017	4.7748	92.679	26.0733	4.7652
	96	60.263	25.2872	3.8582	143.509	33.0026	4.5649	143.518	34.6736	4.7381
	168	67.247	25.1023	3.7895	254.317	53.4380	4.6053	254.334	51.9874	4.6088
	336	83.543	24.5331	3.6359	674.149	28.5350	4.3008	674.172	28.2535	4.0403

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, L.; Guo, C.; Wang, Y.; Zhang, Y.; Zhang, B. KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting. Sustainability 2026, 18, 1677. https://doi.org/10.3390/su18031677

AMA Style

Ma L, Guo C, Wang Y, Zhang Y, Zhang B. KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting. Sustainability. 2026; 18(3):1677. https://doi.org/10.3390/su18031677

Chicago/Turabian Style

Ma, Long, Changna Guo, Yangyang Wang, Yan Zhang, and Bin Zhang. 2026. "KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting" Sustainability 18, no. 3: 1677. https://doi.org/10.3390/su18031677

APA Style

Ma, L., Guo, C., Wang, Y., Zhang, Y., & Zhang, B. (2026). KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting. Sustainability, 18(3), 1677. https://doi.org/10.3390/su18031677

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

KAN+Transformer: An Explainable and Efficient Approach for Electric Load Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Time-Series Forecasting Models

2.2. Transformer + KAN

3. Method

3.1. Overview

3.2. Kolmogorov–Arnold Network

3.3. Transformer Block

3.4. Mixture of KAN Experts Layer

3.5. Loss Function

4. Experiment

4.1. Datasets

4.2. Experimental Implementation and Details

4.2.1. Data Preprocessing

4.2.2. Model Configuration

4.2.3. Training and Evaluation

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Computational Efficiency

4.6. Visual Analysis

4.6.1. Visual Result of ETTh1 Dataset

4.6.2. Visual Result of ETTh2 Dataset

4.6.3. Visual Result of ETTm1 Dataset

4.6.4. Visual Result of ETTm2 Dataset

4.6.5. Interpretability Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI