Next Article in Journal
Deep Reinforcement Learning-Based DCT Image Steganography
Previous Article in Journal
Quantum Computing Meets Deep Learning: A QCNN Model for Accurate and Efficient Image Classification
Previous Article in Special Issue
Cluster Complementarity and Consistency Mining for Multi-View Representation Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems

1
School of Information and Engineering, Shenyang University of Technology, Shenyang 110870, China
2
College of Information, Shenyang Institute of Engineering, Shenyang 110135, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(19), 3149; https://doi.org/10.3390/math13193149
Submission received: 24 August 2025 / Revised: 12 September 2025 / Accepted: 30 September 2025 / Published: 2 October 2025
(This article belongs to the Special Issue Artificial Intelligence and Data Science, 2nd Edition)

Abstract

With the rapid evolution of artificial intelligence technologies in power systems, data-driven time-series forecasting has become instrumental in enhancing the stability and reliability of power systems, allowing operators to anticipate demand fluctuations and optimize energy distribution. Despite the notable progress made by current methods, they are still hindered by two major limitations: most existing models are relatively small in architecture, failing to fully leverage the potential of large-scale models, and they are based on fixed nonlinear mapping functions that cannot adequately capture complex patterns, leading to information loss. To this end, an LLM-Empowered Kolmogorov–Arnold frequency learning (LKFL) is proposed for time series forecasting in power systems, which consists of LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion. Specifically, LKFL first transforms multivariable time-series data into text prompts and leverages a pre-trained LLM to extract semantic-rich prompt representations. It then applies Fast Fourier Transform to convert the time-series data into the frequency domain and employs Kolmogorov–Arnold networks (KAN) to capture multi-scale periodic structures and complex frequency characteristics. Finally, LKFL integrates the prompt and frequency representations through an entropy-oriented cross-modal fusion strategy, which minimizes the semantic gap between different modalities and ensures full integration of complementary information. This comprehensive approach enables LKFL to achieve superior forecasting performance in power systems. Extensive evaluations on five benchmarks verify that LKFL sets a new standard for time-series forecasting in power systems compared with baseline methods.

1. Introduction

In contemporary society, as a nation’s critical infrastructure, power systems are undergoing continuous expansion in scale and complexity [1,2]. Encompassing vast domains—from large-scale power stations and intricate grids to countless end-users—the extensive scope and intricate composition of these systems make time-series forecasting increasingly vital. Accurate forecasting is indispensable for the efficient operation of power systems. For daily operations, power companies rely on forecasted electricity demand to rationally schedule generator units, determining their startup/shutdown times and output levels to minimize operational costs. Grid operators depend on accurate forecasts to monitor grid status, proactively identify potential failure points, ensure safe and stable grid operation, and prevent large-scale blackouts. Simultaneously, for policymakers, long-term power time-series data forecasts enable rational planning of power resource development and allocation, facilitate the formulation of scientific energy policies, and guide the sustainable development of the power industry [3,4,5]. Historically, power system time-series forecasting primarily employed traditional statistical methods, such as autoregression (AR) and moving averages (MA). These methods, leveraging their mature theoretical foundations and relatively simple model structures, could achieve reasonable forecasting performance when handling data with pronounced linear relationships. However, as power systems evolved, data inevitably exhibited significant nonlinear characteristics—such as electricity demand being influenced by the complex interplay of factors like temperature, holidays, and industrial production cycles. Traditional statistical methods proved inadequate in capturing these nonlinear relationships [6]. Furthermore, the dimensionality of data generated by power systems continues to increase. Traditional methods often require cumbersome feature selection and data preprocessing for high-dimensional data and impose demanding smoothness requirements on the data. Any data volatility or anomalies significantly degrades prediction accuracy. These limitations substantially constrain their effectiveness in modern power system time-series forecasting [7,8,9].
In recent years, machine learning methods have emerged to address the limitations of traditional statistical approaches in power system time-series prediction. Support vector machines (SVMs) have shown effectiveness in some small-scale prediction problems by mapping data to high-dimensional spaces and finding optimal classification hyperplanes [10,11]. AdaBoost, an adaptive boosting algorithm, enhances prediction accuracy by combining weak learners and focusing on previously misclassified samples. However, these machine learning methods often have high computational complexity, making their application to large-scale power system data time-consuming and resource-intensive. With the rise of deep learning, new opportunities have arisen for power system time-series prediction. Recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and gated recurrent units (GRUs) have gained attention [12,13,14,15]. RNNs model time dependencies in sequence data through their internal loops [16]. LSTMs improve upon RNNs with gating mechanisms to capture long-term dependencies [17], and GRUs further simplify the structure while retaining the ability to model long-term relationships [18]. These deep learning models excel at automatic feature extraction and efficiently handle nonlinear and high-dimensional data, bringing new breakthroughs to power system time-series prediction [19,20].
Despite the superior predictive performance of current deep learning methods in power system time-series prediction, several issues remain. First, most existing deep learning models have relatively small architectures and have not fully leveraged the potential of large-scale models. Power system data contains complex patterns and relationships that small-scale models, with limited parameters and simple neural connections, can only partially capture. This restricts further improvements in predictive accuracy. Second, current power system time-series prediction methods are predominantly based on multi-layer perceptrons (MLPs) [21,22,23]. Although MLPs can approximate complex function mappings through multi-layer neural networks with nonlinear transformations, these nonlinear transformations are often based on fixed activation functions like ReLU or sigmoid. Such fixed nonlinear mappings limit the model’s ability to dynamically adapt to the intricate and variable patterns present in power system data, resulting in suboptimal feature extraction and underfitting of the complex relationships within the data. This inflexibility makes it difficult for MLPs to fully capture the rich temporal dynamics and multi-scale characteristics inherent in power system time-series data, ultimately restricting their prediction accuracy and reliability.
To this end, a LLM-Empowered Kolmogorov–Arnold frequency learning (LKFL) is proposed for time series forecasting in power systems, which consists of LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion. Specifically, LKFL constructs LLM-based prompt representation learning by transforming time-series data into text prompts and leveraging a pre-trained LLM to extract semantic-rich representations. This approach captures complex patterns and relationships within the data that are difficult to identify with traditional methods, enabling the model to better understand the underlying structure of power system data and providing a more profound contextual understanding. Moreover, LKFL builds KAN-based frequency representation learning by applying Fast Fourier Transform to convert time-series data into the frequency domain and employing Kolmogorov–Arnold networks to capture multi-scale periodic structures and complex frequency characteristics. Simultaneously, LKFL constructs entropy-oriented cross-modal fusion to bridge the semantic gap between different representation modalities. By minimizing the information entropy of frequency and prompt representations under fusion conditions, the model ensures that complementary information from both modalities is fully integrated. This fusion process, guided by information entropy minimization, enables the model to make more informed and accurate predictions by leveraging the strengths of both representations. Extensive evaluations on five benchmarks verify that LKFL sets a new standard for time-series forecasting in power systems compared with baseline methods.
The main contributions of our work are threefold:
  • We introduce a pioneering framework that transforms multivariable time-series data into structured text prompts and leverages pre-trained Large Language Models (LLMs) to extract semantically rich representations. This approach captures complex, contextually nuanced patterns within power system data that are difficult to discern with conventional methods.
  • We propose a novel frequency-domain learning module utilizing Kolmogorov–Arnold Networks. By applying Fast Fourier Transform and employing KANs with learnable activation functions, this component effectively captures multi-scale periodic structures and intricate frequency characteristics inherent in power system time-series data, overcoming the limitations of fixed-activation MLPs.
  • We develop an entropy-minimization strategy for cross-modal fusion to bridge the semantic gap between prompt-based (semantic) and frequency-based representations. This theoretically grounded approach ensures the full integration of complementary information from both modalities, significantly enhancing the model’s ability to leverage diverse data characteristics for accurate forecasting.

2. The Proposed Method

Given a multivariable time-series data X R M × L with L historical time steps in power systems, where M denotes the number of variables, the goal of the time series forecasting task aims to learn a mapping function f ( · ) to predict future time-series values Y R M × T with T time steps, i.e., f : X Y . To achieve the above objective, this paper proposes LLM-empowered Kolmogorov–Arnold frequency learning method for time series forecasting in power systems, which contains LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion (see Figure 1).

2.1. LLM-Based Prompt Representation Learning

LLM-based prompt representation learning aims to leverage LLM [24,25] that has been trained on a vast amount of data, as a powerful feature extractor to learn robust representations from time-series data. To achieve this, it utilizes the prompt learning to transform time series data into the text description, and then extract latent representations via the prompt encoder with the text tokenization.
Specifically, given a multivariable time-series data X R M × L , we transform it as the prompt P = { p 1 , p 2 , , p N } R M × S where p i denotes the prompt corresponding the i-th variable that contains words and time series values, i.e., from t 1 to t L , the values are x i 1 , , x i L every 5 min, and the total trend value is Δ T = t = 1 L 1 x i t + 1 x i t . Then, we utilize a non-overlapping tokenization mechanism to reshape the prompt as a sequence of tokens, i.e., { g 1 , g 2 , , g N } R M × G , where g i = [ g i j ] j = 1 G denotes the j token of the i-th variable. Next, we input these tokens into the pre-trained LLM to obtain initial prompt representations. The computation process of the l-th layer in the pre-trained LLM is as follows:
E ¯ l = M M S A ( E l ) + E l
E l + 1 = F F N ( L N ( l ¯ l ) ) + E ¯ l
where E 0 = G + E p . E p stands for the positional encoding. L N ( · ) and F F N ( · ) stand for the layer normalization and the feed-forward network, respectively. M M S A ( · ) is defined as follows:
M M S A ( E l ) = ρ o ( A t t e n ( ρ q E l , ρ k E l , ρ v E l )
where ρ o , ρ q , ρ k , and ρ v are the learnable parameters of the attention function A t t e n ( · ) . In the experiments, the pre-trained LLM is chosen as GPT-2 and parameters are frozen during training.
It has been empirically demonstrated that the significance of tokens in language model training is not uniform. Among all tokens, the last token in a prompt sequence emerges as a repository of the most comprehensive contextual knowledge. This is predominantly due to the inherent mechanism of masked multi-self attention in the pre-trained LLM, which facilitates the aggregation of information from preceding tokens. In other words, the representation of the last token is only affected by its previous token representation. Hence, to optimize computational efficiency, we store the last token representation E G to stand for the overall prompt E and input it into Transformer-based prompt encoder to generate final prompt representations Z p .
Z p = P E n c o d e r ( E G )
where P E n c o d e r ( · ) denotes the Transformer-based prompt encoder.

2.2. KAN-Based Frequency Representation Learning

Time-series data in power systems are rich in frequency information. Current methods mainly rely on multi-layer perceptron frameworks to extract frequency-domain features, yet they cannot effectively identify multi-scale periodic structures in time-series data, restricting prediction accuracy. Kolmogorov–Arnold networks, with their excellent data fitting, flexibility, and learnable activation functions, can capture complex time-series patterns more accurately. Inspired by them, this paper introduces KAN-based frequency representation learning to better capture frequency characteristics of time-series data and enhance the prediction performance.
Specifically, given a multivariable time-series data X R M × L , we leverages the Fast Fourier Transform F F T ( · ) to mapping it into the frequency domain:
X f = F F T ( X ) = X ( t ) e j 2 π w t d t
where j stands for the imaginary unit. X f denotes the frequency transformation of X.
Then, we utilize KAN with learnable nonlinear mapping as a aggregator to extract fruitful the frequency information:
X ¯ f = F K A N ( X f , W b a s e , W s p l i n e )
where W b a s e and W s p l i n e denote parameters of the base transformation and B-spline transformation, respectively, which are defined as follows:
X f b = W b a s e S i L U ( X f )
X f s = i W s p l i n e i · B i k ( X f )
where S i L U ( · ) denotes the nonlinear activation. W s p l i n e i is learnable weight. B i k ( · ) is the i-th B-spline of degree k:
B i k ( x ) = x g i g i + k g i B i k 1 ( X f ) + g i + k + 1 X f g i + k + 1 g i + 1 B i + 1 k 1 ( X f ) ,
where the grid [ g k , , g 0 , g 1 , g s + k ] that follows a uniform distribution over [ 1 , 1 ] , determines the scale of spline interpolation. Attentively,
B i 0 ( X f ) = 1 if X f [ g i , g i + 1 ) 0 otherwise
After performing the base transformation and B-spline transformation, the fusion output of KAN is as follows:
X ¯ f = X f b + X f s
Finally, we leverage inverse F F T 1 ( · ) to map X ¯ f with fruitful frequency features back to the time domain:
Z f = F F T 1 ( X ¯ f ) = X ¯ f e j 2 π w t d w
where Z f denotes frequency representations.

2.3. Entropy-Oriented Cross-Modal Fusion

To eliminate the cross-modal heterogeneous semantic gap between frequency and prompt representations, we propose entropy-oriented cross-modal fusion for fully integrating their complementary information.
Specifically, given prompt representations Z p and frequency representations Z f , the fusion representations can be obtained via:
Z = Z p Z f
where ⨂ denotes fusion operations such as the concatenation. To ensure full aggregation of cross-modal complementary information, we minimize the information entropy of frequency representation or prompt representation under fusion representation conditions:
m i n H ( Z f | Z ) + H ( Z p | Z ) = E P Z f , Z [ log P ( Z f | Z ) ] E P Z p , Z [ log P ( Z p | Z ) ]
where the smaller the H ( Z f | Z ) and H ( Z p | Z ) , the better the complementary information fusion.
Taking the H ( Z f | Z ) as an example, computing the conditional entropy directly is challenging due to the complexity of the underlying data distributions and the high-dimensional nature of the representations. To address this challenge, we introduce a variational distribution Q ( Z f | Z ) , which serves as an approximation to the true posterior distribution P ( Z f | Z ) . The goal then shifts to maximizing the lower bound of the expectation, as shown below:
max E P Z f , Z [ log P ( Z f | Z ) ] E P Z f , Z [ log Q ( Z f | Z ) ]
To make this optimization more tractable, we assume that the variational distribution Q follows a Gaussian distribution, i.e., N ( Z f | G f ( Z ) , σ I ) . Here, G f ( · ) represents a learnable cross-view mapping function that transforms fusion representations back to the frequency modality, and σ I stands for a diagonal covariance matrix with variance σ . Substituting this Gaussian assumption into the expectation yields
max E P Z f , Z [ log Q ( Z f | Z ) ] = E P Z f , Z [ log ( 1 σ I 2 π e ( Z f G f ( Z ) ) 2 2 ( σ I ) ) ]
Expanding the logarithm of the Gaussian density function results in the following optimization problem:
max E P Z f , Z [ ( Z f G f ( Z ) ) 2 2 σ I + log 1 2 π σ I ]
where the term ( Z f G f ( Z ) ) 2 2 σ I corresponds to the negative squared Mahalanobis distance between Z f and G f ( Z ) , which encourages the model to produce representations that are close under the cross-view mapping. The term log 1 2 π σ I acts as a normalization constant that ensures the Gaussian distribution is properly scaled. However, since this term is constant with respect to the model parameters, it can be safely ignored during optimization. Similarly, the scaling factor 2 σ I can be omitted for simplicity, leading to the prediction loss
max E P Z f , Z ( Z f G f ( Z ) ) 2 2
Finally, the loss of entropy-oriented cross-modal fusion given prompt representations Z p and frequency representations Z f is defined as follows:
L e = ( Z f G f ( Z ) ) 2 2 + ( Z p G p ( Z ) ) 2 2

2.4. The Loss Optimization

Current methods employ MSE for time series prediction optimization. While MSE is widely used due to its simplicity and effectiveness in penalizing large errors, it has certain limitations. For instance, MSE is highly sensitive to outliers, which can lead to suboptimal model performance when dealing with noisy data. To address these issues, this paper designs a smooth L1 Loss function to optimize the prediction performance. The Smooth L1 Loss combines the benefits of both L1 and L2 loss functions. It uses a squared term for small errors and a linear term for larger errors, which makes it less sensitive to outliers and provides more robust performance. The mathematical formulation of the smooth L1 Loss is as follows:
L s = 0.5   ×   | Y Y ¯ | 2 if | Y Y ¯ |   < 1 | Y Y ¯ |     0.5 otherwise
where Y denotes true values. Y ¯ denotes prediction values generated via a linear predictor with learnable parameters W p r e d :
Y ¯ = W p r e d Z
The benefits of using Smooth L1 Loss are multifold. Firstly, it reduces the influence of outliers by switching to a linear scaling for larger errors, which results in a more balanced overall loss. Secondly, it maintains the differentiability at zero, which helps in stable gradient descent optimization. Thirdly, it provides a good balance between the robustness of L1 loss and the stability of L2 loss during optimization.
The overall loss function is a combination of the L s and L e . For the time series forecasting task, the total loss function L t o t a l can be expressed as
L t o t a l = L s + λ L e
where λ is a weighting parameter that balances the contributions of the different loss. For the optimization process, we adopt the Adam optimizer, which is a popular adaptive gradient-based optimization algorithm. It computes adaptive learning rates for each parameter by maintaining moving averages of the gradients and the squared gradients. This allows the model to converge more efficiently and effectively, especially in high-dimensional parameter spaces. The optimizer updates the model parameters by moving in the direction of the negative gradient of the loss function with respect to the parameters, thereby minimizing the overall loss.

3. Experimental Evaluation

3.1. Setup

Datasets. The publicly available time-series datasets from five power systems were used to validate the performance of the model [26,27]. ETTh1 and ETTh2 datasets contain 7 variables with a 1 h time interval, a total length of 17,420, and follows a 6:2:2 split ratio. ETTm1 and ETTm2 datasets comprise 7 variables recorded at a 15 min interval, with a length of 69,680 and a 6:2:2 split ratio of the training set, the validating set, and the testing set. Electricity includes 321 variables sampled at a 1 h interval, has a total length of 26,304, and adopts a 7:1:2 split ratio to conduct the training set, the validating set, and the testing set.
Evaluation Metrics. To evaluate the prediction performance, MSE and MAE are used as metrics in the experiments [27,28]. RMSE is a statistical measure used to evaluate prediction accuracy by computing the square root of the average squared differences between predicted and observed values. MAE calculates the average absolute differences between forecasted and true values, offering an intuitive understanding of the typical prediction error magnitude and serving as a robust metric less affected by outliers.
Comparison methods. Ten time-series prediction methods are used, including FS-TSF [1], Hybrid-net [3], IMC-net [4], DAE-TSF [14], SAC-ConvLSTM [16], LT-TSF [22], Timecma [24], Gpt4mts [26], Effformer [29], and TimeKAN [30]. In order to guarantee a fair comparison across all experiments, we meticulously ensured that every experiment was carried out within the same hardware and software settings. Regarding the hardware environment, all experiments were performed on identical high-performance servers equipped with Intel Xeon Gold 6248 with the BIOS version 1.30, NVIDIA Tesla V100 GPUs, and 192 GB of RAM. The software environment was also kept uniform, with the same versions of Ubuntu 20.04 LTS operating system, Python 3.8, PyTorch 1.9.0 deep learning framework, and other related libraries and tools. When it comes to hyperparameter configurations, we strictly followed the original publications of each method. We carefully examined the original papers, extracted the exact hyperparameter values, and implemented them without any alteration to ensure that the performance of each method was evaluated under conditions that aligned with the authors’ initial setups, thus providing a fair and valid basis for comparison.
Implementation details. A grid dimension of s = 2 alongside a B-spline degree of k = 1 is utilized to conduct KAN, with each layer within the KAN possessing a latent dimensionality of 256. Five datasets undergo standardization through application of the MinMax scaling technique. For the optimization procedure, we implement the Adam algorithm, wherein the learn rate is calibrated within the interval spanning [ 10 4 , 10 2 ] . Meanwhile, the batch sizes are varied across the spectrum from 4 to 64 samples. Training of all architectures is executed for a maximum of 20 epochs, with premature termination of training initiated when the loss on the validation set plateaued, indicating potential overfitting.

3.2. Comparison Evaluation

In this section, a full evaluation of the proposed method and ten methods is designed on five common time series datasets in power systems in terms of MSE and MAE. In the evaluation, the historical window size L is set to =96 and future window size T is chosen from {96, 192, 336, 720} for five datasets. The evaluation results are depicted in Table 1, Table 2, Table 3, Table 4 and Table 5. The proposed method obtains the optimal results on five datasets under different future window size T. This results demonstrate our method achieves a new baseline in this task. The reasons for the performance advantage are threefold. Firstly, the integration of LLM-based prompt representation learning enables the model to leverage the vast knowledge contained within pre-trained large language models. This allows for the extraction of robust and semantically rich representations from time-series data. By transforming time-series data into text descriptions and utilizingprompt encoders to extract latent representations, the model can better understand the underlying patterns and relationships in power system data. This significantly enhances the overall forecasting accuracy and provides a more profound contextual understanding compared to traditional methods. Secondly, the KAN-based frequency representation learning component effectively captures multi-scale periodic structures and complex frequency characteristics inherent in power system time-series data. Kolmogorov–Arnold networks, with their excellent data fitting capabilities and learnable activation functions, can more accurately model the intricate patterns of time-series data. This component allows the model to exploit frequency-domain features that are often overlooked by conventional approaches, thereby improving the prediction performance. Lastly, the entropy-oriented cross-modal fusion strategy bridges the semantic gap between different representation modalities. By minimizing the information entropy of frequency and prompt representations under fusion conditions, the model ensures that complementary information from both modalities is fully integrated. This fusion process, guided by information entropy minimization, enables the model to make more informed and accurate predictions by leveraging the strengths of both representations. The combination of these three components creates a powerful framework that outperforms existing methods across various future window sizes and datasets.
Meanwhile, a comparison experiment of cumulative relative error between our methods and competitive methods is designed for the model robustness in power systems. The cumulative relative error CRE is defined as follows:
CRE = t = 1 T ( Y ¯ t Y t ) t = 1 T Y t
CRE calculates the ratio of cumulative deviation of predictions to actual values over the planning horizon, which reflects the cumulative bias of a model. The results are shown in Table 6. The results indicate that the ours method demonstrates superior performance by achieving the lowest cumulative relative error across all forecasting horizons and datasets. This suggests that the method possesses enhanced robustness and generalization capability. It is capable of accurately capturing the underlying patterns of the data, even when faced with complex distributions and dynamic characteristics, thereby maintaining predictive accuracy over extended forecasting periods. In contrast, other models may exhibit acceptable performance in short-term forecasting but tend to accumulate larger deviations as the forecasting horizon increases, leading to a decline in their effectiveness.

3.3. Ablation Analysis

This section performs the ablation analysis to verify the impact each component in the proposed method on the prediction performance, which contains the loss ablation and the architecture ablation.
Specifically, the loss ablation sets up two ablation variants. (1) Variant_1 utilizes vanilla MSE and L e losses to optimize the overall network. (2) Variant_2 utilizes only vanilla MSE loss to optimize the overall network. (3) Variant_3 utilizes L s loss to optimize the overall network. The ablation results on the ETTh1 dataset with four future window sizes are shown in Table 7. Results show that variant_3 outperforms variant_2, highlighting the effectiveness of the L s loss function. Similarly, variant_1 demonstrates superior performance compared to variant_2, which underscores the positive impact of incorporating the L e loss function. Furthermore, our proposed method surpasses all three variants. This indicates that our overall loss function design is not only rational but also advanced, as it integrates the advantages of different loss components to achieve optimal results. The ablation study thus provides strong evidence for the effectiveness of each loss component and the superiority of our comprehensive loss function design.
The architecture ablation sets up two ablation variants: Ours w/o LLM denotes the erasure of the LLM-based prompt representation learning; Ours w/o KAN denotes the erasure of the KAN-based frequency representation learning; Ours w/o entropy denotes the erasure of the entropy-oriented cross-modal fusion representation learning. The ablation results on the ETTh1 dataset with four future window sizes are shown in Table 8. There are three observations: (1) The variant Ours w/o LLM demonstrates inferior performance compared to the full model. This indicates that the LLM-based prompt representation learning component plays a crucial role in capturing robust and semantic-rich representations from the time-series data. By leveraging the vast knowledge contained within pre-trained LLMs, this component enables the model to better understand the underlying patterns and relationships in the power system data, thereby enhancing the overall forecasting accuracy. (2) Ours w/o KAN shows a significant performance drop compared to the complete model. This highlights the importance of the KAN-based frequency representation learning component. Power system time-series data inherently contains rich frequency information, and the KAN module is specifically designed to effectively capture multi-scale periodic structures and complex frequency characteristics. Without this component, the model’s ability to exploit frequency-domain features is compromised, leading to less accurate predictions. (3) Ours w/o Entropy also underperforms relative to the full model. This underscores the value of the entropy-oriented cross-modal fusion strategy. This component is essential for bridging the semantic gap between different representation modalities and ensuring that complementary information from both prompt and frequency representations is fully integrated. The fusion process, guided by information entropy minimization, allows the model to make more informed and accurate predictions by leveraging the strengths of both representation types.

3.4. Parameter Analysis

This section performs parameter analysis of the trade-off parameter λ , the latent representation size D, and the layer number L n of KAN on the ETTh1 dataset with four future window sizes T.
Trade-off Parameter λ : We conduct experiments with λ { 100 , 10 , 1 , 0.1 , 0.01 , 0.001 } . The results in Figure 2 indicate that the model achieves optimal performance when λ is set within the range of 0.1 to 1. A higher λ tends to overemphasize the frequency representation learning, while a lower λ may underweight its significance. The optimal range effectively balances the contributions of LLM-based prompt representation learning and KAN-based frequency representation learning, ensuring comprehensive and accurate predictions.
Latent Representation Size D: Experiments are carried out with different sizes D { 10 , 64 , 128 , 256 , 512 , 1024 } . The findings in Figure 2 reveal that a latent representation size that is too small (e.g., D = 10 ) limits the model’s capacity to capture intricate patterns in the data, resulting in underfitting. Conversely, an excessively large D (e.g., D = 1024 ) may introduce noise and lead to overfitting. The optimal performance is observed when D is within the range of 128 to 256, which strikes a balance between capturing sufficient information and avoiding redundancy, thereby enhancing the model’s forecasting accuracy.
Layer Number M of KAN: We explore values of M from 1 to 5. The results in Figure 2 demonstrate that a single layer is insufficient to model the complex frequency characteristics of power system time-series data. As the number of layers increases, the model’s ability to capture multi-scale periodic structures improves. However, beyond three layers, the performance gains diminish and overfitting becomes a concern. The optimal performance is achieved with M set to 3, which effectively captures the essential frequency features while maintaining computational efficiency and generalization capability.

4. Conclusions

This study addresses the critical challenge of accurate time-series forecasting in complex power systems by proposing LLM-Empowered Kolmogorov–Arnold frequency learning (LKFL) that consists of LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion. Three components work seamlessly together to capture fruitful frequency and temporal complementary patterns for time series forecasting in power systems. Meanwhile, ablation experiments demonstrate the ablation of each component yield inferior results compared with LKFL, which validates the effectiveness of each component. Future work will explore lightweight LLM distillation and extend the framework to irregularly sampled or sparse power data. Nevertheless, LKFL establishes a new paradigm for time-series forecasting in critical infrastructure, with immediate applications in grid scheduling, fault prevention, and energy policy optimization.

Author Contributions

Conceptualization, Y.Y.; Methodology, Z.Y.; Validation, Y.Z.; Formal analysis, S.L.; Data curation, S.L.; Writing—original draft, Z.Y.; Writing—review and editing, Y.Y.; Visualization, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xing, Q.; Huang, X.; Wang, J.; Wang, S. A Novel Multivariate Combined Power Load Forecasting System Based on Feature Selection and Multi-Objective Intelligent Optimization. Expert Syst. Appl. 2024, 244, 122970. [Google Scholar] [CrossRef]
  2. Ferkous, K.; Guermoui, M.; Menakh, S.; Bellaour, A.; Boulmaiz, T. A Novel Learning Approach for Short-Term Photovoltaic Power Forecasting-A Review and Case Studies. Eng. Appl. Artif. Intell. 2024, 133, 108502. [Google Scholar] [CrossRef]
  3. Bashir, T.; Wang, H.; Tahir, M.; Zhang, Y. Wind and Solar Power Forecasting Based on Hybrid CNN-ABiLSTM, CNN-Transformer-MLP Models. Renew. Energy 2025, 239, 122055. [Google Scholar] [CrossRef]
  4. de Azevedo Takara, L.; Teixeira, A.C.; Yazdanpanah, H.; Mariani, V.C.C.; dos Santos Coelho, L. Optimizing Multi-Step Wind Power Forecasting: Integrating Advanced Deep Neural Networks with Stacking-Based Probabilistic Learning. Appl. Energy 2024, 369, 123487. [Google Scholar] [CrossRef]
  5. Zhao, Y.; Liao, H.; Pan, S.; Zhao, Y. Interpretable Multi-Graph Convolution Network Integrating Spatio-Temporal Attention and Dynamic Combination for Wind Power Forecasting. Expert Syst. Appl. 2024, 255, 124766. [Google Scholar] [CrossRef]
  6. Gao, J.; Li, P.; Laghari, A.A.; Srivastava, G.; Gadekallu, T.R.; Abbas, S.; Zhang, J. Incomplete Multiview Clustering via Semidiscrete Optimal Transport for Multimedia Data Mining in IoT. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
  7. Gao, J.; Liu, M.; Li, P.; Laghari, A.A.; Javed, A.R.; Victor, N.; Gadekallu, T.R. Deep Incomplete Multiview Clustering via Information Bottleneck for Pattern Mining of Data in Extreme-environment IoT. IEEE Internet Things J. 2023, 11, 26700–26712. [Google Scholar] [CrossRef]
  8. Gao, J.; Liu, M.; Li, P.; Zhang, J.; Chen, Z. Deep Multiview Adaptive Clustering with Semantic Invariance. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12965–12978. [Google Scholar] [CrossRef]
  9. Li, P.; Laghari, A.A.; Rashid, M.; Gao, J.; Gadekallu, T.R.; Javed, A.R.; Yin, S. A Deep Multimodal Adversarial Cycle-consistent Network for Smart Enterprise System. IEEE Trans. Ind. Inform. 2022, 19, 693–702. [Google Scholar] [CrossRef]
  10. Gao, J.; Cheng, Y.; Zhang, D.; Chen, Y. Physics-Constrained Wind Power Forecasting Aligned with Probability Distributions for Noise-Resilient Deep Learning. Appl. Energy 2025, 383, 125295. [Google Scholar] [CrossRef]
  11. Wang, J.; Kou, M.; Li, R.; Qian, Y.; Li, Z. Ultra-Short-Term Wind Power Forecasting Jointly Driven by Anomaly Detection, Clustering and Graph Convolutional Recurrent Neural Networks. Adv. Eng. Inform. 2025, 65, 103137. [Google Scholar] [CrossRef]
  12. Wang, Y.; Hao, Y.; Zhao, K.; Yao, Y. Stochastic Configuration Networks for Short-Term Power Load Forecasting. Inf. Sci. 2025, 689, 121489. [Google Scholar] [CrossRef]
  13. Hu, X.; Li, H.; Si, C. Improved Composite Model Using Metaheuristic Optimization Algorithm for Short-Term Power Load Forecasting. Electr. Power Syst. Res. 2025, 241, 111330. [Google Scholar] [CrossRef]
  14. Yang, Q.; Tian, Z. A Hybrid Load Forecasting System Based on Data Augmentation and Ensemble Learning Under Limited Feature Availability. Expert Syst. Appl. 2025, 261, 125567. [Google Scholar] [CrossRef]
  15. Gao, J.; Guo, C.; Liu, Y.; Li, P.; Zhang, J.; Liu, M. Dynamic-Static Feature Fusion with Multi-Scale Attention for Continuous Blood Glucose Prediction. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  16. Jalalifar, R.; Delavar, M.R.; Ghaderi, S.F. SAC-ConvLSTM: A Novel Spatio-Temporal Deep Learning-Based Approach for a Short Term Power Load Forecasting. Expert Syst. Appl. 2024, 237, 121487. [Google Scholar] [CrossRef]
  17. Deng, Q.; Wang, C.; Sun, J.; Sun, Y.; Jiang, J.; Lin, H.; Deng, Z. Nonvolatile CMOS Memristor, Reconfigurable Array, and Its Application in Power Load Forecasting. IEEE Trans. Ind. Inform. 2023, 20, 6130–6141. [Google Scholar] [CrossRef]
  18. Yuan, F.; Che, J. An Ensemble Multi-Step M-RMLSSVR Model Based on VMD and Two-Group Strategy for Day-Ahead Short-Term Load Forecasting. Knowl.-Based Syst. 2022, 252, 109440. [Google Scholar] [CrossRef]
  19. Zhang, S.; Chen, R.; Cao, J.; Tan, J. A CNN and LSTM-Based Multi-Task Learning Architecture for Short and Medium-Term Electricity Load Forecasting. Electr. Power Syst. Res. 2023, 222, 109507. [Google Scholar] [CrossRef]
  20. Lv, L.; Wu, Z.; Zhang, J.; Zhang, L.; Tan, Z.; Tian, Z. A VMD and LSTM Based Hybrid Model of Load Forecasting for Power Grid Security. IEEE Trans. Ind. Inform. 2021, 18, 6474–6482. [Google Scholar] [CrossRef]
  21. Xu, A.; Chen, J.; Li, J.; Chen, Z.; Xu, S.; Nie, Y. Multivariate Rolling Decomposition Hybrid Learning Paradigm for Power Load Forecasting. Renew. Sustain. Energy Rev. 2025, 212, 115375. [Google Scholar] [CrossRef]
  22. Pentsos, V.; Tragoudas, S.; Wibbenmeyer, J.; Khdeer, N. A Hybrid LSTM-Transformer Model for Power Load Forecasting. IEEE Trans. Smart Grid 2025, 16, 2624–2634. [Google Scholar] [CrossRef]
  23. Liu, P.; Guo, H.; Dai, T.; Li, N.; Bao, J.; Ren, X.; Jiang, Y.; Xia, S.T. Calf: Aligning LLMs for Time Series Forecasting via Cross-Modal Fine-Tuning. Proc. AAAI Conf. Artif. Intell. 2025, 39, 18915–18923. [Google Scholar] [CrossRef]
  24. Liu, C.; Xu, Q.; Miao, H.; Yang, S.; Zhang, L.; Long, C.; Li, Z.; Zhao, R. Timecma: Towards llm-empowered multivariate time series forecasting via cross-modality alignment. Proc. AAAI Conf. Artif. Intell. 2025, 39, 18780–18788. [Google Scholar] [CrossRef]
  25. Tan, M.; Merrill, M.; Gupta, V.; Althoff, T.; Hartvigsen, T. Are Language Models Actually Useful for Time Series Forecasting. Adv. Neural Inf. Process. Syst. 2024, 37, 60162–60191. [Google Scholar]
  26. Jia, F.; Wang, K.; Zheng, Y.; Cao, D.; Liu, Y. Gpt4mts: Prompt-Based Large Language Model for Multimodal Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2024, 38, 23343–23351. [Google Scholar] [CrossRef]
  27. Qiu, X.; Wu, X.; Lin, Y.; Guo, C.; Hu, J.; Yang, B. Duet: Dual Clustering Enhanced Multivariate Time Series Forecasting. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, Canada, 3–7 August 2025; pp. 1185–1196. [Google Scholar]
  28. Murad, M.M.N.; Aktukmak, M.; Yilmaz, Y. Wpmixer: Efficient Multi-Resolution Mixing for Long-Term Time Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2025, 39, 19581–19588. [Google Scholar] [CrossRef]
  29. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
  30. Huang, S.; Zhao, Z.; Li, C.; Bai, L. TimeKAN: KAN-based Frequency Decomposition Learning Architecture for Long-term Time Series Forecasting. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025; pp. 1–8. [Google Scholar]
Figure 1. The illustration of the proposed method.
Figure 1. The illustration of the proposed method.
Mathematics 13 03149 g001
Figure 2. The parameter analysis on the ETTh1 dataset. (a) Trade-off Parameter λ on the ETTh1 dataset in terms of MAE. (b) Latent representation size D on the ETTh1 dataset in terms of MAE. (c) Layer number M of KAN on the ETTh1 dataset in terms of MAE.
Figure 2. The parameter analysis on the ETTh1 dataset. (a) Trade-off Parameter λ on the ETTh1 dataset in terms of MAE. (b) Latent representation size D on the ETTh1 dataset in terms of MAE. (c) Layer number M of KAN on the ETTh1 dataset in terms of MAE.
Mathematics 13 03149 g002
Table 1. The time-series prediction results on the ETTh1 dataset. Bold indicates the best result.
Table 1. The time-series prediction results on the ETTh1 dataset. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
FS-TSF0.1210.0990.1450.1080.1690.1270.1890.143
Hybrid-net0.0980.0760.1060.0800.1100.0890.1150.094
IMC-net0.1350.1020.1780.1460.2040.1740.2290.176
DAE-TSF0.1080.0820.1460.1150.1560.1240.1590.128
SAC-ConvLSTM0.1030.0740.1090.0780.1130.0820.1200.088
LT-TSF0.0890.0630.0960.0680.1100.0800.1170.086
Timecma0.1070.0770.1210.0890.1340.1010.1500.119
Gpt4mts0.0910.0630.0960.0690.1170.0830.1210.099
Effformer0.0920.0640.1010.0730.1120.0860.1500.116
TimeKAN0.0920.0610.0970.0690.1150.0850.1190.092
Ours0.0880.0620.0930.0670.1070.0790.1100.083
Table 2. The time-series prediction results on the ETTh2 dataset. Bold indicates the best result.
Table 2. The time-series prediction results on the ETTh2 dataset. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
FS-TSF0.0900.0740.0930.0820.0950.0990.0970.102
Hybrid-net0.0920.0760.0950.0810.0970.0900.1050.097
IMC-net0.0880.0750.0910.0790.1020.0870.1090.096
DAE-TSF0.0670.0510.0640.0480.0790.0600.1060.085
SAC-ConvLSTM0.0750.0520.0850.0610.0890.0650.0900.066
LT-TSF0.0760.0520.0840.0590.0930.0680.0940.069
Timecma0.0540.0380.0620.0470.0680.0480.0750.057
Gpt4mts0.0550.0380.0650.0460.0750.0540.0890.067
Effformer0.0540.0390.0620.0460.0870.0670.0990.072
TimeKAN0.0680.0450.0770.0520.0830.0580.0880.062
Ours0.0530.0370.0600.0430.0630.0460.0720.054
Table 3. The time-series prediction results on the ETTm1 dataset. Bold indicates the best result.
Table 3. The time-series prediction results on the ETTm1 dataset. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
FS-TSF0.1080.0860.1140.0970.1270.1080.1320.116
Hybrid-net0.1200.0920.1270.0980.1340.1080.1550.119
IMC-net0.1170.0840.1320.1010.1570.1230.1850.133
DAE-TSF0.1020.0700.1000.0710.1040.0720.1100.089
SAC-ConvLSTM0.0960.0710.1010.0730.1100.0850.1220.096
LT-TSF0.0830.0580.0900.0640.0990.0690.1100.082
Timecma0.0880.0610.8010.6100.7560.5040.7700.551
Gpt4mts0.0840.0620.0900.0690.1020.0770.1120.093
Effformer0.0820.0560.0880.0620.0980.0720.1080.083
TimeKAN0.0820.0550.0850.0630.0990.0740.1090.082
Ours0.0790.0540.0800.0600.0950.0660.1060.077
Table 4. The time-series prediction results on the ETTm2 dataset. Bold indicates the best result.
Table 4. The time-series prediction results on the ETTm2 dataset. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
FS-TSF0.0690.0590.0740.0630.0820.0740.890.082
Hybrid-net0.0800.0720.0880.0780.0960.0880.1070.092
IMC-net0.0580.0390.0650.0430.0700.0460.0810.054
DAE-TSF0.0430.0310.0480.0370.0590.0440.0840.060
SAC-ConvLSTM0.0950.0770.1040.0860.1170.0950.1260.109
LT-TSF0.0830.0650.1010.0790.1100.0840.1210.082
Timecma0.0470.0450.0540.0400.7560.0430.0870.055
Gpt4mts0.0640.0420.0730.0560.0880.0560.0960.073
Effformer0.0440.0330.0520.0360.0700.0470.0850.056
TimeKAN0.0510.0330.0600.0380.0680.0450.0780.051
Ours0.0400.0280.0470.0330.0540.0380.0630.045
Table 5. The time-series prediction results on the Electricity dataset. Bold indicates the best result.
Table 5. The time-series prediction results on the Electricity dataset. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
FS-TSF0.0910.0670.1030.0740.1230.0770.1260.079
Hybrid-net0.0830.0560.0940.0700.1120.0790.1200.086
IMC-net0.0770.0580.0810.0630.0870.0690.0970.075
DAE-TSF0.0850.0710.0970.0680.1070.0750.1530.092
SAC-ConvLSTM0.0880.0680.0960.0650.1060.0750.1350.077
LT-TSF0.0780.0570.0840.0540.0870.0590.0920.065
Timecma0.0820.0590.0870.0570.0790.0560.0960.071
Gpt4mts0.0740.0450.0720.0440.0730.0470.0860.058
Effformer0.0720.0450.0740.0470.0770.0510.0860.058
TimeKAN0.0750.0480.0790.0520.0800.0550.0930.063
Ours0.0690.0410.0710.0430.0720.0450.0840.056
Table 6. The results of the cumulative relative error on the ETTm1 and Electricity datasets. Bold indicates the best result.
Table 6. The results of the cumulative relative error on the ETTm1 and Electricity datasets. Bold indicates the best result.
Dataset
Metric
T = 96T = 336T = 720
ETTm1ElectricityETTm1ElectricityETTm1Electricity
IMC-net−0.0018−0.00480.00650.00880.01390.0166
LT-TSF0.00190.00490.00670.00900.01370.0165
Timecma0.00220.00530.00720.00960.01460.0168
Gpt4mts0.00160.00420.00630.0084−0.0133−0.0154
Effformer−0.0014−0.00420.00640.0085−0.0135−0.0155
TimeKAN0.00170.00440.00640.00860.01380.0159
Ours0.00150.00380.00600.00800.01300.0150
Table 7. Loss analysis on the ETTh1 dataset in terms of RMSE and MAE. Bold indicates the best result.
Table 7. Loss analysis on the ETTh1 dataset in terms of RMSE and MAE. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
Variant_10.0980.0720.1000.0760.1200.0970.1320.107
Variant_20.0890.0630.0940.0690.1100.0830.1130.086
Variant_30.0920.0670.0990.0750.1170.0920.1270.102
Ours0.0880.0620.0930.0670.1070.0790.1100.083
Table 8. Architecture analysis on the ETTh1 dataset in terms of RMSE and MAE. Bold indicates the best result.
Table 8. Architecture analysis on the ETTh1 dataset in terms of RMSE and MAE. Bold indicates the best result.
Dataset
Metric
T = 96T = 192T = 336T = 720
RMSEMAERMSEMAERMSEMAERMSEMAE
Ours w/o LLM0.0980.0710.1020.0750.1190.0810.1310.103
Ours w/o KAN0.0910.0650.0950.0700.1100.0850.1170.096
Ours w/o Entropy0.0900.0650.0980.0720.1150.860.1210.097
Ours0.0880.0620.0930.0670.1070.0790.1100.083
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Z.; Yu, Y.; Lin, S.; Zhang, Y. LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems. Mathematics 2025, 13, 3149. https://doi.org/10.3390/math13193149

AMA Style

Yang Z, Yu Y, Lin S, Zhang Y. LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems. Mathematics. 2025; 13(19):3149. https://doi.org/10.3390/math13193149

Chicago/Turabian Style

Yang, Zheng, Yang Yu, Shanshan Lin, and Yue Zhang. 2025. "LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems" Mathematics 13, no. 19: 3149. https://doi.org/10.3390/math13193149

APA Style

Yang, Z., Yu, Y., Lin, S., & Zhang, Y. (2025). LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems. Mathematics, 13(19), 3149. https://doi.org/10.3390/math13193149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop