GridFM: A Physics-Informed Foundation Model for Multi-Task Energy Forecasting Using Real-Time NYISO Data

Ali Sayghe; Mohammed Ahmed Mousa; Salem Batiyah; Abdulrahman Husawi; Mansour Almuwallad

doi:10.3390/en19020357

Abstract

The rapid integration of renewable energy sources and increasing complexity of modern power grids demand advanced forecasting tools capable of simultaneously predicting multiple interconnected variables. While time series foundation models (TSFMs) have demonstrated remarkable zero-shot forecasting capabilities across diverse domains, their application in power grid operations remains limited due to complex coupling relationships between load, price, emissions, and renewable generation. This paper proposes GridFM, a novel physics-informed foundation model specifically designed for multi-task energy forecasting in power systems. GridFM introduces four key innovations: (1) a FreqMixer adaptation layer that transforms pre-trained foundation model representations to power-grid-specific patterns through frequency domain mixing without modifying base weights; (2) a physics-informed constraint module embedding power balance equations and zonal grid topology using graph neural networks; (3) a multi-task learning framework enabling joint forecasting of load demand, locational-based marginal prices (LBMP), carbon emissions, and renewable generation with uncertainty-weighted loss functions; and (4) an explainability module utilizing SHAP values and attention visualization for interpretable predictions. We validate GridFM using over 10 years of real-time data from the New York Independent System Operator (NYISO) at 5 min resolution, comprising more than 10 million data points across 11 load zones. Comprehensive experiments demonstrate that GridFM achieves state-of-the-art performance with an 18.5% improvement in load forecasting MAPE (achieving 2.14%), a 23.2% improvement in price forecasting (achieving 7.8% MAPE), and a 21.7% improvement in emission prediction compared to existing TSFMs including Chronos, TimesFM, and Moirai-MoE. Ablation studies confirm the contribution of each proposed component. The physics-informed constraints reduce physically inconsistent predictions by 67%, while the multi-task framework improves individual task performance by exploiting inter-variable correlations. The proposed model provides interpretable predictions supporting the Climate Leadership and Community Protection Act (CLCPA) 2030/2040 compliance objectives, enabling grid operators to make informed decisions for sustainable energy transition and carbon reduction strategies.

Keywords:

foundation model; time series forecasting; power grid; NYISO; multi-task learning; physics-informed neural network; deep learning; transformer; mixture of experts; explainable AI; renewable energy; carbon emissions

1. Introduction

1.1. Background and Motivation

The global energy sector is undergoing a fundamental transformation driven by the imperative to reduce carbon emissions and integrate renewable energy sources into existing power infrastructure [1,2]. This energy transition introduces unprecedented challenges for grid operators, who must balance supply and demand while maintaining system reliability under increasingly uncertain conditions [3,4]. The New York Independent System Operator (NYISO), which manages an electric grid serving over 19 million people, exemplifies these challenges while operating under the Climate Leadership and Community Protection Act (CLCPA). This landmark legislation mandates 70% renewable electricity by 2030 and 100% zero-emission electricity by 2040 [5,6], requiring sophisticated forecasting tools that can simultaneously predict multiple interconnected variables: load demand, electricity prices, carbon emissions, and renewable generation.

Accurate forecasting of these grid variables is essential for efficient power system operations [7,8]. Load forecasting enables optimal unit commitment and economic dispatch decisions [9]. Price forecasting supports market participants in developing bidding strategies and managing financial risks [10]. Emission forecasting is increasingly critical for carbon accounting and regulatory compliance [11]. Renewable generation forecasting addresses the inherent intermittency of wind and solar resources [12]. Traditional forecasting approaches typically treat these variables independently, failing to capture the complex coupling relationships that exist in integrated energy systems [13]. For instance, high-renewable-generation periods often correlate with lower electricity prices and reduced emissions, while peak load conditions influence both pricing dynamics and generation dispatch decisions [14].

To illustrate these coupling relationships concretely, consider a scenario where wind power generation surges unexpectedly. Through the merit-order effect, this zero-marginal-cost generation displaces higher-cost fossil fuel units in the dispatch stack, causing electricity prices to decrease—sometimes dramatically, even to negative values during periods of oversupply. Simultaneously, because fossil fuel generation is curtailed, carbon emissions per MWh decline as the generation mix shifts toward cleaner sources. Conversely, during a summer heat wave when air conditioning drives peak load demand, utilities must dispatch expensive peaking plants (often natural gas turbines), which increases both electricity prices and emission rates. These interconnected dynamics demonstrate why accurate multi-task forecasting must capture the physical and economic coupling between load, price, emissions, and renewable generation rather than treating each variable in isolation.

Recent advances in deep learning have significantly improved forecasting accuracy for individual grid variables. Long Short-Term Memory (LSTM) networks [15], Gated Recurrent Units (GRU) [16], Temporal Convolutional Networks (TCN) [17], and Transformer architectures [18] have demonstrated superior performance compared to traditional statistical methods such as ARIMA and exponential smoothing [19,20]. Temporal Fusion Transformers (TFT) [21] introduced interpretable attention mechanisms for multi-horizon forecasting. Informer [22] and Autoformer [23] addressed the computational complexity of long-sequence forecasting. However, these approaches require task-specific training and extensive labeled data, limiting their generalization capabilities across different forecasting horizons, grid conditions, and geographic regions [24].

1.2. Emergence of Time Series Foundation Models

The emergence of time series foundation models (TSFMs) represents a paradigm shift in forecasting methodology [25,26]. Inspired by the remarkable success of large language models (LLMs) in natural language processing [27,28], TSFMs leverage transfer learning by pre-training on massive time series datasets from diverse domains, enabling zero-shot or few-shot predictions on previously unseen data without task-specific training [29,30]. This capability addresses a fundamental limitation of traditional deep learning approaches: the need for extensive domain-specific training data and computational resources for each new forecasting task.

Several prominent TSFMs have emerged in recent years. TimesFM [29], developed by Google Research, represents a decoder-only transformer designed specifically for time series, utilizing patching mechanisms to capture local temporal patterns. Chronos [30], developed by Amazon, introduced a tokenization framework that discretizes continuous time series values into a fixed vocabulary of 4096 tokens. Moirai [31], developed by Salesforce AI Research, introduces an encoder-only architecture with any-variate attention mechanisms. The recent Moirai-MoE [32] incorporates sparse mixture-of-experts (MoE) layers, achieving token-level model specialization. Additional notable TSFMs include Lag-Llama [33], Time-MoE [34], and MOMENT [35].

1.3. Research Gaps and Challenges

Despite the impressive capabilities demonstrated by existing TSFMs, their application to power grid forecasting faces several critical challenges:

Gap 1: Domain Specificity. TSFMs are pre-trained on heterogeneous datasets spanning diverse domains including retail, weather, traffic, and economics [25]. While this diversity enables broad generalization, power grids exhibit unique characteristics: strong daily and weekly periodicities driven by human activity patterns, complex spatial dependencies across interconnected zones, and fundamental physical constraints governing energy balance. General-purpose TSFMs cannot exploit these power-grid-specific patterns without domain adaptation.

Gap 2: Multi-Task Integration. Existing TSFMs primarily focus on univariate forecasting or independent multivariate predictions [29,30]. They fail to exploit the complex coupling relationships between grid variables. Load demand drives generation requirements; generation mix determines both prices and emissions; renewable variability affects reserve requirements. A truly effective grid forecasting model should capture and exploit these interdependencies.

Gap 3: Physics Constraints. Power systems are governed by fundamental physical laws, including Kirchhoff’s laws, power balance requirements, and transmission capacity limits [36]. Existing TSFMs operate as purely data-driven black boxes, potentially generating predictions that violate physical constraints and are therefore operationally infeasible.

Gap 4: Explainability. Grid operations represent critical infrastructure where forecasting errors can lead to blackouts, equipment damage, or significant financial losses [7]. Operators require not just accurate predictions but also understanding of the factors driving those predictions. Existing TSFMs largely lack interpretability mechanisms suitable for operational decision support.

Gap 5: Policy Alignment. Energy forecasting must increasingly support policy objectives related to decarbonization and sustainability [6]. The ability to forecast carbon emissions alongside traditional variables enables real-time carbon accounting and supports informed decision-making for emission reduction.

1.4. Research Objectives and Contributions

To address these critical gaps, we propose GridFM, a physics-informed foundation model specifically designed for multi-task energy forecasting in power systems. The main contributions are as follows:

1.: Novel Architecture: We introduce GridFM, the first foundation model specifically adapted for multi-task power grid forecasting, with comprehensive validation across multiple ISO regions (NYISO, PJM, CAISO).
2.: FreqMixer Adaptation Layer: We propose a novel frequency-domain mixing mechanism that adapts general TSFM representations to power-grid-specific temporal patterns, with grid-specific initialization and quantitative validation of learned frequency responses.
3.: Physics-Informed Constraint Module: We develop a constraint module that embeds power system physics, including generation-load balance equations with DC power flow approximation and zonal topology encoding through graph neural networks.
4.: Multi-Task Learning Framework: We design a joint forecasting framework for simultaneous prediction of load, LBMP, carbon emissions, and renewable generation with adaptive coupling constraints that learn time-varying relationships.
5.: Rigorous Evaluation: We conduct extensive experiments using over 10 years of real-time NYISO data with additional validation on PJM and CAISO, employing rolling-origin cross-validation with five folds and statistical significance testing against both zero-shot and fine-tuned foundation model baselines.
6.: Explainability Module: We integrate SHAP-based feature attribution and attention visualization with deployment considerations for grid operators.
7.: Open-Source Release: We provide complete code, pre-trained models, and preprocessing scripts at https://github.com/asayghe1/GridFM (accessed on 7 January 2026). The repository has been made publicly accessible and includes comprehensive documentation, installation instructions, usage examples, and scripts for reproducing all experimental results presented in this paper.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 presents the GridFM architecture. Section 4 describes the experimental setup. Section 5 presents the experimental results. Section 6 discusses the findings and limitations. Section 7 concludes the paper.

3. Methodology

This section presents the GridFM architecture in detail. We begin with the problem formulation, then describe each architectural component.

3.1. Problem Formulation

Let

D = {(X_{t}, Y_{t})}_{t = 1}^{T}

denote the historical dataset, where

X_{t} \in R^{L \times D}

represents the input features with context length L and feature dimension D, and

Y_{t} \in R^{H \times K}

represents the target variables with prediction horizon H and K forecasting tasks.

The multi-task target matrix is defined as

Y_{t} = [\begin{matrix} y_{t}^{load} \\ y_{t}^{price} \\ y_{t}^{emission} \\ y_{t}^{renewable} \end{matrix}] \in R^{K \times H}

(1)

The objective is to learn a mapping

f_{θ} : R^{L \times D} \to R^{H \times K}

that minimizes

\begin{matrix} L_{total} = L_{MTL} + λ_{p} L_{physics} + λ_{c} L_{coupling} \end{matrix}

(2)

3.2. GridFM Architecture Overview

Figure 1 illustrates the complete GridFM architecture, which comprises five main modules organized in a hierarchical structure:

Figure 1. GridFM architecture overview. The model comprises five main modules: (1) Input Embedding with temporal and positional encodings; (2) FreqMixer Adaptation Layer for power-grid-specific pattern learning through frequency domain processing; (3) Pre-trained Foundation Model Backbone (Moirai-MoE) with frozen weights and LoRA adapters; (4) Physics-Informed Constraint Module embedding power balance equations and zonal grid topology; (5) Multi-Task Output Heads for the joint forecasting of load, price, emission, and renewable generation. The total loss combines multi-task, physics, and coupling components.

1.: Input Embedding Layer: Projects raw features to model dimension with positional and temporal encodings.
2.: FreqMixer Adaptation Layer: Transforms representations to power-grid-specific patterns through frequency domain mixing.
3.: Foundation Model Backbone: Pre-trained transformer with frozen weights and LoRA adapters for general time series representations.
4.: Physics-Informed Constraint Module: Embeds power balance equations and zonal grid topology via GCN.
5.: Multi-Task Output Heads: Task-specific prediction layers with explainability features.

3.3. Input Embedding and Positional Encoding

Given input sequence

X \in R^{L \times D}

, we apply a linear projection:

E^{(0)} = X W_{in} + b_{in}

(3)

We employ sinusoidal positional encoding [18]:

\begin{matrix} {PE}_{(pos, 2 i)} & = sin (\frac{pos}{10000^{2 i / d}}) \\ {PE}_{(pos, 2 i + 1)} & = cos (\frac{pos}{10000^{2 i / d}}) \end{matrix}

(4)

Additionally, we incorporate cyclic temporal encodings to capture grid-specific periodicities:

\begin{matrix} T_{hour} (h) & = [sin (\frac{2 π h}{24}), cos (\frac{2 π h}{24})] \\ T_{day} (d) & = [sin (\frac{2 π d}{7}), cos (\frac{2 π d}{7})] \\ T_{year} (m) & = [sin (\frac{2 π m}{12}), cos (\frac{2 π m}{12})] \end{matrix}

(5)

3.4. FreqMixer Adaptation Layer

The FreqMixer module represents a key innovation in GridFM, enabling the adaptation of general foundation model representations to power-grid-specific temporal patterns. Unlike prior frequency-domain methods such as FEDformer [49], FreqMixer introduces learnable masks specifically initialized to emphasize grid-relevant frequencies. Figure 2 illustrates the detailed architecture of this layer.

Figure 2. Detailed architecture of the FreqMixer Adaptation Layer. The input embeddings

E \in R^{L \times d}

are transformed to the frequency domain via Fast Fourier Transform (FFT), filtered through a learnable frequency mask

σ (M)

, processed by a frequency-mixing MLP, and transformed back via Inverse FFT (IFFT). A residual connection preserves the original signal, followed by layer normalization. The inset shows an example of the learned frequency mask that selectively amplifies power-grid-relevant periodicities (12 h, 24 h, 168 h).

3.4.1. Spectral Decomposition

Given embedded sequence

E \in R^{L \times d}

, we apply the Discrete Fourier Transform (DFT):

{\hat{E}}_{k} = \sum_{n = 0}^{L - 1} E_{n} \cdot exp (- \frac{i 2 π k n}{L}), k = 0, 1, \dots, ⌊ L / 2 ⌋

(6)

3.4.2. Learnable Frequency Mask with Grid-Specific Initialization

We introduce a learnable frequency mask

M \in R^{N_{f} \times d}

with initialization that emphasizes expected grid periodicities:

{\tilde{E}}_{k} = {\hat{E}}_{k} ⊙ σ (M_{k} + b_{k})

(7)

The mask is initialized as

M_{k}^{(0)} = \{\begin{matrix} 1.5 & if f_{k} \in {f_{12 h}, f_{24 h}, f_{168 h}} \pm δ \\ 1.0 & otherwise \end{matrix}

(8)

where

f_{12 h}

,

f_{24 h}

, and

f_{168 h}

correspond to 12 h, daily, and weekly periodicities, and

δ

allows for slight frequency variations.

The tolerance parameter

δ

is set to two frequency bins (i.e.,

δ = 2

) to account for slight variations in detected periodicities that arise from sampling rate effects and finite sequence length. Specifically, given a context length

L = 288

(24 h at 5 min resolution), the frequency resolution is

Δ f = 1 / L

, and

δ = 2

allows the mask to capture periodicities within

\pm 2 Δ f

of the target frequencies. Our sensitivity analysis shows that

δ \in [1, 3]

produces stable results: smaller values (

δ = 0

) miss nearby periodicities due to spectral leakage, while larger values (

δ > 4

) over-smooth the frequency response and lose selectivity. The choice of

δ = 2

balances precision in capturing grid-specific periodicities with robustness to minor frequency shifts caused by data preprocessing variations.

3.4.3. Frequency Mixing Network

To capture cross-frequency dependencies:

\bar{E} = {MLP}_{mix} (Flatten (\tilde{E}))

(9)

3.4.4. Inverse Transform and Residual Connection

The final output includes a residual connection and layer normalization:

E^{freq} = LayerNorm (Real (IFFT (\bar{E})) + E)

(10)

Algorithm 1 presents the complete FreqMixer forward pass.

Algorithm 1 FreqMixer Adaptation Layer

Require: Input embeddings

E \in R^{B \times L \times d}

, learnable mask

M \in R^{N_{f} \times d}

Ensure: Adapted embeddings

E^{freq} \in R^{B \times L \times d}

1:

\hat{E} \leftarrow FFT (E, \dim = 1)

2:

M_{σ} \leftarrow σ (M + b)

3:

\tilde{E} \leftarrow \hat{E} ⊙ M_{σ}

4:

z \leftarrow Flatten (\tilde{E})

5:

\bar{z} \leftarrow {MLP}_{mix} (z)

6:

\bar{E} \leftarrow Reshape (\bar{z})

7:

E^{'} \leftarrow IFFT (\bar{E}, \dim = 1)

8:

E^{freq} \leftarrow LayerNorm (Real (E^{'}) + E)

9: return

E^{freq}

3.5. Foundation Model Backbone

GridFM leverages pre-trained time series foundation models as the backbone encoder. We primarily utilize Moirai-MoE [32] due to its sparse mixture-of-experts architecture, which provides efficient scaling and task specialization. The design of our spatial dependency modeling through sparse expert routing draws inspiration from recent advances in adaptive sparse graph attention networks for renewable energy forecasting [50], which demonstrated effective learning of dynamic inter-zone dependencies that vary with operating conditions.

3.5.1. Sparse Mixture-of-Experts Layer

The MoE architecture routes each input token to a subset of expert networks:

MoE (x) = \sum_{i = 1}^{N_{e}} g_{i} (x) \cdot E_{i} (x)

(11)

The gating weights are computed through a learned routing mechanism:

g (x) = Softmax (TopK (W_{g} x + ϵ, k))

(12)

3.5.2. Any-Variate Attention

Following Moirai [31], we employ any-variate attention:

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + B) V

(13)

3.5.3. Low-Rank Adaptation (LoRA)

To enable efficient fine-tuning while preserving pre-trained knowledge, we apply LoRA [51] to the attention layers:

W^{'} = W + α \cdot B A

(14)

where

B \in R^{d \times r}

,

A \in R^{r \times d}

,

r ≪ d

, and

α

is a scaling factor.

3.6. Physics-Informed Constraint Module

The physics-informed module embeds power system constraints to ensure predictions respect fundamental physical laws.

3.6.1. Improved Power Balance Constraint

The fundamental power balance equation for each zone z is extended to include energy storage systems:

\sum_{g \in G_{z}} P_{g} (t) + P_{storage}^{z} (t) = P_{load}^{z} (t) + \sum_{l \in L_{z}} P_{flow}^{l} (t) + P_{loss}^{z} (t)

(15)

where

P_{storage}^{z} (t)

represents the net power injection from energy storage devices in zone z, which is positive during discharging (acting as generation) and negative during charging (acting as load). This term captures battery energy storage systems (BESS), pumped hydro storage, and other storage technologies that are increasingly prevalent in modern grid operations.

The physics loss with DC power flow approximation:

L_{balance} = \frac{1}{H} \sum_{h = 1}^{H} {∥{\hat{P}}_{h}^{gen} - {\hat{P}}_{h}^{load} - {\hat{P}}_{loss} ({\hat{P}}_{h}^{load}, T)∥}_{2}^{2}

(16)

where

{\hat{P}}_{loss}

is computed using the zonal topology

T

:

{\hat{P}}_{loss} \approx \sum_{(i, j) \in E} \frac{{({\hat{θ}}_{i} - {\hat{θ}}_{j})}^{2}}{X_{i j}}

(17)

The parameters in these equations have the following physical meanings:

${\hat{P}}_{h}^{gen}$ : Predicted total generation at hour h (in MW), computed as the sum of forecasted renewable generation and fossil fuel generation from the respective task heads.
${\hat{P}}_{h}^{load}$ : Predicted load demand at hour h (in MW), output from the load forecasting task head.
${\hat{θ}}_{i}, {\hat{θ}}_{j}$ : Predicted voltage phase angles at nodes (zones) i and j, respectively (in radians), derived from the GCN spatial embedding layer.
$X_{i j}$ : Line reactance between nodes i and j (per unit on the system MVA base), obtained from NYISO transmission data. This represents the electrical “distance” between zones.
$E$ : The set of transmission lines (edges) connecting adjacent zones in the grid topology.

This DC power flow approximation assumes lossless transmission lines and small angle differences, which is valid for bulk power system analysis where reactive power and voltage magnitude variations are secondary effects.

3.6.2. Zonal Topology Encoding

We encode the zonal topology using a Graph Convolutional Network (GCN) [52]:

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)})

(18)

where

\tilde{A} = A + I

is the adjacency matrix with self-loops, and

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

.

3.7. Multi-Task Learning Framework

Our multi-task learning framework builds upon recent advances in physics-informed hybrid multi-task architectures. Specifically, the approach of combining physics constraints with shared representations for multi-output prediction has been successfully demonstrated in battery aging estimations [53], providing a theoretical grounding for our integration of power system physics with multi-task forecasting.

Figure 3 illustrates the multi-task learning framework in GridFM.

Figure 3. Multi-task learning framework in GridFM. The shared foundation model backbone processes input features with frozen pre-trained weights and LoRA adapters. Task-specific output heads generate predictions for load, price, emission, and renewable generation forecasting. The uncertainty-weighted loss

L_{MTL}

automatically balances task contributions using learned uncertainty parameters

σ_{k}^{2}

. Coupling constraints (dashed red arrows) enforce consistency between related predictions.

3.7.1. Hard Parameter Sharing Architecture

For each task

k \in {load, price, emission, renewable}

:

{\hat{y}}^{(k)} = W_{k}^{(2)} \cdot GELU (W_{k}^{(1)} \cdot Pool (H_{shared}) + b_{k}^{(1)}) + b_{k}^{(2)}

(19)

3.7.2. Uncertainty-Weighted Multi-Task Loss

Following Kendall et al. [54],

L_{MTL} = \sum_{k = 1}^{K} \frac{1}{2 σ_{k}^{2}} L_{k} + log σ_{k}

(20)

where

σ_{k}

are learnable task-specific uncertainty parameters.

3.7.3. Task-Specific Loss Functions

For regression tasks, we employ the smooth L1 loss [55]:

L_{smooth} (y, \hat{y}) = \{\begin{matrix} \frac{{(y - \hat{y})}^{2}}{2 β} & if | y - \hat{y} | < β \\ | y - \hat{y} | - \frac{β}{2} & otherwise \end{matrix}

(21)

For price forecasting, we employ quantile regression loss [56]:

L_{quantile} (τ) = \sum_{t = 1}^{H} ρ_{τ} (y_{t} - {\hat{y}}_{t}^{τ})

(22)

where

ρ_{τ} (u) = u (τ - 1_{u < 0})

.

3.7.4. Adaptive Coupling Constraint Loss

Unlike fixed correlation constraints, we employ an adaptive coupling loss that learns the appropriate relationships from data:

L_{coupling} = {∥ρ ({\hat{y}}^{load}, {\hat{y}}^{price}) - ρ_{hist}^{(w)}∥}^{2} + λ_{e} \cdot \max (0, {\hat{y}}^{emission} - f ({\hat{y}}^{renewable}, {\hat{y}}^{fossil}))

(23)

where

ρ_{hist}^{(w)}

is computed from a rolling window of historical data, allowing the relationship to vary over time and according to market conditions.

3.8. Explainability Module

3.8.1. SHAP-Based Feature Attribution

We employ SHAP [57] to quantify feature importance:

ϕ_{i} = \sum_{S \subseteq N ∖ {i}} \frac{| S |! (| N | - | S | - 1)!}{| N |!} [f (S \cup {i}) - f (S)]

(24)

3.8.2. Attention Weight Visualization

We extract and aggregate attention weights:

A_{agg} = \frac{1}{N_{L} \cdot N_{H}} \sum_{l = 1}^{N_{L}} \sum_{h = 1}^{N_{H}} Softmax (\frac{Q_{h}^{(l)} {K_{h}^{(l)}}^{⊤}}{\sqrt{d_{k}}})

(25)

3.9. Complete Training Algorithm

Algorithm 2 presents the complete GridFM training procedure.

Algorithm 2 GridFM Training Algorithm

Require: Training dataset

D

, pre-trained backbone

θ_{backbone}

, learning rate

η

, epochs E

Ensure: Trained GridFM parameters

θ^{*}

1: Initialize FreqMixer, physics module, task heads, uncertainty weights

2: Freeze backbone:

θ_{backbone} . requires_grad \leftarrow False

3: Initialize LoRA adapters with rank

r = 16

4: for epoch

= 1

to E do

5: for each mini-batch

{(X_{b}, Y_{b})}

do

6:

E \leftarrow InputEmbed (X_{b}) + PE + TemporalEmbed

7:

E^{freq} \leftarrow FreqMixer (E)

8:

H \leftarrow Backbone (E^{freq})

With LoRA

9:

H^{'} \leftarrow PhysicsModule (H)

10: for

k = 1

to K do

11:

{\hat{y}}^{(k)} \leftarrow {TaskHead}_{k} (H^{'})

12:

L_{k} \leftarrow {TaskLoss}_{k} ({\hat{y}}^{(k)}, y^{(k)})

13: end for

14:

L_{total} \leftarrow L_{MTL} + λ_{p} L_{physics} + λ_{c} L_{coupling}

15: Update parameters:

θ \leftarrow θ - η \nabla_{θ} L_{total}

16: end for

17: Update

ρ_{hist}^{(w)}

with rolling window

18: end for

19: return

θ^{*}

4. Experimental Setup

4.1. Dataset Description

We utilize comprehensive real-time data from three Independent System Operators (ISOs) to validate GridFM’s generalizability:

Primary Dataset (NYISO): January 2014 to December 2024 (11 years), comprising 11 load zones and over 10 million data points at 5 min resolution.

Validation Datasets:

PJM: January 2018 to December 2024 (7 years), 20 load zones.
CAISO: January 2018 to December 2024 (7 years), 5 load zones.

Figure 4 shows the geographic distribution of NYISO’s 11 load zones and their interconnecting transmission interfaces.

Figure 4. NYISO 11 load zones with transmission interfaces. The zones include: A (West), B (Genesee), C (Central), D (North), E (Mohawk Valley), F (Capital), G (Hudson Valley), H (Millwood), I (Dunwoodie), J (New York City), and K (Long Island). Lines represent major transmission interfaces with transfer capability constraints. The zonal topology is encoded using Graph Convolutional Networks (GCN) in the Physics-Informed Constraint Module.

Table 2 summarizes the key characteristics of each data source.

Table 2. NYISO dataset characteristics.

4.2. Data Preprocessing

We apply the following preprocessing steps:

1.: Missing Value Handling: Linear interpolation for gaps < 1 h; exclusion for longer gaps (affects $< 0.3 %$ of data).
2.: Outlier Detection: Values $> 5 σ$ from rolling 24 h mean are flagged and replaced with interpolated values. Specifically, for each time step t, we compute the rolling mean $μ_{t}$ and standard deviation $σ_{t}$ using a centered 24 h window (288 samples). A value $x_{t}$ is flagged as an outlier if $| x_{t} - μ_{t} | > 5 σ_{t}$ . Flagged values are replaced using cubic spline interpolation from the nearest valid data points on either side. The $5 σ$ threshold was chosen to balance sensitivity to genuine anomalies (e.g., equipment failures, data transmission errors) while avoiding false positives during normal demand fluctuations such as morning ramp-ups or evening peaks. This threshold correctly identifies $< 0.1 %$ of data as outliers while preserving legitimate extreme values during heat waves or cold snaps.
3.: Normalization: Per-zone z-score normalization using training set statistics.
4.: Feature Engineering: Calendar features (hour, day, month, holiday indicators), lagged values (1 h, 24 h, 168 h), and weather features.

4.3. Train/Validation/Test Split

We employ rolling-origin cross-validation to ensure temporal validity:

Training: January 2014–December 2021 (8 years).
Validation: January 2022–December 2022 (1 year).
Test: January 2023–December 2024 (2 years).

For cross-validation, we used 5 rolling folds with 1-year validation windows. All results are reported as mean ± standard deviation across 5 random seeds per fold.

4.4. Baseline Models

We compared GridFM against comprehensive baselines, including fine-tuned versions of foundation models for fair comparison:

Statistical: ARIMA [58], Prophet [59]

Machine Learning: XGBoost [60], LightGBM [61]

Deep Learning: LSTM [15], GRU [16], TCN [17], TFT [21], N-BEATS [62]

Transformers: Informer [22], Autoformer [23], PatchTST [63]

Foundation Models (Zero-Shot): TimesFM [29], Chronos [30], Moirai [31], Moirai-MoE [32]

Foundation Models (Fine-Tuned): TimesFM-FT, Chronos-FT, Moirai-MoE-FT

4.5. Evaluation Metrics

We employed standard forecasting metrics:

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(26)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(27)

Handling Negative and Near-Zero Prices

Since MAPE is undefined or unstable when actual values are zero or negative, we applied the following protocol for electricity price evaluation:

1.: Exclusion criterion: Time intervals where $| price | < $ 1 / MWh$ are excluded from MAPE calculation (affects 0.8% of price data in our test set).
2.: Symmetric MAPE (sMAPE): We additionally report sMAPE, defined as $sMAPE = \frac{100 %}{n} \sum_{i = 1}^{n} \frac{| y_{i} - {\hat{y}}_{i} |}{(| y_{i} | + | {\hat{y}}_{i} |) / 2}$ , which is bounded and well-defined for near-zero values.
3.: Primary metric: Given these considerations, we emphasize RMSE as the primary metric for price forecasting throughout the paper, as it is unaffected by zero or negative values and directly measures prediction error magnitude in $/MWh.

All price forecasting tables include RMSE alongside MAPE for completeness.

For probabilistic forecasts, we use CRPS [64] and calibration metrics.

4.6. Statistical Testing

We employ the following statistical methodology:

Significance Testing: Paired t-tests with Bonferroni correction for multiple comparisons.
Effect Size: Cohen’s d for practical significance.
Confidence Intervals: 95% CI from bootstrap resampling (1000 iterations).

4.7. Implementation Details

GridFM is implemented in PyTorch 2.0:

Model: $d = 256$ , $d_{hidden} = 1024$ , $N_{L} = 6$ layers, $N_{H} = 8$ heads.
MoE: eight experts, top-2 routing.
LoRA: rank $r = 16$ , $α = 32$ .
Context/horizon: $L = 288$ (24h), $H \in {12, 24, 48, 288}$ .
Training: AdamW [65], $η = 10^{- 4}$ , cosine annealing, 100 epochs.
Loss weights: $λ_{p} = 0.1$ , $λ_{c} = 0.05$ (sensitivity analysis in Section 5.8).
Hardware: 4× NVIDIA A100 80GB GPUs.

5. Experimental Results

This section presents comprehensive experimental results evaluating GridFM against state-of-the-art baselines.

5.1. Main Performance Comparison

Table 3 and Table 4 presents a comprehensive performance comparison across all forecasting tasks. Results are reported as mean ± std across 5 folds × 5 seeds. GridFM achieves statistically significant improvements on all four tasks.

Table 3. Main experimental results on NYISO test set (Part 1: Load and Price). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT (paired t-test with Bonferroni correction).

Table 4. Main experimental results on NYISO test set (Part 2: Emission and Renewable). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT.

Figure 5 provides a comprehensive visualization of the performance comparison across all models and metrics. As shown in the figure, GridFM consistently outperforms all baselines across all four forecasting tasks.

Figure 5. Comprehensive performance comparison across all models and tasks. (a) MAPE comparison across load, price, emission, and renewable forecasting. (b) Load RMSE in MW. (c) Price RMSE in $/MWh. (d) MAE for load and price. (e) Relative improvement over Moirai–MoE baseline. (f) Radar chart comparing GridFM, Moirai–MoE, and Chronos. GridFM (highlighted) achieves the best performance across all metrics.

Figure 6 presents an example 48 h multi-task forecast, demonstrating GridFM’s ability to accurately predict all four variables simultaneously with well-calibrated uncertainty estimates.

Figure 6. Example 48 h multi-task forecast visualization. Each subplot shows 24 h of context (shaded region) followed by 24 h of prediction: (a) load forecasting in MW, (b) price forecasting in $/MWh, (c) emission rate forecasting in lbs CO₂/MWh, and (d) renewable generation forecasting in MW. Red dashed lines show GridFM predictions with 95% confidence intervals (shaded red). The vertical dotted line indicates the forecast horizon start. GridFM accurately captures daily patterns and correlations across all four tasks.

The error distribution analysis in Figure 7 confirms that GridFM produces more concentrated predictions with fewer extreme errors compared to baseline models.

Figure 7. Prediction error distribution analysis. (a) Histogram of load forecast errors showing GridFM has narrower distribution compared to Moirai–MoE. (b) Q-Q plot demonstrating GridFM errors follow approximately normal distribution. (c) Box plots of errors by hour of day revealing heteroscedasticity pattern; orange circles represent GridFM errors, blue circles represent Moirai-MoE errors. (d) Cumulative error distribution showing 95th percentile absolute error: GridFM 380MW vs. Moirai-MoE 520 MW; dashed lines indicate 95th percentile threshold.

Figure 8 shows the training convergence behavior, demonstrating stable optimization without overfitting.

Figure 8. GridFM training progress over 100 epochs. (a) Training and validation loss convergence showing stable optimization without overfitting. The model converges around epoch 60 with minimal gap between training and validation loss. (b) Task-specific validation MAPE during training for load, price, and emission forecasting. All tasks show consistent improvement, with price forecasting requiring more epochs to converge due to higher volatility.

5.2. Multi-ISO Validation

Table 5 presents results across three ISO regions to validate generalizability.

Table 5. Cross-ISO generalization results (Load MAPE %). Transfer: model trained on NYISO, tested on target ISO. Native: model trained and tested on target ISO. Note: Lower MAPE indicates better performance. As expected for transfer learning, transfer MAPE values are higher (worse) than ntive values due to domain shift.

GridFM demonstrates consistent improvements across all ISOs, with transfer learning showing only modest degradation compared to native training. The transfer degradation row shows the percentage increase in MAPE when applying a NYISO-trained model to other ISOs compared to native training on that ISO, confirming expected domain adaptation behavior (8–15% degradation is typical for cross-region transfer in power systems [7]).

5.3. Forecast Horizon Analysis

Figure 9 presents the performance analysis across different prediction horizons from 1 h to 24 h. Table 6 provides detailed metrics.

Figure 9. Performance analysis across different forecast horizons from 1 h (12 steps) to 24 h (288 steps). (a) Load forecasting MAPE shows GridFM maintains superior performance across all horizons. (b) Price forecasting MAPE demonstrates larger gaps at longer horizons. (c) GridFM improvement retention over Moirai–MoE remains above 15% even at 24 h horizons, dashed lines indicate the 15% improvement retention threshold. (d) Forecast skill score relative to persistence baseline shows GridFM maintains positive skill across all horizons.

Table 6. Performance across forecast horizons (Load MAPE %).

Key findings:

Short-term superiority: GridFM shows the largest improvement (22.1% vs. zero-shot, 16.5% vs. fine-tuned) at the 1 h horizon.
Consistent advantage: The improvement remains significant (14.8% vs. zero-shot, 6.6% vs. fine-tuned) at 24 h.
Skill score: GridFM maintains positive skill scores exceeding 0.55 at 24 h.

5.4. Seasonal and Temporal Analysis

Figure 10 and Table 7 present the performance breakdown by season and temporal patterns.

Figure 10. Seasonal and temporal performance analysis. (a) Load forecasting MAPE by season shows consistent improvement across all seasons. (b) Price forecasting MAPE reveals largest improvements during summer peak periods. (c) GridFM monthly MAPE heatmap across all four forecasting tasks. (d) Performance comparison by day type (weekday, weekend, holiday), demonstrating robust performance under varying operational conditions.

Table 7. Seasonal performance (MAPE %).

5.5. Zonal Performance Analysis

Figure 11 visualizes the performance across NYISO’s 11 load zones. Table 8 provides detailed metrics.

Figure 11. Zonal performance analysis across NYISO’s 11 load zones. (a) Load forecasting MAPE by zone comparing GridFM and Moirai–MoE. Dashed lines indicate system-wide mean. Zone J (NYC) and Zone K (Long Island) show the highest MAPE but also the largest improvement. Different shades of blue distinguish GridFM (dark blue) from Moirai–MoE (light blue). (b) Horizontal bar chart of GridFM improvement percentage by zone. The red dashed line indicates mean improvement (18.4%). All zones show consistent improvement between 17.8% and 19.2%.

Table 8. Load forecasting MAPE (%) by zone.

5.6. Ablation Study

Table 9 and Figure 12 present the ablation study results quantifying the contribution of each GridFM component.

Table 9. Ablation study results (MAPE %). Each row adds one component to the previous configuration.

Figure 12. Ablation study showing the contribution of each GridFM component. Starting from the base Moirai–MoE model, each component is progressively added: FreqMixer adaptation layer, physics-informed constraints, and multi-task learning framework. The full GridFM model achieves the best performance across all tasks. The FreqMixer contributes a 6.8% improvement for load forecasting, physics constraints add 2.9%, and multi-task learning provides an additional 5.5%. Dashed lines in subfigure (d) indicate threshold levels for comparison.

Component Contributions:

LoRA Fine-tuning: 9.5% improvement, establishing a strong baseline.
FreqMixer: 4.2% additional improvement, validating frequency-domain adaptation.
Physics Constraints: 2.6% improvement, with largest impact on emission forecasting.
Multi-Task Learning: 1.8% improvement, with price forecasting benefiting most.

5.7. Physics Constraint Effectiveness

Figure 13 and Table 10 evaluate the effectiveness of physics-informed constraints.

Figure 13. Physics-informed constraint effectiveness analysis. (a) Power balance constraint violation rate across models. GridFM achieves a 67% reduction compared to Moirai–MoE. (b) Physics and coupling loss convergence during training. (c) Inter-variable correlation preservation comparing true data correlations with model predictions. GridFM closely matches true correlations. (d) Physical consistency metrics showing GridFM achieves > 92% compliance across all constraint types.

Table 10. Physics constraint effectiveness. Higher is better for compliance metrics (↑); lower is better for violations (↓).

The physics constraints reduce power balance violations by 67% (from 6.5% to 2.1%) compared to Moirai–MoE.

5.8. Hyperparameter Sensitivity Analysis

Table 11 presents sensitivity analysis for key hyperparameters.

Table 11. Hyperparameter sensitivity (Load MAPE %).

5.9. Probabilistic Forecasting Evaluation

Figure 14 and Table 12 present probabilistic forecasting metrics.

Figure 14. Probabilistic forecasting performance analysis. (a) Continuous Ranked Probability Score (CRPS) comparison for load and price forecasting. (b) Prediction interval calibration plot showing GridFM achieves near-perfect calibration (close to diagonal). (c) Pinball loss across quantile levels for load forecasting. (d) Prediction interval width by operating condition with corresponding coverage rates. Dashed horizontal line indicates the 95% target coverage threshold. GridFM maintains 95% target coverage while producing narrower intervals.

Table 12. Probabilistic forecasting performance.

5.10. Computational Efficiency

Figure 15 and Table 13 compare computational requirements.

Figure 15. Computational efficiency analysis. (a) Inference latency in milliseconds per batch of 64 samples. GridFM (52 ms) is comparable to Moirai–MoE (45 ms) while achieving better accuracy. (b) Efficiency frontier showing model parameters vs. Load MAPE. GridFM achieves optimal position on the Pareto frontier. (c) GPU memory requirements in GB. Dashed lines indicate common GPU memory thresholds. GridFM requires 9.5 GB, compatible with RTX 3090/4080 GPUs.

Table 13. Computational efficiency comparison.

GridFM adds modest computational overhead (15% more parameters, 15% slower inference) while achieving substantially better performance.

6. Discussion

This section discusses key findings, limitations, and practical implications.

6.1. Key Findings

6.1.1. Foundation Model Adaptation Is Effective

Our results demonstrate that the domain-specific adaptation of general-purpose TSFMs yields substantial improvements. The 10.1% improvement over fine-tuned Moirai–MoE (18.6% over zero-shot) validates our hypothesis that power grids require specialized inductive biases beyond what standard fine-tuning provides.

6.1.2. Physics Constraints Improve Both Accuracy and Consistency

The 67% reduction in power balance violations (from 6.5% to 2.1%) is significant for grid operators. Physics constraints provide the largest relative contribution to emission forecasting (28.0% of total improvement), where physical relationships between generation mix and emissions are well-defined.

6.1.3. Multi-Task Learning Exploits Variable Coupling

Price forecasting benefits most from the multi-task framework (15.9% improvement over fine-tuned baseline), as prices are directly influenced by load and renewable generation. The adaptive coupling loss allows the model to learn time-varying relationships rather than enforcing fixed correlations.

6.1.4. FreqMixer Captures Grid-Specific Patterns

Analysis of learned frequency masks confirms that FreqMixer emphasizes expected periodicities (12 h, 24 h, 168 h) corresponding to grid operational patterns. The FreqMixer contributes a 6.8% improvement for load forecasting through the selective amplification of grid-relevant frequencies.

6.2. Explainability Analysis

The SHAP analysis (Figure 16) reveals interpretable feature importance patterns essential for grid operator trust.

Figure 16. SHAP feature importance analysis for (a) load forecasting and (b) price forecasting. Features are ranked by mean absolute SHAP value. For load forecasting, recent historical load (

t - 1

,

t - 24

) and temperature are most influential. For price forecasting, the previous price and hour of day dominate. The interpretable feature attribution supports grid operator decision-making.

The attention heatmap (Figure 17) shows that GridFM learns to focus on relevant historical patterns.

Figure 17. Attention weight visualization for the load forecasting task. The heatmap shows aggregated attention weights, where rows represent prediction time steps and columns represent context positions. Darker colors indicate higher attention weights. The model learns to focus on recent observations while also attending to the same hour from the previous day (diagonal pattern), capturing daily periodicity.

6.3. Limitations

6.3.1. Computational Requirements

GridFM requires substantial resources for training (12 h on 4 × A100 GPUs). While inference is efficient (52 ms per batch), the training requirements may limit adoption by smaller utilities. Future work could explore more efficient adaptation methods, such as adapter layers or prompt tuning.

6.3.2. Extreme Event Performance

Table 14 shows that performance during extreme events remains an area for improvement.

Table 14. Performance during extreme events (Load MAPE %).

While GridFM improves extreme event forecasting, performance during price spikes remains challenging (22.8% MAPE). This reflects the inherent difficulty of predicting rare, high-volatility events. The integration of external signals (e.g., weather forecasts, outage schedules) could improve extreme event detection.

6.3.3. Geographic Scope

While we validated on three ISOs (NYISO, PJM, CAISO), these share similar market structures and climate zones. Validation on ISOs with different characteristics (e.g., ERCOT with its isolated grid, or European markets with different regulatory frameworks) would strengthen generalizability claims.

6.3.4. Temporal Scope

The test period (2023–2024) may not capture all relevant scenarios, such as major grid failures or unprecedented weather events. The model’s behavior under distribution shift from rapid grid evolution (increasing EV penetration, distributed solar) remains to be validated.

6.3.5. Physics Constraint Limitations

Our physics constraints use DC power flow approximation, which may not capture AC power flow effects accurately under stressed conditions. The GCN-based topology encoding assumes a static grid structure, not accounting for reconfiguration events.

6.4. Practical Implications

6.4.1. Economic Impact

A 0.5% reduction in forecast error could save approximately $10–20 million annually for a large ISO [7]. GridFM’s 10.1% improvement over fine-tuned baselines (18.6% over zero-shot) translates to potential annual savings of $50–100 million for NYISO-scale operations.

6.4.2. CLCPA Compliance

GridFM’s emission forecasting capabilities directly support CLCPA 2030/2040 compliance monitoring through real-time carbon accounting and scenario analysis.

6.4.3. Renewable Integration

The 20.7% improvement in renewable forecasting facilitates higher renewable penetration by reducing reserve requirements.

6.4.4. Deployment Considerations

For operational deployment, we recommend:

Weekly model retraining with a 30-day rolling window.
Ensemble predictions combining GridFM with operational persistence models.
Automated anomaly detection to flag low-confidence predictions.
Human-in-the-loop review for high-stakes decisions.

7. Conclusions

This paper presented GridFM, a physics-informed foundation model designed specifically for multi-task energy forecasting using real-time data from the New York Independent System Operator, with additional validation on PJM and CAISO datasets. The proposed framework addresses the fundamental limitations of existing time series foundation models when applied to power grid applications by introducing domain-specific adaptations that respect the unique characteristics of energy systems.

At the core of GridFM lies the FreqMixer adaptation layer, a novel frequency-domain mixing mechanism that transforms general-purpose foundation model representations into power-grid-specific patterns. By operating in the spectral domain with grid-specific initialization, FreqMixer learns to selectively emphasize frequency components corresponding to characteristic grid periodicities such as daily load cycles, weekly demand patterns, and seasonal variations, achieving a 12.3% improvement in capturing these essential temporal structures without modifying the pre-trained backbone weights.

The integration of physics-informed constraints represents another significant contribution, embedding fundamental power system laws directly into the learning process. Through the incorporation of power balance equations with DC power flow approximation and zonal topology encoding via graph neural networks, GridFM ensures that predictions respect physical consistency requirements. This approach reduces physically inconsistent predictions by 67% compared to purely data-driven alternatives, a critical improvement for its operational deployment, where violated constraints can lead to infeasible dispatch decisions.

The multi-task learning framework with adaptive coupling constraints enables the simultaneous forecasting of load demand, locational-based marginal prices, carbon emissions, and renewable generation through a shared representation architecture with task-specific output heads. The uncertainty-weighted loss function automatically balances the contributions from each task during training, while the adaptive coupling loss learns the time-varying relationships between grid variables. This joint modeling approach exploits the inherent correlations between grid variables, improving individual task performance beyond what single-task models can achieve.

Comprehensive experiments on over 10 years of NYISO data, with rolling-origin cross-validation and statistical significance testing, demonstrate that GridFM achieves statistically significant improvements across all forecasting tasks. The model attains

2.14 % \pm 0.05 %

MAPE for load forecasting, representing a 10.1% improvement over fine-tuned Moirai-MoE (

p < 0.01

) and 18.6% over zero-shot baseline. Price forecasting reaches

7.80 % \pm 0.31 %

MAPE with a 15.9% improvement over the fine-tuned baseline (23.2% over zero-shot), while emission prediction achieves

4.73 % \pm 0.18 %

MAPE, improving by 14.3% over fine-tuned baseline (21.8% over zero-shot). These gains are consistent across different forecast horizons, seasons, day types, and geographic zones within the NYISO territory, as well as across PJM and CAISO validation datasets.

The explainability module, incorporating SHAP-based feature attribution and attention visualization, provides interpretable predictions essential for grid operator trust and regulatory compliance. This transparency supports the Climate Leadership and Community Protection Act objectives by enabling real-time carbon accounting and informed decision-making for sustainable energy transition.

Looking ahead, several promising directions emerge from this work. Extension to other ISO regions through transfer learning would further validate the generalizability of the GridFM architecture across different grid topologies and market structures. The integration of conformal prediction methods could provide calibrated uncertainty intervals with formal coverage guarantees. The development of online learning capabilities would enable adaptation to distribution shifts arising from evolving grid composition, including increasing electric vehicle penetration and distributed solar adoption. Finally, the exploration of federated learning approaches could enable privacy-preserving model training across multiple utilities without sharing sensitive operational data. These directions collectively point toward a future where foundation models become standard tools for power system operations, contributing to the broader goal of reliable, affordable, and sustainable electricity systems.

A particularly promising avenue for future research is the extension of GridFM to distributed power grids and microgrids. This expansion would require several architectural adaptations: (1) modifying the GCN topology encoding to handle meshed microgrid networks with bidirectional power flows, rather than the primarily radial structure of bulk transmission systems; (2) incorporating behind-the-meter distributed energy resources (DERs), such as rooftop solar, home batteries, and electric vehicles, which introduce additional stochasticity at the distribution level; (3) adapting the physics constraints to account for voltage regulation and reactive power management, which become critical at lower voltage levels; and (4) developing hierarchical forecasting frameworks that coordinate predictions across transmission, distribution, and microgrid levels. Such extensions would enable GridFM to support emerging applications, including virtual power plant (VPP) aggregation, community microgrids, peer-to-peer energy trading platforms, and transactive energy markets. The modular architecture of GridFM, with its separable FreqMixer, physics constraint, and multi-task components, provides a flexible foundation for these adaptations.

Author Contributions

Conceptualization, A.S. and M.A.M.; methodology, A.S.; software, A.S.; validation, A.S. and S.B.; formal analysis, A.S.; investigation, A.S.; resources, M.A.M.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, S.B. and M.A.M., A.H. and M.A.; visualization, A.S.; supervision, M.A.M.; project administration, M.A.M.; funding acquisition, M.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NYISO data used in this study are publicly available at https://www.gridstatus.io/records/nyiso, accessed on 1 July 2025. PJM data are available at https://dataminer2.pjm.com/, accessed on 1 July 2025. CAISO data are available at https://dataminer2.pjm.com/list, accessed on 1 July 2025 http://oasis.caiso.com/, accessed on 1 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

International Energy Agency. World Energy Outlook 2024. 2024. Available online: https://www.iea.org/reports/world-energy-outlook-2024 (accessed on 30 July 2025).
IPCC. Climate Change 2023: Synthesis Report; Contribution of Working Groups I, II and III to the Sixth Assessment Report; IPCC: Geneva, Switzerland, 2023.
Mohammadi, M.; Hosseinian, S.H.; Gharehpetian, G.B. Deep Learning for Renewable Energy Forecasting: A Comprehensive Review. Renew. Sustain. Energy Rev. 2024, 189, 113871. [Google Scholar]
Ahmed, Z.; Kazmi, S.A.A.; Holmberg, T. Smart Grid Forecasting: A Review of Deep Learning Methods for Energy Management. IEEE Access 2024, 12, 12345–12378. [Google Scholar]
New York Independent System Operator. Power Trends 2024: The Annual State of the Grid Report. 2024. Available online: https://www.nyiso.com/power-trends (accessed on 30 July 2025).
New York State. Climate Leadership and Community Protection Act. 2019. Available online: https://climate.ny.gov/ (accessed on 11 July 2025).
Hong, T.; Pinson, P.; Wang, Y.; Weron, R.; Yang, D.; Zareipour, H. Energy Forecasting: A Review and Outlook. IEEE Open Access J. Power Energy 2020, 7, 376–388. [Google Scholar] [CrossRef]
Haben, S.; Arber, S.; Giasemidis, G.; Sheridan, M.; Sherwin, E.; Williams, T. Review of Low Voltage Load Forecasting: Methods, Applications, and Recommendations. Appl. Energy 2021, 304, 117798. [Google Scholar] [CrossRef]
Hammad, M.A.; Jereb, B.; Rosi, B.; Dragan, D. Methods and Models for Electric Load Forecasting: A Comprehensive Review. Logist. Sustain. Transp. 2020, 11, 51–76. [Google Scholar] [CrossRef]
Weron, R. Electricity Price Forecasting: A Review of the State-of-the-Art with a Look into the Future. Int. J. Forecast. 2014, 30, 1030–1081. [Google Scholar] [CrossRef]
Lindberg, K.B.; Bakker, S.J.; Sartori, I. Modeling Electric and Hybrid Vehicles’ Charging Demand and Grid Impacts. Appl. Energy 2021, 284, 116355. [Google Scholar]
Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. A Review of Deep Learning for Renewable Energy Forecasting. Energy Convers. Manag. 2022, 198, 111799. [Google Scholar] [CrossRef]
Li, K.; Mu, Y.; Yang, F.; Wang, H.; Yan, Y.; Zhang, C. Joint Forecasting of Source-Load-Price for Integrated Energy System Based on Multi-Task Learning and Hybrid Attention Mechanism. Appl. Energy 2024, 360, 122821. [Google Scholar] [CrossRef]
Lago, J.; De Ridder, F.; Vrancx, P.; De Schutter, B. Forecasting Day-Ahead Electricity Prices in Europe: The Importance of Considering Market Integration. Appl. Energy 2021, 211, 890–903. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Haben, S.; Giasemidis, G.; Ziel, F.; Arber, S. Short Term Load Forecasting and the Effect of Temperature at the Low Voltage Level. Int. J. Forecast. 2019, 35, 1469–1484. [Google Scholar] [CrossRef]
Hewamalage, H.; Bergmeir, C.; Bandara, K. Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions. Int. J. Forecast. 2021, 37, 388–427. [Google Scholar] [CrossRef]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 22419–22430. [Google Scholar]
Oreshkin, B.N.; Amini, A.A.; Coyle, L.; Coates, M. Meta-Learning Framework with Applications to Zero-Shot Time-Series Forecasting. arXiv 2021, arXiv:2002.02887. [Google Scholar] [CrossRef]
Liang, Y.; Wen, H.; Nie, Y.; Jiang, Y.; Jin, M.; Song, D.; Pan, S.; Wen, Q. Foundation Models for Time Series Analysis: A Tutorial and Survey. arXiv 2024, arXiv:2403.14735. [Google Scholar] [CrossRef]
Jin, M.; Wen, Q.; Liang, Y.; Zhang, C.; Xue, S.; Wang, X.; Zhang, J.; Wang, Y.; Chen, H.; Li, X.; et al. Position Paper: What Can Large Language Models Tell Us about Time Series Analysis. arXiv 2024, arXiv:2402.02713. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Sen, R.; Zhou, Y. A Decoder-Only Foundation Model for Time-Series Forecasting. arXiv 2024, arXiv:2310.10688. [Google Scholar]
Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Arango, S.P.; Kapoor, S.; et al. Chronos: Learning the Language of Time Series. arXiv 2024, arXiv:2403.07815. [Google Scholar] [CrossRef]
Woo, G.; Liu, C.; Kumar, A.; Xiong, C.; Savarese, S.; Sahoo, D. Unified Training of Universal Time Series Forecasting Transformers. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21 July–27 July 2024. [Google Scholar]
Liu, X.; Liu, J.; Woo, G.; Aksu, T.; Liang, Y.; Zimmermann, R.; Liu, C.; Savarese, S.; Xiong, C.; Sahoo, D. Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts. arXiv 2024, arXiv:2410.10469. [Google Scholar]
Rasul, K.; Ashok, A.; Williams, A.R.; Ghonia, H.; Bhagwatkar, R.; Khorasani, A.; Bayazi, M.J.D.; Adamopoulos, G.; Riachi, R.; Hassen, N.; et al. Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting. arXiv 2024, arXiv:2310.08278. [Google Scholar] [CrossRef]
Shi, X.; Chen, S.; Yao, Y.; Wang, L.; Liu, J.; Liu, C. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts. arXiv 2024, arXiv:2409.16040. [Google Scholar]
Goswami, M.; Szafer, K.; Choudhry, A.; Cai, Y.; Li, S.; Dubrawski, A. MOMENT: A Family of Open Time-series Foundation Models. arXiv 2024, arXiv:2402.03885. [Google Scholar] [CrossRef]
Misyris, G.S.; Venzke, A.; Chatzivasileiadis, S. Physics-Informed Neural Networks for Power Systems. In Proceedings of the IEEE PES General Meeting, Montreal, QC, Canada, 2–6 August 2020. [Google Scholar]
Gruver, N.; Finzi, M.; Qiu, S.; Wilson, A.G. Large Language Models Are Zero-Shot Time Series Forecasters. Adv. Neural Inf. Process. Syst. 2024, 36, 19622–19635. [Google Scholar]
Xue, H.; Salim, F.D. PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting. IEEE Trans. Knowl. Data Eng. 2023, 36, 6851–6864. [Google Scholar] [CrossRef]
Zhou, T.; Niu, P.; Wang, X.; Sun, L.; Jin, R. One Fits All: Power General Time Series Analysis by Pretrained LM. Adv. Neural Inf. Process. Syst. 2024, 36, 43322–43355. [Google Scholar]
Garza, A.; Mergenthaler-Canseco, M. TimeGPT-1. arXiv 2024, arXiv:2310.03589. [Google Scholar]
Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-Term Residential Load Forecasting Based on LSTM Recurrent Neural Network. IEEE Trans. Smart Grid 2019, 10, 841–851. [Google Scholar] [CrossRef]
Kim, T.Y.; Cho, S.B. Predicting Residential Energy Consumption Using CNN-LSTM Neural Networks. Energy 2019, 182, 72–81. [Google Scholar] [CrossRef]
Lago, J.; Marcjasz, G.; De Schutter, B.; Weron, R. Forecasting Day-Ahead Electricity Prices: A Review of State-of-the-Art Algorithms, Best Practices and an Open-Access Benchmark. Appl. Energy 2021, 293, 116983. [Google Scholar] [CrossRef]
Yang, Y.; Liu, X.; Chen, Z.; Wang, J. ATTnet: An Explainable Gated Recurrent Unit Neural Network for High Frequency Electricity Price Forecasting. Int. J. Electr. Power Energy Syst. 2024, 158, 109975. [Google Scholar] [CrossRef]
Chen, Y.; Xiao, J.; Wang, Y.; Li, Y. Multi-Task Learning for Integrated Energy System Forecasting with Coupling Awareness. Energy Convers. Manag. 2024, 297. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Donon, B.; Clément, R.; Donnot, B.; Marot, A.; Guyon, I.; Schoenauer, M. Graph Neural Networks for Power Grid. arXiv 2020, arXiv:1905.09990. [Google Scholar]
Hossain, S.; Rahman, A.; Ahmed, S.; Enam, S.; Ahmed, M.T. Interpretable Physics-Informed Neural Networks for Energy Consumption Prediction Using IoT Sensors. Array 2025, 28, 100469. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Qu, K.; Xue, S.; Zheng, X.; Yan, D.; Cao, H. Learning dynamic inter-farm dependencies for wind power forecasting via adaptive sparse graph attention network. Renewable Energy 2026, 258, 124969. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2022, arXiv:2106.09685. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Zhang, S.; Liu, Z.; Xu, Y.; Su, H. A Physics-Informed Hybrid Multitask Learning for Lithium-Ion Battery Full-Life Aging Estimation at Early Lifetime. IEEE Trans. Ind. Inform. 2025, 21, 415–424. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Koenker, R.; Bassett, G., Jr. Regression Quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Box, G.E.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: Holden, MA, USA, 1970. [Google Scholar]
Taylor, S.J.; Letham, B. Forecasting at Scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]

Figure 1. GridFM architecture overview. The model comprises five main modules: (1) Input Embedding with temporal and positional encodings; (2) FreqMixer Adaptation Layer for power-grid-specific pattern learning through frequency domain processing; (3) Pre-trained Foundation Model Backbone (Moirai-MoE) with frozen weights and LoRA adapters; (4) Physics-Informed Constraint Module embedding power balance equations and zonal grid topology; (5) Multi-Task Output Heads for the joint forecasting of load, price, emission, and renewable generation. The total loss combines multi-task, physics, and coupling components.

Figure 2. Detailed architecture of the FreqMixer Adaptation Layer. The input embeddings

E \in R^{L \times d}

are transformed to the frequency domain via Fast Fourier Transform (FFT), filtered through a learnable frequency mask

σ (M)

, processed by a frequency-mixing MLP, and transformed back via Inverse FFT (IFFT). A residual connection preserves the original signal, followed by layer normalization. The inset shows an example of the learned frequency mask that selectively amplifies power-grid-relevant periodicities (12 h, 24 h, 168 h).

Figure 2. Detailed architecture of the FreqMixer Adaptation Layer. The input embeddings

E \in R^{L \times d}

are transformed to the frequency domain via Fast Fourier Transform (FFT), filtered through a learnable frequency mask

σ (M)

, processed by a frequency-mixing MLP, and transformed back via Inverse FFT (IFFT). A residual connection preserves the original signal, followed by layer normalization. The inset shows an example of the learned frequency mask that selectively amplifies power-grid-relevant periodicities (12 h, 24 h, 168 h).

Figure 3. Multi-task learning framework in GridFM. The shared foundation model backbone processes input features with frozen pre-trained weights and LoRA adapters. Task-specific output heads generate predictions for load, price, emission, and renewable generation forecasting. The uncertainty-weighted loss

L_{MTL}

automatically balances task contributions using learned uncertainty parameters

σ_{k}^{2}

. Coupling constraints (dashed red arrows) enforce consistency between related predictions.

Figure 3. Multi-task learning framework in GridFM. The shared foundation model backbone processes input features with frozen pre-trained weights and LoRA adapters. Task-specific output heads generate predictions for load, price, emission, and renewable generation forecasting. The uncertainty-weighted loss

L_{MTL}

automatically balances task contributions using learned uncertainty parameters

σ_{k}^{2}

. Coupling constraints (dashed red arrows) enforce consistency between related predictions.

Figure 4. NYISO 11 load zones with transmission interfaces. The zones include: A (West), B (Genesee), C (Central), D (North), E (Mohawk Valley), F (Capital), G (Hudson Valley), H (Millwood), I (Dunwoodie), J (New York City), and K (Long Island). Lines represent major transmission interfaces with transfer capability constraints. The zonal topology is encoded using Graph Convolutional Networks (GCN) in the Physics-Informed Constraint Module.

Figure 5. Comprehensive performance comparison across all models and tasks. (a) MAPE comparison across load, price, emission, and renewable forecasting. (b) Load RMSE in MW. (c) Price RMSE in $/MWh. (d) MAE for load and price. (e) Relative improvement over Moirai–MoE baseline. (f) Radar chart comparing GridFM, Moirai–MoE, and Chronos. GridFM (highlighted) achieves the best performance across all metrics.

Figure 6. Example 48 h multi-task forecast visualization. Each subplot shows 24 h of context (shaded region) followed by 24 h of prediction: (a) load forecasting in MW, (b) price forecasting in $/MWh, (c) emission rate forecasting in lbs CO₂/MWh, and (d) renewable generation forecasting in MW. Red dashed lines show GridFM predictions with 95% confidence intervals (shaded red). The vertical dotted line indicates the forecast horizon start. GridFM accurately captures daily patterns and correlations across all four tasks.

Figure 7. Prediction error distribution analysis. (a) Histogram of load forecast errors showing GridFM has narrower distribution compared to Moirai–MoE. (b) Q-Q plot demonstrating GridFM errors follow approximately normal distribution. (c) Box plots of errors by hour of day revealing heteroscedasticity pattern; orange circles represent GridFM errors, blue circles represent Moirai-MoE errors. (d) Cumulative error distribution showing 95th percentile absolute error: GridFM 380MW vs. Moirai-MoE 520 MW; dashed lines indicate 95th percentile threshold.

Figure 8. GridFM training progress over 100 epochs. (a) Training and validation loss convergence showing stable optimization without overfitting. The model converges around epoch 60 with minimal gap between training and validation loss. (b) Task-specific validation MAPE during training for load, price, and emission forecasting. All tasks show consistent improvement, with price forecasting requiring more epochs to converge due to higher volatility.

Figure 9. Performance analysis across different forecast horizons from 1 h (12 steps) to 24 h (288 steps). (a) Load forecasting MAPE shows GridFM maintains superior performance across all horizons. (b) Price forecasting MAPE demonstrates larger gaps at longer horizons. (c) GridFM improvement retention over Moirai–MoE remains above 15% even at 24 h horizons, dashed lines indicate the 15% improvement retention threshold. (d) Forecast skill score relative to persistence baseline shows GridFM maintains positive skill across all horizons.

Figure 10. Seasonal and temporal performance analysis. (a) Load forecasting MAPE by season shows consistent improvement across all seasons. (b) Price forecasting MAPE reveals largest improvements during summer peak periods. (c) GridFM monthly MAPE heatmap across all four forecasting tasks. (d) Performance comparison by day type (weekday, weekend, holiday), demonstrating robust performance under varying operational conditions.

Figure 11. Zonal performance analysis across NYISO’s 11 load zones. (a) Load forecasting MAPE by zone comparing GridFM and Moirai–MoE. Dashed lines indicate system-wide mean. Zone J (NYC) and Zone K (Long Island) show the highest MAPE but also the largest improvement. Different shades of blue distinguish GridFM (dark blue) from Moirai–MoE (light blue). (b) Horizontal bar chart of GridFM improvement percentage by zone. The red dashed line indicates mean improvement (18.4%). All zones show consistent improvement between 17.8% and 19.2%.

Figure 12. Ablation study showing the contribution of each GridFM component. Starting from the base Moirai–MoE model, each component is progressively added: FreqMixer adaptation layer, physics-informed constraints, and multi-task learning framework. The full GridFM model achieves the best performance across all tasks. The FreqMixer contributes a 6.8% improvement for load forecasting, physics constraints add 2.9%, and multi-task learning provides an additional 5.5%. Dashed lines in subfigure (d) indicate threshold levels for comparison.

Figure 13. Physics-informed constraint effectiveness analysis. (a) Power balance constraint violation rate across models. GridFM achieves a 67% reduction compared to Moirai–MoE. (b) Physics and coupling loss convergence during training. (c) Inter-variable correlation preservation comparing true data correlations with model predictions. GridFM closely matches true correlations. (d) Physical consistency metrics showing GridFM achieves > 92% compliance across all constraint types.

Figure 14. Probabilistic forecasting performance analysis. (a) Continuous Ranked Probability Score (CRPS) comparison for load and price forecasting. (b) Prediction interval calibration plot showing GridFM achieves near-perfect calibration (close to diagonal). (c) Pinball loss across quantile levels for load forecasting. (d) Prediction interval width by operating condition with corresponding coverage rates. Dashed horizontal line indicates the 95% target coverage threshold. GridFM maintains 95% target coverage while producing narrower intervals.

Figure 15. Computational efficiency analysis. (a) Inference latency in milliseconds per batch of 64 samples. GridFM (52 ms) is comparable to Moirai–MoE (45 ms) while achieving better accuracy. (b) Efficiency frontier showing model parameters vs. Load MAPE. GridFM achieves optimal position on the Pareto frontier. (c) GPU memory requirements in GB. Dashed lines indicate common GPU memory thresholds. GridFM requires 9.5 GB, compatible with RTX 3090/4080 GPUs.

Figure 16. SHAP feature importance analysis for (a) load forecasting and (b) price forecasting. Features are ranked by mean absolute SHAP value. For load forecasting, recent historical load (

t - 1

,

t - 24

) and temperature are most influential. For price forecasting, the previous price and hour of day dominate. The interpretable feature attribution supports grid operator decision-making.

Figure 16. SHAP feature importance analysis for (a) load forecasting and (b) price forecasting. Features are ranked by mean absolute SHAP value. For load forecasting, recent historical load (

t - 1

,

t - 24

) and temperature are most influential. For price forecasting, the previous price and hour of day dominate. The interpretable feature attribution supports grid operator decision-making.

Figure 17. Attention weight visualization for the load forecasting task. The heatmap shows aggregated attention weights, where rows represent prediction time steps and columns represent context positions. Darker colors indicate higher attention weights. The model learns to focus on recent observations while also attending to the same hour from the previous day (diagonal pattern), capturing daily periodicity.

Table 1. Comparison of time series foundation models.

Model	Type	Params	Multi.	Exog.	Prob.	Open
TimesFM [29]	Decoder	200 M	–	–	–	✓
Chronos [30]	Enc-Dec	8–710 M	–	–	✓	✓
Moirai [31]	Encoder	14–311 M	✓	✓	✓	✓
Moirai-MoE [32]	Sparse MoE	11–117 M	✓	✓	✓	✓
Time-MoE [34]	Sparse MoE	1–2.4 B	✓	✓	✓	✓
Lag-Llama [33]	Decoder	7–70 M	–	–	✓	✓
GridFM (Ours)	Hybrid MoE	135 M	✓	✓	✓	✓

Table 2. NYISO dataset characteristics.

Variable	Resolution	Observations	Zones	Source
System Load	5 min	105,120 × 11/year ^†	11 + 1	nyiso.com/load-data
Real-Time LBMP	5 min	105,120 × 11/year ^†	11 + 1	nyiso.com/pricing-data
Fuel Mix	5 min	105,120/year	System	nyiso.com/real-time-dashboard
Marginal Emissions	5 min	105,120/year	System	nyiso.com/emissions-data
Weather (NOAA)	Hourly	8760/year	11 stations	ncdc.noaa.gov

^† Note: 5 min resolution yields 105,120 observations per year per zone (60/5 × 24 × 365 = 105,120). For zonal variables (Load, LBMP), data is collected for each of 11 zones, yielding 1,156,320 total zone-level observations per year. System-level variables (Fuel Mix, Marginal Emissions) are reported once per 5 min interval at the aggregate system level.

Table 3. Main experimental results on NYISO test set (Part 1: Load and Price). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT (paired t-test with Bonferroni correction).

Table 3. Main experimental results on NYISO test set (Part 1: Load and Price). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT (paired t-test with Bonferroni correction).

Model	Load			Price
Model	MAPE	RMSE	MAE	MAPE	RMSE	MAE
Traditional Methods
ARIMA	4.21 ± 0.15	892 ± 32	685 ± 25	18.52 ± 0.85	12.5 ± 0.6	8.2 ± 0.4
XGBoost	3.12 ± 0.08	654 ± 21	498 ± 16	14.23 ± 0.62	9.8 ± 0.4	6.5 ± 0.3
LSTM	2.89 ± 0.07	598 ± 18	452 ± 14	12.78 ± 0.55	8.5 ± 0.3	5.8 ± 0.2
TFT	2.58 ± 0.06	512 ± 15	385 ± 12	10.52 ± 0.45	7.2 ± 0.3	4.9 ± 0.2
Informer	2.51 ± 0.06	498 ± 15	375 ± 11	11.23 ± 0.48	7.8 ± 0.3	5.2 ± 0.2
PatchTST	2.45 ± 0.05	485 ± 14	365 ± 11	10.85 ± 0.48	7.4 ± 0.3	5.0 ± 0.2
Foundation Models (Zero-Shot)
TimesFM	2.78 ± 0.08	542 ± 18	412 ± 14	10.82 ± 0.52	7.5 ± 0.4	5.1 ± 0.2
Chronos	2.71 ± 0.07	528 ± 16	398 ± 13	10.21 ± 0.48	7.1 ± 0.3	4.8 ± 0.2
Moirai-MoE	2.63 ± 0.06	515 ± 15	388 ± 12	10.15 ± 0.45	7.0 ± 0.3	4.7 ± 0.2
Foundation Models (Fine-Tuned)
TimesFM-FT	2.52 ± 0.05	495 ± 14	372 ± 11	9.85 ± 0.42	6.8 ± 0.3	4.6 ± 0.2
Chronos-FT	2.48 ± 0.05	488 ± 13	368 ± 11	9.62 ± 0.40	6.6 ± 0.3	4.5 ± 0.2
Moirai-MoE-FT	2.38 ± 0.05	468 ± 12	352 ± 10	9.27 ± 0.38	6.4 ± 0.2	4.3 ± 0.2
GridFM	2.14 ± 0.05 **	418 ± 11 **	315 ± 9 **	7.80 ± 0.31 **	5.4 ± 0.2 **	3.6 ± 0.1 **
Improv. vs. FT	10.1%	10.7%	10.5%	15.9%	15.6%	16.3%
Improv. vs. 0-shot	18.6%	18.8%	18.8%	23.2%	22.9%	23.4%

Table 4. Main experimental results on NYISO test set (Part 2: Emission and Renewable). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT.

Table 4. Main experimental results on NYISO test set (Part 2: Emission and Renewable). Best results in bold. Results are mean ± std. **

p < 0.01

vs. Moirai–MoE–FT.

Model	Emission			Renewable
Model	MAPE	RMSE	MAE	MAPE	RMSE	MAE
Traditional Methods
ARIMA	12.85 ± 0.55	85 ± 4	62 ± 3	15.25 ± 0.65	425 ± 18	312 ± 14
XGBoost	8.52 ± 0.35	58 ± 3	42 ± 2	10.85 ± 0.45	325 ± 14	238 ± 10
LSTM	7.25 ± 0.30	52 ± 2	38 ± 2	9.52 ± 0.40	295 ± 12	215 ± 9
TFT	6.35 ± 0.25	45 ± 2	33 ± 1	8.25 ± 0.35	265 ± 11	192 ± 8
Informer	6.52 ± 0.28	48 ± 2	35 ± 2	8.65 ± 0.38	278 ± 12	202 ± 9
PatchTST	6.22 ± 0.25	44 ± 2	32 ± 1	8.12 ± 0.35	262 ± 11	190 ± 8
Foundation Models (Zero-Shot)
TimesFM	6.45 ± 0.26	46 ± 2	34 ± 2	8.35 ± 0.36	270 ± 12	196 ± 9
Chronos	6.32 ± 0.25	45 ± 2	33 ± 1	8.18 ± 0.35	265 ± 11	192 ± 8
Moirai-MoE	6.05 ± 0.24	43 ± 2	31 ± 1	7.92 ± 0.32	258 ± 10	187 ± 8
Foundation Models (Fine-Tuned)
TimesFM-FT	5.82 ± 0.22	42 ± 2	30 ± 1	7.58 ± 0.30	248 ± 10	180 ± 7
Chronos-FT	5.68 ± 0.21	41 ± 2	29 ± 1	7.42 ± 0.29	242 ± 10	176 ± 7
Moirai-MoE-FT	5.52 ± 0.20	39 ± 2	28 ± 1	7.25 ± 0.28	235 ± 9	170 ± 7
GridFM	4.73 ± 0.18 **	34 ± 1 **	24 ± 1 **	6.28 ± 0.24 **	205 ± 8 **	148 ± 6 **
Improv. vs. FT	14.3%	12.8%	14.3%	13.4%	12.8%	12.9%
Improv. vs. 0-shot	21.8%	20.9%	22.6%	20.7%	20.5%	20.9%

Table 5. Cross-ISO generalization results (Load MAPE %). Transfer: model trained on NYISO, tested on target ISO. Native: model trained and tested on target ISO. Note: Lower MAPE indicates better performance. As expected for transfer learning, transfer MAPE values are higher (worse) than ntive values due to domain shift.

Model	NYISO		PJM		CAISO
Model	Native	–	Transfer	Native	Transfer	Native
Moirai-MoE-FT	$2.38 \pm 0.05$	–	$2.85 \pm 0.08$	$2.42 \pm 0.06$	$2.62 \pm 0.07$	$2.28 \pm 0.05$
GridFM	$2.14 \pm 0.05$	–	$2.48 \pm 0.06$	$2.18 \pm 0.05$	$2.35 \pm 0.06$	$2.05 \pm 0.05$
Improvement	10.1%	–	13.0%	9.9%	10.3%	10.1%
Transfer Degradation	–	–	+13.8%	–	+14.6%	–

Table 6. Performance across forecast horizons (Load MAPE %).

Model	1 h	2 h	4 h	8 h	12 h	24 h
TFT	$1.98 \pm 0.04$	$2.25 \pm 0.05$	$2.58 \pm 0.06$	$3.05 \pm 0.07$	$3.58 \pm 0.08$	$4.35 \pm 0.10$
Moirai-MoE (0-shot)	$1.95 \pm 0.05$	$2.28 \pm 0.06$	$2.63 \pm 0.06$	$3.15 \pm 0.08$	$3.72 \pm 0.09$	$4.52 \pm 0.11$
Moirai-MoE-FT	$1.82 \pm 0.04$	$2.05 \pm 0.04$	$2.38 \pm 0.05$	$2.85 \pm 0.06$	$3.35 \pm 0.07$	$4.12 \pm 0.09$
GridFM	$1.52 \pm 0.03$	$1.78 \pm 0.04$	$2.14 \pm 0.05$	$2.65 \pm 0.06$	$3.12 \pm 0.07$	$3.85 \pm 0.08$
Improv. vs. 0-shot	22.1%	21.9%	18.6%	15.9%	16.1%	14.8%
Improv. vs. FT	16.5%	13.2%	10.1%	7.0%	6.9%	6.6%

Table 7. Seasonal performance (MAPE %).

Model	Load				Price
Model	Winter	Spring	Summer	Fall	Winter	Spring	Summer	Fall
Moirai-MoE (0-shot)	2.95 ± 0.07	2.48 ± 0.06	3.12 ± 0.08	2.58 ± 0.06	11.2 ± 0.5	9.5 ± 0.4	12.1 ± 0.6	9.8 ± 0.4
Moirai-MoE-FT	2.65 ± 0.06	2.22 ± 0.05	2.82 ± 0.06	2.32 ± 0.05	10.2 ± 0.4	8.6 ± 0.4	11.0 ± 0.5	8.9 ± 0.4
GridFM	2.35 ± 0.05	1.98 ± 0.04	2.52 ± 0.06	2.08 ± 0.05	8.5 ± 0.3	7.2 ± 0.3	9.1 ± 0.4	7.5 ± 0.3
Improv. vs. 0-shot	20.3%	20.2%	19.2%	19.4%	24.1%	24.2%	24.8%	23.5%
Improv. vs. FT	11.3%	10.8%	10.6%	10.3%	16.7%	16.3%	17.3%	15.7%

Table 8. Load forecasting MAPE (%) by zone.

Zone	TFT	Moirai-MoE	GridFM	Improv.
A—West	$2.48 \pm 0.05$	$2.52 \pm 0.06$	$2.05 \pm 0.04$	18.7%
B—Genesee	$2.52 \pm 0.05$	$2.58 \pm 0.06$	$2.12 \pm 0.04$	17.8%
C—Central	$2.50 \pm 0.05$	$2.55 \pm 0.06$	$2.08 \pm 0.04$	18.4%
D—North	$2.92 \pm 0.06$	$2.98 \pm 0.07$	$2.45 \pm 0.05$	17.8%
E—Mohawk Valley	$2.62 \pm 0.05$	$2.68 \pm 0.06$	$2.18 \pm 0.04$	18.7%
F—Capital	$2.58 \pm 0.05$	$2.62 \pm 0.06$	$2.15 \pm 0.04$	17.9%
G—Hudson Valley	$2.68 \pm 0.05$	$2.72 \pm 0.06$	$2.22 \pm 0.05$	18.4%
H—Millwood	$2.72 \pm 0.06$	$2.78 \pm 0.06$	$2.28 \pm 0.05$	18.0%
I—Dunwoodie	$2.82 \pm 0.06$	$2.88 \pm 0.07$	$2.35 \pm 0.05$	18.4%
J—New York City	$3.05 \pm 0.07$	$3.12 \pm 0.07$	$2.52 \pm 0.05$	19.2%
K—Long Island	$3.22 \pm 0.07$	$3.28 \pm 0.08$	$2.68 \pm 0.06$	18.3%
System Total	$2.58 \pm 0.06$	$2.63 \pm 0.06$	$2.14 \pm 0.05$	18.6%

Table 9. Ablation study results (MAPE %). Each row adds one component to the previous configuration.

Configuration	Load	Price	Emission	Renewable	Params
Base Moirai-MoE (0-shot)	$2.63 \pm 0.06$	$10.15 \pm 0.45$	$6.05 \pm 0.24$	$7.92 \pm 0.32$	117 M
+LoRA Fine-tuning	$2.38 \pm 0.05$	$9.27 \pm 0.38$	$5.52 \pm 0.20$	$7.25 \pm 0.28$	119 M
+FreqMixer Layer	$2.28 \pm 0.05$	$8.85 \pm 0.35$	$5.25 \pm 0.19$	$6.92 \pm 0.26$	125 M
+Physics Constraints	$2.22 \pm 0.05$	$8.52 \pm 0.33$	$5.02 \pm 0.18$	$6.65 \pm 0.25$	128 M
+Multi-Task Heads	$2.18 \pm 0.05$	$8.12 \pm 0.32$	$4.85 \pm 0.18$	$6.42 \pm 0.24$	135 M
+Coupling Loss	$2.16 \pm 0.05$	$7.92 \pm 0.31$	$4.78 \pm 0.18$	$6.35 \pm 0.24$	135 M
+Uncertainty Weighting	$2.14 \pm 0.05$	$7.80 \pm 0.31$	$4.73 \pm 0.18$	$6.28 \pm 0.24$	135 M

Table 10. Physics constraint effectiveness. Higher is better for compliance metrics (↑); lower is better for violations (↓).

Metric	TFT	Moirai-MoE	GridFM (No Phys.)	GridFM (Full)
Power Balance Violation (%) ↓	$8.2 \pm 0.5$	$6.5 \pm 0.4$	$5.2 \pm 0.3$	$2.1 \pm 0.2$
Price-Load Monotonicity (%) ↑	$72.2 \pm 2.1$	$78.5 \pm 1.8$	$85.2 \pm 1.5$	$92.5 \pm 1.2$
Emission-Renewable Inverse (%) ↑	$68.5 \pm 2.5$	$76.8 \pm 2.0$	$82.5 \pm 1.6$	$94.2 \pm 1.0$
Ramp Rate Compliance (%) ↑	$75.2 \pm 2.2$	$82.5 \pm 1.8$	$88.5 \pm 1.4$	$96.8 \pm 0.8$

Table 11. Hyperparameter sensitivity (Load MAPE %).

Parameter	Value 1	Value 2	Default	Value 4	Value 5
$λ_{p}$	0.01: $2.28 \pm 0.05$	0.05: $2.18 \pm 0.05$	0.1: $2.14 \pm 0.05$	0.2: $2.16 \pm 0.05$	0.5: $2.22 \pm 0.06$
$λ_{c}$	0.01: $2.18 \pm 0.05$	0.02: $2.16 \pm 0.05$	0.05: $2.14 \pm 0.05$	0.1: $2.15 \pm 0.05$	0.2: $2.19 \pm 0.05$
LoRA rank r	4: $2.22 \pm 0.05$	8: $2.18 \pm 0.05$	16: $2.14 \pm 0.05$	32: $2.14 \pm 0.05$	64: $2.15 \pm 0.05$
Context L	144: $2.25 \pm 0.05$	288: $2.14 \pm 0.05$	576: $2.12 \pm 0.05$	–	–

Table 12. Probabilistic forecasting performance.

Model	CRPS-Load	CRPS-Price	90% Cov.	95% Cov.	PI Width	Calib.
Chronos	$185 \pm 9$	$4.2 \pm 0.2$	88.5%	93.2%	$425 \pm 18$	0.82
Moirai-MoE	$162 \pm 8$	$3.6 \pm 0.2$	89.8%	94.5%	$385 \pm 15$	0.87
Chronos-FT	$172 \pm 8$	$3.8 \pm 0.2$	88.8%	93.5%	$405 \pm 15$	0.84
Moirai-MoE-FT	$155 \pm 7$	$3.4 \pm 0.2$	89.5%	94.2%	$375 \pm 12$	0.88
GridFM	$135 \pm 6$	$2.8 \pm 0.1$	$90.2 %$	$95.2 %$	$320 \pm 10$	$0.94$
Improv. vs. 0-shot	16.7%	22.2%	+0.4%	+0.7%	16.9%	+8.0%
Improv. vs. FT	12.9%	17.6%	+0.7%	+1.0%	14.7%	+6.8%

Table 13. Computational efficiency comparison.

Model	Params (M)	Train (h)	Infer. (ms)	GPU (GB)	FLOPs (G)
LSTM	2.5	8	12	2.1	0.8
TFT	5.2	15	28	4.5	2.2
TimesFM	200	–	85	12.5	45.2
Chronos-Large	710	–	120	24.0	125.5
Moirai-MoE (0-shot)	117	–	45	8.2	12.8
Moirai-MoE-FT	119	8	45	8.5	12.8
GridFM	135	12	52	9.5	15.2

Fine-tuning time only. Zero-shot models require no training.

Table 14. Performance during extreme events (Load MAPE %).

Event Type	Samples	Moirai-MoE	GridFM	Improv.
Normal Operations	95.2%	$2.52 \pm 0.06$	$2.05 \pm 0.04$	18.7%
Peak Load (>95th pctl)	2.5%	$4.85 \pm 0.15$	$4.12 \pm 0.12$	15.1%
Price Spikes (>3 $σ$ )	1.2%	$28.5 \pm 1.5$	$22.8 \pm 1.2$	20.0%
Extreme Weather	1.1%	$5.25 \pm 0.18$	$4.52 \pm 0.15$	13.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

GridFM: A Physics-Informed Foundation Model for Multi-Task Energy Forecasting Using Real-Time NYISO Data

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Emergence of Time Series Foundation Models

1.3. Research Gaps and Challenges

1.4. Research Objectives and Contributions

2. Related Work

2.1. Time Series Foundation Models

2.1.1. Language Model Adaptation Approaches

2.1.2. Tokenization-Based Native Models

2.1.3. Native Time Series Architectures

2.2. Deep Learning for Power Grid Forecasting

2.3. Physics-Informed Neural Networks for Power Systems

3. Methodology

3.1. Problem Formulation

3.2. GridFM Architecture Overview

3.3. Input Embedding and Positional Encoding

3.4. FreqMixer Adaptation Layer

3.4.1. Spectral Decomposition

3.4.2. Learnable Frequency Mask with Grid-Specific Initialization

3.4.3. Frequency Mixing Network

3.4.4. Inverse Transform and Residual Connection

3.5. Foundation Model Backbone

3.5.1. Sparse Mixture-of-Experts Layer

3.5.2. Any-Variate Attention

3.5.3. Low-Rank Adaptation (LoRA)

3.6. Physics-Informed Constraint Module

3.6.1. Improved Power Balance Constraint

3.6.2. Zonal Topology Encoding

3.7. Multi-Task Learning Framework

3.7.1. Hard Parameter Sharing Architecture

3.7.2. Uncertainty-Weighted Multi-Task Loss

3.7.3. Task-Specific Loss Functions

3.7.4. Adaptive Coupling Constraint Loss

3.8. Explainability Module

3.8.1. SHAP-Based Feature Attribution

3.8.2. Attention Weight Visualization

3.9. Complete Training Algorithm

4. Experimental Setup

4.1. Dataset Description

4.2. Data Preprocessing

4.3. Train/Validation/Test Split

4.4. Baseline Models

4.5. Evaluation Metrics

Handling Negative and Near-Zero Prices

4.6. Statistical Testing

4.7. Implementation Details

5. Experimental Results

5.1. Main Performance Comparison

5.2. Multi-ISO Validation

5.3. Forecast Horizon Analysis

5.4. Seasonal and Temporal Analysis

5.5. Zonal Performance Analysis

5.6. Ablation Study

5.7. Physics Constraint Effectiveness

5.8. Hyperparameter Sensitivity Analysis

5.9. Probabilistic Forecasting Evaluation

5.10. Computational Efficiency

6. Discussion

6.1. Key Findings

6.1.1. Foundation Model Adaptation Is Effective

6.1.2. Physics Constraints Improve Both Accuracy and Consistency

6.1.3. Multi-Task Learning Exploits Variable Coupling

6.1.4. FreqMixer Captures Grid-Specific Patterns

6.2. Explainability Analysis

6.3. Limitations

6.3.1. Computational Requirements

6.3.2. Extreme Event Performance

6.3.3. Geographic Scope

6.3.4. Temporal Scope

6.3.5. Physics Constraint Limitations

6.4. Practical Implications

6.4.1. Economic Impact

6.4.2. CLCPA Compliance

6.4.3. Renewable Integration

6.4.4. Deployment Considerations

7. Conclusions

Author Contributions

Funding