3.1. Data
The NEM regions that were chosen, NSW1, QLD1, and VIC1 (as illustrated in
Figure 1), all exhibit materially different price behaviours, levels of renewable integration, market structures, and volatility profiles. For example, QLD1 is characterised by frequent high-price excursions linked to thermal generator bidding strategies, whereas VIC1 displays stronger wholesale coupling to wind-generation variability and interconnector constraints. NSW1, as the largest demand centre, presents comparatively smoother but structurally complex price dynamics.
All data used in this study was obtained from publicly available sources, ensuring full reproducibility. Electricity market observations and forecast data were sourced from the Australian Energy Market Operator (AEMO) [
34,
35], while meteorological observations and forecasts were obtained from Open-Meteo [
36]. While the evaluation focused on the NSW1, QLD1, and VIC1 NEM regions, the data collection and pipeline included all interconnected NEM regions, such as SA1 and TAS1. This comprehensive approach was taken because these regions are integral parts of the interconnected energy market [
37].
Figure 1.
Map of Australia showing the interconnected NEM regions. The main focus of this study will be on NSW1 which represents the largest state by population in Australia [
38]. Image adapted from “Australia Color Map” by Quickiebytes, Syed, Wikimedia Commons (accessed on 29 Nov 2025), licensed under CC BY-SA 3.0 [
39].
Figure 1.
Map of Australia showing the interconnected NEM regions. The main focus of this study will be on NSW1 which represents the largest state by population in Australia [
38]. Image adapted from “Australia Color Map” by Quickiebytes, Syed, Wikimedia Commons (accessed on 29 Nov 2025), licensed under CC BY-SA 3.0 [
39].
3.2. Preprocessing
Thirty-four months of data was collected, covering NEM operational information, including actual spot prices, operational demand and net interchange, as well as the full suite of NEM thirty-minute ahead pre-dispatch forecasts for price, demand, and net interchange. Weather variables, including temperature, humidity, wind speed, and cloud cover, were compiled for the capital city associated with each NEM region examined in this study. As no official benchmark dataset exists for NEM forecasting research, all data streams were manually merged into a consistent temporal frame.
All data was resampled or aggregated to a uniform thirty-minute resolution, ensuring strict timestamp alignment across the actual and forecast domains. Preprocessing involved parsing and flattening nested AEMO files, resolving daylight savings irregularities, removing anomalies, interpolating missing meteorological measurements as necessary, and producing a chronologically ordered dataset. Thirty-two step sequences were generated from this at 30 min intervals. Any sequence with missing data that did not contain a complete set of 32 steps was dropped. Data was monitored for exceptionally large erroneous numbers but none were detected. However, it was important to retain anomalous price spikes, caused by breakdowns and un-forecast weather extremes, since they are frequent features of the NEM. QuantileTransformer scalers were used to reduce the impact of such extremes on training and subsequent performance of the various models. The QuantileTransformer maps the empirical distribution of electricity prices to a smooth, approximately Gaussian space by transforming each value according to its quantile. This is especially useful for markets with rare extreme spikes, as it compresses heavy tails and stabilises model training while preserving the relative ordering of high-impact events. No temporal shuffling of sequences occurred prior to dataset splitting.
Feature engineering aimed to provide the models with variables known or hypothesised to influence NEM price formation [
19,
40]. These included time-based encodings capturing diurnal, weekly, and seasonal cycles. Weather features that were expected to have the greatest impact on regional electricity generation and consumption were chosen. RRP, demand and net interchange features were included for all NEM regions since they are all interconnected and influence prices in each. The complete list of features used is shown in
Table 1.
3.3. Model Architectures
The primary forecasting model used in this study was a transformer architecture based on the seminal work of Vaswani et al. [
25], as illustrated in
Figure 3. Transformers make use of a multi-head self-attention mechanism in which attention heads learn to emphasise the most relevant parts of the input sequence at each timestep. This mechanism enables efficient modelling of long-range temporal dependencies, making transformers particularly effective for electricity price forecasting where patterns can span multiple hours or even days.
Transformers typically consist of an encoder and a decoder linked by a cross-attention mechanism. The encoder processes the historical sequences, such as RRP, demand, and weather actuals, while the decoder consumes the known future inputs, including AEMO pre-dispatch forecasts and weather forecasts. Both components may be stacked in multiple layers to provide increased representational capacity and enable the model to capture hierarchical temporal patterns across different forecast horizons.
In this study, the encoder was responsible for learning latent representations of past market behaviour, whereas the decoder integrated these representations with exogenous forward-looking signals to generate a full 16 h ahead forecast. Positional encodings were applied to both streams to ensure that the model retained awareness of the temporal order of the inputs, a crucial requirement given the irregular and highly dynamic nature of NEM spot prices. The combination of multi-head self-attention, cross-attention, and deeply stacked layers allowed the model to capture nonlinear interactions between historical drivers, forecast inputs, and evolving system conditions more effectively than recurrent or convolutional baselines.
Since the original transformer was developed for sequence-to-sequence text translation, several modifications were made to adapt the architecture for numerical time-series forecasting and to improve training stability on NEM datasets. First, a pre-layer normalisation (pre-LN) formulation was adopted, following the stabilised architecture proposed by Wang et al. [
41]. Pre-LN significantly improves gradient stability during training and removes the need for the large learning-rate warm-up schedule used in the original transformer. Consequently, the warm-up phase was omitted entirely, and the model was successfully trained using the standard Adam optimiser [
42] with a fixed learning rate.
Second, the decoder was configured to use parallel decoding, enabling the model to generate the entire 32-step forecast horizon in a single forward pass. This approach avoids the accumulation of errors inherent in autoregressive decoding and aligns with operational forecasting needs in the NEM, where full multi-horizon price trajectories must be generated simultaneously. To assess the impact of decoding strategy, an additional autoregressive (AR) decoder variant was implemented and evaluated. The AR configuration predicted each future timestep sequentially, feeding earlier predictions back into the model, enabling direct comparison between parallel and AR decoding methods.
The default “small” encoder–decoder model consisted of three layers with four attention heads per layer, a hidden dimension of 128, a feed-forward dimension of 512, and a dropout rate of 0.05. It accepted an input sequence of ninety-six 30 min time steps and produced a forecast horizon of thirty-two 30 min time steps.
To evaluate robustness and generalisation, a variety of architectural variants were tested, including tiny, small, medium, and large transformer models, as well as encoder-only and decoder-only configurations. A summary of these variants and their hyperparameters is provided in
Table 2 and shown in
Figure 4.
A generic two-layer Long Short-Term Memory (LSTM), a Patch Time Series Transformer (PatchTST) [
43], a TimesFM [
44], and a Temporal Fusion Transformer (TFT) [
45], were included for comparison. These models were chosen since they were modern state-of-the-art hybrids of each kind of transformer. TimesFM is a decoder-only model, PatchTST is encoder-only, and TFT contains both encoders and decoders. Each of the architectures is shown in
Figure 5.
A simple two-layer LSTM provides a strong baseline for short-term RRP forecasting because it captures sequential dependencies in load, price, and weather while remaining computationally lightweight. The first LSTM layer learns short-term patterns, such as daily patterns, while the second layer captures longer term patterns including weekly and seasonal changes. LSTMs are the currently established way of solving these types of time-series problems [
11,
12].
The TFT architecture combines LSTM encoders with multi-head attention and explicit variable-selection layers, making it well-suited to RRP forecasting, where both past conditions and future covariates, such as predispatch forecasts, weather projections, and outages, influence price formation. Its LSTM layers learn local patterns, and its attention mechanisms help identify which drivers matter at longer horizons, giving good interpretability and often strong performance when diverse feature sets are available. The traditional quantile output head has been replaced with a single dense output head since we are making point-price predictions.
TimesFM applies a pretrained large-scale foundation model with self-attention encoders and a dedicated forecast head, allowing it to extract generalisable temporal patterns from NEM data. Its ability to transfer patterns learned from massive global time-series corpora makes it effective for medium-horizon RRP prediction, where structural noise, demand cycles, and renewable variability dominate. It is used in zero-shot mode, without any fine-tuning, where it can generate accurate multi-step forecasts by conditioning only on the input sequence and forecast horizon, making it highly adaptable to unseen time series with minimal task-specific configuration. However, it has the potential for improvement with fine-tuning.
PatchTST excels in RRP forecasting by processing inputs as overlapping temporal patches, allowing the model to focus on localised variations, such as ramp events, solar troughs, wind lulls, and rebidding episodes, while efficiently capturing long-range structure through attention. Its channel-independent design, in which each feature is viewed in its own channel, cleanly handles multivariate NEM inputs and often improves robustness, particularly when the target series exhibits nonlinear seasonality or regime changes.