A Review of Balancing Price Forecasting in the Context of Renewable-Rich Power Systems, Highlighting Profit-Aware and Spike-Resilient Approaches
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper discusses new energy prediction and power supply and distribution prediction, aiming to optimize the electricity market price. However, there are the following problems:
This review article seems more like a discussion on the demand and balance issue in the electricity market, including new energy, which belongs to electricity market marketing and policy. It is suitable to be published in this journal. some problem need to be improved below:
1. Although a large number of prediction methods are presented in the paper, most of these methods are very old algorithms and do not represent the latest modern development technologies. For example, the prediction results of Kalman filtering, neural networks, and deep learning are not good. Moreover, newly proposed prediction methods in recent years, such as long-range dependence and fractal, are not mentioned.
2. The analysis of various methods is superficical, authors should explain advantages and disadvantages of various algorithms from an algorithmic perspective.
3. The maximum peak error given is too high, which has no engineering application value and cannot represent the current technology at all.
4. New energy power generation and electricity load demand are closely related to seasons, weather, weekdays, and holidays. This issue is not discussed in the article.
5. There is a missing punctuation mark after the reference [12,13].
6. An explanation table for many abbreviations should be provided.
Author Response
Response to Reviewer 1:
Title: “A Review of Balancing Price Forecasting in the Context of Renewable-Rich Power Systems, Highlighting Profit-Aware and Spike-Resilient Approaches”
The author thanks the reviewer for his/her comments and suggestions, which has been carefully considered. They have greatly contributed to improving the quality of the manuscript.
Reviewer #1 (Comments to the Authors):
This paper discusses new energy prediction and power supply and distribution prediction, aiming to optimize the electricity market price. However, there are the following problems:
This review article seems more like a discussion on the demand and balance issue in the electricity market, including new energy, which belongs to electricity market marketing and policy. It is suitable to be published in this journal. some problem need to be improved below:
- Although a large number of prediction methods are presented in the paper, most of these methods are very old algorithms and do not represent the latest modern development technologies. For example, the prediction results of Kalman filtering, neural networks, and deep learning are not good. Moreover, newly proposed prediction methods in recent years, such as long-range dependence and fractal, are not mentioned.
Thank you. Balancing market price forecasting remains a relatively less explored area. In agreement with the reviewer’s comment, this study highlights the necessity of employing advanced and newly developed time-series forecasting methods. Table 1 shows a predominance of statistical approaches compared to hybrid deep learning and more advanced techniques. To address benchmarking, we have added a new Section (Section 6), which discusses the market, error metrics, and test periods. Section 6 also shows that most studies rely on MAE and RMSE to quantify errors, reported together with the corresponding market. This makes cross-study comparison challenging. For example, an error of 20 EUR/MWh may be acceptable during peak hours but considered high during morning hours. Moreover, since prices have been on an upward trend, MAE and RMSE values reported in 2014 cannot be easily extrapolated to the error ranges of 2024.
The journal advised us to avoid unnecessary citations without a direct link to balancing market price forecasting; nevertheless, it is evident that methods exploiting long-range dependence and fractal properties can outperform traditional statistical approaches.
- The analysis of various methods is superficical, authors should explain advantages and disadvantages of various algorithms from an algorithmic perspective.
Thanks. Statistical models are simple and interpretable but weak in capturing nonlinearities; machine learning models are flexible and robust to feature heterogeneity but prone to overfitting; deep learning models capture temporal dependencies effectively but require large datasets; hybrid models combine complementary strengths at the expense of complexity; and probabilistic models quantify uncertainty, which is crucial for risk-aware bidding, but remain more demanding to calibrate and evaluate.
We have added a paragraph on a particularly important class of probabilistic methods, as follows:
“In probabilistic forecasting, the performance metrics such as Prediction Interval Coverage Probability (PICP), Average Coverage Error (ACE), Prediction Interval Normalized Average Width (PINAW), Winkler score, Coverage Width Criterion (CWC), and Continuous Ranked Probability Score (CRPS), are essential for interpreting results and for capturing the fundamental trade-off between reliability (the proportion of realizations contained within the predicted intervals) and sharpness (the narrowness of those intervals). For example, Tahmasebifar et al. [40] report improved ACE, PINAW, and CWC alongside RMSE and MAPE when comparing their hybrid model, while Mori and Nakano report empirical coverage at ±σ, ±2σ, and ±3σ (e.g., 89.8%, 100%, 100% with a Mahalanobis kernel) as an implicit calibration check [38]. Recent imbalance-forecasting work further emphasizes the reliability–sharpness trade-off using quantile loss, Winkler score, and CRPS, defining reliability as the fraction of realized values below forecast quantiles [32].”
- The maximum peak error given is too high, which has no engineering application value and cannot represent the current technology at all.
We appreciate the reviewer’s observation regarding the high peak errors reported in some studies. We agree that such large errors have limited engineering application value and do not reflect the full potential of current forecasting technologies. However, as also noted in the first comment, several of the contributions relied on relatively simple methods (Table 1), and only recently have more advanced forecasting frameworks with robust pre-processing and post-processing techniques begun to appear. Spikes remain particularly challenging, and while the reported results confirm the limitations of existing approaches, they also highlight a clear research gap and potential for improvement. In our revision, we have clarified in conclusion this point by emphasizing that these extreme error values should be seen not as representative of the state-of-the-art, but as evidence of the need for further development in spike-resilient and profit-aware forecasting pipelines.
- New energy power generation and electricity load demand are closely related to seasons, weather, weekdays, and holidays. This issue is not discussed in the article.
We thank the reviewer for this valuable comment. We made additions to manuscript as:
“…Calendar effects, such as monthly seasonality and de-rated capacity margins, further enhance forecast accuracy by embedding system stress linked to seasonal load and availability patterns [15,43]. ….”
“For balancing markets, where seasonality (load, wind/RES profiles), scarcity episodes (tight capacity margins, high imbalance volume), and calendar effects shape the distribution, short windows are unlikely to capture regime diversity or tail behavior (spikes). As a result, point-error metrics computed on days-to-weeks or single-month samples risk optimistic generalization and unstable conclusions about model superiority. By contrast, a contiguous year (or multi-season design) is methodologically justified: it spans winter-to-summer shifts, holiday clusters, maintenance/forced-outage patterns, and structural breaks, enabling robust hyper-parameter tuning and fair comparison across models.
Recommended practice, therefore, can be to (i) prefer ≥ 1-year out-of-sample evaluation or, at minimum, multi-season blocked/rolling back-tests; (ii) report regime-conditional performance (e.g., normal vs. scarcity/spike days; holiday vs. non-holiday) alongside aggregate errors; and (iii) document inclusion/exclusion criteria (holidays, extreme events) and market splits explicitly. When only short windows are feasible, authors should complement results with rolling-origin tests across disjoint weeks/months and provide uncertainty bands or sensitivity analyses to guard against period-selection bias.”
Along with:
“Recently, a Seasonal Attention-Based Bidirectional LSTM (SA-BiLSTM) framework for forecasting electricity imbalance prices in the British balancing market, using data from 2016 to 2019, was proposed by Deng et al. [15]. The model introduces a seasonal attention mechanism to enhance temporal feature extraction from 48 half-hourly periods each day, addressing inherent seasonality and autocorrelation structures in imbalance price dynamics. ….”
- There is a missing punctuation mark after the reference [12,13].
Thank you. We have corrected.
- An explanation table for many abbreviations should be provided.
Thank you. We have added the table below, alongside Table 1, which contains the method abbreviations.
|
Abbreviation |
Full Form |
|
BM |
Balancing Market |
|
C |
Classification of the sign of the price difference between DAM and BM |
|
CRPS |
Continuous Ranked Probability Score |
|
DAM |
Day-Ahead Market |
|
DERs |
Distributed Energy Resources |
|
ISO |
Independent System Operator |
|
ISO-NE |
Independent System Operator New England |
|
IVC |
Imbalance Volume Classification (Surplus/Shortage) |
|
IV |
(Net) Imbalance Volume |
|
L |
Load |
|
LMP |
Locational Marginal Price |
|
MAE |
Mean Absolute Error |
|
MAP |
Maximum a Posteriori |
|
MAPE |
Mean Absolute Percentage Error |
|
MSE |
Mean Squared Error |
|
NMAE |
Normalized Mean Absolute Error |
|
NRMSE |
Normalized Root Mean Squared Error |
|
NRV |
Net Regulation Volume |
|
OPF |
Optimal Power Flow |
|
P |
Balancing Market Premium Prediction (BM – DAM price difference) |
|
PICP |
Prediction Interval Coverage Probability |
|
PINAW |
Prediction Interval Normalized Average Width |
|
PJM |
Pennsylvania–New Jersey–Maryland Interconnection |
|
RES |
Renewable Energy Sources |
|
RMSE |
Root Mean Squared Error |
|
R² |
Coefficient of Determination |
|
S |
Spike Occurrence Prediction |
|
sMAPE |
Symmetric Mean Absolute Percentage Error |
|
SOC |
State of Charge |
|
TSO |
Transmission System Operator |
|
VPP |
Virtual Power Plant |
|
VaR |
Value-at-Risk |
|
CVaR |
Conditional Value-at-Risk |
|
CWC |
Coverage Width Criterion |
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- The paper provides a useful topical synthesis and highlights practical gaps (e.g., spike modeling, profit-aware evaluation). However, it is not clear how this review differs in method and outcomes from recent surveys cited (e.g., day-ahead and intraday review papers). Please clarify the novelty of this review in a concise paragraph (explicitly state: unique scope, selection period, number of papers surveyed, and any quantitative synthesis you performed). Right now the novelty claim is asserted but not demonstrated.
- The abstract reports MAPE ranges (3–10% for 1–6 h, 10–20% for 12–24 h, 25% at 24–36 h), which are useful but lack context (markets, time periods, or metric variants). Please state whether these ranges are aggregated across markets / horizons / performance metrics and briefly note the basis (number of studies or median values). This will make the abstract stronger and less ambiguous.
- The "Structure of the Review" paragraph is helpful, but the paper would benefit from a short table (or list) telling the reader how many papers are covered in each section/topic (point-based, probabilistic, horizons, spike detection, etc.). That would immediately orient readers about coverage depth per topic.
- The review lists many point-based models but rarely compares them under a common performance metric or market context. Can the author (a) add a consolidated table summarizing best reported metrics per model-family and horizon, and (b) comment on comparability problems (different error metrics, region-specific spikes, different test periods)? Right now we get lists of methods but limited cross-study synthesis.
- The section correctly highlights the promise of probabilistic forecasting but misses a short discussion on calibration and sharpness (or how the surveyed works report interval quality). I recommend the author include a paragraph summarizing whether existing probabilistic studies report calibration tests (e.g., PIT, reliability diagrams) and how those results compare to point-forecast baselines. That would add depth beyond naming models.
- The text notes that preprocessing/post-processing is under-used, and Table 1 compiles methods. Still, the paper should (a) explain the criteria for listing methods in Table 1 (which papers contributed which items), (b) provide a short evaluation of which preprocessing steps consistently improved results across studies (if possible), and (c) highlight pitfalls (e.g., transformations that hide spikes). A small meta-analysis (even qualitative) here would substantially raise the paper's value.
- The review correctly emphasizes day-ahead prices, net-imbalance volumes, and meteorology. Two additions would help readers: (a) a brief taxonomy of features by availability / latency (published real-time vs delayed), and (b) a discussion of the propagation of input forecast errors (the paper mentions this later; it should be referenced and cross-linked here). Readers implementing forecasting systems need guidance on which features are reliable in practice.
- I strongly appreciate the emphasis on transaction costs, latency, and regulatory caps — these are often neglected. However, the review would be much stronger if it (a) lists which studies included transaction costs or latency in their backtests (and their numerical impact), and (b) proposes a minimal standardized profit-aware evaluation protocol (a short checklist or pseudo-algorithm). That will help future benchmarking and addresses a central gap the author identifies.
- The review currently omits whether the surveyed papers make code and datasets available. Please add a short paragraph (or table column) stating how many studies provide code/data and emphasize the importance of reproducible pipelines for this field. E.g. https://doi.org/10.1016/j.dib.2025.112042
- The manuscript is generally well written, but a careful proofreading pass is needed to correct minor grammar and punctuation slips and to improve flow in a few dense paragraphs (e.g., the long lists in Table 1 can be split for readability). Also consider consistent use of abbreviations (introduce IVC, IV, P etc. once and then use consistently).
- The conclusions list sensible gaps. To increase impact, convert the list into a prioritized research agenda (short-, medium-, long-term items), and where possible indicate measurable milestones (e.g., "Develop standard spike-label dataset by 2026", "Establish profit-aware evaluation protocol: metrics X,Y,Z"). This will help the paper serve as a rallying point for the community.
There are few grammatical errors which should be corrected.
Author Response
Response to Reviewer #2:
Title: “A Review of Balancing Price Forecasting in the Context of Renewable-Rich Power Systems, Highlighting Profit-Aware and Spike-Resilient Approaches”
Author would like to thank the reviewer. His/her comments helped us truly improve the manuscript. We have taken all comments into account carefully and amended the manuscript accordingly.
Reviewer #2 (Comments to the Authors):
- The paper provides a useful topical synthesis and highlights practical gaps (e.g., spike modeling, profit-aware evaluation). However, it is not clear how this review differs in method and outcomes from recent surveys cited (e.g., day-ahead and intraday review papers). Please clarify the novelty of this review in a concise paragraph (explicitly state: unique scope, selection period, number of papers surveyed, and any quantitative synthesis you performed). Right now the novelty claim is asserted but not demonstrated.
Thank you. We have added a contribution sub-section as:
“1.2 Contribution
This work offers the first dedicated and systematic review focused exclusively on price forecasting in balancing and real-time electricity markets, a domain often overshadowed by the much larger bodies of research on wind-power and day-ahead price forecasting. While balancing-market studies date back nearly three decades, their progress has been constrained by the inherent complexity of market mechanisms and by restricted or opaque data availability. To address this gap, our review systematically structures the literature by methodology and forecast horizon, enhancing accessibility for both academic and practitioner audiences. It catalogs 48 forecasting studies, classifying them by preprocessing, forecasting architecture, and hybrid design, while also identifying contributions that extend beyond price to alternative targets such as imbalance volume (IV), price premium (P), load (L) and classification tasks (C). It complements this by categorizing prediction horizons, explicitly linking accuracy degradation to lead time. Beyond methodological benchmarking, the review extends to revenue optimization, profit-aware evaluation, data latency constraints, and spike/ramp forecasting, where these aspects are explicitly synthesized.”
- The abstract reports MAPE ranges (3–10% for 1–6 h, 10–20% for 12–24 h, 25% at 24–36 h), which are useful but lack context (markets, time periods, or metric variants). Please state whether these ranges are aggregated across markets / horizons / performance metrics and briefly note the basis (number of studies or median values). This will make the abstract stronger and less ambiguous.
We have rewrite the sentence as:
“Typical reported accuracy spans mean absolute percentage errors of approximately 3-10% for very short-term (1-6 h ahead) horizons, 10-20% for mid-term horizons (12-24 h ahead), and around 25% for longer horizons (24-36 h ahead), with spikes and rapid ramps driving most residual error.”
- The "Structure of the Review" paragraph is helpful, but the paper would benefit from a short table (or list) telling the reader how many papers are covered in each section/topic (point-based, probabilistic, horizons, spike detection, etc.). That would immediately orient readers about coverage depth per topic.
Thank you. We have added a new section titled as “Market Scope, Test-Period Design and Performance Metric Selection”. Therefore, we have added
“Section 6 discusses market scope, test-period design, and metric selection, offering guidance for establishing benchmarks.”
into Structure of the Review part.
We have written a Contribution section stating that 48 papers were examined in Tables 1 and 2. In addition, Section 6 discusses 33 studies that explicitly disclose the test period:
“1.2. Contribution
… It catalogs 48 forecasting studies, classifying them by preprocessing, forecasting architecture, and hybrid design, while also identifying contributions that extend beyond price to alternative targets such as imbalance volume (IV), price premium (P), load (L) and classification tasks (C). It complements this by categorizing prediction horizons, explicitly linking accuracy degradation to lead time. …”
- The review lists many point-based models but rarely compares them under a common performance metric or market context. Can the author (a) add a consolidated table summarizing best reported metrics per model-family and horizon, and (b) comment on comparability problems (different error metrics, region-specific spikes, different test periods)? Right now we get lists of methods but limited cross-study synthesis.
The new section is
“6. Market Scope, Test-Period Design and Performance Metric Selection
Comprehensive forecasting require combining data from both market and system operators within a region. Table 3 lists the markets/regions considered in the surveyed studies. Many works use publicly available datasets released by market and transmission system operators. Notably, two studies, [19] and [25], openly release their code along with the dataset.
Thirty-one studies explicitly disclose the length of the test period, covering a total of 33 distinct cases, yet their length vary widely. Out of 32 study, the median test length is 2 months, and only 45% cover at least a full season (≥ 3 months). Concretely: 18% evaluate on < 1 month (e.g., 1 day, 6 days, 20 days, 3 weeks), another 15% on exactly 1 month, 30% on 1–3 months, 6% on > 3 months but < 1 year (e.g., 4–8 months), and 30% on ≥ 1 year [17,20,21,25,26,30–32,43,44] (including multi-year tests). The analysis also reveals potentially non-representative sampling choices, e.g., market-specific windows within the same study (2 months for ISO-NE vs. 3 months for PJM) [35] and non-holiday subsets (≈ 2.5 months) [42], that can inflate apparent accuracy if challenging regimes are under-sampled.
For balancing markets, where seasonality (load, wind/RES profiles), scarcity episodes (tight capacity margins, high imbalance volume), and calendar effects shape the distribution, short windows are unlikely to capture regime diversity or tail behavior (spikes). As a result, point-error metrics computed on days-to-weeks or single-month samples risk optimistic generalization and unstable conclusions about model superiority. By contrast, a contiguous year (or multi-season design) is methodologically justified: it spans winter-to-summer shifts, holiday clusters, maintenance/forced-outage patterns, and structural breaks, enabling robust hyper-parameter tuning and fair comparison across models.
Recommended practice, therefore, can be to (i) prefer ≥ 1-year out-of-sample evaluation or, at minimum, multi-season blocked/rolling back-tests; (ii) report regime-conditional performance (e.g., normal vs. scarcity/spike days; holiday vs. non-holiday) alongside aggregate errors; and (iii) document inclusion/exclusion criteria (holidays, extreme events) and market splits explicitly. When only short windows are feasible, authors should complement results with rolling-origin tests across disjoint weeks/months and provide uncertainty bands or sensitivity analyses to guard against period-selection bias.
Table 3. Market/Region
|
MARKET |
Ref. |
|
Ontario (Canada) electricity market NYISO, New York City Irish balancing market (I-SEM) Dutch balancing market Southern Norway (NO1) Polish market UK energy market U.S. PJM market Queensland (Australia) spot market Belgian imbalance market U.S. Midcontinent independent market (MISO) U.S. ISO New England Swedish Nord Pool market New South Wales (Australia) Nord Pool NO2 (Norway) Turkish electricity market Greek balancing market England & Wales market Romanian market ERCOT (Texas, USA) Japanese electricity market Austrian balancing (real-time) market Nordic power market German balancing market |
[19] [20] [24] [27] [33] [21] [25] [22] [23] [26][43] [55] [15][28] [37] [48] [50] [61] [29] [34][35] [36][47] [29] [10][30] [32][54] [58] [31] [35][38][41] [39] [49] [40] [42] [44] [46][57] [52] [51] [56] [59] [60] [62] [17] |
Table 4. Performance metrics
|
METRIC |
Ref. |
|
MAE |
[19][21][22][27] [28][29] [33] [34][36] [40][42] [46] [48] [50] [51] [56] [57] [58] [15][17] |
|
RMSE |
[19][21][22][26] [29] [31] [34] [40] [46] [48] [50][51][56] [57][58][61] [15] [17] |
|
MAPE |
[20][24][33][34] [35] [40] [41][49][50] [51][52][55] [56] [15] |
|
R-squared |
[28] [29][46] [48] [51] |
|
Pinball Loss |
[30] [32] [48] [57] [61] |
|
Continuous Ranked Probability Score (CPRS) |
[30] [32] [57] [17] |
|
sMAPE |
[21] [27] [49] |
|
Winkler Score |
[40][32] |
|
nMAE |
[57] [30] |
|
Model Confidence Set (MCS) |
[19] |
|
Aggregate Pinball Score |
[25] |
|
Explained Variance Score |
[28] |
|
MSE |
[28] |
|
MRE (Mean Relative Error) |
[29] |
|
IAE (Integral of Absolute Error) |
[29] |
|
NRMSE |
[30] |
|
Pseudo R-squared |
[37] |
|
MAP (maximum a posteriori) |
[41] |
|
Area Under the Curve (AUC) value |
[51] |
|
IQR (interquartile range) |
[55] |
As summarized in Table 4, most studies evaluate forecasts with unit-valued errors such as MAE and RMSE (reported in the native currency per MWh), while a non-trivial subset additionally employs percentage- or scale-free criteria (e.g., MAPE, sMAPE, NMAE/NRMSE). Unit-valued metrics are indispensable for operational interpretation, because they measure error directly in “currency per energy” and thus align with financial impacts, but they hinder comparisons across hours with different price levels or across years and markets with shifting price regimes. Scale-free metrics address this by normalizing error (to a mean, range, or to the magnitude of the observation/forecast), enabling more meaningful cross-period and cross-market benchmarking. However, MAPE suffers from two well-known issues in electricity price applications: it is undefined (or numerically unstable) when prices are zero or near zero, and it weights over- and under-estimation asymmetrically. Consistent with the studies listed in Table 4, sMAPE is therefore preferable as a percentage indicator because its symmetric denominator mitigates the zero-price pathology and balances positive/negative deviations. In practice, it could be recommended reporting both classes in tandem, MAE/RMSE for monetary interpretability and sMAPE plus NRMSE/NMAE for scale-robust comparison, while documenting the chosen normalization scheme to ensure reproducibility and fair assessment.”
In addition, the table below lists the test periods discussed in Section 6:
Table. Test period
|
TEST PERIOD |
Ref. |
|
1 day 6 days 20 days 3 weeks 1 month 41.7 days 2 months 2.5 months (non-holiday) 3 months 4 months 8 months 1 year 1.5 year 3 years
|
[24] [34] [36] [19][54] [10] [22] [28] [38] [40] [50] [15][61] [35] [39] [52] [58] [42] [35] [47] [50] [56] [51] [20][26] [30][31] [32] [43] [44] [17] [21] [25]
|
- The section correctly highlights the promise of probabilistic forecasting but misses a short discussion on calibration and sharpness (or how the surveyed works report interval quality). I recommend the author include a paragraph summarizing whether existing probabilistic studies report calibration tests (e.g., PIT, reliability diagrams) and how those results compare to point-forecast baselines. That would add depth beyond naming models.
Thanks. We have added a paragraph:
“In probabilistic forecasting, the performance metrics such as Prediction Interval Coverage Probability (PICP), Average Coverage Error (ACE), Prediction Interval Normalized Average Width (PINAW), Winkler score, Coverage Width Criterion (CWC), and Continuous Ranked Probability Score (CRPS), are essential for interpreting results and for capturing the fundamental trade-off between reliability (the proportion of realizations contained within the predicted intervals) and sharpness (the narrowness of those intervals). For example, Tahmasebifar et al. [40] report improved ACE, PINAW, and CWC alongside RMSE and MAPE when comparing their hybrid model, while Mori and Nakano report empirical coverage at ±σ, ±2σ, and ±3σ (e.g., 89.8%, 100%, 100% with a Mahalanobis kernel) as an implicit calibration check [38]. Recent imbalance-forecasting work further emphasizes the reliability–sharpness trade-off using quantile loss, Winkler score, and CRPS, defining reliability as the fraction of realized values below forecast quantiles [32].”
- The text notes that preprocessing/post-processing is under-used, and Table 1 compiles methods. Still, the paper should (a) explain the criteria for listing methods in Table 1 (which papers contributed which items), (b) provide a short evaluation of which preprocessing steps consistently improved results across studies (if possible), and (c) highlight pitfalls (e.g., transformations that hide spikes). A small meta-analysis (even qualitative) here would substantially raise the paper's value.
Thanks. We have clarified as:
“Applying an LSTM autoencoder as a preprocessing step significantly improved the performance of multiple classifiers in the Turkish balancing market case study. Without the autoencoder, individual models such as Random Forest or Logistic Regression achieved accuracies around 57–59%, whereas with the autoencoder the hybrid ensemble reached over 61% accuracy [44]. This translated into tangible reductions in imbalance costs, with yearly savings of 6–11%. Similarly, in the Romanian imbalance volume forecasting study, adding a pre-processing stage where the imbalance sign was predicted and then incorporated into the dataset markedly improved forecasts. Across daily, monthly, and 8-month test horizons, including the imbalance sign reduced MAE and RMSE by nearly half and increased R² from 0.94 to 0.98 [51]. This demonstrates that pre-processing not only stabilizes the data against price extremes but also enriches the feature space with predictive signals, thereby strengthening both statistical fit and practical forecasting accuracy. Pre-processing can also include scaling and outlier trimming (e.g., capping prices above £140) such as in [15]. In another study [56], pre-processing involves variance-stabilizing transformations (arcsinh) to mitigate the effect of price extremes. As a result, the entire framework demonstrates competitive accuracy: an RMSE of 35, outperforming strong baselines such as AE-LSTM (RMSE 49) and PatchTST (RMSE 42).”
- The review correctly emphasizes day-ahead prices, net-imbalance volumes, and meteorology. Two additions would help readers: (a) a brief taxonomy of features by availability / latency (published real-time vs delayed), and (b) a discussion of the propagation of input forecast errors (the paper mentions this later; it should be referenced and cross-linked here). Readers implementing forecasting systems need guidance on which features are reliable in practice.
We thank the reviewer for this valuable comment. We have now added:
“Wind variability and demand are highlighted as primary price drivers in the Irish I-SEM market [25] and PJM [34]. Recent studies show that particularly influential explanatory variables include the Net Imbalance Volume, which directly reflects short-term supply–demand mismatches, and the Loss of Load Probability, which captures scarcity conditions [28,46]. Calendar effects, such as monthly seasonality and de-rated capacity margins, further enhance forecast accuracy by embedding system stress linked to seasonal load and availability patterns [15,43]. As previously emphasized, the day-ahead market price again emerges as a strong source of explanatory power [35].”
- I strongly appreciate the emphasis on transaction costs, latency, and regulatory caps — these are often neglected. However, the review would be much stronger if it (a) lists which studies included transaction costs or latency in their backtests (and their numerical impact), and (b) proposes a minimal standardized profit-aware evaluation protocol (a short checklist or pseudo-algorithm). That will help future benchmarking and addresses a central gap the author identifies.
Thanks. We have clarified as:
“Recent work on Irish market explicitly incorporates participation and per-MWh transaction fees, and enforces realistic battery energy storage systems constraints (capacity, ramp limits, round-trip efficiency, and state of charge bounds), alongside a non-negative expected-profit filter and daily trade caps [25]. However, execution latency/slippage and explicit regulatory price caps are not parameterized, and several asset life-cycle items (e.g., warranty, insurance, end-of-life, incentives) are excluded.”
….
“Latency, in this context, refers to the delay between when imbalance information becomes available and when operational decisions can actually be taken. Riveros et al. [54] demonstrate that micro-CHP aggregators could reduce weekly system costs by about 5% through near real-time optimization, but this outcome assumes immediate access to Net Regulation Volume (NRV) signals. In practice, imbalance prices and NRV are only published after the 15-minute settlement period, which prevents operators from acting instantaneously and substantially reduces the realizable benefit. Smets et al. [58] similarly show, in a Belgian balancing-market case study, that although improved forecasting approaches can increase profits for storage systems up to 176%, the perfect-foresight benchmark performs substantially better. This persistent gap arises because operators must rely on forecasts rather than actual imbalance prices. As latency increases, decisions are made further from real time, and consequently arbitrage profits decline.
Statistical accuracy in price forecasting does not directly translate into financial returns; therefore, a real-time profit-and-loss accounting module is essential. Although most of the electricity price forecasting literature still emphasizes statistical forecast accuracy, a growing but still limited body of research highlights the importance of profit-aware evaluation and explicit risk quantification. For instance, Maciejowska et al. [43] calculate yearly profits and quantify downside risk with the Value at Risk (VaR) metric in the Polish balancing market to benchmark the performance of their statistical models. Riveros et al. [54] explicitly evaluate their CHP rescheduling strategy in the Belgian market in terms of ex post profits. He and Song [53] incorporate probabilistic LMPs and rival strategies into a multi-criteria bidding model where expected payoff, market share, and risk jointly determine optimal bids. Similarly, Li and Park [69] model wind farm bidding in the U.S. PJM market by embedding penalty costs for forecast deviations, ensuring ex post profit maximization.”
- The review currently omits whether the surveyed papers make code and datasets available. Please add a short paragraph (or table column) stating how many studies provide code/data and emphasize the importance of reproducible pipelines for this field. E.g. https://doi.org/10.1016/j.dib.2025.112042
We now write at the beginning of the Section 6 as:
“Table 3 lists the markets/regions considered in the surveyed studies. Many works use publicly available datasets released by market and transmission system operators. Notably, two studies, [19] and [25], openly release their code along with the dataset. “
- The manuscript is generally well written, but a careful proofreading pass is needed to correct minor grammar and punctuation slips and to improve flow in a few dense paragraphs (e.g., the long lists in Table 1 can be split for readability). Also consider consistent use of abbreviations (introduce IVC, IV, P etc. once and then use consistently).
We thank the reviewer for this suggestion; the abbreviations (IV, IVC, P, etc.) are now introduced earlier in the manuscript and used consistently throughout, and minor grammar and readability issues have been carefully corrected
- The conclusions list sensible gaps. To increase impact, convert the list into a prioritized research agenda (short-, medium-, long-term items), and where possible indicate measurable milestones (e.g., "Develop standard spike-label dataset by 2026", "Establish profit-aware evaluation protocol: metrics X,Y,Z"). This will help the paper serve as a rallying point for the community.
We now finish the manuscript with future directions:
“As a future line of work, a Spike-Aware approach is planned to be implemented for balancing price forecasting, as outlined in the pipeline shown in Figure 2. In this framework, periods of normal behavior and periods with price spikes are distinguished using statistical detection techniques such as z-score thresholds or adaptive interquartile range methods. For each regime, ensemble weights can be optimized separately so that the most effective component models receive greater emphasis when a spike is detected. Forecasting accuracy and robustness can then be evaluated using risk-sensitive metrics, including symmetric Mean Absolute Percentage Error (sMAPE) and tail-oriented measures such as Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR), thereby quantifying risk and promoting profit-aware evaluation. During inference, the method would combine the outputs of all base models with regime-specific weights, resulting in a hybrid ensemble where model blending is explicitly conditioned on the predicted spike status. To further diagnose and refine performance from a profit-aware perspective, two additional measures may be incorporated: the maximum number of consecutive hours in which forecasting errors exceed 30% or 50%, which highlights persistent failures, and the proportion of forecasts where the absolute percentage error falls below fixed thresholds (e.g., 10% or 15%), which provides an indicator of reliability of the method.”
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for Authorsyou are very much like an orator, response is very fluency. this version is OK.
Author Response
We would like to thank the reviewer for the insightful comments, which helped us refine the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsAuthors have not added any new reference in the updated version of the paper. It is recommended that authors revisit previous version comments (#7, #8 and #9), read them carefully and modify manuscript accordingly. Please do not hurry this time.
Author Response
We sincerely thank the reviewer for the suggestions and constructive feedback, which have been invaluable in improving the paper.
Balancing-market price forecasting remains relatively underexplored. Very few studies conduct feature ablation (i.e., comparing errors with and without a specific feature), so the literature provides limited guidance. However, in the previous revised version, we added a paragraph on this topic, particularly emphasizing historical day-ahead price data as an important indicator or signal, supported by the quantitative results in the literature.
Cross-study synthesis is not straightforward in some context because most results are reported in unit-based metrics and local currencies (e.g., MAE in EUR/MWh), which vary across hours (peak vs. off-peak), seasons, years (with a pronounced upward price drift), and markets. Thus, a 20 EUR/MWh error may be acceptable during peak hours but not during low-price periods. Review also explicitly discusses this. We provided the test periods and the market on which each study focuses, but some missing details still make cross-study synthesis difficult.
In the previous revised version, we emphasized the transaction costs, latency, and regulatory caps, while noting that only a small subset of works address these factors. We also reported how many studies provide code and data. We outlined future directions, including a profit-aware approach.
Despite our best efforts, a few relevant studies may have escaped our attention. It should also be noted that we excluded papers that are not representative of balancing-price forecasting. For example, we omitted one study reporting exceptionally low errors, well below what is typically achievable, without sufficient methodological explanation. We also excluded another work discussing the reduction of imbalance costs in India through the estimation of locational marginal prices. However, such an approach must be carefully assessed, since a locational marginal price mechanism does not exist in India in reality. Hypothetical approaches and conclusions derived from multiple assumptions may not be transferable to the actual balancing context. These decisions are not value judgments about those papers but reflect our aim to synthesize results that are methodologically consistent, empirically grounded, and comparable across real markets.
We appreciate the reviewer’s suggestions. However, the journal advised us to avoid unnecessary citations without a direct link to balancing-market price forecasting. Nevertheless, it is evident that the Data in Brief resource the reviewer mentioned is valuable, especially given Europe’s practical data-access constraints. While TSOs publish several operational variables, major exchanges (e.g., Nord Pool, EPEX Spot) do not publicly release feature-rich historical feeds (such as comprehensive day-ahead price series). Such datasets are therefore important for benchmarking and reproducibility. However, because Data in Brief resource primarily addresses flexibility rather than balancing-price forecasting, we reference them sparingly to keep the review within scope.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsIt is appreciated that author believes in journal standard. Hence, i suggest the paper should be excluded from the special issue which is focused on "Flexibility". The special issue does not have "direct link to balancing-market price forecasting". It is suggested to find another special issue Or consider the manuscript as regular submission.