Forecasting the Occurrence of Electricity Price Spikes: A Statistical-Economic Investigation Study

: This research proposes an investigative experiment employing binary classification for short-term electricity price spike forecasting. Numerical definitions for price spikes are derived from economic and statistical thresholds. The predictive task employs two tree-based machine learning classifiers and a deterministic point forecaster; a statistical regression model. Hyperparameters for the tree-based classifiers are optimized for statistical performance based on recall, precision, and F1-score. The deterministic forecaster is adapted from the literature on electricity price forecasting for the classification task. Additionally, one tree-based model prioritizes interpretability, generating decision rules that are subsequently utilized to produce price spike forecasts. For all models, we evaluate the final statistical and economic predictive performance. The interpretable model is analyzed for the trade-off between performance and interpretability. Numerical results highlight the significance of complementing statistical performance with economic assessment in electricity price spike forecasting. All experiments utilize data from Alberta’s electricity market.


Introduction
Price spikes are unexpected and abrupt extreme prices [1] whose value can be several orders of magnitude higher than normal electricity prices [2].Moreover, due to their stochastic nature [3], price spikes are short-lived extreme price variations [4] observed in the short-term operation of all wholesale electricity markets.These extreme prices impact both, the supply and demand sides.
The occurrence of electricity price spikes in the short term can be associated with one or a combination of several factors.For example, forced outages of power generators and transmission lines [4,5]; transmission line congestion [6]; forecast errors in the production of intermittent generation (i.e., wind and solar) [7]; sudden increments of the electrical demand [8]; and strategic bidding of market participants [9].Overall, these extreme prices are shaped by the complex interactions between the technical and economic forces driving the operation of the power system and the electricity market.
Real-time and day-ahead electricity markets are impacted by the occurrence of electricity price spikes.Notably, real-time markets are more prone to observe a higher frequency of the occurrence of these spikes [10].This is because unplanned physical power system disruptions need to be handled in real time, hence, increasing the probability of high marginal cost generators to be dispatched to provide any last resort capacity to cover the system contingency [8].However, in day-ahead markets, price spikes have become more frequent in recent years [11], too.This is likely attributed to the increased penetration of variable generation resources that carry an inherent production uncertainty [12,13].More frequent and severe spikes in both types of market structures require market participants to be better prepared to hedge against the increasing volatility of electricity prices.
Price spikes are the most prominent characteristic of the wholesale electricity market prices [10]; these extreme prices have significant economic implications.For the supply side, price spikes present opportunities to generate more profits.Likewise, for the demand side, spiky prices entail high energy costs that should be avoided where possible.Moreover, for energy traders, price spikes bring opportunities for economic arbitrage.Thus, predicting the occurrence and/or magnitude of these extreme prices is important for all market participants.
Different types of price spike forecasts are reported in the literature.For example, the authors of [8,9,13,14] estimate forecasts of the probability of occurrence of price spikes.Moreover, some works in price spike forecasting use a hybrid approach, e.g., predictions of the occurrence of price spikes are used to generate deterministic spiky forecasts [2,3,15,16].Likewise, other works on price spike forecasting directly generate deterministic forecasts [17][18][19][20].Finally, using a binary classification approach, the authors of [21,22] generate class forecasts of the occurrence of price spikes.
In a wholesale electricity market, price spike forecasts serve different purposes for different types of market participants.For example, deterministic price spike forecasts can be used to optimize the bidding strategies of a generation asset in the ancillary services market [23].Likewise, a retailer of electricity may use forecasts of the probability of occurrence of spiky prices to hedge its operational portfolio against future extreme prices [8].From the modeling perspective, price spike forecasts are conditional to a certain threshold.
Selection of a threshold for price spikes usually lies on either two general categories, i.e., fixed and/or variable [24].Some authors [4] also identified these thresholds as economic or statistical, respectively.Furthermore, the selection of the threshold for high or extreme prices would vary subject to different criteria, for instance, market structure and/or modeler's expert knowledge; and is being argued to be generic [8].
A price spike forecasting model requires the definition of a threshold that would categorize the spiky prices from normal prices.This holds for the majority of the price spike forecasting approaches in the literature, i.e., deterministic [23]; occurrence as a probability [13]; and occurrence as a binary classification approach [21].For example, in the price spike occurrence forecasting literature, we have observed the proposed modeling methodologies involve one, or a combination of the two general price spike thresholds previously mentioned, and one or different models are used to generate the forecasts of the spiky prices either as probabilities or price classes.
From the multiple modeling approaches in the literature of price spike occurrence forecasting, methods can be categorized as parametric or non-parametric.Examples of parametric methods used to forecast the occurrence of price spikes are [4,8,13,24,25].Likewise, examples of non-parametric methods that have been also used to predict the occurrence of price spikes are [3,21,22,26].From this set of references, to better understand the decisions made by a predictive model, the authors of [24] map the contribution importance of different input features to predictions of the probability of occurrence of price spikes.
The notion of interpretability in machine learning can be understood as models that are inherently interpretable [27].Conversely, explainability refers to post hoc explanations that are generated on top of black box models [27].In electricity price forecasting, model interpretability has recently started to gain inertia; some of these models are based on interpretable neural networks, e.g., see the studies from [5,28].Other works, like [24], introduce a form of model explainability as a post-processing stage.Either an interpretable or an explainable model, the goal seems to be the same, i.e., a collective research effort towards enhancing the trustworthiness of the predictions generated by the AI algorithms [28].If this is achieved, it should help on reduce the barriers to better acceptability of AI in the industry [29].
In this paper, we contribute to the existing literature by proposing an investigative methodology to evaluate how alternative price spike forecasting methods perform when assessed against different fixed and variable price spike thresholds and performance evaluation metrics.We employ various metrics for both, model hyperparameter optimization and performance evaluation.In particular, we apply classification-based and regressionbased methods.The deployed classification-based methods are optimized against recall, precision, and F1-score to generate binary predictions of the occurrence of price spikes.The regression-based model is adapted from the literature on electricity price forecasting [30].This model generates deterministic point price forecasts that are converted to binary spike/non-spike predictions based on the different spike thresholds.We evaluate the performance of all models against both, data-driven and economic measures.Furthermore, inspired by the recent works in interpretable machine learning in electricity price forecasting [5,28], we explore the costs and benefits of interpretability in price spike occurrence forecasting.Here, we use the decision rules generated from an interpretable model [31] to predict the occurrence of spiky prices.Based on the obtained statistical and economic performance, we quantify the model's trade-off between performance and interpretability.In other words, the economic impact of interpretability, which, to the best of our knowledge, has not been explored yet in the price spike forecasting literature.This paper provides a comprehensive investigation of the value of predicting electricity price spikes when alternative models and performance evaluation measures are implemented.
The rest of this paper is organized as follows.Section 2 introduces the relevant literature associated with the proposed research study.The investigative experiment used to predict the occurrence of price spikes is presented in Section 3. Section 4 discusses in detail the numerical results of the analysis.Likewise, Section 5 summarizes the findings of this investigative methodology, discusses the limitations of the proposed study, and provides potential future research paths.

Literature Review
This section provides a review of the works in electricity price spike forecasting.Particularly, research methodologies on price spike occurrence forecasting.Furthermore, we review those works in the electricity price forecasting literature that incorporate economic evaluation as part of the task, and, if applicable, some form of explainability.

Forecasting the Occurrence of Price Spikes
The task of forecasting electricity price spikes requires establishing a threshold over which price spikes can be numerically defined.In other words, prices above this threshold would be mapped as price spikes.Thresholds for price spikes can be either economic [4] or statistical [3], respectively.They are also known as fixed [32] or variable [9] thresholds, too.
The definition of the proper threshold for price spikes depends on different factors.Fixed thresholds are selected based on the economic applications or practices of the different market participants [8,26].For example, common fixed threshold values observed in electricity price spike occurrence forecasting correspond to AUD 100/MWh or AUD 300/MWh [4,8,14].Authors in [25] select fixed thresholds for price spikes corresponding to AUD 100/MWh, AUD 250/MWh, and AUD 500/MWh, respectively.The authors of [21] uses fixed thresholds for price spikes of USD 150/MWh and USD 200/MWh.Similarly, the work in [26] selects fixed price spike thresholds of EUR 80/MWh, EUR 90/MWh, EUR 100/MWh, EUR 120/MWh and EUR150/MWh, respectively.Overall, fixed thresholds present the disadvantage of not considering the price fluctuations over time [1,13].
In the literature on electricity price spike occurrence forecasting, different variable thresholds have been proposed.Variable thresholds consider the electricity price movements associated with the market dynamics, e.g., daily or weekly seasonality [19,33].For example, authors in [21] define a variable threshold using the mean plus two standard deviations of the Australian electricity prices.Similarly, the work in [22] defines a variable threshold for the Australian market using a kernel function to estimate the expected value of the prices plus the adjusted standard deviation affected by a constant.In [13] a two-regime autoregressive threshold conditional to upper (AUD 400/MWh) and lower (AUD 80/MWh) limits, is used.Likewise, a variable threshold based on the 5th and 95th quantiles is proposed by [24].Finally, recursive filters, variable price thresholds, recursive filters on prices, and regime-switching classification, are used as price spike thresholds in [9].
Predicting the occurrence of price spikes is not a trivial task [9].Generally, two different approaches are observed in the literature.The first approach considers a parametric modeling; which can be used to quantify the uncertainty of the predictions [24], e.g., probabilistic price spike occurrence forecasting.The second consists of a non-parametric approach that treats the problem as a pattern recognition task [33], e.g., a binary classification [21].
Different parametric models have been used to forecast the occurrence of price spikes.For example, authors in [8,13,25] fit different econometric models to predict one-day and one-step-ahead probabilities of the occurrence of price spikes in the Australian market.These models consist of a zero-inflated Poisson autoregressive with exogenous variables, a Poisson autoregressive, and an autoregressive conditional hazard, respectively.Considering a one-step-ahead forecasting horizon, authors in [14] forecast the occurrence of price spikes in the Australian market by fitting a dynamic copula-based multivariate discrete choice model.In [4] a modified version of the model proposed by [8] and different configurations of the logit and scobit models generate one-step-ahead forecasts in the Australian market.Finally, the work in [9] uses the demand-to-capacity ratio to forecast the 2-day up to 2-week probability of occurrence of price spikes in the UK market; and finds this probability is well fitted to an exponential function.Authors in [24] forecast the probability of occurrence of day-ahead price spikes in the Australian electricity market using a multivariate logistic regression.Similarly, authors in [34] generate 14-day-ahead forecasts of the occurrence (probability and binary) of price spikes in the Japanese market using a modified form of the Hawkes model.
Non-parametric models have also been used in electricity price spike occurrence forecasting.For example, the authors of [26] generate forecasts of the probability of price spikes in the Dutch market using a fundamental model [1] consisting of the functional form of the relation between supply and demand.Authors in [21] predict the occurrence of day-ahead price spikes in the PJM and Australian markets using a probabilistic NN in a binary classification approach.Authors in [35] predict the occurrence of one-step-ahead price spikes in the region of Queensland, within the Australian market, using six different machine learning classifiers.Finally, the work in [22] forecasts the occurrence of price spikes using SVM classifiers in the Australian market.In all cases, these works rely on a variety of error metrics that are usually specific to evaluate the statistical performance of the algorithms in forecasting price spikes [3].
Electricity prices exhibit right skewness and high levels of kurtosis due to the presence of extreme price spikes [10].Thus, in price spike forecasting, it is important to consider error metrics that properly penalize forecasts based on these characteristics.At their convenience, market participants should look for the set of error metrics that take into account the asymmetric nature imposed on the spot prices time series due to the presence of extreme price spikes [25].In other words, metrics that penalize the model's decision on incorrectly assigning the value of a normal price to a price spike, or vice versa.
The literature on price spike forecasting proposes different metrics to evaluate the forecasts.For example, ref. [25] defines an error metric with attention on penalizing more the underestimation of price spikes than their overestimation.Authors in [8] use different error metrics, namely, log-probability score error, the asymmetric loss score, mean absolute error (MAE), and root mean square error (RMSE).In [4,14] the spike forecast errors are evaluated with the predictive loglikelihood and the Cramer statistic.The work from [16] proposes the spike prediction accuracy (i.e., recall) and confidence (i.e., precision) evaluation metrics; which are further used by [3,20,21,35].In [24] the area under the curve, which can be derived from the receiver operating characteristic, is used to evaluate spiky forecasts.Similarly, reference [13] uses different error metrics, e.g., true (false) positive rate, forecast score, MAE, RMSE, mean probability score error, and the asymmetric loss used in [8].The coefficient of determination R 2 is used to evaluate the good-of-fitness of the probabilities of observing price spikes by [9].In [34] the MAE, an adjusted weighted accuracy, and the Matthews correlation coefficient are used as metrics to evaluate the spiky forecasts.Overall, price spike forecast error metrics should be associated with a market participant's decision-making process [35].
The investigative studies on price spike occurrence forecasting from [4,35] propose the use of either a fixed or variable threshold, respectively.In contrast to these studies, our investigative study incorporates the use of both, fixed and variable thresholds.Moreover, in contrast to other non-parametric (i.e., machine learning) modeling approaches in the price spike occurrence forecasting literature [21,22,35], we select the optimal set of hyperparameters using different error metrics during the hyperparameter optimization process.

Economic Evaluation of Electricity Price Forecasts
An economic evaluation of electricity price forecasts allows market participants to assess the inherent benefits of these predictions, such that profits can be maximized [36].Additionally, the economic risks posed by price spikes on these market participants suggest that the evaluation of the forecasts goes beyond measuring their statistical performance, and invites to carefully design price risk management strategies [37].
Different case studies are presented in the price forecasting literature.For example, the authors of [8] evaluate the profitability of the spiky forecasts by simulating a trading strategy based on future contracts.The main goal is to simulate hedging the risk that an Australian retailer of electricity would have against price spikes.The work in [35] quantifies the cost-benefit that the forecasts of the occurrence of price spikes have on a generator aiming to optimize its bidding strategies in the Australian market.The economic impact of electricity price class forecasts for the New York market is evaluated with a demand-side management strategy applied to an industrial load in [38].Reference [39] proposes a statistical arbitrage trading strategy based on price differences forecasts for the short-term Dutch electricity markets (i.e., day-ahead, intraday, and balancing).The trading positions are managed using deep reinforcement learning agents.Overall, due to the increasing penetration of intermittent generation, modern electricity markets present a challenging environment for risk management to the market participants [40].
Accounting for the uncertainties of the price forecasts allows market participants to better evaluate the risks associated with different trading activities, e.g., managing a power portfolio [41].Recent works in electricity price forecasting evaluate the economic implications of probabilistic forecasts.For example, authors in [36,41,42] quantify the profitability of probabilistic price forecasts in the German day-ahead market by implementing a trading strategy that looks to optimally allocate bids to charge and discharge a battery storage system.
To add to an already complex operation of modern electricity markets, as previously stated, researchers have been also inspired to explore model interpretability in electricity price forecasting.One reason is that computational intelligence models [1], i.e., AI models, have been demonstrated to outperform state-of-the-art statistical methods in price forecasting applications [11].However, the decision-making process of AI models still represents a barrier to acceptability among industry stakeholders [29].
Finally, the work in [35] quantifies the benefits and cost impacts of a price spike occurrence forecasting strategy based on empirically varying a series of factors that directly affect the weights of correctly or incorrectly classified price spikes.In other words, factors that affect the weights of the true (false) positives and negatives.In contrast to these authors, our economic evaluation allows a selected price spike threshold to quantify how much the benefit or cost would be for a market participant.

Interpretability in Electricity Price Forecasting
The literature in electricity price forecasting is aware of the importance of model interpretability in AI.An interpretable model can serve different purposes.For example, to enhance the trustworthiness of an algorithm's output, to better understand the interactions between input variables and the model's output, and to improve our understanding of the phenomenon under study [28,43].
Authors have used interpretable models in electricity price forecasting.For example, the authors of [5] present an interpretable Transformer-based probabilistic forecasting model to predict the imbalance electricity prices in the Belgian market.Here, interpretability plays a key role because the Transformer-based attention mechanism allows it to learn the non-linear temporal (past and future) relationships between the input features and the target variable at specific time steps.Moreover, it allows for global and local interpretability analysis of input features associated with their temporal nature.Similarly, the work by [28] proposes a neural basis expansion analysis with exogenous variables to generate price forecasts in four (one) European (US) day-ahead markets.Here, interpretability is achieved as the combination of a stack of fully connected NN, along with a mechanism capable of decomposing the forecasts time-series into its trend and seasonal components (e.g., harmonic functions and polynomial trends), as well as the effects of exogenous features (time-varying local regression).Moreover, using a multivariate logistic regression model, authors in [24] conduct relative importance analysis to evaluate the impact the input features have on predicting the probability of occurrence of price spikes in the Australian market.Authors in [44] generate electricity price point forecasts for Ontario's market using a single-layer NN and an adaptive neuro-fuzzy inference system based on decision rules.These rules are of the form 'if-then', i.e., antecedent-consequent.
Even though interpretable AI promises to bring innovation to the field of electricity price forecasting, quantifying the economic benefits and risks of the price forecasts is of importance to the market participants.Hence, we quantify the so-called trade-off between model performance and interpretability.In this sense, performance can be understood as the different evaluations a model is subject to, i.e., statistical and economic.Moreover, to the best of our knowledge, no other work in the electricity price spike forecasting literature has explored the economic impact of interpretability.

The Proposed Methodology
The proposed investigation methodology uses a set of different forecasting models, input features, price spike thresholds and evaluation error metrics to generate forecasts of the occurrence of electricity price spikes in a binary classification task.Additionally, the value of the forecasts is also quantified with an economic evaluation.

Modeling Approach
A set of models, i.e., classification-based (interpretable and non-interpretable) and regression-based models C k , k ∈ {k 1 , k 2 , ...} generate binary predictions ĉk ∈ {0, 1} of the occurrence of normal and price spikes at the forecasting origin.Here, class 0 represents normal prices, and class 1, price spikes.Moreover, each classifier C k is trained, validated and tested using a dataset D, where {X, y} ∈ D, D ∈ R. X is the set of input features and y is a vector of historical spot prices.
The feature set X consists of a combination of features that help classifier C k to better capture the extreme price movements and the seasonal effects at the forecasting origin.Here, observed and forecasts of the features at times t and t − h are used to generate the one-step-ahead forecasts of class prices ĉk .
In electricity price forecasting is a common practice to use lagged values of the features to account for seasonal effects of the spot prices [45], e.g., daily and weekly seasonality.Furthermore, in electricity price spike forecasting, lagged values should also account for the fact that price spikes tend to cluster during short periods of time [13,34,46], e.g., no more than 24 h [3].
Different price spike thresholds τ α are used in the proposed investigation study to generate binary electricity price class predictions ĉk .Let us define τ α as follows: The set of thresholds in (1) correspond to fixed and variable thresholds, i.e., τ α is used such that ŷk is converted to ĉk .
In price spike forecasting, different thresholds serve different purposes.For example, fixed thresholds, i.e., τ α in (1), can be defined based on a market participant's future short-term operation that would satisfy a profitable economic expectation in the electricity market.Fixed thresholds can be also thought to define an economic spike [4], associated, for example, with trading activities [8].However, fixed thresholds on spiky prices do not reflect spot price dynamics.
Electricity prices are responsive to the market dynamics.For example, the seasonal effects of the electricity demand [11] and the variability of an increasing amount of renewable resources [13,47], among others.As opposed to fixed thresholds, a variable threshold for price spikes, i.e., τ in (1) accounts to reflect such dynamics [19,24].However, the definition of a variable threshold is conditional to past market price dynamics.Once defined, it can be used to forecast future price spikes.However, it is restricted to the flexibility allowed by the fixed threshold to set a value that appropriately aligns with future short-term economic operations.In any case, in price spike forecasting, a proper selection of a threshold can be adapted to the different necessities of market participants.Here, predictions ĉk from classifier C k should be evaluated based on a selected forecast error metric that accounts for the price spikes.

Statistical Evaluation Error Metrics
A forecast error metric E i evaluates the statistical performance of a classifier C k .Moreover, the training forecast error metric E * j is used during the hyperparameter optimization process of classifier C k .
The optimal set of hyperparameters is found during the training stage by using a Bayesian optimization approach and an error function E * j .Once found, predictions ĉk are generated with the test set and evaluated using E i .These error metrics would influence a classifier's decision function towards adding more weight to different sets of relationships, i.e., between correctly and incorrectly predicted price spikes.Here, incorrectly predicted price spikes mean either over or underestimation of the occurrence of a spike by classifier C k .Thus, in machine learning applications, the selection of the optimal set of hyperparameters is important.
Machine learning models require a careful fine-tuning of several hyperparameters [5]; this is typically achieved by following some heuristic optimization [30].Here, different algorithms may be used, however, in electricity price forecasting, common approaches involve Bayesian optimization algorithms [11,30].These approaches require a forecast error metric E * j where the optimal set of hyperparameters is searched within the hyperparameter space during the training stage.
In ( 2) and ( 3), TP, FP, and FN correspond to the true positives, false negatives, and false positives, respectively.To put it in context, the underestimation of the occurrence of a price spike is directly related to the FN.Conversely, overestimation is related to the FP.In other words, (3) can be seen as a measure of price spike forecasting reliability [35].Also, ( 4) is the trade-off between recall and precision, usually observed in real-life applications, e.g., price spike forecasting [35].These metrics alongside the true negatives TN can be used to evaluate the economic impact a market participant, who is looking to enhance its energy efficiency strategy, would have, based on predictions from C k using threshold τ (•) α .Here, a market participant would decide which metric E * j has a bigger positive economic impact, depending on the specific application.
In the proposed experiment, the statistical performance of an optimized classifier C k based on E i is further quantified using an economic evaluation on an electricity market application case study.

Economic Evaluation of the Forecasts
The economic impact for each classifier C k using threshold τ (•) α is evaluated.This evaluation represents the operational economic benefits (costs) of a market participant if operational decisions have to be made based on the forecasts ĉk from classifier C k .In other words, it quantifies the marginal value of electricity when class price forecasts ĉk are above or below the price spike threshold τ Evaluating what the economic impact of class price forecasts ĉk would be for a market participant is important.This is because no forecasting method is perfect, thus, statistical and economic evaluation should not be independent of each other [35].Moreover, this would allow us to analyze if operational decisions could or not be profitable.Thus, based on threshold τ In (5a) and (5b), π 1 (π 2 ) quantifies the total loss (profit), that a market participant γ would incur at each time interval t using forecasts ĉk from C k .Moreover, in (5a) and (5b), λ t and τ (•) α correspond to the observed hourly electricity price and the price spike threshold (i.e., see (1)), respectively.
Observe that the first and second elements in (5a) represent the operational cost due to forecast errors from C k .Conversely, the first and second elements in (5b) represent the profits associated with correctly predicted price classes from C k .
A simple economic benchmark that quantifies the marginal value of electricity based on threshold τ α is implemented as a comparison to evaluate the overall impact of forecasts ĉk in Π k in (5c).We refer to this as a blind operation benchmark B. Let us define B.
The expression in (6) quantifies the economic impact of threshold τ α with respect to the observed hourly electricity price λ t .Thus, for every interval t where γ t • (τ α − λ t ) > 0, the marginal value of electricity is positive.A negative marginal value of electricity occurs on interval t if γ t • (τ α − λ t ) < 0. Finally, the case where γ t • (τ α − λ t ) = 0 at time t, the marginal value of electricity is zero.

Pseudocode of the Proposed Experiment
The proposed investigation methodology pseudocode is formalized in Algorithm 1.
Here, we show the general process used to generate one-step-ahead class price forecasts ĉk by classifier C k , as described in Sections 3.1-3.
for n = 1, . . ., N do 5: 6: Π k j ← γ(TP,TN,FP,FN) 18: end for In short, the pseudocode shown in Algorithm 1 uses a rolling calibration window [13] across the M training periods for classifier C k .Here, hyperparameter optimization over N trials for model C k is performed using the training (i.e., D (train) ) and validation (i.e., D (val) ) sets, by maximizing E * j .Price class predictions ĉk are then generated using an optimized classifier C (Ω) k and the set of features, i.e., X (test) .Finally, the statistical and economic performance of classifier C k is evaluated using E ki and Π k j .

Selected Models
Our first selection corresponds to a machine learning model, i.e., XGBoost [50].The reason for our selection is that XGBoost is an enhanced version of gradient boosting machine methods, i.e., GBM.These methods have been among the top-ranked performing methods in, e.g., probabilistic electricity price forecasting, within an important International energy forecasting competition [6].Moreover, GBM methods have been widely used in different applications, and in some cases, like classification, these algorithms have provided state-of-the-art results [50].
Likewise, our second selection corresponds to Skope rules, i.e., a tree-based machine learning explainable method [31].We justify our selection because it has been recognized in the forecasting literature that tree-based methods are interpretable [51].Their versatility allowed them to be used for core forecasting tasks in the past.For example, random forest, whose design relies on the statistical principles of bagging [52], was primarily used by important companies in challenging forecasting applications [51], e.g., retailing.
Similarly, our third selection considers a statistical regression-based model adapted from the literature in electricity price forecasting [30].Our decision to use this model is because statistical methods have been demonstrated to be state-of-the-art in electricity price forecasting.Overall, for the task under study, the modeler is free to choose other models than those introduced here.A brief mathematical review of the selected models is found in the Appendix A.

Numerical Results and Discussions
Our experiments are carried out using publicly available hourly data from Alberta's electricity market, where the minimum (maximum) value of the pool price is CAD 0/MWh (CAD 999.99/MWh).However, the proposed method applies to other electricity markets, too.Even those where negative prices may occur; nevertheless, the definition of threshold α should account for the presence of negative price spikes.Some authors have modeled negative price spikes, e.g., in [49].
The simulation setup and numerical results for the models' predictive performance and generated net profits in the proposed study are presented in this section.All our experiments are carried out using the Python programming language, and a machine with Ubuntu 22.04 LTS, 32 GB of RAM, and 8 11th Gen i7 Intel cores at 2.80 GHz.Finally, for all classifiers C k , k ∈ {1, ..., 7}, the experiments follow the methodology introduced in Section 3.
The set of classifiers C k and their configurations based on an error metric E * j , quantifying their training error during the hyperparameter optimization stage, are introduced in Table 1.
In Table 1, models from C 1 to C 3 consist of a set of XGBoost [50] classifiers (i.e., XGBC) optimized with recall, precision and F1-score, i.e., OR, OP, OF1, respectively.For example, XGBC-OR corresponds to an XGBoost classifier optimized to enhance its predictive performance based on the recall metric.
The set of classifiers from C 4 to C 6 in Table 1 consists of the family of interpretable classifiers, i.e., Skope Rules [31] (i.e., SKRC).Similarly, as with XGBC-based models, these models are optimized in the training stage with recall (OR), precision (OP), and F1-score (OF1), respectively.Explainable decision rules are extracted to generate the class price predictions from each SKRC-classifier (see Algorithm 1) in Section 3.
Also, in  1 are optimized using Optuna [53] over 40 trials.The selection of the number of trials is empirically determined aiming to balance the running time while preserving classification performance as much as possible.Hereafter, we refer to each of these models as their respective classifier assignation in Table 1.
Finally, the set of features {SC t , W t , ϕ t−h } ∈ X, where {SC t−h , W t−h , p t−h } ∈ ϕ t−h and h ∈ {1, 2, 24, 48, 168}.Hence, SC t (SC t−h ), W t (W t−h ), and p t−h correspond to the estimated supply cushion [54,55], wind power forecasts [5] and observed pool prices [56], respectively; and their corresponding lagged values at intervals t − h.Using input set X in the proposed investigation study, a classifier C k generates class price forecasts ĉk based on threshold τ (•) α .

Forecasting the Occurrence of Price Spikes
Each model in Table 1 uses the same set of input features X to generate one-hour-ahead electricity price class predictions ĉk .These forecasts are generated using a rolling calibration window [13].Moreover, the training period m in Algorithm 1 in Section 3 corresponds to one year of data, starting on 1 January 2018, at 00:00.Here, the last month is used as a validation set.For example, in the first training round, the month of December 2018 is used as a validation set.The total test period starts on 1 January 2019, at 00:00 and ends on 31 December 2022, at 23:00.Following the previous example, after the first training round, January 2019 is used as the test period.
Similarly as made by [8] for the Australian market, Figure 1 shows the 10, 50, and 90% percentiles (i.e., P10, P50, and P90) of the pool price dynamics in Alberta's market for the years where class predictions ĉk are generated.Observe, for example, the tendency of the P10 and P50 for all of the years in Figure 1 to not exceed 100 CAD/MWh at any hour of the day.Conversely, the P90 shows drastic changes when comparing between the years of 2019 and 2020 (e.g., approximately between 125 CAD/MWh and 100 CAD/MWh) against 2021 and 2022 (e.g., approximately between 500 CAD/MWh and 800 CAD/MWh).The diverse high price dynamics, as shown in Figure 1, suggests the convenience of testing the models with different thresholds τ α as the prices above or equal to 100 CAD/MWh, 200 CAD/MWh, 300 CAD/MWh, 400 CAD/MWh, here, expressed as τ 100 , τ 200 , τ 300 , and τ 400 , respectively.Likewise, the set of variable thresholds [3,16] in τ (I I) α is defined by K = 1 and K = 2, i.e., µ + σ and µ + 2 • σ; and are expressed as τ µ+σ , and τ µ+2•σ , respectively.For each threshold type, i.e., τ (I) α and τ α , the corresponding number of spikes is shown in Table 2.As expected, Table 2 shows that the number of spiky samples decreases as the value of threshold τ α increases for every test period.Observe the notorious increment in the number of spikes in 2021 and 2022, compared to 2019 and 2020.Among others, an important reason for this is due to changes in the bidding strategies of some market participants looking to increase their asset profitability.Thus, resulting in higher market price offers from these market participants [57].Moreover, observe the number of hourly spiky samples in Table 2 is small compared to, e.g., one year of hourly data.The high imbalance between the spiky and normal prices is one of the reasons why modeling electricity price spikes is a non-trivial task [9].
Figure 2 shows the mean recall (2), precision (3) and F1-score (4) for each classifier C k , and threshold τ (I) α across the tests periods.Here, the value of each threshold in τ (I) α remains fixed through every training period m.For each of these performance metrics, Figure 2 shows a general tendency for classifiers C k to decrease their performance as the value of the threshold increases.For example, for most of the years, performance metrics, and threshold τ (I) α , we observe this tendency to occur for more than 75% (and up to 90%) of the time.The only exception is 2020, where this only occurs 63% of the time.α (seven classifiers C k ) introduced in Section 3 (Table 1).
Furthermore, Figure 2 shows that in 2020, classifiers C k tend to perform lower compared to other years.This is more (less) notorious for classifiers C 1 to C 3 and C 7 (C 4 to C 6 ).This may probably be associated with the impact of the COVID-19 pandemic on market price dynamics during that year.Furthermore, in general, in all years and thresholds τ (I) α , it is possible to observe that the variations between mean performance metrics are not significant for classifiers C 2 , C 3 and C 7 .In contrast, these variations in mean performance tend to be significant for classifiers C 1 and C 4 to C 6 .
Among all classifiers, C 1 tends to show the highest performance to predict the occurrence of price spikes in Figure 2, i.e., see mean recall.However, the lower precision of C 1 most times may suggest a possible tendency to overestimate predictions of the occurrence of price spikes.Moreover, in terms of precision, i.e., the capacity of classifier C k to not overestimate the predictions of the occurrence of price spikes, classifier C 7 shows the highest mean precision for most of the cases in Figure 2. Finally, C 2 has the highest mean F1-score most of the time but is not as evident as with the recall and precision cases.
Interestingly, Figure 2 shows that classifier C 4 , optimized with recall, obtains a higher average precision than recall for most of the years and thresholds τ (I) α .A possible reason for this is that the explainable model requires, among others, a minimum recall and precision as hyperparameters to select the best-performing rules (see Appendix A.2). Here, for example, when optimized with E * j where ∈ j ={recall}, we use a search range within a minimum recall (precision) of [0.3, 0.7] ([0.01, 0.02]).Observe we try to set higher limits for recall than precision, to try to find an optimal rule able to detect as much FN as possible (see (2) in Section 3.2).However, due to the challenging nature of the imbalance problem, as shown in Table 2, the rules seem to have difficulties in achieving such a task.
To corroborate our previous assumption, we run a new simulation for C 4 using τ (I) α , where minimum recall is now set using the default hyperparameter, as found in [31], i.e., equal to 0.01.Conversely, we set minimum precision to a 'high' range, i.e., [0.9, 0.95], to try to enforce the model to 'easily' find a higher recall.However, results from this simulation are similar to those in Figure 2 for C 4 , where higher average precision tends to prevail in most cases.Thus, there may be evidence that empirically shows that in our problem, the extracted rules used to generate predictions ĉ4 , tend to be better, on average, at detecting FP than FN.
The mean recall, precision, and F1-score for each classifier C k using a variable threshold in τ (I I) α are presented in Figure 3.A variable threshold τ in (1) is estimated using the spot price observed on the last month of each training period (excluding the validation period) m, i.e., month 11.Here, using the last month would allow the threshold to include the most recent possible market information and its impact on the price dynamics.
Observe from Figure 3 that the performance of the classifiers C k also tends to decrease as the threshold τ (I I) α increases.In general, this occurs on average 88% of the time for all thresholds, performance metrics, and most test periods.
Figure 3 shows that the classifier with the best mean recall across all thresholds τ (I I) α and test periods is again C 1 .Moreover, in this case, classifier C 5 shows the highest mean precision most of the time for thresholds τ Regarding the F1-score, C 3 has the highest one for most of the cases when compared to other classifiers.Again, we observe that classifiers C 1 , and C 4 to C 6 have a higher variability between mean recall, precision and F1-score.In contrast, C 2 , C 3 , and C 7 tend to show a more stable mean performance across these metrics.
Again, we observe that the classifier C 4 has a higher mean precision than recall.Here, since the hyperparameter search space for minimum recall (precision) is the same regardless of the price spike threshold τ (•) α , we believe the same idea, as previously explained for Figure 2, prevails.
General conclusions are derived from Figures 2 and 3.For any threshold type, classifier C 1 , i.e., optimized to enhance recall performance, heavily trades-off precision and, therefore, F1-score.As previously stated, this may be an indication of this classifier overestimating the class predictions of price spikes.For the generality of this conclusion, we try to investigate if in our experiment, classifier C 4 would behave similarly to C 1 .That is, if instead of using C 4 's rules to generate the predictions (i.e., see Algorithm 1 in Section 3.4), we use the model, i.e., as with any machine learning algorithm.We run our experiment for C 4 following these ideas.Results are shown in Table 3.  1).For the sake of preserving space, Table 3 shows the results of the experiment only for τ (I I) α .However, similar results are observed for τ (I) α .Observe that, when predictions ĉ4 are generated using C 4 as a model, its performance tends to be biased towards increasing the recall at the cost of decreasing precision and F1-score; as observed with C 1 .Thus, in our experiment, C 1 's derived conclusion is extended to C 4 , only if rules from C 4 , are not used to generate predictions.
In contrast, for those classifiers optimized to improve precision and F1-score performance, i.e., C 2 and C 3 , and, C 5 and C 6 , the optimal set of hyperparameters appears to be those decreasing the chances of overestimating (higher precision) the occurrence of a price spike, at the cost of underestimating (lower recall) them.This idea applies to classifier C 7 , too.
It is difficult to observe a general tendency for a classifier C k to have a significantly better statistical performance for thresholds in τ (I) α than those in τ (I I) α .Finally, in the proposed investigation study, observing a single model achieving high performance across the different error metrics, is challenging.A similar conclusion is derived in [4,8].

Economic Evaluation of the Occurrence of Price Spike Predictions
Based on the statistical performance of a classifier C k in Figures 2 and 3, we evaluate the economic implications of the forecasts ĉk based on the threshold τ (•) α in the decision making of a market participant.This is achieved following (5c) and ( 6) in Section 3.2.The example case study evaluates a demand-side management strategy [38] for a 1 MW load γ.In other words, we quantify the economic value of the forecasts using a one-sided approach [42].Finally, this application can be extended to other markets, too.
For each threshold τ (•) α and test period, we first quantify the total profit or loss for operating γ following (6) in Section 3.3.Therefore, for all thresholds in τ (I) α , the total profits B are CAD 299,165; CAD 3,805,565; CAD 7,311,965, and CAD 10,818,365, respectively.Likewise, for thresholds in τ (I I) α , total profits B correspond to CAD 3,192,417 and CAD 6,675,959 respectively.Using these profits as a base comparison, we evaluate if operating load γ based on forecasts from classifier C k , represents an economic benefit.
Table 4 shows the total profit (positive) or loss (negative) from the difference between Π k − B for each classifier C k and fixed threshold τ (I) α .Here, it is shown that load γ benefits the most using predictions from classifier C 7 .As referenced from Section 4.1, Figure 2 shows C 7 as the model with the highest mean precision performance for most of the cases.Moreover, as discussed in Section 4.1, C 7 benefits from having a more stable mean predictive performance for all metrics, most of the time.From a practical perspective, it means an optimal balance between FP and FN predictions in (2) and (3); thus, translating into less economic damage reflected in Π 7 .For example, C 7 has a lower total economic loss due to the overestimation of predictions of the occurrence of price spikes for τ 400 but not for τ 100 , τ 200 , and τ 300 , being C 5 the one with the lowest economic losses.However, for all thresholds τ (I) α , C 7 benefits from having much lower losses than C 5 due to the underestimation of predictions of the occurrence of price spikes.The total sum of these losses i.e., underestimation plus overestimation, corresponds to −CAD 631,974 and −CAD 1,644,171 for C 7 and C 5 , respectively.
Likewise, Table 4 shows that C 1 has a high economic loss when the threshold is τ 400 .Nevertheless, we also observe that there is a profit for the other thresholds for C 1 .However, to contrast with the ideas based on observations from Figure 2, our numerical results show that C 1 has the highest economic losses due to FP among all classifiers C k for τ 100 , τ 300 , and τ 400 .These losses represent −CAD 134,032; −CAD 463,566, and −CAD 555,493, respectively.For τ 200 , the highest loss is for C 4 (i.e., −CAD 379,676), but just CAD 2,182 higher than C 1 .In contrast, C 1 has the lowest economic losses due to FN (i.e., −CAD 42,190; −CAD 43,450; −CAD 36,212 and −CAD 26,892, respectively) across all thresholds.Thus, as previously mentioned, C 1 shows a tendency to trade off high levels of recall with low levels of precision.
In regards to C 4 , Table 4 shows negative economic impacts in τ 300 and τ 400 .Figure 2 shows that C 4 has a tendency that, on average, benefits precision over recall, however, this is not restrictive for all the periods.For instance, take C 4 when τ 400 in Table 4. Here, our results show that in August 2022, C 4 incurs a big economic loss associated with the overestimation of the occurrence of price spikes.Moreover, the recall is high but precision is low in this period.Thus, negatively affecting the total Π 4 − B.
Table 5 shows the economic results of operating load γ using a variable threshold τ (I I) α .These results are also obtained as the difference between (5c) and ( 6) introduced in Section 3, i.e., Π k − B. It is also possible to observe from Table 5 that the best economic results for both thresholds in τ α , correspond again to classifier C 7 .From Figure 3, we show C 5 as the model with the highest precision most of the time.However, the variability between metrics in terms of average performance is higher when compared to C 7 .For instance, the economic impact due to FP is less (bigger) for C 5 (C 7 ).Conversely, losses due to FN are smaller (bigger) for C 7 (C 5 ).The total of these losses represent −CAD 640,739 and −CAD 289,672 for C 5 and C 7 , respectively.
Evidence from Table 5 shows that, when thresholds are variable, the higher tradeoff between recall and precision in Figure 3 does not necessarily translate in the highest economic losses for model C 1 .However, numerical evidence from our results shows that C 1 has the highest (lowest) economic losses due to FP (FN) for each threshold in τ (I I) α , i.e., the total in −CAD 761,344 (−CAD 50,193), respectively.
Again, we investigate reasons for C 4 having a negative impact, i.e., observe τ µ+2•σ .Here, for example, during May 2019 and December 2020, C 4 profits Π 4 are affected by the negative economic impact due to overestimation of the occurrence of price spikes.These balances are observed to significantly trade off good recall against a bad precision score in these months.Thus, negatively impacting the difference Π 4 − B, and hence the total accounted for τ µ+2•σ .
In general, results from Tables 4 and 5 show that classifiers C 2 , C 3 , C 5 , C 6 and C 7 consistently generate predictions that economically benefit the operation of load γ when compared to the blind operation benchmark B, using threshold τ (•) α .Moreover, these results demonstrate that, in price spike occurrence forecasting, a higher performance in one statistical error metric, would not necessarily translate into a high economic benefit [35], e.g., C 1 in our study.Likewise, in contrast to [8], who concludes that a false negative of the probability of occurrence of a price spike would be more detrimental for an energy retailer than a false positive, our results suggest that a balance between recall and precision, with some bias towards the latter, brings higher economic benefits for the proposed case study, i.e., see C 7 .This conclusion extends to C 2 and C 3 , whose results are also competitive.Moreover, based on C 7 's performance, our results are better aligned to those in [35].Here, the authors show the spike prediction confidence (i.e., precision) translates as beneficial in the cost-benefit analysis for the supply bids of a market participant in the Australian market.Finally, we agree with [35], who emphasizes the importance of integrating the price spike forecasting method with the decision-making process.

The Trade-Off between Performance and Explainability
In recent years, interpretable AI in energy forecasting has started to capture the attention of researchers.The goal is to enhance the acceptability of AI models with the industry stakeholders [5,29].Additionally, interpretability allows for the comparison between expert knowledge and model decisions.This is particularly important, for example, in the engineering field [43].However, the methodological trade-offs between a model's performance and interpretability [43] remain as a challenge.Here, to the best of our knowledge, the economic impact of this trade-off on a market's participant has not been explored yet in price spike occurrence forecasting.
In this investigative study, we quantify the trade-off between model predictive performance and interpretability.Results in Tables 4 and 5 show that, for the proposed case study, C 7 is the model that brings the highest profits with respect to Π 7 − B, among all models C k and thresholds τ  6 shows the economic trade-off between the model's performance and interpretability for the demand-side management case in this study.For the sake of showing the idea behind this trade-off in price spike forecasting, we are comparing the best noninterpretable performing model, i.e., C 7 , against a competitive interpretable model, i.e., C 5 .For the demand-side management application, the economic differences between C 7 and C 5 displayed in Table 6 for each spike threshold τ (•) α , represent the trade-off due to the lost of C 5 's statistical forecasting performance.These differences indicate that, if one would pick using human-interpretable decision rules from C 5 to predict the occurrence of price spikes at any selected threshold τ (•) α , the cost of it would have been C 7 − C 5 for each case, respectively.Finally, observe from Tables 4 and 5 that forecasts from C 5 would still have economically benefited the operation of load γ.
Predictions ĉ5 are made using decision rules from the explainable model introduced in Appendix A.2. Here, we show an example of a decision rule used to generate forecasts ĉ5 during July 2022 using τ 200 [21].We select this period because it corresponds to the Summer season, where price spikes are likely to occur due to the high demand associated with extreme temperatures [8,57].
Thus, the decision rule used to generate the class price spike forecasts is pool_price_lag_1 > 379 AND wind_7d_forecast_lag_24 > 26 .This predictive rule has a recall, precision, and F1-score of 53.6, 87, and 66.3%, respectively.Observe that this rule has a cardinality of two, and is the best-performing rule selected from the semantic deduplication process introduced in Appendix A.2.In other words, two features from the total in X ∈ R (i.e., see Section 3) have been found by Skope Rules [31] as relevant, to generate one-hour-ahead forecasts of the occurrence of price spikes during July 2022 using threshold τ 200 .

Conclusions
An investigation study involving the statistical and economic performance analysis of the forecasts of the occurrence of electricity price spikes was proposed in this paper.Here, a set of different fixed and variable electricity spike thresholds were used.Moreover, interpretable and non-interpretable classification-and regression-based price spike models were trained using data from Alberta's power market.We investigated the impact that using various statistical error metrics on the hyperparameter optimization process of the models, had on predicting the occurrence of these spikes.The final forecasts were evaluated using the same set of statistical metrics, and also by quantifying their economic impact using a demand-side management strategy.We compared against a simple economic benchmark.Finally, we quantified the so-called trade-off between the model's performance and interpretability using an interpretable and a non-interpretable model.
Simulation results showed that hyperparameter optimization with recall tends to control underestimation but not overestimation of the occurrence of price spikes.Conversely, if precision or F1-score are used, this trade-off tends to be better balanced.This tendency was observed, in general, regardless of the price spike threshold.The economic results from the demand-side management case study further confirmed this balance as optimal.In general, these results reflected that is better to avoid an excessive overestimation of the spiky predictions, without heavily sacrificing their underestimation.Finally, in the proposed investigative methodology, interpretability has a high cost for the example study case.In general, we argue that interpretability adds additional difficulties to the already challenging task of predicting price spikes.However, we recognize that interpretability is an active field of research, and the exploration and integration of more robust interpretable methods could be a future research path in electricity price spike forecasting.
One limitation of this research work is that we did not conduct experiments using different cut-off probabilities for the classification-based models.Thus, we let this as future work, along with experimenting with different ensembles of models, which have proven to be competitive in electricity price forecasting.Finally, the statistical evaluation metrics in this study follow those commonly used by some other works in price spike forecasting, however, we recognize that alternative metrics can be worth investigating for this task.Therefore, we leave this as a path for future research.
The general form of the model in (A12) [63] considers each feature X j and its corresponding coefficient ϕ j .Moreover, following [30], the model in (A12) is estimated using the least absolute shrinkage and selection operator, i.e., LASSO [64].φ = arg min In (A13) [30], RSS = ∑(y − ŷ) 2 , i.e., the residual sum of squares.Moreover, λ ≥ 0 corresponds to the LASSO penalty, a hyperparameter that controls the regularization applied to the coefficients ϕ j .
α , respectively.Here, v and K are numeric constants used to define a different spike threshold value.Particularly, in τ (I I) α , µ and σ represent the mean and standard deviation of a vector of spot prices, respectively.Moreover, thresholds τ (•) α convert the observed prices y to price classes c ∈ {0, 1} such that a classifier-based model C k , generates class price predictions ĉk .Likewise, since a regression-based model C k predicts ŷk , thus τ (•)

α
and classifier's C k statistical performance, we evaluate what the marginal value of electricity is for a market participant.Let us define the net profit (loss) Π k (5c) of classifier C k as follows:

3 . 1
Algorithm Pseudocode of the proposed methodology Input: Threshold type τ (•) α ; classifier C k ; training forecast error metric E * j ; data D; training periods M; hyperparameter optimization trials N

Figure 1 .
Figure 1.The P10, P50, and P90 for the 24 h spot prices in Alberta's market.

Figure 2 .
Figure 2.Each grid corresponds to a heatmap displaying a color representation of the mean performance of each model in the proposed comparative study.The outer (inner) y-axis displays the four test periods (three evaluation metrics E i ) introduced in Section 3. Likewise, the outer (inner) part of the x-axis displays the four fixed price spike thresholds τ (I)

Figure 3 .
Figure 3.Each grid corresponds to a heatmap displaying a color representation of the mean performance of each model in the proposed comparative study.The outer (inner) y-axis displays the 4 test periods (3 evaluation metrics E i ) introduced in Section 3. Likewise, the outer (inner) part of the x-axis displays the two variable price spike thresholds τ (I I) α (7 classifiers C k ) introduced in Section 3 (Table1).
, among the explainable classifiers C 4 to C 6 , C 5 and C 6 tend to be competitive across thresholds τ (•) α in Π 5 − B and Π 6 − B. Table

Table 1 .
The set of classifiers C k with a hyperparamter optimization based on E * j (see Algorithm 1).

Table 1
, C 7 consists of the LEAR [30] regression-based classifier.Here, the numerical forecasts from C 7 are converted to price class forecasts ĉ7 based on threshold τ (•) α .All classifiers but C 7 in Table

Table 2 .
Number of price spikes for thresholds τ in (1), and each period in our analysis. α

Table 3 .
Mean statistical performance E i for C 4 and τ

Table 6 .
The total profits and differences, in thousand Canadian dollars (CAD), for the best performing not interpretable model C 7 and the interpretable model C 5 across all test periods and threshold τ α types.