Short-Term Electricity Load Forecasting with Machine Learning

: An accurate short-term load forecasting (STLF) is one of the most critical inputs for power plant units’ planning commitment. STLF reduces the overall planning uncertainty added by the intermittent production of renewable sources; thus, it helps to minimize the hydrothermal electricity production costs in a power grid. Although there is some research in the ﬁeld and even several research applications, there is a continual need to improve forecasts. This research proposes a set of machine learning (ML) models to improve the accuracy of 168 h forecasts. The developed models employ features from multiple sources, such as historical load, weather, and holidays. Of the ﬁve ML models developed and tested in various load proﬁle contexts, the Extreme Gradient Boosting Regressor (XGBoost) algorithm showed the best results, surpassing previous historical weekly predictions based on neural networks. Additionally, because XGBoost models are based on an ensemble of decision trees, it facilitated the model’s interpretation, which provided a relevant additional result, the features’ importance in the forecasting.


Introduction
The electric power system operation is a continuous work that requires real-time coordination from the power plants to distribution substations to operate within a secure range and conclusively deliver the electricity service with quality and without interruptions. Before the real-time operational job arrives, planning should be done to consider the renewable energy production sources' behavior, the power plants and grid maintenance, and weight the hydrothermal resources, so the electricity production meets a projected demand. This real-time balance between energy generation and load should be sustained to avoid damages to the grid [1].
A power system operation's planning time scope can be decomposed into three frames, and each of these frames focuses on specific tasks [2]: short-term, mid-term, and longterm. The short-term timeframe goes from 1 day to 1 week, focusing more on the power system's operational and security aspects. The mid-term timeframe typically considers several weeks to several months, focusing more on managing the production resources and avoiding the energy deficits with the existing power plants. Consequently, the long-term timeframe focuses on years to decades, intending to define the installation of new power plants or changes on the transmission system. These criteria can vary from region to region; nevertheless, the concept should remain.
Generic and multi-site electricity demand data is difficult to obtain. Since the authors are familiar with Panama's electricity infrastructure and because Panama makes its electricity load data publicly available, the authors decided to use the Panama electricity load data to build the models. In Panama, the National Dispatch Center (CND) is in charge of the power system planning and operation. According to CND methodologies [3], the goal of forecasting with an acceptable level of deviation is to anticipate and supply the authors compare traditional ML approaches with deep learning methods on the electricity forecasting field and the most trending algorithms in Scopus-indexed publications from 2005 to 2015.
Time-series analysis is considered one of the most widely discussed forecasting methodologies in which the Box-Jenkins and Holt-Winters procedures are extensively used. For example, the authors of Reference [21] used those methods to forecast the weekly load for Riyadh Power System in Saudi Arabia, concluding that these approaches give insights to decompose the electric load forecast. The autoregressive integrated moving average (ARIMA) model is proposed to forecast the next 24 h in Iran [22]. This modified ARIMA combines the estimation with temperature and load data, producing an enhancement to the traditional ARIMA model. The ARIMA model by itself does not significantly improve the forecast accuracy and is computationally more expensive, demonstrating the need to complement these models with external inputs to enhance the results.
Overall, in most recent research, these models are less used for electricity STLF since ML methods provide better results, as demonstrated in References [23,24], and more recently in Reference [25]. Particularly, in this last cited study, the authors compare the performance of six classical data-driven regression models and two deep learning models to deliver a day-ahead forecast for Jiangsu province, China, concluding that the ARIMA model had several limitations to solve the STLF problem. First, it can only consider time-series data to forecast based on the electrical load. Second, the determination of the model order is either computationally expensive or empirical. Lastly, to make residuals uncorrelated, several trials are required. At the same time, autocorrelation function (ACF) and partial autocorrelation function (PACF) graphs need to be iteratively checked to tune the model.
In contrast with the classical statistical time-series models, ML models can handle more valuable factors, such as weather conditions, to improve the STLF accuracy. Multiple linear regression (MLR) has been widely used for STLF. For example, the authors of Reference [26] used it to forecast the hourly weekly load in Thailand, obtaining an average mean absolute percentage error (MAPE) of 7.71% for 250 testing weeks and pointing out that temperature is a primary factor to predict load. Similarly, the authors of Reference [13] utilized MLR to forecast electricity consumption 24 h ahead for 14 west-African countries, considering weather variables like temperature, humidity, and daylight hours. The researchers that have implemented MLR agreed on the fast training and interpretability this model offers, although it shows poor performance for irregular load profiles.
Another approach that has been widely used for STLF during the last decades is the artificial neural network (ANN), mainly due to the algorithm flexibility. For example, in the 2006 study of Reference [10], an ANN is proposed with a Levenberg-Marquardt training algorithm to forecast hourly, daily, and weekly load in Ontario, Canada, presenting good results without comparing with other algorithms. Furthermore, the authors of Reference [9] predicted a single transformer hourly load, using quarter-hour load records and weather data with hourly records, obtaining a MAPE performance below 1% with ANN for summer and winter seasons. In more recent research, the authors of Reference [27] apply STLF for urban smart grid systems in Australia, commenting that ANN has good generalization ability for the task. However, this approach still has many disadvantages such as quickly falling into a local optimum, over-fitting, and exhibiting a relatively low convergence rate. Nevertheless, forecasting smart grid loads with increasing renewable energy sources is challenging and deserves complex solutions to obtain good results.
The support vector regression (SVR) model is another popular model for STLF, mainly with a linear kernel, due to the linearity between the inputs and the forecast, as concluded by the authors of Reference [25], who obtained a MAPE under 2.6% for the day-ahead prediction, performing better than MLR and multivariate adaptive regression splines. Similarly, the authors of Reference [17] proposed forecasting the 48 h of Portuguese electricity consumption by using SVR as a better alternative after previously submitting the ANN's use for the same task in Reference [28]. The main reason for preferring SVR was the Information 2021, 12, 50 4 of 21 efficiency of the hyperparameter tuning on the daily online forecast. The SVR achieved a MAPE between 1.9% and 3.1% for the first-day forecast and between 3.1% and 4% for the second day. A variant of SVR is compared against ANN by the authors of Reference [29] to forecast the south-Iranian day-ahead hourly load. They proposed the nu-SVR, which improves upon SVR by changing the algorithm optimization problem and automatically allowing the epsilon tube width to adapt to data. They evaluated both models for each season: the average MAPE was 2.95% for nu-SVR and 3.24% for ANN.
The random forest (RF) ensemble technique combines independent learners to improve the overall model forecasting ability. The research presented in Reference [30] took advantage of this principle to forecast the day-ahead hourly consumption in office buildings. They used many ensemble algorithms, with RF being one of them, including environmental variables such as temperature and humidity and lagged load records to improve the results. Finally, they obtained a 6.11% MAPE for RF.
Similarly, the authors of Reference [31] submitted a comparative study between many models to forecast smart buildings' electricity load. ARIMA, Seasonal ARIMA (SARIMA), RF, and extreme gradient boosting (XGB) were on this set of models. Their experiments demonstrated that RF showed decent results, but XGB outperformed the other methods, concluding that XGB gives better accuracy and better performance in terms of execution time. The study from Reference [32] compares RF solely with XGB to forecast the next 24 h load and concludes that XGB, as an emerging ensemble learning algorithm, can achieve higher prediction accuracy, producing a root mean squared error (RMSE) of 3.31 for RF and 2.01 for XGB.
The authors of Reference [33] suggest using XGB, including weather variables and historical load, to forecast the hourly weekly load of a power plant. A remark is made on the complexity of the XGB hyperparameter phase. For this reason, the fireworks algorithm is proposed as a solution to obtain the global minimum on the hyperparameter space, and for instance, getting a more accurate load forecast. STLF for holidays is one of the most challenging tasks within the field due to their irregular load profile, though, in Reference [16], the authors argue that there are many matured predictive methods for STLF such as SVR, ANN, and deep learning (DL). However, those methods have some issues: SVR is not robust to outliers, ANN has the weakness of setting the correct number of hidden layers or can be easily trapped into a local minimum, and DL approaches require massive high-dimensional datasets for good performance. XGB lacks these issues and outperforms the others for solving STLF. Their results are based on averaging the daily profile curves for similar holidays plus the use of XGB, where this averaging plus XGB outperforms RF, SVR, ANN, and even the sole-use XGB.
Despite the good XGB performance, some authors recommend training the model based on similar days to enhance the forecast [34,35]. A comparison between a traditional XGB and the similar days XGB is demonstrated in Reference [34]. The similar days approach showed a noticeable improvement, emphasizing that the accurate selection of similar days will directly affect the STLF.
Because XGB provides the feature importance property, the authors of Reference [36] proposed a hybrid algorithm to classify similar days with K-means clustering fed by XGB feature importance results. Once the classification is done, an empirical mode method is used to decompose similar days' data into several intrinsic mode functions to train separated long short-term memory (LSTM) models, and finally, a time-series reconstruction from individual LSTM model predictions. This hybrid model using LSTM performed better for STLF over 24 and 168 h horizons, after comparing with ARIMA, SVR, or a back-propagation neural network using the same similar day approach as initial input.
The authors of Reference [37] proposed a multi-step-ahead forecasting methodology using XGB and SVR to forecast hourly heat load, where a "direct" and "recursive" forecasting strategy are compared. The direct method involves an independent model to predict each period on the forecasting horizon, while the "recursive" method considers a unique model that iterates one step at a time over the forecasting horizon, using the previous predicted steps as an input variable for the following forecasting step. Performance is the main disadvantage of the direct strategy because it needs to train as many models as desired periods to forecasts. The recursive strategy is sensitive to prediction errors, meaning that prediction errors will propagate along the forecasting horizon.
A study to forecast the 10-day streamflow for a hydroelectric dam used a decompositionbased methodology to compare XGB and SVR [38]. In this study, the streamflow time-series were decomposed into seven contiguous frequency components using the Fourier Transform. Then, each component was forecasted independently by the SVR or XGB. The study results showed that SVR outperformed XGB in terms of evaluation criteria through the Fourier decomposition methodology.
Another solution joining ANN with ensemble approaches is presented in Reference [39], where the authors seek to improve ANN generalization ability using baggingboosting. When training ensembles of ANNs in parallel, each ensemble uses a bootstrapped sample of the training data and consists of training the ANNs sequentially, and this method reduces the STLF error but increases the computational time because of the several training procedures. Alternatively to training several ANN sequentially, the authors of Reference [40] propose an evolutionary novel optimization procedure for tuning an ANN. For instance, avoiding the issues related to ANN tuning like overfitting and selecting the best ANN architecture. Their results achieved a 4.86% MAPE. Based on the results from References [10,36], ANN for STLF can outperform other forecasting methods if a robust hyperparameter optimization is performed to avoid the issues related to ANN tuning.
Recurrent neural networks (RNN) are taking an important place in the STLF field from all neural network approaches, especially LSTM. Contrary to standard feedforward neural networks, LSTM feedback connections are beneficial to deal with time-series forecasting applications. An example of forecasting the next 24 h load from a smart grid, comparing LSTM results with a back-propagation ANN and SVR, demonstrates that LSTM can offer a MAPE of 1.9% against 3.3% from ANN and 4.8% of SVR [41]. Similarly, the authors of Reference [42] address the STLF for a furniture company with a method based on a multilayer LSTM and compare it to other models like ARIMA, exponential smoothing, k-nearest neighbors regressor, and ANN. Moreover, their results showed that LSTM performed better in both RMSE and MAPE, followed by SVM and ANN. According to Reference [43], deep learning methods have a superior performance in electricity STLF. However, the potential of using these methods has not yet been fully exploited in terms of the hidden layer structures. For this reason, they evaluate deep-stacked LSTM with multiple layers for both unidirectional LSTM (Uni-LSTM), bidirectional LSTM (Bi-LSTM), and SVR as the baseline. Their results showed that Bi-LSTM returned a MAPE of 0.22% against MAPE scores above 2% for Uni-LSTM and SVR.
The hybridization of the successive geometric transformations model (SGTM) neurallike structure is another promising approach for STLF, as used in Reference [44] to predict Libya's solar radiation. This approach demonstrated a higher accuracy than MLR, SVR, RF, and multilayer perceptron neural network, besides having a faster training time due to the non-iterative training procedure.

Materials and Methods
Regarding the hardware and software, this research was developed in a computer with an i5-9300H processor and 8 Gigabytes of RAM. Colab [45] hosted Jupyter notebooks service. All the experiments were developed with Python [46].

Data Sources, Data Extraction, and Data Preparation
All data sources to develop this research are publicly available, and the data is available in hourly records from January 2015 until June 2020. The data composition is the following:

1.
Historical electricity load, available on daily post-dispatch reports [47], and historical weekly forecasts available on weekly pre-dispatch reports [48], both from CND.

2.
Calendar information related to holidays, and school period, provided by Panama's Ministry of Education through Official Gazette [49] and holidays websites [50].

3.
Weather variables, such as temperature, relative humidity, precipitation, and wind speed, from three main cities in Panama, are gathered from EarthData satellite data [51].
The load datasets are available in Excel files on a daily and weekly basis, with hourly granularity. Holidays and school periods data is sparse, along with websites and PDF files. These periods are represented with binary variables, and date ranges are manually inputted into Excel files. Both Excel datasets are imported and converted into data frames [52]. Weather data is available on daily NetCDF files, which can be treated with netCDF [53] and xarray [54] to select the desired variables and subsequently merge all the datasets into a unique modeling dataset.
Finally, the results of these steps are a time-series with the historical forecast along with its date-time timestamp as the index, and a data frame with the same timestamp index and 15 columns, one for each of the following variables in Table 1. Both objects have 48,048 records. There are no missing values on the datasets, and an initial outlier's revision was made by normalizing each variable. Only a few low values on the load were detected due to hourly blackouts and damages in the power grid, but all records were kept.
The set of variables used to train the ML models, also called features, are treated in this section. New variables related to the date-time index are created to feed the ML models with this extra information about time, with this being one of the most critical steps for STLF [22,48]. The new features added to the datasets are year, month number, day of the month, week of the year, day of the week, the hour of the day, the hour of the week, and a weekend indicator. Since all are integer variables, except for the binary weekend indicator, it is essential to clarify that Saturday is considered the first day of the week. For instance, the first week of each year is the first complete week starting on Saturday, and this is respected to keep the CND reports calendar structure for further comparisons.
Load weekly lags, load weekly moving average (LMA), and weekly moving standard deviation features were newly calculated features that helped capture the most recent load changes [30], keeping the hourly granularity. These were calculated from the immediately preceding week until the fourth week, adding 12 more features to the current dataset.
The decision of which variables should be used to train the ML models is critical to obtain good results. This process, known as feature selection (FS), also reduces computation time, decreases data storage requirements, simplifies models, evades the curse of dimensionality, and enhances generalization, avoiding overfitting [55]. For these reasons, several FS techniques were performed along with the STLF state-of-theart [7,22,50]. The explored FS techniques were: feature variance, correlation with the target, redundancy among regressors [56], and feature importance [57], according to the default models multiple linear regression, decision tree regressor, random forest regressor, and extreme gradient boosting regressor. After having 35 regressors and a defined target, the FS analysis showed that 13 regressors would significantly contribute to the forecast. Given an hour, h, to forecast the load, L, at this hour, L h , and denoting lags in historical load by L h−i , where i remains in the hourly granularity, the best regressor is expressed as: holiday_indicator h , hour_o f _the_day h , temperature _2m_in_Panama_city h , and rel_humidity_2m_in_Panama_city h , with temperature being the most important weather variable due to its positive relationship with electricity load [58], as illustrated in obtain good results. This process, known as feature selection (FS), also reduces computation time, decreases data storage requirements, simplifies models, evades the curse of dimensionality, and enhances generalization, avoiding overfitting [55]. For these reasons, several FS techniques were performed along with the STLF state-of-the-art [7,22,50]. The explored FS techniques were: feature variance, correlation with the target, redundancy among regressors [56], and feature importance [57], according to the default models multiple linear regression, decision tree regressor, random forest regressor, and extreme gradient boosting regressor. After having 35 regressors and a defined target, the FS analysis showed that 13 regressors would significantly contribute to the forecast. Given an hour, ℎ, to forecast the load, , at this hour, , and denoting lags in historical load by , where remains in the hourly granularity, the best regressor is expressed as: , with temperature being the most important weather variable due to its positive relationship with electricity load [58], as illustrated in Figure 1. This figure shows the typical load range, from 800 to 1600 MWh, and a temperature range between 23 and 35 °C. The dashed line identifies the linear equation: which indicates that a 1 °C increase in temperature represents a 74.8 MWh electricity demand increase.

Modeling
Before splitting the data into training and test datasets, the hourly records at the beginning and the end of the horizon are dropped if they do not belong to a complete 168 h weekly block for consistency on training, validation, and testing. After this, 284 weeks are

Modeling
Before splitting the data into training and test datasets, the hourly records at the beginning and the end of the horizon are dropped if they do not belong to a complete 168 h weekly block for consistency on training, validation, and testing. After this, 284 weeks are available with hourly records. The dataset is split into train and test, keeping the chronological records. Records are sorted by date-time index, always leaving the last week for testing and the remaining older data for training. Based on this logic, there are 14 pairs of train-test datasets selected. Twelve pairs, having a testing week for each month of the last year of records before the 2020 quarantine started (due to the COVID-19 pandemic), and two more after the quarantine began. To note, the official lockdown in Panama began on Wednesday, 25 March of 2020 (week 12-2020) [59]. These criteria test the models under regular and irregular conditions since the quarantine period presented a lower demand with atypical hourly profiles. The selected testing weeks also included typical days and holidays to test the models on different conditions throughout the year.
Studies have shown that many decision-makers exhibit an inherent distrust of automated predictive models, even if they are proven to be more accurate than human forecasters [60]. One way to overcome "algorithm aversion" is to provide them with interpretability [61]. For these reasons, the current research explores a set of candidate ML models that have been proven as forecasters within the STLF state-of-the-art, but also models that can offer a certain level of interpretability.
The candidate ML models considered in this research are multiple linear regression (MLR), k-nearest neighbors regressor (KNN), epsilon-support vector regression (SVR), random forest regressor (RF), and extreme gradient boosting regressor (XGB). All these estimators were executed using a pipeline with a default Min-Max scaler as the first step. These ML models, the pipeline structure, and the scalers were from sci-kit learn [62], except for XGB [63].
MLR uses two or more independent variables to predict a dependent variable by fitting a linear equation. This method's assumptions are that: the dependent variable and the residuals are normally distributed, there are linear relationships between the dependent and independent variables, and no collinearity should exist between regressors. Since MLR can include many independent variables, it can provide an understanding of the relationships [55], but it presents the disadvantage of being sensitive to outliers.
KNN is not typical for STLF. Nevertheless, their results can be interpreted, and some researchers used it as a baseline model [64]. The KNN method searches for the k most similar instances; when the k most similar samples are found, the target is obtained by local interpolation of the targets associated with the k found instances [42]. The main disadvantage of this method is that it tends to overfit, and it has few hyperparameters to change this situation.
SVR is a regression version of the support vector machine (SVM), which was initially designed for classification problems [65]. In contrast to ordinary least squares of MLR, the objective function of SVR is to minimize the L2-norm of the coefficient vector, not the squared error. The error term is then constrained by a specified margin ε (epsilon). SVR is frequently used for STLF with the linear [25] or radial basis function (RBF) kernel [66], identifying load patterns better than other linear models [67].
RF is an ensemble learning method with generalization ability. It fits many decision trees on various sub-samples of the dataset and uses averaging to improve the forecast and avoid over-fitting. For these reasons, it seems suitable for STLF, but the few researchers that consider this model have demonstrated a weak performance on their results [28,68].
XGB is another ensemble ML algorithm, based on the gradient boosting library, but enhanced and designed to be highly efficient, flexible, and portable [65], providing a forward stage-wise additive model that fits regression trees on many stages while the regression loss function is minimized [69]. Due to its recent development, XGB is not a matured STLF method [16]. Nevertheless, researchers are starting to use it, showing outstanding performances against traditional methods [29,30].
Once the training and testing week pairs were defined, models were trained with the earliest train-test pair, following the forward sliding window approach [70] for time-based cross-validation [71]. The idea for time-based cross-validation is to iteratively split the training set into two folds at each step, always keeping the validation set ahead of the training set. This process of defining folds, training the model, predicting the validation fold, and evaluating the model performance while changing hyperparameters and moving the training/validation sets further into the future is illustrated in Figure 2. The sliding window characteristics are 149 weeks (2.8 years) for training/validation. Within those, 64 are validation weeks, excluding the last 72 h from each validation fold to comply with the three-day unknown gap when forecasting the weekly demand on real conditions; finally, the forward step on the training/validation process is one week.
fold, and evaluating the model performance while changing hyperparameters and moving the training/validation sets further into the future is illustrated in Figure 2. The sliding window characteristics are 149 weeks (2.8 years) for training/validation. Within those, 64 are validation weeks, excluding the last 72 h from each validation fold to comply with the three-day unknown gap when forecasting the weekly demand on real conditions; finally, the forward step on the training/validation process is one week. The hyperparameter tuning was performed using the sci-kit learn package with the sliding window attributes with randomized search and grid search cross-validation [63]. First, a randomized search was conducted to explore a vast hyperparameter space. A grid search was then conducted to explore a finer hyperparameter space based on the randomized search results. This two-step approach [14] attempts to find a global optimum without digging through all the hyperparameter grids in the first step, reducing computational time, and refining the result by searching the second step.
The hyperparameter search results are shown in Table 2, except for MLR, which does not possess a hyperparameter space to enhance the predictions. The absent parameters were considered with their default value. As exposed in the Results and Discussion Section, XGB demonstrated the best performance. For this reason, only a description of these parameters will be addressed. For the 'eval_metric' parameter, 'rmse' was the most suitable option to penalize large errors. The default 'gbtree' booster was kept to take advantage of the generalization ability of the ensemble of trees instead of the weighted sum of linear functions provided by 'gblinear'. The 'n_estimators' is the number of iterations the model will perform, in which a new tree is created per iteration. For this reason, values above hundreds of trees are considered to make a sizeable iterative process that can adapt to the problem. The 'max_depth' parameter was kept with low values to avoid overfitting by training weak tree learners. The 'learning_rate' had the most extensive search to adapt the hyperparameter search during each boosting step, preventing overfitting. The parameters 'subsample', 'colsample_bytree', 'colsample_bynode', and 'col_sample_bylevel' were reduced to provide generalization ability to the model, restricting the training process with sub-samples of the data. A range of larger values and the default zero were explored for 'gamma', to control partitions on the leaves' nodes, making the algorithm more conservative. Similarly, because 'min_child_weight' controls the number of instances needed to be in each node, for this reason, values above the zero-default setting were explored.  The hyperparameter tuning was performed using the sci-kit learn package with the sliding window attributes with randomized search and grid search cross-validation [63]. First, a randomized search was conducted to explore a vast hyperparameter space. A grid search was then conducted to explore a finer hyperparameter space based on the randomized search results. This two-step approach [14] attempts to find a global optimum without digging through all the hyperparameter grids in the first step, reducing computational time, and refining the result by searching the second step.
The hyperparameter search results are shown in Table 2, except for MLR, which does not possess a hyperparameter space to enhance the predictions. The absent parameters were considered with their default value. As exposed in the Results and Discussion Section, XGB demonstrated the best performance. For this reason, only a description of these parameters will be addressed. For the 'eval_metric' parameter, 'rmse' was the most suitable option to penalize large errors. The default 'gbtree' booster was kept to take advantage of the generalization ability of the ensemble of trees instead of the weighted sum of linear functions provided by 'gblinear'. The 'n_estimators' is the number of iterations the model will perform, in which a new tree is created per iteration. For this reason, values above hundreds of trees are considered to make a sizeable iterative process that can adapt to the problem. The 'max_depth' parameter was kept with low values to avoid overfitting by training weak tree learners. The 'learning_rate' had the most extensive search to adapt the hyperparameter search during each boosting step, preventing overfitting. The parameters 'subsample', 'colsample_bytree', 'colsample_bynode', and 'col_sample_bylevel' were reduced to provide generalization ability to the model, restricting the training process with sub-samples of the data. A range of larger values and the default zero were explored for 'gamma', to control partitions on the leaves' nodes, making the algorithm more conservative. Similarly, because 'min_child_weight' controls the number of instances needed to be in each node, for this reason, values above the zero-default setting were explored.
This research addresses many evaluation metrics to systematically evaluate the ML models' performance, including the traditional metrics for ML forecasting, RMSE, and MAPE, as well as other specific metrics for STLF. Where MAPE and RMSE metrics express the average model prediction error, both metrics can range from zero to infinity and are indifferent to the direction of errors, with MAPE being easier for interpretation and comparison, and RMSE having the benefit of penalizing large errors. The rest of the metrics focus on the weekly deviation of the forecasts, evaluating the highest load "Peak", the lowest load "Valley", and the sum of the 168 load periods "Energy". All those metrics are negatively oriented, which means that lower values are better. These evaluation metrics and their formulation are listed in Table 3. A is a weekly set of actual hourly load values, F is a weekly set of forecasted hourly values, and subindex h stands for a specific hour. For this research, all testing sets were whole weeks with 168 h, so n is 168.

Results and Discussion
The overall hourly evaluation along the 14 testing weeks is displayed in Figure 3, showing the MAPE and RMSE error distributions with box-whiskers plots by ML candidate model and the historical weekly pre-dispatch forecast. The lower end of each boxplot represents the 25th percentile, the upper end shows the 75th percentile, and the central line depicts the 50th percentile or the median value. In this case, the lower whiskers are zero for all forecasts. In contrast, the upper whiskers represent the upper boundaries for errors distribution, calculated as 1.5 times the inter-quartile range plus the 75th percentile value. The values outside these boundaries are considered outliers, which means large errors. A statistics summary to complement the error's distribution is shown in Table 4.

Results and Discussion
The overall hourly evaluation along the 14 testing weeks is displayed in Figure 3, showing the MAPE and RMSE error distributions with box-whiskers plots by ML candidate model and the historical weekly pre-dispatch forecast. The lower end of each boxplot represents the 25th percentile, the upper end shows the 75th percentile, and the central line depicts the 50th percentile or the median value. In this case, the lower whiskers are zero for all forecasts. In contrast, the upper whiskers represent the upper boundaries for errors distribution, calculated as 1.5 times the inter-quartile range plus the 75th percentile value. The values outside these boundaries are considered outliers, which means large errors. A statistics summary to complement the error's distribution is shown in Table 4.  This overall evaluation shows that the pre-dispatch forecast, also abbreviated as Pre-Disp., has a larger standard deviation on both metrics, which implies that the ML candidates' models improve the STLF task. Nevertheless, MLR shows several outliers with a high magnitude. The best ML models were XGB, RF, and SVR with RBF kernel, having a  This overall evaluation shows that the pre-dispatch forecast, also abbreviated as Pre-Disp., has a larger standard deviation on both metrics, which implies that the ML candidates' models improve the STLF task. Nevertheless, MLR shows several outliers with a high magnitude. The best ML models were XGB, RF, and SVR with RBF kernel, having a similar performance by looking at these two metrics. Considering a reasonable computational time for training and predicting, XGB is more efficient than RF and SVR but even more flexible in hyperparameter tuning. RF showed the slowest computational performance, followed by SVR, XGB, KNN, and MLR, the faster model but the least accurate.
Moreover, all models were also evaluated by testing week, knowing that each week has a different context, hence a different load profile. These models' results are displayed in Table 5. The weekly evaluation demonstrates that XGB improved MAPE and RMSE for all the testing weeks. XGB was also accurate in predicting the peak load, valley load, and weekly energy. MLR is the simplest model, thus for some holidays with low demand, it predicted high loads, adding errors to the weekly evaluation, but it also showed the smallest peak deviation overall, followed by XGB. KNN did not expose this issue, but it predicted unusual hourly load profiles on holidays. It also tends to forecast lower demands. For instance, it was the best model to predict load valleys, followed by XGB, but the worst to predict load peaks. RF showed a good performance for peaks and valleys and had the smallest energy deviation along all testing weeks, followed by XGB. The RF forecast's negative side was the irregular spikes that do not follow the typical hourly profile. Since XGB demonstrated the best performance, providing an average MAPE of 3.74%, only hourly results from this model are plotted along with the pre-dispatch forecast and the real load, illustrated in Figure 4. These testing weeks include holidays, regular days, and periods with quarantine restrictions due to the COVID-19 pandemic. Figure 5 shows another testing week, with all the candidates' ML models forecast. The weekly evaluation demonstrates that XGB improved MAPE and RMSE for all the testing weeks. XGB was also accurate in predicting the peak load, valley load, and weekly energy. MLR is the simplest model, thus for some holidays with low demand, it predicted high loads, adding errors to the weekly evaluation, but it also showed the smallest peak deviation overall, followed by XGB. KNN did not expose this issue, but it predicted unusual hourly load profiles on holidays. It also tends to forecast lower demands. For instance, it was the best model to predict load valleys, followed by XGB, but the worst to predict load peaks. RF showed a good performance for peaks and valleys and had the smallest energy deviation along all testing weeks, followed by XGB. The RF forecast's negative side was the irregular spikes that do not follow the typical hourly profile. Since XGB demonstrated the best performance, providing an average MAPE of 3.74%, only hourly results from this model are plotted along with the pre-dispatch forecast and the real load, illustrated in Figure 4. These testing weeks include holidays, regular days, and periods with quarantine restrictions due to the COVID-19 pandemic. Figure 5 shows another testing week, with all the candidates' ML models forecast.    Overall, all models distinguished between weekends and weekdays load. Since weekends had a lower demand with low variance, all models showed a decent performance for these periods. For periods with similar characteristics, like the early morning, most of the forecasts were reasonably good. The most difficult periods to forecast were daytime periods during working days due to their high variance. The most challenging was holidays due to fewer records, different contexts, and the natural randomness of consumers' demand. Examples of holiday forecasting are shown in Figure 4a, on Tuesday 24 and Wednesday 25 December 2019, and another holiday example is illustrated in Figure Overall, all models distinguished between weekends and weekdays load. Since weekends had a lower demand with low variance, all models showed a decent performance for these periods. For periods with similar characteristics, like the early morning, most of the forecasts were reasonably good. The most difficult periods to forecast were daytime periods during working days due to their high variance. The most challenging was holidays due to fewer records, different contexts, and the natural randomness of consumers' demand. Examples of holiday forecasting are shown in Figure 4a, on Tuesday 24 and Wednesday 25 December 2019, and another holiday example is illustrated in Figure 5a for Thursday 18 and Friday 19 April 2019. The quarantine period brings another challenge for STLF since the load profiles changed abruptly for this period, and fewer records are available for training. Besides, the load profiles do not follow a steady pattern. Differences between the quarantine period and no quarantine are shown in Figure 4b,c, respectively.
Beyond having an accurate forecast, it is relevant to know the factors contributing to a specific STLF task. Some of the ML candidates' models proposed in this research provide a straightforward way to check the feature importance, except for KNN and SVR, which only provide coefficients for the linear kernel. Still, feature permutation importance [72] is applied to those two models with ten permutation rounds to estimate their feature importance. For MLR, feature importance is obtained through the coefficient property by multiplying each coefficient by the feature standard deviation to reduce all coefficients to the same measurement unit. Then, the absolute value of each feature importance is scaled to get the contribution percentage. For RF and XGB, the feature importance property directly returns contribution percentage by feature. Table 6 shows the feature importance by ML model, where each value is the average from evaluating each model on the 14 testing weeks. The feature importance results show that the load lags have a strong influence on the forecasting, especially the most recent week. The hour of the day is also a crucial feature, especially for KNN; also, temperature resulted in an essential feature for SVR and a secondary feature for the rest of the models. Relative humidity does not show a significant weight for all models. Lastly, the binary indicators for holidays and weekends show minor importance but still relevant, mainly to mark the difference between a higher load peak for working days and a lower peak for non-working days. The least essential variable was the month, indicating that monthly trends are not significant for STLF. For research reasons, it is valuable to see the feature's contribution by model, but the model's less significant features can be neglected for performance and simplicity for deployment reasons.
The results obtained in the current research can be interpreted from the perspective of any time-series forecasting research using ML techniques since the standard ML methodologies for STLF were applied to train and evaluate results, as exposed in the literature review. For example, the selected features across the STLF field of study match this research's best features: the load lags, the hour of the day, and temperature. Holidays and weekends' binary indicators also contribute since they help determine a high or low range of load.
In contrast with most of the studies where researchers forecast 24 or 48 h, this research addressed a 168 h horizon, considering a 72 h gap before the first forecasting period. Besides the typical MAPE and RMSE evaluation metrics, this research proposed load peak, load valley, and energy evaluation as secondary, practical metrics that analysts can easily monitor. Another distinction of this research is the diversity for testing periods, presenting STLF for regular working days, holidays, and irregular periods during the 2020 quarantine. This research's most important distinction is comparing the official weekly pre-dispatch forecast as a baseline and validating the results.

Conclusions
This research presented a novel model that improves weekly STLF forecast with a 72 h gap. The model was developed and benchmarked with data and previous forecasts from Panama's power system. Results along 14 testing weeks confirmed the suitability of the XGB algorithm. First, for time efficiency on training and predicting, second, for flexibility due to the hyperparameter space, and third, for providing certain interpretability through its feature importance property.
For the above-exposed reasons, this research makes several important contributions to the field of study. First, it shows that models built with XGB have superior performance to models built with other algorithms. Second, it confirms that temperature plays a critical role in STLF. Third, it demonstrates the excellent performance of ML models by forecasting for a longer horizon than typical research, and even with a three-day gap of data before the forecast. Lastly, this research identifies public data that can be used by other researchers to improve STLF forecasting or be used as a benchmark.
This research also has several practical implications. First, it created a model that Panama can use to replace of the current one, thus allowing Panama to reduce costs and improve STLF performance. Second, this model could be trained and applied in other countries or regions with similar conditions. Furthermore, the positive impacts of providing a more accurate STLF will reduce the planning uncertainty added by the intermittent renewable production and subsequently be closer to the optimal hydrothermal costs scheduled in the weekly unit commitment. This third practical implication leads to a fourth: for the specific case of the Panama energy spot market, because it is a marginalist market, accurate STLF can also reduce the hourly energy price uncertainty in the wholesale electricity market.

Limitations and Future Work
Future work within this specific research can include more historical load records to train models in particular situations, like holidays. Another approach to enhance the holidays' forecast is to develop a specific model for holidays and merge the predictions with a regular days model [16] and use stacking techniques to enhance the forecast accuracy, even though it increases the training and predicting time [73].
In the particular case study of Panama's national load, there are special consumers named auto-generators (Panama Canal and Minera Panama [74]). These consumers have a significant load, but as their agent's name suggests, they supply their own demand with their own power plants most of the time, except for scheduled maintenance periods on their power plants or unforeseen unavailability on their power plants [75]. Those agents will consume energy from the national power system during those events, causing an extra increase in the national load. If this additional load can be scheduled, it is better to track the electricity load by consumer and forecast only the residential, commercial, and industrial load, then add auto-generators if needed. This load segregation avoids distortions on the load forecast, as stated in auto-generators methodology [3]. In this context, another research avenue to develop STLF by consumers emerges, but it requires data with this segregation.
Since DL models are out of this research scope, and DL models require more records than ML models, unidirectional and bidirectional LSTM were not compared in the research. However, future studies should evaluate these algorithms' performance that have proven good for STLF [44]. Similarly, non-iterative ANN-based algorithms can be explored due to their good performance and low training time, like Reference [45] compared with a set of ML models.
Although there are currently several weather forecasts available with hourly granularity [76], because the temperature is crucial for STLF, it is advisable to count with an accurate temperature forecast to feed this STLF model. Hence, a temperature forecast model can complement this research.
It is advised to do a weekly load patterns revision since consumption patterns can change in the future. An example of abrupt changes on the hourly load profile was exposed in this research during the 2020 quarantine period. However, other trending consumption patterns [77], like recharging more electric vehicles and having more solar production behind-the-meter, will produce residential and commercial hourly consumption changes. This revision implies that forecasting models should be updated more often, and even that they need to be more robust.
A possible solution to overcome these issues is to automatically enabling the models to learn with new data every week by deploying the "Champion-Challenger" approach [78]. A weekly hyperparameter tuning is executed to update the models and make them compete to ensure the best performance along time.
The final goal of STLF is to reduce the uncertainty on real-time dispatch of hydrothermal power plants because the wind and solar farms are non-dispatchable power plants.
Consequently, another way to reduce this uncertainty and complement the STLF is by providing an accurate forecast for wind and solar production and other non-dispatchable power plants, if any.
Author Contributions: E.A.M. has contributed to the conceptualization, data curation, formal analysis, methodology, software and writing the original draft. N.A. has contributed to the conceptualization, supervision, reviewing and editing the original draft. All authors have read and agreed to the published version of the manuscript.