Short-Term Load Forecasting Using Neural Networks with Pattern Similarity-Based Error Weights

: Forecasting time series with multiple seasonal cycles such as short-term load forecasting is a challenging problem due to the complicated relationship between input and output data. In this work, we use a pattern representation of the time series to simplify this relationship. A neural network trained on patterns is an easier task to solve. Thus, its architecture does not have to be either complex and deep or equipped with mechanisms to deal with various time-series components. To improve the learning performance, we propose weighting individual errors of training samples in the loss function. The error weights correspond to the similarity between the training pattern and the test query pattern. This approach makes the learning process more sensitive to the neighborhood of the test pattern. This means that more distant patterns have less impact on the learned function around the test pattern and lead to improved forecasting accuracy. The proposed framework is useful for a wide range of complex time-series forecasting problems. Its performance is illustrated in several short-term load-forecasting empirical studies in this work. In most cases, error weighting leads to a signiﬁcant improvement in accuracy.


Introduction
Time series (TS) often exhibit complex seasonal patterns. For example, hourly data may have a daily periodicity, a weekly periodicity, and a yearly periodicity. Sometimes, it also exhibits a monthly periodicity. Typical examples of this type of TS can be found in the energy sector: energy consumption TS, electricity load TS, wind and solar power production TS, and electricity prices TS in wholesale markets. Accurate forecasts of the load, energy production, and electricity prices are essential for power system planning, control, and scheduling and for energy markets. In this study, our focus is on short-term load forecasting (STLF), which uses hourly, half-hourly, or quarterly TS with daily, weekly, and yearly seasonalities. Figure 1 shows periodograms of such TS for four European countries, which we use in the experimental part of this work. As can be see from this figure, the daily seasonality dominates significantly over the other ones for PL, GB, and DE data. For FR data, the yearly seasonality is much stronger than others. Note that 12 h seasonality is also revealed and is related to the characteristic shape of the daily profile (see the lower panel of Figure 2). This seasonality is even stronger than the weekly one for PL and FR data. Due to multiple seasonality, nonlinear trend, daily shape changing significantly during the week and during the year, and random fluctuations, the STLF problem is challenging and requires the forecasting model to be highly flexible.
In addition to the energy sector, the demand variations in the transportation, telecommunications, tourist, and healthcare sectors can also be greatly influenced by multiple seasonal cycles [1][2][3][4]. In order to deal with TS expressing multiple seasonality, the forecasting model should be equipped with appropriate mechanisms. Over the years, many methods of performing this have been proposed.

Forecasting Models for TS with Multiple Seasonality
A standard autoregressive moving average (ARMA) model for single seasonality can be extended to multiple seasonal cycles [5]. In the double seasonal formulation, two new terms need to be introduced to the ARMA model. These terms are polynomial functions, involving backshifts of the seasonal period. Similarly, to include a third seasonal cycle, two new terms need to be introduced. ARMA models for triple seasonality contain 21 adjustable parameters. A drawback of the seasonal ARMA model is the assumption that the cycle shapes are all the same. In practice, the seasonal patterns can greatly differ from each other. For example, in STLF, daily cycles for workdays can be significantly different from those for weekends.
Another popular and widely used statistical forecasting procedure, the Holt-Winters method, which belongs to the class of exponential smoothing methods (ETS), was extended to incorporate a second seasonal component in [6] and a third seasonal component in [7]. The adaptation involves the inclusion of second and third seasonal indexes and two additional smoothing equations for these indexes. The triple seasonal Holt-Winters model is composed of five interdependent equations that express a smoothed level, three seasonal indexes, and an additive combination of them. It requires the selection of five smoothing parameters and initial values for the level and seasonal cycles. As with ARMA, the seasonal Holt-Winters model suffers from an important weakness. It requires the same cyclical behavior for each period. To incorporate distinct seasonal patterns into the model, first, they should be identified in some way and, then, the model formulation should be developed [7].
To deal with changing seasonal patterns, innovation state-space models for ETS were employed in [8]. They can forecast TS with either additive (linear) or multiplicative (nonlinear) seasonal patterns. Their fundamental goal for multiple seasonal processes is to allow for the seasonal terms that represent a seasonal cycle to be updated more than once during the period of the cycle. This goal may be achieved by dividing the basic cycle into several shorter subcycles and by updating the seasonal terms of one sub-cycle during the time for another subcycle. Thus, for each day of the week, a separate subcycle can be used. Having too many subcycles complicates the model and increases the number of parameters to adjust. To reduce complexity, Gould et al. [8] proposed introducing common seasonal patterns for different subcycles. The limitation of the model is that it can only be used for double seasonality, where one seasonal length is a multiple of the other.
An interesting extension of the innovation state-space modeling framework for forecasting complex seasonal TS, such as those with multiple seasonal periods, high-frequency seasonality, non-integer seasonality, and dual-calendar effects was proposed in [9]. It combines an ETS state space model with Fourier terms, a Box-Cox transformation, and ARMA error correction. The seasonal components in this framework are represented by trigonometric functions based on Fourier series. The Fourier trigonometric representation with time-varying coefficients enables both linear and nonlinear TS with non-integer seasonal frequencies to be modeled. The Box-Cox transformation helps to transform the data into normality and to handle nonlinearities.
Among the newer statistical models for forecasting TS with multiple seasonality, it is worth mentioning the Prophet [10]. This model is optimized for business forecasting tasks, which typically have any of the following characteristics: hourly, daily, or weekly observations; strong multiple "human-scale" seasonalities, such as day of week and time of year; important holidays that occur at irregular intervals; historical trend changes, for instance due to product launches or logging changes; and trends that are nonlinear growth curves, where a trend hits a natural limit or saturates. The Prophet procedure is an additive regression model with four main components: a piecewise linear or logistic growth curve trend, a yearly seasonal component modeled using Fourier series, a weekly seasonal component using dummy variables, and a user-provided list of important holidays. According to the authors, due to its flexible formulation, Prophet can easily accommodate seasonality with multiple periods.
Machine learning (ML) methods offer an alternative to classical statistical forecasting methods. They provide forecasting models with the ability to learn about historical patterns and anomalies and to improve prediction accuracy. Among the ML methods, neural networks (NN) are most often used for forecasting. There are numerous forecasting model solutions based on different NN architectures [11]. The way to deal with multiple seasonality varies depending on the base model, the mechanisms used, and the authors' creativity. For example, the winning submission to the most reputable forecasting competition, M4 Makridakis, in a version for hourly double seasonal TS combines ETS to deal with seasonality and recurrent NN [12]. An ETS module, in the form of a double seasonal Holt-Winters model, produces two seasonal components. These are used for TS deseasonalization and adaptive normalization during on-the-fly preprocessing. The preprocessed TS is forecasted by a long short-term memory (LSTM) network that is composed of two blocks of two dilated LSTM layers. A dilation mechanism allows for the LSTM cells to take into account the earlier hidden states, which in this case, are shifted back in time by 1, 4, 24, and 168 h. This improves long-term memory performance and can be especially useful for seasonal TS, as it expresses a cyclical relationship between the series elements.
Another LSTM-based model (LSTM-MSNet) for TS with multiple seasonal patterns was proposed recently in [13]. It is a globally trained LSTM network, where a single prediction model is built across all of the available TS to exploit the cross-series knowledge. According to the authors, this enables the model to untap the common seasonality structures and behaviors available in a collection of TS. To deal with multiple seasonal cycles, LSTM-MSNet initially uses a deseasonalization strategy to detach the multiseasonal components from a TS. For deseasonalization, different statistical decomposition techniques were utilized including Fourier transformation. LSTM was trained on the seasonally adjusted TS or, in the second variant, on the original TS together with the seasonal components as exogenous variables. The choice of the learning variant was determined by the characteristics of the TS.
Although recurrent NNs, such as LSTM, gated recurrent units, and DeepAR [14], dominate today as NN architectures applied to TS forecasting, other solutions for multiple seasonal TS are also used. For example, in [15], a deep NN architecture based on backward and forward residual links and a very deep stack of fully connected layers was proposed. In this model, each stack is composed of basic building blocks. In the generic architecture, each block learns the predictive decomposition of the partial forecast in the basis matrix. This matrix expresses waveforms in the time domain, which composes, together with the learnable expansion coefficients, the partial forecast. In the interpretable architecture, each block can model a trend or a seasonal component. In the latter case, the basis function has the form of Fourier series with learnable Fourier coefficients. Thus, the partial forecast produced by each block is a periodic function mimicking typical seasonal patterns. The final forecast is the sum of partial forecasts produced by all the blocks. Thus, complex seasonal pattern can be decomposed and modeled by many blocks. Another deep architectures useful for forecasting multiple seasonal TS was proposed in [16]. It extends the canonical transformer architecture [17] by introducing a convolutional self-attention mechanism, which allows the model to better incorporate the local context of the TS. An experimental study performed on the TS exhibiting both hourly and daily seasonal patterns showed that the proposed approach achieved state-of-the-art performance in comparison with recent RNN-based methods.
The approaches presented above for multiple seasonal TS forecasting rely on incorporation into the model mechanisms, which allow it to deal with seasonal components. An alternative approach is to simplify the problem by TS decomposition. The components expressing less complexity than the original TS can be modeled independently using simpler models. The most popular methods of decomposition are additive decomposition, multiplicative decomposition, seasonal-trend decomposition using LOESS, discrete Fourier and wavelet transforms, and empirical mode decomposition [18][19][20][21].
Another method of dealing with TS with multiple seasonal periods was proposed for STLF in [22] and described further in the context of neural models in [23]. This pattern-based approach uses TS representation expressing the shapes of basic seasonal cycles. A trend and seasonal cycles longer than the basic cycle are filtered out from the TS. Thus, the relationship between TS elements is simplified and can be modeled using simple models such as nonparametric regression models based on similarity [24] or shallow NNs [25]. After forecasting the preprocessed TS, the trend and seasonal components are determined from the most recent TS history and combined with the forecasted seasonal pattern. It is worth emphasizing that this approach needs neither a complex forecasting model for multiple seasonal TS nor TS decomposition and forecasting of each component individually. Experimental research has confirmed that these models can compete in terms of accuracy with state-of-the-art deep learning models, such as the winning M4 submission [26].

Neural Networks for STLF
Due to their very attractive properties, useful for solving forecasting problems, NNs are widely used in STLF [27]. They are competitive with the state-of-the-art ML models such as support vector regression, random forest, and gradient boosting decision trees [28]. A survey of the classical NN architectures for STLF can be found in [29,30], and a comparison of their performances can be found in [25]. New NN architectures are quickly applied in STLF. Some examples are given below.

•
A hybrid model based on a deep convolutional NN is introduced for short-term photovoltaic power forecasting in [31]. In this approach, different frequency components of the TS are first decomposed using variational mode decomposition, and then, they are constructed into a two-dimensional data, which can be better learned by convolution kernels, leading to a higher prediction accuracy.
• LSTM NN was proposed in [32] for STLF for individual residential households.
Forecasting an electric load of a single energy user is fairly challenging due to the high volatility and uncertainty involved. LSTM can capture the subtle temporal consumption pattern persisting in single-meter load profile and can produce the accurate forecasts. • Deep residual networks were proposed for STLF in [33]. The proposed model is able to integrate domain knowledge and researchers' understanding of the task by virtue of different NN building blocks. Specifically, a modified deep residual network was formulated to improve the forecast results and a two-stage ensemble strategy was used to enhance the generalization capability of the proposed model. The model can be applied to probabilistic load forecasting using Monte Carlo dropout. • Stacked denoising autoencoders were used in [34] for forecasting the hourly electricity price. The model was equipped with a unsupervised pre-training process to prevent overfitting. It incorporates concepts of the random sample consensus and stochastic neighbor embedding to further improve the forecasting performance. • Pooling-based deep recurrent NN was applied in [35] for household load forecasting. A novel pooling-based deep learning was proposed, which batches a group of customers' load profiles into a pool of inputs. The model aims to directly learn the uncertainty of load profiles by applying deep learning. • A hybrid model based on randomized NN for probabilistic electricity load forecasting was proposed in [36]. This model includes generalized extreme learning machine for training an improved wavelet NN, wavelet preprocessing, and bootstrapping. It takes into account data noise uncertainties while the output of the model is the load probabilistic interval. • Second-order gray NN using wavelet decomposition was proposed in [37] to improve the accuracy of load forecasting. In this approach, the load TS is decomposed by wavelet decomposition to reduce the nonstationary load sequence. Then, a secondorder gray forecasting model is used to forecast each component of the TS. To obtain the optimal parameters, the NN mapping approach is used to build the second-order gray forecasting model. • A wavelet echo state network was applied to both STLF and short-term temperature forecasting in [38]. A wavelet transform was used as the front stage for multiresolution decomposition of the TS. Echo state networks function as forecasters for decomposed components. A modified shuffled frog leaping algorithm is used for optimizing the model. • A spiking NN for STLF was proposed in [39]. This model employs spike train models that are close to their biological counterparts. An implementation of the proposed model is divided into two phases. In the first phase, spike NN predicts a day-ahead hourly temperature profile. In the second phase, another spike NN predicts a dayahead load profile based on actual and forecasted temperatures, historical loads, day of the week, and holiday effect. • Convolutional NNs are combined with fuzzy TS for STLF in [40]. In this approach, multivariate TS data, which include hourly load data, hourly temperature TS, and fuzzified version of load TS, were converted into multi-channel images to be fed to a proposed convolutional NN model. • Deep residual network with a convolution structure was proposed to STLF in [41]. The authors analyzed and discussed the effectiveness of the convolutional residual network with different hyperparameters and architectures that include depths, widths, block structures, and shortcut connections, with/without dropout. • Convolutional NNs combined with a data-augmentation technique, which can artificially enlarge the training data, were applied for STLF a single household in [42]. This method can address issues caused by a lack of historical data and improves the accuracy of residential load forecasting, which is a fairly challenging topic because of the high volatility and uncertainty of the electric demand of households.

Summary of Contributions
In this work, we propose an innovation to the pattern-based neural model for forecasting TS with multiple seasonality such as STLF. To improve the NN learning performance, we introduce error weights to the mean square error loss function. The weights assigned to the training patterns are dependent on the similarity between the training patterns and the query test pattern. This modification makes the learned mapping function more accurate in the neighborhood of the query test pattern and leads to improved forecasting performance.
The weighted loss function in NNs was used in many different formulations and contexts. In [43], to address the class distribution imbalance in deep learning, a class re-balancing strategy based on a class-balanced dynamically weighted loss function was proposed. In this approach, the weights are assigned based on class frequency and predicted probability of ground truth class. A similar problem, data imbalance problem, was considered in [44]. In this study, the weighted loss function has two types of weights, i.e., a class balanced weight and a sample importance weight. The sample importance weights discriminate between samples of different importance considering the sample difficulty (the performance of the classifier on the sample). A dynamically weighted loss function for convolutional and recurrent deep architectures was proposed in [45]. This approach is expected to modify the learning process by augmenting the loss function with a weight value corresponding to the learning error of each data instance. The objective is to force deep learning models to focus on those instances where larger learning errors occur in order to improve their performance.
In contrast to the approaches described above, our method proposed in this work assigns weights to the training patterns depending on their similarity to the query test pattern.
The goal is to model the target function around the query pattern with greater accuracy.
The contribution of this study includes the following two points: 1.
We propose a new univariate multi-step forecasting neural model with pattern similarity-based error weighting for TS with multiple seasonality.

2.
We empirically demonstrate using challenging STLF problems the statistically significant accuracy improvement and forecasting bias reduction of the proposed approach with respect to the standard approach without error weighting.
The rest of the work is organized as follows. Section 2 describes the pattern representation of the TS. Section 3 presents the forecasting neural model with error weighting. The experimental framework used to evaluate the performance of the proposed approach is described in Section 4. Finally, Section 5 concludes the work. Table 1 lists the main symbols used in this study.

Multiple Seasonal Time Series and Their Representation
Let us consider the hourly electricity demand TS for Poland as an example of TS with multiple seasonality. This TS is shown in Figure 2 for the 4-year period 2012-2015. As can be seen from this figure, it exhibits a trend and three seasonal cycles: annual, weekly, and daily. Note that the daily shape changes significantly during the week and during the year.
To estimate the magnitudes of the seasonal components, we calculate the standard deviations as follows. For annual seasonality, we calculate the weekly means and their standard deviations in each yearly period. For weekly seasonality, we calculate the daily means and their standard deviations in each weekly period. Finally, for daily seasonality, we calculate the standard deviations of hourly demand in each daily period. These standard deviations correspond to the magnitudes of the seasonal components. Their values, averaged over the yearly periods, are shown in Figure 3. For annual variations, we can observe a steady fall in the variability of the weekly mean demands over the years. The weekly and daily components rise slightly over the years. The strongest seasonal component in this TS is the daily one.  The TS representation used in this study produces input and output patterns. The input patterns represent predictors as transformed sequences of the daily period, while the output patterns represent the forecasted daily sequences. The transformations that turn the TS into patterns are aimed at normalizing daily sequences by unifying the variance and by removing the trend as well as longer seasonal cycles (weekly and annual cycles in our case). The resulting patterns express normalized shapes of the daily sequences.
Let {E k } K k=1 be a TS and e i = [E i,1 E i,2 . . . E i,n ] T be a vector expressing a daily TS sequence of day i. This sequence of length n (24 for the hourly TS) is represented by input pattern x i = [x i,1 x i,2 . . . x i,n ] T , which is determined as follows: where e i is a mean value of sequence e i and e i = ∑ n t=1 (E i,t − e i ) 2 is a measure of sequence e i dispersion.
The x-pattern defined by this equation is a normalized version of centered vector e i . Due to the pattern representation, the TS daily sequences, which have a different mean value and dispersion, are unified. All x-patterns, representing successive daily sequences, have zero mean, the same variance, and the same unity length. However, they differ in shape.
The output pattern y i = [y i,1 y i,2 . . . y i,n ] T represents the forecasted daily sequence e i+τ = [E i+τ,1 E i+τ,2 . . . E i+τ,n ] T , where τ ≥ 1 is a forecast horizon in days. The y-pattern is defined as follows: where e i and e i are the same as in Equation (1).
Note that, in Equation (2), we use the same coding variables e i and e i as that for the input patterns. This is because the decoding operation needs the coding variables to determine the forecast: If we used the coding variables e i+τ and e i+τ in Equation (2), their values necessary for decoding would not be known because they describe the future sequence, which has just been forecasted.
From Equation (2), it follows that, to determine forecasted sequence e i+τ , we should use the forecasted y-pattern, y i , which is produced by the model in response to the query x-pattern, and the known coding variables, which are determined on the basis of historical sequence e i . The paired x-and y-patterns express two daily sequences: the sequence preceding the forecast (for day i) and the forecasted sequence (for day i + τ), respectively. Figure 4 depicts the real hourly electricity demand TS and its x-and y-patterns for τ = 1. The upper panel shows two 9-day fragments of the original TS from June and December, respectively. These fragments differ in level due to annual seasonality. The daily sequences within these fragments differ in level as well due to the weekly seasonality. The pattern representation allows the model to deal with differences in level as well as variance in TS. Note that the x-patterns are all normalized and differ only in shape. The y-patterns, due to the use of coding variables from the previous periods, reveal the weekly seasonality. The y-patterns of Mondays are much higher than the patterns of other days of the week because the Monday sequences are coded with the means of Sunday sequences, which are much lower than the means of Monday sequences. For similar reasons, y-patterns for Saturdays and Sundays are lower than y-patterns for the other days of the week. Thus, the y-patterns are not unified globally but are unified in groups composed of the same days of the week. For this reason, we construct forecasting models that learn from data representing the same days of the week. For example, when we train the model for forecasting daily sequence for Monday, a training set for it is composed of the y-patterns representing all Mondays from history and the corresponding x-patterns representing the previous days (depending on the forecast horizon; Sundays for τ = 1).

Forecasting Model
The separate forecasting neural model is constructed for each day of the test period (e.g., one-year period). It learns on a training set composed of pairs of corresponding xand y-patterns, where the x-patterns represent the same day of the week as the query test pattern: To obtain the forecasted y-pattern for the given day of the test period, the x-pattern representing previous day is presented to the trained network. We call this x-pattern a query test pattern. Query test pattern x is also used for weighting the training x-patterns, i.e., the training x-patterns that are the most similar to the query pattern have higher error weights and thus larger impacts on the loss function value. This is explained below in detail.
Note that, in our approach, we only use historical load values as inputs, i.e., vectors x, which encode the daily TS sequences (24 values that encode loads in consecutive hours of the day). We do not use exogenous variables such as weather variables (temperature, wind speed, cloud cover, humidity, and precipitation) as their values are unknown for the forecast period (day i + τ). Due to the pattern representation of TS and local approach (a specific training set and a separate model for each forecasting task), our proposed method does not require additional input variables indicating the day of the week, hour of the day, or period of the year. Thus, the model maps n-dimensional input vector space, X = R n , into n-dimensional output vector space, Y = R n .
To model the relationship between n-dimensional input and output patterns, we use a feedforward NN with a single hidden layer. The hidden layer consists of m hyperbolic tangent sigmoid nodes while the output layer consists of n linear nodes. To prevent overfitting, we use an early stopping. The number of hidden layers and nodes correspond to the target function complexity. In our case, we simplify the forecasting problem using pattern representation, which means we can limit NN architecture to a shallow, narrow one. This implies fast and more efficient training.
To improve the learning performance of NN, we introduce error weights to the mean square error loss function. The weights are assigned to the training samples. The weight value is dependent on the similarity between the training x-pattern and the query test pattern. More specifically, the weight value is proportional to the rank of the training x-pattern in the similarity ranking.
The weighted loss function is of the following form: where y i,t andŷ i,t are the real and forecasted values, respectively, and ω i is an error weight for ith training sample calculated as follows: where r i = r i −1 N−1 is a normalized rank; r i = 1, ..., N is an ith training sample position in the distance from the test sample ranking; and γ is a parameter controlling the weighting function shape.
The normalized rank takes the values r 1 = 0 for the nearest training sample to the query test sample, r 2 = 1 N−1 for the second nearest training sample, and r N = 1 for the furthest training sample. The distance between samples is calculated as an Euclidean distance between query test pattern x and training pattern x i (as x-patterns are normalized vectors, a dot product x T x i can be used instead of the Euclidean metric). Figure 5 shows the weighting function (5) (2), the error weights for all training samples are the same, i.e., equal to one. The γ hyperparameter controls the weighting of the training samples. In the experimental part of this work, we adjust its value to the test samples. A block diagram of the proposed model is presented in Figure 6. From the raw TS, {E k }, the preprocessing module produces a training set composed of x-and y-patterns. It also produces the query test pattern and coding variables for the forecasted y-pattern. The weighting function calculates the distances between training samples and the query test pattern, ranks the training samples, and calculates error weights for them. The NN learns on the training set using weights ω for weighting the individual training sample errors. It produces the forecast for the query test pattern. This forecast is decoded using Equation (3) into the real TS forecast using coding variables, e and e, determined from the most recent historical seasonal period (the period represented by the query test pattern).

Simulation Study
In this section, we present our experimental results for STLF. We use real-world data collected from www.entsoe.eu, (accessed on 6 April 2016), which details the hourly power system load in the period 2012-2015 for four countries: Poland (PL), Great Britain (GB), France (FR), and Germany (DE). We exclude atypical days such as public holidays (between 10 and 20 days a year) from these data. A one day-ahead forecasting problem is considered. We forecast the daily load profile for each day of 2015, training NN on previous data. For each forecasted day, a new NN is trained. The quality of forecasts was measured using metrics shown in Table 2.

Metric Equation
Percentage Error where E and E are the real and forecasted hourly loads, respectively.

Tuning of the Error Weights
In the first study, we assumed a fixed NN architecture: 24 inputs, 24 hidden nodes, and 24 outputs. We used a hyperbolic tangent sigmoid transfer function in the hidden layer and a linear transfer function in the output layer. Early stopping was used to improve generalization: 20% of the training data was treated as a validation set for error monitoring during the learning process and halting when no reduction in validation error was observed. The network was trained using a Levenberg-Marquardt optimizer for 100 epochs.
To assess the impact of the error weighting on the forecasting errors, for each γ ∈ Γ = {15, 4.44, 1.25, 0, −0.56, −0.82, −0.94, −1}, where the last value corresponds to the learning variant without weighting, we train NNs and record test absolute percentage errors (APE). Distributions of these errors expressed by boxplots are depicted in Figure 7. The dashed horizontal lines express medians of APE for the baseline variants without weighting (γ = −1). As can be seen from the figure, weighting of errors in almost all cases leads to lower median of APE. The highest reduction in errors is for DE and PL. The lowest is for GB.
In the next experiment, we select the γ-value for each test pattern on the training sets. Four variants of learning mode were considered: V1 learning without error weighting as a baseline for other variants; V2 learning with error weighting, where γ was selected on the training set, i.e., NN was trained for each γ ∈ Γ, and the minimal training error indicated the selected γ value. NN with this γ value was used to produce the forecast for the test pattern; V3 similar to V2, but γ was selected on the five training patterns most similar to the query test pattern. First, the subset Φ of five training patterns closest (Euclidean distance) to the test pattern was selected. After NN training with γ ∈ Γ, average errors for Φ were calculated. The lowest error indicated the selected γ value. Then, NN with this γ was used to produce the forecast for the query test pattern; and V4 learning with fixed γ = 0.
In variants V2 and V3, the value of γ was searched individually for each test pattern. Set Φ in V3 represents the neighborhood of the test pattern. We expect that the γ value selected as optimal for this neighborhood brings better results for the test pattern than the γ value selected as optimal for all training patterns. V4 represents linear error weighting.
The forecasting quality metrics for the test data are presented in Tables 3-6. They include the mean absolute percentage error (MAPE), median of APE, root mean square error (RMSE), mean percentage error (MPE), and standard deviation of percentage error (Std(PE)) as a measure of the forecast dispersion.   To confirm that the results from the four learning variants are significantly different, we performed a Wilcoxon signed-rank test with α = 0.05 for APE. The diagrams in Figure 8 depict pairwise comparisons of the learning variants. The arrow lying at the intersection of the two variants indicates which of them gave the significantly lower error. A lack of an arrow means that both variants gave statistically indistinguishable errors.
From Tables 3-6 and Figure 8, we can conclude the following: 1.
The baseline variant V1 gave significantly higher errors than other variants in 10 out of 12 cases.

2.
V4 gave significantly lower errors than the baseline variant V1 for all data sets. It also beat its competitors, V2 and V3, for PL and GB data.

3.
V2 and V3 gave lower errors than the baseline variant for PL, FR, and DE. 4.
V3 gave the significantly lowest APE for FR data.  Tables 3-6 allows us to compare the bias of the forecasts produced by the different model variants. A positive value of MPE indicates underprediction, while a negative value indicates overprediction. As can be seen from the tables, for PL data, the bias was positive, whilst for GB and FR data, it was negative. For these three data sets, error weighting yielded a positive effect on the bias reduction: by up to 15% for PL, 28% for GB, and 35% for FR. For DE data, we can observe very low bias regardless of the model variant.

Tuning of the Number of Hidden Nodes
In a further experiment, we change the NN architecture. We consider a variable number of hidden nodes from 2 to 48. Based on the results of previous experiments, we restrict this study to NN with the linear error weighting function (γ = 0). For each test sample, we select the number of the hidden nodes in two ways: (1) on the training set and (2) on the Φ set of five training patterns closest to the test pattern. The results for both cases are shown in Table 7 (due to the different training sessions, the results for m = n = 24 differ from those presented in Tables 3-6). As can be seen from this table, the selection of the optimal number of hidden nodes on the training set or on the Φ set does not improve significantly the results for the baseline variant with fixed m = n = 24.  Figure 9 shows the mean test errors depending on the number of hidden nodes m. As can be seen from this figure, for all countries, the error initially decreases with m and then begins to increase. This increase in error is due to overfitting, i.e., the complexity of the model is greater than the complexity of the relationship it approximates. The lowest mean test errors for each country were achieved for m = 16 for PL, m = 14 for GB, m = 8 for FR, and m = 10 for DE. Thus, assuming m = n = 24 is not the best choice. When deciding on the optimal number of hidden nodes for a given TS, we can consider the results for similar TS. Doing so, we can assume the hidden nodes for PL as the mean best m value for other countries, i.e., (14 + 8 + 10)/3 ≈ 11. Using this approach for other countries, we obtain m ≈ 11 for GB and m ≈ 13 for FR and DE. Thus, the number of hidden nodes is near m = n/2 = 12. Let us assume such a value of m for simplicity. The errors for m = n/2 = 12 are presented in the last row of Table 7. As you can see, this case turned out to be slightly better than the baseline variant. Figure 10 shows examples of forecasts of the daily load profiles produced by proposed model with m = 12 hidden nodes and linear error weighting. Note that the proposed model produces a multi-output response, maintaining the relationships between the output variables (y-pattern components). In the case of single-output models, these relationships are ignored because the variables are predicted independently. This may cause a lack of smoothness in the forecasted curve (zigzag effect).

Discussion
From the experimental part of this work, we can conclude that linear error weighting and m = n/2 can be considered the default settings for the multi-output pattern-based NN forecasting models. With respect to the model without weighting, such settings provide a relative gain in terms of MAPE of 13% for PL, 4% for GB, 7% for FR, and 13% for DE.
In our earlier work [25], we compared different NN architectures for STLF: multilayer perceptron (MLP), radial basis function NN, generalized regression NN (GRNN), two variants of fuzzy counterpropagation NNs, and three variants of self-organizing maps. Among them, MLP (architecture used also in this study) turned out to be one of the most accurate. It outperformed all other models except GRNN. The forecasting errors (MAPE) averaged over four STLF tasks were 1.92 for MLP, 1.85 for GRNN, and over 2.00 for other neural models (for comparison, the naive forecast error was MAPE = 4.14). MLP outperformed also classical statistical models such as ARIMA (MAPE = 2.46) and ETS (MAPE = 2.28).
In [23], the local and global variants of MLP were compared with other state-of-the-art STLF models such as principal components regression, partial least-squares regression (PLSR), Nadaraya-Watson estimator (N-WE), fuzzy neighborhood model (FNM), two models based on k-means clustering, and three models using artificial immune systems. The local variant of MLP (corresponding to the model used in this study) turned out to be one of the most accurate. It was beaten only by three models: PLSR (MAPE = 1.82), N-WE (MAPE = 1.83), and FNM (MAPE = 1.89). The empirical comparison of MLP with classical statistical models as well as state-of-the-art ML models reported in [23,25] shows its high performance in STLF. This performance can be further improved by using the pattern similarity-based error weighting, as shown in this study.
The proposed forecasting framework can be useful not only for STLF but also for a broad range of complex TS forecasting problems. To capture many seasonal cycles, the nonlinear trend, and long-term dependencies, our approach does not require a complex, deep architecture equipped with sophisticated mechanisms such as dilation, attention, dropout, and residual connections (in fact, some of these mechanisms are used not to improve the forecast accuracy but to enable effective learning of the deep NN). Unlike deep learning, which is very computationally intensive, our approach does not require powerful hardware for learning its single-hidden layer model. Moreover, our approach has a more transparent architecture than deep NNs, its working principle is more understandable, and it has far fewer parameters to fine-tune than its deep competitors. By using patterns for data representation, our approach solves many problems related to complex forecasting tasks. Note also that, in our approach, we avoid TS decomposition and forecasting each TS component separately. Thus, the construction, optimization, and learning of the forecasting model are faster. Producing a vector output, the proposed model is able to multi-step forecast without a recursive approach. The components of the output pattern are the forecasts for the successive time periods. The model is also able to produce forecasts for further horizons. In such a case, we define output patterns in an appropriate way (with τ > 1).
The forecasting model using a pattern-based representation of TS described in this study was developed for seasonal TS. An application of this model to forecasting nonseasonal series requires further research. The proposed approach with weighted loss function (4) is based on the assumption that, if pattern x a is similar to pattern x b , then pattern y a is similar to pattern y b . This assumption can be verified for a given TS using appropriate statistical tests, i.e., the chi-squared test (see [24]). If the test does not confirm this assumption, the weighted loss function may not improve the results. In such a case, we can look for another method of TS representation or coding. For example, in [46], to improve the performance of the models with pattern-based representation, the coding variables for y-patterns, e and e in Equation (2), were not determined from preceding period i but were predicted for period i + τ. This resulted in a significant reduction in forecast errors.
The proposed model is a univariate forecasting model, but exogenous variables can be easily introduced as additional NN inputs. As with endogenous variables, a key problem with exogenous variables is their appropriate representation.

Conclusions
Forecasting TS with multiple seasonality such as STLF is a challenging problem requiring the forecasting model to be equipped with appropriate mechanisms to deal with multiple seasonal cycles. In this study, we use an alternative approach to deal with multiple seasonality. It involves TS preprocessing to filter out the trend and seasonal cycles longer that the basic cycle. This simplifies the relationship between input and output variables. To model this relationship, we use multi-output shallow NN architecture. To improve the learning performance of the model, we propose using the loss function with error weights. The error weights are assigned to the training samples and express their similarity to the query test pattern. This approach leads to more accurate modeling of the target function in the neighborhood of the test pattern. The empirical study, including a challenging STLF problem for four European countries, showed the improvement in performance for our proposed approach over standard NN learning without error weighting.
The proposed approach using a simple single-hidden layer NN is an alternative to deep NN solutions, which have been developed rapidly for forecasting problems in recent years. Unlike deep architectures, it is simple, transparent, easily tuned, and fast trained.
To further simplify the model and learning process, in our future work, we plan to employ randomization-based NNs to complex forecasting problems.

Data Availability Statement:
We use real-world data collected from www.entsoe.eu (accessed on 6 April 2016).

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: