Chiller Load Forecasting Using Hyper-Gaussian Nets

: Energy load forecasting for optimization of chiller operation is a topic that has been receiving increasing attention in recent years. From an engineering perspective, the methodology for designing and deploying a forecasting system for chiller operation should take into account several issues regarding prediction horizon, available data, selection of variables, model selection and adaptation. In this paper these issues are parsed to develop a neural forecaster. The method combines previous ideas such as basis expansions and local models. In particular, hyper-gaussians are proposed to provide spatial support (in input space) to models that can use auto-regressive, exogenous and past errors as variables, constituting thus a particular case of NARMAX modelling. Tests using real data from different world locations are given showing the expected performance of the proposal with respect to the objectives and allowing a comparison with other approaches.


Introduction
In the context of energy consumption, chillers are becoming a factor of great importance. In medium to large buildings, chillers are responsible for producing cold water to cover the demand for Heating Ventilation and Air Conditioning (HVAC) and other uses [1]; as a result, a significant fraction of global energy consumption is due to chillers. Recently, operation of chillers has been the subject of research with the aim of reducing energy consumption maintaining operational objectives. For instance, efficient sequencing of different units allows the sharing of cooling load among several machines; these can then operate closer to their respective optimum conditions [2]. To this end different control schemes have been proposed. When available, forecasts of the cooling demand [3] can be used to produce control sequences that are feasible and do not pose a significant strain on the chillers (such as frequent switchings on and off). Moreover, the use of thermal energy storage devices can play a significant role in energy savings if coupled with a suitable forecasting system [4].
Predicting the cooling demand can be posed as a multivariate time-series forecasting problem (see [3,5,6] and references therein). For a given building/facility, several variables play a significant role in cooling load: external temperature, wind, humidity, building capacity and occupancy, desired thermal comfort level, etc. For some of these there might be some forecasts available, for instance weather forecasts can be easily accessed from various sources. However, forecasts for the exact particular location of the building might need models specially developed to such end.
In this paper a energy demand forecasting model is developed having in mind the special features arising from chiller optimization. In particular, ease of design/deployment and adaptiveness of the predictors are key for this application. It is remarkable that social changes can produce changes in the use of the building/facility that affect the energy demand, thus adaptiveness is a sought after quality.
Neural networks based on basis function such as RBFNN [7], GRNN [8,9] and Hyper-Gaussian [10] are specially suited for the above posed problem. The basis functions are such that the response of each base is significative (different from zero) in just a region of the input space. This makes the derivation of the base parameters a simpler task compared with other Neural Networks (NN) approaches [11].
In this paper a methodology for developing predictors for use in the context of chiller optimization is presented. The proposal uses an Hyper-Gaussian net to provide support for local Auto Regressive Moving Average with eXogenous inputs (ARMAX) models. The concept of local linear models is thus combined with that of Hyper-Gaussian. In the proposal, each node is allowed to have a different metric to compute the distance as well as an associate ARMAX model with its own parameters and inputs.
In this context, and regarding the contributions of the present work, please note that, although hyper-gaussians were proposed long ago, in most implementations the weighted metric employed (also referred to as "Mahalanobis-like distance" [10]) is the same for all nodes. In such cases the Hyper-Gaussian NN becomes equivalent to a regular Radial Basis Function Neural Network (RBFNN) where the input data have been subject to a special normalization using the inverse of the experimental covariance matrix. Furthermore, note that the most used basis functions are radial (unlike hyper-gaussians that are not radial).
The proposed methodology aims at the already presented key features, namely low training requirements (in terms of memory, computation, design parameters and data required). In addition, the local nature of the approximation makes it possible to train the network with fewer input-output training data. Finally, the adaptation of the resulting model can easily be performed on-line, meaning that each new observation, acquired during the predictor use, can be used to fine-tune the predictor. This is of great importance for because: 1-many applications for energy management requiring energy demand prediction might not have access to large data-sets to train the NN predictor, and 2-even if the past data have been gathered they might not be that relevant due to social and environmental changes as well as possible changes in the installations.
The methodology is put to the test with various forecasting problems arising in the context of energy consumption forecasting and in particular chiller optimization. In all cases, direct multi-step ahead forecasting is considered, with horizons (number of steps) that are relevant for the chiller optimization problem. A bibliographical review of related works is presented next and commented upon in the light of the objectives sought for chiller optimization. Then, in Section 3, the forecasting problem is detailed followed by Section 4 where the proposed methodology is presented. The tests used for assessment are explained in Section 5 where the obtained results, including comparisons with other approaches, are shown and discussed.

Related Works
In this section, a review of neural models used for forecasting energy consumption is presented, paying special attention to those related to the proposed methodology and others used in the tests for comparison. For a thorough review of artificial intelligence based load forecasting the reader is advised to check [6,12,13] and their references.
Energy demand forecasting is a difficult task due to some characteristics such as nonstationarity (in the mean and/or variance), calendar and other social effects, seasonality effects (weekly cycle, annual cycle), high volatility, jumps and outliers, and, in some cases, lack of long historical data sets [5]. This motivates the research on methods that might prove useful for some particular applications such as cooling of large buildings. Arguably, the first factor to take into account is the forecasting horizon. The case of interest for chiller operation optimization is that of short-term forecasting (horizons from a few hours to days [14]). For medium and long-term forecasting see [15] and the references therein.
The problem of chiller plant optimization has been tackled in many research works. Optimization is usually carried out in static terms (based on pre-specified load profiles) or dynamically by using appropriate control techniques. In the second case, having access to accurate predictions for the variables of interest makes the control techniques more capable of delivering energy saving schemes. Variables such as temperature/humidity/solar radiation and building/facility occupancy are usually considered. The cooling demand prediction was tackled in [16] using Generalized Regression Neural Networks (GRNN), Ref. [17] using support vector machines, Ref. [5] using Multi Layer Perceptrons (MLP), Ref. [18] using genetic methods that provide simple models by means of feature selection.
GRNN belong to a class of nonparametric estimation methods that can be posed as a special case of basis functions networks. The use of the Parzen Kernel with a set of input-output training pairs produces the well known GRNN method that uses as many bases as training pairs. Training of GRNN is not iterative and just a single parameter (node spread) is usually considered [8,9].
If the Parzen Kernel is replaced with a local linear model and the number of basis functions is restricted to some pairs (or averaged values) then a fusion of linear local models is obtained. In the literature the term local linear RBFNN has been used for such a construct [19]. Unsupervised center location and global spread parameters were used in most cases. For instance, in [20] a recurrent self-organizing map is used to cluster input training data, local autoregressive (AR) models are then developed for each cluster. The case of different spreads for each node is less common but it is treated with either global optimization techniques such as particle swarm optimization [19] or gradient descent [21].
Other means of using linear models have been proposed, in particular the work of [22] is somehow reciprocal of this proposal as the authors use a functional representation for the parameters of an AR model that becomes state-dependent. This has been extended to RBFNN producing the so called RBF-AR model of [23] and other variants [24]. The idea has also been extended to wavelets as in [25] where the local linear models are combined using a wavelet NN trained using Particle Swarm Optimization.
Regarding the use of non Euclidean metrics for the basis function, the higher flexibility of Hyper-gaussians (HG) compared with RBFNN can decrease the problem posed by correlation between components of the input vector. Furthermore, carefully selecting the nodes helps to reduce the number of bases needed for a given approximation. Although many papers remark the possibility of basis functions using a weighted metric, few actually use them. Please notice that in some previous works, the Hyper-Gaussian NN is referred to as "generalized RBF" [26]. This terminology is confusing since there are a handful of other meanings given to the term (e.g., [27]). Another important aspect is that of training of the Hyper-Gaussians. In the seminal paper [10], regularization is introduced as a means to achieve better generalization. A similar approach also appears in later works [28].
On-line adaptation of RBFNN has also been proposed in other papers such as [29] where the basis functions centers and their diagonal covariance matrix are updated by means of a multi-innovation recursive least square. Please notice that schemes for the on-line adaptation of MLP are not that straightforward.
Summarizing the pros and cons of the main NN forecasting models, and in the context of chiller optimization, it is fair to state the following • MLP. The pro-side is the ability to deal with high dimensional spaces (i.e., large number of inputs). The main drawback is the difficulty to use on-line adaptation. • RBFNN. The main advantage comes from the existence of constructive algorithms for concurrent training/design and from the possibility of on-line training/adaptation. The main drawback stems from the possible growth on the number of nodes as the input space dimension increases.
The proposed model aims at retaining the good traits of RBFNN avoiding its drawbacks. To this end, Hyper-gaussian nodes are used in conjunction with local models, reducing the possible growth on the number of nodes.

Forecasting Problem
To proceed orderly, the forecasting problem of interest for chiller optimization is first presented. A number of time-series are involved, mainly external temperature and building occupancy. Hourly values are assumed in order to be of use for the control system, so the temporal index t represents hours. For each time-series the aim is to compute (at time t = h) a prediction for t = h + H based on measurements available at time t = h − 1. A discrete-time control approach is being used here in the notation for ease of connecting the predictor with the rest of the control system driving the optimized operation of the energy system [30].
A direct multi-step forecast strategy is adopted thus the needed prediction will be computed as: where N is a function (realized by a neural model) of an input vector x whose components are derived from past values of the time-series or from other available information. The selection of the input vector components is discussed later in this Section; such selection takes into account candidate components that are presented in the following.

Candidate Input Variables
The components of vector x in (1) receive various denominations: regressors, input elements, input variables, features, etc. For the problem at hand, it is possible to make a list of signals that are candidates to be used as actual components of x. The signals in the list will be referred to as candidate input variable or candidate regressors. Please note that the list does not pursue exhaustiveness. The window is defined using two lags: L 1 ≥ 1 and L 2 > L1. • Type IV. Differences of autoregressive values. Signals in this group have the general form: y(t − L 2 ) − y(t − L 1 ). This group is specially relevant for the direct multi-step problem. • Type V. Averaged differences. This group uses averages of signals of type IV. • Type VI. Temporal indices. Signals in this group are variables of the day within the year, day within the week, hour within the day, etc. Often it is interesting to encode the information in a way that reflects the observed periodicities. For instance, instead of d y (day within the year) with d y ∈ [1, 365] one may encode it as two cyclic variables z 1 (d y ) = sin 2πd y /365, z 2 (d y ) = cos 2πd y /365. Signals z 1 and z 2 are continuous over years unlike d y . • Type VII. Weather forecasts. As stated in the introduction, the meteorological forecast might be available even if for a large area (city, province, state) in which the building/facility is located. • Type VIII. Social indicators. Signals in this group are useful to encode a priori knowledge about energy demand stemming from social considerations such as the day of the week, festivities, etc.

Data Preprocessing
As is usually done in forecasting applications, first data should be pre-processed to take care of some traits such as trend, seasonality and outliers. Such pre-processing must be designed having in mind the specific application short-term forecasting of variables relevant for cooling demand. It must also follow the directrices presented in the introduction: ease of design/deployment and adaptiveness. To this end the following steps are taken.

1.
Outlier detection. Outliers can be the product of erroneous measurements made by sensors or incorrect treatment of the information (for instance misplaced points in the data-set). Spectral methods can be of use since the basic periods are known (annual cycle, 24 h cycle) and an estimate of the harmonic content for each one can be obtained from data. A simpler approach, however, is to use the point-like value y(t) of the hourly series and compare it with the mean (or median) over a window of data points µ = mean(y(t − L), . . . , y(t − 1), y(t + 1), . . . , y(t + L)). In this way, if |y(t) − µ| > U then y(t) is considered an outlier. This requires the selection of just two parameters: L and U. The detected outliers can then be substituted by filtered values [31]. 2.
Trend removal. There are many reports about the necessity of detrending data prior to its use with NN. One of the simplest approaches is global linear fit performed on the data available for training [31]. 3.
Seasonality treatment. The kind of time series that are of interest in this application are linked to meteorological condition that show an annual cycle with more or less marked seasons (depending on location). Seasonality can be accommodated by NN models if one includes temporal indications as inputs. Nevertheless a simple procedure can be applied since (in this case) the periodicity is known in advance. Thus, the seasonal average can be computed and its value substracted from all data points.

4.
Normalization. Linear scaling to the [−1, + 1] interval is performed. Please note that in the proposed approach a weighted metric is used and this step is not that important. Nevertheless it will be used to provide a fair ground for comparison with other approaches.
These steps are applied to all data used in the paper for model design, testing and comparisons with other approaches.

Selection of Regressors
The problem of composing an input vector out of the possible choices given the candidate input variables is referred to as "input variable selection" (IVS) and "variable/feature subset selection". It has been widely reported that it is not an easy task, yet it is a step of paramount importance. In particular, in the general case of NN models, there are no assumptions regarding the structure of the model (unlike the classical linear regression problem) and that couples the IVS problem with the model selection problem. At any rate, factors that difficult the selection process are: the large number of candidate variables, the correlations among them that create redundancy and existence of variables that have little or no predictive power but can correlate with some patterns in the historical data.
There are in the literature different types of algorithms to perform IVS for NN models. Filters, wrappers and embedded methods have been widely used for time-series forecasting [32]. Each one has its own merits and drawbacks. For the application at hand (direct multi-step forecasting for chiller optimization) and the proposed model that will be presented later, it is possible to perform IVS applying a linear correlation time-series analysis in the vein of Box and Jenkins. This is so because, in the proposed model, localicity is exploited so there is no need for wrappers and embedded methods that arise in the general context of NN models. So, step-wise regression will be used to determine which variables (from groups I to VIII above) will form the input vector x used by the predictor given by (1).
Please note that, for typical energy-related forecasting, the filter approach is much less computationally intensive than any other IVS. It also provides insight into the problem as the selected variables can be interpreted through correlations or by means of fitted linear models. These traits are in accordance with the directing lines adopted for this particular application, as stated in the introduction. Moreover, note that some dependency (or correlation) among variables is to be expected. Removal of the redundant ones does reduce the dimension of the candidate list. In this way the computation burden associated with the input variable selection is greatly reduced. This procedure is most advisable in many practical situations. However, just to make sure of the results, the results shown are based on the complete list using step-wise regression. Please bear in mind that in the comparisons performed later, all models share the same input vector to provide a fair comparison.

Proposed Model
Neural models of the MLP type have some positive traits that have made them a choice for forecasting. In particular, the ability to establish complex relationships between variables due to their flexibility and training procedure. However there are implicit drawbacks that need attention: risk of over-fitting producing low generalization, black box nature of the resulting model, difficulties on short datasets, difficulties in on-line adapting an already trained network to incoming new observations [14] and the need for elaborate methods for input variable selection.
On the other hand, RBFNN are three-layer nets similar to MLP with just one hidden layer. In a RBFNN the input vector x is not weighted, rather it is received by each node and a kernel-like function Φ() is computed based on it. Typical basis functions are Gaussians defined as where σ defines the spread of the Gaussian and d is the distance of input vector x from the base center c taken in most cases as the Euclidean distance The output of a RBFNN for input vector x is computed aggregating the activation of each base in a weighted manner: where ω n is the output weight of node n and Nn is the number of nodes. In addition to Gaussians, the basis functions can take other forms such as Φ(x) = (d(x, c) 2 + σ 2 ) 1/2 (multiquadrics), Φ(x) = (d(x, c) 2 + σ 2 ) −1/2 (inverse multiquadrics). In all cases, the response (activation) of each base is significative (notably different from zero) in just a region of the input space close to its center. This makes the derivation of the base parameters (centers and spreads) an easy task amenable to constructive procedures [11].
In addition, to tune basis output weights (ω n ) fast non-iterative methods such as the pseudo-inverse can be used [33]. This feature also makes it possible to train the network with fewer input-output training data (compared with feedforward MLP, recurrent NN and other approaches). Furthermore, the adaptation of an already trained set of basis functions can easily be performed on-line using each new observation to fine-tune the predictor during its use.
However, a large number of nodes in the hidden layer might be needed to produce the same approximation as an MLP, compromising generalization. Several variations of the basic RBFNN scheme have been proposed to enhance the positive traits and diminish the negative ones. In particular, using local linear models that use the nodes for spatial support. Then each node provides a weight factor to be applied to a local model. The output of the net is the weighted sum for all local models. These can, in principle, be of any type, although simple (linear) models are used in most cases.
The output of a local linear RBFNN (with Gaussian nodes) for input vector x is computed as: where Λ n (x) is the output of the n-th local model and ω n is defined in terms of the GRNN basis as: ω n (x) = exp(−d(x, c n ) 2 /σ 2 n ).
With an adequate choice of basis placement (c n ), basis spreads (σ n ) and local models, the needed number of hidden nodes needed can drastically be reduced compared with RBFNN, while retaining the positive traits of RBFNN. In particular, on-line adaptation is possible and an interpretation of the input-output relationship is also available through the local models.
Another early modification of the basic RBFNN scheme is the Hyper-Gaussian of HyperBFNN [10]. The Hyper-Gaussian nodes are similar to those of a Gaussian RBFNN but using a weighted norm defined as where W is a definite positive matrix. For some choices of W, the resulting base is radial (e.g., a W diagonal matrix). In the general case the basis are Gaussians with iso-lines defining ellipses in the input space with an arbitrarily rotated axis. In this paper the local linear RBFNN concept is used with two important new features: the local models are chosen as ARMAX linear regression models and the supporting basis are Hyper-Gaussians. Moreover, the weighted norm W is independently chosen for each node. The resulting structure receives the name Local Linear Hyper-Gaussian Model (LLHGM).
According to the previous definitions, the local models are defined as: where θ is a vector of model parameters. The activation of each base is calculated as: and the output of the LLHGM as:

Training of the LLHGM
The design of the predictor includes the selection of the parameters of the Hyper-Gaussians (centers and weighting matrices) and the local model parameters for each HG node. Again, the local nature of the approximation produced by a LLHGM makes it possible to attack each task more or less separately. It is also customary that a last phase of supervised training in which gradient-descent can fine-tune some or all of the parameters.
A typical realization of NN also includes some hyper-parameters (or structural parameters) to be selected. In the case of RBFNN these include maximum number of bases, extreme values for spread factors and heuristics for center placement using the distance to previous nodes. These are usually set by testing using a validation set.

Node Placement
Node placement is the selection of a value for the centers c n for each n to be used in (9). In the proposed model, as happens with RBFNN and related approaches, the centers can be obtained using different strategies, such as: self-organizing, supervised training, and constructive methods such as RAN [11], GAP [34], C-RBF [26] and their variants.
In principle any method could be used for the proposed model. However, it is important to recall that, in the proposed model, the Hyper-Gaussian basis are used to provide support (in the Mathematical sense of localicity) to the local models. So node placement is not tightly linked to shape in the output space as happens with RBFNN.
This almost rules out supervised training and favors self-organizing over constructive methods. In addition, self-organizing methods such as K-Means are in line with the desired trait of ease of implementation. Finally, boot-strapping K-Means yields placements that are robust (in the sense of being less sensitive to random choice of initial solution) than constructive algorithms.

Weighted Metric
Once the center have been chosen, matrices W n for each node n are easily derived. Several paths can then be followed, perhaps the simplest one is to consider all data points x i belonging to the n-th Voronoi region produced by the K-Means phase, arrange them in a matrix X = (x i ) and compute the covariance matrix C n = covX from which W n is derived simply as W n = γ n C − n 1.
Parameter γ n is necessary because the covariance matrix contains information about relative importance (for the weighted metric) of each component of vector x in relation to the others, but it does not provide the width or spread for the Hyper-Gaussian node. This parameter can be set as a global value for all nodes or it can be derived for each node as a function of the distance to its nearest nodes. Whatever the case, it is customary (in RBFNN) to allow some overlapping of basis support, otherwise the input space would not be totally covered up by the set of all nodes. Then, instead of a fixed value for γ, the designer has to come up with a desired value for the overlapping (which might be easier to set as it admits a geometrical interpretation). Different values can be tested and chosen using a test set without much burden.

Local Models
So far just the input space has been used to determine parameters. For the local models it is necessary to include the output values as well. The task is an easy one and can be solved performing a least squares fit Λ n (x i ) = y i + i .
Pseudo-inverse methods can be used with minimum computer power (in comparison with a typical ANN training) to obtain the vector of parameters θ n for each n. Please note that, as in the previous case, one might just use the Voronoi regions from the K-Means phase to decide which data points x i are used for each node.

Final Adjustment and Adaptation
The previous steps allow the setting all parameters of the proposed model. It is interesting to remark that just some iterations on the K-Means algorithm and straightforward computations have been needed. With this, the model could be put to work and the adaptation would take care of minor corrections. However, in many cases regarding RBFNN models, one might fine-tune the parameters using gradient-descent.
The chain-rule of derivation allows to back-propagate the output error. In other words, compute the derivative of the squared error with respect to the parameters of the model and use the derivative to slightly modify each parameter seeking a minimum of the error.
Learning rates must be selected keeping in mind how much a given error level is allowed to modify each parameter. This is specially important for the adaptation phase. Moreover, bear in mind that if centers are allowed to move, care has to be taken to prevent the eventual superposition of nodes. Finally, since the design of the model involves little computation it is possible to re-run the whole process if one has doubts about the performance of the adaptation process.

Experimental Results
The proposed model is trained on different data-sets to show its adequacy for the objectives that motivate the research. Comparison with other strategies will be presented on the basis of hold-out data posterior in time to that used for training.

Data-Sets for Design and Comparisons
Three data-sets are considered. T1 consists of hourly temperature values from Atlanta (USA). E1 are hourly energy consumption from Dominion Virginia Power (DOM) with no temperature data linked to them. In these cases the data were obtained from the Kaggle website [35]. E2 consists of hourly energy consumption from a chiller in Seville Spain for which historical temperature values are available. Please note that several locations of the world are represented, thus different climates and social patterns are considered. Furthermore, the data sets correspond to situations of interest for chiller consumption forecasting, including temperature (T1), energy with no attached temperature (E1) and energy with associated temperature (E2).
Please note that, in data-driven forecasting of chiller load, meteorological forecasts might be available even for a large area (city, province, state) in which the building/facility is located. Yet the data-driven forecast for a particular location might be more accurate than the large area forecasts since local factors can be included. This is one reason why set T1 is included. The other reason is that, in order to use set E2, a forecast of temperature is needed. Regarding E1, it is included to be able to compare results in the (perhaps not too common) case where temperature forecasts are not available. The same methodology is applied and for the sake of completeness the results are given further validating the method.
In order to compare different models using different data sets it is convenient to define a figure of merit such as the the Mean Squared Error (MSE) of the predictions. However, the physical units or scale used by the output variable can obscure the interpretation of results. A dimensionless variable that is often used is the MSE of the predictions divided by the MSE of the lazy man's prediction. This is the same as dividing the MSE by the observed variance of the variable of interest. This measure provides an idea of the inherent difficulty to predict a time-series. Consider the following definition: where F is the prediction horizon, T the number of data points and L is an appropriate lag. For the one-step ahead prediction it makes sense to consider H = 1, L = 1 (adopting the discrete-time control approach), then s(1, 1) 2 is a measure of the variability of the time-series when going from a known datum y(t − 1) to a future one y(t + 1). This figure of merit s(1, 1) can be interpreted as the mean squared error that a lazy-predictor would produce. The lazy predictor in this case takes the formŷ(t + 1) = y(t − 1). Any other prediction should produce less variance and the amount by which the variance is reduced is a measure of the goodness of the predictor. In other words, s(1, 1) provides a base-line for comparing predictors. Now, for a direct H-step ahead predictor and given the fact that a 24 h cycle is involved, it is natural to consider the cases a) (F = H, L = 1) because y(t − 1) is the moment closest to the actual one for which measurements are available, and b) (F = H, L = 24 − H) because y(t − 24 + H) corresponds to the same hour as y(t + F) but on the previous day. In fact, the lazy man's prediction for y(t + F) would beŷ(t + F) = y(t − L). Then, the following figure of merit emerges: where y is the real value andŷ the prediction. This measure (E) establishes the degree in which the observed variance has been reduced by the predictor allowing a comparison of results in different data-sets (with possibly different variables and/or units). Please notice that the lazy man's prediction would generate a value of E = 100, any other predictor should produce lesser values, in fact, the lower the better.
The s(F, L) statistics are shown in Table 1 for each data set and for each of the values of H considered in the tests. Other characteristics of the data-sets are also included. The chiller used at the installation producing the E2 data-set has the following characteristics: 400 (kW) cooling power, 1.5 × 5 × 2.2 (m) in size (including casing), and 4 · 10 3 (Kg) in weight. It operates producing chilled water at about 5 (ºC) that is sent to the installation from where it returns at a higher temperature depending on the thermal load. In addition, a laboratory version is available that allows inspection of its elements for research purposes as illustrated in Figure 1.

Input Variables
A thorough search of the combination of variables was performed to obtain the most promising input vector for each data-set. In order to produce a fair comparison, the same input variable is used for all methods under comparison. In Table 2 the chosen input vector is presented for each data-set. The variables come from the list of candidate variables given in Section 3.1. The particular values of the lags and other parameters are presented in Table 3. Table 2. Composition of the input vector for each test.

Variable Type
Parameters VIII festivity

Results
The proposed predictor has been tested with the data-sets of Table 1. For comparison purposes, a feed-forward NN was also trained on each data-set. In all cases, 4 years of data were used for model design (structure and parameter estimation) and one additional year (posterior in time) for comparison. The first set is termed historical (H) and was in turn divided into training/testing/validations parts as usual. The second set is termed new (N) and was not used except for the final comparison.
The number of nodes was obtained using the hold-out set technique. The structure of the resulting models is presented in Table 4. Two variants of the proposed algorithm are presented in Table 5, the one without adaptation (LLHGM) and the one with adaptation (LLHGM-a), in both cases the number of nodes is the same. The forecasting errors are reported in Table 5 using the figure of merit (E) previously introduced and for the particular values F = 6, L = 24 − 6 since a lead of 6 hours in the forecast is considered. Please bear in mind that number (E) is a dimensionless figure of merit that is independent of scale and allows comparing different models on different data sets. A value of E = 100 is obtained by the lazy man's forecast that simply produceŝ y(t + F) as y(t − L). Hence, any value less than 100 is an improvement over that simple forecast. It can be seen that the MLP is able to reduce E from 14% to 26% depending on the data-set. The proposed model has a slightly better reduction specially in the (N) section meaning that the generalization is better. The adaptive version of the proposed model achieves even better generalization results as expected.  Figure 2 shows the trajectories of the predicted variable y and the forecastsŷ for the MLP and the proposed model. The temporal span is a a week and the obtained corresponds to the (N) section of each data-set. It can be seen that the worst predictions correspond to the E1 data-set. Please recall that no meteorological information is used for this case, just auto-regressive values. It can also be seen that in data-set E2 the use of a temperature and social indicator allows the creation of better predictions compared with the E1 case.  Figure 3 shows the histogram of errors obtained for data-set E2 using the proposed model with and without adaptation. The data points used in the histogram belong to the (N) section of the data-set and comprise 3 months of hourly data. This allows the validation of the model using data not previously used in the design and training of the model.

Discussion
The results show how the proposed scheme compares with a classical approach such as a Multi Layer Perceptron. Please notice that the features sought by this proposal are easy implementation and on-line adaptation capability. Yet the results show that the proposal can also compete in terms of forecasting accuracy, especially with new data (i.e., data encountered during the use of the predictor after training). The authors believe that this result comes from two sources: a) the local nature of the approximation made by the model and b) the low complexity of the local linear models. This prevents (or limits the possibility of) the LLHGM from taking the course of other approaches (such as MLP) to bad generalization.
The ease of on-line adaptation provides a further reduction in forecasting error which is potentially a relief for the changing of load patterns due to socio-economic effects. These changes are likely to appear in connection with energy consumption and, in particular, with building usage and the associated HVAC needs.  Data Availability Statement: Data-sets T1 and E1 can be found at the Kaggle website [35].

Conflicts of Interest:
The authors declare no conflict of interest.