Generalised Regression Hypothesis Induction for Energy Consumption Forecasting

: This work addresses the problem of energy consumption time series forecasting. In our approach, a set of time series containing energy consumption data is used to train a single, parameterised prediction model that can be used to predict future values for all the input time series. As a result, the proposed method is able to learn the common behaviour of all time series in the set (i.e., a ﬁngerprint) and use this knowledge to perform the prediction task, and to explain this common behaviour as an algebraic formula. To that end, we use symbolic regression methods trained with both single-and multi-objective algorithms. Experimental results validate this approach to learn and model shared properties of different time series, which can then be used to obtain a generalised regression model encapsulating the global behaviour of different energy consumption time series.


Introduction
Energy efficiency in the building sector has become an important research area for two main reasons: firstly, because the residential sector represents around 25% of global energy consumption; and secondly, because the building sector is also considered as the main contributor to the energy shortage and climate change effects regarding the worldwide increased population and the large environmental impacts [1,2]. Building infrastructures include sensor technologies [3] that provide a huge amount of energy consumption data that allows researchers to address the problem of energy usage and its environmental impact from different perspectives [4], such as anomaly detection [5,6], energy consumption modelling [7,8], energy demand planning [9] or consumer profile mining [10,11].
Each of these problems has been addressed with different techniques, according to the nature of the problem and the desired output. For instance, in anomaly detection problems, Chou and Telaga [5] used a neural network to predict the energy consumption of future days and attempted to detect the energy consumption anomalies identifying the differences between the real and predicted energy data. On the other hand, Cui and Wang [6] proposed a hybrid model that combines polynomial regressions and Gaussian distributions to build data detection and visualisation systems that help to identify anomalies in electricity consumption data. Regarding consumer profile problems, Gomez et al. [10] used a data-fitting approach and a multi-class classifier to estimate the electricity needed in a building, and Capozzoli et al. [11] used pattern recognition and classification algorithms to provide knowledge about the energy usage in a building. Regarding energy consumption modelling, Zhao et al. [12] used occupant information in addition to the heating, ventilation and air conditioning (HVAC) energy consumption data to determine the pollution impact of buildings. As another example, Balaji et al. [13] built a control system that uses the WiFi network traffic in a building to estimate the occupancy and control the HVAC. Due to the heterogeneous nature of the different sources of data, every approach usually needs a preprocessing stage to provide useful data to the models, as argued in [7].
Our current work focuses on energy consumption forecasting in public buildings. Traditionally, energy consumption forecasting in residential and public buildings has been implemented by means of analysing a single energy consumption time series [14][15][16][17][18]. In these cases, this single time series is given as input to a prediction model, with the potential addition of some external data (temperature, building occupancy, etc.). These models have proven able to provide suitable results in their respective case studies, obtaining accurate prediction models. On the other hand, there are many works that combine historical energy data with statistical and machine learning techniques to provide more accurate models able to improve the energy consumption in buildings.
A good and recent review on this topic [19] summarises the state of the art in the energy consumption forecasting paradigm, and describes the most used techniques to that end. To only cite a few, Yasmin et al. developed ARIMA models (Autorregresive Intregrated Moving Average) [20] to predict electricity consumption in Pakistan and Baca Ruiz et al. [21] proposed nonlinear autoregressive artificial neural networks with exogenous inputs (NARX) to predict the energy consumption in public buildings. There is a plethora of proposals in the literature for energy consumption forecasting in the last few years, and a complete survey paper would be necessary to analyse all these research approaches.
Unlike traditional approaches in energy consumption forecasting for buildings, which use a single time series that contains historical data of energy consumption and exogenous information, we address our research from a different perspective and we assume that there are multiple buildings whose energy consumption must be predicted. We hypothesise that, if the energy consumption of different buildings is medium or highly correlated, then these buildings share the same energy consumption ground behaviour. Then, our goal is to obtain a general forecasting model able to learn structural relationships of all time series in a set, and to parameterise this model for its particular adaptation for each specific building, obtaining a more accurate prediction model that explains the overall energy consumption behaviour of the whole compound.
To address this problem, we use symbolic regression and genetic algorithms to find interpretable regression models that describe the relationships in the energy consumption data that are common to all buildings under study in a compound. More specifically, we assume that each related building provides a dataset with their energy consumption, and our motivation is to find a general regression hypothesis f able to explain all datasets of each building. These regression models will then be parameterised for its adaptation to a specific building forecasting task. As we have not found any work in the literature that solves a similar problem, we attempt to use two alternatives, namely classical genetic algorithms and multi-objective optimisation approaches, to identify the strengths and weaknesses of each type of techniques. In addition, in the experiments, we firstly validated the approach with synthetic data generated in the laboratory, and then in a real scenario. In both cases, we departed from several energy consumption time series that are medium or highly correlated, and we attempted to find a single regression model able to learn the shared ground behaviour of all these data series. The ultimate goal was to predict their future values accurately with the same model, parameterised for each data series.
To achieve these objectives, this manuscript is organised as follows. Section 2 introduces the fundamentals of symbolic regression and the different historical alternatives used to solve multi-objective problems. After that, Section 3 describes the formulation of the problem and the proposed methods. Section 4 shows the experimental results in two scenarios: synthetic data generated in the laboratory and real energy consumption data. Finally, conclusions and future works are described in Section 5.

Related Work
The literature gathers several techniques able to solve energy forecasting problems, such as neural networks [22,23], support vector machines, decision trees [24,25] or regression analysis techniques [26,27]. Since we were interested in finding not only accurate but also interpretable solutions for the final user, we focus this research in regression analysis techniques, and study their limitations and the solutions provided by different authors in the literature in Section 2.1. After that, we establish the basic concepts of multi-objective optimisation techniques required for our research in Section 2.2.

Symbolic Regression
Regression analysis is a classical tool widely used by researchers in prediction and data modelling problems. Given an algebraic expression f as model hypothesis, a set of input data (independent variables)x = (x 1 , x 2 , ..., x n ) and output data (dependent variables)ȳ = (y 1 , y 2 , ..., y m ), the regression analysis attempts to find the optimal model parametersw = (w 1 , w 2 , ..., The parametersw are usually found using numerical procedures, such as least squares optimisation, that minimise an error measurement, for instance e( f (x,w),ȳ) = || f (x,w) −ȳ||. However, the main limitation of traditional regression analysis appears when not only the parametersw are unknown, but also the model hypothesis f , or when f is very difficult to formulate manually. To solve this limitation, symbolic regression [28] combines a set of predefined atomic operators (such as +, -, *, /, and sin), independent variablesx and parametersw to build an algebraic expressionf as an approximation for the optimal model f . To do so, symbolic regression uses optimisation algorithms to explore the search space and finds the best approximatioñ f that minimises an error measure, such as ||ȳ −f (x,w)||.
Although there are many techniques designed to solve optimisation problems, we chose genetic programming due to its demonstrated potential in several areas [29], including symbolic regression. Genetic programming [30] is a supervised machine learning method based on biological evolution and is used in symbolic regression problems since it evolves a population of candidate algebraic expressionsf (x,w) and applies a set of genetic operators [31] to obtain the best candidatef . Although tree structures are highly used to encode the algebraic expressions of the population in genetic programming algorithms, our previous research [32] has demonstrated that alternative representations such as Straight Line Programs (SLP) are able to improve the solutions of symbolic regression problems in terms of not only accuracy and computational time but also obtaining more interpretable algebraic expressions.
Symbolic regression has been proposed previously as a tool to model energy consumption. The state of the art in this topic includes the work by Yang et al. [33], who used symbolic regression and evolutionary algorithms to identify the main factors in the energy consumption in China. Whereas Bhattacharya et al. [34] used linear genetic programming to perform consumer electricity demand forecasting, Behera et al. [35] used a genetic algorithm in order to develop an effective planning system able to estimate demand and energy consumption. In our works, we have also studied the use of symbolic regression for energy consumption modelling with good results [32].
Benefits of symbolic regression for energy consumption prediction are the simplicity of the resulting models, in contrast to more complex and difficult to analyse models such as neural networks or support vector machines, and also their interpretability by a non-expert in machine learning. On the other hand, limitations arise in the side of accuracy when dependencies between input and output data are difficult to find. In these cases, universal approximators such as neural networks have shown good performance [21]. Overall, we have selected symbolic regression as representation of forecasting model hypotheses due to the balance between their easy interpretability and good accuracy.

Multi-Objective Optimisation Paradigm
As previously described, our goal is to obtain a single general and parameterised forecasting model able to predict the energy consumption of different buildings. Thus, given a set of N energy consumption time series as input (one for each building), the task at hand is to find a unique parameterised regression modelf (x i ,w i ) that consistently models and predict future values of each ith time series separately. If the valuesȳ i are the desired outputs for forecasting the ith time series, then we can measure N prediction errors as e(f , i) = ||ȳ i −f (x i ,w i )||. Finding the desired modelf therefore requires the minimisation of the N error measurements separately. In this work we consider two approaches to address this problem: (1) a single-objective approach, where all error measurements are aggregated into an unique measurement as e(f ) = ∑ N i=1 e(f , i); and (2) using multi-objective optimisation, where each error measurement e(f , i) would be considered as a separate target function to be minimised.
Multi-objective optimisation problems attempt to find models that minimise/maximise a set of objective measures simultaneously, and represent the solution of each objective in a vector function e. Formally, a multi-objective problem is defined as the minimisation of multiple criteria e(x) = (e 1 (x), e 2 (x), ..., e m (x)), wherex = (x 1 , x 2 , ..., x n ) is the vector of decision variables, and e(x) = (e 1 , e 2 , ..., e m ) are the objective functions that must be minimised/maximised. An important aspect to take into account solving a multi-objective problem is related with the problem statement. As Hassan [36] argued, a multi-objective problem can be addressed from three different perspectives: converting the multi-objective problem in a single-objective one [37][38][39], using population-based algorithms [39][40][41] and studying Pareto-optimal based techniques [42][43][44][45][46].
Multi-objective optimisation techniques have been used to solve energy consumption problems (e.g., [47][48][49][50]). Therefore, in this work, we compare single-objective and NSGA-II multi-objective approaches to train symbolic regression forecasting models. The next section describes the problem statement of each approach, the representation used to solve symbolic regression and the main components of the algorithmic procedure.

Methods
Our method is based on the assumption that there is a set of several datasets (time series) composed of input/output patterns with the same structure, for which we want to model the output data with respect to the input data and obtain an algebraic expression that models the input/output relationship. We also hypothesise that there is a shared ground common behavior that is present in all these datasets, and that can explain output data regarding input data (at least partially). The model for this shared behavior could be unknown in advance. Due to this assumption, we consequently expect a medium/high correlation between the output data of all datasets.
As a toy example to understand these hypotheses, we sampled data from two linear functions y 1 = f 1 (x 1 ) = 3 * x 1 + 5 + 1 , and y 2 = f 2 (x 2 ) = 6 * x 2 + 1 + 2 (where 1 , 2 stand for an error in each dataset, respectively) and obtained two datasets (X 1 , Y 1 ) and (X 2 , Y 2 ). Both datasets come from different data distributions, but share a common ground behavior that could explain y 1 from x 1 and y 2 from x 2 , and could be written as y = f (x) = a * x + b. Thus, the regression hypothesis f can be used to explain both datasets, when it is parameterised by the coefficients a and b.
We could use traditional symbolic regression methods to solve this toy example, by means of solving each dataset separately, and probably obtaining algebraic expressions that match f 1 and f 2 , respectively. In this article, we provide an automatic method to find general regression hypotheses that could help to explain multiple datasets, providing a single parameterised algebraic expression that model the common ground behavior of all datasets. A necessary constraint of our proposal is that all datasets must have the same data structure, i.e., they are composed by a set of independent (input) variables X = (x 1 , x 2 , ..., x k ) where x m ∈ R, ∀m : 1 ≤ m ≤ k, and the number of variables k is the same for all datasets, and a single dependent (output) variable Y = (y 1 ) where y 1 ∈ R.
The problem formulation assumes a set of n datasets (X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X n , Y n ) whose data have this structure, i.e., X i = (x i 1 , x i 2 , ..., x i k ) and Y i = (y i 1 ) ∀i : 1 ≤ i ≤ n are the input and output variables, respectively, of the ith dataset. We denote N(i) = (X i , Y i ) as the number of input/output patterns of the ith dataset. We remark that two datasets can have different sizes (number of input/output patterns), so that N(i) can be different to N(j) for two different datasets, without loss of generality. Thus, the main goal of this work consists of finding a general parameterised algebraic expression F(X, W) able to explain the relationships between dependent (output) and independent (input) data from all datasets simultaneously, where W = (w 1 , w 2 , ..., w l ) : w q ∈ R∀q : 1 ≤ q ≤ l is a set of parameters or coefficients estimated for each specific dataset. Finding the values for the parameters of each dataset, W i , is also a component of the proposed method and implemented as a local search procedure within a genetic evolutionary approach. We deepen into the algorithms description in the following sections.

Straight Line Programs for Time Series Prediction
A time series is a sequence of data (that we call x) evenly sampled in time (t) that is used to predict futures values x(t + k). The goal of time series prediction is to find a model f that combines h previous values of the time series data and parameters w, to forecast the next values of the time series as In this paper, we use symbolic regression to predict future values of energy consumption time series. By using symbolic regression, we assume that the prediction model f can be written as an algebraic expression. Although there are different alternatives to encode algebraic expressions, including trees and linear genetic programming, Straight Line Programs have proved their potential over classical representations [51]. Straight Line Programs (SLP) are represented as a sequence of grammar production rules, where each production rule is composed by a set of known mathematical operators O U i ∈ {o 1 , o 2 , ..., o m } (as for instance {+, −, * , /, sin, cos}), a set of terminal symbols T = {t 1 , t 2 , ..., t k } (as for instance, parametersw = {w 1 , w 2 , ..., w k } or independent variables x = {x 1 , x 2 , ..., x n }) and references to other rows R U i,1 , R U i,2 ∈ {T ∪ {U 1 , U 2 , ..., U i−1 }}. We may verify that the production rules that appear in the consequent must be references to previous rules, in order to avoid recursion.
We can build an algebraic expression from a SLP by evaluating the Nth rule, which is the starting symbol of the grammar U N . As an example, Equation (1) gathers a SLP that encodes the algebraic The following subsections gather the problem formulation for each approach and also a description of the main genetic operators to train SLP representations using genetic algorithms.

Single-Objective Problem Formulation
The problem addressed in this research is applied over multiple time series simultaneously, and it assumes that symbolic regression should find a general algebraic expression F(X, W) that models the dependent variables of several datasets simultaneously, where X represents the independent variables of the datasets and W is the set of parameters values estimated by symbolic regression. The single-objective formulation of the problem assumes that symbolic regression should minimise an error function e(F) in order to find the general algebraic expression F. We calculate this error e(F) as the sum of the errors of F of all datasets to approximate the desired output Y (Equation (2)).
where n stands for the number of datasets, N(i) is the number of data samples of the ith dataset, X i (p) is the pth input sample of the ith dataset, Y i (p) is the pth output sample of the ith dataset, and W i are the values of the algebraic function parameters of the ith dataset. The optimisation problem is formulated as finding the algebraic expressionF = min F (e(F)). Once the problem is formulated as explained, we use a genetic algorithm to find a SLP that provides the best regression hypothesisF.
Using the previous formulation, the accuracy error of the prediction model F over a time series X i is aggregated into a single measurement e(F). In contrast, the multi-objective formulation shown in the next section treats the error minimisation of each time series separately.

Multi-Objective Problem Formulation
The single objective problem formulation described in the previous section is likely to have limitations if datasets are not normalised, since e(F) is defined as the sum of squared errors in all datasets. If the data scale varies significantly from one dataset to another, then the errors could also vary significantly, and then the datasets with higher absolute errors could dominate the search in the solution space. In cases in which the data cannot be normalised, or if we do not know lower and upper bounds to perform a normalisation, then the single objective strategy could lead us to obtain undesired local optima solutions. To solve this limitation, we also provide a problem formulation from a multi-objective optimisation perspective.
In it, we define a minimisation objective per each dataset, so we assume a set of n objectives (O 1 , O 2 , ..., O n ), where each objective O i attempts to minimise the error function of the corresponding ith dataset (X i , Y i ). Therefore, in the searching process of the general algebraic expression F, the multi-objective approach attempts to minimise each objective individually, finding the function F that minimises the error of the ith objective without worsening the quality of the jth objective. In this way, this formulation may avoid local optima when the data are not normalised since bad solutions in the estimation of some datasets do not influence the exploration of the search space for the remaining objectives. Thus, the goal is to find a general algebraic expression F that minimises an error function e(F, i) for each of the objectives. The error measurement that must be minimised for each objective is shown in Equation (3).
where n stands for the number of objectives (i.e., the number of datasets), N(i) is the number of data samples of the ith dataset, X i (p) is the pth input sample of the ith dataset, Y i (p) is the pth output sample of the ith dataset, and W i is the set of parameter values for the target algebraic expression F for the ith dataset. The multi-objective formulation of the problem is to find the optimal algebraic expressionF, or the set of Pareto-optimal algebraic expressionsF, such asF = min F (e(F, i))∀1 ≤ i ≤ n.
Once the problem is formulated as explained, we train SLPs to minimise (O 1 , O 2 , ..., O n ) using the multi-objective algorithm NSGA-II [46].

Algorithm Description
We use a classic genetic algorithm [32] to optimise the single-objective approach, and the NSGA-II [46] method for the multi-objective optimisation approach. Since no changes were made to these template algorithms, in this section, we only describe the genetic operators to be used with SLP representation. We also remark that both crossover and mutation operators were proposed by Alonso et al. [51] and we summarise the procedure of each operator with the aim of improving the understanding of this work.
• Crossover operator. Two parents P 1 and P 2 are used in both single-and multi-objective approaches in order to generate two new children C 1 and C 2 . The operator starts out selecting a random rule After that, an ordered set of rules R is calculated as the set of rules U = {U 1 , U 2 , ..., U i−1 } that can be reached from the selected rule U i . Then, a random rule The offspring C 1 is created as a copy of the parent P 2 and the rules in R are copied into C 1 and renamed from U k−|R|+1 to U k . Finally, the offspring C 2 is generated with the same procedure, but exchanging the roles of both parents P 1 and P 2 . An example of this operator is shown in Figure 1, where rule r 1 = 5 was selected randomly from parent P 1 . After that, the ruleset U is created as the set of rules that can be reached from U 5 . In this case, U = {U 5 , U 2 , 1 }, and is renamed as Then, a random position r 2 = 4 is selected in P 2 , and the offspring C 1 is created as a copy of P 2 with the replacement of rules in R, starting from r 2 . • Mutation operator. Given a SLP table of an individual of the population, a random element of the consequent of a random rule is exchanged for another random symbol. If the selected element is an operator, it is exchanged by another valid operator and if the selected element is an operand, it is exchanged by a terminal symbol or a reference to other rule, as shown in Figure 2. On the other hand, if the mutation operator exchanges a binary operator by an unary operator, the second operand of the rule is left to the value ∅. Nevertheless, if the operator mutes an unary operator to a binary operator, then the second operand is randomly selected from the set of valid operands of the production rule (independent variables, parameters or references to other rules of the SLP table).
Once both single-and multi-objective algorithms have found an algebraic expression, a local search procedure is applied in order to estimate the parameters W for each objective. In this way, both algorithms are able to provide a general algebraic expression which share common behaviour from all datasets (objectives) and parameters are conveniently fitted to satisfy each objective.

Experimentation
We tested and experimentally validated both single-objective and multi-objective approaches in two experiment setups: • A first study attempted to empirically validate whether the described problem can be solved with the proposed formulations. Thus, we built different scenarios with synthetic data, with the aim of validating the performance of each approach under a controlled experimental environment that eases the analysis of performance of the approaches. To that end, Section 4.1 describes a set of benchmark algebraic expressions to be used in the experiments, and the results obtained with each approach. This section ends with a discussion of the results obtained.

•
In the second experiment (Section 4.2), we tackled a real problem about energy consumption prediction. In it, we were provided with the energy consumption time series of a set of buildings for which we assumed there is a medium or high correlation and the goal was to find a single parameterised algebraic expression F(X i , W i ) that can explain the global/common behaviour of all related energy consumption data series.

Experimentation with Synthetic Data
We used a set of benchmark algebraic expressions to create a set of synthetic datasets with the same ground behaviour. The main goal of this experimentation was to test if the proposal of this paper can be used to obtain a general regression hypothesis that explains all datasets coming from the same algebraic expression, with different parameters W for each dataset, in this controlled environment.
This section is divided into three parts: firstly, Section 4.1.1 explains the set of benchmark algebraic expressions, and how all artificial datasets were generated. Then, Sections 4.1.2 and 4.1.3 show the experimental configuration used in each approach and the results obtained, respectively.

Data Acquisition
We used six benchmark algebraic expressions (see Equations (4)- (9)) that are widely used in the literature to test the quality of symbolic regression algorithms [52]. More specifically, we selected benchmark algebraic expressions with different complexity (which include polynomial, trigonometrical and exponential expressions) and also with a varying number of parameters (from one parameter (Equations (7)-(9)) to six parameters (Equation (6))). Besides, we generated different datasets for each benchmark algebraic expression to empirically validate if both single-objective and multi-objective approaches have the same behavior, or in contrast, to find which technique carries out a better exploration of the search space when the number of datasets increases.
We generated five datasets for each benchmark algebraic expression shown in Equations (4)-(9). All five datasets regarding a single algebraic expression share the same values for the input data, i.e., for each independent variable x i of each benchmark algebraic expression, we generated 200 samples uniformly distributed in the range [0.0, 1.0]. After that, we selected different values for the parameters w i of each dataset regarding the same algebraic expression. Each value w i was randomly generated in the interval [0.0, 5.0]. Thus, for each benchmark algebraic expression, we were provided with five datasets with the same inputs but different outputs.
As an example, let us consider the dataset generation for the benchmark algebraic expression of Equation (5): firstly, we generated 200 random samples for x 1 and x 2 , which were used as inputs for all datasets. After that, we generated five different values for each parameter w i : (w 1 1 , w 1 2 ) = (3.23, 4.12) for the first dataset; (w 2 1 , w 2 2 ) = (1.09, 2.35) for the second dataset; and (w 3 1 , w 3 2 ) = (2.17, 0.59), (w 4 1 , w 4 2 ) = (3.0, 0.64), and (w 5 1 , w 5 2 ) = (3.83, 1.1) for the remaining datasets. This allowed us to obtain different datasets with the same inputs but different outputs, and with a high correlation between all datasets, which is a prerequisite of our proposal.
Both single-and multi-objective approaches were tested in two cases: considering a low number of datasets, and considering a larger number of datasets. We performed an experimentation using three of the five datasets generated for each benchmark algebraic expression, and another experimentation using all datasets, with the goal of discovering if the number of datasets has an influence in the performance of the multi-objective approach.

Experimental Settings
For the experimentation, we used 12 mathematical operators (+, -, *, /, sqrt, pow, exp, sin, cos, log, min, and max) to allow symbolic regression to build algebraic expressions with any combination of these operators. We performed a parameter tuning with different values for both single-and multi-objective approaches to find a suitable configuration that allows us to achieve a better exploration of the search space. Thus, we tested different values for SLP size, genetic operators (crossover and mutation), stopping criterion, etc. After that test, the experimental settings that provided the best results in both approaches were: the population size of 180; allowing 15 rules for each SLP; and crossover and mutation probabilities of 90% and 10%, respectively. Finally, the stopping criteria was evaluating 40,000 solutions.
Each dataset was randomly divided into train (70% of data) and test (remaining 30%). The training data were used during the algorithm execution, and the test data were used once the algorithm finished and returned an algebraic expression, so that we could prevent over-fitting and validate the results in previously unseen data. Finally, we ran 30 executions for each scenario (algorithm and problem) with different random seed numbers, in order to carry out a statistical test that helped us determine if there exist significant differences between the results obtained.

Results and Discussion
The results obtained for each approach and dataset (train and test results) are shown in Table 1. The first column of this table describes the item evaluated for each approach, i.e., Median, Best (lowest), and Worst (highest) Mean Square Error (MSE), the average execution time of each approach (item Time (s.)), the average size of the resulting algebraic expression found by the algorithms (item Size), and the average number of parameters used by the resulting algebraic expressions (item Parameters). We highlight that we used the median error as statistical analysis estimator since the resulting MSE error distributions of the algorithms do not follow a normal distribution. In these cases, the mean error cannot be considered an appropriate metric, and it is replaced by the median value.
Columns 2-7 contain the results provided by both single-and multi-objective approaches for each benchmark algebraic expression ( f 1 -f 6 ) that was used to generate three datasets, and Columns 8-13 gather the results after modelling five datasets generated with each benchmark algebraic expression. Finally, Rows 3-17 describe the results in train data, whereas Rows 18-26 gather the results in test data. To give support to the analysis, we also provide box plots about the MSE distributions in the test sets for each single-and multi-objective approach (see Figure 3).
In a preliminary analysis of the test results in Table 1 and Figure 3, we observed that the single-objective approach achieved the lowest median MSE in five of six problems with three datasets, and all problems with five datasets. In addition, the best solution found was provided by the single-objective approach in both cases. Finally, the worst MSE also suggests that the single-objective algorithm performed better in all problems with three datasets, and in five of six problems with five datasets. After this preliminary analysis, we validated this assumption statistically, using a non-parametric Kruskal-Wallis (KW) test with 95% confidence level to verify whether significant differences exist between the results found with each algorithm. Table 2 summarises the results of the test, and it contains the resulting p-value of the test for both training (Columns 2 and 3) and test data (Columns 4 and 5). If significant differences were found between the results found with the single-and multi-objective approaches (p-value < 0.05), then the results were marked: with "+" if the single-objective approach found a better solution; with "-" if the multi-objective algorithm was better; and with "x" if there were no significant differences between the solutions.
As shown in Table 2, our preliminary analysis was confirmed by results in Columns 4 and 5, since the single-objective approach improved significantly the multi-objective algorithm in all cases. After the statistical test was applied over the training error distributions, Table 2 confirms that the multi-objective approach was better in two of six problems with three datasets than the single-objective approach, considering training results; and there were no statistical differences in the remaining four problems. Regarding the problems with five datasets, the multi-objective approach was equivalent to the single-objective approach in three problems also considering training results, and the single-objective algorithm was better in the remaining three. On the other hand, the results provided in Table 1 as well as the statistical test results in Table 2 suggest that the single-objective algorithm improves performance over the multi-objective approach in the test data. This fact suggests that the multi-objective approach might be over-fitting the training data, therefore providing worse results in the test sets of each problem than the single-objective approach. This assumption is supported by the size of the resulting algebraic expressions, provided in Table 1. The single-objective algorithm could provide simpler (shorter) expressions, while the multi-objective approach provided more complex expressions that could overfit training data and could not generalise well to all datasets. This also affected the training time of each method. As the multi-objective algorithm considered larger expressions, its evaluation was computationally more expensive than the single-objective algorithm (see item Time (s.) in Table 1).    To conclude the analysis of results in this section, we note that the single-objective approach could find better solutions than the multi-objective approach in terms of generalisation, without any regards to the number of datasets for each problem. In this way, the multi-objective approach overfit the training data, especially when the number of objectives (datasets) was low.

Experimentation with Real Data
With the approaches validated under a controlled environment, this section describes the testing with real data. As problem statement, we were provided with a set of energy consumption time series of different buildings (one time series x i for each building i), together with an additional time series of exogenous data with the ambient temperature T. We name the energy consumption of a building i at time instant t as x i (t), and the temperature at time instant t as T(t). All buildings are located in the same area, so that the temperature time series is the same for all buildings.
Our goal was to find a single algebraic expression f such as x i (t + 1) = f (x i (t), x i (t − 1), · · · , x i (t − h), T(t + 1), T(t), · · · , T(t − h + 1), w i ), which can provide us with an approximation of the next energy consumption value of any building, x i (t + 1), considering previous values of the same energy consumption time series up to a time horizon h, and the h previous values of the temperature. The resulting algebraic expression that models the energy consumption time series must be the same for all buildings, except for the parameters W i , which are different for each building.
This could be possible only if the data series of all buildings are not independent and there exists a relationship among them. For this reason, it was important to check as prerequisite that the correlation coefficient of all time series in a dataset is medium or high, as a measurement of dependency (not necessarily causality) between all time series. If this hypothesis were confirmed, then the proposed methods would be expected to find a generalisation of that behavior in terms of a parameterised general algebraic expression able to predict the energy consumption of each building.

Data Acquisition
We used a dataset that contains the energy consumption of seven different buildings of University of Granada (south of Spain) from March 2013 to October 2015, named from B 1 to B 7 for confidentiality reasons. Each building is equipped with a set of sensors that monitor their energy consumption (kW/h) hourly. A Building Automation System (BAS) is responsible to monitor the energy consumption measured for each sensor of each building and store all data in a database.
The raw data stored in the database needed to be preprocessed because there could be missing data due to light cuts, sensor failures, etc. The data preprocessing process used in this experimentation is shown in Figure 4. The first step of this preprocessing stage consisted in seeking missing values and interpolating each of them (which are 5% of the data). After that, a time alignment was necessary to obtain the data consumption in the same temporal range. Besides, since this experiment attempted to predict the energy consumption of different buildings using the data of previous days, it was necessary to calculate the energy consumption for each weekday as the aggregation (addition) of the kW consumed in 24 h of the same day. Finally, to work with uniform data, we normalised the data in the interval [0.0, 1.0] (see Equation (10), where v i is the response value, v min is the minimum response observed, v max is the maximum response observed and r normalised is the normalised response value). Once all data were preprocessed, we transformed the univariate dataset into a multivariate dataset, with eight dimensions, each one for a weekday plus the weekly temperature.
Once all data were preprocessed, it was necessary to carry out a preliminary correlation study between the energy consumption of all buildings to know if energy consumption data series are medium or highly correlated as the initial hypothesis to apply the proposed models. As the labels low, medium or high correlation levels are subjective and problem dependent, in our research, we considered low correlations with values less than 0.3, whereas correlation values between 0.3 and 0.7 were considered to medium correlation and values higher than 0.7 were considered highly correlated. As a result of this study, we found two clusters with medium or high correlation coefficients. Buildings B 1 and B 2 formed a cluster, and the second cluster was composed of Buildings B 3 -B 7 . An example of the energy consumption data series of each set of buildings is shown in Figure 5.  As shown in Figures 6 and 7, there are medium (0.3 < R < 0.7) and high (R > 0.7) positive correlations between the energy consumption of the selected buildings in each cluster. Thus, we were provided with two different scenarios to test our approaches: One being composed of two datasets with Buildings B 1 and B 2 and another one with five datasets with Buildings B 3 -B 7 . To ease the explanation in the experimental section, we named each cluster of buildings as E 1 (for Buildings B 1 and B 2 ) and E 2 (for Buildings B 3 -B 7 ).
In the experiments, we divided each dataset into two subsets: training (70%) and test (30%), to avoid over-fitting in the solutions found. In this way, the training set was used to build the model of each approach and the test set was used to check the quality of the solutions found, following the same methodology presented in Section 4.1. Therefore, to verify the robustness of each approach, we show in Section 4.2.3 the results obtained in each training and test subsets.

Experimental Settings
To define the experimental settings of each approach used in a real scenario with energy consumption data, we performed a trial-and-error procedure to find the optimal parameters for the algorithm. As is described, we allowed a maximum size of 15 operations (SLP size), which can use a set of 12 mathematical operators (+, -, *, /, sqrt, pow, exp, sin, cos, log, min, and max). Thus, as mentioned above, the input variables for each approach were composed of the energy consumption registered in the previous h weekdays (h = 6 for this experimentation) and the temperature registered for the last weekday. In addition, we permitted a set of five parameters for each building, which were estimated for each building during the algorithm execution. Then, the crossover and mutation probabilities were established as 90% and 10%, respectively. Finally, both approaches ran 30 times with different random seeds to analyse the results statistically. The stopping criteria used for each algorithm was to have 40,000 solutions evaluated.

Results and Discussion
The results of this experimentation are organised in Table 3. In this table, Column 1 shows the item evaluated for each single-and multi-objective approach. Thus, rows labeled as Median describe the median Mean Square Error (MSE) obtained in the 30 experiments for both approaches. Rows named as Best and Worst gather the minimum and maximum MSE obtained in the 30 runnings. Then, the average time needed to obtain a solution by each approach is shown in rows labeled as Time (s.). Rows labeled as Size encode the average size of each algebraic expression found in each run (calculated as the number of mathematical operators used in the algebraic expression found), and the number of parameters used in the algebraic expression found with each single-and multi-objective approach are also shown in rows tagged as Parameters. Finally, Columns 2 and 3 of the table gather the results obtained by both single-and multi-objective approach for each cluster of buildings (E 1 and E 2 ), respectively, in training data and Columns 3 and 4 show the results obtained by both single-and multi-objective approach in test data. Finally, last row of Table 3 describes the results of Kruskal-Wallis test. To provide a better analysis of the results of Table 3 we have also included the box plots of the MSE distributions in the test sets for both single-and multi-objective approach in Figure 8.  In a first analysis of the results of both single-and multi-objective approach in test data in Table 3, we may observe that the single-objective approach obtained the lowest MSE median value in both clusters of buildings. Moreover, since the single-objective approach provided the best solution in terms of best MSE in both scenarios and the multi-objective approach found the worst solution in both sets of buildings E 1 and E 2 , this preliminary analysis suggests that the single-objective approach is potentially better than the multi-objective approach.
To give support of this analysis, we again performed a non-parametric Kruskal-Wallis test (KW) with a 95% confidence level to statistically validate if there are significant differences between each approach. The results of the KW test are presented in the last row of Table 3, where Columns 2-3 and 4-5 show the resulting p-value of the KW test for each cluster of buildings and approaches for both training and test datasets. If significant differences were found between the results obtained with the single-and multi-objective approach (p-value < 0.05), then the results were marked with the symbol + if the single-objective approach was better; with the symbolif the multi-objective approach provided better solutions; and with the symbol x if both approaches were equivalents.
Regarding the results of the KW test in Table 3 in test data (Columns 4 and 5), the single-objective approach provided better results than the multi-objective approach in all cases, thus confirming our hypothesis that the single-objective is better than the multi-objective approach in test data. Box plots in Figure 8 visually confirm that there are significant differences between the results found by the multi-objective approach in the problems of buildings in E 1 and buildings in E 2 , respectively. Nevertheless, regarding the results of the KW test in train data (Columns 2 and 3 of the last row of Table 3), we may observe that the multi-objective approach provided better solutions for E 1 (two objectives), and no significant differences between the single-and multi-objective approach were found in E 2 (five objectives).
If we analyse the results of each approach in train data in Table 3 regarding the median and best MSE value, the multi-objective approach provided the lowest value in the cluster of buildings E 1 , and the single-objective approach achieved better results in the second cluster of buildings E 2 . Nevertheless, in both scenarios, the multi-objective approach performed worse in terms of MSE value. This fact, together with the results of the KW test, suggests that the multi-objective approach suffered over-fitting in the training procedure, and therefore it shows the same performance we found in synthetic datasets in Section 4.1. Therefore, as an example of the solutions found by both approaches in real energy consumption data, Equations (11) and (12) gather the best SLP found by each single-and multi-objective approach, respectively. We also want to remark that, although we allowed SLPs with size equal to 15, the equations shown are the useful code for each SLP. Then, the algebraic expression encoded by each SLP can be calculated by generating the last rule of each equation. Finally, regarding the average algebraic expression size shown in Table 3, we cannot conclude that there are significant differences between both single-and multi-objective approaches, since the multi-objective approach found smaller solutions for E 2 , whereas the single-objective approach provided smaller solutions in the first cluster of buildings E 1 . Nevertheless, regarding the computational time, the multi-objective approach was considerably higher than the single-objective approach in both cases, which may be a consequence of the Pareto-optimal solution search in multi-objective algorithms that need more computational time, as argued in Section 2.
To conclude with the analysis of the results, Figures 9 and 10 show in colour purple the original dataset of the cluster of buildings E 1 (Buildings B 1 and B 2 ) and E 2 (Buildings B 3 -B 7 ), and the results of the predicted data from both single-and multi-objective approaches in red and green colours, respectively. The results plotted with each approach were obtained with the best algebraic expression found from each approach (see Equations (11) and (12)). In this way, we may conclude that not only was the single-objective approach able to find better solutions than the multi-objective approach in terms of generalisation, without regards the number of datasets for each problem, but also that the provided algebraic expression is an accurate model of energy consumption for all buildings in the same compound E 1 or E 2 . As a general summary of this experimental section, we may conclude that the approach in this manuscript is able to find a generalised regression hypothesis able to explain multiple datasets. More specifically, when applied to real energy consumption data series, the proposal could find a general explanation about how energy consumption evolves in different buildings, and to provide a single formula that is valid to explain the energy consumption behaviour of all buildings.

Conclusions
This work describes our new formulation of energy consumption forecasting in the case where multiple energy consumption time series are under study. To use such approach, we require that all the input time series have a medium or high correlation and, consequently, that a common ground behaviour that can explain (at least partially) all time series exists. As we have not found any other approach in the literature that addresses this particular problem, we developed an experimentation to test the approach under a controlled environment in the laboratory, and then using real data. According to the experimental results, we can confirm that the results in both synthetic and real data are consistent, and that the single-objective algorithm proposed is the best option to find such algebraic expression. Besides, we observed that the performance of Pareto-based multi-objective algorithms decreases as the number of objectives increases, since both experiments showed that the multi-objective algorithm was significantly worse than the single-objective approach when the number of datasets was high. A final observation is that there were significant differences between the error distribution in training and test datasets, which suggest us that the multi-objective approach may over-fit the training data.
In summary, we have provided the scientific community with a new tool to analyse and generalise multiple datasets, and to perform data mining over energy consumption modelling problems. Thus, the proposed model opens new opportunities not only in energy consumption forecasting, but also in other topics such as time series data summarisation, energy profile mining, and anomaly detection. Future works attempt to apply ontologies not only to automatically select the most affordable Pareto-solution regarding semantic knowledge, but also to reduce the search space including knowledge about algebraic expressions.