3.1. The Model
The aim of the model presented herein is to find the most essential hours at which power consumptions could describe the consumption at remaining hours of the same day in the best way. In order to find such a set of hours in a regular way we face two problems. The first problem is that since we do not know how many hours are sufficient, we have to assume this number before calculations. The second problem is the large number of combinations to investigate. The presented algorithm solves both problems simultaneously. It requires neither parameters nor assumptions, while being fast and needing a limited number of calculations.
The whole data set consists of 24 time series of hourly electricity consumption. At the beginning, these hourly consumptions are the described variables. The construction of the model is performed in steps. In the first step, the first hour, which becomes the describing variable, is chosen, while the remaining 23 h are still described variables. Twenty-three single-variable linear regressions were built and used to assess precisions. During the second step, the hour with the lowest predictability is chosen as another describing variable. Twenty-two two-variable linear regressions are constructed. Each following step expands the model by changing the type of one variable from described to describing. During each step, another hour is chosen until the demanded accuracy of the model is reached. In this way, we can obtain a model based on multiple linear regressions that is self-configurable and parameterless. The way variables are chosen is described in the next section.
Let us consider a set of hours
. Let
indicate a variable (time series), with values equal to total electricity consumptions at hour
on every day, where
N is the total number of analyzed days. The basis of the algorithm is a multiple equation linear regression model of the form:
where
are model parameters and
The number of equations is related to the number of describing variables. In the case of
k variables, the model consists of 24 −
k equations. Covariance matrix
has dimensions (24 −
k) × (24 −
k). There are residual variances
on the main diagonal of the
matrix.
There are several calculated measures necessary to construct the model and evaluate its quality. For each model equation, we calculate the standard deviation of residuals (root of the mean squared error). The formula takes into account the number of independent variables
k in the model, which has the form:
where an error
is the difference between theoretical and empirical values at hour
on the
i-th day. Measure (2) is used during variable selection for model (1), and to evaluate the quality of the model. The quality of model regressions is also measured by means of two relative measures: the relative standard deviation and the relative residual standard deviation. The first one has the form:
while the second one is:
The quality of the whole model (1) is measured using the mean values of measures (2)–(4) calculated based on all 24 −
k equations in the model:
Abbreviations MSD, MRSD, and MRRSD denote the mean standard deviation, mean relative standard deviation, and mean relative residuals standard deviation, respectively. Various methods of model quality measurement are provided by Formulas (2)–(7). The first three are used to evaluate the quality of model equations and the next three are used to evaluate the quality of the whole model. The smaller their values, the more consistent the equation/model is with data. On the other hand, Formulas (2)–(7) define various types of errors, and their values show the level of uncertainty of the obtained results. Further in this work, we will refer to those quantities by providing the numbers of the formulas defining them.
3.2. Model Construction
The model is built based on the learning data set consisting of
n = 1583 days. In step zero, we construct 24 models (1) with one describing variable. Each model consists of 23 regression equations
, where
, and
. For each model, we calculate the
MRSD (6) measure. Values of the measure are plotted in
Figure 5 vs. index of the describing variable. The lowest values of the measure are mainly located between hours 11 and 15. Those models are characterized by the lowest mean relative standard deviation.
In the first step, we choose the best model—the one with the lowest
MRSD. The describing variable for that model is the first chosen hour. This hour is denoted by
, and the model has the form
. During steps 2 to 23, the choice of the describing variable is performed on the basis of another rule: the worst described hour
is chosen. We calculate the root mean squared error (2) for each equation. The second step starts with choosing the variable with the maximum measure (2):
. This is the worst described variable. Now it becomes one of the describing variables. Twenty-two linear regressions with variables
and
are evaluated, and for each equation measure (2) is calculated. Other steps are processed in the same manner as step 2 until one model with 23 variables is obtained. However, the procedure can be aborted when the assumed precision is reached.
Table 2 summarizes the hours chosen by the algorithm in steps 1 through 23.
3.3. Statistical Evaluation of the Model
We performed an evaluation of the statistical significance of the parameters for each regression of the model (1). The
F test was used, whose statistic is the ratio of the explained variance to the unexplained variance. Obtained values of
F are high, ranging from 1.9 × 10
3 to 8.3 × 10
5. The results are presented in
Figure 6 on a log scale. The values of
F greatly exceed critical values
F*, which are in the range of 0.004—0.568 for significance level
α = 0.05. The results clearly show that all of the model’s regressions are statistically significant (1).
Figure 7 shows the values of the relative standard deviation (3) for every equation of the model (1). For the first step,
does not exceed 7% of the average power consumption, and its mean value is about 5%. We observe a systematic decrease of
when going through the steps. For example, already in steps 4–6,
is less than 3%. In the next steps, we observe even lower values. The results indicate good and very good agreement of the models with data.
In the next step, we tested the null H
0 hypothesis regarding the agreement of the residual distribution with the normal distribution for every model equation and step. The Kolmogorov–Smirnov test (K-S test) was used. Its statistics measure the distance between the cumulative empirical distribution of the sample and the cumulative distribution function of the normal distribution. The large samples we used allowed us to precisely evaluate empirical distributions. For comparison, we utilized Marsaglia’s version of the K-S test [
37]. The results of both tests were very similar. For 256 regressions, about 70.8% indicated a lack of reasons to reject the null hypothesis. The positive results increased with steps: from 30.1% for
k = 1 and 70.0% for
k = 4 to 100.0% for
k = 12, …, 23. In about 29.2% of cases, the results of the test indicated a rejection of the H
0. It is important to note that the residual distributions in these cases did not deviate significantly from the normal distribution. The values of the skewness and kurtosis of the residual distributions are shown in
Figure 8 and
Figure 9.
The values of skewness are around 0, and the values of kurtosis are mostly located slightly below 3. The values of skewness are spread from −1 to 1, but from step 5 they are between −0.4 and 0.4. Kurtosis in the first steps assumes values between 2 and 4, during the next steps its values are concentrated between 2.5 and 3.5, and at the end of the procedure they are in the range of 2.6–3.0. The observed values of skewness and kurtosis indicate an approximate agreement between the distribution of random terms and a normal distribution. These results prove that the linear models were evaluated properly, and the predictions made based on them should be consistent in the whole range of independent variables.
The presented algorithm chooses and converts one described variable into the describing variable in each step. In order to obtain the model in step
k, one has to go through steps 1 to
k and evaluate
equations. Thus, for step
k = 23, it is necessary to evaluate a total of 276 equations. On the other hand, when using the regular approach, previous steps can be skipped, but all combinations of
k describing variables must be considered. Each combination requires the evaluation of 24 −
k equations. Therefore, in order to build a model with
k variables one has to evaluate
equations. The number of evaluated equations for each step is presented in
Figure 10a for both approaches.
The number of calculated equations in the regular approach exceeds the number of equations in our model by several orders of magnitude. For example, at step number 10, those values are 27,457,584 and 185, respectively. A computer equipped with an Intel Core i7-855OU CPU and 16 GB RAM calculated the model using the regular method for about 197 h (see
Figure 10, right plot). The presented algorithm required less than 1 min for the calculations.
We now compare the accuracy of model (1) with the accuracy of the models yielded by the regular method. In order to evaluate the qualities of the models, we used two measures defined by Formulas (5) and (6). The results are presented in
Figure 11 for steps 3, 4, 5, and 6. The horizontal axes show mean relative standard deviations (
) and the vertical axes show mean standard deviations (
). The blue points indicate the qualities of the models yielded by considering all the
k-element subsets. The highlighted points reflect the results for model (1) and are located among the models with the best values of both measures. There are a few models that have lower values of the measures. However, these differences do not seem to be significant from a statistical point of view. In order to evaluate the accuracy of the measures used, we adopted the bootstrap method. We generated 5000 bootstrap samples, evaluated a model for each sample and calculated both measures. The 95% bootstrap confidence intervals are listed in
Table 3 for steps 3 to 6.
The results are also presented in the inset plots in
Figure 11. Model (1) is indicated by a green diamond, and the 95% bootstrap regions are marked by a red color. The bootstrap regions are reflected by the 95% confidence intervals in the axes. They exceed beyond the lowest obtained values of both measures. Therefore, the obtained results show that observed differences between the best models are not statistically important. The presented model produces statistically consistent results with the best models.