O ﬃ ce Building Tenants’ Electricity Use Model for Building Performance Simulations

: Large o ﬃ ce buildings are responsible for a substantial portion of energy consumption in urban districts. However, thorough assessments regarding the Nordic countries are still lacking. In this paper we analyse the largest dataset to date for a Nordic o ﬃ ce building, by considering a case study located in Stockholm, Sweden, that is occupied by nearly a thousand employees. Distinguishing the lighting and occupants’ appliances energy use from heating and cooling, we can estimate the impact of occupancy without any schedule data. A standard frequentist analysis is compared with Bayesian inference, and the according regression formulas are listed in tables that are easy to implement into building performance simulations (BPS). Monthly as well as seasonal correlations are addressed, showing the critical importance of occupancy. A simple method, grounded on the power drain measurements aimed at generating boundary conditions for the BPS, is also introduced; it shows how, for this type of data and number of occupants, no more complexities are needed in order to obtain reliable predictions. For an average year, we overestimate the measured cumulative consumption by only 4.7%. The model can be easily generalised to a variety of datasets.


Introduction
It is well known that the energy use of buildings depends remarkably on occupant behaviour, which, e.g., includes indoor climate parameter preferences, how different systems are operated as well as when and by how many people the buildings are used. The reviews [1,2] show a systematic performance gap between predicted and actual energy consumption of buildings; in some cases, this can reach even 300% difference. The authors conclude that this might occur by a loose implementation of the occupants' behaviour in building energy performance simulations (BPS), where climate data and physical characteristics of the building are addressed in detail whilst occupancy schedules are fixed according to generic standards.
Upon recognition of this problem, in the past decade a number of research efforts have considered how modelling the occupant behaviour lifestyles impacts the building energy use, either for direct implementation into BPS [3][4][5][6][7] or with a more general formulation [8][9][10].
A common modelling approach relies on logistic regression, which is used in statistics to analyse and model binary dependent variables; an example is given by the status open/closed of a window, which can depend, e.g., on a temperature increase or to carbon dioxide concentration [11]. Another interesting procedure by Zhao et al. [3] divides the users' role into "active" versus "passive", treating the occupants as "disturbances" in the HVAC control; the occupant individual behaviour is then implemented in BPS software. Klein et al. [5] use a distributed comfort evaluation based on In this paper we consider energy consumption modelling of a large office building in Sweden. Nordic buildings constitute an interesting case, since the literature addressing their consumption is at the moment rather scarce. Furthermore, climate and cultural differences (regarding, e.g., occupancy and lighting schedules as well as and holiday seasons) in principle constitute a case study with peculiar characteristics. Even the indoor temperature setpoints differ from the European values [25], holding respectively as 21.5 • C for heating and 23 • C for cooling for the building considered in this paper.
Our input data are very large in number (five years of operation with five minute time step), covering from district heating to HVAC system and tenants' electricity consumption. The latter, to which we focus our attention, is afflicted by some sort of an unusual problem: the tenants' appliances power plug load and lighting are aggregated together. Furthermore, no occupancy rate could be measured. These two features make our investigation fairly unconventional, considering the current literature.
Another research question we wish to answer here is how much we can simplify the analysis and reduce the time spent in resources and effort, so that the adopted approach is as simple as possible while still retaining data modelling and predictive accuracy. In this paper we tackle the issue on the side of data analysis, by generalizing measurement patterns and simplifying a method first introduced in [26]. Our main concern is constructing a simple predictive method that would easily and promptly be implemented into building performance simulations.
The paper is structured as follows: in Section 2 we address the dataset structure and methods of analysis, while Section 3 reports our results, featuring daily and monthly energy consumption as well as the predictive model for implementation into BPS. Section 4 is devoted to a discussion of the results and Section 5 to our conclusions.

Methods
The data examined in this paper pertain to a portion of a large office building in Stockholm, Sweden shown in Figure 1. The building block consists of office spaces with only a marginal share of technical rooms; the data has been gathered only from the office spaces. These lie on 8 floors, with a total heated area of 19,642 m 2 . The building was constructed in 2014 and has been awarded Triple LEED Platinum certificate; according to the building owner, the number of occupants is in the order of 1000 persons.

Data Preprocessing
The data analysis was performed with the software R [28], through various packages. Considering the full set of variables, we checked for the existence of near-zero variance and computed correlations between the 7 predictors, obtaining the correlation matrix in Figure 2.
Preprocessing relied on probability distributions of measurements, which were built and analysed with several methods, for instance normality tests such as the p-p and Shapiro-Wilk tests.  [27]. The energy consumption data were collected from January 2014 to March 2019, with 5 min time step; they comprised external temperature measured in different locations, cold and hot water consumption, district heating, facility and tenant electricity load, free cooling pumps, room temperatures, ventilation devices and controllers, air pumping for humidity control, indoor air quality control, and air heating. The temperature sensors' precision was ±0.1 • C, while that of electric consumption loggers being 0.1 kW.
Due to the high collinearity between most of these predictors (which is addressed in preprocessing, Section 3.1), in this paper we shall address only the tenants' consumption measurements, which were obtained via a common meter for the appliances' plug load and space lighting. The appliances included were those typical of an office work environment, namely personal computers, printers as well as fridges, microwaves, and coffee machines in the kitchens.

Data Preprocessing
The data analysis was performed with the software R [28], through various packages. Considering the full set of variables, we checked for the existence of near-zero variance and computed correlations between the 7 predictors, obtaining the correlation matrix in Figure 2.

Data Preprocessing
The data analysis was performed with the software R [28], through various packages. Considering the full set of variables, we checked for the existence of near-zero variance and computed correlations between the 7 predictors, obtaining the correlation matrix in Figure 2.
Preprocessing relied on probability distributions of measurements, which were built and analysed with several methods, for instance normality tests such as the p-p and Shapiro-Wilk tests. We smoothened the data with a simple moving average that was computed by averaging five values at a time. The time variable was also rescaled and normalized between 0 (=0:00) and 1 (=24:00), with 12 o'clock equal to 0.5, to reduce the sensitivity to the rounding of digits.
For the daily energy consumption, we could estimate and fit with good precision the most suitable data distribution with the aid of the Cullen and Frey graph, as well as the QQ and PP plots.  Preprocessing relied on probability distributions of measurements, which were built and analysed with several methods, for instance normality tests such as the p-p and Shapiro-Wilk tests. We smoothened the data with a simple moving average that was computed by averaging five values at a time. The time variable was also rescaled and normalized between 0 (=0:00) and 1 (=24:00), with 12 o'clock equal to 0.5, to reduce the sensitivity to the rounding of digits.
For the daily energy consumption, we could estimate and fit with good precision the most suitable data distribution with the aid of the Cullen and Frey graph, as well as the QQ and PP plots.

Statistical Analysis of Energy Consumption
For modelling energy consumption towards a predictive model, we compared different approaches, namely common least squares regression and a Bayesian approach [10,16,24,29]. The latter used a Monte Carlo-Markov Chain (MCMC) method to obtain posterior predictions. In other words, we generated a Energies 2020, 13, 5541 5 of 19 random binary sequence (Monte Carlo), which was then ordered by means of a constant transition matrix where the i-th event depended only on the (i-1)-th event (homogeneous Markov Chain).
First, for the regression we chose a generalized linear mixed-effects model (GLMM) model, which adds random effects (by means of a matrix Z that encodes deviations in the predictors across specified groups) to the fixed effects addressed by Generalized Linear Models (GLMs). In a frequentist analysis, the Z matrix coefficient b is regarded as a random error term.
For Bayesian inference, we assumed a normal distribution for the smooth data, consistently with our findings (see Section 3.2). Additional care needed to be adopted regarding the target average proposal acceptance probability during fitting, since the default value 0.95 lead to 34 divergent transitions. A larger value of 0.99 corresponds to a smaller step size used by the numerical integrator, thus it made the sampling more robust compared to the default value. The simulation used 30,000 iterations for increased accuracy.
The MCMC predictions were then compared with another popular regression method, the cross-validation (CV) approach [30]. Generally speaking, k-fold CV consists of randomly splitting the data into two sets: one that is used to train the model (e.g., 75% of data) and the test set, the remaining data (25%) that are used to test the model. The process is repeated k-times, and the model can be trained with a variety of methods; here we used the Metropolis algorithm that uses MCMC as well. We thus split the data into Train (75%, 56 observations) and Test (25%, 17 observations) subsets, and then used a k-fold CV with k = 10 and a cubist fit model (i.e., a rule-based model that extends Quinlan's M5 model tree, see [31] and [32]).

Seasonal and Monthly Correlations
For addressing seasonality and weather correlations, the outdoor temperature was recorded every 5 min by a sensor installed on the outer facade of the building, while sunshine duration data were retrieved from the database [33]. In the following, every "monthly" datapoint (temperature, consumption, etc.) refers to hourly-averaged measurements covering the entire 5 year period, from 2014 to 2018.
Correlation coefficients and matrices for energy consumption versus climate were computed, together with the Pearson coefficient and an additional Kendall test. Since the latter is more sensitive than the Pearson test, we deemed it more suitable for drawing the monthly correlation matrix. The main R functions and packages that were used in this work are listed in Table 2.

Results
In this section we shall report the data analysis in detail, using a descriptive step-by-step approach guiding the reader through the development of a simple method for generating energy consumption profiles for application in BPS.

Data Preprocessing
First of all, we checked that the full set of seven predictors showed no problematic elements with near-zero variance. Using hourly averages of data consumption over the entire measurement period 2014-2018, we obtained the correlation matrix in Figure 2.
The mean outdoor temperature has units of [ • C], all the others have [kW] units. +1 means high (perfect) correlation, −1 perfect negative correlation, with 0 meaning no correlation. Specifically, one can notice that HVAC system and facility electricity are highly correlated (at order~0.98), as expected; correlations with tenants' consumption (0.90) and DHW (0.76) are high as well. Tenants' consumption and DHW are highly correlated too, with value 0.91.
Such high collinearity simplifies the analysis remarkably, since one can reduce the set of predictors and safely investigate only the tenants' energy consumption.
Noticeably, on the other hand, strong negative or null correlations do exist. Clearly the outdoor temperature has −0.8 (thus, highly negative) correlation with district heating, however with HVAC, DHW, and tenants' consumption this is very weak,~0.02. We will address the influence of outdoor temperature and sunshine hours on tenants' plug loads in detail in Section 3.3.1.

Daily Energy Consumption
Let us begin by analysing the measured tenants' energy consumption per square meter for a single day, which will provide a benchmark with good time resolution as the data were collected every five minutes. Measurements for Wednesday (middle-week working day) of a central February week (large occupancy) from 2014 to 2019 are plotted in Figure 3. Specifically, in this section we shall analyse Wednesday Feb 18th, 2015 as it does not deviate too much from the general trend, yet it constitutes a nontrivial example due to some outliers (at about 14:52).

Results
In this section we shall report the data analysis in detail, using a descriptive step-by-step approach guiding the reader through the development of a simple method for generating energy consumption profiles for application in BPS.

Data Preprocessing
First of all, we checked that the full set of seven predictors showed no problematic elements with near-zero variance. Using hourly averages of data consumption over the entire measurement period 2014-2018, we obtained the correlation matrix in Figure 2.
The mean outdoor temperature has units of [°C], all the others have [kW] units. +1 means high (perfect) correlation, −1 perfect negative correlation, with 0 meaning no correlation. Specifically, one can notice that HVAC system and facility electricity are highly correlated (at order ~ 0.98), as expected; correlations with tenants' consumption (0.90) and DHW (0.76) are high as well. Tenants' consumption and DHW are highly correlated too, with value 0.91.
Such high collinearity simplifies the analysis remarkably, since one can reduce the set of predictors and safely investigate only the tenants' energy consumption.
Noticeably, on the other hand, strong negative or null correlations do exist. Clearly the outdoor temperature has −0.8 (thus, highly negative) correlation with district heating, however with HVAC, DHW, and tenants' consumption this is very weak, ~0.02. We will address the influence of outdoor temperature and sunshine hours on tenants' plug loads in detail in Section 3.3.1.

Daily Energy Consumption
Let us begin by analysing the measured tenants' energy consumption per square meter for a single day, which will provide a benchmark with good time resolution as the data were collected every five minutes. Measurements for Wednesday (middle-week working day) of a central February week (large occupancy) from 2014 to 2019 are plotted in Figure 3. Specifically, in this section we shall analyse Wednesday Feb 18 th , 2015 as it does not deviate too much from the general trend, yet it constitutes a nontrivial example due to some outliers (at about 14:52). The raw data showed a large dispersion during daytime. The maximum value was found at ~3 pm, holding as 11.49 W/m 2 . Since we were interested in the occupied hours, the time interval was narrowed between ~7 AM and 10 PM. Following the classification scheme of the time periods in a typical working day by Zhou et al. [24], we divided the working day into three subperiods, namely The raw data showed a large dispersion during daytime. The maximum value was found at 3 pm, holding as 11.49 W/m 2 . Since we were interested in the occupied hours, the time interval was narrowed between~7 AM and 10 PM. Following the classification scheme of the time periods in a typical working day by Zhou et al. [24], we divided the working day into three subperiods, namely 1. Arriving-at-work, 2. Daytime, 3. Off-work. The scatter and box plots for Arriving-at-work (6:45-9:30) hold as in Figure 4, featuring also, as a reference, a smooth curve fitted by Loess (a local polynomial regression that is controlled at each point by the nearest data [42]). 1. Arriving-at-work, 2. Daytime, 3. Off-work. The scatter and box plots for Arriving-at-work (6:45-9:30) hold as in Figure 4, featuring also, as a reference, a smooth curve fitted by Loess (a local polynomial regression that is controlled at each point by the nearest data [42]).
(a) (b) We recall that an outlier is any datapoint that lies outside 1.5 times the inter quartile range (IQR), which is the distance between the 25th percentile and 75th percentile values for that variable. In other words, there are no outliers here. Moreover, the correlation coefficient between time and consumption was 0.83, which is satisfactory, indicating that the trend is approximately linear.
For Daytime (09:30-15:30) we obtained instead a very scattered plot, with a low correlation holding as 0.13. The measurement error was computed as 0.61 W/m 2 ; therefore, it is clear from Figure  5 that many outliers were present. For the Off-work period (15:30-22:00), the box plot did not show any outliers; however, the correlation was 0.13. To understand the data distribution and compare our measurements with those of similar cases such as [24], we thus performed further tests on the daily dataset.
The Arriving-at-work data (06:45-09:30) in Figure 6 correspond to a distribution with median 7.97 and mean 7.996. The estimated standard deviation (sd) is 1.176, with skewness and kurtosis holding as 0.052 and 2.27, respectively. The low skewness confirms a standard (symmetric) normal distribution, while the kurtosis approaching the value of 3 means that there are no outliers, or that the tails are well-behaved. This confirms the box plot in Figure 4. We recall that an outlier is any datapoint that lies outside 1.5 times the inter quartile range (IQR), which is the distance between the 25th percentile and 75th percentile values for that variable. In other words, there are no outliers here. Moreover, the correlation coefficient between time and consumption was 0.83, which is satisfactory, indicating that the trend is approximately linear.
For Daytime (09:30-15:30) we obtained instead a very scattered plot, with a low correlation holding as 0.13. The measurement error was computed as 0.61 W/m 2 ; therefore, it is clear from Figure 5 that many outliers were present.
1. Arriving-at-work, 2. Daytime, 3. Off-work. The scatter and box plots for Arriving-at-work (6:45-9:30) hold as in Figure 4, featuring also, as a reference, a smooth curve fitted by Loess (a local polynomial regression that is controlled at each point by the nearest data [42]).  We recall that an outlier is any datapoint that lies outside 1.5 times the inter quartile range (IQR), which is the distance between the 25th percentile and 75th percentile values for that variable. In other words, there are no outliers here. Moreover, the correlation coefficient between time and consumption was 0.83, which is satisfactory, indicating that the trend is approximately linear.
For Daytime (09:30-15:30) we obtained instead a very scattered plot, with a low correlation holding as 0.13. The measurement error was computed as 0.61 W/m 2 ; therefore, it is clear from Figure  5 that many outliers were present. For the Off-work period (15:30-22:00), the box plot did not show any outliers; however, the correlation was 0.13. To understand the data distribution and compare our measurements with those of similar cases such as [24], we thus performed further tests on the daily dataset.
The Arriving-at-work data (06:45-09:30) in Figure 6 correspond to a distribution with median 7.97 and mean 7.996. The estimated standard deviation (sd) is 1.176, with skewness and kurtosis holding as 0.052 and 2.27, respectively. The low skewness confirms a standard (symmetric) normal distribution, while the kurtosis approaching the value of 3 means that there are no outliers, or that the tails are well-behaved. This confirms the box plot in Figure 4. For the Off-work period (15:30-22:00), the box plot did not show any outliers; however, the correlation was 0.13. To understand the data distribution and compare our measurements with those of similar cases such as [24], we thus performed further tests on the daily dataset.
The Arriving-at-work data (06:45-09:30) in Figure 6 correspond to a distribution with median 7.97 and mean 7.996. The estimated standard deviation (sd) is 1.176, with skewness and kurtosis holding as 0.052 and 2.27, respectively. The low skewness confirms a standard (symmetric) normal distribution, while the kurtosis approaching the value of 3 means that there are no outliers, or that the tails are well-behaved. This confirms the box plot in Figure 4. For the Daytime data (09:30-15:30), we find a median of 9.29 and a mean of 9.47. The estimated sd is 0.816, with skewness 0.569 and kurtosis 2.51. The kurtosis is still close to 3, thus we confirm the absence of outliers for this case too. The positive skewness is however ten times larger than for the symmetric normal distribution, namely this dataset corresponds to a right-skewed distribution.
Finally, the Off-work period gives a median of 6.72, a mean of 6.65, sd = 1.538, and a negative skewness −0.008, with kurtosis 2.25. The distribution is indeed very slightly left-skewed, again with no outliers.
The three normal distributions in Figures 6 and 7 are therefore symmetric and right-and leftskewed, respectively, see e.g., [43]. These differ from Zhou et al. [24], who instead found Poisson for the Arriving-at-work and Off-work periods, and Normal for Daytime. This might be due to the tenants' appliances and lighting energy usage being aggregated together since no other qualitative differences seem to distinguish the two cases.

Least Squares Interpolations
Aiming to generate fitting curves for predicting the measured daily energy data, we preprocessed the measurements as illustrated in Section 2.1. A comparison of smooth and original datasets is given in Figure 6. For the Daytime data (09:30-15:30), we find a median of 9.29 and a mean of 9.47. The estimated sd is 0.816, with skewness 0.569 and kurtosis 2.51. The kurtosis is still close to 3, thus we confirm the absence of outliers for this case too. The positive skewness is however ten times larger than for the symmetric normal distribution, namely this dataset corresponds to a right-skewed distribution.
Finally, the Off-work period gives a median of 6.72, a mean of 6.65, sd = 1.538, and a negative skewness −0.008, with kurtosis 2.25. The distribution is indeed very slightly left-skewed, again with no outliers.
The three normal distributions in Figures 6 and 7 are therefore symmetric and right-and left-skewed, respectively, see e.g., [43]. These differ from Zhou et al. [24], who instead found Poisson for the Arriving-at-work and Off-work periods, and Normal for Daytime. This might be due to the tenants' appliances and lighting energy usage being aggregated together since no other qualitative differences seem to distinguish the two cases. For the Daytime data (09:30-15:30), we find a median of 9.29 and a mean of 9.47. The estimated sd is 0.816, with skewness 0.569 and kurtosis 2.51. The kurtosis is still close to 3, thus we confirm the absence of outliers for this case too. The positive skewness is however ten times larger than for the symmetric normal distribution, namely this dataset corresponds to a right-skewed distribution.
Finally, the Off-work period gives a median of 6.72, a mean of 6.65, sd = 1.538, and a negative skewness −0.008, with kurtosis 2.25. The distribution is indeed very slightly left-skewed, again with no outliers.
The three normal distributions in Figures 6 and 7 are therefore symmetric and right-and leftskewed, respectively, see e.g., [43]. These differ from Zhou et al. [24], who instead found Poisson for the Arriving-at-work and Off-work periods, and Normal for Daytime. This might be due to the tenants' appliances and lighting energy usage being aggregated together since no other qualitative differences seem to distinguish the two cases.

Least Squares Interpolations
Aiming to generate fitting curves for predicting the measured daily energy data, we preprocessed the measurements as illustrated in Section 2.1. A comparison of smooth and original datasets is given in Figure 6.

Least Squares Interpolations
Aiming to generate fitting curves for predicting the measured daily energy data, we preprocessed the measurements as illustrated in Section 2.1. A comparison of smooth and original datasets is given in Figure 6. The Daytime dataset (9:30-15:30) was then cut into subperiods because it is the most difficult period to interpolate. Fitting the smooth data with least squares returned the formula for the five subperiods as E [W/m 2 ] = a + bt + ct 2 + dt 3 + et 4 , t [0,1], with coefficients listed in Table 3.

Statistical Inference of Energy Consumption
As illustrated by the R 2 values in Table 3, a common least square regression could fit the smooth data for a winter day with good accuracy. Comparison between predicted and measured cumulative value [MWh] results in only a 4.7~5% difference indeed. This was possible because of the well-known property that any curve can be approximated by n parabolas, if n is sufficiently large.
As much as this is common practice, we wondered whether it was instead possible to obtain a fit for the daily dataset as a whole curve. We therefore adopted a Bayesian approach and attempted to model the dataset with this alternative prescription.
Comparison of measured data with predictions from both MCMC and Cross-Validation methods is given in Figure 8. The Daytime dataset (9:30-15:30) was then cut into subperiods because it is the most difficult period to interpolate. Fitting the smooth data with least squares returned the formula for the five subperiods as E [W/m 2 ] = a + bt + ct 2 + dt 3 + et 4 , t ϵ [0,1], with coefficients listed in Table 3.

Statistical Inference of Energy Consumption
As illustrated by the R 2 values in Table 3, a common least square regression could fit the smooth data for a winter day with good accuracy. Comparison between predicted and measured cumulative value [MWh] results in only a 4.7~5% difference indeed. This was possible because of the well-known property that any curve can be approximated by n parabolas, if n is sufficiently large.
As much as this is common practice, we wondered whether it was instead possible to obtain a fit for the daily dataset as a whole curve. We therefore adopted a Bayesian approach and attempted to model the dataset with this alternative prescription.
Comparison of measured data with predictions from both MCMC and Cross-Validation methods is given in Figure 8. A 10-fold cross validation managed to reproduce the observations rather well, while the overall profile looks more regular; remarkably, MCMC with GLMM could instead match even the outliers with no overfitting. MCMC had an MSE = 1.13 × 10 −5 , corresponding to a RMSE of 0.0033; cross validation instead reported an MSE of 1.66 × 10 −4 and an RMSE of 0.0129. We obtained R 2 = 0.903 for the train set and R 2 = 0.926 for the test set, with a very small overfitting −0.0227. CV underestimates the energy consumption by only 0.12%, while MCMC overestimates it by 0.01%.

The Structural Dataset for Daily Values
The distributions for January weekday and weekends look very similar, while for June the weekend shows a sharp difference in the peaks' ratio. From the comparison in Figure 9, clearly one

The Structural Dataset for Daily Values
The distributions for January weekday and weekends look very similar, while for June the weekend shows a sharp difference in the peaks' ratio. From the comparison in Figure 9, clearly one cannot use the daily profile that was analysed and fitted in the previous sections, as the trend deviates qualitatively from that of other typical days.
Energies 2020, 13, x FOR PEER REVIEW 10 of 19 cannot use the daily profile that was analysed and fitted in the previous sections, as the trend deviates qualitatively from that of other typical days.
(a) (b) On average, the peak consumption occurred mostly at around 9:30 ( Figure 9), rather than at 15:00 (as on Wed 18.02). As a representative day we accordingly chose Wednesday 24 January 2018 (middle of the week, out of holidays), aiming to reflect an average daily profile for weekdays without the outliers appearing for the February day in Section 3.2.1.
Interpolating the January measured values with a standard least squares method returns a 2.1% difference in total energy consumption between fit and measurements, which is satisfactory. However, for higher accuracy and for reasons that will be explained in Section 3.3.2, the coefficients listed in according to the generic formula E [W/m 2 ] = a + bt + ct 2 + dt 3 + et 4 , t ϵ [0,24) were obtained with a nonlinear analysis, by applying to the fit an "energy constraint" that minimizes the difference between observed and fitted values [26] (the round bracket at t = 24h ensures that midnight is not overcounted). Moreover, we demanded that the fitted values at the joining points (at 9 AM and 3 PM for weekdays in Table 4) differ by less than 0.01 W/m 2 . The according constrained fit returns an observed/fitted difference of only 0.05% for the January consumption; it will be called "structural curve" henceforth. In Figure 9b, the structural curve is compared to the energy profiles computed by averaging the data, hour by hour for each day, over all the working weeks from 2013 to 2018. While the prediction does overestimate the peak value, the overall trend is clearly matched by the structural curve; such overestimation is expected, since averaging over large ensembles always reduces peak values. Predictions for two random May and September weekdays are compared with data in Figure 10.
For the weekend instead we fitted energy data from an average profile, since the Saturday and Sunday curves had a large variety of trends thus a specific representative could not be identified. The Weekend fit reported in Table 4 matches the daily consumption within a ~5% error. On average, the peak consumption occurred mostly at around 9:30 (Figure 9), rather than at 15:00 (as on Wed 18.02). As a representative day we accordingly chose Wednesday 24 January 2018 (middle of the week, out of holidays), aiming to reflect an average daily profile for weekdays without the outliers appearing for the February day in Section 3.2.1.
Interpolating the January measured values with a standard least squares method returns a 2.1% difference in total energy consumption between fit and measurements, which is satisfactory. However, for higher accuracy and for reasons that will be explained in Section 3.3.2, the coefficients listed in according to the generic formula E [W/m 2 ] = a + bt + ct 2 + dt 3 + et 4 , t [0,24) were obtained with a nonlinear analysis, by applying to the fit an "energy constraint" that minimizes the difference between observed and fitted values [26] (the round bracket at t = 24h ensures that midnight is not overcounted). Moreover, we demanded that the fitted values at the joining points (at 9 AM and 3 PM for weekdays in Table 4) differ by less than 0.01 W/m 2 . The according constrained fit returns an observed/fitted difference of only 0.05% for the January consumption; it will be called "structural curve" henceforth. In Figure 9b, the structural curve is compared to the energy profiles computed by averaging the data, hour by hour for each day, over all the working weeks from 2013 to 2018. While the prediction does overestimate the peak value, the overall trend is clearly matched by the structural curve; such overestimation is expected, since averaging over large ensembles always reduces peak values. Predictions for two random May and September weekdays are compared with data in Figure 10.

Monthly Energy Consumption
In this section we first briefly analyse the energy consumption data on a monthly basis, then based on the January profile in Table 4 we evaluate correlation formulas for all the 11 remaining months, which are to be used as boundary conditions of BPS. The differences between months are highlighted and searching for possible weather correlations shall help in determining the impact of occupancy on energy consumption. To our knowledge, such an analysis is still very scarce in energy consumption studies of large office buildings, with the only exception of [44].
Averaging the hourly power [kW] over each month and then dividing by the corresponding February consumption gives the ratios in Table 5: February is the month showing the largest consumption. Since a striking as well as expected difference exists between autumn/winter and summer months, one might wonder whether this is correlated to weather or to occupancy.

. Seasonal Variations
It is easy to see that weather and consumption are weakly correlated, as it is shown in Figures  11-13 for consumption vs. measured irradiation hours and average outdoor temperature. The measurements are averaged over the entire period 2014-2018. This is confirmed by an additional Pearson test, which holds respectively as -0.67 and -0.70, as well as by the values in Table 5. For the weekend instead we fitted energy data from an average profile, since the Saturday and Sunday curves had a large variety of trends thus a specific representative could not be identified. The Weekend fit reported in Table 4 matches the daily consumption within a~5% error.

Monthly Energy Consumption
In this section we first briefly analyse the energy consumption data on a monthly basis, then based on the January profile in Table 4 we evaluate correlation formulas for all the 11 remaining months, which are to be used as boundary conditions of BPS. The differences between months are highlighted and searching for possible weather correlations shall help in determining the impact of occupancy on energy consumption. To our knowledge, such an analysis is still very scarce in energy consumption studies of large office buildings, with the only exception of [44].
Averaging the hourly power [kW] over each month and then dividing by the corresponding February consumption gives the ratios in Table 5: February is the month showing the largest consumption. Since a striking as well as expected difference exists between autumn/winter and summer months, one might wonder whether this is correlated to weather or to occupancy.

Seasonal Variations
It is easy to see that weather and consumption are weakly correlated, as it is shown in Figures 11-13 for consumption vs. measured irradiation hours and average outdoor temperature. The measurements are averaged over the entire period 2014-2018. This is confirmed by an additional Pearson test, which holds respectively as -0.67 and -0.70, as well as by the values in Table 5.    Domestic hot water (DHW) consumption is also independent of the external temperature, while only air heating shows a clear correlation as expected, as illustrated in Figure 12. More into detail, Figure 14 considers two specific days in winter and in summer. As a reference for an average winter profile, we chose one week in January (the coldest month). Hourly averages for the consumption data are averaged for the same weekday at each week of the month.
Energies 2020, 13, x FOR PEER REVIEW 13 of 19 Domestic hot water (DHW) consumption is also independent of the external temperature, while only air heating shows a clear correlation as expected, as illustrated in Figure 12. More into detail, Figure 14 considers two specific days in winter and in summer. As a reference for an average winter profile, we chose one week in January (the coldest month). Hourly averages for the consumption data are averaged for the same weekday at each week of the month.

Monthly Correlations
The density distributions of consumption data (hourly values averaged over 2014-2018) are very similar to each other, the only exception being February, see Figure 15. Interestingly, one finds a strong linear correlation between months, as shown below by the correlation matrix in Figure 16. The "1" entries are due to rounding of R-squared values that are very close, yet obviously not identical to unity. Correlation coefficients for each month against January, which constitutes our structural dataset, are provided in Table 5 for both weekdays and weekend.
The numerical values are computed with the Kendall test, that is more sensitive than the Pearson test. The correlation is very strong, only April has a couple of outliers in Figure 17, where each month is plotted against January (the interpolation formula holds for May, as an example). We also wondered about the role of the Christmas holidays, which are distributed less evenly than summer vacations: Figure 18 addresses the working week starting on Dec 21 st and ending at Christmas, which is compared to a Summer week (last working week of June).
At this point, let us recall that we aim to obtain formulas for a prompt implementation in BPS software. One immediate option would be to simply produce interpolation formulas for each month

Monthly Correlations
The density distributions of consumption data (hourly values averaged over 2014-2018) are very similar to each other, the only exception being February, see Figure 15. Domestic hot water (DHW) consumption is also independent of the external temperature, while only air heating shows a clear correlation as expected, as illustrated in Figure 12. More into detail, Figure 14 considers two specific days in winter and in summer. As a reference for an average winter profile, we chose one week in January (the coldest month). Hourly averages for the consumption data are averaged for the same weekday at each week of the month.

Monthly Correlations
The density distributions of consumption data (hourly values averaged over 2014-2018) are very similar to each other, the only exception being February, see Figure 15. Interestingly, one finds a strong linear correlation between months, as shown below by the correlation matrix in Figure 16. The "1" entries are due to rounding of R-squared values that are very close, yet obviously not identical to unity. Correlation coefficients for each month against January, which constitutes our structural dataset, are provided in Table 5 for both weekdays and weekend.
The numerical values are computed with the Kendall test, that is more sensitive than the Pearson test. The correlation is very strong, only April has a couple of outliers in Figure 17, where each month is plotted against January (the interpolation formula holds for May, as an example). We also wondered about the role of the Christmas holidays, which are distributed less evenly than summer vacations: Figure 18 addresses the working week starting on Dec 21 st and ending at Christmas, which is compared to a Summer week (last working week of June).
At this point, let us recall that we aim to obtain formulas for a prompt implementation in BPS software. One immediate option would be to simply produce interpolation formulas for each month Interestingly, one finds a strong linear correlation between months, as shown below by the correlation matrix in Figure 16. The "1" entries are due to rounding of R-squared values that are very close, yet obviously not identical to unity. Correlation coefficients for each month against January, which constitutes our structural dataset, are provided in Table 5 for both weekdays and weekend.
The numerical values are computed with the Kendall test, that is more sensitive than the Pearson test. The correlation is very strong, only April has a couple of outliers in Figure 17, where each month is plotted against January (the interpolation formula holds for May, as an example). We also wondered about the role of the Christmas holidays, which are distributed less evenly than summer vacations: Figure 18 addresses the working week starting on Dec 21st and ending at Christmas, which is compared to a Summer week (last working week of June).
Energies 2020, 13, x FOR PEER REVIEW 14 of 19 as in Table 4 and list them in 12 tables, with a loss in generality; here we prefer a more general and time-saving approach that is described in the next section.     as in Table 4 and list them in 12 tables, with a loss in generality; here we prefer a more general and time-saving approach that is described in the next section.    as in Table 4 and list them in 12 tables, with a loss in generality; here we prefer a more general and time-saving approach that is described in the next section.    At this point, let us recall that we aim to obtain formulas for a prompt implementation in BPS software. One immediate option would be to simply produce interpolation formulas for each month as in Table 4 and list them in 12 tables, with a loss in generality; here we prefer a more general and time-saving approach that is described in the next section.

The Monthly Structural Dataset and Prediction Formulas
Following Ref. [26] and using its terminology, we first identify January as our representative month with its structural dataset, then derive for each other month the according correlation formulas with January, written in the form E[W/m 2 ] = A × E Jan + B, from the according measured consumptions as in Figure 17. Now one needs only to input the interpolation coefficients for January (Table 4) and the correlations for each month (Table 5) into any simulation software, then proceed with, e.g., annual simulations of energy consumption. This is a very simple method, based on linear interpolations, which naturally returns only a small sensitivity on the coefficients, as well as a small error (see [26] for a more detailed discussion). January is chosen because, according to Figure 16, it is the month with the highest correlation with any other month.
Let us immediately test this approach by comparing June measurements (24 hourly averages over weekdays), with error 0.61 W/m 2 , with the values obtained by means of the correlation coefficients in Table 5. For generating the average January weekday profiles, namely the 24 hourly E Jan values, we shall use either linear regression (Table 4) or MCMC (Figure 8). Figure 19 shows the result for June, when using the January structural coefficients in Table 4. The agreement is excellent, with every point within the experimental error and an average % residual of 3.7% for weekdays, 2.46% for weekends. Using instead Bayesian inference (MCMC here) to compute the structural dataset lowers the average % residuals to resp. 1.83% and 1.7%. Following Ref. [26] and using its terminology, we first identify January as our representative month with its structural dataset, then derive for each other month the according correlation formulas with January, written in the form E[W/m 2 ] = A × EJan + B, from the according measured consumptions as in Figure 17. Now one needs only to input the interpolation coefficients for January (Table 4) and the correlations for each month (Table 5) into any simulation software, then proceed with, e.g., annual simulations of energy consumption. This is a very simple method, based on linear interpolations, which naturally returns only a small sensitivity on the coefficients, as well as a small error (see [26] for a more detailed discussion). January is chosen because, according to Figure 16, it is the month with the highest correlation with any other month.
Let us immediately test this approach by comparing June measurements (24 hourly averages over weekdays), with error 0.61 W/m 2 , with the values obtained by means of the correlation coefficients in Table 5. For generating the average January weekday profiles, namely the 24 hourly EJan values, we shall use either linear regression (Table 4) or MCMC (Figure 8). Figure 19 shows the result for June, when using the January structural coefficients in Table 4. The agreement is excellent, with every point within the experimental error and an average % residual of 3.7% for weekdays, 2.46% for weekends. Using instead Bayesian inference (MCMC here) to compute the structural dataset lowers the average % residuals to resp. 1.83% and 1.7%.
The plot in Figure 19 also features purple crosses, i.e., the hourly consumption computed according to the occupancy profile (solid purple line) of the EU Standard EN 16798-1 [25]. The Standard estimates 12 W/m 2 for lighting and 6 W/m 2 for occupancy, neglecting the weekends completely and assuming null occupancy throughout. We thus conclude that in this case one cannot infer energy consumption directly from the EN 16798-1 occupancy profile. This is confirmed by comparing the cumulative measured data for an average year with the profiles obtained with the structural dataset (Tables 4 and 5), that gives a 4.69% overestimation of the total consumption when using our formulas. On the other hand, simply assuming the occupancy profile from the standard EN 16798-1 returns an underestimation of the real consumption of the order ~27%.
Comparison with the yearly accumulated consumption for each year is given in Table 6 (the year 2019 was not considered as the data were recorded only until 20th March). The plot in Figure 19 also features purple crosses, i.e., the hourly consumption computed according to the occupancy profile (solid purple line) of the EU Standard EN 16798-1 [25]. The Standard estimates 12 W/m 2 for lighting and 6 W/m 2 for occupancy, neglecting the weekends completely and assuming null occupancy throughout. We thus conclude that in this case one cannot infer energy consumption directly from the EN 16798-1 occupancy profile. This is confirmed by comparing the cumulative measured data for an average year with the profiles obtained with the structural dataset (Tables 4 and 5), that gives a 4.69% overestimation of the total consumption when using our formulas. On the other hand, simply assuming the occupancy profile from the standard EN 16798-1 returns an underestimation of the real consumption of the order 27%.
Comparison with the yearly accumulated consumption for each year is given in Table 6 (the year 2019 was not considered as the data were recorded only until 20th March).

Discussion
The daily energy consumption analysed in Section 3.2 showed some peculiar characteristics. Interestingly, the distributions for each period were different for the analogous cases of large office buildings in China [24] and in Austria [10]. While one cannot be sure whether this is due to geographical differences (lighting component) rather than to appliances usage, this combined evidence might suggest that generalizing consumption patterns of a building typology at an international level is not as immediate as it seems.
Besides, we have shown in Section 3.2.1 that fitting the data for a nontrivial profile with the good old minimum square method returned an error of only~5%, in agreement with [18]. A more sophisticated Bayesian approach showed an even higher accuracy, with error well below 1%. In other words, it is quite feasible to model this type of energy data with high accuracy even with simple methods; models should be conveniently chosen according to the specific BPS.
Regarding the seasonal variations and weather correlations, the preprocessing showed that there exists a high collinearity between the electric consumption of facility and HVAC system. Furthermore, these are highly correlated also with the DHW and tenants' appliances power plug and lighting consumption. Only the district heating is (negatively) correlated with the outdoor temperature, as expected. These results are accordingly pointing at occupancy as the main cause of energy consumption fluctuations in the building.
More specifically, one can notice the low R-squared values for a linear fit in Figure 13, showing low correlation with the climate. Consumption is indeed larger in August than in December and May, and in March compared to January. December, June, and July, i.e., the holiday months, are particularly intriguing: June and July incidentally have smaller consumption as well as higher temperatures/irradiation. Interestingly, May recorded about the same irradiation yet larger consumption. Accordingly, comparing these two plots might also suggest a more prominent role of plug load (i.e., of occupancy) than of lighting. January weekdays are almost indistinguishable from each other, with the only exception being Friday ( Figure 14). The weekend is slightly more interesting, as Saturday's curve is different from Sunday's. It can be shown that Monday and Tuesday have a similar density distribution, as well as Wed, Thu, and Fri; the daily consumption has anyway little variations within each of the two sets (weekdays and weekends). These considerations are consistently valid for all the twelve months, with little variance. Interestingly, the according distributions are all skew normal, in contrast with the profiles plotted in Wang [16] and Ding [29] for large Chinese buildings. We can speculate that this might be due to the plug loads and lighting being aggregated together.
Identifying the structural dataset and computing the structural curve followed the principles explained in [26]. The specific dataset reflected the general trend qualitatively: looking for common patterns in the full data is the most important step in the analysis. Choosing, e.g., the February day analysed in Sections 3.2.1 and 3.2.2, with its outliers and peaks at~3 PM, would have determined a different outcome at least in terms of daily consumption distribution, if not of annual energy simulation. Since instead the Weekend profiles were highly different for all the five years considered, we used a profile that was constructed by averaging the hourly values over the full dataset. We shall finally remark that here we showed only one way to obtain the structural curve, for illustrative purposes: any other method can be equally valid, since a whole class of very sophisticated statistical models already exists, as summarized in Table 1.

Conclusions
As building performance simulations constitute an increasingly common practice among engineers for assessing the energy analysis at different levels, it is needed to bridge the gap between observations and predictions with simple, yet reliable approaches for the implementation into BPS that are suitable to academics as well as to practitioners. In this paper we have argued whether this is possible, by examining the presently largest ensemble of energy consumption data of an office building in a Nordic country.
Considering the employees' appliances plug consumption and lighting for a calendar year, we found no evident weather correlation. Rather, holiday periods and their impact on occupancy proved to be more important: the highest consumption occurred in February and the lowest in June. All the months were highly correlated.
Daily consumption patterns showed on the average a clear peak at around 9:30 AM for weekdays, while weekends exhibited a larger variance. Comparing regression curves obtained with a frequentist inference to Bayesian inference (MCMC), we obtained an extremely high accuracy with the latter; nevertheless, the least squares method returned only a 2% error, which is low enough for implementation into BPS.
These results lead to the definition of an analytical bottom-up model, for predicting any daily consumption given a benchmark daily profile with an error smaller or equal to 5%. On one hand, the numerical coefficients reported in Tables 4 and 5 can be immediately implemented into building performance simulations addressing comparable datasets.
More in general, we believe that the procedure detailed in Sections 3.2.3 and 3.3.3 is a simple yet effective analysis blueprint that can be directly employed in energy consumption assessments and BPS modelling of a variety of building, location, and study typologies. The flexibility of the original method, which was postulated for DHW consumption data of residential buildings [26], should be evident from the present context, where it was adapted to a case study with very different datasets.
In conclusion, this paper analysed a substantially large amount of data obtained during five years of measurements, addressing tenants' plug load and lighting consumption when aggregated together. Further studies should investigate the role of tenants' plug load when it is aggregated separately, how different occupancy models could impact the overall energy use, and how the prediction model benefits investigations through BPS by applying the structural dataset method to, e.g., annual simulations with IDA ICE and other simulation programs.