Empirical and Comparative Validation for a Building Energy Model Calibration Methodology

The digital world is spreading to all sectors of the economy, and Industry 4.0, with the digital twin, is a reality in the building sector. Energy reduction and decarbonization in buildings are urgently required. Models are the base for prediction and preparedness for uncertainty. Building energy models have been a growing field for a long time. This paper proposes a novel calibration methodology for a building energy model based on two pillars: simplicity, because there is an important reduction in the number of parameters (four) to be adjusted, and cost-effectiveness, because the methodology minimizes the number of sensors provided to perform the process by 47.5%. The new methodology was validated empirically and comparatively based on a previous work carried out in Annex 58 of the International Energy Agency (IEA). The use of a tested and structured experiment adds value to the results obtained.


Introduction
With the continuing challenges posed by climate change, a growing number of countries around the world are implementing measures to reduce energy consumption and greenhouse gas emissions. The increasing deployment of energy efficiency measures in the building sector provides an important avenue for reducing energy demand and carbon dioxide emissions, even generating new energy production and distribution facilities. The global market for energy efficiency in the construction sector will grow from USD 68.2 billion in 2014 to USD 127.5 billion in 2023 according to Navigant Research [1]. The high energy consumption of both commercial and residential buildings in developed countries, around 40% [2], necessitates a reduction in energy consumption. For this reason, energy-efficient construction is now a key factor in energy policies at all levels [3].
Energy simulation has become a powerful tool supporting the design of new buildings and in proposing Energy Conservation Measures (ECM) in existing buildings. It is used in the application of the predictive model control (MPC) techniques [4][5][6][7][8][9], in actions for the operational optimization of Heating Ventilation Air Conditioning (HVAC) [10,11], in economic strategies [12][13][14][15], and in the optimization of cost-effective building refurbishment [16,17]. In terms of building energy models (BEMs), Hence and Lamberts [18] differentiated between the following types depending on the physical relevance of the parameters: • White box models are based on a physical models with exclusively physically meaningful parameters. This can provide the most detailed building performance characteristics that can be applied in energy prediction for demand response applications or establishing baseline models for ECM performance, among others.
share of variable power generation from renewable sources. The SABINA project responds to this need by targeting the cheapest possible source of flexibility: the existing thermal inertia in buildings and the resulting coupling between heat and electricity networks. The use of thermal inertia as storage capacity is known as a power to heat (P2H) solution [47,48]. High quality models (calibrated) are constructed on the basis of the success of P2H implementation to offer services to the grid [6,8]. Neymark et al. indicated that there are three ways of evaluating a whole-building energy simulation program: (1) empirical validation, which compares simulated data with monitored data from a real building or experiment; (2) analytical verification, which compares simulated data from verified numerical models; and (3) comparative testing, which compares a program with itself or to other programs or techniques within the same program [49]. In this paper, an empirical and comparative test have been followed out.
Empirical validation must be used if an absolute standard of truth is to be established (in comparison with simulation results with a perfectly performed empirical experiment) [50]. Measurement error and uncertainty must be considered when constructing the model. Careful work is required to reduce these unknowns (geometry, materials, infiltrations, ground, etc.). The National Renewable Energy Laboratory (NREL) divides empirical validation into levels depending on the degree of control over possible error sources [51].
The comparative test examines the results obtained in the model with itself or with other energy models of the same building. This last test does not generate input uncertainties and can be applied regardless of the complexity of the model, as many comparisons can be performed as possible and it is cheap and fast. However, there are no absolute truth inputs, and only statistically-based acceptance ranges are possible. Among the techniques for testing a calibration methodology applying empirical, analytical, and comparative validation, there is the Building Energy Simulation Test for Existing Homes (BESTEST-EX) [52]. Additionally, the ASHARE 140 standard covers the analytical and comparative validation test [53]. The combination of analytical and comparative validation techniques reduces uncertainty in simulation processes where adjustment periods are short within the space in which they can be compared. By combining these options, model errors can be fixed and a more suitable solution can be reached [54][55][56].
In this study, based on previous calibration knowledge [57,58], a novelty in the methodology has been implemented with a reduced number of parameters (thermal inertia, infiltration, and thermal bridges), which simplified dramatically the technical complexity of the process. The calibrated BEM produced complies with standards criteria of ASHRAE and International Performance Measurement and Verification Protocol (IPMVP), in temperature with hourly CV(RMSE) below 2%, in energy with hourly CV(RMSE) between 9 to 30% and in both cases with R2 above 90%. The new methodology was validated empirically and comparatively based on a previous work carried out in Annex 58 of the International Energy Agency (IEA). Ramos et al. [57] and Bandera et al. [58] published a BEM calibrated with genetic algorithm producing hourly CV(RMSE) below 4% and R2 above 90% in temperature, but no information about energy was given; in both papers the number of parameters used for calibration were fifteen, neither of them were empirically validated. Tahmasebi et al. [23] produced a calibrated energy model of an office building with an hourly CV(RMSD) of 2.35% and a R2 of 88% in temperature without information on energy with six parameters. Raftery et al. [59] presented an hourly CV(RMSE) of 8.72% in energy consumption with more than 20 parameters involved and without information about thermal zone temperature. Chaudhary et al. [60] in the Autotune project reduced the hourly CV(RMSE) in energy to less than 5% by using 60 parameters. Chong et al. proposed a bayesian calibration framework in an EnergyPlus model of the cooling system of a ten story office building located in Pennsylvania with hourly energy CV(RMSE) of 6% by using ten parameters and without information about temperature in building thermal zones. Cacabelos et al. performed a similar work with low quantity of envelope parameters involved, with an hourly CV(RMSE) of 5% in temperature but with a monthly CV(RMSE) of 4%.
The remainder of this paper is structured as follows: Section 2 explains the methodology used for the validation of the calibration process and provides an analysis of the sensors provided and the sensors used to perform the analysis. Section 3 details the analysis of the results. Using uncertainty indices, the results were evaluated in both energy and temperature, and the final values of the objects used in the process are reported. The paper finishes with the conclusions, which are discussed in Section 4.

Empirical Validation and Comparative Test Methodology
In this paper, a envelope calibration technique was validated using empirical and comparative tests with a case study provided by Annex 58 [61]. High quality calibration was achieved by considerably reducing the number of parameters and sensors used in the original experiment (by 47.5%), using 42 out of the 80 sensors. This simplification does not reduce the quality of the calibration process because the as-build data are taken as a base model. The method of measuring the calibration quality of the models was widely studied in our previous scientific articles [62][63][64] and in other studies [65]. This novel methodology reduces the uncertainty of the error since it is able to find the optimal adjustment with a minimum number of parameters that fit the curve of the simulated data with respect to measured data. The methodology captures the building thermal dynamics and therefore quantifies the available thermal mass. Thermal mass plays an important role in the heat transfer within a building [66], but many simulation programs only consider the thermal mass of the envelope, neglecting what may exist inside the building (furniture, partitions, books, etc.) [67]. This means that thermal zones are treated as empty air-filled spaces [68][69][70]. Similarly, dynamic infiltration is another parameter that is difficult to obtain due to the size of the leaks in the building, the climatic conditions, the permeability of the facade, air flows, etc. They are complex to measure [71].
This methodology tries to adjust these values using real data based on the high quality models obtained. None of the seven tests and trials in the buildings conducted with the aim of providing useful information for the process were used. As such, the uncertainty caused by the input elements was decreased and cost-effectiveness was increased. We focused on twin houses in the German town of Horzkitchen where all the as-built information was available, so that the energy model was constructed with less uncertainty. The original experiment was performed by another 21 research groups with different techniques and simulation engines. To avoid commercial competence, there was no information on how to connect the model characteristics and results with specific simulation software. In this paper, all the information about the parameters estimation of the model is provided so the experiment can be reproduced and validated. As far as the authors are aware, this is the first time this experiment has been used for empirical validation out of the Annex contest.

Selection of Data Provided by Annex 58 for Empirical Validation
The data used in this project were part of work completed by Annex 58, and all the information is available on the Internet [73]. The process of calibration started by feeding the base model the data collected from the sensors. Table 1 shows the sensors that were placed in the buildings. Two types of sensors can be distinguished: those that refer to the building (indoor temperature, energy consumption, etc.) and those that refer to weather conditions (exterior temperature, solar radiation, wind speed, etc.). In addition to the sensors, several analyses and tests were conducted in the houses to obtain additional information that may be useful for the empirical test:

•
In the area of windows, Window 6.3 software was used. We used publicly available computer software that provides a versatile heat transfer analysis method consistent with the updated rating procedure developed by the National Fenestration Rating Council (NFRC), which is consistent with the ISO 15099 standard. With Window 6.3, the optical properties of house glass were calculated. The Fraunhofer IBP Institute (Institute for Construction Physics) calculated the absorption capacity of the blinds.

•
The transmittance values of the thermal bridges were calculated using TRISCO and THERM software. The latter is state-of-the-art computer software that performs a two-dimensional conduction heat transfer analysis based on the finite element methodology. The Fraunhofer IBP Institute provided the U values of the thermal bridge for windows similar to those built in the houses.

•
For ventilation, apart from the sensors used, the PHluft program of the Passive House Institute was used. This is a free program that calculates the heat transfer between the ventilation ducts and the indoor environment.

•
The infiltration of the houses was obtained by the use of blower doors. A blower door is a machine used to measure the hermeticism (air tightness) of buildings. It can also be used to measure the flow between built areas, to test the tightness of air conductors, or to help physically locate air escape sites in the building envelope. A total of five blower doors were used between the two houses. They were applied throughout the house and in the rooms that were part of the experiment.

•
For the ground, the reflection of the short wave ground was measured. Reflectivity measurements were recorded on both asphalt and gravel and the ground temperature was recorded at various depths: 0, 0.05, 0.1, and 0.2 m.
To perform the novel calibration procedure, all the information was deeply analyzed. The sensors required for the process were listed and are highlighted in grey in Table 1. This selection shows that the number of sensors was significantly reduced (42 out of 80, a 47.5% reduction) and that the sensors used were unobtrusive. Regarding analysis and testing, none of the data provided were used to conduct the experiment. These findings show that the proposed methodology is cost effective.

The Buildings
Once the data used to obtain the energy model were analyzed and validated, the geometry of the buildings were drawn. To do this, OpenStudio [74] was chosen and the generated volume was exported to EnergyPlus and JePLUS+EA to perform the optimization-based calibration process. EnergyPlus is an open source building energy simulation engine developed by the DOE (U.S. Department of Energy), tested and validated through analysis and benchmarking.
This study was conducted on two prototype houses (N2 and O5 houses) in the German town of Holzkirchen, south of Munich ( Figure 1). The houses are twins and are located in a flat area without any buildings that could cast shadows on them in the summer period. They have three floors: basement, ground floor, and attic. The experiment focused on the ground floor, which included a living room, kitchen, entrance, bathroom, corridor, and two bedrooms ( Figure 2). It has a free height of 2.495 m.
Once the geometry of the buildings was drawn, the physical properties and the thermal load of the buildings were introduced.
The documentation provided by the Annex 58 test includes all the materials and construction elements of the houses, detailed information about the windows (windows glass, frames, and dividers), and the thermal bridges.
For heating, the houses are equipped with Dimplex AKO K 810/K 811 electric radiators with an estimated rapid response time of 1 to 2 min with a radiative-convective effect of 30%/70%.  It was reported that the ventilation of the houses is mechanical. The supply and extraction points are located on the ceiling. Constant ventilation is provided during the whole test period, which introduces 120 m 3 /h of air through the living room ceiling and extracts 60 m 3 /h through the bathroom and 60 m 3 /h through the child's room.

Experimental Design and Calibration Process
In order to carry out the study, we ensured that the heat flow in the houses guaranteed the possibility of constructing satisfactory energy models with reasonable quality. To achieve this energy activation, the designers of the original experiment subjected the houses to three types of sequences: internal steady-state temperatures, a sequence of pseudo-random heat injections, and a period of free oscillation.
The experiment was conducted in the months of August and September 2013 (from 1 August to 26 September) because the houses were only available on those dates. Although these are summer months, heating was used in the exercise. During this time, the houses underwent five periods of energization to reflect, among other things, the common conditions of the buildings and to ensure that the dynamic response was tested ( Table 2).

•
Period 1: In this first period, the aim was to achieve identical and well-defined starting conditions for both houses. To do this, they were heated to 30 • C for three days. • Period 2: During the following seven days, the interior temperatures were kept constant at 30 • C using the building's control system. For the experiment, indoor temperatures were provided as inputs to the mode, and the energy needed by the HVAC system to achieve those temperatures was requested.
• Period 3: In this period, a Randomly-Ordered Logarithmic Binary Sequence (ROLBS) was implemented for the activation of the living room radiator (the rest of the radiators in the rooms were turned off, thus increasing the interaction between the units). This sequence was developed in the EC COMPASS project. The ROLBS, which aims to cover all relevant frequencies with the same weight, is a signal in which the on and off periods are chosen at logarithmically equal intervals and shuffled in a quasi-random order. This random sequence ensures that there is no relationship between the heat input by the HVAC system and the solar gains. This phase lasted two weeks with heat inputs ranging from 1 to 90 h. The power of the radiator was limited to 500 W. During this stage, the energy consumed by the radiator was offered and the energy model was asked to predict the interior temperatures of the rooms. • Period 4: After Period 3, the thermal load of the houses was reset so that in the following period, both houses started with the same temperature and internal energy conditions. To achieve this, over 7 days, a constant temperature of 25 • C was introduced. As in Period 2, the indoor temperatures were provided so that they could be entered into the energy model and were asked to predict the energy involved in raising the indoor temperature to 25 • C. • Period 5: This was the last stage of the experiment. During this time, no energy was introduced into the buildings; they were left in free oscillation. The energy model was asked to reproduce the indoor temperatures using only input of energy provided by the external weather. After developing the base BEM, the new calibration process began to find the best adjustment of energy and indoor temperatures for each of the design periods. A script programmed in EnergyPlus run-time language was developed. This script transferred the measured temperature to the model as a set-point in Period 2 and 4. Similarly, real energy consumption was input to the model as an internal load in Period 3 and 5. Under normal circumstances a temperature sensor per thermal zone is enough but if available more information can be added. Energy consumption is not essential in this process because the calibration can be performed in a free float period.
The calibration procedure represented in Figure 3 was repeated for the different periods described above. The methodology is similar to an optimization process but the objective function is related to the adjustment between real and simulated curves. The objective function of the calibration process, which will show the way to finding the model with the highest adjustment range, is composed of the uncertainty indices proposed by the contest. In this case, the Mean Absolute Error(MAE) and the Spearman rank correlation coefficient (ρ) were used, since they are the indices chosen in Annex 58, but other statistical indexes like CV(RMSE) are also possible. The genetic algorithm Non-Dominated Sorting Genetic Algorithm (NSGA-II) [75] was selected as the engine for tracking the best solution. The search space is based on the combinational probabilities of the parameters: capacitance, thermal mass, infiltrations, and thermal bridges, it is a 75% reduction compare with previous studies [57,58]. Another improvement implemented has been the introduction of the statistical index, in this case the MAE for each thermal zone, this makes the process faster to converge while in the previous studies a maximum of two objectives functions where in place for total energy and CV(RMSE) of the average temperature of the thermal zones. In the proposed exercise, 21 research centers and universities from around the world participated. All members created their own energy models and reported their results for the different periods analyzed. The whole experiment was reported by Strachan et al. [61].
When considering the execution of the experiment, we decided to perform two exercises: the adjustment of the model for each period of analysis, which produced four models: one each for Periods 2 to 5 (Period 1 is the initialization of thermal conditions), and obtaining an energy model adjusted to all the proposed periods (called a unique model).
To perform an equitable evaluation of the data obtained by the different energy models and to qualitatively show the degree of agreement between the different periods and between the temperature and energy predictions, for the organization of Annex 58 test, we decided to use two evaluation metrics:

•
To measure the magnitude adjustment, we used mean absolute error (MAE) in Equation (1), which is the measurement of the difference between two continuous variables, considering the two sets of data (some calculated and others measured) related to the same phenomenon: where y i andŷ i are the real and simulated values, respectively, and n is the number of values in the test sample.

•
To assess the level of correspondence of the form, we used Spearman's rank correlation coefficient, ρ, using Equation (2). This coefficient is a measure of linear association that uses the ranges and the order number of each group of subjects and compares these ranges: For the representation of the obtained data, we chose the box plot because it is a standardized method that graphically represents a series of numerical data through its quartiles. The box plot shows the median and quartiles of the data at a glance and can also represent the outliers. The graphs represent the results of the 21 participants in the Annex 58 contest plus the experiment in this study.
The graphs allow a clear and quick representation of the results obtained by all the participants, comparing them in an empirical and comparative way at the same time. (Figures 4-14).
The models were evaluated using the energy and temperature data obtained from the living room, south bedroom (children's room), kitchen, and north bedroom (Bedroom) of both houses, N2 and O5.
The standards for the evaluation of BEMs are provided by by ASHRAE, IPMVP, and the Federal Energy Management Program (FEMP). As such, the indexes proposed and recommended by those agencies for validation were included to this study: CV(RMSE), NMBE, and R 2 .
CV(RMSE) (Equation (3)) is the coefficient of variation of the mean square error, which is obtained by weighting the RMSE index by the average of the real values. This index considers the error variance as measured variability and is therefore recommended by ASHRAE guideline 14, by FEMP, and by IPMVP.
NMBE (Equation (4)) is the normalized mean bias error. It is a modification of the MBE index that obtains more precise information about the adjustment between two values; in our case, this includes real values and those simulated by the energy models.
The coefficient of determination R 2 (Equation (5)) is the proportion of the variance in the dependent variable that is predictable from the independent variable. This coefficient is used to analyze how differences in one variable can be explained by a difference in a second variable. R 2 is similar to the correlation coefficient, r. The correlation coefficient formula will tell you how strong a linear relationship exists between two variables.
International Performance Measurement and Verification Protocol (IPMVP) states that a model can be considered calibrated if it achieves an NMBE of less than ±5% with a CV(RMSE) of less than ±20% on an hourly scale. ASHRAE and Federal Energy Management Program (FEMP) consider a model calibrated if its NMBE index does not exceed ±10% combined with a CV(RMSE) index of no more than ±30% (Table 3). ASHRAE, in turn, recommends that models considered calibrated should not have an R 2 index lower than 75%.

Analysis of Results and Discussion
To better understand the results, we introduced extra points into the box plot graphs (Figures 4-14). The green and yellow points are the best results of the Annex 58 contest; they are labeled participant 1 and participant 2. Three extra points were shown: (1) in red, the best model obtained with the proposed methodology; (2) in dark blue, the base model created from the data provided by the developers of the experiment, without any previous calibration process; and (3) in sky-blue, the unique model that is better adjusted to all the proposed periods.
Through the box plot graphs, we attempted to answer the following research questions for each analyzed period: (1) By how much is the base model (dark blue) improved by the calibration process (red or sky-blue)? (2) How far is the distance between the unique models (sky-blue) and the best model for each period (red)? (3)    In this case, the base model (dark blue) was improved by the calibrated models. In MAE and ρ, almost all the thermal zones are in Quartile (Q) 3 or Q4; after the calibration process, they are consistently in Q1 or Q2 (red and sky-blue). The distance between the red and sky-blue models is similar because they are systematically in the same quartile. The red model has a better position. In relation to the best models of the contest (green and yellow), in MAE, the red model is normally in the best position but in ρ, these model are better, especially the green model. Figure 6 shows the MAE of the comparison of the indoor temperatures of the energy models with the interior temperatures of the houses in the period of constant set point (Period 2) in the living space. In this period, the models created with the proposed methodology are: the Period 2 model, unique model, and base model, which do not generate any uncertainty when representing the real temperature of the building, since this temperature, provided by the authors of the exercise, is introduced in the model as input data. The models represent the real temperature and the energy it produces was analyzed in the previous figures. Our hypothesis was that the rest of the participants introduced error into the temperature to be able to better adjust the energy consumed by the model and thus be more similar to the real energy.  The base model performed slightly better than the previous case, MAE is Q3 or Q2 and near Q1 in some cases; ρ is Q2 or Q1. MAE was improved by the calibration process, in some cases from Q4 to Q1; the improvement in ρ was not so clear because most of the selected models were already Q1 or Q2. The distance between sky-blue and red was not significant, being never more than one quartile. In relation to the position of the models (red and sky-blue) in the contest, for MAE, red held consistently a leading position and for ρ, the amplitude of the boxes was smaller and all the models performed similarly. The kitchen was the place where red and sky-blue performed the worst.   In Period 3, all the models that were calculated (red, sky-blue and dark blue) to validate the methodology consumed the energy that the real building demanded to reach the requested temperature. This means that they had no uncertainty in contrast to the rest of the participants who introduced errors in the energy to predict as best as possible the measured temperatures.
Period 4 (fixed set point at 25 • C): The MAE and ρ of the comparison of the energy consumed by the energy model and the real energy required by the households are shown for a set point of temperature established at 25 • C in Figures 10 and 11, respectively. Two and one outliers, respectively, are not shown in the figures because they were not representative of the models.
The MAE in base model oscillated in performance from Q4 to Q1, but was improved in all cases by the calibration process (red and sky-blue), except in the living room of both houses where the unique model was ranked Q3, worse than the base model (Q1). For ρ, all the models performed similarly. In relation to the distance between red and sky-blue, we observed an anomaly in the living room of both houses because the unique model (sky-blue) performed even worse than the base model (dark blue). This is the only place where this happened; in the other thermal zones, the pattern was similar. In relation to the position in the contest, the red model, again, was ranked first in MAE, and for ρ, the values were similar for all the models. Figures 6 and 12 show the MAE produced by comparing the indoor temperatures of the energy models with the interior temperatures of the houses in the period of constant set point at 30 • C in the living space.   The models created for the experiment, as mentioned in previous paragraphs, do not produce any uncertainty error when representing the temperature of the building. The models provide the energy consumed when representing the interior temperature of the zones, as seen in the previous figure ( Figure 12).

As in
Period 5 (free oscillation): The MAE and ρ of the comparison of the interior temperatures reached by the energy models and the real indoor temperatures of the dwelling, a period of free oscillation, are shown where there is no artificial energy input to the buildings, except for the heating produced by the weather, in Figures 13 and 14, respectively. Four outliers are not shown in Figure 14 because they were not representative of the models.  In this case, the MAE of the base model was always improved except in the kitchen and bedroom of houseO5. ρ was slightly improved but the margin was low because all the models produced good results for this index. The distance between the unique model (sky-blue) and the best model of the period (red) in MAE was small; they were always in the same quartile. The same occurred for ρ, except in the kitchen of house N2, where the sky-blue model was slightly better than red, which was unusual. With respect to the position of the red and sky-blue models in the contest, the leading position was unclear; red was the best in four out of eight in the Figure 13, and both models were always in Q1 except for the kitchen in house N2. Similar results were obtained in the Figure 14.
Once the results of the models were analyzed and compared with those of the contest, it was decided to check the models of both houses in different periods (checking periods)in order to evaluate their robustness (Tables 4 and 5). As an example, the model calibrated in Period 2 of house N2 was also simulated in Periods 3, 4, and 5. It should be remembered that Period 1 is an initialization period and therefore it has not been taken into account in the calibration processes. At the same time, new models have been introduced for this demonstration, calibrated in several periods at the same time and tested in the rest, to see if the combination produced better results in the models, these models are: For the N2 house:    Tables 4 and 5 show the results obtained. The models have been ordered according to their adjustment to the real data, showing first of each period, the model that has the best fit with reality and so on.
Analyzing the data the first observation that can be made is that models calibrated in their periods are the best for that period. Secondly as a general rule models calibrated in energy periods, are the best when they are checked in energy periods (2 and 4) and models calibrated in temperature periods are the best when they are checked in temperature periods (3 and 5). Thirdly the models that use the whole data available perform better than models that use mixed data (energy and temperature) and three o two periods. Finally, models that used mixed data perform better in checking period that models calibrated in temperature period and checked in energy period and vice-verse.
The uncertainty indices proposed by the designers of the Annex 58 experiment to assess the degree of fit of the models were MAE and ρ. In this study, the indexes from the available standards (ASHRAE Guidelines 14, IPMVP, and FEMP) were used because they offer a reference of quality. Table 6 provides a compilation of all the indexes produced by the energy models for all the zones analyzed (living room, children's room, bedroom, and kitchen). All the calibrated models met the higher standard (IPMVP) for energy and temperature, and, despite the base model being categorized as good quality, the calibration process increased the quality.
The goal of the calibration process was to adjust the energy models to reality so that reliably depict the real building. Table 7 lists the parameters that best represent reality for each energy model studied. This information could be used to replicate the whole experiment, for confirmation of the results by others researchers, or to improve those results. In the original experiment, there was no information about the parameters of the model, which complicates the performance of comparative tests. The following Figures 15-22 show the results of energy and temperature produced by the calibrated models (unique and period models) and the base model for each of the periods of the experiment in a temporal sequence. These data were compared with the real data measured in the original experiment. The graphs clearly visualize the degree of adjustment of the calibrated models and the unique model in each period, as well as how the calibration process captures the reality of the data produced by the energy model.        All the graphs shown for the different periods proposed by the experiment have elements in common. The most accurate models are those calibrated in each period. The unique model produced very satisfactory results, with its graph being similar to reality, but without improving the results of the calibrated model in its period. Finally, the model that performed the worst in all the periods was the base model, which emphasizes the importance of the calibration process.
By completing a global review of the results produced by the models provide in each validation period, we observed how both the model adjusted in its period and the unique model were among the best. Notably, the models adjusted for each period are those that produced the best results, always positioned better than the unique model. However, the unique model provides generality; it provides optimal results for all the proposed study periods.
In the paper that inspired this experiment [61], the large amount of data needed to construct optimal energy models that offer a good fit between reality and simulation is mentioned, "Similar data sets are needed from other, larger building types, but it would require a high level of resourcing to undertake such an experiment with a similar level of detail as the experiment described in this paper." One of the achievements of the experiment was matching and, in many periods, improving the participants' fitting data using only 52% of the sensors provided for the test. Not only is the sensor reduction remarkable but also the reduction of the most intrusive or expensive sensors. The experiment mainly used temperature measurement and energy measurement sensors, which may be the most economical and least intrusive. This considerably facilitated the process of adjusting the energy model, decreasing the cost of the data needed to perform this process while not affecting the operation of the building.
The adjustment process developed in this study is quick and simple, in which the energy model is found by adjusting four parameters present in the building, such as the experimental capacitance of the interior air, the internal mass, the infiltrations, and the thermal bridges. The right combination of these parameters produces an energy model that captures the thermal dynamics of the real building. The rest of the building materials do not suffer any variation in their characteristics. This methodology respects the building's physics by generating energy models that adjust to the constructive reality of the buildings. Therefore, these are models that can be used by anyone with a minimum knowledge of construction since all the properties of the construction are those specified in the documents provided for the modeling process.
Many elements and people are involved in the process of building construction, which can produce discrepancies between the project or construction data and the real building data. For example, concrete is a material manufactured in a cement plant that is brought to the site. The installation process involves different factors that can vary its specifications, such as climate, installation methodology, etc. Gathering all the uncertainties that may arise in the construction phase is impractical. Some of them can be determined later, but the identification of most parameters is a task that cannot be assumed by modelers. In the adjustment process proposed in this paper, the construction specifications are entered into the model according to the documentation provided. The differences from reality can be assumed by the elements introduced in the calibration process (capacitances, internal mass, infiltrations, and thermal bridges) depending on the uncertainty of the model. Some of these parameters, such as experimental capacitance and internal mass, are able to absorb the possible mismatches between measured and simulated data. The values used for feeding the algorithm are free; therefore, they do not have physical sense in some cases. We could consider these parameters as pure black box parameters. Thermal bridges and infiltration are a kind of mix. Thermal bridges oscillate around the values provided by the designers of the contest: 2.5 to 2.9 m 2 K/W. With a maximum value of five and a minimum value of zero, this constraint tries to give the parameter a physical sense; however, this parameter can be overfitted to match the measured values. For this reason, a comparison in a checking period should be conducted. This was one of the main problems with this experiment: a lack of verification periods with similar characteristics. For infiltration, the methodology used is the leakage area; this area is varied between 0 and 100 cm 2 for each zone based on the authors' previous modeling experience. This methodology gives the option for this parameter to be dynamically adjusted based on external conditions: wind speed, wind direction, and outdoor temperature. For this reason, infiltration has a lower possibility of overfitting. These elements can absorb the possible mismatches in the entered data, producing a result that could be considered a grey box model in which some parameters do not have a physical sense. However, these differences are not reflected in the documentation provided by the contest; without these adjustments, it would be unlikely that a high efficiency BEM could be constructed.
Two exercises were undertaken in this study: the adjustment of the model for each validation period and the construction of a high quality energy model for all the proposed periods. Notably, overfitting can occur during the calibration processes. This over-adjustment may be influenced by various factors, such as the weather and the indoor conditions in the thermal zone. Each energy model was adapted to the conditions provided in its period, such as the interior temperature or the energy consumed. For a model to be viable in a period other than its adjustment stage, the objective functions created for its convergence with the reality of the different spaces must have a correlation. If this correlation is weak or does not exist, the model will not be able to reflect reality in other periods. However, the longer the adjustment period, the better the results produced by the model over time because it will identify the differences that may occur in the thermodynamic conditions of the building. If the period selected for the adjustment is short but reflects a high percentage of the average thermal reality of the building, the resulting model will be able to support other periods.

Conclusions
In this study, we used the data provided in Annex 58 for the validation of an adjustment methodology where the calibration model is based on using fewer, less intrusive, and low-cost sensors. The results obtained by the 21 participants in the study were used to conduct a comparative test of the methodology. The results were reported using the uncertainty indexes proposed by the ASHRAE, IPMVP, and FEMP standards to enable empirical validation.
The dataset provided in Annex 58 includes almost two months of data. The meteorological data supplied are continuous for this period, and, although the sensors' data for the houses had some small gaps, this was easily solved through a process of interpolation. The experiments and data provided were the same for both houses.
The experiments had to be completed in the summer months, which produces certain limitations when adjusting the models. Since a limited amount of data is available, some aspects, such as the behavior of the model in the face of abnormal climate changes, such as heat or cold waves, may produce discrepancies in the results, as no similar reference has been introduced to the model that could provide information on the behavior of the building in these special cases. The model learns from the data provided to it. The process of adjusting a building will be more successful as long as the data provided includes as many singularities as possible, regardless of the duration of the data.
The ability to adjust the envelope in periods with a wide range of temperature differences could not be tested. This also led to the energy input to the houses being limited, with only mechanical ventilation being tested. House occupation was excluded from the process to simplify the test. The period of data provided is relatively brief, and we would have preferred having a validation period independent of the calibration phases to more robustly verify the reliability of the models over time.
The two months of data were divided into five periods to test the strength of the calibration process in different circumstances. To complete the experiment, we performed two types of energy model calibration: the creation of a calibrated model for each proposed period and a calibration together with all the periods united (unique model).
The results demonstrated that the methodology used for the resolution of the proposed problem was valid. The models were adjusted to the requirements proposed in the Annex 58 exercise by reducing the sensors required by 47.5% and by not using any of the tests and essays provided for the resolution of the experiment.
The most accurate models were those that were calibrated in each period, but the results in the joint calibration showed that the methodology is robust, since they were also among the most accurate of the set of models produced by the experiment Analyzing the results in all periods, the models calibrated with the proposed methodology produced the best results globally. Even the unique model, calibrated considering all five periods, produced results well above the average for the exercise.
With this analysis, we demonstrated that the proposed methodology for the calibration of energy models precise reproduces the thermal and energetic reality of the building while significantly reducing the number of sensors needed to produce optimal results. This led to a substantial reduction in costs, both in terms of human resources and equipment, as well as in the economics of constructing energy models, thus increasing the accessibility of these models for the building market.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: