Development, Calibration and Validation of an Internal Air Temperature Model for a Naturally Ventilated Nearly Zero Energy Building: Comparison of Model Types and Calibration Methods

In this study, a grey box (GB) model for simulating internal air temperatures in a naturally ventilated nearly zero energy building (nZEB) was developed and calibrated, using multiple data configurations for model parameter selection and an automatic calibration algorithm. The GB model was compared to a white box (WB) model for the same application using identical calibration and validation datasets. Calibrating the GB model using only one week of data produced very accurate results for the calibration periods but led to inconsistent and typically inaccurate results for the validation periods (root mean squared error (RMSE) in validation periods was 229% larger than the RMSE in calibration periods). Using three weeks of data from varying seasons for calibration reduced the model accuracy in the calibration period but substantially increased the model accuracy and generalisation abilities for the validation period, reducing the mean RMSE by over 160%. The use of one week of data increased the standard deviation in parameter selections by over 40% when compared with the three-week calibration datasets. Utilising data from multiple seasons for calibration purposes was found to substantially improve generalisation abilities. When compared to the WB model, the GB model produced slightly less accurate results (mean RMSE of the GB model was 1.5% higher). However, the authors found that employing GB modelling with an automatic model calibration technique reduced the human labour input for simulating internal air temperature of a naturally ventilated nZEB by approximately 90%, relative to WB modelling using a manually calibrated approach.


Introduction
Internal air temperature is a critical parameter for the simulation of the indoor environment in buildings. In order to simulate important metrics such as thermal comfort, occupant productivity and energy consumption, accurate internal air temperature models are required.
White box (WB), black box (BB) and grey box (GB) models have been employed in previous studies for internal air temperature prediction. WB models are physics-based and mechanistic in operation. WB models typically utilise a large number of building descriptive parameters such as material properties, spatial dimensions, fenestration orientations and mechanical/thermal system specifications. BB models are data driven and use regression or machine learning algorithms to map the relationship between system inputs and outputs using large amounts of empirical data without the requirement for static building descriptive parameters [1,2]. GB models, like WB models, are mechanistic in nature, however, the iterative physics-based operations are simplified and aggregated. Therefore, they are less computationally expensive and require fewer building descriptive parameters [3,4]. The most commonly utilised GB models in thermal engineering applications are resistive-capacitive (RC) models [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] (Table 1). An RC model uses an electrical analogy to describe model structure and function, where resistors (R) and capacitors (C) are used to simulate the thermal energy flow and storage in the building. RC models typically simplify WB model parameters by lumping parameters together into single resistors or capacitors. The simplification of RC models is achieved through order-reduction [7,18] and the lumping of parameters [4,14,16].  Typically, WB models which utilise precise building parameters were found to be reasonably accurate when validated under experimental conditions in building test-cells [19,20]. However, WB models were found to be inaccurate when compared to measured data from real buildings [21]. Applying manual or automated calibration techniques to WB models has been shown to greatly improve accuracy for internal air temperature prediction in real buildings [22][23][24].
Model calibration often requires parameter tuning or selection. Parameter tuning can be performed using manual [25], automated [26] and semi-automated [27] approaches. The manual calibration approach uses human intuition combined with standardised calibration metrics [28] to minimize prediction error. This approach begins with an initial un-calibrated model populated with design stage building parameters. These parameters are then tuned via a series of revisions to the initial model based on the relative error reduction after each revision. [29,30]. The automated calibration approach uses mathematical or statistical methods for parameter tuning [26]. These methods typically utilise optimisation algorithms to reduce the error between model outputs and measured data [16]. Semi-automated or mixed approaches to calibration combine intuitive and mathematical methods [31]. Typically, an initial model is created manually, parameters are selected for tuning and boundary limits are manually selected for each parameter. The algorithm then selects the final values of these parameters through an iterative error minimization process [8,32]. The difference in time taken to manually or automatically calibrate models has been shown in some instances to be very substantial [26,32]. As manual calibration relies on user intuition, this process can be time intensive (e.g., 40 to 50 h), whereas automated calibration can take over 80% percent less time (e.g., two to seven hours). In building simulation, WB models are the most widely used models in practice [33]. Typically, WB models are manually calibrated [22,25,34]. However, previous studies have also used semi-automated calibration procedures with WB models [32,35]. BB models are developed solely from empirical data where the internal architecture and coefficients (i.e., number for neurons in the hidden layer of a neural network and the synaptic weightings between neurons) are selected through an automatic procedural method based on globalised error reduction [36][37][38][39][40][41]. For GB models, several studies use manual [5,11,14] and semi-automated calibration approaches [12]. However, the most common approach is automated calibration [3,4,13,[15][16][17]. The majority of studies using RC models for internal air temperature prediction have either relied on synthetic data generated by WB models or a higher order GB model for calibration and validation [3,[6][7][8]11,[15][16][17]. When empirical data are employed for such purposes, it is most commonly recorded data from unoccupied test-cells or rooms [5,12,14]. The number of RC models that have been calibrated and validated using recorded data from real occupied buildings is limited [10,13].
Nearly zero energy buildings (nZEBs) are now a legislative requirement for newbuilds in many parts of the world. A large amount of literature has identified nZEBs and zero energy or net zero energy buildings as the target for buildings in the future [42][43][44][45]. Reaching nZEB standards requires improvements in fabric and energy performance when compared to existing buildings. Although energy performance certificates suggest that many nZEBs exist [46], the sample of well documented case study examples in published literature is limited [23,[47][48][49][50]. Over 80% of the simulation studies that focused on modelling and performance of actual nZEBs, identified occupant behaviour (OB) as the primary inhibiting factor for accurate simulation and predictions of both energy and internal air temperatures [23,49,[51][52][53]. This is due to the near stochastic nature of OB in certain conditions. OB has a strong influence on the internal air temperatures in nZEBs [23], as these buildings are highly insulated and can be thermally decoupled from external climatic conditions. As a result, the internal thermal gains from OB have a more dominant effect on internal air temperature. The observed efficacy of parameter tuning for RC models may be lower if measured empirical data from nZEBs are used (as opposed to synthetic data) due to the included noise from OB. The majority of published research has focused on RC model calibration of air-conditioned buildings [3,[6][7][8]13,16,17]. As these buildings incorporated controlled air-conditioning systems for regulating the internal environment, the influence of OB on air temperature was attenuated, therefore, the model calibration and validation processes are less susceptible to the negative effects of OB noise. Pavlak et al., (2014) trained an RC model on three weeks of synthetic data. Varying levels of noise were added to the calibration dataset to simulate the uncertainty of recorded empirical data. The results of this study showed that least-squares based automatic model calibration error increased relative to noise level and noise type, with high levels of brown noise resulting in a substantial increase in model error. RC models calibrated and validated using automatic calibration algorithms have produced low levels of error with root mean squared error (RMSE), with average of RMSE values of 0.4 • C and a maximum RMSE value of 1.2 • C [6][7][8]13,16,17]. RC model predictions in naturally ventilated (NV) buildings are reported in the literature as being less accurate, with mean absolute errors of 1.0 • C to 1.1 • C and daily maximum errors of 1.8 • C [14]. Air temperature in NV buildings is much more sensitive to external climatic conditions (external air temperature and wind velocity) than mechanically air-conditioned buildings [54]. OB also has a strong influence on the operation of NV buildings as many NV systems either fully or partially rely on manual occupant-controlled openings. Table 1 presents findings from a systematic mapping (from peer-reviewed literature) of published RC models that have been calibrated and/or validated for predicting internal air temperatures in different buildings. This table includes the types of RC models that have been used, the building type (i.e., residential, non-residential or test-cell), the types of calibration data used (i.e., synthetic from other software package or empirical from a real building), the calibration method (i.e., manual, automated or mixed methods), the duration of calibration or validation in days, and the season(s) that were considered during calibration or validation. 40% of the studies identified in Table 1 used a combined calibration and validation approach. Of the studies that calibrate or validate GB models using measured empirical data, the data requirements for calibration purposes were typically between 6 days to 26 days and depended on the application [5,10,12,13] with some studies that used larger datasets broken into smaller periods [7]. Following a review of the RC model studies presented in Table 1, a number of gaps have been identified. Existing literature on GB model calibration has a limited number of validated examples of thermally decoupled environments such as nZEBs, and these examples are in test-cell environments [5]. GB models of nZEBs have yet to be calibrated in occupied conditions. The majority of GB models have used synthetic data for parameter tuning. There are few studies of GB model calibration in NV buildings [5,14]. While some examples have added noise to synthetic data [6,7], calibration or validation with measured empirical data from real buildings is very limited [10,13] and non-existent for NV nZEBs.
This paper presents the first example of GB model calibration for internal air temperature prediction in an occupied NV nZEB. The five objectives of this study are, (1) investigate GB model performance for predicting internal air temperatures in a thermally decoupled NV building with different sources of empirical calibration and validation datasets, (2) develop a GB model of a naturally ventilated nZEB, (3) apply an automatic calibration algorithm to select optimal GB model parameters using measured internal air temperature data, (4) compare the validated GB model to a calibrated WB model and, finally, (5) investigate the potential for practical implementation of GB models. In Section 3 we present the model theory and the calibration and validation approaches. In Section 4 we analyse the accuracy of the GB model when compared to the WB model and measured empirical data along with an analysis of the parameter selection when using different calibration and validation periods. In Section 5 we discuss the efficacy of different calibration and validation configurations as well as a practical comparison of WB and GB models. Section 6 presents the conclusions of this study.

Application
The empirical data used for this study were gathered from a test-bed building known as the zero2020 building. The zero2020 building is a 223 m 2 educational building that functions as a live test-bed known as the National Building Energy Retrofit Test-bed (NBERT) shown in Figure 1. The NBERT is used for research in thermal comfort, ventilative and passive cooling, energy systems, and micro-grid applications [55][56][57][58][59]. The NBERT uses a multi-configuration slotted louvre (MCSL) natural ventilation system for both comfort cooling and air quality. The retrofitted building also has a highly insulated external envelope that has fabric and fenestration u-values that are lower than the legislative requirements for nZEBs in Ireland [60]. The NBERT has one large open plan office area, two small cellular offices, one conference room and one seminar room, with a corridor connecting all occupied zones. For information relating to the thermo-physical properties and the building geometry see [56,61,62], for information regarding the natural ventilation system see [54] and for specification of the instrumentation used in recording the empirical data see [23]. The data used in this paper as well as building information regarding the NBERT test-bed can be found at messo.cit.ie/nbert.   Figure 1 displays its interior, which has glazed facades to the west and south, a roof light. This room was selected as the application in this study for the following sons: (1) highly insulated external façade, therefore, the internal air temperature is higher degree, thermally decoupled from the external air temperature in comparison typical building. Occupancy related heat gains therefore, have a stronger influence o ternal air temperature. (2) the space has a highly variable occupancy schedule [23], th fore profiling occupancy related gains is more difficult than a conventional open s office. (3) NBERT employs a natural ventilation system with automatic openings (w can be manually overridden by occupants), and manually operated openings, which controlled exclusively by the occupants. Therefore, the air temperature control in space relies on both external climatic conditions, (through solar irradiation and ventila convection exchanges), and internal OB and interactions with the space. For these reas the space was deemed an interesting application for a GB model with automatic cal tion and validation using measured empirical data.   Figure 1 displays its interior, which has glazed facades to the west and south, and a roof light. This room was selected as the application in this study for the following reasons: (1) highly insulated external façade, therefore, the internal air temperature is to a higher degree, thermally decoupled from the external air temperature in comparison to a typical building. Occupancy related heat gains therefore, have a stronger influence on internal air temperature. (2) the space has a highly variable occupancy schedule [23], therefore profiling occupancy related gains is more difficult than a conventional open space office. (3) NBERT employs a natural ventilation system with automatic openings (which can be manually overridden by occupants), and manually operated openings, which are controlled exclusively by the occupants. Therefore, the air temperature control in the space relies on both external climatic conditions, (through solar irradiation and ventilation convection exchanges), and internal OB and interactions with the space. For these reasons, the space was deemed an interesting application for a GB model with automatic calibration and validation using measured empirical data. The GB model used in this study was an RC model. Previous studies have found that single capacitor building RC models are too simplistic and do not adequately simulate internal air temperature for high thermal mass buildings [4,11,15,63], while cognate research that utilised multiple capacitors in RC models found that there was a reducing return on increased model accuracy as the number of capacitors was increased beyond two [12][13][14]. Therefore, a two-capacitor model (2C) was selected for this application. The internal air mass (air capacitor), and the internal material mass (material capacitor), were both modelled as independent capacitors. Three of the six zone boundary surfaces were external (two external walls and roof) while the other three were internal (two internal walls and floor). It was assumed no heat transfer would occur through the internal surfaces. The resistors of the three external surfaces were lumped into one resistor, which was the resistance of the external envelope (UEnv). This approach was adopted as the internal air temperature and external air temperature at either side of these three resistors were uniform. A second resistor was applied to the material capacitor representing the materials such as the walls and roof (which was the resistance of the convective heat transfer coefficient (hCap)), therefore, the structure of the model used was 2R2C (for an equivalent RC diagram see [9]). This model structure was found to be very effective in cognate studies [9,12,13], and from the literature review in Section 1, the authors deemed this to be the simplest configuration that would be capable of simulating the dynamic internal air temperature for this application.

GB Model
The internal air temperature in the room (TAi(t)) at each time-step (t) is described in Equation (1): The change in internal air temperature, TAi, due to the balance between energy flow in, (θA,I) and out (θA,O) of the air in the room is defined in Equation (2): where mAi is the mass of the air in the room and CA is the specific heat capacity of air. The sensible energy input to the room is described in Equation (3): The GB model used in this study was an RC model. Previous studies have found that single capacitor building RC models are too simplistic and do not adequately simulate internal air temperature for high thermal mass buildings [4,11,15,63], while cognate research that utilised multiple capacitors in RC models found that there was a reducing return on increased model accuracy as the number of capacitors was increased beyond two [12][13][14]. Therefore, a two-capacitor model (2C) was selected for this application. The internal air mass (air capacitor), and the internal material mass (material capacitor), were both modelled as independent capacitors. Three of the six zone boundary surfaces were external (two external walls and roof) while the other three were internal (two internal walls and floor). It was assumed no heat transfer would occur through the internal surfaces. The resistors of the three external surfaces were lumped into one resistor, which was the resistance of the external envelope (U Env ). This approach was adopted as the internal air temperature and external air temperature at either side of these three resistors were uniform. A second resistor was applied to the material capacitor representing the materials such as the walls and roof (which was the resistance of the convective heat transfer coefficient (h Cap )), therefore, the structure of the model used was 2R2C (for an equivalent RC diagram see [9]). This model structure was found to be very effective in cognate studies [9,12,13], and from the literature review in Section 1, the authors deemed this to be the simplest configuration that would be capable of simulating the dynamic internal air temperature for this application.

GB Model
The internal air temperature in the room (T Ai(t) ) at each time-step (t) is described in Equation (1): The change in internal air temperature, T Ai , due to the balance between energy flow in, (θ A,I ) and out (θ A,O ) of the air in the room is defined in Equation (2): where m Ai is the mass of the air in the room and C A is the specific heat capacity of air. The sensible energy input to the room is described in Equation (3): where, θ Sol, θ Occu, and θ Int are the solar, occupant driven, and internal energy gains added to the room at each time-step. β represents the percentage of thermal gains that enter the air. It was assumed a percentage of all gains would be lost due to latent energy [64]. γ describes the fraction of energy that is convective or radiative. Two independent convective-radiative ratios were employed in this study, one for internal gains and occupant gains (γ IO ) and another for solar gains (γ Sol ). These ratios represented the percentage of convective gains that enter the air, where the remainder was assumed as radiative and was absorbed by the surface of the material capacitor (i.e., the walls, floor and roof). Equation (4) describes how the solar gains (θ Sol ) were broken into gains entering from the sky-light window on the horizontal, and gains entering from all vertical windows. The g-values for the skylight (g h ) and vertical windows (g v ) were independent as the glazing details varied. The solar radiation incident on the horizontal (I h ) and vertical (I v ) were derived from global horizontal irradiance data and converted to vertical values for the vertical windows using the Perez model [65] swhich was employed from TRNSYS 17 [66] using the integrated weather data and radiation processor (Type 99): Equation (5) describes totalised outward sensible energy flows from the internal air where, θ Inf, θ Vent, θ Cond and θ Mat are energy flows due to infiltration, ventilation, conductive losses and material capacitor energy losses (energy flowing from the air to the material capacitor), respectively. Equation (6) describes the energy losses due to conduction (θ Cond ): where, T Ai is the internal air temperature of the room, T Ae is the external air temperature, U Env is the u-value of the external envelope and A Env is the area of the external envelope. Equation (7), where, θ Mat,I and θ Mat,O are the energy flows in and out of the material capacitor, m Mat is the mass of the material capacitor, and C Mat is the capacitance of the materials in the room, describes the change in temperature due to the energy flows entering and exiting the material capacitor (walls, floor and roof): The mass of the material capacitor is described in Equation (8) and is a function of its depth (d Mat ), area (A Mat ) and density (ρ Mat ). For this study, the material capacitance for the room was assumed to be concrete from the walls, floor and roof in the room. Therefore, all thermo-physical parameters that were defined for the material capacitor were related to concrete blocks with a maximum thickness of 0.2 m and a maximum utilisable depth of 0.1 m, based on maximum depths for calculations proposed by ISO 13790 [67]. Initial values for the density of the material capacitor were taken from CIBSE Guide A [68]: where, θ Sol, θ Occu, and θ Int are the solar, occupant driven and internal energy added to the material capacitor at each time-step. (1 − γ Sol ) and (1 − γ IO ) describe the radiative gains entering the material capacitor. Equation (10) describes the energy flowing out the material capacitor (θ Mat,O ): where, T Ai is the temperature of the air in the room, T Mat is the temperature of the material, h Mat is the convective heat transfer co-efficient for the material capacitor, and A Matex is heat exchange area between the material capacitor and the air capacitor. Equation (11) describes the heat exchange due to infiltration through the building external envelope.
where T Ai is the temperature of the air in the room, T Ae is the external air temperature, C A is the heat capacity of the air, ρ A is density of the air in the open plan office, and Q Ainf is the volume of air due to infiltration. Equation (12) describes the heat loss due to ventilation from the building's natural ventilation system: where Q Avent is the volume of air flowing through the room due to natural ventilation. The volumetric flowrate of air passing through the room is based on whether the wind driven flow (Q W ) is greater than the buoyancy driven flow (Q B ) or vice versa, as indicated in Equation (13): Equations (14) and (15) describe the volumetric flowrate for wind driven (Q W ) and buoyancy driven (Q B ) flow according to Warren et al. [69]: where F R describes the reference flow number, α describes the opening position of openings (the opening scaling factor), A win describes the area of openings in the room, υ w describes the local wind speed, H is the height of the opening and C d is the discharge co-efficient for all openings.

Calibration, Testing and Validation Data
The data utilised for calibration, testing and validation in this study were recorded in an occupied open plan office in NBERT (see Section 2.1). In previous work by O'Donovan et al. a WB model was calibrated and validated using the same dataset [23]. Table 2 and Figure 3 display the periods where data were used for calibrating and validating both the GB and the WB model. The WB model was calibrated using data from week 1 to week 3 and validated using data from week 4 to week 6. For more detailed information on these data please see the supplementary data section of O' Donovan et al. [23].

Calibration and Validation Approaches and Metrics
Calibration and validation using both WB and GB models were compared. The baseline model geometry for the WB model was developed using the TRNSYS 3D Google Sketchup Plugin [70] and imported into TRNSYS 17 [66]. The WB model was calibrated using a manual approach, where, the baseline model (describing the building fabric and solar heat gains) was calibrated sequentially, first using unoccupied data and later occupied data. Week 1 was used to tune parameters related to the unoccupied condition of the building (i.e., infiltration rate, external shading factor, thermal bridges). Week 2 was used to tune parameters related to the occupied condition of the building (i.e., lighting, appliances, occupant gains, window positions). Week 3 was used to test the WB model performance, while, data from week 4 to week 6 were used to validate the WB model (i.e., no parameters were tuned). Changes from the baseline to the calibrated model were made through a series of iterations that were evidence-based. All changes to the baseline were founded on building information or detailed knowledge of the building performance. Each iteration was, therefore, a manual adjustment of the previous model and did not rely on automation. To assess whether the model was satisfactorily calibrated, standardised calibration metrics with reference to measurement and verification or calibration standards were applied. More information on this approach can be found in O' Donovan et al. [23]. In this study, three GB model validation/calibration approaches were investigated: (1) The GB model was calibrated on an individual week, then validated on the corresponding five weeks. This was repeated for all calibration and validation periods.

Calibration and Validation Approaches and Metrics
Calibration and validation using both WB and GB models were compared. The baseline model geometry for the WB model was developed using the TRNSYS 3D Google Sketchup Plugin [70] and imported into TRNSYS 17 [66]. The WB model was calibrated using a manual approach, where, the baseline model (describing the building fabric and solar heat gains) was calibrated sequentially, first using unoccupied data and later occupied data. Week 1 was used to tune parameters related to the unoccupied condition of the building (i.e., infiltration rate, external shading factor, thermal bridges). Week 2 was used to tune parameters related to the occupied condition of the building (i.e., lighting, appliances, occupant gains, window positions). Week 3 was used to test the WB model performance, while, data from week 4 to week 6 were used to validate the WB model (i.e., no parameters were tuned). Changes from the baseline to the calibrated model were made through a series of iterations that were evidence-based. All changes to the baseline were founded on building information or detailed knowledge of the building performance. Each iteration was, therefore, a manual adjustment of the previous model and did not rely on automation. To assess whether the model was satisfactorily calibrated, standardised calibration metrics with reference to measurement and verification or calibration standards were applied. More information on this approach can be found in O' Donovan et al. [23]. In this study, three GB model validation/calibration approaches were investigated: (1) The GB model was calibrated on an individual week, then validated on the corresponding five weeks. This was repeated for all calibration and validation periods. The purpose of this analysis was to examine the performance of the model when calibrated on a single week with a specific occupancy level during a specific season, then validated on five individual weeks with different occupancy levels over varying seasons. The comparison of calibration and validation metrics was used as an indicator of robustness. (2) The GB model was calibrated on three weeks, then validated on the other three weeks.
Multiple simulations were carried out where the three calibration weeks and the three corresponding validation weeks contained periods of varying occupancy and different seasons. The efficacy of this calibration and validation method was examined and then compared to the previous method. The model parameters selected by the calibration algorithm for both methods were compared. (3) A comparison of both GB and WB models was performed where both models were calibrated and validated using identical data. The GB model was simultaneously calibrated using data from weeks 1, 2 and 3. The WB model was incrementally calibrated for weeks 1 and 2 and tested using data from week 3. Both models were validated using data from week 4 to 6.
To compare the accuracy of each model, the RMSE for a given period, (p), was calculated for each week using Equation (16): where, T Ai,S is the simulated internal air temperature, T Ai,M is the measured internal air temperature and N is the number of measurements during period p. Previous work by O'Donovan et al. noted the need to include a correlation metric when comparing air temperature predictions with empirical data as the range of temperatures in nZEBs can be quite narrow and may not accurately represent the error in a model [23]. Equation (17) shows the calculation of the Pearson correlation co-efficient (r) [71]: where, T Ai,S and T Ai, M are values of simulated and measured air temperatures for each instance, and, T Ai,S and T Ai, M are the mean values for the simulated and measured datasets, respectively. The Pearson correlation co-efficient was used to measure the strength of correlation between the models' outputs and the measured empirical data. This correlation is a direct indication on the models' abilities to profile the transient behaviour of the internal air temperature, and therefore, it provides a good indication on whether the model is accurately capturing the thermodynamic characteristics of the space. When the Pearson correlation co-efficient is greater than 0.5, it can be suggested that the model is representing the dynamic behavior of the building [72].

GB Model Calibration
The Levenberg-Marquardt algorithm [73] was used in the automatic calibration process. This algorithm was chosen for its computational efficiency and ability to avoid local minima. The Levenberg-Marquardt method has proven to be successful in cognate modelling applications [40,74,75]. The values of the 11 model parameters and their upper and lower limits are shown in Table 3. The algorithm selected the optimal configuration of the 11 parameters through an iterative process. The cost function employed was the sum of squared errors (SSE) (see Equation (18)). This cost function was selected as it penalizes large residuals, where the error was the difference between simulated air temperature (T Ai,S ) and the measured air temperature (T Ai,M ). The value of N is dependent on the length of the calibration dataset. The algorithm stopping criteria were set to a convergence value (1 × 10 −3 ) and a maximum limit of 1000 iterations.
While many of the standard building parameters (such as the density of concrete blocks) were assumed to be known, the building specific parameters were assumed to be unknown but could be estimated within a set range [76]. The 11 building specific parameters shown in Table 3 could be altered during the auto-calibration process. As these parameters were unknown, the initial parameter values were estimated and the upper and lower limits were either set by the physical constraints of the building or maximum plausible boundary limits determined by the authors.

Calibration and Validation of GB Model
In this section, we compared the performance of the GB model for two different calibration and validation datasets. The first dataset contained one week for calibration and five weeks for validation purposes (C1V5), while the second dataset contained three weeks for calibration and three weeks for validation purposes (C3V3). Each dataset was partitioned and folded into multiple calibration and validation subsets (Figure 4) to allow multiple calibration and validation arrangements [77]. The purpose of comparing the calibration and validation of the GB model for these two datasets was to determine the performance gap when the model was calibrated using just one week in a particular season with a particular occupancy level versus being calibrated using multiple weeks over varying seasons and occupancy levels.  Table 4 displays the calibration and validation results for the GB model for C1V5. The dataset contained six individual weeks of data. These weeks were recorded over varying seasons and contained multiple occupancy levels (see Section 3.2). The GB model was calibrated using one week and then validated on the other five weeks, this was repeated for each week. For example, when calibrating using week 2, the validation period consisted of weeks 1, 3, 4, 5 and 6 (C2_V13456) (see Figure 4). We can see from the results that the GB model correlated very strongly and produced a very low error for the weeks on which it was calibrated.   Table 4 displays the calibration and validation results for the GB model for C1V5. The dataset contained six individual weeks of data. These weeks were recorded over varying seasons and contained multiple occupancy levels (see Section 3.2). The GB model was calibrated using one week and then validated on the other five weeks, this was repeated for each week. For example, when calibrating using week 2, the validation period consisted of weeks 1, 3, 4, 5 and 6 (C2_V13456) (see Figure 4). We can see from the results that the GB model correlated very strongly and produced a very low error for the weeks on which it was calibrated. Table 4. Accuracy of models calibrated on one week of data. The values in bold indicate the accuracy of the training week. The corresponding line plots for these bold values are in Figure 5.    Figure 5 shows the calibrated models' performance for each respective calibration period. The models' very tight fit to the empirical data in each calibration period demonstrates the algorithm's ability to finely tune the model parameters to produce the minimum error. However, the correlation substantially decreases and the error substantially increases for the validation periods. For the six simulations of C1_V23456 through to C6_V12345, the mean RMSE was 334%, 739%, 62%, 593%, 25% and 265% higher in the validation period compared to the calibration period, respectively. The mean increase in RMSE from the calibration period to the validation periods on average was 229%. Figure 6 displays the set of parameters selected by the calibration algorithm for each calibration period (the parameters are displayed as a percentage between the upper and lower limits). We can see there is substantial variation between periods for the majority of parameters. Depending on the week selected for calibration, the tuned parameter values varied substantially. The mean of the standard deviations for the 11 parameters was 19%. Table 5 displays the calibration and validation results for the GB model for C3V3. The GB model was calibrated using three weeks and then validated on the other three weeks. The composition of the three calibration weeks were selected and folded to ensure the model was calibrated using data containing varying occupancy levels during different seasons. The last simulation employs all six weeks for calibration purposes (C123456).

GB Model Performance (C3V3)
For the four simulations of C123_V456 through to C256_V134, the mean RMSE was 109%, 53%, 65%, and 29% higher in the validation period compared to the calibration period, respectively. The mean increase in RMSE from the calibration periods to the validation periods was 62%. Figure 7 illustrates the final parameter selections that the calibration algorithm made for each three-week combination of calibration and validation datasets. As can be seen, the parameter values selected by the algorithm can vary depending on the three weeks used for the calibration dataset. However, the variance was substantially lower in comparison to the C1V5 results ( Figure 6). The mean of the standard deviations for the 11 parameters was 13%.
Energies 2021, 14, x FOR PEER REVIEW 13 of 26 Figure 5. Line graphs of GB model performance showing empirical and predicted air temperatures using C1V5 calibration approach for calibration weeks only. Figure 5 shows the calibrated models' performance for each respective calibration period. The models' very tight fit to the empirical data in each calibration period demonstrates the algorithm's ability to finely tune the model parameters to produce the minimum error. However, the correlation substantially decreases and the error substantially increases for the validation periods. For the six simulations of C1_V23456 through to C6_V12345, the mean RMSE was 334%, 739%, 62%, 593%, 25% and 265% higher in the validation period compared to the calibration period, respectively. The mean increase in Figure 5. Line graphs of GB model performance showing empirical and predicted air temperatures using C1V5 calibration approach for calibration weeks only. Figure 6 displays the set of parameters selected by the calibration algorithm for each calibration period (the parameters are displayed as a percentage between the upper and lower limits). We can see there is substantial variation between periods for the majority of parameters. Depending on the week selected for calibration, the tuned parameter values varied substantially. The mean of the standard deviations for the 11 parameters was 19%.  Table 5 displays the calibration and validation results for the GB model for C3V3. The GB model was calibrated using three weeks and then validated on the other three weeks. The composition of the three calibration weeks were selected and folded to ensure the model was calibrated using data containing varying occupancy levels during different seasons. The last simulation employs all six weeks for calibration purposes (C123456).      However, when the GB model was calibrated using one week of data (C1V5), the accuracy of the model was far less consistent in comparison to the C3V3 results, with mean RMSE values of between 0.21 °C and 1.52 °C, and maximum RMSE values of between 1.46 °C and 2.43 °C. Pearson correlation coefficients ranged from 0.28 to 0.99, for calibration and validation periods, respectively. From Figure 8 we can see how the calibration periods for C1V5 (see Figure 5 for line graphs), are highly accurate. However, the corresponding validation periods are typically far less accurate, with lower Pearson correlation and higher RMSE. In contrast, the calibration periods for C3V3 (see Figure 9 for line graph of C123V456), were not as accurate in comparison to C1V5, but the validation periods were far more accurate. We can see the number of weeks used for calibration of parameters in GB models effects the accuracy and repeatability of results (for a breakdown of individual weeks see Figure 10). The inconsistency in the results for C1V5 is likely a result of overfitting for the conditions present in the individual calibration week. This is reflected in the mean RMSE values for each calibration week or series of weeks and in the difference in standard deviation between C1V5 and C3V3. The use of three weeks of data (C3V3) as opposed to one week of data (C1V5) for calibration, reduces the average RMSE in the validation period by over 160%. The use one week of data (C1V5) increases the standard deviation in final parameter selections by 46%. However, when the GB model was calibrated using one week of data (C1V5), the accuracy of the model was far less consistent in comparison to the C3V3 results, with mean RMSE values of between 0.21 • C and 1.52 • C, and maximum RMSE values of between 1.46 • C and 2.43 • C. Pearson correlation coefficients ranged from 0.28 to 0.99, for calibration and validation periods, respectively. From Figure 8 we can see how the calibration periods for C1V5 (see Figure 5 for line graphs), are highly accurate. However, the corresponding validation periods are typically far less accurate, with lower Pearson correlation and higher RMSE. In contrast, the calibration periods for C3V3 (see Figure 9 for line graph of C123V456), were not as accurate in comparison to C1V5, but the validation periods were far more accurate. We can see the number of weeks used for calibration of parameters in GB models effects the accuracy and repeatability of results (for a breakdown of individual weeks see Figure 10). The inconsistency in the results for C1V5 is likely a result of overfitting for the conditions present in the individual calibration week. This is reflected in the mean RMSE values for each calibration week or series of weeks and in the difference in standard deviation between C1V5 and C3V3. The use of three weeks of data (C3V3) as opposed to one week of data (C1V5) for calibration, reduces the average RMSE in the validation period by over 160%. The use one week of data (C1V5) increases the standard deviation in final parameter selections by 46%.    Figure 9 displays the difference observed between recorded air temperature data, GB model predictions and WB model predictions in the NBERT open plan office for both calibration and validation periods. Table 6 displays the Pearson coefficient and the RMSE for both models for the calibration period (weeks 1-3) and the validation period (weeks 4-6). During the calibration period the GB model produced a higher Pearson coefficient for week 2 (0.82), while the WB performed better in week 3 (0.86). In week 1 (non-occupied) both models produced an equivalent Pearson coefficient (0.99). The RMSE in the calibration period for GB was considerably lower in all three weeks (0.27, 0.46, and 0.59). During the validation period the GB model produced a higher Pearson coefficient for week 3 (0.9), while the WB model performed better in weeks 2 and 3 (0.88 and 0.6). The RMSE values of the WB model over the validation period were lower than the GB model for weeks 1 and 3 (0.46 °C and 0.77 °C), while the GB model produced a lower RMSE in week 2 (1.18 °C). The mean RMSE of the GB model during the validation period increased by 109% when compared to mean RMSE of the calibration period. However, the mean RMSE of the WB model only increased by 28% between the calibration and validation period. The drop in GB model accuracy between the calibration and validation periods may be due to the overfitting of the tuning parameters to the calibration data as discussed in Section 3.2 (for the selected parameters please see Figure 7, C123_V456). When directly comparing the performance of the models, we can see the GB model substantially outperformed the WB model over the calibration period. The mean RMSE of the GB model (0.44 • C) was 38% less than the WB model (0.71 • C) over the same period. However, during the validation period the mean RMSE of the GB model (0.92 • C) was slightly higher than the WB model (0.91 • C) and the correlation between the measured empirical data and model predictions were stronger for the WB model during the validation period. The GB model was far more accurate during the calibration period, but this can be attributed to the automatic calibration method, which employed an optimisation algorithm, which was able to converge on a solution. The WB model relied on a manual calibration method, which used a piecemeal evidence-based approach for a larger number of parameters. It must also be noted that WB model parameters were calibrated using data from weeks 1 and 2, while week 3 was used for testing purposes (as described in Section 3.3). While the WB model is only slightly more accurate than the GB model for the validation period, this comparison suggests the WB modelling technique produced better generalisation abilities. Figure 8 provides a breakdown of the accuracy of the individual weeks used for calibration and validation used in C1V5 and C3V3 for the GB model. The accuracy of the calibration and validation periods for the WB model are represented by the dashed box. From this, we can see that the GB model out-performs the WB model for the calibration periods, especially for the C1V5 configuration. For the validation periods, the GB model is more likely to produce less accurate results than the WB model, with the C1V5 configuration producing the least accurate results. Figure 10 displays the calibration and validation results for individual weeks contained in the C1V5 and C3V3 configurations. We can see calibration periods for both C1V5 and C3V3 fit the winter period very well with low RMSE values (≤0.27 • C). However, when the GB model was validated on the winter season, the validation RMSE rose to 0.84 • C for the C3V3 configuration and to 2.43 • C for the C1V5 configuration, while the correlation remained very high (r ≥ 0.98). The summer calibration periods produced RMSE values between 0.21 • C and 0.49 • C while the validation periods resulted in RMSE values between 0.53 • C and 1.46 • C, with r values between 0.76 and 0.95. The shoulder season produced the widest spread in RSMEs with calibration values between 0.22 • C and 0.71 • C and validation values between 0.38 • C and 2.39 • C. The r values ranged between 0.28 and 0.96. From these results we can see that calibrating the GB model using one week of data from the winter or summer period results in high errors and poor correlations when the model is then validated on shoulder season periods. Likewise, when the GB model is calibrated using one week from the shoulder season periods and validated on the summer or winter period, it generally performs poorly. The occupancy level in the winter period is zero, the summer period is low and the shoulder season is high (Figure 3). These results indicate the OB noise in the shoulder season has a negative effect on model calibration and validation. While the RMSE values for validation on the winter period vary considerably for GB models that have been trained using one week of data from another period, the correlation coefficient remains consistently very high. This may be due to the lack of occupants during this period and therefore the elimination of OB noise. When the C3V3 configuration is employed, the magnitude of validation error for all seasons reduced substantially as the calibration period generally contains a varied mix of data from multiple seasons and occupancy levels.

Discussion
From the results in Section 4.1.1 we can see the simple GB model employed in this study was capable of accurately capturing the dynamic characteristics of the internal air temperature profile when calibrated over one week using an automatic algorithm ( Figure 5). However, depending on the week selected as the calibration period (varying season and occupancy levels), the parameters selected by the algorithm varied substantially (Figure 6), which resulted in varied model performance (generally poor) when validated (Table 5). Many previous studies utilised only one week or less of empirical data (single season) for model calibration purposes. Using a similar sized single-season calibration period (C1V5) for the RC model and calibration algorithm used in this study did not produce consistently accurate results. When additional weeks (across varying seasons) where added to the calibration period, the consistency of the model's performance substantially improved. However, the generalisation abilities of the C3V3 GB model where still below that of the WB model. The C3V3 GB model did however produce accuracy levels comparable to that of the WB model (mean RMSE of the GB model was 1.5% higher than mean RMSE of the WB model).
From a practitioner's perspective, the GB modelling and automatic model calibration techniques were found to be far more straightforward and time efficient methods of simulating internal air temperature in an nZEB, in comparison to a WB model manually calibrated using a piecemeal evidence-based approach. The WB model had a complex structure and accounted for many other factors which interact with internal air temperature (such as relative humidity), while the GB model had a simple RC structure, less interactions and fewer parameters. The manual evidence-based WB calibration method was entirely depended on lengthy human interaction, while the automatic calibration method required minimal human interaction. The mean time for the calibration algorithm to converge on a final set of parameters for the full C3V3 configuration (six individual calibration periods made up of three weeks each) was 148 s using a six core Intel i7 3930 3.2 GHz processor with parallel computing enabled. The authors of this study estimated that by employing the GB modelling method with an automatic model calibration technique, the human labour input to simulating internal air temperature was reduced by approximately 90% relative to WB modelling using a manually calibrated evidence-based approach. The labour reduction applies only to the model development, calibration and validation time and does not include the time required to record and process the empirical data, which would be the same for both methods. This results in a significant decrease in human labour input, albeit with a slight decrease in accuracy and drop in generalisation abilities for this application. In this study, both WB and GB modelling techniques and their corresponding calibration methods were found to possess independent attributes and both styles had unique merits.

Conclusions
In this study, a GB model for a naturally ventilated nZEB was developed and calibrated using multiple data configurations using an automatic calibration algorithm. Then the GB model was compared to a WB model for the same application with identical calibration and validation datasets. The following conclusions were drawn from the results:

•
The GB modelling method used in this study was capable of simulating the dynamic internal air temperature profile of a naturally ventilated nZEB. • Utilising only one week for the GB model calibration dataset resulted in overfitting. When three weeks of data from varying seasons were used, the GB model was able to consistently produce more accurate results for the validation periods.

•
The season and level of occupancy in the calibration and validation data had a strong influence on the GB model's accuracy levels.

•
When calibrated and validated using identical data, the WB model produced slightly more accurate results than the GB model and displayed better generalisation abilities. • Although the GB model was slightly less accurate than the WB model (mean RMSE 1.5% higher), the authors found the development time to be significantly lower for the GB automatic calibration method in comparison to the WB manual calibration method (approx. 90% reduction in human time input).