3. Data Sources of Training Sets and Test Sets
The Grasshopper (v1)–Honeybee (v1.7.26)–EnergyPlus integrated simulation platform provided a unified technical framework for this study, covering the full process from parametric modeling to building energy simulation. Grasshopper, as a visual parametric modeling tool operating within the Rhino (v8) environment, enables the rapid generation and systematic adjustment of building geometry, envelope parameters, and operational conditions. Honeybee serves as an interface plugin linking parametric models with building performance simulation engines, allowing building geometry, material properties, weather data, and operational schedules to be effectively transferred to EnergyPlus. EnergyPlus, a mature building thermal and energy simulation engine, performs dynamic calculations of building heating loads and heating energy use intensity during the operation stage. Through the integration of these three tools, this study was able to efficiently generate large-scale simulation samples. These samples covered different climate zones, building functions, and occupant activity intensity levels. Meanwhile, consistency in the simulation workflow, flexibility in parameter control, and comparability of simulation results were ensured.
To facilitate subsequent data management and statistical analysis, a hierarchical database structure was adopted to organize the simulation samples. At the first level, the database was classified by climate zone into three categories: severe cold, cold, and hot summer and cold winter, represented by the typical cities of Harbin, Beijing, and Shanghai, respectively. The HDD65 values for Harbin, Beijing, and Shanghai are 9698, 5497, and 3250, respectively. In this study, HDD refers to heating degree days, which is used to characterize the influence of climatic conditions on building heating demand. More specifically, HDD65 refers to heating degree days calculated with a base temperature of 65 °F (approximately 18.3 °C). It is obtained by accumulating the daily temperature difference when the outdoor mean temperature falls below this base temperature. Within each climate zone, the samples were further subdivided by building function into office, residential, and commercial buildings. In addition, occupant activity intensity was classified into three scenarios: 100 W/person, 150 W/person, and 200 W/person. Through this design, the sample database formed a clear tree-like hierarchical structure, which facilitated subsequent sample retrieval, comparison, and modeling across different dimensions, including city, building function, and occupant activity intensity.
Subsequently, the Grasshopper–Honeybee–EnergyPlus integrated simulation platform was employed to simulate the heating energy use intensity of buildings under different climate zones, building functions, and occupant activity intensity levels. A total of 1620 simulation samples were generated in this study. These samples were then divided into a training set and a test set at a ratio of 7:3, resulting in 1134 training samples and 486 test samples.
5. Model Results and Validation
Three Markov Chain Monte Carlo (MCMC) chains were initialized with dispersed starting values in the parameter space, and the thinning interval was set to 1. After 30,000 iterations (including 10,000 burn-in iterations for thermalization), all chains achieved
≤ 1.1, confirming stable convergence to the target posterior distribution.
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6 show the trace plots of iterations in three chains, excluding the burn-in phase for the unidentified parameters, including
, in the model.
As shown in
Figure 2, after the first 10,000 iterations were discarded as burn-in, good stationarity and mixing were achieved for both
and
across the three Markov chains. No obvious drift, monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been reached. For
, fluctuations were distributed around a relatively stable central region, and substantial overlap and frequent crossing among chains were observed, suggesting that the posterior space was adequately explored. For
, a similar pattern was identified, with random fluctuations confined to a narrower interval around zero and no persistent chain separation detected. Compared with
, a smaller fluctuation range was exhibited by
, implying lower posterior dispersion and greater estimation stability. Overall, satisfactory convergence was achieved for both parameters after burn-in removal, and the posterior samples were considered reliable for subsequent inference and model interpretation.
As shown in
Figure 3, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both
and
across the three Markov chains. No obvious drift, persistent monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been reached for both parameters. For
, fluctuations were confined to a relatively narrow interval around a stable central level. Substantial overlap and frequent crossover were observed among the three chains, suggesting that the posterior region was adequately explored and that good between-chain consistency was achieved. For
, a similar pattern was identified. Although slightly wider fluctuations were exhibited, the chains remained highly interwoven and no structural shift or abnormal jump was detected. Overall, satisfactory convergence was considered to have been attained for both parameters after burn-in removal, and the resulting posterior samples were regarded as reliable for subsequent inference and model interpretation.
As shown in
Figure 4, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both
and
across the three Markov chains. For both parameters, random fluctuations were maintained around relatively stable central levels during the post-burn-in phase. No obvious drift, persistent monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been attained.
For , substantial overlap and frequent crossover were observed among the three chains. Although local short-term oscillations were present, no structural shift or persistent separation was detected. This pattern suggests that the posterior space of was adequately explored and that good between-chain consistency was achieved. For , a similar trace pattern was exhibited. The chains were highly interwoven and remained concentrated within a stable interval. Slightly narrower fluctuations were observed for , implying comparatively lower posterior dispersion and greater estimation stability. Overall, satisfactory convergence was considered to have been reached for both parameters after burn-in removal.
As shown in
Figure 5, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both
and
across the three Markov chains. Random fluctuations were maintained around stable central levels, and substantial overlap was observed among chains. No obvious drift, persistent trend, or prolonged local sticking was detected. For both parameters, only short-term local oscillations were exhibited, without structural separation or abnormal jumps. These trace patterns indicate that the posterior distributions were adequately explored and that satisfactory convergence was reached after burn-in removal.
As shown in
Figure 6, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both
and
across the three Markov chains. For
, random fluctuations were confined to a relatively narrow interval, and substantial overlap and frequent crossover were observed among chains, indicating stable posterior sampling and good between-chain consistency. For
, wider fluctuations were exhibited, implying comparatively greater posterior dispersion. However, no obvious drift, persistent separation, or prolonged local sticking was detected. Overall, satisfactory convergence was considered to have been reached for both parameters after burn-in removal, and the resulting posterior samples were regarded as reliable for subsequent inference.
Overall,
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6 indicate that the three chains for parameters
and
all reached satisfactory convergence after the burn-in phase. The chains are well mixed, highly overlapped, and fluctuate randomly around stable central values. These visual results are consistent with the PSRF values below 1.1, further confirming that the posterior estimates are stable and credible for subsequent parameter interpretation and prediction.
Predictive accuracy was quantified using the ASHRAE Guideline 14-2023 [
30] recommended metrics. The ASHRAE Guideline 14-2023 standard recommends a CVRMSE threshold of 15% and an NMBE threshold of 5%. The specific formulae for these indexes are presented in detail in Formulae (5)–(7).
where
is the predicted value of the heating EUI of the sample (
),
is the observed value of the heating EUI of the sample (
),
is the total number of samples in the test set,
is the mean observed value of the EUI of the test set (
).
The indexes include the Root Mean Square Error (RMSE), the Coefficient of Variation of the Root Mean Square Error (CVRMSE), and the Normalized Mean Bias Error (NMBE). RMSE is commonly used to evaluate the prediction error of regression models. It measures the model’s prediction accuracy by calculating the average of the squared differences between the predicted and actual values, and then taking the square root of this average. A smaller RMSE indicates better predictive performance of the model. The CVRMSE is a statistical index used to assess model performance. It standardizes RMSE by the standard deviation of the data, providing a relative measure of prediction error. As a dimensionless relative error metric, CVRMSE is expressed as a percentage. The lower the CVRMSE value, the smaller the model’s prediction error and the higher the relative accuracy. CVRMSE is particularly useful when comparing models with datasets of varying magnitudes, as it eliminates the influence of scale differences, thus enabling a more equitable comparison. The NMBE is an index that measures the systematic bias of a model. It evaluates whether the predicted values are systematically higher or lower than the actual values. The NMBE value reflects the bias between predicted and actual values relative to the actual values.
The proposed surrogate model demonstrated robust predictive performance, as evidenced by the statistical metrics NMBE, RMSE, and CVRMSE, which quantify deviations between predicted and observed heating EUI values in the test set. As shown in
Figure 6, the model achieved minimal prediction bias, with NMBE = −1.01%, RMSE = 9.69, and CVRMSE = 12.37%. These values are substantially below the thresholds recommended by ASHRAE Guideline 14-2023 (CVRMSE ≤ 15%, −5% ≤ NMBE ≤ 5%), confirming the model’s high accuracy and generalization capability for office building heating EUI prediction.
To further examine the stability and generalization ability of the Bayesian prediction model under different sample partitions, five-fold cross-validation was introduced as an additional robustness validation method. Specifically, the 1620 building heating energy consumption samples were divided into five mutually exclusive subsets. In each iteration, four subsets were used for model training, while the remaining subset was used for testing, ensuring that each subset was used once as the test set.
As shown in
Table 1, the five-fold cross-validation results indicate that the model maintained high prediction accuracy under different sample partitions. The
values ranged from 0.9668 to 0.9709. The mean value was 0.9683, and the standard deviation was only 0.0016. These results indicate that the model provided strong and stable explanatory power for variations in heating energy use intensity. The mean RMSE was 9.6075, with a standard deviation of 0.1935. The mean CVRMSE was 12.2788%, with a standard deviation of 0.2870%. These results suggest that the overall prediction error was relatively low. They also indicate that the variation among different folds was limited. The mean NMBE was −0.7826%, with a standard deviation of 0.3477%, indicating that the overall bias of the model was close to zero and that no obvious systematic overestimation or underestimation was observed.
In addition, the PSRF values of the Bayesian models in all folds were close to 1, indicating good convergence of the MCMC sampling process and confirming the reliability of the posterior parameter estimates. Overall, the five-fold cross-validation results further demonstrate the robustness and generalization ability of the Bayesian prediction model developed in this study. The model performance did not depend on a specific training–testing split, but remained stable across different sample combinations. Therefore, the validated model can provide a reliable basis for the subsequent extraction of building heating energy baselines.
The posterior distributions of the hyperparameters for the office building heating energy surrogate model are shown in
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11. Overall, these posterior densities are concentrated within relatively clear high-probability regions, indicating that the corresponding hyperparameters are well identified by the data and the Bayesian inference procedure. The representative values located in the high-probability density regions were then selected as the optimal hyperparameter estimates for the surrogate model.
As shown in
Figure 7, unimodal posterior distributions were obtained for both
and
, indicating that relatively well-defined high-probability regions were identified. For
, the posterior density was mainly concentrated in the positive range, with a clear peak but a comparatively wider spread, suggesting moderate posterior uncertainty. For
, the density was concentrated within a narrower interval slightly below zero, and a sharper peak was exhibited, implying lower uncertainty and greater estimation stability. No obvious multimodality or irregular dispersion was observed for either parameter. Therefore, reliable representative values were considered to be available from the high-probability posterior regions.
As shown in
Figure 8, unimodal posterior distributions were obtained for both
and
, indicating that identifiable high-probability regions were established for these hyperparameters. For
, the posterior density was concentrated within a narrow interval slightly below zero, and a relatively sharp peak was formed, suggesting that stronger data constraints and lower posterior uncertainty were achieved. For
, the density was also centered in a mildly negative range, but a broader spread and a flatter peak were exhibited, implying comparatively greater dispersion. No obvious multimodality or irregular tail expansion was observed. Therefore, reliable representative estimates were considered to be supported by the posterior distributions.
As shown in
Figure 9, unimodal posterior distributions were obtained for both
and
, indicating that stable high-probability regions were identified. For
, the posterior density was centered in a mildly negative range and was spread over a relatively wider interval, suggesting that moderate posterior variability was retained. For
, the density was also concentrated in the negative region, but a sharper peak and shorter tails were exhibited, implying stronger concentration and greater estimation stability. No obvious multimodality or irregular spreading was observed. Therefore, reliable representative values were considered to be supported by the posterior high-density regions.
As shown in
Figure 10, smooth and unimodal posterior distributions were obtained for both
and
, indicating that well-defined probability concentration regions were established for these hyperparameters. For
, the posterior density was centered near −2.00, and a compact distribution with a clear peak was exhibited, suggesting limited posterior variability and strong estimation stability. For
, the density was concentrated near 1.08 and a similar overall shape was observed, although a slightly wider spread was retained. No obvious multimodality, irregular skewness, or excessive tail extension was detected. Therefore, reliable representative estimates were considered to be supported by the posterior high-density regions.
As shown in
Figure 11, smooth and unimodal posterior distributions were obtained for both
and
, indicating that clearly identifiable probability concentration regions were established. For
, the posterior density was centered near 1.20, and a relatively symmetric and compact shape was exhibited, suggesting stable estimation and limited posterior uncertainty. For
, the density was concentrated near 38, while a visibly wider spread was retained, implying greater residual variability. Nevertheless, no multimodality, discontinuity, or excessive tail extension was detected. Therefore, both
and
were considered to be reliably characterized by their posterior high-density regions.
Taken together,
Figure 7,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 show that generally smooth and unimodal posterior distributions were obtained for all model parameters, with no obvious multimodality, irregular fragmentation, or uncontrolled dispersion being observed. This indicates that the parameters were well identified through the Bayesian inference procedure and that the posterior estimates can be regarded as stable and credible overall.
To further evaluate the predictive performance of the surrogate model, the trained model was applied to the test set to predict the heating EUI of 486 building samples.
Figure 11 compares the actual and predicted values of heating EUI for the test samples. Overall, the predicted results agree well with the observed data, indicating that the model has good generalization ability and satisfactory predictive accuracy.
As shown in
Figure 12, a clear linear correspondence was exhibited between the observed heating EUI values and the model predictions for the building samples in the test set. Most scatter points were distributed close to the 1:1 reference line, indicating that the overall variation trend of heating EUI was reproduced well by the surrogate model. The ±5% PE and ±10% PE bands were also provided in the figure, by which the prediction deviation can be assessed more intuitively. Overall, a large proportion of the samples were enclosed within the ±10% error bands. A considerable number of samples were also located near the ±5% bands. These results suggest that good agreement was achieved between the predicted and observed values. When different EUI ranges were examined, a relatively compact distribution was observed in the low-EUI region, where close agreement with the reference line was maintained. In the medium-EUI range, a slight increase in scatter dispersion was exhibited; however, most samples were still retained within a limited deviation range, indicating that stable predictive performance was preserved. In the higher-EUI range, a broader spread was observed and several points deviated more visibly from the reference line, yet the overall clustering trend remained consistent and no severe distortion was detected. This suggests that, although the prediction uncertainty increased somewhat with increasing heating EUI, acceptable predictive accuracy was still maintained.
In addition, the scatter points were distributed on both sides of the 1:1 line rather than being concentrated on a single side. This implies that no pronounced systematic overestimation or underestimation was introduced by the model. The observed deviations therefore appear to have been dominated mainly by random variation rather than structural bias. In summary,
Figure 12 shows that the predicted values agree well with the observed values for the test samples, with most points falling within reasonable error bands and no obvious systematic bias being detected. Good accuracy, stability, and generalization ability were therefore demonstrated by the proposed model.
6. Results and Discussion
6.1. Pattern Analysis of District Heating EUI in Office Buildings: Impact of Building Scale
The surrogate model developed in this study was further applied to simulate the heating energy use intensity (EUI) of office buildings with scales ranging from 100 to 100,000
under different climatic conditions. For this analysis, the building function was specified as an office building, the occupant activity intensity was fixed at 100 W/person, and the WWR was set to 0.25. The HDD65 values considered were 3250, 5497, and 9698.
Figure 11 presents both the heating EUI and its marginal change with increasing building scale under different HDD65 conditions.
As shown in
Figure 13, a continuous decrease in heating energy use intensity (EUI) was exhibited as building scale increased under all three climatic conditions represented by HDD65 values of 3250, 5497, and 9698. At the same time, the rate of change in heating EUI was observed to move progressively from larger negative values toward zero. These two sets of curves, when interpreted together, indicate that the influence of building scale on heating EUI should not be regarded as linear. Instead, a pronounced nonlinear attenuation pattern was revealed, in which the strongest scale effect was produced in the small-building range and was then gradually weakened as floor area increased.
A stable climatic ordering was maintained across the entire scale range. For any given building size, the highest EUI was associated with HDD65 = 9698, the intermediate level was associated with HDD65 = 5497, and the lowest EUI was associated with HDD65 = 3250. This pattern suggests that climatic severity remained the dominant factor controlling the absolute level of heating demand per unit floor area. Nevertheless, a gradual narrowing of the absolute gap among the three EUI curves was also observed as the building scale increased. This feature is important because it implies that scale enlargement can partially mitigate the amplification effect imposed by colder climates on unit heating demand, even though the overall climatic ranking is not altered.
Particular attention should be paid to the small-scale interval between approximately 100 and 1000 . In this range, the decline in EUI was found to be the steepest for all three climatic conditions. The right-axis curves further show that the magnitude of the negative rate of change was largest in this stage, especially under HDD65 = 9698. This means that the marginal reduction in unit heating demand generated by scale enlargement was most pronounced in small buildings and was amplified under severe heating climates. When the building scale was expanded from approximately 1000 to 3000 , a second regime was entered. The EUI continued to decline. However, the slope of the orange curves became visibly gentler than that observed in the first stage. Meanwhile, the green curves were still negative, although their magnitudes had already been reduced considerably. This indicates that scale remained influential in this interval, but the marginal energy-saving return per additional unit of floor area had begun to weaken. The vertical markers at 1000 and 3000 can therefore be interpreted as practical transition points rather than arbitrary graph annotations. A further transition was then observed in the 3000–5000 range. In this interval, the decrease in EUI was preserved, yet the curves became even flatter and the change-rate lines moved closer to zero. This stage may be interpreted as the boundary between the “sensitive scale–response zone” and the “stabilizing zone.” Importantly, the pace at which the change-rate curves approached zero was not identical across climates. Under HDD65 = 3250, the right-axis curve had already become very close to zero by this stage, whereas under HDD65 = 5497 and especially HDD65 = 9698, a noticeable negative rate was still retained. Thus, it was not only the EUI magnitude that was altered by climate, but also the persistence of the scale effect itself. In the larger-scale interval from approximately 5000 to 20,000 , the three EUI curves were observed to become substantially flatter. At this point, additional increases in building size still produced some reduction in unit heating demand, but the benefit was clearly smaller than that seen in small and medium buildings. The most meaningful insight here is that the duration of the scale effect was found to vary with climatic severity. Under the mildest of the three climates, the change-rate curve approached zero relatively early. Under the intermediate climate, the convergence was delayed. Under the coldest climate, a weak but still visible response to scale was maintained over a broader size range. This suggests that in colder regions, building enlargement continues to affect heating EUI for a longer portion of the scale spectrum.
Beyond approximately 20,000 , all three EUI curves moved toward an almost stable platform, and the corresponding rate-of-change curves converged very closely to zero. It may therefore be inferred that the scale effect had entered a mature diminishing-returns stage. In other words, once office buildings reach a sufficiently large size, the contribution of further scale enlargement to reducing heating EUI becomes marginal.
At the same time, the results suggest that these transition characteristics are modulated by climatic severity, in that under higher HDD65 conditions, the scale effect persists over a longer scale range and attenuates more slowly. This analysis indicates that the scale effect is reflected not only in the decline in heating EUI with increasing building size but also in the variation in response intensity across different scale intervals and in its climatic modulation.
Overall,
Figure 13 demonstrates that a robust negative relationship between office building heating EUI and building scale was established, but that this relationship was not uniform across the entire scale range. A stage-wise pattern was revealed: a highly sensitive decline at small scales, a moderated but still meaningful response at medium scales, and an increasingly stable regime at larger building scales. At the same time, a clear climatic modulation of this pattern was identified. Higher HDD65 values were associated not only with higher EUI levels, but also with a stronger and more persistent scale effect. More specifically, climatic severity remained the dominant factor controlling the absolute level of heating demand per unit floor area. However, as building scale increased, the absolute differences in heating EUI among different climatic conditions gradually narrowed, indicating that scale enlargement can partially offset the amplifying effect of colder climates on unit-area heating demand. In addition, the persistence of the scale effect varied across climatic conditions. In colder regions, the influence of building enlargement on heating EUI was maintained over a broader range of building scales. This influence also attenuated more slowly. These results suggest that the scale effect remained effective for a longer portion of the scale spectrum under more severe heating climates. These findings suggest that scale-based heating benchmarks for office buildings should be established jointly with climatic stratification, rather than through a single universal reference. Such an approach would allow the physical mechanism behind observed energy differences to be represented more accurately and would improve the scientific basis of benchmark setting, comparative evaluation, and energy-efficiency-oriented design decisions.
It should be further noted that the results observed around approximately 1000 , 3000 , 5000 , and 20,000 are not interpreted in this study as universal fixed thresholds. Instead, they are regarded as threshold-like transition ranges. These ranges were identified under the present dataset, variable setting, and modeling framework. Their main significance lies in showing that the influence of building scale on heating EUI is not uniform across the full scale spectrum, but instead exhibits a clear stage-wise sensitivity pattern. At the same time, the results show that the overall pattern of these scale-related transition ranges remains broadly consistent across different climate zones considered in this study. Although the absolute level of heating EUI varies with climatic severity, the stage-wise threshold pattern of the scale effect is generally similar across climates. This suggests that the identified threshold pattern has a certain degree of robustness within the present research framework.
6.2. Reference for the Development of Heating Energy Benchmarks in Beijing
Consequently, the findings of this study were compared with the current energy benchmark regulations for public buildings in Beijing. These findings may provide a preliminary reference for improving benchmark classification and energy management by building scale. However, they should not be regarded as a direct basis for policy adjustment. The results suggest that relatively small public buildings (≤3000 ) may deserve greater attention in energy management and benchmark evaluation. Buildings in the range of 3000–5000 may require more context-specific assessment. This assessment should be based on their actual energy consumption characteristics. In addition, benchmark values for public buildings ≤3000 may need to be considered separately from those of larger buildings. The benchmark setting for very large public buildings (≥20,000 ) should also be examined with caution. This is especially important when they are compared with medium-sized categories. In contrast, the U.S. Energy Information Administration (EIA) subdivides public buildings into 10 building-size categories. The classification thresholds are 93, 465, 929, 2323, 4645, 9290, 18,581, 46,452, and 92,903 . These values are equivalent to 1000, 5000, 10,000, 25,000, 50,000, 100,000, 200,000, 500,000, and 1000,000 , respectively. This classification could serve as a reference for refining building scale categories in urban public building energy benchmarks in China. This approach would facilitate the rapid collection of operational data. When combined with suitable energy prediction methods, it would also allow energy benchmarks to be determined across different scales, climatic conditions, building configurations, and operating scenarios.
It should be noted that this study adopts a relatively simplified variable set and simulation-based sample data in order to balance data availability, model interpretability, and the practical needs of benchmark analysis. Although variables such as climate zone, building function, occupant activity intensity, building scale, and building-form-related characteristics can capture the major sources of variation in heating energy use, factors such as envelope thermal performance, HVAC system efficiency, and fine-grained operational management were not explicitly included. Therefore, the model results are more suitable for revealing the relative variation pattern of heating EUI and its scale effect, rather than fully representing all mechanisms governing actual building energy use. In addition, although the simulation-based dataset allows the variation in heating energy consumption to be systematically examined under unified boundary conditions, the results may still differ from actual operational data. Accordingly, the findings of this study should be interpreted with caution in terms of real-world application, and are better regarded as a reference for heating energy benchmark research and classified management analysis. Future research could incorporate more measured data, together with more detailed envelope, system, and operational variables, to improve the practical applicability and interpretive depth of the model.
From a methodological perspective, the contribution of the Bayesian framework in this study lies not only in model estimation but also in providing a probabilistic basis for benchmark interpretation. The posterior results make it possible to discuss the scale effect, climate modulation, and threshold-like transition ranges together with their associated uncertainty, which strengthens the interpretability and practical relevance of the benchmark-oriented analysis.