Research on Heating Energy Benchmarks for Office Buildings Based on Bayesian Framework

Na, Wei; Li, Yinlong

doi:10.3390/buildings16101853

Open AccessArticle

Research on Heating Energy Benchmarks for Office Buildings Based on Bayesian Framework

by

Wei Na

^* and

Yinlong Li

Beijing Key Lab of Heating, Gas Supply, Ventilating and Air Conditioning Engineering, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(10), 1853; https://doi.org/10.3390/buildings16101853

Submission received: 30 March 2026 / Revised: 30 April 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

(This article belongs to the Special Issue Recent Developments and Future Perspectives in Heating, Ventilation and Air-Conditioning Systems of Buildings)

Download

Browse Figures

Versions Notes

Abstract

Establishing a reliable heating energy benchmark for urban buildings is essential for effective energy management, yet benchmark accuracy is often constrained by multiple building characteristics and uncertainty in energy prediction. This study investigated the influence of scale heterogeneity on the heating energy use intensity (EUI) of office buildings. A Bayesian surrogate model was developed, trained, and validated, yielding acceptable accuracy, with a CVRMSE of 12.37% and an NMBE of −1.02%, both within the limits recommended by ASHRAE Guideline 14-2023. The validated model was then used to simulate the heating EUI of office buildings with floor areas from 100 to 100,000

m^{2}

under climatic conditions ranging from 3250 to 9698 HDD65. The results showed a clear inverse relationship between building scale and heating EUI. Smaller buildings were more sensitive to scale variation, with pronounced declines around 1000 and 3000

m^{2}

, while the decline rate weakened beyond 5000

m^{2}

. Climatic severity remained the dominant factor controlling the absolute level of heating demand, but the climatic differences in heating EUI gradually narrowed as building scale increased. Moreover, the scale effect persisted longer under colder climatic conditions. These findings provide a reference for establishing scale-sensitive heating energy benchmarks in urban public buildings.

Keywords:

building heating EUI; Bayesian model; building scale; energy consumption prediction; energy benchmarks

1. Introduction

1.1. Background

Energy is essential for human survival and sustainable socioeconomic development [1]. With the continuous growth in societal energy demand and the depletion of fossil fuel reserves, identifying efficient energy-saving approaches and reducing emissions have become increasingly important [2,3]. As one of the three largest energy-consuming sectors, the building sector accounts for approximately 32% of global total energy consumption [4]. Building energy use mainly consists of the energy consumed during the production of building materials, the construction phase, and the operation phase. Among these, operational energy consumption accounts for approximately 47% of the total life-cycle energy use of buildings. Therefore, improving energy efficiency during the building operation phase is of great significance.

A building energy consumption benchmark refers to the upper limit of total energy consumption required to maintain the operation of equipment, facilities, and sanitary conditions in a building during a specified benchmark period. It serves as an important indicator for evaluating building energy use and operational performance [5]. Many countries have implemented relevant policies or issued official documents concerning building energy consumption benchmarks. In the United States, the Energy Star benchmarking tool has been used to evaluate the energy performance of more than 30,000 commercial buildings. Using the least squares method, energy use intensity per unit floor area is regressed against explanatory variables such as building energy-use characteristics and outdoor climatic parameters to determine benchmark values [6]. In Germany, the VDI 3807 building energy evaluation system assumes that equipment operates under optimal conditions and calculates a theoretical energy consumption value based on rated power and operating parameters. This value is then compared with the actual operational energy consumption to assess building performance [7]. At present, existing studies still lack a more comprehensive classification framework and more reasonable energy consumption benchmarks for public buildings. Therefore, the rational evaluation of urban building heating energy benchmarks has become a major research focus in the field of urban building energy consumption management at the present stage.

Model- or simulation-based methods rely on tools such as EnergyPlus (v26.1.0), DOE-2 (v2.3), and similar platforms to estimate energy-use benchmarks from building parameters, enabling more detailed characterization of end-use load profiles [8,9]. However, their application is often constrained by limited data availability and discrepancies between simulated results and actual building performance [10]. Point-based rating systems, such as LEED, provide qualitative evaluations based on compliance with sustainability criteria. However, they generally lack sufficient granularity for comparing actual energy performance. They can also be labor-intensive when applied across different time scales [11]. Assessments based on end-use energy or performance metrics usually focus on individual building systems. These assessments are often conducted using sub-metered or simulated data. However, their application is restricted by high data collection costs. Their ability to capture the combined effects of multiple variables is also limited [12].

In recent years, Bayesian approaches have begun to be introduced into building energy benchmarking research to improve the treatment of parameter uncertainty and limited sample sizes. Based on monitoring data from heating substations in Beijing, Na and Wang [13] calibrated an urban-scale space heating energy model using Bayesian inference and MCMC. It was demonstrated that a probabilistic framework can be used to characterize district heating EUI benchmarks under conditions of incomplete data and uncertain inputs. Subsequently, an empirical Bayesian method was further proposed by Na and Liu for predicting urban-scale district heating energy benchmarks, and the applicability of Bayesian methods to heating energy benchmark development was further confirmed [14]. Guy et al. [15] applied a Bayesian multilevel model to the benchmarking of commercial building energy use in Europe. In their model, country, building type, and climate factors were considered simultaneously. Full energy-use distributions and uncertainty intervals could also be generated, highlighting the advantages of Bayesian methods for cross-regional and cross-type benchmark modeling. Although Bayesian methods have been applied to building energy prediction and calibration, existing studies have still been focused mainly on urban-scale building stocks or the overall energy-use distribution of multiple building types. The scale effect on district heating energy benchmarks for a single building type has received relatively limited attention.

Some recent studies have begun to focus on dynamic and context-dependent benchmarks [16]. In a study of hospital buildings, Kim et al. [17] pointed out that building energy use is influenced by physical attributes and climate conditions. They also emphasized the role of operational characteristics. These characteristics include schedules, equipment use, and occupant activities. It was therefore argued that operational variables should be incorporated into benchmark models, rather than relying solely on simple cross-sectional comparisons based on energy use per unit floor area. Er-Retby et al. [18] further proposed a framework for shifting from static to dynamic benchmarking. In that framework, the TRNSYS (v18.06.0002) simulation was combined with measured occupancy, weather, and equipment data. Seasonal operating conditions were identified through clustering, and dynamic monitoring was achieved by means of control charts. Their study indicated that future heating energy benchmarking will place greater emphasis on seasonal variation, behavioral fluctuation, and continuous updating.

A common feature of these studies is that the benchmark is no longer treated as a fixed threshold. Instead, it is defined as a reference interval or dynamic trajectory that changes with the operating context. However, three major limitations still exist in current benchmarking research. First, many studies have still focused on total building energy use or overall EUI, whereas dedicated benchmark research on district heating, which is a key end-use component, remains relatively limited. As a result, the benchmark results cannot be directly applied to heating energy-saving diagnosis or heating operation assessment. Second, although differences in climate zones and building types have been considered in some studies, insufficient attention has been given to the combined effects of building scale, building form, and geometric parameters [19,20]. In particular, systematic identification is still lacking as to whether changes in building scale may lead to regular variations in district heating energy use intensity. Third, although some data-driven and AI-based methods have improved prediction accuracy, they still face limitations related to restricted data, black-box behavior, and classification uncertainty. Meanwhile, traditional evaluation methods based on static rating levels also show limited ability to explain actual heat loss and real operating conditions.

Meanwhile, although previous studies have shown that building area or building scale can affect energy performance, the underlying patterns of this influence have not been systematically examined [21]. Wang et al. [22] investigated the relationship between building energy use and gross floor area across 16 urban regions in the United States. A clear scaling relationship was identified between energy use and gross floor area. Noticeable variations were also observed across different regions and building types. However, the study was focused on the scaling law of total energy use at the urban level rather than on district heating end-use energy. In addition, it was not extended to the development of a heating energy benchmark applicable to the evaluation of individual buildings. Kim and Park [23] analyzed the actual heating energy consumption of apartment buildings with respect to building location and floor area. Their results showed that, although the annual heating energy consumption increased with increasing building area in both 2010 and 2011, the heating energy use intensity decreased on a per-unit-area basis. Mohammadiziazi et al. [24] analyzed the annual average EUI for commercial buildings in Pittsburgh. They found that the EUI decreased with increasing floor area, demonstrating that floor area can be used as a scaling variable. Lee et al. [25] demonstrated that building size is an important factor in defining peer groups for energy benchmarking. Based on a data-driven analysis of 2754 residential buildings in Seoul, they identified gross floor area thresholds of approximately 728

m^{2}

and 1039

m^{2}

. This result indicates that building energy performance exhibits distinct scale-dependent characteristics. These characteristics vary across different size ranges.

Therefore, in this study, the heating energy benchmark of buildings was taken as the research object. On the basis of climate differences and building characteristics, a Bayesian approach was introduced to investigate how heating energy use intensity responds to changes in building scale. Particular attention was given to identifying the variation characteristics of heating energy use across different scale ranges, together with their uncertainty bounds. Compared with previous studies, this work does not remain at the level of general energy-use prediction. Instead, the scale effect embedded in heating energy benchmarking is specifically addressed. In this way, more targeted theoretical support and methodological guidance can be provided for the evaluation of heating energy use, benchmark development, and energy-saving management for buildings of different scales.

1.2. The Present Research

To address the gaps of limited building features and uncertainties in energy-influencing factors, this study proposed a Bayesian framework-based surrogate model for building heating energy consumption prediction using MAP estimation [15]. The data samples used in this model were generated based on the integrated Grasshopper–Honeybee–EnergyPlus simulation platform. By constructing an MCMC algorithm for stochastic sampling in the feature space, the model inferred hyperparameters through peak probability estimation of posterior distributions. This approach enabled the surrogate model to expand the value range of building scales, systematically analyzing EUI trends across diverse building scales while quantifying uncertainty intervals caused by parameter interactions.

In summary, this study focuses on district heating energy benchmarking for public buildings, and its main contributions are as follows:

(1): To address the large number of building-related factors, their complex interactions, and the pronounced uncertainty in establishing public building energy benchmarks, a Bayesian framework-based calibration method was developed for building district heating energy use. This method provides a quantifiable and calibratable analytical tool for benchmark research.
(2): Based on data from building simulation samples, the independent effect of building size on the district heating energy use intensity of office buildings was identified. The results reveal a general pattern whereby larger buildings exhibit lower heating energy use intensity per unit floor area under different climatic conditions, while smaller buildings are more sensitive to changes in scale.
(3): Building on the analysis of heating energy-use patterns, sensitive intervals around building sizes of 1000 $m^{2}$ and 3000 $m^{2}$ were further identified, and the decline in heating energy use intensity was found to become more gradual for buildings larger than 5000 $m^{2}$ . These findings provide a basis for tiered benchmark classification and threshold optimization for public building district heating energy benchmarking.
(4): The study further reveals that climatic severity remains the dominant factor controlling the absolute level of heating demand per unit floor area. Meanwhile, as building scale increases, the absolute differences in heating EUI among different climatic conditions gradually narrow, indicating that scale enlargement can partially offset the amplifying effect of colder climates on unit-area heating demand.
(5): In addition, the persistence of the scale effect was found to vary across climatic conditions. In colder regions, the influence of building enlargement on heating EUI was maintained over a broader range of building scales and attenuated more slowly, suggesting that the scale effect remains effective over a longer portion of the scale spectrum under more severe heating climates.

This paper describes the construction of the energy consumption prediction model in Section 2, while Section 3 presents the data sources of both training and test sets. Furthermore, Section 4 elaborates on the prior distribution setting and provides technical details of the model-solving process. Specifically, Section 5 details the obtained model results along with comprehensive verification procedures. In Section 6, the study further analyzes how office building scale influences district heating energy consumption patterns and mechanisms of scale effect. In Section 7, limitations and future works are shown. Finally, Section 8 concludes the paper by summarizing key findings and highlighting practical implications.

2. Materials and Methods

2.1. Strategies to Determine the Probability-Based Energy Model

The approach to determining and calibrating the energy model aims to infer the output from the model with a high probability distribution and consequently regulate the model given data. Two strategies have been mostly used to approach the solution for the model, based on the inference means regarding two different interpretations of probability. The interpretations of probability affect how to utilize and integrate the observed information in the inference of the energy model, which includes frequentist reasoning and Bayesian reasoning. Specifically, the strategies for estimating the distribution and determining the simulator for the model are termed the MLE strategy and the Bayesian MAP strategy.

First, frequentist reasoning is used in the MLE strategy to interpret probability. No prior information or uniform priors are specified in the estimation procedure. This indicates that only the observed data are used to identify the model parameters by computing their likelihood distributions. By contrast, Bayesian reasoning is adopted in the MAP strategy. A prior is specified in the estimation procedure. This allows expert domain knowledge or beliefs about the stochastic variables to be incorporated before the observed data are used to identify the unknown parameters. The uncertainty is expressed by a prior over the unidentified parameter, and the knowledge in the prior and the information in the likelihood are integrated to regulate the model by computing the posterior. Taking the parameter

β_{0}

as an example, the consequent posterior is expressed as Equation (1).

p (β_{0} | η) = p (β_{0}) \times p (η | β_{0}) \times p {(η)}^{- 1} \propto P (β_{0}) \times P (η | β_{0})

(1)

where

p (β_{0} | η)

represents posterior distribution,

p (β_{0})

represents prior distribution,

p (η | β_{0})

represents likelihood distribution,

p (η)

represents marginal distribution.

Second, the point estimate is computed in the strategies of MLE and MAP. Aiming at predicting the output with a high probability distribution,

P (η | X, β_{0})

, the MLE strategy is considered as a procedure to find the parameters

{(β_{0})}_{M L E}

of the proposed model via the maximization of the likelihood

p (D | β_{0})

to maximize the agreement between the model and the observed data D. The MAP strategy, by contrast, is a procedure to find the parameters

{(β_{0})}_{M A P}

of the proposed model via the maximization of the posterior

p (β_{0} | D)

. Moreover, the MLE estimate is merely equivalent to the MAP estimate if the prior is specified as a uniform one, as shown in Equation (2).

\begin{array}{l} {(β_{0})}_{M A P} = \underset{β_{0}}{\arg \max} (p (β_{0} | D)) \\ = \underset{β_{0}}{\arg \max} (p (D | β_{0}) p (β_{0})) \leftrightarrow p (β_{0}) \propto 1, {(β_{0})}_{M L E} = \underset{β_{0}}{\arg \max} (p (D | β_{0})) \end{array}

(2)

A practical and frequent challenge to developing and calibrating the district-scale energy model is the data-hungry issue. As a consequence of asymptotic properties, MLE strategies are found to be effective when the sampling scope becomes large and the data are adequate; otherwise, it would lead to the over-fitting problem or the type-I error in the model regulation. Therefore, a Bayesian modeling approach used in the model strategies incorporates the estimated likelihood information with the specified prior knowledge so as to utilize the observed data from limited samples in the region sufficiently, suggesting the strategy is a promising alternative to augment the calibration performance in the event of limited samples.

2.2. Construction of Energy Model

Building energy consumption prediction models face significant challenges due to the complexity of influencing factors, the high dimensionality of urban building samples, and inherent data imbalance [26,27]. Urban building datasets often fail to comprehensively capture all relevant energy-related features, leading to training sets with imbalanced distributions and redundant variables [28,29]. Excessive inclusion of independent variables in such models can exponentially increase computational complexity and data requirements, thereby reducing generalizability. To address these limitations, this study proposed a generalized linear fixed-effects model with randomized hyperparameters, informed by prior research on multi-factor, imbalanced, and small-sample modeling in energy prediction. The model’s objective function is formulated as follows:

\begin{array}{l} y_{i} = β_{0} + β_{1} W W R_{i} + β_{2} (\log A_{i}) + β_{3} D_{a c t i v i t y, 2, i} + β_{4} D_{a c t i v i t y, 3, i} \\ + β_{5} D_{c l i m a t e, 2, i} + β_{6} D_{c l i m a t e, 3, i} + β_{7} D_{f u n c t i o n, 2, i} + β_{8} D_{f u n c y i o n, 3, i} + ε \end{array}

(3)

where

y_{i}

is the target parameter of the model, the heating EUI of the building (

{GJ / m}^{2}

),

D_{c l i m a t e, 2, i}

denotes the explanatory variable for the cold climate zone,

D_{c l i m a t e, 3, i}

denotes the explanatory variable for the hot summer and cold winter climate zone,

D_{f u n c t i o n, 2, i}

denotes the explanatory variable for the residential building function,

D_{f u n c t i o n, 3, i}

denotes the explanatory variable for the commercial building function,

D_{a c t i v i t y, 2, i}

and

D_{a c t i v i t y, 3, i}

denote the explanatory variables for occupant activity intensities of 150 W/person and 200 W/person, respectively,

β_{0} - β_{8}

are the regression hyperparameters of the fixed effect of the model;

ε

is the independent random error predicted by the model (

{GJ / m}^{2}

).

In this study, categorical variables were coded using reference-group coding, with the severe cold region as the reference category for climate zone, office building as the reference category for building function, and 100 W/person as the reference category for occupant activity intensity. Taking the explanatory variables for the building function as an example,

D_{f u n c t i o n, 2, i} = \{\begin{matrix} 1, & if F u n c t i o n_{i} = r e s i d e n t i a l b u i l d i n g \\ 0, & otherwise \end{matrix}

(4)

It should be noted that the objective of this study is not to establish a fully mechanistic model that explicitly includes all possible influencing factors, such as detailed envelope thermal properties, HVAC system performance, and operational control strategies. Rather, the aim is to develop an analytical framework that balances data availability, model interpretability, and the practical needs of baseline application. For heating energy baseline research, the main requirement is to identify the relative variation patterns of heating EUI and the building-scale effect across different scenarios, rather than to fully reproduce all thermal processes of an individual building. Therefore, priority is given to representative variables that are directly relevant to baseline analysis, including climate zone, building function, occupant activity intensity, building scale, and window-to-wall ratio.

It should be further noted that the Bayesian framework is adopted in this study not merely as a general statistical tool, but as a methodological basis for benchmark-oriented analysis under uncertainty. Given the heterogeneous simulation samples across climate zones, building functions, and occupant activity intensity scenarios, together with the simplified yet representative variable setting adopted in this study, the Bayesian framework makes it possible to estimate not only the central tendency of the scale–heating EUI relationship, but also the associated uncertainty in parameter estimation and prediction. This is particularly important for the present study, because the objective is not only to identify average variation patterns, but also to provide an uncertainty-aware basis for interpreting the scale effect, threshold-like sensitive intervals, and their relevance to heating energy benchmark analysis.

3. Data Sources of Training Sets and Test Sets

The Grasshopper (v1)–Honeybee (v1.7.26)–EnergyPlus integrated simulation platform provided a unified technical framework for this study, covering the full process from parametric modeling to building energy simulation. Grasshopper, as a visual parametric modeling tool operating within the Rhino (v8) environment, enables the rapid generation and systematic adjustment of building geometry, envelope parameters, and operational conditions. Honeybee serves as an interface plugin linking parametric models with building performance simulation engines, allowing building geometry, material properties, weather data, and operational schedules to be effectively transferred to EnergyPlus. EnergyPlus, a mature building thermal and energy simulation engine, performs dynamic calculations of building heating loads and heating energy use intensity during the operation stage. Through the integration of these three tools, this study was able to efficiently generate large-scale simulation samples. These samples covered different climate zones, building functions, and occupant activity intensity levels. Meanwhile, consistency in the simulation workflow, flexibility in parameter control, and comparability of simulation results were ensured.

To facilitate subsequent data management and statistical analysis, a hierarchical database structure was adopted to organize the simulation samples. At the first level, the database was classified by climate zone into three categories: severe cold, cold, and hot summer and cold winter, represented by the typical cities of Harbin, Beijing, and Shanghai, respectively. The HDD65 values for Harbin, Beijing, and Shanghai are 9698, 5497, and 3250, respectively. In this study, HDD refers to heating degree days, which is used to characterize the influence of climatic conditions on building heating demand. More specifically, HDD65 refers to heating degree days calculated with a base temperature of 65 °F (approximately 18.3 °C). It is obtained by accumulating the daily temperature difference when the outdoor mean temperature falls below this base temperature. Within each climate zone, the samples were further subdivided by building function into office, residential, and commercial buildings. In addition, occupant activity intensity was classified into three scenarios: 100 W/person, 150 W/person, and 200 W/person. Through this design, the sample database formed a clear tree-like hierarchical structure, which facilitated subsequent sample retrieval, comparison, and modeling across different dimensions, including city, building function, and occupant activity intensity.

Subsequently, the Grasshopper–Honeybee–EnergyPlus integrated simulation platform was employed to simulate the heating energy use intensity of buildings under different climate zones, building functions, and occupant activity intensity levels. A total of 1620 simulation samples were generated in this study. These samples were then divided into a training set and a test set at a ratio of 7:3, resulting in 1134 training samples and 486 test samples.

4. Prior Distribution Setting and Model Solution

4.1. Prior Distribution Specified for Parameters in Energy Model

In this study, strong prior knowledge regarding the model parameters was not available. Therefore, diffuse priors were adopted so that posterior inference would be driven primarily by the observed training data rather than by subjective prior assumptions. Specifically, the regression coefficients

β_{0} - β_{8}

were assigned zero-centered Gaussian priors with very low precision, corresponding to a large variance. This choice provides sufficient flexibility for the coefficients to be estimated from the data while avoiding overly restrictive prior constraints. Since the explanatory variables were standardized before model fitting, the use of zero-centered Gaussian priors is also statistically appropriate and facilitates coefficient comparability. The precision parameter of the error term was assigned a Gamma prior to guarantee its positivity. Small hyperparameter values were used so that the prior remained diffuse and had minimal influence on posterior estimation. Specifically, hyperparameters

β_{0} - β_{8}

were given by Gaussian

(0, 1 \times 10^{- 6})

distributions, and

ε

was given by a Gamma

(1 \times 10^{- 6}, 1 \times 10^{- 6})

distribution, ensuring minimal prior influence on parameter estimation.

4.2. Model Solution

The Bayesian inference framework is implemented using the BUGS (v3.2.3) software, which leverages MCMC algorithms—including Gibbs sampling and the Metropolis algorithm—to sample from the posterior distribution. Three parallel MCMC chains are initialized with random seeds in the building characteristic space, iterating until convergence to a stationary distribution (assessed by Gelman-Rubin

\hat{R}

< 1.1). The posterior expectation, corresponding to the peak probability region of the stationary distribution, serves as the point estimate for hyperparameters. The flowchart of the MCMC simulation is shown in Figure 1.

Developed jointly by the Biostatistics Unit at the University of Cambridge and the Medical Research Council, BUGS addresses the limitations of frequentist methods that rely on asymptotic approximations. By directly computing exact finite-sample posterior distributions, this approach explicitly accounts for model uncertainty without dependence on large-sample assumptions. The computational workflow integrates:

(1): Parameter Space Exploration: MCMC chains traverse high-dimensional hyperparameter spaces, balancing exploration (via Metropolis steps) and exploitation (via Gibbs updates).
(2): Convergence Diagnostics: Trace plots and Brooks-Gelman statistics verify chain mixing and stationarity.
(3): Posterior Summarization: Kernel density estimation extracts 95% credible intervals for hyperparameters, quantifying epistemic uncertainties in building energy models.

This methodology overcomes the rigidity of empirical prior assignments while providing statistically rigorous uncertainty quantification—a critical advancement for applications in heterogeneous building stocks where operational data exhibit spatial–temporal variability. In other words, the Bayesian method does not rely on asymptotic distributions of large samples for statistical inference, but directly computes precise finite sample distributions. Additionally, the Bayesian method fully considers the uncertainty of the model, resulting in more comprehensive and accurate inference results.

For the present study, the Bayesian framework is used not only for parameter estimation, but also for uncertainty-aware interpretation of heating energy benchmarks. By using MCMC sampling to obtain posterior distributions, the model makes it possible to evaluate both the estimated effects and their uncertainty ranges. This is particularly important because the benchmark-related conclusions are interpreted not only through point estimates, but also through the posterior uncertainty associated with the scale effect and its stage-wise variation. In this way, the Bayesian framework provides a more informative basis for analyzing heating EUI patterns across different building scenarios.

5. Model Results and Validation

Three Markov Chain Monte Carlo (MCMC) chains were initialized with dispersed starting values in the parameter space, and the thinning interval was set to 1. After 30,000 iterations (including 10,000 burn-in iterations for thermalization), all chains achieved

\hat{R}

≤ 1.1, confirming stable convergence to the target posterior distribution. Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 show the trace plots of iterations in three chains, excluding the burn-in phase for the unidentified parameters, including

β_{0} - β_{8}

, in the model.

As shown in Figure 2, after the first 10,000 iterations were discarded as burn-in, good stationarity and mixing were achieved for both

β_{0}

and

β_{1}

across the three Markov chains. No obvious drift, monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been reached. For

β_{0}

, fluctuations were distributed around a relatively stable central region, and substantial overlap and frequent crossing among chains were observed, suggesting that the posterior space was adequately explored. For

β_{1}

, a similar pattern was identified, with random fluctuations confined to a narrower interval around zero and no persistent chain separation detected. Compared with

β_{0}

, a smaller fluctuation range was exhibited by

β_{1}

, implying lower posterior dispersion and greater estimation stability. Overall, satisfactory convergence was achieved for both parameters after burn-in removal, and the posterior samples were considered reliable for subsequent inference and model interpretation.

As shown in Figure 3, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both

β_{2}

and

β_{3}

across the three Markov chains. No obvious drift, persistent monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been reached for both parameters. For

β_{2}

, fluctuations were confined to a relatively narrow interval around a stable central level. Substantial overlap and frequent crossover were observed among the three chains, suggesting that the posterior region was adequately explored and that good between-chain consistency was achieved. For

β_{3}

, a similar pattern was identified. Although slightly wider fluctuations were exhibited, the chains remained highly interwoven and no structural shift or abnormal jump was detected. Overall, satisfactory convergence was considered to have been attained for both parameters after burn-in removal, and the resulting posterior samples were regarded as reliable for subsequent inference and model interpretation.

As shown in Figure 4, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both

β_{4}

and

β_{5}

across the three Markov chains. For both parameters, random fluctuations were maintained around relatively stable central levels during the post-burn-in phase. No obvious drift, persistent monotonic trend, or prolonged local sticking was observed, indicating that stable posterior sampling had been attained.

For

β_{4}

, substantial overlap and frequent crossover were observed among the three chains. Although local short-term oscillations were present, no structural shift or persistent separation was detected. This pattern suggests that the posterior space of

β_{4}

was adequately explored and that good between-chain consistency was achieved. For

β_{5}

, a similar trace pattern was exhibited. The chains were highly interwoven and remained concentrated within a stable interval. Slightly narrower fluctuations were observed for

β_{5}

, implying comparatively lower posterior dispersion and greater estimation stability. Overall, satisfactory convergence was considered to have been reached for both parameters after burn-in removal.

As shown in Figure 5, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both

β_{6}

and

β_{7}

across the three Markov chains. Random fluctuations were maintained around stable central levels, and substantial overlap was observed among chains. No obvious drift, persistent trend, or prolonged local sticking was detected. For both parameters, only short-term local oscillations were exhibited, without structural separation or abnormal jumps. These trace patterns indicate that the posterior distributions were adequately explored and that satisfactory convergence was reached after burn-in removal.

As shown in Figure 6, after the first 10,000 iterations were discarded as burn-in, satisfactory stationarity and mixing were achieved for both

β_{8}

and

ε

across the three Markov chains. For

β_{8}

, random fluctuations were confined to a relatively narrow interval, and substantial overlap and frequent crossover were observed among chains, indicating stable posterior sampling and good between-chain consistency. For

ε

, wider fluctuations were exhibited, implying comparatively greater posterior dispersion. However, no obvious drift, persistent separation, or prolonged local sticking was detected. Overall, satisfactory convergence was considered to have been reached for both parameters after burn-in removal, and the resulting posterior samples were regarded as reliable for subsequent inference.

Overall, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 indicate that the three chains for parameters

β_{0} - β_{8}

and

ε

all reached satisfactory convergence after the burn-in phase. The chains are well mixed, highly overlapped, and fluctuate randomly around stable central values. These visual results are consistent with the PSRF values below 1.1, further confirming that the posterior estimates are stable and credible for subsequent parameter interpretation and prediction.

Predictive accuracy was quantified using the ASHRAE Guideline 14-2023 [30] recommended metrics. The ASHRAE Guideline 14-2023 standard recommends a CVRMSE threshold of 15% and an NMBE threshold of 5%. The specific formulae for these indexes are presented in detail in Formulae (5)–(7).

RMSE = \sqrt{\frac{\sum_{i = 1}^{N_{t}} {({\hat{y}}_{i} - y_{i})}^{2}}{N_{t} - 1}}

(5)

CVRMSE = \frac{\sqrt{\frac{\sum_{i = 1}^{N_{t}} {({\hat{y}}_{i} - y_{i})}^{2}}{N_{t} - 1}}}{{\bar{y}}_{i}} \times 100 %

(6)

NMBE = \frac{\sum_{i = 1}^{N_{t}} (y_{i} - {\hat{y}}_{i})}{(N_{t} - 1) \times {\bar{y}}_{i}} \times 100 %

(7)

where

{\hat{y}}_{i}

is the predicted value of the heating EUI of the sample (

{GJ / m}^{2}

),

y_{i}

is the observed value of the heating EUI of the sample (

{GJ / m}^{2}

),

N_{t}

is the total number of samples in the test set,

{\bar{y}}_{i}

is the mean observed value of the EUI of the test set (

{GJ / m}^{2}

).

The indexes include the Root Mean Square Error (RMSE), the Coefficient of Variation of the Root Mean Square Error (CVRMSE), and the Normalized Mean Bias Error (NMBE). RMSE is commonly used to evaluate the prediction error of regression models. It measures the model’s prediction accuracy by calculating the average of the squared differences between the predicted and actual values, and then taking the square root of this average. A smaller RMSE indicates better predictive performance of the model. The CVRMSE is a statistical index used to assess model performance. It standardizes RMSE by the standard deviation of the data, providing a relative measure of prediction error. As a dimensionless relative error metric, CVRMSE is expressed as a percentage. The lower the CVRMSE value, the smaller the model’s prediction error and the higher the relative accuracy. CVRMSE is particularly useful when comparing models with datasets of varying magnitudes, as it eliminates the influence of scale differences, thus enabling a more equitable comparison. The NMBE is an index that measures the systematic bias of a model. It evaluates whether the predicted values are systematically higher or lower than the actual values. The NMBE value reflects the bias between predicted and actual values relative to the actual values.

The proposed surrogate model demonstrated robust predictive performance, as evidenced by the statistical metrics NMBE, RMSE, and CVRMSE, which quantify deviations between predicted and observed heating EUI values in the test set. As shown in Figure 6, the model achieved minimal prediction bias, with NMBE = −1.01%, RMSE = 9.69, and CVRMSE = 12.37%. These values are substantially below the thresholds recommended by ASHRAE Guideline 14-2023 (CVRMSE ≤ 15%, −5% ≤ NMBE ≤ 5%), confirming the model’s high accuracy and generalization capability for office building heating EUI prediction.

To further examine the stability and generalization ability of the Bayesian prediction model under different sample partitions, five-fold cross-validation was introduced as an additional robustness validation method. Specifically, the 1620 building heating energy consumption samples were divided into five mutually exclusive subsets. In each iteration, four subsets were used for model training, while the remaining subset was used for testing, ensuring that each subset was used once as the test set.

As shown in Table 1, the five-fold cross-validation results indicate that the model maintained high prediction accuracy under different sample partitions. The

R^{2}

values ranged from 0.9668 to 0.9709. The mean value was 0.9683, and the standard deviation was only 0.0016. These results indicate that the model provided strong and stable explanatory power for variations in heating energy use intensity. The mean RMSE was 9.6075, with a standard deviation of 0.1935. The mean CVRMSE was 12.2788%, with a standard deviation of 0.2870%. These results suggest that the overall prediction error was relatively low. They also indicate that the variation among different folds was limited. The mean NMBE was −0.7826%, with a standard deviation of 0.3477%, indicating that the overall bias of the model was close to zero and that no obvious systematic overestimation or underestimation was observed.

In addition, the PSRF values of the Bayesian models in all folds were close to 1, indicating good convergence of the MCMC sampling process and confirming the reliability of the posterior parameter estimates. Overall, the five-fold cross-validation results further demonstrate the robustness and generalization ability of the Bayesian prediction model developed in this study. The model performance did not depend on a specific training–testing split, but remained stable across different sample combinations. Therefore, the validated model can provide a reliable basis for the subsequent extraction of building heating energy baselines.

The posterior distributions of the hyperparameters for the office building heating energy surrogate model are shown in Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11. Overall, these posterior densities are concentrated within relatively clear high-probability regions, indicating that the corresponding hyperparameters are well identified by the data and the Bayesian inference procedure. The representative values located in the high-probability density regions were then selected as the optimal hyperparameter estimates for the surrogate model.

As shown in Figure 7, unimodal posterior distributions were obtained for both

β_{0}

and

β_{1}

, indicating that relatively well-defined high-probability regions were identified. For

β_{0}

, the posterior density was mainly concentrated in the positive range, with a clear peak but a comparatively wider spread, suggesting moderate posterior uncertainty. For

β_{1}

, the density was concentrated within a narrower interval slightly below zero, and a sharper peak was exhibited, implying lower uncertainty and greater estimation stability. No obvious multimodality or irregular dispersion was observed for either parameter. Therefore, reliable representative values were considered to be available from the high-probability posterior regions.

As shown in Figure 8, unimodal posterior distributions were obtained for both

β_{2}

and

β_{3}

, indicating that identifiable high-probability regions were established for these hyperparameters. For

β_{2}

, the posterior density was concentrated within a narrow interval slightly below zero, and a relatively sharp peak was formed, suggesting that stronger data constraints and lower posterior uncertainty were achieved. For

β_{3}

, the density was also centered in a mildly negative range, but a broader spread and a flatter peak were exhibited, implying comparatively greater dispersion. No obvious multimodality or irregular tail expansion was observed. Therefore, reliable representative estimates were considered to be supported by the posterior distributions.

As shown in Figure 9, unimodal posterior distributions were obtained for both

β_{4}

and

β_{5}

, indicating that stable high-probability regions were identified. For

β_{4}

, the posterior density was centered in a mildly negative range and was spread over a relatively wider interval, suggesting that moderate posterior variability was retained. For

β_{5}

, the density was also concentrated in the negative region, but a sharper peak and shorter tails were exhibited, implying stronger concentration and greater estimation stability. No obvious multimodality or irregular spreading was observed. Therefore, reliable representative values were considered to be supported by the posterior high-density regions.

As shown in Figure 10, smooth and unimodal posterior distributions were obtained for both

β_{6}

and

β_{7}

, indicating that well-defined probability concentration regions were established for these hyperparameters. For

β_{6}

, the posterior density was centered near −2.00, and a compact distribution with a clear peak was exhibited, suggesting limited posterior variability and strong estimation stability. For

β_{7}

, the density was concentrated near 1.08 and a similar overall shape was observed, although a slightly wider spread was retained. No obvious multimodality, irregular skewness, or excessive tail extension was detected. Therefore, reliable representative estimates were considered to be supported by the posterior high-density regions.

As shown in Figure 11, smooth and unimodal posterior distributions were obtained for both

β_{8}

and

ε

, indicating that clearly identifiable probability concentration regions were established. For

β_{8}

, the posterior density was centered near 1.20, and a relatively symmetric and compact shape was exhibited, suggesting stable estimation and limited posterior uncertainty. For

ε

, the density was concentrated near 38, while a visibly wider spread was retained, implying greater residual variability. Nevertheless, no multimodality, discontinuity, or excessive tail extension was detected. Therefore, both

β_{8}

and

ε

were considered to be reliably characterized by their posterior high-density regions.

Taken together, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 show that generally smooth and unimodal posterior distributions were obtained for all model parameters, with no obvious multimodality, irregular fragmentation, or uncontrolled dispersion being observed. This indicates that the parameters were well identified through the Bayesian inference procedure and that the posterior estimates can be regarded as stable and credible overall.

To further evaluate the predictive performance of the surrogate model, the trained model was applied to the test set to predict the heating EUI of 486 building samples. Figure 11 compares the actual and predicted values of heating EUI for the test samples. Overall, the predicted results agree well with the observed data, indicating that the model has good generalization ability and satisfactory predictive accuracy.

As shown in Figure 12, a clear linear correspondence was exhibited between the observed heating EUI values and the model predictions for the building samples in the test set. Most scatter points were distributed close to the 1:1 reference line, indicating that the overall variation trend of heating EUI was reproduced well by the surrogate model. The ±5% PE and ±10% PE bands were also provided in the figure, by which the prediction deviation can be assessed more intuitively. Overall, a large proportion of the samples were enclosed within the ±10% error bands. A considerable number of samples were also located near the ±5% bands. These results suggest that good agreement was achieved between the predicted and observed values. When different EUI ranges were examined, a relatively compact distribution was observed in the low-EUI region, where close agreement with the reference line was maintained. In the medium-EUI range, a slight increase in scatter dispersion was exhibited; however, most samples were still retained within a limited deviation range, indicating that stable predictive performance was preserved. In the higher-EUI range, a broader spread was observed and several points deviated more visibly from the reference line, yet the overall clustering trend remained consistent and no severe distortion was detected. This suggests that, although the prediction uncertainty increased somewhat with increasing heating EUI, acceptable predictive accuracy was still maintained.

In addition, the scatter points were distributed on both sides of the 1:1 line rather than being concentrated on a single side. This implies that no pronounced systematic overestimation or underestimation was introduced by the model. The observed deviations therefore appear to have been dominated mainly by random variation rather than structural bias. In summary, Figure 12 shows that the predicted values agree well with the observed values for the test samples, with most points falling within reasonable error bands and no obvious systematic bias being detected. Good accuracy, stability, and generalization ability were therefore demonstrated by the proposed model.

6. Results and Discussion

6.1. Pattern Analysis of District Heating EUI in Office Buildings: Impact of Building Scale

The surrogate model developed in this study was further applied to simulate the heating energy use intensity (EUI) of office buildings with scales ranging from 100 to 100,000

m^{2}

under different climatic conditions. For this analysis, the building function was specified as an office building, the occupant activity intensity was fixed at 100 W/person, and the WWR was set to 0.25. The HDD65 values considered were 3250, 5497, and 9698. Figure 11 presents both the heating EUI and its marginal change with increasing building scale under different HDD65 conditions.

As shown in Figure 13, a continuous decrease in heating energy use intensity (EUI) was exhibited as building scale increased under all three climatic conditions represented by HDD65 values of 3250, 5497, and 9698. At the same time, the rate of change in heating EUI was observed to move progressively from larger negative values toward zero. These two sets of curves, when interpreted together, indicate that the influence of building scale on heating EUI should not be regarded as linear. Instead, a pronounced nonlinear attenuation pattern was revealed, in which the strongest scale effect was produced in the small-building range and was then gradually weakened as floor area increased.

A stable climatic ordering was maintained across the entire scale range. For any given building size, the highest EUI was associated with HDD65 = 9698, the intermediate level was associated with HDD65 = 5497, and the lowest EUI was associated with HDD65 = 3250. This pattern suggests that climatic severity remained the dominant factor controlling the absolute level of heating demand per unit floor area. Nevertheless, a gradual narrowing of the absolute gap among the three EUI curves was also observed as the building scale increased. This feature is important because it implies that scale enlargement can partially mitigate the amplification effect imposed by colder climates on unit heating demand, even though the overall climatic ranking is not altered.

Particular attention should be paid to the small-scale interval between approximately 100 and 1000

m^{2}

. In this range, the decline in EUI was found to be the steepest for all three climatic conditions. The right-axis curves further show that the magnitude of the negative rate of change was largest in this stage, especially under HDD65 = 9698. This means that the marginal reduction in unit heating demand generated by scale enlargement was most pronounced in small buildings and was amplified under severe heating climates. When the building scale was expanded from approximately 1000 to 3000

m^{2}

, a second regime was entered. The EUI continued to decline. However, the slope of the orange curves became visibly gentler than that observed in the first stage. Meanwhile, the green curves were still negative, although their magnitudes had already been reduced considerably. This indicates that scale remained influential in this interval, but the marginal energy-saving return per additional unit of floor area had begun to weaken. The vertical markers at 1000 and 3000

m^{2}

can therefore be interpreted as practical transition points rather than arbitrary graph annotations. A further transition was then observed in the 3000–5000

m^{2}

range. In this interval, the decrease in EUI was preserved, yet the curves became even flatter and the change-rate lines moved closer to zero. This stage may be interpreted as the boundary between the “sensitive scale–response zone” and the “stabilizing zone.” Importantly, the pace at which the change-rate curves approached zero was not identical across climates. Under HDD65 = 3250, the right-axis curve had already become very close to zero by this stage, whereas under HDD65 = 5497 and especially HDD65 = 9698, a noticeable negative rate was still retained. Thus, it was not only the EUI magnitude that was altered by climate, but also the persistence of the scale effect itself. In the larger-scale interval from approximately 5000 to 20,000

m^{2}

, the three EUI curves were observed to become substantially flatter. At this point, additional increases in building size still produced some reduction in unit heating demand, but the benefit was clearly smaller than that seen in small and medium buildings. The most meaningful insight here is that the duration of the scale effect was found to vary with climatic severity. Under the mildest of the three climates, the change-rate curve approached zero relatively early. Under the intermediate climate, the convergence was delayed. Under the coldest climate, a weak but still visible response to scale was maintained over a broader size range. This suggests that in colder regions, building enlargement continues to affect heating EUI for a longer portion of the scale spectrum.

Beyond approximately 20,000

m^{2}

, all three EUI curves moved toward an almost stable platform, and the corresponding rate-of-change curves converged very closely to zero. It may therefore be inferred that the scale effect had entered a mature diminishing-returns stage. In other words, once office buildings reach a sufficiently large size, the contribution of further scale enlargement to reducing heating EUI becomes marginal.

At the same time, the results suggest that these transition characteristics are modulated by climatic severity, in that under higher HDD65 conditions, the scale effect persists over a longer scale range and attenuates more slowly. This analysis indicates that the scale effect is reflected not only in the decline in heating EUI with increasing building size but also in the variation in response intensity across different scale intervals and in its climatic modulation.

Overall, Figure 13 demonstrates that a robust negative relationship between office building heating EUI and building scale was established, but that this relationship was not uniform across the entire scale range. A stage-wise pattern was revealed: a highly sensitive decline at small scales, a moderated but still meaningful response at medium scales, and an increasingly stable regime at larger building scales. At the same time, a clear climatic modulation of this pattern was identified. Higher HDD65 values were associated not only with higher EUI levels, but also with a stronger and more persistent scale effect. More specifically, climatic severity remained the dominant factor controlling the absolute level of heating demand per unit floor area. However, as building scale increased, the absolute differences in heating EUI among different climatic conditions gradually narrowed, indicating that scale enlargement can partially offset the amplifying effect of colder climates on unit-area heating demand. In addition, the persistence of the scale effect varied across climatic conditions. In colder regions, the influence of building enlargement on heating EUI was maintained over a broader range of building scales. This influence also attenuated more slowly. These results suggest that the scale effect remained effective for a longer portion of the scale spectrum under more severe heating climates. These findings suggest that scale-based heating benchmarks for office buildings should be established jointly with climatic stratification, rather than through a single universal reference. Such an approach would allow the physical mechanism behind observed energy differences to be represented more accurately and would improve the scientific basis of benchmark setting, comparative evaluation, and energy-efficiency-oriented design decisions.

It should be further noted that the results observed around approximately 1000

m^{2}

, 3000

m^{2}

, 5000

m^{2}

, and 20,000

m^{2}

are not interpreted in this study as universal fixed thresholds. Instead, they are regarded as threshold-like transition ranges. These ranges were identified under the present dataset, variable setting, and modeling framework. Their main significance lies in showing that the influence of building scale on heating EUI is not uniform across the full scale spectrum, but instead exhibits a clear stage-wise sensitivity pattern. At the same time, the results show that the overall pattern of these scale-related transition ranges remains broadly consistent across different climate zones considered in this study. Although the absolute level of heating EUI varies with climatic severity, the stage-wise threshold pattern of the scale effect is generally similar across climates. This suggests that the identified threshold pattern has a certain degree of robustness within the present research framework.

6.2. Reference for the Development of Heating Energy Benchmarks in Beijing

Consequently, the findings of this study were compared with the current energy benchmark regulations for public buildings in Beijing. These findings may provide a preliminary reference for improving benchmark classification and energy management by building scale. However, they should not be regarded as a direct basis for policy adjustment. The results suggest that relatively small public buildings (≤3000

m^{2}

) may deserve greater attention in energy management and benchmark evaluation. Buildings in the range of 3000–5000

m^{2}

may require more context-specific assessment. This assessment should be based on their actual energy consumption characteristics. In addition, benchmark values for public buildings ≤3000

m^{2}

may need to be considered separately from those of larger buildings. The benchmark setting for very large public buildings (≥20,000

m^{2}

) should also be examined with caution. This is especially important when they are compared with medium-sized categories. In contrast, the U.S. Energy Information Administration (EIA) subdivides public buildings into 10 building-size categories. The classification thresholds are 93, 465, 929, 2323, 4645, 9290, 18,581, 46,452, and 92,903

m^{2}

. These values are equivalent to 1000, 5000, 10,000, 25,000, 50,000, 100,000, 200,000, 500,000, and 1000,000

{ft}^{2}

, respectively. This classification could serve as a reference for refining building scale categories in urban public building energy benchmarks in China. This approach would facilitate the rapid collection of operational data. When combined with suitable energy prediction methods, it would also allow energy benchmarks to be determined across different scales, climatic conditions, building configurations, and operating scenarios.

It should be noted that this study adopts a relatively simplified variable set and simulation-based sample data in order to balance data availability, model interpretability, and the practical needs of benchmark analysis. Although variables such as climate zone, building function, occupant activity intensity, building scale, and building-form-related characteristics can capture the major sources of variation in heating energy use, factors such as envelope thermal performance, HVAC system efficiency, and fine-grained operational management were not explicitly included. Therefore, the model results are more suitable for revealing the relative variation pattern of heating EUI and its scale effect, rather than fully representing all mechanisms governing actual building energy use. In addition, although the simulation-based dataset allows the variation in heating energy consumption to be systematically examined under unified boundary conditions, the results may still differ from actual operational data. Accordingly, the findings of this study should be interpreted with caution in terms of real-world application, and are better regarded as a reference for heating energy benchmark research and classified management analysis. Future research could incorporate more measured data, together with more detailed envelope, system, and operational variables, to improve the practical applicability and interpretive depth of the model.

From a methodological perspective, the contribution of the Bayesian framework in this study lies not only in model estimation but also in providing a probabilistic basis for benchmark interpretation. The posterior results make it possible to discuss the scale effect, climate modulation, and threshold-like transition ranges together with their associated uncertainty, which strengthens the interpretability and practical relevance of the benchmark-oriented analysis.

7. Limitations and Future Works

Some limitations of the study should be underlined and will be addressed in future work.

First, this study will continue to systematically verify whether large-scale buildings’ energy systems and geometric forms demonstrate inherent advantages in reducing building energy consumption. Second, the use of response surfaces, factor analysis, and other improved modeling methods can further optimize the model results. Third, the predictive method proposed in this paper may achieve further improvements in predictive performance if more real samples are utilized. At the least, the potential impact of different sample sizes on the prediction results should be demonstrated. Fourth, a more systematic comparative analysis of benchmarks will be further considered in our future research. Fifth, a more detailed analysis of envelope thermal parameters, HVAC system performance, and operational management variables, as well as their influence on heating energy consumption, will be an important direction for our future research.

8. Conclusions

This work aims to explore the correlation between heating EUI and building scale, providing scientific evidence for the accurate setting of public building energy benchmarks. The main results of this work are as follows:

(1): A Bayesian framework-based surrogate model for building energy consumption was proposed in this study, demonstrating high accuracy and robust generalization capabilities in predicting heating EUI. Evaluation on the test dataset yielded a CVRMSE of 12.37% and an NMBE of −1.02%, both significantly below the thresholds recommended by ASHRAE Guideline 14-2023 (CVRMSE ≤ 15%, −5% ≤ NMBE ≤ 5%). These results validate the model’s compliance with international prediction standards and highlight its potential for large-scale application in building energy benchmarking and HVAC system optimization.
(2): Simulation results based on surrogate models demonstrate an inverse correlation between building scale and heating EUI under different HDD65 conditions. Specifically, smaller-scale buildings exhibit a more obvious reduction in heating EUI per unit of incremental floor area. Accelerated declines in heating EUI are observed at critical thresholds of 1000 $m^{2}$ and 3000 $m^{2}$ . In contrast, for buildings exceeding 5000 $m^{2}$ , the decline in heating EUI decelerates significantly. This study proposes that public building energy benchmarks should be further categorized based on building scale, with small-scale public buildings (≤3000 $m^{2}$ ) prioritized as focal points for energy management. The observed inflection points at 1000 $m^{2}$ and 3000 $m^{2}$ (where energy intensity reduction rates intensify) underscore the necessity for granular classifications of energy benchmarks.
(3): The differences in heating energy use intensity per unit floor area under different heating degree day conditions also decrease continuously with increasing building scale. Overall, the results show a rapid decline for small buildings. For large buildings, the decline gradually flattens. This indicates that small office buildings are more sensitive to changes in climatic severity, whereas the influence of climatic differences on heating energy use intensity per unit area gradually weakens as building scale increases. These results suggest that building scale not only affects heating energy use intensity itself, but also significantly moderates the effect of climatic conditions on heating energy demand.
(4): The persistence of the scale effect varied across climatic conditions. In colder regions, the influence of building enlargement on heating EUI was maintained over a broader range of building scales. This influence also attenuated more slowly. These results suggest that the scale effect remained effective for a longer portion of the scale spectrum under more severe heating climates.

The findings of this study establish a theoretical foundation for formulating scientifically grounded energy benchmarks in public buildings. The method proposed in this paper is applicable for urban construction and management departments, as well as energy management units related to heating operations, to conduct building energy benchmark management. Its application can achieve continuous optimization and enhancement of building energy management techniques.

Author Contributions

W.N.: writing—original draft, writing—review & editing, visualization, validation, supervision, software, resources, project administration, methodology, investigation, project administration, funding acquisition, formal analysis, data curation, conceptualization. Y.L.: writing—original draft, writing—review & editing, visualization, validation, software, methodology, investigation, formal analysis, data curation, conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Municipal Natural Science Foundation, grant number 9222009.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to express gratitude to Beijing University of Civil Engineering and Architecture for their support in providing the research site and facilities.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EUI	Energy use intensity
CFD	Computational fluid dynamics
SVM	Support vector machines
ANN	Artificial neural networks
MCMC	Markov chain Monte Carlo
HDD	Heating degree days
EIA	Energy information administration
RMSE	Root mean square error
CVRMSE	Coefficient of variation of the root mean square error
NMBE	Normalized mean bias error
MAP	Maximum a posteriori
MLE	Maximum likelihood estimation
BUGS	Bayesian inference using Gibbs sampling
PSRF	Potential scale reduction factor
WWR	Window-to-wall ratio

References

Li, Y.; Song, X.; Wei, J.; Bai, Y.; Lv, P.; Xu, G.; Yu, G. Study on the Mechanism of the Effect of Biomass Ash on the Erosion and Adhesion Behavior of Al₂O₃ Ceramic Particles. Fuel 2024, 358, 130188. [Google Scholar] [CrossRef]
Gao, Y.; Xu, C.; Cui, D.; Rout, L.; Ding, K.; Shi, L.; Zhang, S.; Lv, P.; Li, B.; Yu, G.; et al. Decoupling Study on the Influence of the Interaction between Biomass Hydrochar and Coal during Co-Pyrolysis on the Char Structure Evolution. Renew. Energy 2024, 231, 120938. [Google Scholar] [CrossRef]
Shah, M.; Prajapati, M.; Yadav, K.; Sircar, A. A Comprehensive Review of Geothermal Energy Storage: Methods and Applications. J. Energy Storage 2024, 98, 113019. [Google Scholar] [CrossRef]
Jeong, Y.-S.; Kim, D.-W. Analysis of the Necessity of Revising the Building Energy Efficiency Certificate for Non-Residential Buildings in South Korea. J. Build. Eng. 2024, 94, 109811. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, D.; Lv, J. Comparison of the Applicability of City-Level Building Energy Consumption Quota Methods. Energy Build. 2022, 261, 111933. [Google Scholar] [CrossRef]
Arjunan, P.; Poolla, K.; Miller, C. EnergyStar++: Towards More Accurate and Explanatory Building Energy Benchmarking. Appl. Energy 2020, 276, 115413. [Google Scholar] [CrossRef]
Dongmei, S. Research and Application of Energy Consumption Benchmarking Method for Public Buildings Based on Actual Energy Consumption. Energy Procedia 2018, 152, 475–483. [Google Scholar] [CrossRef]
Parrado-Hernando, G.; Herc, L.; Pfeifer, A.; Capellán-Perez, I.; Batas Bjelić, I.; Duić, N.; Frechoso-Escudero, F.; Miguel González, L.J.; Gjorgievski, V.Z. Capturing Features of Hourly-Resolution Energy Models through Statistical Annual Indicators. Renew. Energy 2022, 197, 1192–1223. [Google Scholar] [CrossRef]
Jia, J.; Li, J.; Feng, W.; Li, Z.; Tian, Q.; Yin, L. Heat Transfer Modeling and System Performance Assessment of an Innovative Thermosyphon Radiator Integrated with Air Source Heat Pumps for Sustainable Heating Applications. Energy Convers. Manag. 2025, 326, 119444. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A Review of Data-Driven Building Energy Consumption Prediction Studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
Kontokosta, C. Greening the Regulatory Landscape: The Spatial and Temporal Diffusion of Green Building Policies in U.S. Cities. J. Sustain. Real. Estate 2011, 3, 68–90. [Google Scholar] [CrossRef]
Shim, J.; Park, S.; Park, S.; Song, D. Energy Signature Approach for Retrofit Prioritization: A Proposal for Building Identification Methodology. Sustain. Cities Soc. 2024, 115, 105844. [Google Scholar] [CrossRef]
Na, W.; Wang, M. A Bayesian Approach with Urban-Scale Energy Model to Calibrate Building Energy Consumption for Space Heating: A Case Study of Application in Beijing. Energy 2022, 247, 123341. [Google Scholar] [CrossRef]
Na, W.; Liu, S. Benchmarking Building Energy Consumption for Space Heating Using an Empirical Bayesian Approach with Urban-Scale Energy Model. Energy Build. 2024, 320, 114581. [Google Scholar] [CrossRef]
Guy, H.; Vittoz, S.; Caputo, G.; Thiery, T. Benchmarking the Energy Performance of European Commercial Buildings with a Bayesian Modeling Framework. Energy Build. 2023, 299, 113595. [Google Scholar] [CrossRef]
Lu, Y.; Li, T. A Review of Explainable Data-Driven Building Energy Benchmarking at Multiple Time Scales. Energy Build. 2026, 358, 117240. [Google Scholar] [CrossRef]
Kim, H.G.; Jeong, D.W.; Kwon, S.J.; Kim, S.S. Development of Building Energy Performance Benchmark for Hospitals. Buildings 2022, 13, 12. [Google Scholar] [CrossRef]
Er-Retby, H.; Mghazli, M.O.; El Mankibi, M.; Benzaazoua, M. Energy Performance Assessment: From Static to Dynamic Benchmarking and Cluster-Based Analysis in Residential Building. Energy Rep. 2026, 15, 108901. [Google Scholar] [CrossRef]
Baglivo, C.; Albanese, P.M.; Congedo, P.M. Relationship between Shape and Energy Performance of Buildings under Long-Term Climate Change. J. Build. Eng. 2024, 84, 108544. [Google Scholar] [CrossRef]
Marincu, C.; Dan, D.; Moga, L. Investigating the Influence of Building Shape and Insulation Thickness on Energy Efficiency of Buildings. Energy Sustain. Dev. 2024, 79, 101384. [Google Scholar] [CrossRef]
Sandvall, A.F.; Ahlgren, E.O.; Ekvall, T. Cost-Efficiency of Urban Heating Strategies—Modelling Scale Effects of Low-Energy Building Heat Supply. Energy Strategy Rev. 2017, 18, 212–223. [Google Scholar] [CrossRef]
Wang, C.; Jiang, H.; Wu, H.; Liu, Y.; Guo, S.; Xu, M. Scaling in Urban Building Energy Use and Its Influencing Factors. J. Ind. Ecol. 2023, 27, 1076–1088. [Google Scholar] [CrossRef]
Kim, S.-J.; Park, D.-Y. Study on the Variation in Heating Energy Based on Energy Consumption from the District Heating System, Simulations and Pattern Analysis. Energies 2022, 15, 3909. [Google Scholar] [CrossRef]
Mohammadiziazi, R.; Copeland, S.; Bilec, M.M. Urban Building Energy Model: Database Development, Validation, and Application for Commercial Building Stock. Energy Build. 2021, 248, 111175. [Google Scholar] [CrossRef]
Lee, K.; Lim, H.; Kim, I.; Lee, S. Data-Driven Methodology for Identifying Gross Floor Area Thresholds in Building Energy Benchmarking. J. Build. Eng. 2025, 112, 113768. [Google Scholar] [CrossRef]
Wang, J.; Zhu, Z.; Zhao, J.; Li, X.; Liu, J.; Yang, Y. Research on the Energy Consumption Influence Mechanism and Prediction for the Early Design Stage of University Public Teaching Buildings in Beijing. Buildings 2024, 14, 1358. [Google Scholar] [CrossRef]
Wang, P.; Yang, Y.; Ji, C.; Huang, L. Influence of Built Environment on Building Energy Consumption: A Case Study in Nanjing, China. Environ. Dev. Sustain. 2023, 26, 5199–5222. [Google Scholar] [CrossRef]
Yang, J.; Yuan, H.; Yang, J.; Zhu, R. Study on the Influencing Factors of Energy Consumption of Nearly Zero Energy Residential Buildings in Cold and Arid Regions of Northwest China. Sustainability 2022, 14, 15721. [Google Scholar] [CrossRef]
Yu, C.; Pan, W. Inter-Building Effect on Building Energy Consumption in High-Density City Contexts. Energy Build. 2023, 278, 112632. [Google Scholar] [CrossRef]
ASHRAE Guideline 14-2023; Measurement of Energy, Demand, and Water Savings. American Society of Heating, Refrigerating and Air-Conditioning Engineers: Atlanta, GA, USA, 2023. Available online: https://store.accuristech.com/standards/ashrae-guideline-14-2023-measurement-of-energy-demand-and-water-savings?product_id=2569793# (accessed on 6 March 2026).

Figure 1. Flowchart of main steps implementing MCMC simulation.

Figure 2. Trace plots of

β_{0}

and

β_{1}