An Approach to Data Acquisition for Urban Building Energy Modeling Using a Gaussian Mixture Model and Expectation-Maximization Algorithm

: In recent years, a building’s energy performance is becoming uncertain because of factors such as climate change, the Covid-19 pandemic, stochastic occupant behavior and inefficient building control systems. Sufficient measurement data is essential to predict and manage a building’s performance levels. Assessing energy performance of buildings at an urban scale requires even larger data samples in order to perform an accurate analysis at an aggregated level. However, data are not only expensive, but it can also be a real challenge for communities to acquire large amounts of real energy data. This is despite the fact that inadequate knowledge of a full population will lead to biased learning and the failure to establish a data pipeline. Thus, this paper proposes a Gaussian mixture model (GMM) with an Expectation-Maximization (EM) algorithm that will produce synthetic building energy data. This method is tested on real datasets. The results show that the parameter estimates from the model are stable and close to the true values. The bivariate model gives better performance in classification accuracy. Synthetic data points generated by the models show a consistent representation of the real data. The approach developed here can be useful for building simulations and optimizations with spatio-temporal mapping.


Introduction
Buildings account for 40% of global energy consumption [1].Of this figure over 60% of the energy is consumed in the form of electricity, and only 23% of it is supplied by renewable sources [2].Studies about nearly zero-energy building (NZEB) and positive energy districts (PEDs) have recently drawn much attention to possible ways to reduce this energy demand [3,4].NZEB buildings have a very high energy performance level, with the nearly zero or very low amount of energy provided by significant renewable sources, and their energy performance is determined on the basis of calculated or actual annual energy usage [5][6][7].PEDs are defined as energy-efficient and energy-flexible urban areas with a surplus renewable energy production and net zero greenhouse gas emissions.For both NZEB and PED, building energy performance is a crucial criteria for indicating their energy achievements [8].
Building energy modeling is an efficient way to predict the different possible performance levels of a building [9].Among modeling methods, data-driven approaches have shown their advantages in building energy modeling, especially at an urban scale [10][11][12][13].Basically, a data-driven approach is a systematic framework of data modeling techniques comprising model selection, parameter estimation and model validation that creates analytical decision-making.Most of the machine learning methods are data-driven since the Citation: Han, M.; Wang, Z.; Zhang, X.An approach to data acquisition for urban building energy modeling using a Gaussian mixture model and Expectation-Maximization algorithm.Buildings 2021, 11, 30.https://doi.org/10.3390/buildings11010030 machines or models are trained by learning a mapping function that operates between the input and output of the data.The more experience a machine has at learning, the better performance it will get.Thus, acquiring sufficient data is the basis to identify accurate energy use patterns and decide on the optimal actions to take in response.However, acquiring sufficient data in high quality for buildings at the urban level is a real challenge.Either random missing values or a large amount of incomplete information jeopardizes the model's validity.As data in urban building energy modeling (UBEM) collected from different sources, it can take significant effort to integrate datasets into a standardized format for interoperability [14].
By identifying such a research gap around the acquisition of data for urban building energy modeling, this paper aims to develop a novel approach.The specific contributions of this work are as follows: (1) it proposes to use a Gaussian mixture model (GMM), trained by Expectation-Maximization (EM) algorithm, as a generative model to discover the populations where the data can be drawn from; (2) it uses real datasets to validate parameter estimation and generative performance; (3) it suggests that the bivariate Gaussian model is more robust than the univariate model; and (4) it discusses the practical ways in which the initial values of the EM algorithm can be set.
The rest of the paper is structured as follows: based on an extensive literature review, the necessities and challenges of modeling with sufficient data are discussed in Section 2. Section 3 continues with a brief summary of the different ways to acquire more data.The philosophies of GMM and EM are presented in Section 4. In Section 5, the real datasets, parameter estimation details, and model evaluations are introduced.Section 6 discusses the spatio-temporal mapping of the synthetic data.It is followed by a conclusion in the final section.

Necessities and Challenges of Building Performance Modeling with Big Data
The analysis of building energy performance is switching from single buildings to district and urban levels.It yields new research domains that are associated with building energy performance, such as transportation, spatial correlations, grid operations, energy market actions and so on.Together with the factors, such as occupant behavior and climate change affecting single building modeling, a large amount of data are being produced in different domains, and the causal relationships between them are becoming complex.For instance, Zhang et al. reviewed a modeling technique for urban energy systems at the building cluster level that incorporated renewable-energy-source envelope solutions, in which they highlighted that a high-resolution energy profile, a spatio-temporal energy demand (both building and mobility) and detailed engineering/physical and statistical models are desirable for further development [15].Salim et al. defined occupantcentric urban data and the pipeline to process it in a review paper that also outlined the different sources of urban data for modeling urban-scale occupant behavior: mobility and energy in buildings [16].Perera et al. developed a stochastic optimization method to consider the impact of climate change and extreme climate events on urban energy systems across 30 cities in Sweden by considering 13 climate change scenarios [17].Yang et al. started the data-driven urban energy benchmarking of buildings that uses recursive partitioning and stochastic frontier analysis in their study of a dataset of over 10,000 buildings in New York City [18].By examining 3640 residential buildings, Ma and Cheng estimated building energy use intensity at an urban scale by integrating geographic information system (GIS) and big data technology [19].Pasichnyi et al. proposed a data-driven building archetype for urban building energy modeling that uses rich datasets [10].Risch et al. presented a level scheme to study the influence of data acquisition on the individual Bayesian calibration of archetype for UBEMs [20].Ali et al. developed a data-driven approach to optimize urban-scale energy retrofit decisions for residential buildings while acknowledging that developing a building stock database is a time-intensive process that requires extensive data (both geometric and non-geometric), that can, in its uncollected form, be sparse, inconsistent, diverse and heterogeneous in nature [11].
When training a model, inadequacies in the data acquisition process generally mean that important information will be lacking, and the capture of underlying features is badly carried out.For example, energy policies may be misleading, and the data derived from them can be unclear or ambiguous.Securing sufficient data from different domains and in a finer resolution at the urban level will improve the quality of a model's decisionmaking.Understanding the impact of insufficient data on building energy modeling allows challenges to be identified, limits to the model's capacity to be broken and ways of improving the data situation to be found [21].Modern machine learning techniques, especially deep learning (DL) methods, are a powerful way to model large amounts of urban energy data.The reason that DL outperforms other methods is that it uses millions of parameters to create sophisticated, nonlinear relationships that map the input data to the output data.The central goal is to train the bulk of the parameters so that they not only fit the training set well, but that they are also able to work on a dataset that the model has not seen.The ability to perform well on previously unobserved inputs is called generalization [22].Small data sets do not provide enough epochs to update the parameters, which means that the model can be perfectly trained, and the output can be mapped to an extremely high accuracy, but only on an observed dataset.This leads to the problem of overfitting.Feeding sufficient data to the model is the equivalent of enabling it to discover more comprehensive mapping rules and enhancing, therefore, the model's generalization ability.
Data capture and storage, data transformation, data curation, data analysis and data visualization are all challenges when working with big data [23].Establishing new systems that can gather the data required to estimate building energy performance requires multidisciplinary efforts and comes with high financial and time costs.Consequently, missing data at different levels in building energy modeling is a common problem.Most of the existing statistical and nonlinear machine learning methods can provide reliable interpolations when the missing rate is small, for example, lower than 30%, and the problem is considered to be random missing.When the missing rate is as large as 50-90% or if the sample information is completely missing from a large-scale dataset, it is unknown how well these methods will perform [24].Alternatively, when a simplified building model handles incomplete data, its findings are usually fairly robust and efficient [25].However, the assessment methodology has depended on the situation in each specific area, and it can be difficult to generalize [26].

Methods for Acquiring Building Energy Performance Data
Traditional methods for acquiring building energy performance data include direct collection from a meter, data augmentation and simulation.Combining these sources will enrich a dataset.Despite this, statistical methods, especially mixture models, from generative model point of view have not been seriously examined.

Collecting Energy Performance Data
Direct collection of building energy data from energy meters and sub-meters can be done in three different ways: energy readings, high-frequency energy logging and building management systems (BMS) [27].Reading energy consumption, data are easy and cheap to do if meters are on-site, but readers can make mistakes and these are not easily discovered.One alternative is to apply computer vision techniques to automatically read the meter [28].High-frequency energy data logging is the process of both collecting and storing data over a period of relatively short time intervals.The core device is a data logger comprising an electronic sensor, a processor and storage memory.The development of cloud and edge computing make the management of these data much smoother.These kinds of data are usually used for accurate forecasting in high time resolution and serves as the basis for a real-time bidding system.A BMS is a digital system that monitors, controls and records a building's operations.However, its automated processes, analysis capabilities and its integration with other systems are still not well developed.The data that BMS systems provide have been shown to contain errors [27].These challenges can be tackled with the aid of the modeling approaches and integration capabilities provided by building information modeling (BIM) [29].
Direct data collection from meters is fast and precise.Where there is uncertainty about energy performance, meter data can act as a system benchmark.With the help of computers, the handling of data is becoming increasingly efficient.However, errors, including missing and incorrect values, can still be made by humans and machines.Thus, investing in data infrastructures such as sensors, meters, sub-meters and data archiving systems and linking these to data-driven building operation applications are still essential [30].

Data Augmentation
Data augmentation is a particular technique in computer vision that allows images to be flipped, rotated, scaled and translated in order to produce more samples.New features can also be generated to aid model generalization.Although building energy data are not directly stored in the form of images, there have been a few studies that explore the different ways that data could be augmented.

Energy Consumption
Unlike face recognition, intelligent transportation, precision medicine and other AI industries where computer vision has been comprehensively developed, an efficient, image-based approach to analyzing building energy is still in its early stages [31].Traditional data augmentation methods have seldom been considered, although a deep convolutional neural network is able to detect equipment usage and predict the energy heat gains in office buildings [32].In other research, building energy consumption data incorporating equipment use, lighting and air conditioning systems were augmented and enabled to capture more sequences by imposing a sliding sample window on the original training data [33].With this technique, the training sample was enlarged to the length of original sequence minus the window length.

Short-Term Load Forecasting
Short-term load forecasting (STLF) for buildings, which secures operations in a smart grid, focuses on predicting the energy load of a building for a certain number of hours or days ahead.Deep learning has become the principal method for securing accurate predictions within this process from the demand side [34,35].The inclusion of multi-source data such as user profiles, efficiency and weather data has been shown to improve the performance of loading forecasting [36].Without a sufficient supply of data, deep learning modeling may fail to cope with a huge number of parameters.However, a large historical load dataset is very likely unavailable.One of the reasons is the development of distribution networks where newly built distribution transformers are growing.The other reason is that the responsible party for data management frequently restricts access to it [37].
One recent study proposed a method that merged the three-phase dataset to form a larger dataset so that the load forecasting of a particular phase could also be based on other phases' information.The new dataset was then thrown into a rolling forest to generate a training and testing set [37].Another study proposed that, where a method that concatenates the series to generate larger amounts of data was viable, for a single building, the loading series should have less uncertain and more homogeneous information [38].The size of the data was enlarged ( + )/2 times by figuring out 1,2, … ,  centroid load profiles and the corresponding residuals.In order to improve the concatenation method, instead of aggregating all of the historical data from previous years, a historical data augmentation method inserted one feature factor, which adopts adjacent loads as new feature, into the original input [39].

Learning-Based Methods
Generative Adversarial Network (GAN) has been developed in many applications [40,41].In GAN, a generator generates random objects that are mixed with real objects for a discriminator to discriminate.The discriminator learns to assign high scores to real objects and low scores to generated objects.The generator is then updated so that it generates new objects that are likely to be assigned a high score by the fixed discriminator.The procedure stops when the discriminator is not able to discriminate whether the objects are generated or real.In a recent work [42], GAN was applied to one year of hourly whole building electrical meter data from 156 office buildings so that the individual variations of each building were eliminated.The generated load profiles were close to the real ones suggesting that GAN can be further used to anonymize data, generate load profiles and verify other generation models.A recurrent GAN preceded by core features pre-processing was also used to test the same dataset.The model trained with the synthetic data achieved a similar accuracy as the one trained with real data [43].In another work, a conditional variational auto-encoder was developed to detect electricity theft in buildings.The method considered the shapes and distribution characteristics of samples at the same time, with the training set improving detection performance in comparison with other classifiers [44].

Simulation
Building simulation is an economical method for evaluating building performance and comparing it with real data [45,46].With a pre-defined physical environment and generally acceptable levels of accuracy, simulation tools can rapidly generate sufficient amounts of analyzable data.Stochastic factors that incur uncertainties, such as occupant behaviors, have been integrated into building performance simulations and have improved their accuracy [47].Combined with Geographical Information System (GIS) programs, urban energy simulations are already demonstrating that they are likely to further reduce input data uncertainty and simplify city district modeling [48].Some of challenges of building simulation, such as model calibration, efficient technology adoption and integrated modeling and simulation have also been addressed in various modeling scales, from the single building to the national level, and at different stages in the building life cycle, from initial design to retrofit [49].For building simulation, the applicability of integrated tools, not only during the early design but also throughout the building operation, management and retrofit stages should be improved to make the most effective decisions [50].

Gaussian Mixture Model
Building energy data are often a mixture of samples from different populations where the parameters differ from each other.A natural strategy is to identify these populations and generate new data points using respective populations where a Gaussian mixture model (GMM) is built.GMM is a simple linear superposition of Gaussian components, aimed at providing a richer class of density models than the single Gaussian [51].GMM is also a generative model, wherein arbitrary amounts of data can be generated.While -means, a special form of GMM, assigns an individual to the nearest center, GMM gives a soft allocation for all data points.GMM is an unsupervised method where a data point is assumed to belong to a component with a certain probability.A categorical latent variable  is adopted to determine a specific component by letting  = 1 and  = 0, where  is the elements other than  .The marginal distribution of  is denoted as ( = 1) =  with ∑  = 1 for  = 1,2, … , .Thus, in Equation ( 1), the marginal distribution for the observable variable  will be where (|  ,   ) is the Gaussian density.A posterior probability for  when  is given indicates how much each component contributes to the realization of : (|х) in Equation ( 2) is also known as the responsibility probability allowing us to partition an observation into  components.For a given set of independent observations х  , х  , … , х  with sample size , we assume that ᴢ  ,  = 1,2, … , , is the latent variable for each observation.In addition to the location and shape parameters   and   , the marginal probabilities  ,  , … ,  also contribute the parameter space of the log-likelihood function in Equation (3): The graphical representation for an observation х  can be illustrated in Figure 1.The explicit form of the derivatives to Equation ( 3) is not available due to the summation term in the logarithm operation.The EM algorithm introduced in Sections 4.2 and 4.3 will address this difficulty and evaluate its likelihood in a tractable way.

The Expectation-Maximization (EM) Algorithm
It has been proposed that an EM algorithm might be used to iteratively compute maximum-likelihood with incomplete information, where many applications such as filling in missing data, grouping and censoring, variance components, hyperparameter estimation, reweighted least-squares and factor analysis are explicitly addressed [52].EM is considered one of the top ten algorithms in data mining and simplifies the maximization procedures for GMM [53].Density estimation for GMM via EM can also handle high-dimensional data [54].Thus, for a parametrized probability model (X|), the joint distribution (X, Z|) is introduced to rewrite the log-likelihood as where X is the observable variable and Z is the hidden variable.It should be noted that  (X|) is given in the form of a random variable.It will be equivalent to ∑  (х |) when the sample х , х , … , х is obtained.We denote a density function of Z as (Z) with (Z) > 0. Since (Z) is irrelevant to , taking an expectation to  (X|) with regard to the distribution of Z will not affect the value of ℒ().On the other hand, taking the expectation to the right-hand side in Equation ( 4) results in In Equation ( 5), − (Z) Z X,  ( ) Z, known as the Kullback-Leibler divergence and denoted as ( ∥ ), is used to measure the distance between (Z) and (Z|X, ) [55].( ∥ ) takes the value of 0 when (Z) = (Z|X, ), otherwise it is greater than 0.
If we denote (Z) X, Z  ( ) Z as the evidence lower bound (ELBO), ℒ() can be represented in Equation ( 6) as For fixed  and х , х , … , х , ℒ() ≥ .ℒ() takes the value of its lower bound ELBO only when ( ∥ ) = 0. Thus, the task becomes to find the estimate of  such that () is fixed for  Z X,  () that is one of the specific options for (Z) in Equation (7).
The superscript () indicates the values obtained from the last iteration of the EM algorithm.Hence,  Z X,  ()   Z X,  () Z is independent of  and can be treated as constant in the estimation.It should be noted that (Z|X, ) is the ideal representation of the form of (Z) in some specific models.For implicit (Z) that is hard to obtain, an approximation method is usually applied to find the best (Z).
Two recursive steps for updating  ( ) in Equation ( 7) form the EM algorithm.In the E-step, we use old  () to find the posterior distribution for the latent variable, which is used to calculate the expectation of complete-data likelihood: ℋ ;  () =  Z X,  ()  (X, Z|) .
In the M-step,  ( ) is updated by maximizing Equation (8): An iterative evaluation of Equations ( 8) and ( 9) guarantees the convergence of ℒ() [56].An illustration of this process can be seen in Figure 2. The M-step searches the new parameter to increase the value of ℋ(; •) that is further increased by plugging to ℒ() because the property ℒ() ≥  always holds.The convergence will be found until  does not significantly update its value or ℒ() does not improve.

Parameter Estimation for GMM
ℋ ;  () is evaluated on the joint log-likelihood with complete data, which differs from the incomplete log-likelihood described in Equation ( 3).If we let the distribution of latent variable  be () = ∏  and the conditional distribution be (|) = ∏ (|  ,   ) , the likelihood for complete data is a product of the two distributions: where  indicates the  component for х  .Compared with Equation (3), the benefit for evaluating Equation ( 10) is that the logarithm part of ln ( х  ,   |, , ) will not be calculated on any summation terms, and there will only be a linear relationship between the latent variable  and the observed variable, namely  ( х  ,   |, , ) =   + ( х  |  ,   ) .
Based on Equation (11), ℋ ,  () can be easily specified as ℋ , , ;  () ,  () ,  () =    х  ,  () ,  () ,  ()   х  ,   , , where  =   is the responsibility probability for a given х  to be partitioned into component .Thus, the specification of (|х) in Equation (2),  , can be evaluated by The maximization of Equation ( 12) is in a tractable form if taking derivatives to the parameters.Due to the constraint ∑  = 1, a Lagrange multiplier  is introduced for  .Finally, the E-step for GMM is to evaluate the log-likelihood given  () =  () ,  () ,  ()  and the M-step updates the parameters: When all the parameters are updated, we return to the E-step and evaluate Equation ( 13).The terminal of the algorithm, as introduced in Section 4.2, is reached either by observing stable  or ℒ().

Test Datasets
This paper considers two different datasets to validate the proposed method.Both datasets are well organized and free to use.The first one is the public building energy and water use data from Boston in the United States (the data are available at https://data.boston.gov/dataset/building-energy-reporting-and-disclosure-ordinance).The dataset was collected according to the Building Energy Reporting and Disclosure Ordinance (BERDO), which allows different types of building stakeholders to track their energy usage and greenhouse gas emissions and provides an assessment source for local government to implement energy-saving policies.In the original file from 2019 there were 28 variables in total.This includes general variables related to categories such as building name, location, physical information and energy-related variables.The variable Energy Use Intensity (EUI) reflects total annual energy consumption divided by the gross floor area.EUI is an effective measurement of energy performance.By eliminating the missing values and outliers from the Boston dataset, we identified 659 buildings labeled as multifamily housing.
The second dataset was collected in Aarhus (Århus in Denish), Denmark in 2020 in order to examine the district heating (DH) efficiency of different Danish building types (the data are available at https://data.mendeley.com/datasets/v8mwvy7p6r/1)[57].From this dataset we extracted the EUI data for multifamily housing built between the 1960s and the 1980s and identified 183 buildings.The mean values of EUIs for Boston and Aarhus were 66.68 and 130.46 respectively.Given the two samples now comprised homogeneous information, we merged them into one and assumed the number of populations to be two.Thus, univariate Gaussian distribution was considered.
In addition, our calculations took the variable age group from the Danish dataset, representing the period when the building was built, as a new population indicator to illustrate the bivariate case.The segmentation was determined by shifts in Danish building traditions and the tightening of energy requirements in Danish Building Regulations.Two specific age groups were chosen to constitute the populations: '1951-1960' with 3461 buildings and 'After 2015' with 927.We then selected a secondary variable, Ga, to measure the daily heat load variation of these buildings since we postulated that the load variation may form a representative feature for both populations.Ga is calculated as the accumulated positive difference between the hourly and daily average heat loads during a year divided by both the annual average heat load and the number of hours (8760) in the year.Most of the values of Ga were around 0.2 indicating 20% load deviations on average.Thus, two populations (two age groups) and two variables (EUI and Ga) were constructed only from the Danish dataset for the bivariate case.

The Univariate Case
The histograms are plotted to present the mixed and grouped distribution of EUIs for both cities.In the left panel of Figure 3, the mixed distribution created two peaks, although there was no obvious separation in the overlapped part.The true distributions can, however, be seen quite clearly in the grouped (separated) distribution displayed in the right panel of Figure 3.One limitation when using a Gaussian distribution is that the value of EUI cannot be negative.A truncated Gaussian may be a more appropriate representation.However, since all the truncated data points belonged to the Boston population, using GMM will not affect the responsibility probability  when conducting the E-step.Thus, we still follow the Gaussian assumption.If Gaussian property is severely violated, for example, because of the heat load variation in the bivariate variables, a Box-Cox transformation is required before implementing EM.Another issue for the parameter estimation is to determine the initial values of .It is not a difficult task to compare several combinations for the univariate case, but it will become a problem when the parameter size is large.Thus, we tested a number of possible combinations by varying the initial choice of  ,  ,  ,  ,  ,  and applied the same scheme to the bivariate case.More details on this process can be found in previous work on the setting of initial values [58].The choice was determined by considering whether the final estimation could well represent both populations.The empirical results showed that the choices of  ,  ,  ,  did not seem to be dominant for the convergence, and we simply took  =  = 0.5 and  ,  to be the sample standard deviations.On the other hand, close values for  and  failed to separate the populations.In most of the experiments,  and  converged to a single value.Thus, we initialized  and  by letting them be constrained in the upper and lower 1/3 quantile of the mixed population, respectively.Further, one hundred initial values for  and one hundred for  were randomly generated to obtain ten thousand combinations in which we randomly opted ten for evaluating the performance.
The ten sets of parameters are summarized in Table 1.For every parameter, there was no significant variation, and the mean values can be used to represent ̂ and ̂ .We also computed the absolute errors in percentage terms between the mean and true values.Most of them were within 5%, while the overestimation of  for Aarhus might be due to the slightly smaller estimation for  .Unlike fixed initial values, we also allowed for random variations up to 25% for  and  to validate our argument.As Table 2 shows, the result resembled Table 1.The same conclusion could be drawn for  and  , which are not shown here.It is also observed that the performance of the log-likelihood values in Figure 4 is uniform.All of the experiments stopped within 15 updates and seemed to converge at the same point.In other words, it is enough to make inferences based on current estimations.We created two scenarios and assigned the sample points to the population with the larger density for classification.Two corresponding confusion matrices are presented in Table 3.We divided all the data points into four categories: true Boston, false Boston, true Aarhus and false Aarhus.The accuracy is the sum of both true classifications that is close to 90%.We then demonstrated the fitness between theoretical and empirical proportions.Proportion here refers to the quotient for which Boston's EUI should theoretically and empirically account.Theoretical proportion is made by  () = 0.77( |66.71, 31.39)0.77( |66.71, 31.39)+ 0.23( |128.33, 39.38) , where  corresponds to the probability quantile segmentation on the x-axis in Figure 5 for the mixed distribution.Similarly, the empirical proportion,  (), counts the number of data points from Boston divided by all data points between two adjacent values of  .For example, if 100% of the observations are taken from Boston,  () should be extremely close to 1 at the quantile 0. Both  () and  () are supposed to decrease because ̂ is greater than ̂ .Thus, the results showed a good fit except for the quantiles close to 1 on the bottom right corner.This is because of the long tail of Boston's EUI.However, there were only a total of eight data points in this area, which is a minor deviation compared with  ().The performance for Aarhus would look exactly the same just by vertically reversing Figure 5.Given the results, we have drawn an additional two-step random sampling from the distributions to examine the generated EUIs.In step 1, we created a vector to store a random sequence of 0s and 1s that imply to which population a generated point belongs.The probabilities are taken as ( = 0) = 0.77 and ( = 1) = 0.23, where  is the indicator.The length of the vector is the same as the one of the observed sample, namely 842.In step 2, Gaussian random samples were drawn for each population to constitute the generated data.The quantile-quantile plot is shown for both fixed and random initial s.As seen in Figure 6, the quantile values were taken every 5%.It is not surprising that neither setting for initial  differed a great deal, and almost all of the quantile values were located on the 45-degree line, which means that the quantile values matched each other.In this sense, the generated sample under GMM presented a reliable representation of the populations.Here we only used the true sample to validate a generated sample of the same size.When the method is used in an authentic context, the sample size will depend on the total number of buildings in the original cohort.

The Bivariate Case
Testing bivariate case requires more parameters to be estimated.The selection of the initial values followed the same paradigms that were adopted for the univariate case.In order to keep the Gaussian property, as mentioned in Section 5.2.1, the daily heat load variation (Ga) was treated by Box-Cox transformation [59].Since the estimations then became complex, we also increased the number of experiments for determining the estimates.We picked 20 initial values from each of the population means:  ,  ,  and  .The number of combinations became 20 = 160,000.In all the experiments, we highlighted the combinations with a log-likelihood in the top 10%.We present the resulting pattern in Figure 7 where 400 combinations were located on the x-axis for the population '1951-1960', while 400 for the population 'After 2015' were located on the y-axis.The combinations were arranged from {minimum EUI, minimum Ga} to {minimum EUI, maximum Ga} and then to {maximum EUI, maximum Ga}.The figure shows that there were slight periodic patterns among the bigger 20 × 20 grids.Higher log-likelihood was slightly denser in the top left part.In almost all the bigger 20 × 20 grids, however, high log-likelihood could always be found.Thus, the selection of initial values for the EM algorithm in the bivariate case appears to be somewhat isotropic at finding estimates with high log-likelihood values.Something similar happened when we summarized the results of the 10% experiments in Table 4.In the bivariate case the overall errors decreased significantly compared with the univariate case.The majority were now below 3%.The reason for the bivariate model's success might be that it is able to use more of the energy performance features to separate the populations more correctly.It should be noted that the estimated ̂ was not equal to the transformed mean value of Ga because the Box-Cox transformation is nonlinear.Thus, we only show the results for the transformed values here.We present both transformed and non-transformed Ga in the generative model evaluation later on.The classification accuracy is further computed in concordance with the univariate case in Table 5.Given the better parameter estimations, the true accuracy was over 99%.The estimates obtained in Table 4 were used to generate density contours, with dense and sparse areas distinguished in the two-dimensional surface.We compared these with the true distributions because they disclose the real scales of the densities.The contours are displayed in the left panel of Figure 8, and show the comparison that is made for the transformed heat load variation, while the result of the non-transformed distributions is shown in the right panel.The generative models are supposed to characterize the distributions in both dense and sparse areas.Both panels had obvious and observable centers.Both of these centers converged at the densest part of the real data.In other words, the generative models represented the real data to an acceptable degree.As discussed in the univariate case, the actual number of generated samples depends on the city's capacity to evaluate its energy performance with insufficient data.

Discussion
Using our proposed method to generate synthetic data is based on a distributional overview of the energy performance across all of the buildings in a district or urban area.However, our method does not take any spatial or temporal information into account.
Imposing spatio-temporal mapping onto the synthetic data will help to draw an even better picture of building energy performance at the urban scale.As shown in Figure 9, spatial mapping takes the geolocations of buildings into consideration.Practically, the size of unlabeled data is far larger than labeled data.The performance of supervised learning models varies with limited labeled data.By including physical features, synthetic data are used to select and validate the supervised learning models for each population in a much more robust way.
The GMM method discussed in this work handles temporally aggregated data.With temporal mapping, it is possible to create higher time resolution data (for example hourly data) by including additional variables such as building class and hourly weather data.A set of buildings with known hourly energy consumption and building class could be taken as a set of reference buildings.By weighting from the reference buildings and sorting out a convex optimization problem, the synthetic data could then generate a set of energy performance profiles [60].

Conclusions
Big data are costly to collect and use.Even when this process is automated, many data collection systems operate with incomplete data or, in the case of building energy performance, are only able to collect data on a limited scale.From a statistical point of view, data from different groups are mixed together with little attempt made to distinguish between their representative populations.This hinders the efficient modeling of energy performance data, particularly at a large scale, and the construction of synthetic data for target groups.
In this paper, we proposed a GMM with an EM algorithm that can model building energy performance data for both univariate and bivariate cases.The energy performance indicators of a sample of buildings from Boston and Aarhus were adopted to segment mixed populations.For the univariate case, the Energy Use Intensity data from the two different cities were analyzed, and the updates were shown to have quick convergence.The derived models were able to capture the distributional features and to reflect the true population.The classification rate was almost 90%, and the generated data matched in quantile to the observed data.For the bivariate case, we showed that the inclusion of the new variable daily heat load variation further increased the power in parameter estimation, thus making the classification rate higher than in the univariate case.These data not only generate reliable density representations, but they also can be adjusted according to the real building capacity of a city with a spatio-temporal mapping.
Moreover, there are a number of topics in connection with these findings that would be interesting to explore in future studies.

•
Firstly, in this paper we assumed that the number of populations was known.An interesting investigation would be to detect the optimal number of populations by introducing an objective function.Since Akaike information criterion (AIC) or Bayesian information criterion (BIC) reduces the effect on penalizing the model with small number of parameters, it is still an unsolved issue when the number of populations is small.

•
Secondly, we suggest that other indicators of building energy use need to be considered because overall building performance is usually affected by multiple factors.

•
Finally, it is also appealing that more probability distributions could be studied instead of using data transformation.

Figure 1 .
Figure 1.Graphical representation of a set of data points.

Figure 2 .
Figure 2. Parameter update in the EM algorithm.

Figure 3 .
Figure 3. Mixed and grouped distribution of EUI.

Figure 4 .
Figure 4. Log-likelihood performance for the univariate case.

Figure 6 .
Figure 6.Quantile-quantile plot for observed and generated data.

Figure 7 .
Figure 7. Distribution of top 10% log-likelihood for the combination of the initial values.

Figure 8 .
Figure 8. Density comparisons for the bivariate generative model.

Figure 9 .
Figure 9. Spatio-temporal mapping of urban buildings from GMM synthetic data.

Table 1 .
Summary of the parameter estimation for fixed variance.

Table 2 .
Summary of the parameter estimation for random variance.

Table 3 .
Classification accuracy for the univariate case.

Table 4 .
Summary of the parameter estimation in the bivariate case.

Table 5 .
Classification accuracy for the bivariate case.