Next Article in Journal
Experimental and Numerical Investigation of Post-Weld Heat Treatment on Residual Stress Relaxation in Orthotropic Steel Decks Welding
Previous Article in Journal
Study on Factors Influencing Residents’ Participation in Public Space Improvement Projects for Sustainable Built Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Interpretable Modeling Method for Occupancy in Public Buildings Based on Typical Occupancy Data

1
College of Architecture and Urban Planning, Tongji University, Shanghai 200092, China
2
College of Mechanical and Energy Engineering, Tongji University, Shanghai 201804, China
3
Beijing Key Laboratory of Green Built Environment and Energy Efficient Technology, Beijing University of Technology, Beijing 100124, China
*
Authors to whom correspondence should be addressed.
Buildings 2025, 15(23), 4318; https://doi.org/10.3390/buildings15234318
Submission received: 10 October 2025 / Revised: 21 November 2025 / Accepted: 26 November 2025 / Published: 27 November 2025
(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

Abstract

Occupancy, defined as the count of occupants, plays an important role in building design and operation stages. Obtaining reliable occupancy data for public buildings remains a challenging problem due to the lack of available on-site data. With the development of information technologies, the widespread use of smartphones and social networks provides a source for collecting building occupancy data. In this paper, we collect occupancy data of 56 public buildings from social networks. Based on this database, an interpretable occupancy model is proposed, incorporating the effects of trend, day types, months, meteorological parameters, and special events, such as the COVID-19 period, discount days, etc. The modeling process includes following four steps: (1) extracting typical occupancy data (TOD), (2) extracting key factors through the CatBoost model and SHAP method, (3) model fitting, and (4) model transfer application. The proposed method quantifies the influence of different factors on occupancy and can be applied to simulate occupancy in public buildings without on-site data. Its performance is evaluated through a case study on four public buildings in this paper.

1. Introduction

Occupancy data plays a crucial role in both building design and operation stages [1], which can be summarized in the following aspects. (1) Architectural Design and Site Selection: Occupancy data provides critical references during the preliminary planning stage for site selection and determining building floor area or layouts. (2) HVAC System Sizing: In the design stage, occupancy data is necessary for calculating heating and cooling loads, which directly influence the size of heating, ventilation, and air-conditioning (HVAC) equipment. Accurate occupancy data would mitigate equipment oversizing problems, consequently improving HVAC system performance in the operation stage. (3) Urban Planning: Occupancy data helps planners understand the patterns of different types of buildings. It can be used in district energy system planning. (4) Building Energy Analysis: Occupancy data is a key input for building energy models to simulate energy consumption, analyze model uncertainty, and evaluate building energy performance. (5) Building Control: In the operation stage, occupancy data can be used in building automation (BA) system control to avoid energy waste.
For different application scenarios, the requirements for occupancy data are different. Referring to previous studies [2,3,4], Table 1 summarizes the temporal and spatial resolution requirements for occupancy data in different scenarios.
As shown in Table 1, hourly occupancy data at the building level is most widely used. Accurate occupancy data at these resolutions is instrumental for building design, energy simulation, control, and management.
Building occupancy data can be primarily obtained through three approaches: referring standards or design manuals, on-site survey or measurement, and model simulation. ASHRAE standards [5] and the subsequent research [6] provide occupancy schedules and density for difference types of buildings, considering the discrepancies between weekdays and weekends. This approach is convenient, but cannot reflect all the variation factors in building occupancy data. When applied in an energy simulation, there would be a large difference between the measured data and standard data [7]. For existing buildings, occupancy data can be collected directly by on-site survey [8], sensors [9,10,11,12,13,14,15], or mobile devices [16,17,18,19,20,21]. The on-site survey method is low-cost without the deployment of monitoring equipment. However, it is labor-intensive, time-consuming, and also unsuitable for buildings with high occupant mobility or large spatial scales. Sensor-based methods can obtain occupant counts through sensors such as passive infra-red (PIR) sensors [9,10], pressure sensors [11], cameras [12,13,14], and doors equipped with access control [15]. These approaches generally achieve higher accuracy, but they often require the installation of numerous sensors, which involves installation and maintenance costs. The widespread use of mobile devices, such as smartphones, could provide location information, which makes it easy and low-cost to collect occupancy data for large-scale buildings. Such location information can be obtained from mobile service providers [16], Wi-Fi access points [17,18], or social media platforms [19,20,21]. Occupancy simulation models include deterministic models and stochastic models. The deterministic models usually extract typical occupancy patterns from on-site surveys or monitoring data via statistical analysis or machine learning techniques, such as cluster methods [19] and decision trees [22]. Stochastic occupancy models include sampling from random distributions [23,24], regression models [9], Markov chains [25,26,27], agent-based models [28,29], and machine learning or deep learning models [30,31,32,33,34].
Table 2 summarizes studies on occupancy models in public buildings, focusing on building type, spatial and temporal resolution, data collection methods, and model inputs. Due to the difficulty of collecting occupancy data, most occupancy models are tested on office or school buildings, and very few researchers have studied occupancy models for other types of public buildings. Most models are developed based on historical occupancy data or statistic information on occupants arriving/leaving. Some studies have also considered the influence of factors such as workdays/non-workdays, days of the week, holidays, and seasons.
For buildings with on-site measured data, both deterministic and stochastic models can be chosen to simulate occupancy according to application scenarios and aims. However, most models in existing studies are trained with historical data and require retraining for new buildings. Thus, a significant research gap remains in modeling occupancy for public buildings that lack on-site measurements.
To address this research gap, this study constructs a database of occupancy data of 56 public buildings from social networks. Based on this database, an interpretable occupancy model is developed. The model incorporates the influence of long-term trends, day types, months, weather, and some special events. Four case studies are demonstrated to evaluate model performance in buildings without on-site occupancy data.

2. Methods

In this paper, occupancy data of 56 public buildings is collected from social networks. After preprocessing, the occupancy data and building information are stored in a database. The proposed occupancy modeling method includes four steps: (1) typical occupancy data (TOD) extraction, (2) key factor selection, (3) model fitting, and (4) transfer learning. The research flowchart is illustrated in Figure 1.

2.1. Data Collection

We collected hourly occupancy data from social networks for 56 public buildings in China, with areas ranging from 600 m2 to 1,410,000 m2. Among these, data for 25 buildings were collected from 1 December 2015 to 31 December 2017, while data for the remaining 31 buildings were collected from 1 January 2017 to 31 December 2020. Table 3 shows the sample sizes for different building types.
Our cooperator is one of the largest internet companies in China with more than 120 billion daily global positioning requests covering over 1 billion people. If users turn on location services on their mobile devices and use the applications (apps) developed by our cooperative internet company, real-time positioning data can be recorded. When the target building area is delineated, as shown by the blue line in Figure 2, our cooperator provides us with the hourly occupancy data within that building.
Table 4 lists the building information stored in the database, which includes basic information (including building ID, name, area, and function), construction time (specifying the initial construction year and the most recent renovation year), and location information (including the city location and, within radii of 500 m and 1000 m, the number of bus stops, metro lines, shopping malls, and residential quarters, as well as distances to the nearest airport and train station).

2.2. Data Preprocessing

We collect occupancy data from social networks. If occupants do not use the social apps, their location cannot be recorded. Thus, the collected data need correcting with following steps:
  • Zero value correction. Sometimes zero values in occupancy data may reflect that a building is unoccupied. Thus, these data cannot be treated as anomalies directly. Furthermore, occupant behaviors are influenced by social factors such as weekdays, weekends, and holidays. Occupancy data would follow different patterns across different day types. So, we propose a process to deal with zero value in collected occupancy data (shown in Figure 3).
2.
Nighttime data correction. Occupants would sleep and stop using social networks during nights. Nighttime data correction is necessary for buildings with residential functions, such as hotels and hospital inpatient departments. To address this issue, the peak value observed between 9 p.m. and 6 a.m. the following morning is selected to replace the value for the entire nighttime period.
3.
Utilization rate correction. Not all the occupants would use the social networks we choose. So, the occupancy data need to be modified by the utilization rate. In this paper, the utilization rate is 0.7, which is provided by our cooperative internet company.

2.3. Interpretable Occupancy Modeling Method

Occupancy data show both regular patterns and stochastic variation characteristics. Thus, the occupancy model proposed in this study consists of two parts: a set of typical occupancy data representing the regular patterns, and the effects of the key factors responsible for irregular fluctuations (shown in Equation (1)).
Occupancyy,d,h = ThDiMmWwSsTODd,h,
where Occupancyy,d,h represents the occupancy at hour h on day d of year y, TODd,h is the typical occupancy data at hour h on day d, and Th, Di, Mm, Ww, and Ss represent the effects of trend, day type, month, weather, and special event, respectively.

2.3.1. TOD Extraction

The TOD extraction process consists of five parts: detrending, building clustering, TOD extraction, and TOD generator training. The procedure is shown in Figure 4.
  • Detrending
During collection periods, the occupancy data show an increasing trend due to the enhancement of building reputation or the prevalence of social networks. In order to eliminate the influence of trends and extract typical patterns from the occupancy data, this paper proposes the following detrending process (shown in Figure 5):
  • A logarithmic transformation is applied to the occupancy data and the data are grouped by the hour of day;
  • Calculating average occupancy data by Equation (2):
    y ¯ h , d , i = y h , d N   ,
    where h represents the hour of day, taking integer values from 0 to 23; d represents the day of week, ranging from 1 to 7; y ¯ h , d , i represents the average occupancy at hour h for weekday d, obtained from the i-th calculation.
  • Calculating the residual value y h , d , i by Equation (3):
    y h , d , i = y h , d y ¯ h , d , i   ,
  • Removing outliers by the k-nearest neighbors (kNN) algorithm;
  • If the residuals are steady, the process moves forward; otherwise, the loop continues to calculate y ¯ h , d , i and repeat the subsequent steps;
  • The trend terms at each hour are extracted by an ensemble empirical mode decomposition (EEMD) algorithm and aggregated into the overall trend term;
  • The detrending occupancy data is calculated by Equation (4):
    R = y e T   ,
    where R represents the detrending occupancy data and T represents the trend terms.
Figure 6 presents a comparison of the occupancy data for a public building before and after detrending. The original data in (a) show an obvious upward trend, which is effectively removed in the detrended data shown in (b).
2.
Building clustering
In order to analyze which building occupancies follow similar patterns, this study groups buildings according to the time-series features of the occupancy data. Firstly, the occupancy data are normalized into schedules and maximum values. Secondly, 72 time-series features of each building occupancy schedule are extracted. The feature names and extraction methods are listed in Table 5. Then, principal component analysis (PCA) is applied to reduce the dimension of extracted features. Finally, a k-means clustering algorithm is applied to the data after dimensionality reduction. Euclidean distance is used to calculate the distance between the points and the cluster center. The optimal number of clusters is determined by the Calinski–Harabasz (C–H) criterion [35].
3.
TOD extraction
The k-means algorithm is used again to extract typical occupancy schedules, and typical occupancy densities can be obtained by linear regression of maximum occupancy data and building area.
4.
TOD generator training
In order to generate TOD at any time, a C4.5 decision tree is trained by TOD clustering labels, day types, and months.

2.3.2. Key Factors Selection

The selection of input variables is a critical step for modeling. Irrelevant inputs not only increase the complexity of models but might also decrease the model accuracy. Conversely, omitting key inputs would become a source of errors.
In this paper, the factors influencing occupancy are categorized into four groups: building information, time, weather, and special events. Table 6 lists all the potential factors considered in this paper.
Because the potential influential factors include both numerical and categorical data, a model-based feature selection approach is adopted here to select key factors influencing occupancy data. Due to the large data size and the imbalanced distributions in some input variables, the CatBoost model [38] is selected as a surrogate model and the SHAP (SHapley Additive exPlanations) method [39] is applied to calculate both individual variable importance and bivariate interaction importance. SHAP is a model-free, additive feature attribution method that quantifies the contribution of each input variable to the model’s output results based on Shapley values, which can be calculated by Equation (5):
ϕ i = S N \ i S ! M S 1 ! M ! f S i f ( S ) ,
where ϕ i denotes the Shapley value for the i-th input; S represents a subset of input variable s, and N = 1,2 , , M is the set of all possible input combinations. f S i and f ( S ) denote the average predictions of models including and excluding the i-th input. The final model prediction is the sum of the average prediction result ϕ 0 and the Shapley values for each input, as shown in Equation (6):
g = ϕ 0 + i = 1 M ϕ i ,
The Shapley interaction value is calculated as follows:
ϕ i , j = S N \ i , j S ! M S 2 ! 2 M 1 ! i j ( S ) ,
where i j and i j S = f S i , j f S i f S j + f ( S ) .
In order to capture the underlying relationships between input variables and occupancy data, the trend and seasonal terms are removed from the occupancy data before training the model. The Tree-structured Parzen Estimator (TPE) algorithm, embedded in the Python package Optuna v.1.4.0. [40], is used for hyperparameter optimization. The entire dataset is partitioned into training, validation, and test sets with a ratio of 64%, 16%, and 20%, respectively.

2.3.3. Model Fitting

Table 7 shows the equations and model fitting methods of each effect in Equation (1). The signal function Z t defines the active period of different effects (shown in Equation (8)). It takes the value of 1 if the time falls within the influential period L i = d i 1 d i 1 , d i 2 d i 2 , and 0 otherwise.
Z t = 1 ,                                     t L i 0 ,                                           e l s e   ,
For the trend effect, the Levenberg–Marquardt (LM) algorithm, embedded in the SciPy package [41], is employed for the fitting procedure. The other parameters are estimated by the L-BFGS algorithm available within the PyStan package [42].

2.3.4. Transfer Learning

Transfer learning is the process of using similarities in data, tasks, or model structures to apply models or knowledge acquired in a source domain to a target domain. Transfer learning methods can be categorized into four types: instance-transfer, feature-representation-transfer, parameter-transfer, and relational-knowledge-transfer [43].
Instance-transfer methods assign higher weights to similar instances in the source domain for use in the target domain. Feature-representation-transfer methods transform the feature representations of the models to minimize domain divergence and classification or regression model error. Parameter-transfer methods train a complex model on a large dataset and then fine-tune it with data from a new task. Relational-knowledge-transfer methods focus on mining relationships between the source domain and the target domains.
For occupancy data transfer tasks, the occupancy data of the target building is unavailable. Therefore, an instance-based transfer learning method is proposed in this paper, to assign higher weights to building data with greater similarity to the target building. A weighted summation is then performed on the occupancy schedules of these samples. The maximum occupancy is subsequently obtained from the occupancy density of similar buildings and the building area. Excluding building ID and name, the 15 building features listed in Table 4 are used to evaluate building similarity. The similarity is calculated by Equation (9):
s i m i = 1 f = 1 F δ f d i f ,
where s i m i represents the similarity of the i-th building and target building; F represents the total count of features; δ f is the weight of the f-th feature; d i f represents the discrepancy of the f-th feature between the i-th building and target building.
The feature weights δ f are obtained by a surrogate XGBoost model, which is trained by 15 features and building clustering results in Section 2.3.1. δ f is the normalized result of each feature importance for XGBoost model.
For categorical features, such as city and building function, d i f is calculated by Equation (10). For other numeric features, d i f is calculated by Equation (11) [44]:
d i f = 1 ,                   x i f x t f 0 ,                   x i f = x t f ,
d i f = x i f x t f m a x x f m i n x f ,
where x i f is the f-th feature of the i-th building and x t f is the f-th feature of the target building.
Because incorporating a large number of low-similarity samples would affect the performance of transfer learning, a similarity threshold δ is introduced in this study. Only samples with a similarity exceeding this threshold are considered for weight calculation (shown as Equation (12)):
w i = s i m i s i m i ,
where w i is the weight of the i-th building for the target building, s i m i > δ . In this paper, the threshold is 0.8.

3. Results and Discussion

In this section, the building clustering results based on occupancy time-series features, the key factor selection results, and the model fitting results are shown and discussed. Then four buildings are demonstrated to evaluate transfer learning performance for buildings without on-site occupancy data.

3.1. Building Clustering Results

Figure 7 illustrates the clustering results of buildings based on time-series features of occupancy. Different colors in the figure denote distinct building functions, and the accompanying numbers indicate the quantity of buildings in each cluster. The following patterns are observed: Cluster0 is predominantly composed of office buildings, hospital outpatient departments, and clinics. Cluster1 consists exclusively of shopping buildings. Cluster2 includes two restaurants and one small shop. Cluster3 is primarily made up of airports, with two shopping buildings included. Cluster4 contains two hospital inpatient departments and five hotel buildings. Cluster5 is composed entirely of universities. Cluster6 consists of museums, while Cluster7 is mainly composed of train stations, along with three shopping buildings.
Table 8 summarizes the range of building areas for each clustering result. For Cluster0, with a building area range of 5000 to 70,000 m2, government buildings and clinics generally have a small size, while outpatient departments and commercial office buildings (particularly newer ones) tend to be much larger in area. The sizes of shopping buildings in Cluster1 vary from 27,500 m2 to 205,000 m2. The areas in Cluster3 are all small because this cluster consists of a small shop and fast-food restaurants. Both Cluster3 and Cluster7 exhibit large area variations, due to the fact that airports and railway stations are typically much larger in area than shopping buildings.
For buildings in Cluster0, the building functions include offices, hospital outpatient departments, and clinics. Although they differ in size, the time-series features of occupancy data for these building types are similar.
For shopping buildings, besides Cluster1, there are also 6 other buildings classified in Cluster2, Cluster3, and Cluster7. On the one hand, shopping buildings are more complex and often mixed with other function areas; on the other hand, the occupancy in shopping buildings is also related to the surrounding environment. Therefore, the occupancy patterns of these buildings tend to resemble those of nearby buildings. For example, the shopping building in Cluster2 is a small community shop, while the buildings in Cluster3 and Cluster7 are located farther from the city center but closer to airports and train stations.
Cluster2 includes a small shop and two fast-food restaurants. These buildings have lower values for mean, binarize_mean, and seasonality_strength, but larger values for crossing_points, indicating that the occupancy data is mostly at a lower level with some occasional increases. The seasonal component is small, with less regularity and more randomness. In terms of building information, these buildings are all near residential quarters and have smaller areas.
Cluster3 consists mainly of airport buildings, while Cluster7 consists mainly of train stations. The difference between them is the effect of short-term holidays such as Labor Day. The effect of short-term holidays on airports is negative, while for train stations, it is positive, indicating that there are more people traveling to train stations during short-term holidays.
Cluster4 consists of hotel buildings and hospital inpatient departments, with higher occupancy at night. Cluster5 includes university buildings, and the effects for January, February, July, and August (corresponding to winter and summer vacations) are negative. Cluster6 consists of museum buildings, with positive effects during the winter and summer vacation months.
Based on the above analysis, new building labels corresponding to the clustering results are shown in Table 9.

3.2. Key Factor Selection Results

Figure 8 shows the results of the feature importance with univariate SHAP values. The vertical axis represents the input variables, and the horizontal axis represents the SHAP values. The closer the value is to 0, the smaller the influence of the variable on the occupancy data. The color bar indicates the magnitude of the input variable values: for categorical variables (such as holidays in this model), the color is gray; for other numeric variables, the closer the color is to red, the larger the input variable, and the closer the color is to blue, the smaller the input variable.
As shown in Figure 8, time factors (year, month, day, hour) and static building information (people per area, building clustering type, renovation time) are the primary influences on occupancy. Among social factors, holidays, days to nearest holiday, and the beginning or end of a month or quarter exhibit high SHAP values. By contrast, the impacts of Christmas and Valentine’s Day are negligible. For weather factors, cloud cover exerts minimal influence and can be disregarded, whereas other weather parameters need further analysis.
Figure 9 presents the SHAP values for different holidays. It is shown that the impact of different holidays on building occupancy varies, highlighting the need to treat each holiday individually rather than collectively as a generic holiday.
The SHAP values for different weather factors, including wind level, precipitation, fog level, sand level, and effective temperature, across different building types are shown in Appendix A. Based on these results, the following conclusions can be made:
  • Wind level significantly affects Cluster1(shopping malls), Cluster3 (airports and nearby shopping buildings), and Cluster7 buildings (train stations and shopping buildings). A wind level exceeding level 6 is observed to substantially influence building occupancy. For other building types, the impact of wind level is negligible;
  • Precipitation notably affects Cluster1, Cluster3, Cluster5 (universities), Cluster6 (museums), and Cluster7 buildings. For all buildings except Cluster7, precipitation levels greater than 0 are associated with negative SHAP values, indicating that precipitation reduces occupancy. The impact of precipitation is minimal for other building types.
  • Fog level mainly influences buildings of Cluster1, Cluster3, Cluster5, and Cluster7. Only fog with a level of 3 (visibility below 10 km) needs to be considered for these buildings.
  • Sandstorms were observed only in samples from Cluster1, Cluster3, and Cluster7 buildings. Thus, the effects of sand level on other building types remain undetermined. For these three building types, only sand density with a level of 2 should be considered.
  • The effect of effective temperature is rather more complex, as it correlates with seasonal variations. Therefore, a bivariate importance analysis between effective temperature and month should be conducted.
Figure 10 illustrates the interactive SHAP values for the start or end of the month or quarter, and the month. It is shown that the impact of the beginning or end of the month and quarter overlaps with holidays. For example, in Figure 10c, the beginning of January, April, and October align with New Year’s Day, Tomb Sweeping Day, and National Day, respectively. In Figure 10b,d, the end of September and December coincide with National Day and New Year’s Day. Therefore, while holidays are significant for occupancy modeling, the start/end of a month or quarter can be omitted.
Figure 11 presents the interactive SHAP values for wind level and hour for Cluster3 buildings. The impact of wind level on occupancy is most pronounced between 7 am and 9 pm. Outside of this period, even when higher wind speeds occur, their effect on occupancy is minimal. A similar pattern is observed for other weather factors.
Figure 12 shows the SHAP importance for effective temperature in Cluster6 buildings, both with and without interaction effects. The SHAP values decrease from 0.0125 to below 0.006, indicating that the impact of effective temperature primarily comes from its interaction with other variables (mainly the month). During warmer months, such as July and August in summer, the effective temperature tends to be higher, which increases the occupancy in Cluster6 buildings (as shown in Figure 12a). However, when interaction effects are removed (as shown in Figure 12b), a turning point is observed at an effective temperature of 27 °C. When the effective temperature exceeds 27 °C, the SHAP value becomes negative, indicating a decrease in occupancy. In the thermal comfort temperature range, effective temperatures above 27 °C are considered hot. Therefore, when people feel hot, they are less likely to visit museums. Since the impact of effective temperature significantly decreases after removing the interaction effects, only the impact of effective temperature above 27 °C should be considered for Cluster0, Cluster6, and Cluster7 buildings.
Based on the results of both individual and bivariate importance analysis, different key factors are selected for different building types, which will be used as inputs for occupancy modeling. The results are listed in Table 10.
For time factors, hour and month are selected as key input variables for all building types, as the effects of weather and other factors are closely associated with these two variables. The effect of the year is primarily attributed to the COVID-19 pandemic in 2020; even after detrending the data, occupancy patterns in 2020 differ significantly from those in other years. Therefore, this study includes COVID-19 as a distinct special factor. However, for Cluster6 buildings, no data from 2020 are available, and thus this factor is not considered. The day of the week is combined with holidays and shift days as a day type factor.
Among the weather factors, effective temperature is considered only when it exceeds 27 °C, and only for buildings in Cluster0, Cluster6, and Cluster7. Wind level is taken into account only at level 6 or above, affecting buildings in Cluster1, Cluster3, and Cluster7. Precipitation is considered only for buildings in Cluster1, Cluster3, Cluster5, Cluster6, and Cluster7. Fog level is included only at level 3 for buildings in Cluster1, Cluster3, Cluster6, and Cluster7. Sand level is considered only at level 2 (representing sandstorms) for buildings in Cluster1, Cluster3, and Cluster7.
For special factors, the Double Eleven festival primarily affects Cluster1 buildings (shopping malls). The influence of Christmas and Valentine’s Day is relatively minor. This is because Christmas often coincides with the New Year holiday, and Valentine’s Day is close to the Spring Festival; thus, their effects are largely captured by the general holiday variable and the days to the nearest holiday variable. Nevertheless, the variables for Christmas and Valentine’s Day have been retained specifically for Cluster1 buildings in this study.

3.3. Model Fitting Results

In this section, model fitting performance is evaluated by the coefficient of determination (R2) and the coefficient of variation in the root mean square error (CV). The results are shown in Figure 13.
For Cluster0 buildings and Cluster1 buildings, the proposed occupancy models demonstrate good performance, with all R2 values exceeding 0.7. For Cluster3, Cluster5, and Cluster6 buildings, the R2 of most models ranges between 0.6 and 0.8. The high CV values observed in some buildings are primarily due to a sudden increase in occupancy during the COVID-19 period.
For Cluster2 buildings, due to the high randomness of occupancy, the proposed model performs poorly. For Cluster4 buildings, the model’s performance varies significantly. Models for the inpatient departments show good performances, with R2 values all above 0.7, while models for hotels exhibit poor performance due to their complex functions and limited number of samples. For Cluster7 buildings, the best-performing building shows R2 and CV values of 0.89 and 0.28, respectively. The worst-performing building had only two years of data, and the monthly occupancy patterns in those two years differed significantly.

3.4. Transfer Learning Results

Due to the limited number of samples for some building types and the fact that some samples were collected only from 1 December 2015 to 31 December 2017, only four buildings, including a train station, a shopping mall, an office, and a hospital, are selected to evaluate the transfer learning method proposed in this paper. For each of these buildings, the transfer learning method is applied for following two scenarios:
  • Scenario A: Only the building information of the target building is known, and the measured occupancy data of other buildings are used for transfer learning;
  • Scenario B: Only the building information of the target building is known, and the simulated occupancy data from proposed occupancy models of other buildings are used for transfer learning.
Figure 14 shows the transfer modeling results of applying measured data from other buildings directly for different building types. Except for the hospital building, the R2 values for the other three cases all exceed 0.8, and the CV values are within 0.4, indicating good transfer learning performance to capture the actual occupancy patterns within these buildings. For the hospital, the transfer performance is relatively poor due to the limited number of samples: only one case for outpatient departments and two for clinics.
Figure 15 shows the results of transfer modeling using the simulated data from the proposed occupancy models. For the train station, shopping mall, and office, the transfer learning performances using model data are generally lower than when using the actual measured data, especially for the train station. However, for the shopping mall and office building, this drop in performance is relatively minor, with R2 values of 0.76 and 0.73, respectively. The CV values increased by only 0.04 and 0.02 compared to the results using measured data.
It is worth noting that, for the hospital, the transfer performance using the simulated model data actually outperformed the direct transfer using the measured data, especially for the data in May and June. This is because the occupancy model trained with four years of historical data (2017–2020). It would capture more general occupancy variation patterns. This presents a double-edged sword. In cases where the target building adheres to general patterns, transfer learning with model simulated data offers robustness against the effects of anomalies in measured data, which leads to better model transfer performance compared to directly using measured data. As observed in the hospital building results, although the actual occupancy data of other hospitals in May and June of 2017 was relatively low, the data from subsequent years showed normal values in occupancy during these months. However, if a new, previously unobserved influencing factor happens, the occupancy model may fail to capture it, resulting in poorer performance than the direct transfer of measured data (as seen in the train station building case).

4. Conclusions

Hourly building-scale occupancy data play a crucial role in both the design and operation phases for public buildings. In order to solve the challenge of obtaining occupancy data for public buildings, this paper proposes an interpretable occupancy modeling method. Occupancy data of 56 public buildings are collected from social networks. By analyzing the features of measured data, a concept of typical occupancy data is proposed, and key factors influencing occupancy across different public building types are identified. A weighted-instance transfer learning method is developed to simulate occupancy for buildings lacking on-site measurements.
The main contributions of the proposed method include:
  • This study proposes data preprocessing methods for occupancy data from social networks, and establishes a comprehensive database for both occupancy data and building information.
  • An interpretable public building occupancy modeling method is proposed. The typical occupancy data reflects fundamental occupancy patterns, while the incorporation of trend, day type, weather, and special event factors enables dynamic simulation of public building occupancy. While this modeling method is applicable to other types of public buildings not considered in this study, the key factors and their corresponding influence need to be re-calibrated for new types of buildings.
  • A detailed analysis is conducted on the influence of time and weather factors on occupancy across different building types.
  • A weighted-instance transfer learning method for occupancy data is proposed. This method calculates the similarity between a target building and database samples based on building information, assigns higher weights to more similar samples, and enables occupancy simulation for buildings without measured data.
The limitations and future works of the proposed method are:
  • Collecting occupancy data from social networks requires occupants to use mobile devices and connect to the internet. Therefore, this data collection method is unsuitable for buildings like kindergartens, primary schools, and middle schools, where most occupants cannot use mobile devices. For these buildings, traditional methods (e.g., access cards, sensors) are recommended.
  • The proposed modeling method is suitable for public buildings like offices, shopping malls, hospitals, airports, universities, and so on, but not for buildings with highly stochastic occupancy (e.g., community shops, fast-food restaurants). The model considers time and environmental factors but omits subjective factors like personal psychology and individual occupant differences.
  • It is difficult to distinguish the actual growth trend in building occupancy from the increase caused by the expanding use of social networks. The true long-term occupancy trend needs further observation with more data accumulated.
  • Due to limitations of the current sample size and diversity, the transfer performance of the proposed method is limited for some buildings. Furthermore, the influence of some weather factors (such as sand level) is not pronounced in some building types, due to a lack of observed data. However, the proposed modeling process is general. The model performance will improve significantly with a larger and more diverse sample collection in the future.

Author Contributions

Methodology, coding and original draft writing, J.G.; visualization and original draft writing, Y.Z.; writing—review and editing, Y.J.; conceptualization and supervision, P.X.; resources and writing—review and editing, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (No. 52161135202).

Data Availability Statement

Data available on request due to restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
TODTypical occupancy data
HVACHeating, ventilation and air-conditioning
PIRPassive Infra-Red
SHAPSHapley additive exPlanations
BABuilding automation
PIRLinear dichroism
GPGaussian Process
HMMHidden Markov model
GANGenerative adversarial network
RNNRecurrent neural network
SARIMASeasonal autoregressive integrated moving average
ANNArtificial neural network
LSTMLong short-term memory
LMLevenberg–Marquardt
L-BFGSLimited-memory Broyden–Fletcher–Goldfarb–Shanno
ETEffective Temperature

Appendix A

This appendix shows the SHAP values of different weather factors for different building clustering results.
Figure A1. SHAP values of wind level for different building clustering results.
Figure A1. SHAP values of wind level for different building clustering results.
Buildings 15 04318 g0a1aBuildings 15 04318 g0a1b
Figure A2. SHAP values of rain level for different building clustering results.
Figure A2. SHAP values of rain level for different building clustering results.
Buildings 15 04318 g0a2aBuildings 15 04318 g0a2b
Figure A3. SHAP values of fog level for different building clustering results.
Figure A3. SHAP values of fog level for different building clustering results.
Buildings 15 04318 g0a3aBuildings 15 04318 g0a3b
Figure A4. SHAP values of sand level for different building clustering results.
Figure A4. SHAP values of sand level for different building clustering results.
Buildings 15 04318 g0a4aBuildings 15 04318 g0a4b
Figure A5. SHAP values of effective temperature for different building clustering results.
Figure A5. SHAP values of effective temperature for different building clustering results.
Buildings 15 04318 g0a5aBuildings 15 04318 g0a5b

References

  1. Dong, B.; Yan, D.; Li, Z.; Jin, Y.; Feng, X.; Fontenot, H. Modeling Occupancy and Behavior for Better Building Design and Operation—A Critical Review. Build. Simul. 2018, 11, 899–921. [Google Scholar] [CrossRef]
  2. Chong, A.; Augenbroe, G.; Yan, D. Occupancy Data at Different Spatial Resolutions: Building Energy Performance and Model Calibration. Appl. Energy 2021, 286, 116492. [Google Scholar] [CrossRef]
  3. Jin, Y.; Yan, D.; Chong, A.; Dong, B.; An, J. Building Occupancy Forecasting: A Systematical and Critical Review. Energy Build. 2021, 251, 111345. [Google Scholar] [CrossRef]
  4. Sangogboye, F.C.; Arendt, K.; Jradi, M.; Veje, C.; Kjærgaard, M.B.; Jørgensen, B.N. The Impact of Occupancy Resolution on the Accuracy of Building Energy Performance Simulation. In Proceedings of the 5th Conference on Systems for Built Environments, Shenzhen, China, 7–8 November 2018; ACM: New York, NY, USA, 2018; pp. 103–106. [Google Scholar]
  5. ASHRAE. 90.1-2013 User’s Manual; ASHRAE Inc.: Atlanta, GA, USA, 2013. [Google Scholar]
  6. Goel, S.; Athalye, R.A.; Wang, W.; Zhang, J.; Rosenberg, M.I.; Xie, Y.; Hart, P.R.; Mendon, V.V. Enhancements to ASHRAE Standard 90.1 Prototype Building Models; Pacific Northwest National Laboratory (PNNL): Richland, WA, USA, 2014. [Google Scholar]
  7. Duarte, C.; Van Den Wymelenberg, K.; Rieger, C. Revealing Occupancy Patterns in an Office Building through the Use of Occupancy Sensor Data. Energy Build. 2013, 67, 587–595. [Google Scholar] [CrossRef]
  8. Niu, M.; Ji, Y.; Zhao, M.; Gu, J.; Li, A. A Study on Carbon Emission Calculation in Operation Stage of Residential Buildings Based on Micro Electricity Usage Behavior: Three Case Studies in China. Build. Simul. 2024, 17, 147–164. [Google Scholar] [CrossRef]
  9. Sangogboye, F.C.; Arendt, K.; Singh, A.; Veje, C.T.; Kjærgaard, M.B.; Jørgensen, B.N. Performance Comparison of Occupancy Count Estimation and Prediction with Common versus Dedicated Sensors for Building Model Predictive Control. Build. Simul. 2017, 10, 829–843. [Google Scholar] [CrossRef]
  10. Ohsugi, S.; Koshizuka, N. Delivery Route Optimization Through Occupancy Prediction from Electricity Usage. In Proceedings of the International Computer Software and Applications Conference, Tokyo, Japan, 23–27 July 2018; Volume 1. [Google Scholar]
  11. Labeodan, T.; Zeiler, W.; Boxem, G.; Zhao, Y. Occupancy Measurement in Commercial Office Buildings for Demand-Driven Control Applications—A Survey and Detection System Evaluation. Energy Build. 2015, 93, 303–314. [Google Scholar] [CrossRef]
  12. Dong, B.; Lam, K.P. Building Energy and Comfort Management through Occupant Behaviour Pattern Detection Based on a Large-Scale Environmental Sensor Network. J. Build. Perform. Simul. 2011, 4, 359–369. [Google Scholar] [CrossRef]
  13. Ryu, S.H.; Moon, H.J. Development of an Occupancy Prediction Model Using Indoor Environmental Data Based on Machine Learning Techniques. Build. Environ. 2016, 107, 1–9. [Google Scholar] [CrossRef]
  14. Yang, Y.; Yuan, Y.; Pan, T.; Zang, X.; Liu, G. A Framework for Occupancy Prediction Based on Image Information Fusion and Machine Learning. Build. Environ. 2022, 207, 108524. [Google Scholar] [CrossRef]
  15. Lian, H.; Wei, H.; Wang, X.; Chen, F.; Ji, Y.; Xie, J. Research on Real-Time Energy Consumption Prediction Method and Characteristics of Office Buildings Integrating Occupancy and Meteorological Data. Buildings 2025, 15, 404. [Google Scholar] [CrossRef]
  16. Ahas, R.; Silm, S.; Järv, O.; Saluveer, E.; Tiru, M. Using Mobile Positioning Data to Model Locations Meaningful to Users of Mobile Phones. J. Urban Technol. 2010, 17, 3–27. [Google Scholar] [CrossRef]
  17. Zhang, S.; Xiao, W.; Gong, J.; Yin, Y. Mobile Sensing and Simultaneously Node Localization in Wireless Sensor Networks for Human Motion Tracking. Appl. Bionics Biomech. 2012, 9, 367–374. [Google Scholar] [CrossRef][Green Version]
  18. Wang, W.; Chen, J.; Song, X. Modeling and Predicting Occupancy Profile in Office Space with a Wi-Fi Probe-Based Dynamic Markov Time-Window Inference Approach. Build. Environ. 2017, 124, 130–142. [Google Scholar] [CrossRef]
  19. Kang, X.; Yan, D.; An, J.; Jin, Y.; Sun, H. Typical Weekly Occupancy Profiles in Non-Residential Buildings Based on Mobile Positioning Data. Energy Build. 2021, 250, 111264. [Google Scholar] [CrossRef]
  20. Mohammadi, N.; Taylor, J.E. Urban Energy Flux: Spatiotemporal Fluctuations of Building Energy Consumption and Human Mobility-Driven Prediction. Appl. Energy 2017, 195, 810–818. [Google Scholar] [CrossRef]
  21. Lu, X.; Feng, F.; Pang, Z.; Yang, T.; O’Neill, Z. Extracting Typical Occupancy Schedules from Social Media (TOSSM) and Its Integration with Building Energy Modeling. Build. Simul. 2021, 14, 25–41. [Google Scholar] [CrossRef]
  22. D’Oca, S.; Hong, T. Occupancy Schedules Learning Process through a Data Mining Framework. Energy Build. 2015, 88, 395–408. [Google Scholar] [CrossRef]
  23. Sha, H.; Xu, P.; Yan, C.; Ji, Y.; Zhou, K.; Chen, F. Development of a Key-Variable-Based Parallel HVAC Energy Predictive Model. Build. Simul. 2022, 15, 1193–1208. [Google Scholar] [CrossRef]
  24. Gang, W.; Wang, S.; Shan, K.; Gao, D. Impacts of Cooling Load Calculation Uncertainties on the Design Optimization of Building Cooling Systems. Energy Build. 2015, 94, 1–9. [Google Scholar] [CrossRef]
  25. Page, J.; Robinson, D.; Morel, N.; Scartezzini, J.L. A Generalised Stochastic Model for the Simulation of Occupant Presence. Energy Build. 2008, 40, 83–98. [Google Scholar] [CrossRef]
  26. Manna, C.; Fay, D.; Brown, K.N.; Wilson, N. Learning Occupancy in Single Person Offices with Mixtures of Multi-Lag Markov Chains. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; IEEE: New York, NY, USA, 2013; pp. 151–158. [Google Scholar]
  27. Dong, B.; Lam, K.P. A Real-Time Model Predictive Control for Building Heating and Cooling Systems Based on the Occupancy Behavior Pattern Detection and Local Weather Forecasting. Build. Simul. 2014, 7, 89–106. [Google Scholar] [CrossRef]
  28. Liao, C.; Lin, Y.; Barooah, P. Agent-Based and Graphical Modelling of Building Occupancy. J. Build. Perform. Simul. 2012, 5, 5–25. [Google Scholar] [CrossRef]
  29. Chen, Y.; Hong, T.; Luo, X. An Agent-Based Stochastic Occupancy Simulator. Build. Simul. 2018, 11, 37–49. [Google Scholar] [CrossRef]
  30. Chen, Z.; Jiang, C. Building Occupancy Modeling Using Generative Adversarial Network. Energy Build. 2018, 174, 372–379. [Google Scholar] [CrossRef]
  31. Das, A.; Kjærgaard, M.B. Precept: Occupancy Presence Prediction inside a Commercial Building. In Proceedings of the UbiComp/ISWC 2019—Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, London, UK, 9–13 September 2019. [Google Scholar]
  32. Huang, W.; Lin, Y.; Lin, B.; Zhao, L. Modeling and Predicting the Occupancy in a China Hub Airport Terminal Using Wi-Fi Data. Energy Build. 2019, 203, 109439. [Google Scholar] [CrossRef]
  33. Jin, Y.; Yan, D.; Kang, X.; Chong, A.; Sun, H.; Zhan, S. Forecasting Building Occupancy: A Temporal-Sequential Analysis and Machine Learning Integrated Approach. Energy Build. 2021, 252, 111362. [Google Scholar] [CrossRef]
  34. Wang, Z.; Hong, T.; Piette, M.A. Data Fusion in Predicting Internal Heat Gains for Office Buildings through a Deep Learning Approach. Appl. Energy 2019, 240, 386–398. [Google Scholar] [CrossRef]
  35. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  36. Triebe, O.; Hewamalage, H.; Pilyugina, P.; Laptev, N.; Bergmeir, C.; Rajagopal, R. NeuralProphet: Explainable Forecasting at Scale. arXiv 2021, arXiv:2111.15397. [Google Scholar] [CrossRef]
  37. Jiang, X.; Srivastava, S.; Chatterjee, S.; Yu, Y.; Handler, J.; Zhang, P.; Bopardikar, R.; Li, D.; Lin, Y.; Thakore, U.; et al. Kats 2022. Available online: https://github.com/facebookresearch/Kats (accessed on 13 November 2025).
  38. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. In Proceedings of the Workshop on ML Systems at NIPS 2017, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
  39. Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
  40. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  41. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
  42. Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.; Guo, J.; Li, P.; Riddell, A. Stan: A Probabilistic Programming Language. J. Stat. Softw. 2017, 76, 1–32. [Google Scholar] [CrossRef] [PubMed]
  43. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  44. Han, J.; Kamber, M. Data Mining: Concepts and Techniques, 3rd ed.; Elsevier: Burlington, MA, USA, 2012; ISBN 978-0-12-381479-1. [Google Scholar]
Figure 1. Research flowchart.
Figure 1. Research flowchart.
Buildings 15 04318 g001
Figure 2. Illustration of the building area delineated for occupancy collection.
Figure 2. Illustration of the building area delineated for occupancy collection.
Buildings 15 04318 g002
Figure 3. Process to deal with zero value in occupancy data from social networks.
Figure 3. Process to deal with zero value in occupancy data from social networks.
Buildings 15 04318 g003
Figure 4. Process for TOD extraction.
Figure 4. Process for TOD extraction.
Buildings 15 04318 g004
Figure 5. Detrending process for occupancy data.
Figure 5. Detrending process for occupancy data.
Buildings 15 04318 g005
Figure 6. Comparison of the occupancy data for a public building before and after detrending. (a) Occupancy data before detrending; (b) occupancy data after detrending.
Figure 6. Comparison of the occupancy data for a public building before and after detrending. (a) Occupancy data before detrending; (b) occupancy data after detrending.
Buildings 15 04318 g006
Figure 7. Building clustering results based on time-series features of occupancy.
Figure 7. Building clustering results based on time-series features of occupancy.
Buildings 15 04318 g007
Figure 8. Results of the feature importance with univariate SHAP values.
Figure 8. Results of the feature importance with univariate SHAP values.
Buildings 15 04318 g008
Figure 9. SHAP values of different holidays.
Figure 9. SHAP values of different holidays.
Buildings 15 04318 g009
Figure 10. Interactive SHAP values for the start/end of a month or quarter and month. (a) start of a month and month; (b) end of a month and month; (c) start of a quarter and month; (d) end of a quarter and month.
Figure 10. Interactive SHAP values for the start/end of a month or quarter and month. (a) start of a month and month; (b) end of a month and month; (c) start of a quarter and month; (d) end of a quarter and month.
Buildings 15 04318 g010
Figure 11. Interactive SHAP values for wind level and hour.
Figure 11. Interactive SHAP values for wind level and hour.
Buildings 15 04318 g011
Figure 12. SHAP values for effective temperature in Cluster6 buildings. (a) SHAP value with interaction effects; (b) SHAP value without interaction effects.
Figure 12. SHAP values for effective temperature in Cluster6 buildings. (a) SHAP value with interaction effects; (b) SHAP value without interaction effects.
Buildings 15 04318 g012
Figure 13. Boxplots of R2 and CV values between simulated and measured data for different building types.
Figure 13. Boxplots of R2 and CV values between simulated and measured data for different building types.
Buildings 15 04318 g013
Figure 14. Transfer learning results for Scenario A: (a) train station (R2 = 0.86, CV = 0.27); (b) shopping mall (R2 = 0.86, CV = 0.31); (c) office (R2 = 0.83, CV = 0.37); (d) hospital (R2 = 0.45, CV = 0.41).
Figure 14. Transfer learning results for Scenario A: (a) train station (R2 = 0.86, CV = 0.27); (b) shopping mall (R2 = 0.86, CV = 0.31); (c) office (R2 = 0.83, CV = 0.37); (d) hospital (R2 = 0.45, CV = 0.41).
Buildings 15 04318 g014
Figure 15. Transfer learning results for Scenario B: (a) train station (R2 = 0.48, CV = 0.42); (b) shopping mall (R2 = 0.76, CV = 0.35); (c) office (R2 = 0.73, CV = 0.39); (d) hospital (R2 = 0.52, CV = 0.38).
Figure 15. Transfer learning results for Scenario B: (a) train station (R2 = 0.48, CV = 0.42); (b) shopping mall (R2 = 0.76, CV = 0.35); (c) office (R2 = 0.73, CV = 0.39); (d) hospital (R2 = 0.52, CV = 0.38).
Buildings 15 04318 g015
Table 1. Temporal and spatial resolution requirements for occupancy data on different scenarios.
Table 1. Temporal and spatial resolution requirements for occupancy data on different scenarios.
StageApplication
Scenario
Building TypeSpatial
Resolution
Temporal
Resolution
designHVAC source equipment sizingallbuildinghour
designHVAC terminal equipment sizingallroomhour
designbuilding layout optimization allroomhour
designurban planningallbuildingday/hour
operationdemand responseallbuildingminute/hour
operationbuilding energy assessmentallbuildinghour
operationHVAC controlType A 1roomminute/hour
Type B 2zone/buildingminute/hour
operationlighting controlType A 1roomminute/hour
Type B 2zone/buildingminute/hour
operationelevator controlallfloorminute
1 Type A refers to buildings where terminal equipment could be controlled by occupants, such as offices, hotels, universities, etc. 2 Type B refers to buildings where all the HVAC and lighting equipment are centrally controlled, such as shopping malls, airports, train stations, etc.
Table 2. Summary of occupancy modeling studies in public buildings.
Table 2. Summary of occupancy modeling studies in public buildings.
MethodBuilding TypeResolutionData Collection
Method
InputsReferences
SpatialTemporal
k-meansrailway station,
hospital,
commercial complex
building1 hmobile devicehistorical data[19]
k-means
+decision tree
officeroom10 minlighting on/offhistorical data,
time of the day,
day of the week,
window change
[22]
samplinghotelbuilding1 h/statistical parameter[23]
samplingofficezone1 h/statistical parameter[24]
regressionschoolroomminPIR, camerahistorical data,
CO2 concentration,
temperature,
day type,
season,
holiday
[9]
Markov chainsofficeroom1 hposition sensoroccupant leaving times,
occupant leaving interval,
transition matrix
[25]
Markov chainsofficeroom10 min–1 hPIRhistorical data[26]
GP+HMMofficeroomminsensor fusionhistorical data[27]
Agent-basedschoolroom15 mincameraoccupant arriving time,
occupant leaving time,
occupied interval
[28]
Agent-basedofficeroomminsurveyoccupant type,
occupant density,
parameters of events
[29]
GANschoolroom15 mincamerahistorical data[30]
RNNcommercial buildingroom/camerahistorical data[31]
Bayesian model basedairportzone1 hWi-Fihistorical data[32]
SARIMA-ANNairport,
train station
building1 hmobile devicehistorical data[33]
LSTMofficebuildingmincamerahistorical data[34]
Table 3. Sample sizes for different building types.
Table 3. Sample sizes for different building types.
CategorySubcategorySample SizeCategorySubcategorySample Size
officegovernment building4restaurantquick service restaurant2
commercial office building3hospitaloutpatient department2
shopping areashopping mall19inpatient department2
supermarket2clinic2
shop1educationuniversity2
hotelluxury hotel2transportationairport5
business hotel2train station5
budget inn1artmuseum2
Table 4. Building information.
Table 4. Building information.
TypeInformationNomenclature
basic informationbuilding IDId
namebName
areaA
functionbFunc
construction timeinitial construction yeariTime
most recent renovation yearreTime
locationcityC
number of bus stops within a radius of 500 mbn5
number of bus stops within a radius of 1000 mbn10
number of metro lines within a radius of 500 mmn5
number of metro lines within a radius of 1000 mmn10
number of shopping malls within a radius of 500 msn5
number of shopping malls within a radius of 1000 msn10
number of residential quarters within a radius of 500 mrn5
number of residential quarters within a radius of 1000 mrn10
distance to nearest airportna
distances to nearest train stationnt
Table 5. Time-series features selected in this paper.
Table 5. Time-series features selected in this paper.
TypeCountFeature NameEquation or
Extraction Method
Holiday effects7ChineseNewYear, DragonBoat, LaborDay, Mid-Autumn, NationalDay, NewYearsDay, TombSweepingDayprophet model [36]
Month effects12m = 1, m = 2, m = 3, m = 4, m = 5, m = 6, m = 7, m = 8, m = 9, m = 10, m = 11, m = 12 m e a n m o n t h , i m e a n y e a r m e a n y e a r
Hour effects24h = 0, h = 1, h = 2, h = 3, h = 4, h = 5, h = 6, h = 7, h = 8, h = 9, h = 10, h = 11, h = 12, h = 13, h = 14, h = 15, h = 16, h = 17, h = 18, h = 19, h = 20, h = 21, h = 22, h = 23 m e a n h o u r , i m e a n y e a r m e a n y e a r
Statistics14mean, var, entropy, lumpiness, stability, flat_spots, heterogeneity, crossing_points, binarize_mean, histogram_mode, level_shift_idx, firstmin_ac, firstzero_ac, linearityKats [37]
Others15trend_strength, seasonality_strength, y_acf1, y_acf5, diff1y_acf1, diff1y_acf5, diff2y_acf1, diff2y_acf5, y_pacf5, diff1y_pacf5, diff2y_pacf5, seas_acf1, seas_pacf1, holt_alpha, holt_betaKats [37]
Table 6. Potential influential factors.
Table 6. Potential influential factors.
TypeNameNomenclatureData Type
building informationnew building type obtained from Section 2.3.1newBTypemultinomial categorical
floor area per personarea_per_occcontinuous numeric
most recent renovation yearreTimediscrete numeric
time factoryearyeardiscrete numeric
month of yearmonthdiscrete numeric
day of monthdaydiscrete numeric
day of weekdayofweekdiscrete numeric
hour of dayhourdiscrete numeric
special factorholidayholidaysmultinomial categorical
shift dayis_Shiftsbinary categorical
days to nearest holidaydays2holidaysdiscrete numeric
month startis_month_startbinary categorical
month endis_month_endbinary categorical
quarter startis_ quarter_startbinary categorical
quarter endis_ quarter_endbinary categorical
Valentine’s Day 1is_Valentine_daybinary categorical
“Double Eleven” shopping festivalis_1111_daybinary categorical
Christmas Day 1is_Christmas_daybinary categorical
weather factoreffective temperatureETcontinuous numeric
wind levelWindydiscrete numeric
precipitation levelRainydiscrete numeric
cloud cover levelClouddiscrete numeric
fog levelFoggydiscrete numeric
sand levelSandydiscrete numeric
1 Valentine’s Day and Christmas Day are not public holidays in China, but they are widely associated with commercial discount events.
Table 7. Model fitting method.
Table 7. Model fitting method.
TypeEquationParameterModel Fitting Method
trend effect T h = e A 2 + A 1 A 2 1 + t / t 0 p
T h = e a + b t
t—time;
t0, A1, A2, p, a, b—parameters to be estimated
LM
day type effect D i = β i Z i ( t ) β i , β m , β w , β s —effect coefficients of day type, month, weather, special events, respectively;
Z i , Z m , Z w , Z s —signal functions of day type, month, weather, special events, respectively.
L-BFGS
month effect M m = β m Z m ( t )
weather effect W w = β w Z w ( t )
special event effect S s = β s Z s ( t )
Table 8. The range of building areas for each clustering result.
Table 8. The range of building areas for each clustering result.
Clustering ResultsCountAreas (m2)
Cluster0115000–70,000
Cluster11627,500–205,000
Cluster23600–900
Cluster378000–1,410,000
Cluster4720,000–90,000
Cluster5212,728–26,000
Cluster6218,695–100,600
Cluster7844,000–700,000
Table 9. Building clustering results.
Table 9. Building clustering results.
Clustering ResultsLabels
Cluster0offices, hospital outpatient departments, and clinics
Cluster1shopping malls
Cluster2small-scale buildings near residential quarters
Cluster3airports and nearby shopping buildings
Cluster4hotels and hospital inpatient departments
Cluster5universities
Cluster6museums
Cluster7train stations and nearby shopping buildings
Table 10. Key factors selected for different building clustering results.
Table 10. Key factors selected for different building clustering results.
FactorCluster0Cluster1Cluster2Cluster3Cluster4Cluster5Cluster6Cluster7
hour
month
year
day type
ET (≥27 °C)
windy (≥6)
rainy
foggy (=3)
sandy (=2)
Christmas Day
Double Eleven Day
Valentine’s Day
COVID-19
✓ means the factor is selected.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, J.; Zhu, Y.; Ji, Y.; Xu, P.; Li, L. An Interpretable Modeling Method for Occupancy in Public Buildings Based on Typical Occupancy Data. Buildings 2025, 15, 4318. https://doi.org/10.3390/buildings15234318

AMA Style

Gu J, Zhu Y, Ji Y, Xu P, Li L. An Interpretable Modeling Method for Occupancy in Public Buildings Based on Typical Occupancy Data. Buildings. 2025; 15(23):4318. https://doi.org/10.3390/buildings15234318

Chicago/Turabian Style

Gu, Jiefan, Yi Zhu, Ying Ji, Peng Xu, and Linxue Li. 2025. "An Interpretable Modeling Method for Occupancy in Public Buildings Based on Typical Occupancy Data" Buildings 15, no. 23: 4318. https://doi.org/10.3390/buildings15234318

APA Style

Gu, J., Zhu, Y., Ji, Y., Xu, P., & Li, L. (2025). An Interpretable Modeling Method for Occupancy in Public Buildings Based on Typical Occupancy Data. Buildings, 15(23), 4318. https://doi.org/10.3390/buildings15234318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop