1. Introduction
Accurate and fine-grained spatial knowledge of urban population distribution is fundamental to effective urban management and transportation planning [
1]. Modern transportation systems, including optimized public transit networks and precise Travel Demand Models (TDMs), rely heavily on population density information [
2,
3]. However, the demographic data traditionally provided by national census reports are aggregated to coarse administrative or statistical boundaries (e.g., census tracts or townships), thus failing to capture the significant population heterogeneity within these zones [
4,
5]. This fundamental limitation leads to suboptimal traffic analysis, inaccurate demand forecasting, and ineffective emergency response strategies that cannot account for population at the micro-level [
6]. Therefore, the task of spatializing coarse census counts into highly resolved population layers, specifically at the scale of individual buildings or residential units, remains a persistent and high-value frontier in urban transport research [
6,
7].
The most widely adopted technique for population disaggregation is dasymetric mapping, which uses ancillary spatial data to guide the reallocation of population from large source areas to smaller target units [
8,
9,
10]. Early dasymetric approaches relied on simple land cover classifications or large-area indicators like Nighttime Light (NTL) imagery [
10,
11,
12]. More recently, the field has advanced significantly by leveraging high-resolution imagery, LiDAR, and Point-of-Interest (POI) data to capture the urban form at the micro-scale. Specifically, models integrating building characteristics (e.g., footprint area, height, volume) and urban functional zoning (derived from POI/AOI data) have demonstrated superior accuracy by establishing direct, data-driven relationships between residential structures and population counts [
13]. Machine learning approaches, such as Random Forest (RF) and ensemble methods, have become state-of-the-art tools for capturing the complex, non-linear correlations between these geospatial features and population distributions [
14].
However, traditional machine learning models often rely on shallow architectures that may struggle to extract high-level semantic representations from high-dimensional urban features [
15]. Consequently, deep learning (DL) techniques, such as Convolutional Neural Networks (CNNs) and multi-layer perceptrons (MLPs), have recently gained traction due to their superior capability in modeling non-linear urban dynamics [
16]. Despite their predictive power, a critical limitation in most standard DL-based population models is the neglect of spatial dependency and spatial non-stationarity [
17]. This oversight often leads to ‘spatial smoothing’ issues where local population spikes (e.g., high-density apartments) are under-predicted because the model fails to encode the specific spatial context or neighborhood effects [
17].
While building-based models excel at capturing the static residential population, the increasing necessity for understanding the dynamic ambient population distribution—which is essential for real-time traffic management and dynamic OD generation—has led to the introduction of mobile phone signaling data (or Location-Based Service data) as a critical, high-frequency data source [
8,
18,
19]. However, when integrating these multi-source data, specifically fusing noisy, spatiotemporally sparse mobile sensing data with static, structural building information presents a major methodological challenge [
20]. Current fusion practices often involve deterministic overlay rules or rely on non-probabilistic machine learning models [
21,
22]. A key limitation of these non-probabilistic methods is their inability to formally quantify and propagate the uncertainty inherent in each heterogeneous data source, nor can they naturally integrate domain knowledge (such as a building’s likely function) as a probabilistic prior [
23,
24]. This rigidity often leads to suboptimal allocation and limited interpretability in fine-scale predictions. Furthermore, another challenge in multi-source fusion is ensuring hierarchical consistency between micro-scale predictions and macro-scale ground truth [
25]. Most existing disaggregation methods operate in a ‘bottom-up’ manner without a feedback mechanism to strictly constrain the sum of estimated building populations to match the official census data [
26], which breaks the optimality of the model and fails to propagate the error gradients back to the feature learning stage. Developing a framework that can intrinsically learn micro-level features while satisfying macro-level aggregate constraints remains an open problem.
To address the critical research gap in robust, probabilistic data fusion for micro-scale residential population estimation, the major contributions of this study are twofold.
1. Methodological innovation: We propose a Bayesian-informed hierarchical learning model framework that probabilistically fuses heterogeneous data (mobile signaling, POI, building features) [
27]. Unlike deterministic approaches, this model is rigorously optimized based on both micro-level feature fidelity and macro-level statistical consistency.
2. Practical application: By effectively incorporating building functional types as probabilistic priors, the framework significantly enhances the discrimination of residential capacity at the individual building level. This yields superior prediction accuracy over state-of-the-art baselines, providing high-fidelity spatial data that are essential for city planning.
The remainder of this paper is organized as follows.
Section 2 describes the data sources and data preprocessing, while
Section 3 details the proposed model framework.
Section 4 presents the results, including a comparative analysis against several benchmark models, and discusses the findings. Finally,
Section 5 provides the conclusions and suggests avenues for future research.
3. Methodology
This study proposes a Bayesian-informed hierarchical learning (BIHL) model framework to estimate individual building populations, as shown in
Figure 1, which enables fine-grained disaggregation of a population from coarse-grained census data to a building-level distribution with three layers. The first layer is a data-driven prior model that employs a LightGBM ensemble to generate initial probabilistic estimates and quantify the uncertainty of building-level populations. The second layer is a neural network-based estimator that learns the non-linear relationship between building attributes and population. The third layer acts as a deterministic proxy of Bayesian updating. It optimizes a heuristic multi-objective loss function, where the prior uncertainty from the first layer dynamically modulates the micro-level feature learning using confidence weights (
wi), while macroscopic census totals and spatial smoothing act as robust regularizing constraints.
3.1. First Layer: Data-Driven Prior Model
This layer aims to formulate a regression model to generate prior values of building-level population derived from some key building features. A LightGBM model is employed to train the regression model, which is an efficient machine learning framework based on Gradient Boosting Decision Trees. It offers a fast training speed, low memory consumption, and captures non-linear relationships between variables, making it well-suited for large-scale data analytics. However, before establishing the model, it is crucial to determine the underlying statistical distribution of the target variable to ensure the theoretical validity of the prior formulation.
3.1.1. Exploratory Data Analysis of Prior Distribution
To identify a suitable prior distribution for the pseudo-residential population (
Mi) derived from area-weighted mobile signaling, exploratory data analysis was conducted. Building-level population counts typically exhibit extreme right-skewness and heavy-tail characteristics, as illustrated in
Figure 2A. To mitigate this, a natural logarithmic transformation, log(1 +
Mi), was applied.
Figure 2B shows that this transformation effectively suppresses the heavy tail, yielding a roughly symmetric, bell-shaped distribution (with empirical fit parameters
μ = 5.41 and
σ = 1.83). The corresponding Q–Q plot (
Figure 2C) further confirms that the central quantiles of the transformed data align tightly with the theoretical normal distribution, despite minor deviations at the extreme tails caused by the deterministic outlier clipping during preprocessing.
Moreover, goodness-of-fit tests were performed to compare the adopted log-normal assumption against the log-skew-normal distribution with log-transformed data, as well as to evaluate Poisson and Negative Binomial (NB) distributions fitted to the original pseudo-residential population data. Poisson and NB models are commonly used for count data [
30]. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were calculated to evaluate model fit, as summarized in
Table 3.
As the initial proxy targets (
Mi) in this study are continuous real numbers derived from spatial area-weighted disaggregation, artificial rounding was applied to force discrete fitting for Poisson and NB distributions. However, these distributions yield higher AIC and BIC values, indicating a fundamentally poor model fit, as demonstrated in
Table 3. In comparing the two continuous alternatives, the log-skew-normal distribution accounts for residual skewness and achieves a better statistical fit (with the lowest AIC and BIC). However, within the proposed BIHL framework, the prior distribution needs to facilitate the propagation of the LightGBM ensemble’s mean and variance into the neural network’s loss function. Introducing complex skewness integration parameters would compromise the stability and differentiability of the deep learning optimization process. Given the negligible difference in goodness-of-fit between the two continuous models, this study utilizes the standard log-normal distribution to formulate the population prior distribution for balancing symmetric error assumptions with computational efficiency.
3.1.2. Prior Model Formulation
Informed by the log-normal distribution assumption and existing research [
14,
31,
32], the residential area, POI count, bus stop count, residential type, building location, and mobile phone user amount are utilized as independent variables, and the log-transformation of the pseudo-residential population log(1 +
Mi) is utilized as the target variable.
To quantify the uncertainty associated with this initial estimation, an ensemble of
K models were trained (in this study,
K = 5 was set based on multiple simulations), with each model initialized using a different random seed. The ensemble’s predictive mean and variance are defined in Equation (3).
where
is the value for building
i predicted by the
kth LightGBM model,
and
are the prior mean and prior standard deviation for building
i, respectively. To align with the log-normal assumption, the prior mean (
μlog,i) and variance (
) in the logarithmic space are derived from the ensemble statistics with Equation (4), as follows:
Consequently, the prior distribution of the log-transformed building population is defined as shown in Equation (5).
3.2. Second Layer: Enhanced Neural Network Posterior Estimator
To address the spatial heterogeneity and feature sparsity inherent in building-level population estimation, this layer includes an enhanced neural network posterior estimator, as illustrated in
Figure 3. This model overcomes the limitations of conventional single-path fully connected networks and introduces an innovative multi-branch fusion architecture, which takes as input a high-dimensional building feature vector and corresponding census zone information. These inputs are processed through three parallel subnetworks.
3.2.1. MLP Base Network
This branch network extracts fundamental feature representations. It consists of three hidden layers with dimensions of 256, 128, and 64, respectively. Each linear transformation is followed by batch normalization to accelerate convergence and an ReLU activation function to introduce non-linearity. To alleviate overfitting, a progressively decreasing dropout scheme is applied. The final layer adopts layer normalization, ensuring stable output distributions and enabling the extraction of generalizable building–population mapping patterns.
3.2.2. Zone Bias Embedding
To capture systematic differences across census zones (e.g., baseline density differences between core urban areas and peripheral areas), the model incorporates an embedding layer to learn a zone-specific prior intercept, as shown in
Figure 4.
Let Z be the total number of unique census zones within the study area. For any building located in a census zone z, the layer extracts a unique scalar bias term bz from matrix E based on the discrete zone ID. Then, this bias bz is added to the output of the base feature extraction MLP network, allowing the model to shift the baseline of the population density estimation to that specific spatial unit. From an implementation perspective, the weights of the matrix E are randomly initialized using a small-standard-deviation Gaussian distribution prior to training—which was set as N(0, 0.05) in this study—and the final values are jointly learned by minimizing the final hierarchical loss function.
3.2.3. Zone Interaction Network
To model the interaction between building features and spatial context, such as the differing population capacities of buildings with identical floor areas in core urban areas and peripheral areas, an interaction branch is introduced. This branch follows two linear layers with a ReLU layer structure and applies a scaling factor α, which dynamically modulates the contribution of building features to the final prediction, enhancing the model’s adaptability to complex spatial environments. In this study, α = 0.1 after multiple simulations.
The outputs of the three branches are combined additively, as shown in Equation (6).
where
is the standardized prediction for building
i, and
α is the weight coefficient for the interaction term.
The final output represents the predicted population in the logarithmic space, which is defined in Equation (7). This fusion strategy effectively decouples global structural patterns (captured by the MLP branch) from local spatial deviations (captured by the embedding and interaction branches). As a result, the model maintains strong generalization capability while substantially improving estimation accuracy for specific spatial zones.
where
pi is the true population of building
i, which is a latent variable in this study, and
μlog is the mean deviation of the log-transformed population, which is derived from the LightGBM prior. Therefore, the prediction on the original population scale can be obtained through an inverse transformation, as shown in Equation (8).
where
is the predicted population for building
i.
3.3. Third Layer: Hierarchical Loss Function and Spatial Constraints
To address the challenge of disaggregating population data from the coarse census level to the fine-grained building level, we propose a multi-objective optimization framework that integrates micro-level feature learning and macro-level consistency constraints into a unified differentiable loss function. The total objective function
Ltotal is composed of two distinct components, as shown in Equation (9).
where
Lmicro and
Lmacro are the loss functions at the micro- and macro-level, respectively, and
λ(
t) is a time-dependent dynamic weight, which is formulated to prevent optimization instability caused by the conflicting gradients of micro- and macro-objectives. It is designed as a linear warm-up and cosine annealing strategy, calculated with Equation (10) as follows.
where
λmax is the maximum penalty weight for the macroscopic constraint,
Twarm is the designated number of warm-up epochs,
T is the total number of training epochs, and
t is the current training epoch.
During the initial phase (t ≤ Twarm), a linear warm-up is employed. Applying a macro-penalty from the first epoch may cause gradient shocks to the uninitialized network. The linear ramp-up gradually introduces the macroscopic constraint, allowing the network to first prioritize learning fundamental micro-level feature representations from the proxy labels before heavily incorporating regional boundaries.
Subsequently (
t >
Twarm), a cosine annealing schedule is applied to smoothly decay the weight. In the context of spatial population disaggregation, once the initial macro-penalty successfully constrains the optimization space—ensuring that the model’s aggregated predictions broadly align with the regional census total—maintaining a rigid penalty can over-constrain the network. As widely recognized in spatial analysis, regional census data inherently suffer from spatial aggregation biases (e.g., the Modifiable Areal Unit Problem) [
33], which can conflict with the fine-grained physical realities. By smoothly annealing the macroscopic weight, the network realizes high-precision micro-level fine-tuning in the later stages of training. This dynamic relaxation ensures that the model converges to an optimal state, thus achieving fine-grained spatial micro-heterogeneity without escaping the established macroscopic demographic bounds.
3.3.1. Micro-Level Proxy Loss (Lmicro)
Since ground-truth population data at the building level are typically not available, this study utilized the soft labels generated by the ensemble-LightGBM model as a proxy supervision signal. To ensure robustness against noise and outliers in these proxy labels, SmoothL1loss—a piecewise loss function that combines the advantages of both L1 and L2 loss and improves training stability—was employed rather than the mean squared error (MSE).
Let
N be the number of buildings in a mini-batch,
the predicted value, and
yproxy,i the standardized proxy label for building
i. The micro-level proxy loss is defined in Equation (11).
where
wi is the confidence weight derived from the ensemble variance, ensuring that the network prioritizes learning from high-confidence samples while down-weighting ambiguous predictions.
3.3.2. Macro-Level Census Constraint (Lmacro)
The core of the proposed hierarchical framework is the differentiable aggregation constraint. This study enforces a constraint which ensures that the sum of predicted populations for all buildings within a specific census zone must approximate the true census count.
Let
Z be the number of census zones. For a given zone
z, let
Bz denote the set of buildings belonging to the zone. The aggregated prediction
is calculated by summing the exponentiated outputs of the network, as shown in Equation (12).
where
σy and
μy are the normalization parameters. To handle the large variance in population magnitudes across different zones, this study minimizes the mean squared relative error (MSRE) rather than the absolute error, as shown in Equation (13).
where
Lmacro is the macro-level loss,
Z is the number of zones, and
Pcensus,z is the census population for zone
z.
This constraint acts as a posterior regularization term, correcting the bias in the micro-level predictions by grounding them to the official census statistics.
The training process employs an alternating optimization strategy. In each training epoch, mini-batch data are used to optimize the micro-level proxy loss. Subsequently, the full-batch dataset is utilized to optimize the macro-level census constraint.
3.4. Evaluation Metrics
Based on the BIHL model constructed above, the population of each building was estimated. As the ground-truth residential population at the building level is difficult to obtain, the predicted building-level population was aggregated at the subdistrict level. The aggregated results are then compared with the census population of each subdistrict. The evaluation metrics include the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Symmetric Mean Absolute Percentage Error (SMAPE), as shown in Equations (14)–(16).
where
n is the number of communities in the test set,
yi denotes the true population of community
i, and
fi denotes its estimated population.
3.5. Model Implementation
The proposed BIHL framework was implemented on a workstation equipped with an Intel Xeon W-2235 CPU (3.80 GHz), 32.0 GB of RAM, and an NVIDIA RTX A2000 GPU (12 GB). The deep learning framework was implemented using PyTorch 2.4.1; the complete configuration and hyperparameter settings are detailed in
Table 4. And the implementation pseudo-code of BHIL model framework is shown in
Appendix A.
Regarding the network architecture, the micro-level feature extraction module was constructed with three hidden layers, progressively decreasing in dimensionality [256, 128, 64]. To mitigate overfitting and ensure stable gradient propagation across deep layers, a combination of Batch Normalization and Layer Normalization was utilized. Furthermore, to regularize the network while maintaining the precision required for continuous population estimation, we implemented a layer-decaying dropout strategy [
34]. The dropout probability starts at 0.2 for the first hidden layer—to effectively break the co-adaptation of raw input features—and decays by a factor of 0.8
i for subsequent layers (where
i is the layer index). This progressive decay ensures a smooth transition from robust feature extraction at the bottom layer to stable, deterministic representations at the final regression head.
During the training phase, the model was optimized using the AdamW optimizer with a batch size of 128. To prevent the previously discussed gradient shocks—especially when the macroscopic penalty is introduced—gradient clipping was performed with a maximum norm empirically set to 0.5 to ensure stable parameter updating. The maximum number of training epochs was set to 80; to prevent potential overfitting, a standard early stopping mechanism was employed with a patience of 30 epochs.
4. Results and Discussions
4.1. Study Area Overview
To empirically validate the proposed BIHL model framework, a comprehensive case study was conducted in Haidian District, Beijing, with a specific focus on the Beixiaguan Subdistrict, as shown in
Figure 5. Located in the northwestern part of Beijing city core, Haidian District is the city’s primary academic and technological hub. The selected study area comprises a dense mixture of building typologies, including old residential compounds, modern high-rise apartment complexes, student dormitories, and mixed-use commercial–residential structures. This architectural diversity, coupled with a remarkably high population density, makes it an ideal example for micro-scale population estimation.
The modeled region includes a total of 2763 individual buildings. The official macroscopic census population for this target area is 146,366 residents, with an average regional population density of approximately 24,340 persons per square kilometer. The initial grid-based mobile signaling data used to formulate the probabilistic priors have a spatial resolution of 250 m × 250 m. A comprehensive statistical summary of the dataset is presented in
Table 5, which details the distributions of key building-level variables—such as footprint area, gross floor area (GFA), and number of floors—reporting their mean values, standard deviations, and extreme deciles (10th and 90th percentiles).
4.2. Ablation Experiment
Given that this study introduces additional variables (such as building-level spatial location and residential category information), which have received limited attention in prior work, an ablation experiment was conducted to assess their incremental contribution to the predictive performance. The prediction metrics are reported in
Table 6.
The results show that omitting building-level spatial information leads to a substantial rise in MAPE, reaching 23.40%, along with a pronounced increase in the discrepancy between MAPE and SMAPE. This indicates that spatial location is a key determinant of prediction accuracy and stability in building-level population estimation. This effect is consistent with Beijing’s ring-structured urban form: the central districts were developed earlier and are dominated by compact, high-density residential stock, whereas peripheral areas were developed later and typically feature larger dwelling units to meet higher living-quality expectations. Such spatial heterogeneity directly influences population density patterns, and thus affects model performance when spatial variables are excluded.
Similarly, removing residential category information results in a significant increase in MAPE to 16.42%, again accompanied by a notable widening of the MAPE–SMAPE gap. This underscores the strong contribution of residential category information to reducing prediction errors and improving output stability. The underlying mechanism is that different residential types—such as standard apartments, villas, and dormitories—exhibit substantial variation in per capita floor area, which directly affects population allocation at the building scale. Incorporating residential type therefore provides essential structural information that enhances the predictive accuracy of building-level population models.
Moreover, to justify the proposed dynamic weighting strategy for the macroscopic constraint
λ(
t)—which combines a linear warm-up phase with a cosine annealing schedule—an ablation experiment was conducted by comparing it against a constant
λ baseline.
Figure 6 illustrates the training loss convergence trajectories for both strategies.
As observed for the constant weighting strategy (yellow dots), applying the maximum macroscopic penalty from the very first epoch induces gradient shock. The uninitialized network is forced to resolve conflicting gradients between micro-feature learning and macro-aggregation. This results in a high initial loss spike (exceeding 0.3) and a highly unstable, inefficient early descent trajectory. Furthermore, the conflicting gradients trap the model in a suboptimal state, converging to a noticeably higher final loss.
In contrast, our proposed dynamic weighting strategy (blue dots) effectively eliminates this optimization bottleneck. During the linear warm-up phase, the near-zero initial penalty allows the network to undergo a robust initialization, rapidly and smoothly minimizing the loss to capture essential spatial micro-heterogeneity. As the training progresses into the cosine annealing phase, the network calibrates the multi-objective optimization without disrupting the already learned feature representations.
Consequently, the dynamic strategy not only converges significantly faster but also achieves a lower and more stable final loss. This ablation test demonstrates that the curriculum learning approach (linear warm-up followed by cosine annealing) is a necessary mechanism to ensure stable and efficient gradient descent in our BIHL framework.
4.3. Model Verification
To verify the advancement of the BIHL model framework proposed in this study, some commonly used approaches in the existing literature, including multiple linear regression (MLR) and random forest (RF) models, were implemented for comparison in this study, and the prediction metrics are shown in
Table 7.
The results reveal that the MLR model performs the worst, with a MAPE as high as 30.20%. Moreover, it shows the largest gap between MAPE and SMAPE, indicating poor stability in the error distribution and the presence of extreme prediction values. In contrast, the HBM model proposed in this study demonstrates superior performance across all three subdistrict-level error metrics, achieving a MAPE and SMAPE of 11.36% and 11.26%, respectively, exhibiting the smallest gap between them, suggesting a more stable error distribution.
To validate the fine-grained disaggregation performance of the proposed model, an independent manual survey was conducted. Recognizing that absolute building-level ground truth is inherently unavailable, we established an estimated proxy ground truth for validation. Specifically, a stratified random sampling strategy was employed to select 20 residential buildings within each subdistrict. This stratification ensured that buildings of varying residential types (e.g., general residential housing, mixed-use building) and building location (e.g., core urban area or peripheral area) were proportionally represented. For each selected building, the exact number of households was determined through manual field surveys. The building-level residential population was then estimated by multiplying the household count by the average household size reported in the Haidian District Statistical Yearbook (2023). This surveyed dataset was only utilized as an independent hold-out validation set and was completely excluded from the training phase of all evaluated models.
These estimated validation values were compared with the building-level predictions produced by the four methods (MLR, RF, LightGBM, and the proposed framework), as visually presented in the scatter plots in
Figure 7. To quantify the performance and address the uncertainty of building-level estimations, we calculated the Coefficient of Determination (
R2), Mean Bias Error (MBE), and 95% Confidence Intervals (95% CIs) of the absolute errors for each model, which are given in
Table 8.
As illustrated in
Figure 7, the blue scatter points represent the predicted versus actual estimated populations, with the orange diagonal line denoting perfect alignment. Scatter points closer to this reference line indicate higher consistency. The MLR model exhibits highly dispersed scatter points and a low
R2, indicating a poor capture of micro-level spatial heterogeneity. The RF and LightGBM models show similar, improved patterns; however, they exhibit a noticeable negative bias (MBE < 0), consistently underestimating populations, particularly for medium-to-high density buildings (e.g., actual populations of 400–500).
In contrast, the proposed framework demonstrates superior disaggregation fidelity. By integrating dynamic curriculum weighting and total population constraints, the proposed method effectively mitigates the underestimation bias observed in the tree-based models, yielding the MBE closest to zero and the highest R2. Furthermore, the narrower 95% CI of the proposed method indicates significantly reduced predictive uncertainty at the micro-level. This quantitative and visual evidence solidifies the effectiveness of the proposed multi-objective optimization approach in maintaining spatial micro-heterogeneity without violating macro-level boundaries.
4.4. Result Visualization
To intuitively evaluate the spatial disaggregation performance of the proposed BIHL model framework, we visualize the building-level population estimation results for the Beixiaguan Subdistrict in Haidian District, Beijing.
Figure 8 presents the fine-grained residential population distribution from a 2D perspective, providing a clear overview of the population density variations across the urban fabric, while
Figure 9 displays the results from a 3D perspective, where the color gradient is proportional to the estimated population count with the real building height.
As illustrated in
Figure 8, the BIHL framework successfully captures the high degree of spatial heterogeneity inherent in complex urban environments, effectively discriminating between different building typologies. High-rise apartment complexes and student dormitories, depicted in deep red, are accurately identified as population hotspots. Even within the same census block,
Figure 8 reveals substantial variations among adjacent buildings, as the model integrates building geometry and POI-based typology to assign higher values to residential buildings while maintaining low values for mixed-use buildings.
Figure 8 highlights the Jiaoda community (comprising the academic and residential areas of Beijing Jiaotong University), delineated by a green dotted line, and the China Meteorological Administration (CMA) community (encompassing its working and residential zones), enclosed by a blue dashed line. Driven by the high concentration of high-rise apartment blocks and dense dormitories, the population distribution in the Jiaoda community exhibits a dark red color. Conversely, the residential areas within the CMA community are dominated by low-rise residential buildings, resulting in a more orange representation of its population distribution. These findings are highly consistent with the 3D building characteristics depicted in
Figure 9.
5. Conclusions
In this study, a Bayesian-informed hierarchical learning (BIHL) model framework was designed to fill the gap between data-driven deep learning and formal probabilistic constraints for building-level population estimation. By treating residential population as a latent variable constrained by mobile signaling priors and hard macro-level census data, the proposed method provides a robust solution for fine-grained urban modeling.
To validate the model’s performance, a case study was conducted using data from Haidian District, Beijing. The experimental results demonstrated that the proposed framework significantly outperforms traditional models (MLR, RF), achieving superior predictive accuracy and robustness with a MAPE of 11.36%. The minimal divergence observed between the MAPE and Symmetric MAPE (SMAPE) metrics further corroborates the model’s stability in handling complex error distributions compared to deterministic approaches. Through rigorous ablation studies, this research confirmed the critical role of micro-spatial features, underscoring that the integration of building-level spatial coordinates with residential category priors is indispensable for mitigating ‘spatial smoothing’ effects and enhancing model generalization. Furthermore, unlike standard machine learning approaches that typically exhibit pronounced under-prediction in high-density residential areas, the BIHL model framework effectively eliminates systematic bias and ensures physical consistency by employing a dynamic curriculum learning strategy coupled with hierarchical consistency constraints. This mechanism successfully balances micro-level feature extraction with macro-level census regularization. Ultimately, this work not only provides a robust methodology for generating high-fidelity, residential-level population layers but also establishes a solid foundation for next-generation Travel Demand Models (TDMs), offering significant theoretical and practical implications for optimizing urban public transit networks and formulating precise emergency response strategies in complex urban environments.
While the proposed BIHL model framework demonstrated significant accuracy in the empirical validation within Haidian District, certain limitations remain. First, the model is relatively sensitive to the quality of multi-source inputs; specifically, the inherent noise and spatiotemporal sparsity of mobile signaling data may constrain the stability of micro-level predictions. Second, the model’s reliance on site-specific Zone Embeddings hinders its direct transferability across different cities or regions, typically necessitating parameter recalibration when deployed in new environments. Third, from a methodological perspective, the current neural estimator provides deterministic point predictions optimized by gradient descent, lacking full posterior probability sampling to quantify the absolute uncertainty of the final outputs.
To address these challenges, future research could integrate multi-temporal mobile signaling data to extend the current static estimation into dynamic spatiotemporal forecasting. Simultaneously, to enable robust model transferability, future work should leverage richer urban spatial datasets and employ advanced methods, such as graph neural networks, to establish a generalized mapping between urban spatial features and population biases. Finally, a critical direction is the development of a fully probabilistic Bayesian inference system—such as one integrating Bayesian Neural Networks or Markov Chain Monte Carlo sampling—to rigorously provide posterior uncertainty bounds alongside the micro-level population estimates.