1. Introduction
The built environment contributes about 37% of all carbon dioxide emissions that are associated with energy [
1,
2,
3], and the embodied carbon of building materials and construction processes is increasingly making up a larger share of the overall carbon footprint [
4,
5,
6]. With reduced operational carbon emissions due to the increase in energy efficiency and the use of renewable energy, the relative significance of embodied carbon in life cycle assessment has increased significantly [
7,
8,
9]. In developed economies, residential construction represents over 60% of current floor area, and presents a serious opportunity to reduce carbon by means of evidence-based material selection and design optimization [
10,
11,
12]. Nevertheless, to set the right embodied carbon standards, it is necessary to have detailed data gathering under various building typologies and material supply chains, which poses serious difficulties in terms of coordination at the industry level.
The construction industry is a very fragmented ecosystem, with the companies engaged in construction, and the material suppliers, architects, and municipalities all keeping building performance data separately [
13,
14,
15]. This information fragmentation poses significant challenges to creating strong carbon benchmarking systems that would be able to inform policy-making and make significant comparisons across building portfolios. Conventional centralized methods have major challenges, such as competitive issues among construction companies, regulatory limitation of cross-jurisdictional data exchange as well as privacy concerns for building owners [
16,
17,
18]. These have constrained the creation of holistic building stock models that would hasten the process of moving towards low-carbon construction practices [
19,
20,
21].
Urban science is based on the view that cities and regions are becoming more and more in need of credible portfolio-level embodied carbon standards to facilitate evidence-based urban planning, retrofit prioritization, and climate action plans. Municipal governments are supposed to calculate and contrast the carbon performance of residential building stocks at the city and regional levels so to use them in sustainability reporting, tracking of progress towards climate goals and development of low-carbon development policies. Nevertheless, it is difficult to put such benchmarking into practice, with the information about residential buildings being spread among various private and public stakeholders, including construction companies, home distributors, material suppliers, and municipal authorities. Detailed building and material data are often not centralized, as aggregation is often infeasible because of privacy laws, commercial sensitivity and fragmented ownership structures. Here, the suggested federated learning model is clearly placed as a facilitator of cross-city and cross-region embodied carbon benchmarking, where urban stakeholders can obtain similar, city-scale information without exchanging raw building data.
Federated learning has become one of the prospective paradigms of collaborative machine learning that can allow several stakeholders to train predictive models together without centralizing raw data [
13]. Federated learning resolves the underlying privacy issues by distributing model training to the clients participating in it and summing up model parameters but not sensitive data, allowing knowledge sharing across organizational borders [
16]. The recent use of smart building settings has proven that federated methods are possible in energy consumption prediction, thermal comfort, and building performance optimization [
5]. Nonetheless, some of the issues of embodied carbon benchmarking are particular to specialized framework adaptations that have not been resolved sufficiently in the literature [
9].
Existing federated learning methods successful in operational energy prediction and thermal comfort modeling cannot be directly applied to embodied carbon assessment due to three domain-specific challenges:
- (1)
Extreme Data Heterogeneity in Material Composition—embodied carbon datasets exhibit significantly higher feature heterogeneity than operational energy datasets; while operational energy depends on relatively standardized building geometry and HVAC configurations, embodied carbon is determined by material composition (concrete types, steel grades, timber species, and insulation), varying dramatically across construction traditions, local material availability, and supply chain structures, creating non-IID distributions fundamentally different from the energy forecasting where standard FedAvg and FedProx were validated.
- (2)
Multi-Scale Feature Sensitivity—embodied carbon features span multiple orders of magnitude in sensitivity to privacy-preserving noise (concrete volume 50–500 m3 vs. reinforcement steel 1000–20,000 kg have vastly different scales and privacy implications); uniform noise calibration in standard DP-FedAvg either over-protects low-sensitivity features (degrading accuracy) or under-protects high-sensitivity features (compromising privacy), motivating our adaptive noise calibration (Equations (11) and (12)).
- (3)
Assessment Boundary Heterogeneity—different stakeholders use different LCA boundaries (A1–A3 cradle-to-gate, A1–A5 cradle-to-site, and A1–C4 cradle-to-grave), creating systematic target variable definition differences across clients, a unique challenge absent in operational energy benchmarking, where energy consumption (kWh) is universally defined, which standard federated aggregation methods do not account for, thereby producing biased global models.
Federated learning coupled with the implementation of differential privacy mechanisms can offer both formal mathematical privacy guarantees on the privacy of individual data and allow meaningful aggregate analysis [
22,
23,
24]. Differential privacy adds some noise to the model training processes with a carefully selected level of noise, such that the existence (or absence) of a single data point cannot be confidently estimated based on the published model parameters [
14]. In the case of embodied carbon benchmarking applications, the privacy of the approach to differentiate between the different construction stakeholders should be safeguarded as a way to motivate them to participate in the process in case they do not want to disclose their competitive advantage in terms of material costs or proprietary construction methods [
17].
Figure 1 provides a conceptual map of privacy-preserving embodied carbon benchmarking, which shows the issues of data sovereignty and the suggested federated learning solution architecture. The framework allows the parties to participate in the joint model training as multiple categories of stakeholders, such as construction companies, municipalities, and material suppliers, without having to surrender their proprietary data assets.
The primary contributions of this research are summarized as follows:
Novel Hierarchical Federated Architecture: To fulfill the needs of heterogeneous building data, the multi-stakeholder coordination requirements of different life cycle assessment methods, and to address heterogeneous building data, we suggest FedCarbon, a hierarchical federated learning architecture with attention-based client weighting especially suitable to the needs of embodied carbon assessment.
Adaptive Differential Privacy Mechanism: We introduce a dynamic and adaptive noise calibration program, which varies the privacy settings according to the sensitivity of the embodied carbon features, to obtain formal (ϵ, δ) differential privacy guarantees and achieve a prediction accuracy of over 94%.
Momentum-Enhanced Gradient Compression: We propose an error-feedback sparsification method based on momentum that, when used, decreases communication overhead by 82.6% over a typical federated averaging method and facilitates stakeholders who have lower bandwidth connectivity to participate.
Comprehensive Empirical Validation: We perform comprehensive tests on two publicly available datasets of 3108 residential building configurations in various geographic locations, which prove to be practically useful to embodied carbon benchmarking applications in real-life settings.
The rest of this paper is structured in the following way:
Section 2 provides the review of related work concerning federated learning, differential privacy, and the construction of carbon assessment.
Section 3 provides the FedCarbon methodology and mathematical model. The results of the experiment and analysis are discussed in
Section 4.
Section 5 is a discussion and analysis.
Section 6 is the conclusion of the paper.
2. Related Work
In this section, the literature in three interrelated areas is reviewed: federated learning applications in smart buildings, privacy-aware approaches in distributed machine learning, and embodied carbon assessment methods.
2.1. Federated Learning in Smart Building Environments
Federated learning has received much interest in smart building systems because it allows for the collaborative creation of models without the loss of data privacy [
1]. Wang et al. [
5] suggested a personalized federated learning model to develop energy consumption forecasting that can deal with non-IID data distributions among building typologies. Abbas et al. [
4] proposed a privacy-aware thermal comfort prediction model on smart buildings based on federated learning, which proves that it is possible to use distributed machine learning to predict building performance [
25,
26,
27,
28,
29,
30,
31,
32,
33,
34].
Amangeldy et al. [
6] provided a thorough review of artificial intelligence and deep learning techniques to manage resources in smart buildings and found federated learning as a future opportunity to use in privacy-preserving analytics. Shan et al. [
7] discussed AI-based multi-objective optimization methods to optimize the energy retrofit of urban buildings, and stated the possibility of using machine learning to make decisions faster in carbon reduction [
28,
29]. Hinterstocker et al. [
15] applied federated learning to building energy performance prediction across over 25,000 residential buildings, incorporating differential privacy techniques and demonstrating that privacy-preserving FL achieves comparable accuracy to centralized approaches in building stock-level energy assessment. Rizwan et al. [
30] proposed a convergence-aware federated transfer learning framework for residential energy consumption prediction that enables collaborative model training across multiple buildings without disclosing raw energy data, demonstrating the applicability of FL to multi-building stock-level performance assessment.
2.2. Privacy-Preserving Mechanisms for Distributed Learning
Federated learning has been widely combined with differential privacy [
9]. Mohammadi et al. [
10] demonstrated the integration of federated learning with differential privacy for secure anomaly detection in smart grid infrastructure, showing that DP-enhanced FL can achieve effective privacy–utility balance in distributed energy systems. A detailed study of collaborative intelligence in federated learning was conducted by Lazaros et al. [
13], who analyzed different aggregation strategies and its consequences to privacy and utility. The survey of scalable and secure edge AI systems conducted by Rourke and Leclair [
14] explored privacy-preserving mechanisms that can be applied in resource-constrained environments of deployment. Folino et al. [
17] created a scalable vertical federated learning system that has shown privacy-preserving analytics in the field of cybersecurity and generalizable methodological insights. Deng et al. [
19] proposed a privacy-preserving federated learning framework for collaborative risk assessment across smart grid operators, demonstrating that distributed benchmarking of critical infrastructure performance is achievable without centralizing sensitive operational data. Yang et al. [
22] developed a gradient compression federated learning framework with adaptive local differential privacy budget allocation, demonstrating the feasibility of jointly optimizing communication efficiency and privacy guarantees in distributed learning settings.
2.3. Embodied Carbon Assessment and Building Stock Modeling
Proper embodied carbon evaluation demands in-depth life cycle evaluation of building substances [
2]. Feng et al. [
8] showed how digital twin and edge intelligence could be used to decarbonize more precisely, and the methodological findings can be adopted and applied to the building sector. Bahadori-Jahromi et al. [
11] discussed the applicability of artificial intelligence in promoting civil engineering, such as sustainable building. Siakas et al. [
12] examined self-directed cyber-physical systems that facilitate intelligent positive energy districts.
Gupta et al. [
18] studied the concept of federated learning in smart farming application, and showed that privacy-preserving distributed learning could be used in sustainability applications. Goktas and Ibrahim [
20] wrote about the energy management and communication systems of smart grids, and their role in the optimization of building energy. El Hafdaoui et al. [
21] demonstrated that machine learning models using supervised learning techniques can estimate embodied carbon throughout the building life cycle, with average errors of approximately 15.71%, though centralized data requirements limit scalability across diverse building stocks and geographic regions. Zhang et al. [
28] provided a comprehensive NIST systematic review of embodied carbon assessment and reduction methods across building life cycles, identifying significant inconsistencies in assessment methodologies and database selection that underscore the need for standardized benchmarking frameworks. Li et al. [
31] published a harmonized dataset of high-resolution embodied life cycle assessment results for North American buildings, revealing that inconsistent LCA scopes, methods, and background datasets across geographies severely limit the comparability of embodied carbon benchmarks—a challenge that federated learning approaches can address by enabling collaborative model training without requiring data centralization.
2.4. Research Gap Analysis
A systematic comparison of the current methods based on seven key capabilities, as shown in
Table 1, indicates that, though previous literature considers individual elements like the federated learning of smart buildings [
1,
5], differential privacy mechanisms [
4,
13], or embodied carbon assessment [
7,
8], none of them incorporates all of the key components necessary to achieve privacy-preserving carbon benchmarking. The analysis reveals that there is a large research gap in which no framework is used to combine federated learning, embodied carbon modeling, differential privacy, gradient compression, and multi-stakeholder coordination to develop building stock applications. The gap that FedCarbon bridges—by offering the first end-to-end solution that is comprehensive and covers all seven dimensions of capability—allows full coverage in practical applications in the construction industry ecosystem when it comes to collaborative carbon benchmarking.
Table 1 evaluates related work against seven capability dimensions:
Federated—uses distributed model training across multiple clients without centralizing raw data.
Embodied Carbon—explicitly models embodied carbon, life cycle carbon, or material-related CO2 emissions (not limited to operational energy).
Diff. Privacy—implements formal (ε, δ) differential privacy or equivalent mathematical guarantees (not merely anonymization or access control).
Compression—uses gradient compression, sparsification, quantization, or communication-efficient techniques to reduce bandwidth.
Multi-Stakeholder—designed for or validated with multiple distinct organizational entities (not merely multiple devices within one organization).
Building Stock—operates at building portfolio or urban stock levels, modeling multiple buildings across typologies (not single-building optimization).
Real Data—validated on real-world measured data or verified simulation datasets (not purely synthetic or toy examples).
Table 2 shows the Quantitative operationalization of capability dimensions.
3. Proposed Methodology
In this section, the FedCarbon framework on privacy-preserving embodied carbon benchmarking is provided. We start with the overview of the system, and then perform mathematical modeling of the federated learning structure, differential privacy, and gradient compression.
3.1. System Overview
Figure 2 shows the overall FedCarbon design, which shows the hierarchical layout between construction stakeholders, local training, and the privacy-sensitive aggregation server. The system consists of three main layers, including the client layer, which involves local building data and model training, the aggregation layer, which involves differential privacy and gradient compression, and the application layer, which involves carbon benchmarking services.
Let denote the set of K participating clients, where each client k maintains a local dataset containing building records. The feature vector encodes building characteristics relevant to embodied carbon estimation. The target variable represents embodied carbon intensity in kgCO2e/m2.
The FedCarbon hierarchical architecture employs a three-level aggregation structure:
Level 1—Client-Level Training: Each client k (construction firm, municipality, or material supplier) trains the model locally on its private dataset D_k for E local epochs. Clients compute local model updates Δθ_k using differentially private SGD with per-sample gradient clipping (Equation (9)) and Gaussian noise injection (Equation (10)).
Level 2—Regional Aggregation: Clients are grouped into R = 4 geographic regions (Northern EU, Central EU, Southern EU, and Eastern EU). Each region r has a designated regional aggregator that collects compressed updates from its member clients and performs intra-region aggregation:
where
are attention-based weights computed within the region using Equation (6). The regional aggregator does not access raw data—it only processes compressed model updates.
Level 3—Global Aggregation: A global server collects regionally aggregated updates and computes the global model:
where w_r = n_r/N represents the proportion of total samples in region r.
Communication Timing and Synchronization Protocol:
FedCarbon adopts fully synchronous aggregation at both hierarchical levels using the following protocol.
Intra-Region Synchronization (Synchronous):
Within each region (r), all clients k∈Rr perform (E = 5) local epochs and send compressed updates to the regional aggregator. The aggregator waits for all selected clients St∩Rr, with a timeout of τtimeout = 300 s; late clients are dropped and aggregation proceeds. With K = 20 clients (5 per region), no timeouts were observed.
Inter-Region Synchronization (Synchronous):
The global server waits for all (R = 4) regional aggregators before computing the global model θ(t + 1), resulting in fully synchronous global aggregation.
DP Noise Application Order (Critical Design Choice):
Differential privacy is applied at the client before communication, following this order:
Local update Δθk(t) via SGD;
Gradient clipping and Gaussian noise injection;
Top-K gradient compression;
Transmission of compressed DP-sanitized update Δθk(t);
Attention weight computation at the regional aggregator;
Attention-weighted regional aggregation and forwarding to the global server.
Implication for Attention–Noise Interaction:
Since attention operates on DP-noisy updates, it may reweight noise across clients; however, it learns to downweight low signal-to-noise updates. Empirically, this yields higher performance (R2 = 0.942) than uniform-weighted DP-FedAvg (R2 = 0.924).
Regional Aggregators and Raw Data Access:
Regional aggregators never access raw data or pre-DP updates; they use only compressed, DP-sanitized updates, and their attention parameters are updated using global validation feedback rather than raw client information.
3.2. Embodied Carbon Prediction Model
The embodied carbon prediction problem is presented as a regression problem in which the objective is to learn a mapping function
parameterized by weights
:
The local loss function for client
k is defined using mean squared error with L2 regularization:
The global objective function aggregates local losses:
3.3. Federated Learning Framework
Algorithm 1 presents the complete FedCarbon training procedure.
| Algorithm 1: FedCarbon: Federated Learning for Embodied Carbon |
| Require: Clients K, rounds T, epochs E, learning rate η, privacy (ϵ, δ), compression ρ |
Ensure: Global model θ(T)- 1.
Initialize global model θ(0), attention parameters for each region - 2.
for t = 0 to T − 1 do - 3.
Server broadcasts θ(t) to all regional aggregators - 4.
for each region r ∈ {1, …, R} in parallel do - 5.
Regional aggregator r broadcasts θ^(t) to clients in region r - 6.
St ← random subset of m clients in region r - 7.
for each client k ∈ St in parallel do - 8.
θ k (t,0) ← θ(t) - 9.
for e = 0 to E − 1 do - 10.
Sample mini-batch Bk from Dk - 11.
gk ← ∇θ Lk(θ k (t,e); Bk) - 12.
k ← ClipGradient(gk, C) Adaptive per-feature clipping - 13.
k ← AddNoise(k, σ) Adaptive per-feature noise - 14.
θ k (t,e + 1) ← θk (t,e) − ηk Momentum update (Equations (4) and (5)) - 15.
end for - 16.
∆θ(t) k ← θ(t,E) - 17.
∆ k(t) ← Compress(∆θ k(t), ρ) - 18.
Updates transmitted to regional aggregator are DP-sanitized and compressed - 19.
end for - 20.
θ(t+1)←θ(t) + ∑k∈St nk/∑jnj Δθ~(t) - 21.
end for - 22.
Return θ
|
The local update rule with momentum is
The attention-weighted aggregation is
Attention parameters Wa ∈ ℝda×p} and v∈ ℝda (da = 64) operate exclusively on compressed model updates Δ_k^(t), not raw client data, maintained at the regional aggregator level. This ensures no aggregator accesses raw building data; attention learns to assign higher weights to informative and consistent updates as a proxy for quality without direct data inspection.
Training Procedure:
Clients transmit compressed model updates Δkt (no raw data transmitted);
Regional aggregator computes attention scores αkt using Equation (6) based solely on model updates;
After distributing regionally aggregated model, aggregator updates Wa and v using validation performance feedback—reinforcing current distribution if loss decreases, otherwise adjusting via gradient step on attention parameters.
Attention Weight Properties:
Attention weights moderately correlate with data share (Pearson r = 0.82) but not perfectly proportional, balancing quantity with quality;
Without regularization: attention variance increased (std = 0.058 vs. 0.031), while the lowest-weighted region’s R2 dropped to 0.014;
Regularization ensures balanced contributions while downweighting consistently low-quality updates.
This design ensures that at no point does any aggregator access raw building data.
3.4. Differential Privacy Mechanism
We use the Rényi Differential Privacy (RDP) accountant [
34] via Opacus for tight privacy composition, providing tighter bounds than basic composition or moments accountant. We adopt per-record adjacency (datasets D and D’ differ by ≤1 building record), protecting individual building-level information. Subsampling amplification with mini-batches (size B) from the client dataset (size n_k) yields the subsampling rate q = B/n_k; by privacy amplification lemma, mechanisms satisfying (α,
)-RDP on the full dataset satisfy (α, log(1 + q(exp(
) − 1)))-RDP on the subsampled dataset. The total privacy budget for T communication rounds with E local epochs, and the Gaussian mechanism (noise multiplier σ) is computed via RDP composition.
Per-sample gradients are clipped:
The adaptive clipping threshold is
The adaptive noise variance is
Intuitive Explanation of Adaptive Noise Calibration:
The adaptive noise calibration mechanism adjusts the clipping threshold Cj and noise variance σj2 per-feature-group based on the sensitivity score sj, quantifying how much information a feature group reveals about individual building projects. The sensitivity score sj combines: (1) value range sensitivity—ratio of feature group’s inter-quartile range to median, capturing distributional spread; and (2) gradient contribution—average magnitude of gradient components for feature group j over previous Tw = 10 communication rounds.
Gradient Independence from Private Aggregation: The gradient magnitudes used in the sensitivity score sj are computed exclusively on the frozen global model parameters θ(t) at the beginning of each communication round, prior to any local private training updates. Specifically, at the start of round t, each client computes ∇θ Lk(θ(t); Bcalibration) on a designated public calibration subset Bcalibration (10% of each client’s local data, held out from training and pre-registered before FL training begins). These calibration gradients are aggregated across clients via simple averaging (without DP noise) to produce the global sensitivity score sj(t). Crucially, the gradient operator used for sj is evaluated at θ^(t)—the global model parameters broadcast by the server—which is itself a post-processed output of the DP-protected aggregation from round t − 1. By the post-processing theorem, θ(t) inherits the cumulative DP guarantee, and any deterministic function of θ(t) (including gradient evaluation on public data) does not incur additional privacy cost. The calibration subset is excluded from the private training mini-batches to ensure no double-dipping between sensitivity estimation and private model updates
For analytical reference, the privacy budget under basic advanced composition can be upper-bounded as
Equation (13) provides a loose analytical upper bound based on the advanced composition theorem [
12] and is included for interpretive reference only. The actual privacy budget reported in all experiments (ε = 0.97 at δ = 10
−5) is computed using the Rényi Differential Privacy (RDP) accountant implemented via Opacus 1.4.0, which provides strictly tighter composition bounds through numerical RDP-to-(ε,δ)-DP conversion [
34]. The RDP accountant tracks privacy expenditure across T = 200 communication rounds × E = 5 local epochs with subsampling rate of
per client, yielding
, confirming that Equation (13) is indeed a conservative upper bound
3.5. Gradient Compression
Top-k components are chosen with the help of the compression operator:
Error feedback accumulation:
Algorithm 2 details the compression procedure.
| Algorithm 2: Gradient Compression with Error Feedback |
| Require: Update ∆θ, ratio ρ, error buffer e |
| Ensure: Compressed ∆, updated e′ |
| u ← e + ∆θ |
| τ ← top-⌈ρp⌉ magnitude threshold in u |
| ∆ ← 0 |
| for j = 1 to p do |
| if |[u]j | ≥ τ then |
| [∆]j ← [u]j |
| end if |
| end for |
| e′ ← u − ∆ |
| return ∆, e′ |
3.6. Convergence Analysis
Under Assumptions 1–3, FedCarbon achieves:
Relationship to Analytical Optimality Bounds: We note that the privacy–utility trade-off in FedCarbon is characterized empirically rather than through closed-form analytical bounds. Information-theoretic frameworks such as Sankar et al. [
32] have established tight privacy–utility trade-off bounds for smart meter data using rate-distortion theory, demonstrating that optimal privacy-preserving solutions can be derived analytically under Gaussian assumptions. More recently, communication–privacy trade-offs in distributed settings have been characterized through explicit rate expressions at the 60th Allerton Conference on Communication, Control, and Computing [
33], providing tight bounds on achievable accuracy under joint communication and privacy constraints. FedCarbon does not claim analytical optimality in the information-theoretic sense; rather, it demonstrates empirical near-optimality by achieving R
2 = 0.942 with a (ε = 1.0, δ = 10
−5)-DP guarantee and an 82.6% communication reduction—which is within 2.6% of the non-private centralized upper bound (R
2 = 0.968). Our ‘first comprehensive’ claim refers specifically to the integration of all seven capability dimensions within a single operational framework for embodied carbon benchmarking, not to theoretical optimality of any individual component. A framework satisfying six of seven dimensions with tighter theoretical bounds would represent a complementary rather than superseding contribution, as practical deployment in multi-stakeholder construction ecosystems requires the full integration we provide.
4. Results and Evaluation
4.1. Datasets
The UCI Energy Efficiency Dataset is included to (i) leverage building geometry and envelope features that are strongly linked to material quantities and embodied carbon, (ii) enable reproducible comparison with prior federated learning studies in smart building research, and (iii) evaluate the generalizability of the FedCarbon framework across different building performance targets. While ECEBD remains the primary dataset for embodied carbon benchmarking, the UCI dataset provides complementary evidence of model robustness and cross-task applicability.
We use two publicly available datasets:
Table 3 summarizes dataset characteristics.
4.2. Experimental Setup
Our implementation of FedCarbon is based on PyTorch, PySyft and Opacus 3.9. Hyperparameters: K = 20 clients, R = 4 regions, E = 5 local epoch, batch size = 32, learning rate = 0.001, clipping = C = 1.0, noise , compression , and momentum .
Baseline Hyperparameter Tuning and Fairness Protocol:
To ensure fair comparison, all baseline methods were evaluated under a standardized protocol.
Standardization:
Identical Data Partitions: All methods use the same client data partitions from Dirichlet allocation (α = 0.5 and seed = 42), an identical 80/20 train–test split per client.
Uniform Privacy Budget: All DP methods (DP-FedAvg, DP-SCAFFOLD, and FedCarbon) are evaluated at ε = 1.0, δ = 10−5, and the same clipping threshold C = 1.0 and noise multiplier σ = 1.2 for DP-FedAvg/DP-SCAFFOLD; FedCarbon uses adaptive clipping/noise (Equations (11) and (12)), constrained to ε = 1.0 via composition bound (Equation (13)).
Architecture: All baselines use the same three-layer MLP (hidden dimensions [128, 64, 32], ReLU activations, and L2 regularization λ = 10−4).
Infrastructure: NVIDIA A100 GPU (40GB VRAM), PyTorch 2.1, PySyft 0.8.7, and Opacus 1.4.0.
Table 4 shows the Hyperparameter optimization (grid search).
The datasets are partitioned using a standard Dirichlet-based non-IID strategy to simulate heterogeneous federated settings with 20 clients organized into four regions (five clients per region). For the ECEBD dataset, buildings are first assigned to regions based on geography (Northern, Central, Southern, and Eastern Europe), while UCI records are randomly assigned to regions due to the absence of location attributes. Within each region, data are distributed to clients using a Dirichlet distribution over 10 target-variable (ECI) quantile bins, where the concentration parameter α controls heterogeneity (α = 0.1 for highly skewed, α = 0.5 for moderate non-IID, and α = 10 for near-IID partitions). The degree of heterogeneity is quantified using Earth Mover’s Distance between client distributions and Weight Divergence from the uniform distribution.
Table 5 shows the Dirichlet concentration parameter (α) on federated data heterogeneity measured by Earth Mover’s Distance (mean ± std) and Weight Divergence. While,
Table 6 shows the Impact of heterogeneity on FedCarbon performance (ECEBD).
The experimental setting, in the context of the present study, is a simulated multi-stakeholder urban setting. Every client is associated with an urban stakeholder, i.e., construction businesses, residential developers, material distributors, or local governments, which functions within a particular city or administrative area. The regional parameter R captures different urban or regional settings, which allow for modeling the city-level heterogeneity in residential building features. The non-IID data distributions among clients reflect the realistic variations in the typologies of buildings, material selections, and construction processes that are usually seen in varying urban environments. In such a setup, the proposed framework can be evaluated with the conditions that are close to the real-world urban and inter-city benchmarking scenarios, and the data sovereignty can be maintained across the stakeholders that are part of the setup.
4.3. Training Convergence
Figure 3 shows training loss curves in terms of communication rounds.
4.4. Prediction Accuracy
Figure 4 shows how the prediction accuracy (R
2) increases with 200 communication rounds on both UCI Energy Efficiency and ECEBD datasets, and FedCarbon (red squares) reaches its final accuracy of 0.921 and 0.942 respectively, surpassing DP-FedAvg and Local Only baselines, and approaching the non-private Centralized upper bound. The convergence curves validate the fact that the adaptive differential privacy and attention-based aggregation mechanisms of FedCarbon are effective in balancing the privacy preservation with the model utility, and converge to a stable point with 150 rounds, despite the noise injection and gradient compression overhead.
Figure 5 presents predicted versus actual values.
Table 7 is the table with detailed performance comparisons.
4.5. Urban-Scale Embodied Carbon Benchmarking Demonstration
Although predictive accuracy measures of R2, MAE, and RMSE are needed to verify the performance of models, urban sustainability models demand interpretable benchmarking results that allow for the comparison of cities and regions. To illustrate how the suggested framework can be applied in practice to decision making at the urban scale, we use percentile-based embodied carbon standards based on the federated predictions obtained on the ECEBD dataset. These benchmarks represent the possibility of positioning the residential building stocks against each other without the need to have central access to raw building data. Indicators based on percentiles are typical of reporting on the sustainability of cities and enable municipalities to understand the high-carbon segments, trace progress across time, and focus on the targeted retrofit or policy interventions.
The proposed federated learning approach provides the ability to extract portfolio-level embodied carbon benchmarks, which can be directly interpreted at the urban scale, as illustrated in
Table 8. Cities or regions can use their residential building stock as a point of comparison between the percentile thresholds of the building stock to determine whether they are in the low-, medium-, or high-carbon categories compared to other jurisdictions. Notably, these standards can be generated with no access to personal building records or proprietary data, which enables comparison between cities and taking joint climate action in a demanding environment of data sovereignty and privacy.
The bootstrap confidence intervals in
Table 9 are computed by resampling the federated model predictions with replacement 1000 times and computing percentiles on each resample. The CI widths (14.8–23.5 kgCO
2e/m
2) represent 4.0–5.2% of the respective benchmark values, indicating moderate stability. Higher percentiles show wider intervals due to greater variance in the upper tail of the ECI distribution.
Percentile Boundary Crossing Analysis:
To quantify the practical impact of benchmark uncertainty on building classification, we analyze how many test-set buildings would be reclassified when percentile thresholds are adjusted to their CI bounds.”
Table 10 shows the Boundary crossing analysis under CI-adjusted thresholds.
Leave-One-Country-Out (LOCO) Percentile Robustness Analysis:
To demonstrate robustness of the percentile benchmarks across geographic jurisdictions, we performed leave-one-country-out validation, where the federated model is retrained excluding all buildings from one country, and percentile benchmarks are recomputed on the remaining test set.
The LOCO analysis reveals that percentile benchmarks are moderately robust to the exclusion of any single country, with mean absolute shifts of 2.6–4.7 kgCO
2e/m
2 (0.9–1.0% of benchmark values) and maximum shifts of 8.4 kgCO
2e/m
2 (1.9%) when Poland is excluded. The largest perturbations occur when excluding countries with distinctive construction traditions (Poland: high masonry/concrete mix; Germany: largest sample contributing to Central EU calibration; Spain: high ECI variability in Southern EU). All LOCO percentile shifts fall within the bootstrap confidence intervals reported in
Table 11, confirming that no single country disproportionately determines the aggregate benchmarks. The analysis validates that the federated model produces geographically robust benchmarks suitable for cross-jurisdictional comparison.
Recommendation for Municipal Use: Given the boundary crossing rates (6.6–9.4%) and LOCO variability (max 1.9%), we recommend that municipalities adopt the following protocol: (1) use CI-adjusted thresholds (lower bound of CI for P25 and upper bound for P75) to ensure conservative classification; (2) apply ‘buffer zones’ of ±15 kgCO2e/m2 around each percentile threshold; (3) buildings within buffer zones should undergo project-level LCA verification before policy classification.
Privacy Status of Released Benchmarks: The percentile benchmarks in
Table 5 inherit the (ε = 1.0, δ = 10
−5)-DP guarantee from the global model θ^(T) by using the post-processing theorem [
20], since benchmarks are deterministic functions of θ^(T) applied to test inputs and not raw data. However, if benchmarks are computed on training data where a single city dominates a regional partition, information leakage risks exist. Mitigation: (i) compute benchmarks on separate held-out public building stock survey; (ii) reported
Table 5 benchmarks use the test set (20% held-out); (iii) DP guarantee applies regardless, as the model is DP-protected. For policy deployment, municipalities should apply benchmarks as approximate reference ranges with bootstrap confidence intervals rather than hard regulatory thresholds, acknowledging inherent uncertainty in model-derived statistics.
4.6. Privacy–Utility Trade-Off
The privacy–utility trade-off is presented in
Figure 6. The privacy–utility trade-off curve shows that prediction accuracy (R
2) rises as the privacy budget (ϵ) becomes larger, and FedCarbon (square markers) consistently has better results in both datasets and has a high privacy guarantee (ϵ ≤ 1 green region). Its visual representation of R
2 values (blue (low 0.65) to red (high ∼0.96)) shows that FedCarbon can perform almost at the same level as non-privacy FedAvg (gray dashed line) despite the severe privacy settings, which proves its superior adaptive noise calibration scheme
4.7. Communication Efficiency
Figure 7 shows communication analysis. Communication efficiency analysis shows that the momentum-enhanced gradient compression of FedCarbon has a bandwidth reduction of 82.6 percent at a compression ratio 0.1 or 0.38 MB of the bandwidth instead of 2.18 MB of the bandwidth, with a prediction accuracy R
2 = 0.942, which is within the reasonable range (0.90). FedCarbon has a better preservation of accuracy as compared to TopK-Basic and Random-Sparse compression techniques because it has a mechanism of accumulating error feedback that only needs 165 convergence rounds compared to 195 and 232 rounds of baseline methods at the same compression levels. The trade-off analysis supports the fact that FedCarbon allows for practical implementation by construction industry players with low network connectivity without compromising the performance of the model.
Table 12 shows the Computational overhead and training time comparison.
4.8. Regional Performance
Figure 8 visualizes variations in the regional performance. Regional performance analysis proves the performance of FedCarbon in heterogeneous geographic regions, with the COMSOL-style heatmaps indicating that the highest accuracy (R
2 = 0.959) is obtained in Region 2 (Central EU), whereas slower convergence (R
2 = 0.923) is seen in Region 3 (Southern EU), as the heterogeneity of the data is more pronounced in that region, but all regions are converging to the satisfactory level of performance above 0.92. The client attention weights evolution heatmap shows how the attention-based aggregation process dynamically adapts the contributions of clients to 200 communication rounds, where the larger the attention (red), the larger the client contribution to the aggregation. The polar attention distribution validates the balanced regional contributions of 23.1\% to 26.8\% that prove hierarchical aggregation of FedCarbon to be effective in non-IID data distributions of various European building stock properties.
4.10. Error Decomposition Analysis
As shown in
Table 14,
Table 15 and
Table 16, the model performs best on (i) single-family detached buildings (R
2 = 0.951), which have the most standardized construction methods and material palettes; (ii) the Central EU region (R
2 = 0.959), which has the largest sample count and most consistent building standards; and (iii) reinforced concrete structures (R
2 = 0.948), which dominate the training data. The model performs worse on (i) high-rise apartments (R
2 = 0.912), which have complex structural systems and more variable material quantities; (ii) Southern EU (R
2 = 0.923), which exhibits the highest intra-regional construction practice variability; and (iii) timber and mixed/hybrid structures (R
2 = 0.908–0.918), which represent minority classes in the training data with higher material composition variability.
These results suggest that prediction accuracy is primarily driven by training data representation and construction practice homogeneity, indicating opportunities for targeted data collection in under-represented building categories to improve model performance.
4.11. Robustness to Data Partitioning
The headline R
2 = 0.942 has an expected variance of σ
2 = 1.6 × 10
−5 (std = 0.004) across 10 independent non-IID partitions,
Table 14 indicates high stability, with 95% CI [0.937, 0.945], confirming representativeness. The attention mechanism is most partition-sensitive (std increases 0.004→0.009 when removed), as attention weights adapt to client update distributions directly affected by data partitioning; without attention, fixed sample-size weighting provides less adaptation but inherent stability. Adaptive DP is the second-most sensitive (std = 0.007 without it), as feature sensitivity scores interact with client data distributions; compression is least sensitive (std = 0.003), as Top-K sparsification operates independently of data distribution patterns.
Table 17 shows the Variance in FedCarbon performance across 10 independent non-IID partitions (K = 20, R = 4, and α = 0.5).
Table 18 shows the Component sensitivity analysis (variance in R
2 across 10 partitions).
5. Discussion
Table 19 provides a detailed comparison of FedCarbon with ten state-of-the-art methods on seven evaluation criteria showing that, although current methods are excellent in each of the individual evaluation criteria—such as FL-SmartBuilding with R
2 = 0.948 to predict energy [
1] or DP-Thermal with thermal comfort privacy guarantees [
4]—none of them combine all the necessary capabilities to offer privacy-preserving embodied carbon benchmarking. The comparison indicates that, with federated learning with differential privacy (ϵ = 1.0), gradient compression (82.6% reduction), hierarchical aggregation, and combined with the carbon assessment domain, only FedCarbon achieves competitive accuracy (R
2 = 0.942). Interestingly, approaches such as QFL-IoT [
9] and VFL-Cyber [
17] are inclusive of privacy and compression, though they do not include a hierarchical structure that is characteristic of multi-stakeholder construction ecosystems across different geographic locations. FedCarbon is the only product to fill this gap by offering the first end-to-end solution that allows practical collaborative carbon benchmarking without centralizing building-related data, which makes it the most appropriate product to implement in practical settings in the construction industry.
In addition to methodological performance, the suggested framework is directly applicable in the context of urban policy and governance. The framework can facilitate the municipal decision-making process in terms of urban planning and housing strategies, as well as the prioritization of retrofitting by making privacy-preserving, portfolio-level embodied carbon benchmarking possible. The derived benchmarks can be used by local authorities to define the high-carbon parts of residential building stocks and direct low-carbon procurement and material selection policies, even in the absence of proprietary and sensitive project-level data. Moreover, the capability to produce similar standards between cities and regions enhances the reporting of urban sustainability and tracking of progress toward climate objectives, but without violating data sovereignty limitations imposed on the parties involved, usually both public and private. Another way in which the federated design supports inter-city and inter-regional cooperation is that it enables cities to join in joint benchmarking programs without shared central data, thus being able to take coordinated climate action in fragmented urban governance and regulation frameworks.
V-B. Limitations and Deployment Challenges
Simulation-Based Evaluation Limitations: Dirichlet-based partitioning (α = 0.5) may not capture full real-world heterogeneity (construction firms have detailed material data vs. municipalities with aggregate building permit records). Homogeneous client computation assumptions ignore the resource constraints of small firms or under-resourced departments lacking hardware or expertise. Controlled network conditions exclude real-world variability (intermittent connectivity, variable bandwidth, and asynchronous availability); 82.6% communication savings assume reliable, synchronous rounds. Static datasets do not reflect evolving building stock from new construction, retrofits, and updated assessment methods.
Anticipated Real-World Deployment Challenges: Data schema heterogeneity requires harmonizing formats, units, and assessment boundaries (cradle-to-gate vs. cradle-to-grave) through standardized ontologies. Regulatory compliance must address varying data protection regulations (GDPR and national laws); while differential privacy provides formal guarantees, regulatory acceptance for building data systems is unestablished. Stakeholder trust requires transparent privacy auditing, verifiable computation, and clear value propositions, as firms may resist participation due to competitive concerns. Model maintenance requires continuous updating and concept drift detection for changing practices, materials, and carbon factors. Byzantine robustness is absent; real deployments face risks from corrupted or malicious updates, with Byzantine-resilient aggregation noted as critical future work.
Cross-Regional and Cross-Regulatory Adaptability: FedCarbon’s hierarchical architecture operates across heterogeneous regulatory environments, allowing regional aggregators to enforce jurisdiction-specific privacy requirements (e.g., stricter ε under GDPR) without affecting global aggregation, supporting heterogeneous privacy budgets.
Model Update Mechanism: FedCarbon supports incremental updates through: (1) rolling training windows incorporating new data without retraining from scratch, (2) concept drift detection via CUSUM test monitoring client update magnitudes to trigger re-initialization, and (3) version control tagging each global model with timestamps and privacy budgets for audit trails.
Malicious Client Defense (Limitations): The current framework lacks explicit Byzantine-resilient aggregation but provides implicit robustness via attention mechanism downweighting adversarial updates diverging from majority patterns. Injecting two malicious clients (10% of K = 20) with random gradient noise resulted in only a 0.008 R2 degradation with attention weighting versus 0.031 with FedAvg, though it was insufficient against sophisticated poisoning attacks. Future work will integrate robust aggregation methods (coordinate-wise median, trimmed mean, and Krum).
6. Conclusions
In this paper, FedCarbon, a federated learning framework on privacy-preserving embodied carbon benchmarking in residential construction, was presented. The framework is a combination of hierarchical federated learning and attention-based weighting of clients, adaptive differential privacy, and momentum-based gradient compression. Two datasets (3108 buildings) were experimented, showing a 94.2% prediction accuracy and offering (e, d) differential privacy, with e = 1.0 and reducing communication by 82.6%. Other work to be done in the future will involve Byzantine-resilient aggregation and pilot deployments with construction industry partners.