1. Introduction
Europe’s building stock accounts for roughly 40% of final energy consumption in the European Union, a share large enough to place buildings at the centre of the European Green Deal and the revised Energy Performance of Buildings Directive (EPBD 2024) [
1,
2]. The Directive identifies Building Energy Management Systems (BEMS), software platforms that monitor, control, and optimise building energy flows in real time, as a key instrument for demand reduction and renewable integration. Over the past decade, machine learning (ML) has established itself as the leading computational framework for BEMS optimisation across three functionally distinct sub-domains: short-term load forecasting and energy monitoring; adaptive HVAC setpoint control and building optimisation; and grid demand response and flexibility [
3].
Yet whether and to what extent this transition has been achieved in the BEMS ML literature has not been systematically assessed using a standardised readiness framework [
4].
This deployment gap is directly relevant to the EU Artificial Intelligence Act (Regulation 2024/1689) [
5], in force since August 2024, which establishes risk-based compliance obligations for high-risk AI applications, including ML-enabled BEMS functions that may encompass systems acting as safety components in critical energy infrastructure and demand-response platforms processing live individual metering data. Systems validated only in simulation or single-building pilots have not yet been systematically assessed against these obligations.
Prior reviews have addressed each sub-domain individually without a shared cross-domain readiness framework: for load forecasting, prediction models, XAI applications, and data barriers [
6,
7,
8,
9,
10]; for HVAC control, AI-assisted strategies, hybrid control, and reinforcement learning [
11,
12,
13,
14,
15]; and for demand response, demand forecasting and energy flexibility [
16,
17].
Within this broader context, three dimensions remain insufficiently addressed in the literature reviewed here: (i) the quantification of deployment maturity using a standardised Technology Readiness Level rubric; (ii) the mapping of TAI compliance across ALTAI dimensions of privacy, robustness, and transparency; and (iii) the evaluation of alignment with the EU AI Act risk-classification framework.
This study addresses these gaps through a scoping review following PRISMA 2020 guidelines, covering the three BEMS sub-domains outlined above and structured around three research questions (RQs):
RQ1—Deployment Maturity: What is the distribution of Technology Readiness Levels across the three BEMS sub-domains, and what deployment-context patterns are associated with different maturity levels?
RQ2—Trustworthy AI: What is the distribution of TAI coverage across the three BEMS sub-domains, and how consistently are the ALTAI dimensions of privacy, robustness, and transparency represented at publication level?
RQ3—Regulatory Readiness: What is the distribution of EU AI Act risk proximity across the three BEMS sub-domains, and to what extent do the included papers document engagement with Annex III classification criteria at publication level?
2. Materials and Methods
2.1. Review Protocol and Eligibility
This scoping review is designed for comparative cross-domain charting rather than prevalence estimation over the full screened population.
Item reporting follows the PRISMA Extension for Scoping Reviews (PRISMA-ScR) [
18]; the PRISMA 2020 flow diagram structure [
19] is adopted solely for record-flow reporting.
The review protocol is structured around three BEMS sub-domains: (A) Load Forecasting and Energy Monitoring, (B) HVAC Control and Building Optimisation, and (C) Demand Response and Flexibility.
Eligibility criteria follow the Population–Concept–Context (PCC) framework recommended for scoping reviews [
18]: Population = buildings; Concept = ML-based BEMS across the three sub-domains; Context = peer-reviewed literature published between 2020 and 2026 and indexed in Scopus or IEEE Xplore.
Sources are eligible if they: (i) propose, evaluate, or review an ML-based system in at least one of the three BEMS sub-domains; (ii) are published in English in a peer-reviewed journal or conference proceedings indexed in Scopus or IEEE Xplore; and (iii) appear between 2020 and 2026 inclusive.
Sources are excluded if they address wind, solar, or grid-only applications without an explicit building-level BEMS component.
2.2. Information Sources and Study Selection
A structured search was conducted across two primary databases: Scopus and IEEE Xplore. Six independent queries, one per database and sub-domain combination, were formulated using title-field constraints to maximise precision.
Table A1 (
Appendix A) reports the query strings. The search was last executed in March 2026.
Records were deduplicated and screened using a deterministic pipeline combining exact DOI matching, fuzzy title matching (token-sort ratio, threshold 88;
thefuzz v ≥ 0.20 [
20]), and rule-based relevance scoring (score cutoff
).
A stratified sampling design was adopted, targeting approximately 10% of 614 screened records. Records were stratified by sub-domain (A: 285, B: 192, C: 137), each subject to a minimum floor of 12 papers, ranked by composite screening score, and allocated using a Hamilton largest-remainder quota to minimise rounding error and avoid under-representation of the smallest sub-domain (C, ).
This procedure yielded a final corpus of
papers: sub-domain A (load forecasting and energy monitoring,
), sub-domain B (HVAC control and building optimisation,
), and sub-domain C (demand response and flexibility,
). The PRISMA flow diagram is presented in
Figure 1.
2.3. Data Charting Framework
For each included source of evidence, the following items were charted: primary BEMS sub-domain, ML technique, TRL, TAI coverage, and key performance indicators (KPIs). The ML technique is coded according to the main algorithmic approach (e.g., LSTM, DQN, federated learning).
Papers using Model Predictive Control (MPC) or Mixed-Integer Linear Programming (MILP) without an embedded ML component are retained in the corpus as non-ML comparators; they are excluded from ML taxonomy counts and sub-domain ML breakdowns, but retained as a reference group in the cross-category TRL distribution analysis (
Section 4.2).
TRL is assigned using the rubric detailed in
Section 2.4; TAI coverage is coded as present or absent for each of the three ALTAI dimensions; and KPIs are charted as reported in the abstract. For sub-domain synthesis, additional items include deployment context (simulation/offline/pilot/production) and dataset type (public benchmark/simulation/smart meter/aggregate), which are used for cross-domain analyses.
The primary data charting protocol is based on title and abstract only, consistent with the need for cross-domain comparability and the fact that deployment maturity and TAI coverage are often signalled at publication level through the abstract. Findings are therefore interpreted as properties of the mapped abstracts rather than as claims about the underlying deployed systems, and reported percentages refer to this analytic sample rather than to prevalence across all screened records.
In all ambiguous cases, a conservative lower-bound rule is applied: a TRL level or TAI dimension is assigned as present only when the abstract provides an explicit positive signal. Absence of a signal is recorded as absence of evidence at publication level, not as evidence of absence in the underlying system.
This protocol follows an established precedent in PRISMA-ScR scoping reviews, where cross-domain comparability requirements justify abstract-level coding [
18], and deliberately yields conservative lower-bound estimates of TAI coverage, reducing the risk of false positives in compliance gap identification. Following PRISMA-ScR Item 12, the TRL rubric is also the critical appraisal instrument.
A secondary validation was applied to a targeted subset of 12 deployment-adjacent papers to estimate the conservative bias introduced by abstract-only coding; this validation informs the robustness discussion in
Section 6.4 and does not alter the primary charting results.
2.4. Technology Readiness Level Rubric
TRLs are assigned using a nine-point rubric (
Table 1) adapted from EU/Horizon Europe guidelines [
21], applying the lower-bound rule defined in
Section 2.3.
Three domain-specific adaptations are introduced: (i) TRL 1–2 are collapsed into a single Research band, since abstract-only signals do not distinguish between basic-principles and concept-formulation stages; (ii) TRL 4 and TRL 5 are distinguished by evaluation environment (public benchmark for TRL 4; building physics simulation for TRL 5); and (iii) TRL 6 maps to offline evaluation on real building data, consistent with the EU “demonstrated in relevant environment” criterion for ML/software-intensive systems [
4]. Rule-based classifier signals for each level are detailed in
Table 1.
For cross-domain synthesis, paper-level TRL assignments are additionally aggregated into three deployment bands: Research (TRL 1–3), Development (TRL 4–6), and Demonstration (TRL 7–9).
2.5. Trustworthy AI (TAI) Assessment
Each paper is evaluated against three dimensions of the Assessment List for Trustworthy AI (ALTAI) [
22]: (i) Privacy & Data Governance, (ii) Robustness, and (iii) Transparency.
For each dimension, coverage is coded at paper level as present or absent on the basis of title and abstract only, consistent with the conservative lower-bound rule (
Section 2.3). The BEMS-ML-specific operationalisation and abstract-level classifying signals are detailed in
Table 2.
For each paper, the three binary indicators are aggregated into a four-level ALTAI coverage class. The four classes are:
No coverage (0 dimensions),
Partial coverage (1 dimension),
Multiple coverage (2 dimensions), and
Full coverage (3 dimensions). In cross-domain visualisation (
Section 5), this variable is the vertical axis of the Deployment Readiness Map.
TAI coverage coding is applied to all 61 included papers regardless of ML/non-ML status, as ALTAI dimensions, particularly transparency and robustness, are relevant to any automated decision system deployed in a building context. Non-ML comparators are therefore included in sub-domain TAI counts (
Section 4.3).
2.6. EU AI Act Risk Classification
This EU AI Act risk classification is used solely as an analytical tool within the review and does not constitute formal legal advice. Each paper is screened against Article 6 and Annex III of Regulation 2024/1689 [
5].
A BEMS system is treated as potentially high-risk when its ML function may qualify as a safety component in critical energy infrastructure (e.g., electricity, gas, or heat). By contrast, supportive or advisory algorithms that do not perform a safety function, and whose failure would not directly endanger infrastructure operation, are treated as non-high-risk. Cases that appear to fall under Annex III(2) [
5] are assigned to a conservative high-risk candidacy category.
Systems evaluated only in simulation (TRL < 6) are not confirmed as high-risk, because abstract-level evidence is insufficient for a formal rebuttal analysis under Article 6(3). The review therefore provides a conservative screening of regulatory proximity rather than a full legal compliance assessment.
A three-level analytical scheme is applied consistently across the 61 papers.
High-risk candidacy applies to systems that (i) control HVAC equipment with autonomous setpoint authority in a confirmed critical-infrastructure deployment context and are validated at TRL ≥ 6, or (ii) issue automated demand-response or load-curtailment signals with grid-stability operational effects in a confirmed critical-infrastructure demand-response context.
Borderline candidacy applies when the abstract reports autonomous HVAC setpoint authority or grid-stability demand-response signals (frequency regulation, binding curtailment) but does not confirm or exclude a critical-infrastructure deployment context. In these cases, the assignment records unresolved ambiguity rather than a positive Annex III identification.
Minimal risk covers all remaining papers for which no plausible Annex III trigger is identifiable at abstract level. Risk tier depends primarily on deployment context rather than building typology: the same technical configuration may fall into different categories depending on whether critical-infrastructure qualification is established.
Non-ML comparators are excluded from EU AI Act risk-tier classification (
Section 4.4), since Regulation 2024/1689 Art. 3(1) [
5] does not classify deterministic optimisation algorithms as AI systems unless embedded within a learning pipeline.
4. Cross-Domain Analysis
4.1. ML Taxonomy
The corpus distributes across ML categories as shown in
Figure 2: supervised learning accounts for 23 papers (37.7%), reinforcement learning for 26 (42.6%), federated learning for 10 (16.4%), and model predictive control (non-ML) for 2 (3.3%).
Three taxonomic notes guide interpretation. First, FL is treated as a training architecture orthogonal to learning paradigm: conflating the two would obscure the distinction between optimisation logic and training topology. In
Figure 2, FL is nonetheless shown as a discrete bar for descriptive comparability and should be read accordingly. Second, the single physics-informed hybrid paper spans both axes and is counted once on each independently. Third, MPC is treated as a non-ML baseline: it is excluded from ML taxonomy counts and sub-domain breakdowns, consistent with the definitional boundary of Regulation 2024/1689 Art. 3(1) [
5], which does not classify deterministic optimisation algorithms as AI systems unless embedded within a learning pipeline.
Figure 2 breaks down ML technique categories across the three BEMS sub-domains, with bars showing the percentage of papers in each sub-domain assigned to each ML category and absolute counts (
n) on the bars. Supervised learning encompasses LSTM-based, Transformer, and other supervised architectures. Consistent with the architectural distinction above, FL is displayed as a separate category and excluded from supervised counts, while MPC (non-ML) is included as a comparator baseline in
Figure 2 and in the TRL distribution analysis (
Section 4.2).
Load forecasting and energy monitoring (sub-domain A) is dominated by supervised learning, primarily LSTM-based and Transformer architectures (, 75%), with a notable FL component (, 20%) driven by privacy-preserving NILM systems, and one RL paper (, 5%). HVAC control (sub-domain B), the most control-critical segment of the corpus given the autonomous setpoint authority of its dominant RL configurations, is overwhelmingly RL-driven (, 83%), with one MPC paper retained as a non-ML baseline, three supervised papers (, 13%), and no FL adoption. Demand response and flexibility (sub-domain C) presents the most balanced profile: supervised (, 29%), RL (, 29%), and FL (, 35%) are nearly co-equal, with one additional MPC baseline paper (, 6%), reflecting the structural requirement to train on distributed metering data without centralising raw traces.
4.2. TRL Landscape Across Sub-Domains
Table 6 disaggregates the cross-domain TRL counts by sub-domain. Across the three sub-domains, the TRL distribution is skewed toward the development band. Five papers (8.2%) fall in the research stage (TRL 1–3), 55 (90.2%) in the development stage (TRL 4–6), and only one (1.6%) reaches the demonstration stage (TRL 7–9); no source documents a multi-site production deployment at TRL 8–9.
The aggregate picture masks marked sub-domain-level heterogeneity. Load forecasting shows the narrowest spread: all 20 papers fall within TRL 4–6, with a bimodal distribution (TRL 4: ; TRL 6: ). HVAC control covers the widest range (TRL 2–7): of 24 papers, 20 are at TRL 4–5, 3 at TRL 1–3, and 1 at TRL 7. Demand response presents a split profile, with 2 papers at TRL 1–3 and 15 at TRL 4–6, and no study advancing to pilot deployment.
The temporal dimension of this distribution is examined in
Figure 3. No upward shift toward the Demonstration band is visible across any year of the 2020–2026 window. The apparent decrease in 2026 reflects partial corpus observability: records indexed after March 2026 are not represented in the sample. The interpretation of these cross-domain and temporal patterns is developed in
Section 6.1.
Figure 4 compares median TRL across ML and training categories. MPC is retained as a reference group in the TRL distribution analysis. A Kruskal–Wallis test indicates significant differences (
,
): RL yields a median TRL of 4.0 (
), non-FL supervised learning 4.0 (
), MPC 6.5 (
), and FL 6.0 (
). The parity between RL and supervised learning reflects a shared Development-band ceiling: RL-based HVAC papers are constrained by simulation environments (EnergyPlus, Sinergym), while non-FL supervised approaches rely on public benchmarks (UK-DALE, REFIT), both without live-data validation. Among ML categories, FL yields the highest median (6.0), above both RL and supervised learning (both 4.0); MPC reaches 6.5 as a non-ML reference but with only
papers this value is not interpretable as a distributional estimate. FL’s higher median relative to RL and supervised learning reflects a different constraint: because privacy-sensitive applications require training on real metering data, FL studies more often enter deployment contexts consistent with TRL 6. Overall, deployment context, more than algorithmic complexity, is associated with TRL advancement in the corpus.
4.3. TAI Coverage Summary
Figure 5 shows an asymmetric pattern across ALTAI dimensions and sub-domains. For comparison across dimensions and sub-domains, binary TAI indicators are aggregated within each sub-domain and recoded into three gap levels according to the number of papers addressing each ALTAI dimension:
HIGH gap,
MEDIUM gap and
LOW gap.
The heatmap indicates that Privacy & Data Governance reaches the only LOW-gap result in demand response (9/17), robustness remains at MEDIUM gap in load forecasting (7/20) and HVAC control (5/24) while it is absent in demand response (0/17), and transparency is the weakest dimension overall with only 3 of 61 papers across the corpus.
4.4. EU AI Act Risk Distribution
Figure 6 shows the three-level EU AI Act distribution across the corpus. Applying the analytical scheme introduced in
Section 2 across all 61 papers, the high-risk candidacy category is empty (no abstract confirms a critical-infrastructure deployment context sufficient to trigger Annex III obligations at abstract level), borderline candidacy captures 23 papers (37.7%; 20 RL-based HVAC control papers with autonomous setpoint authority and 3 demand-response papers issuing automated curtailment signals), and the remaining 38 papers (62.3%) fall in the minimal-risk category.
The same 23 borderline-candidacy papers are concentrated in HVAC control and demand response, while all 20 load-forecasting papers fall in the minimal-risk category.
5. Deployment Readiness Map
Figure 7 maps each of the 61 papers as a single marker in a categorical multivariate scatterplot combining five encoding channels: TRL level (x-axis), ALTAI coverage class (y-axis), sub-domain (marker shape), ALTAI dimension(s) addressed (fill/hatch pattern), and EU AI Act risk tier (marker colour: green = minimal risk, orange = borderline candidacy, red = high-risk candidacy). Row totals appear in the right margin.
Of the 61 papers, 36 (59.0%) cluster at TRL 4–6 with
No coverage (zero ALTAI dimensions). One Research-band paper [
66] reports both robustness and transparency (
Multiple coverage), a profile not observed in Development-band studies.
Borderline-candidacy papers are concentrated in the Development band: 19 of 23 fall at TRL 4–6, and none reaches the Demonstration band. No paper falls into the high-risk candidacy category at abstract level. The single TRL 7 paper [
43] remains minimal risk because its MPC-based architecture is outside the EU AI Act definition of an AI system (Art. 3(1)) unless embedded in an ML pipeline.
Along the vertical axis, TAI coverage does not increase with maturity: TRL 6 papers do not show stronger concentration in higher ALTAI classes than TRL 4 papers. The hatch patterns confirm this reading: transparency markers (horizontal fill) appear in only three papers across the entire scatterplot, none of them in the Development band above the Partial coverage row. The colour encoding similarly places most borderline-candidacy papers in the No coverage and Partial coverage rows of the Development band.
No paper occupies the upper-right region (Demonstration with Multiple or Full coverage), the target profile for deployment-ready and governance-documented BEMS systems. Taken together, TAI class and risk-tier colour make the double readiness gap immediately visible: limited documented TAI provisions coexist with a substantial borderline-candidacy cohort nearing the Annex III compliance horizon without publication-level evidence of preparation.
This multi-channel encoding enables direct cross-reading of deployment maturity, TAI coverage, and regulatory exposure, and provides a practical framework for future literature updates.
6. Discussion
All findings in this section refer to the stratified analytics sample of 61 papers extracted from 614 screened records; they should therefore be read as properties of the analytic corpus rather than as prevalence estimates for the entire screened literature.
6.1. RQ1—Deployment Maturity: The TRL Ceiling Problem
The TRL distribution reported in
Table 6 directly addresses RQ1: with 90.2% of papers at the development stage and only one documented pilot at TRL 7, the corpus remains below verified production deployment. This development-band ceiling appears across sub-domains with different algorithmic profiles and evaluation traditions.
The field has consolidated effective development-stage practices, but the institutional and technical conditions needed for verified production deployment remain largely unmet. The primary barriers differ by sub-domain. For HVAC control, simulation-to-real transfer barriers help explain why 20 of 24 RL-based systems remain confined to the development band. For load forecasting and demand response, by contrast, the ceiling reflects the absence of standardised operational commissioning protocols and the difficulty of obtaining long-term live operational data, rather than algorithmic immaturity. Across all three sub-domains, the scarcity of open-source implementations limits independent TRL verification and may hinder the community-level coordination needed to move systems from offline evaluation to sustained deployment. These barriers are compounded by the TAI and regulatory gaps documented in previous sections: systems that do not document robustness under distribution shift (RQ2) or engage with EU AI Act obligations (RQ3) may face additional obstacles to deployment beyond TRL advancement alone.
Three mechanisms remain plausible contributors to this lack of temporal progression rather than demonstrated causes.
- 1.
Infrastructure lock-in: the absence of standardised multi-building testbeds with open live-inference APIs creates a structural ceiling at TRL 6 for all three sub-domains, because TRL 7 assignments require documented KPIs under live conditions.
- 2.
Simulation-to-real transfer barriers in RL: safe exploration, distribution shift, absent formal safety guarantees, and opaque reward design help explain why simulation-validated RL papers rarely advance to pilot deployment. This pattern is consistent with the domain-transfer fragility documented in RL-based physical control systems more broadly [
15,
64], with no RL-based HVAC paper in the charted corpus advancing beyond TRL 5 after 2020.
- 3.
Regulatory uncertainty as a deployment inhibitor: the EU AI Act timeline to August 2026–2027 [
5], combined with the 23 borderline-candidacy papers in
Section 4.4, suggests that limited explicit compliance engagement may reduce incentives to invest in operational testbed access.
These mechanisms should therefore be read as co-determining and plausible, not conclusively causal.
6.2. RQ2—Trustworthy AI Gap
The structural interpretation of the TRL ceiling is mirrored by the TAI coverage pattern, which is uneven across dimensions and sub-domains.
Privacy and data governance show a differentiated, but still incomplete, profile. In demand response, federated aggregation reduces direct data exposure; however, formal differential privacy guarantees appear in only 1 of the 6 FL papers, and no source reports a third-party privacy audit. Architectural intent and verifiable compliance therefore remain distinct in this evidence base. In load forecasting, coverage remains moderate (4/20), yet many NILM-oriented applications rely on fine-grained household metering traces, where data-governance implications can remain material even when privacy safeguards are not explicitly documented at abstract level. In HVAC control, privacy coverage is absent (0/24), broadly consistent with the use of aggregated zone signals rather than personal data.
Robustness is somewhat better represented overall, but it remains uneven across sub-domains. In practice, the mapped robustness signals are mainly technical reliability proxies under experimental conditions (for example benchmark generalisation and distribution-shift sensitivity), especially in load forecasting and HVAC abstracts. This evidence is informative for model behavior, but it does not yet amount to full operational assurance under sustained deployment conditions.
Transparency remains the most structurally absent dimension, with no sub-domain reaching MEDIUM-gap coverage and only 3 of 61 papers (4.9%) addressing it [
36,
43,
66]. In BEMS applications, transparency includes both operator-facing interpretability and the auditability of automated decisions; however, where transparency appears in the mapped corpus, the evidence is often model-internal (for example attention-related cues) rather than operator-facing explanation, traceability support, or audit-ready reporting. This distinction is formalised in the XAI taxonomy of Arrieta et al. [
84] between post-hoc model diagnostics and actionable decision explanations for end users. This gap is particularly notable in sub-domain B, where 22 of 24 papers report no operator-facing explanation and none documents a formal specification of the reward function’s objective weights. The distinction matters in BEMS settings, where human operators need interpretable rationale for control actions beyond internal model diagnostics.
This gap carries direct regulatory implications under the revised Energy Performance of Buildings Directive [
2], which mandates that Building Automation and Control Systems (BACS) installed in non-residential buildings above a threshold capacity include technical documentation and audit trail capabilities consistent with operator-facing transparency, precisely the dimension most structurally absent from the mapped corpus.
Under the abstract-level charting protocol, the observed asymmetry should be interpreted as a reporting profile of the charted literature rather than as a definitive statement on full-system implementation depth. The current evidence therefore supports a clear gap diagnosis while leaving open how much additional TAI evidence may emerge under full-text assessment. The abstract-level charting protocol deliberately yields conservative lower-bound estimates of TAI coverage, reducing the risk of false positives in compliance gap identification (
Section 2). A targeted micro-validation, detailed in
Section 6.4, confirms that the gap diagnosis holds even after accounting for the estimated under-reporting bias: transparency remains the weakest dimension across sub-domains under both conservative and upper-bound estimates.
6.3. RQ3—Regulatory Readiness: EU AI Act Engagement Gap
The limited and asymmetric TAI coverage becomes more salient when considered in relation to the EU AI Act, especially for systems whose deployment context may fall near Annex III thresholds. Within the abstract-level charting protocol, no included source explicitly refers to the EU AI Act, its risk categories, or the ALTAI framework, including the six deployment-adjacent papers involving real buildings or live metering data [
43,
67,
68,
70,
71,
74]. This pattern holds across the 22 papers published after August 2024, when the Act entered into force.
Crossing the risk-category distribution against TRL band sharpens this picture: of the 23 borderline-candidacy papers, 19 fall within the Development band (TRL 4–6), 4 in the Research band (TRL 1–3), and none has crossed into the Demonstration band. Regulatory exposure is therefore not an artefact of early-stage research activity, but it tracks the development ceiling documented in
Section 4, in line with the structural mechanisms identified under RQ1.
The evidence supports a regulatory engagement gap at publication level. The same inferential caveat identified under RQ1 applies here: the co-occurrence of regulatory exposure and deployment stalling is consistent with a causal link but does not establish it under the current abstract-level evidence base.
The compliance timeline of Regulation 2024/1689 [
5] increases the practical urgency of this finding. Under Article 113 [
5], prohibited AI practices became applicable in February 2025. For high-risk AI systems under Annex III, the category most relevant to the borderline-candidacy cohort identified here—obligations apply from August 2026 [
5]. For the 19 borderline-candidacy papers currently at TRL 4–6, this is an immediate horizon, not a distant one. Systems moving toward pilot deployment in the next 12–18 months would need to meet conformity-assessment requirements before commissioning, including technical documentation (Article 11 [
5]), logging and traceability (Article 12 [
5]), and transparency measures (Article 13 [
5]). The absence of any publication-level engagement with these obligations, including among the six papers involving real buildings or live metering data, suggests that this compliance horizon is not yet integrated into the research design cycle of BEMS-ML studies. This is a structural gap, distinct from but compounded by the TRL ceiling documented under RQ1. Regulatory readiness requires proactive design choices—such as logging architecture, reward-function documentation, and data-governance frameworks—that cannot be retrofitted at the deployment stage without significant re-engineering cost.
6.4. Limitations
These regulatory considerations must nonetheless be interpreted within the methodological boundaries of the review. Four methodological constraints shape the scope of inference of this review.
First, the analytical corpus comprises 61 papers drawn from Scopus and IEEE Xplore through a stratified quota design. Databases such as ACM Digital Library and Web of Science were not searched, and the findings should therefore be read as properties of this analytic sample rather than as prevalence estimates for the full screened population.
Second, TRL and TAI assignments were performed by a single rater using a deterministic rubric; inter-rater validation was not conducted, which leaves residual uncertainty, particularly at the TRL 5–6 boundary and in EU AI Act risk-tier classification. The rubric is nevertheless designed to minimise subjectivity by tying each assignment to observable binary abstract signals—for example, a named co-simulation environment indicates TRL 5, live sensor KPI values indicate TRL 6, and an operational pilot with a named building indicates TRL 7 (
Table 1). This rule-based design reduces interpretive leeway in ambiguous cases, although it does not replace replicated inter-rater validation. Cohen’s kappa on a replicated subset therefore remains an appropriate extension for future work.
Third, as established by the conservative lower-bound protocol (
Section 2.3), all charting relied on title and abstract only, so reported coverage figures remain conservative lower-bound estimates. Following PRISMA-ScR Item 12 [
18], this interpretation reduces false positives in compliance gap identification but systematically understates the true TAI coverage of the mapped systems.
To estimate the conservative bias, a targeted micro-validation recoded papers (19.7% of the corpus) against the same TAI rubric applied to abstracts. In 9 of the 12 validated papers (75%), at least one TAI dimension absent from the abstract was present in the full text (robustness: 6/12; transparency: 4/12; privacy: 1/12). If the same under-reporting rates held across the corpus, aggregate coverage figures would rise to an estimated upper bound of 29.5% for robustness, 11.5% for transparency, and 23.0% for privacy, indicative of the magnitude of abstract-level underestimation rather than revised prevalence estimates. For TRL, two papers show deployment-context signals in the full text consistent with a TRL4 to TRL6 upgrade.
Fourth, the EU AI Act risk classification applied here is an analytical framework intended for descriptive research purposes; it does not constitute a legal compliance assessment and should be treated as indicative, subject to revision as legal interpretations and regulatory guidance evolve.
7. Conclusions
This scoping review mapped 61 peer-reviewed papers on ML for BEMS (2020–2026) against a three-axis analytical framework: Technology Readiness Level, ALTAI-derived Trustworthy AI dimensions, and EU AI Act risk proximity. The aggregate pattern is one of persistent imbalance. Strong methodological output has not translated into verified production deployment, documented trustworthy-AI provisions, or explicit regulatory engagement. The 90.2% of papers in the analytic corpus occupy the Development band (TRL 4–6) with no source crossing into multi-site production deployment; transparency (the ALTAI dimension most directly tied to operator accountability) is addressed in only 3 of 61 papers; and EU AI Act engagement is absent from all publication years, including the 22 papers that appeared after the Regulation entered into force in August 2024. Overall, these figures indicate a field-wide pattern rather than a sub-domain artefact: the double readiness gap characterises the publication-level evidence across algorithmic paradigms and evaluation traditions alike. From a sustainability standpoint, this imbalance matters because the energy and carbon savings promised by ML-based BEMS remain largely unverified under real operating conditions.
Taken together, these findings suggest that the current ML-BEMS literature is advancing along a technically capable but deployment-constrained trajectory. The core finding is therefore an accountability ceiling rather than a performance ceiling: the evidentiary record does not yet demonstrate that algorithmic sophistication has been matched by operational validation, governance documentation, or regulatory awareness.
The cross-domain comparison further shows that this gap is expressed differently across sub-domains. Load forecasting and energy monitoring appear methodologically mature but governance-light; HVAC control is the most operationally consequential area yet remains strongly constrained by simulation-to-real transfer; demand response shows the strongest privacy orientation, but still limited robustness and transparency reporting. These asymmetries suggest that future progress is unlikely to come from a single technical improvement alone, and will instead require sub-domain-specific advances in testbed access, deployment reporting, assurance practice, and compliance-aware system design.
A further contribution of this review is methodological. By combining TRL assignment, ALTAI-based coding, and EU AI Act proximity screening into a single Deployment Readiness Map, the paper provides a reusable framework for monitoring how the ML-BEMS field evolves beyond proof-of-feasibility claims toward more operationally and institutionally mature forms of evidence. The framework is intended as a comparative literature-mapping instrument rather than a substitute for full technical audit or legal assessment, but it can support future reviews and longitudinal updates of the field. In this sense, the Deployment Readiness Map is also a sustainability monitoring instrument: it tracks how AI technologies progress toward verifiable contributions to building energy efficiency.
These conclusions should nevertheless be interpreted within the limits of the study design. The findings refer to a stratified analytics sample of 61 papers drawn from 614 screened records, and the primary coding protocol relies on titles and abstracts only, with conservative lower-bound coding (
Section 2.3) in ambiguous cases. Accordingly, the reported percentages should be read as properties of the mapped publication-level evidence rather than as prevalence estimates for the full screened literature or as definitive claims about the underlying deployed systems. This limitation is especially relevant for TAI coverage and regulatory-readiness interpretation, where some system-level provisions may not be visible in abstracts even when they exist in the full text.
On this basis, three priorities emerge for future work.
First, the field would benefit from more operationally explicit reporting, including clearer statements on deployment context, commissioning conditions, live performance horizons, and whether inference occurs in closed-loop building operation.
Second, TAI reporting should move beyond isolated references to privacy-preserving architectures or generic explainability claims, toward more verifiable documentation of robustness testing, operator-facing transparency, auditability, and data-governance mechanisms.
Third, as AI governance requirements begin to affect deployment environments more directly, future BEMS studies would benefit from reporting practices that make regulatory context legible without overstating legal status. Concretely, researchers should improve publication-level reporting on deployment context, TRL evidence, and application-relevant TAI dimensions; industry should focus on the TRL6-to-TRL7 transition through access to operational testbeds, live-condition KPI documentation, and early regulatory pre-screening; and regulators and policymakers should clarify when ML-enabled BEMS configurations move from research and development settings into use contexts where AI Act obligations become practically relevant, especially in borderline cases in HVAC control and demand response.
Overall, the review indicates that ML for BEMS is constrained less by modelling capability or methodological sophistication than by the weaker connection between technical development, operational validation, and governance-ready documentation. Whether the field closes this double readiness gap will determine whether it moves from laboratory-grade results toward deployment that is operationally validated and governance-documented in real building and energy-system contexts, and whether it can deliver its expected contribution to building decarbonisation and Sustainable Development Goal 7 (SDG 7).