LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings

Bai, Junqi; Ning, Dejun; You, Yuxuan; Chen, Jiyan

doi:10.3390/buildings16010196

Open AccessArticle

LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings

by

Junqi Bai

,

Dejun Ning

^*

,

Yuxuan You

and

Jiyan Chen

Shanghai Advanced Research Institute, University of Chinese Academy of Sciences, Shanghai 201210, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(1), 196; https://doi.org/10.3390/buildings16010196 (registering DOI)

Submission received: 27 November 2025 / Revised: 22 December 2025 / Accepted: 25 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue AI in Construction: Automation, Optimization, and Safety)

Download

Browse Figures

Versions Notes

Abstract

With smart buildings being widely adopted in urban digital transformation, interactive semantic question answering (QA) systems serve as a crucial bridge between user intent and environmental response. However, they still face substantial challenges in semantic understanding and dynamic reasoning. Most existing systems rely on static frameworks built upon Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), which suffer from rigid prompt design, breakdowns in multi-step reasoning, and inaccurate generation. To tackle these issues, we propose LoopRAG, a multi-agent RAG architecture that incorporates a Plan–Do–Check–Act (PDCA) closed-loop optimization mechanism. The architecture formulates a dynamic QA pipeline across four stages: task parsing, knowledge extraction, quality evaluation, and policy feedback, and further introduces a semantics-driven prompt reconfiguration algorithm and a heterogeneous knowledge fusion module. These components strengthen multi-source information handling and adaptive reasoning. Experiments on HotpotQA, MultiHop-RAG, and an in-house building QA dataset demonstrate that LoopRAG significantly outperforms conventional RAG systems in key metrics, including context recall of 90%, response relevance of 72%, and answer accuracy of 88%. The results indicate strong robustness and cross-task generalization. This work offers both theoretical foundations and an engineering pathway for constructing trustworthy and scalable semantic QA interaction systems in smart building settings.

Keywords:

large language models; retrieval-augmented generation; multi-agent systems; closed-loop optimization

1. Introduction

With the continued expansion of smart urban infrastructure, intelligent buildings have become a central component of next-generation urban ecosystems [1]. By integrating key technologies, including the Internet of Things (IoT), sensor networks, edge computing, and Building Information Modeling (BIM), building systems are increasingly endowed with environmental sensing, user-intent understanding, and autonomous control. This shift highlights the strong potential of intelligent buildings in energy management, environmental comfort, and personalized services. Nevertheless, within these highly complex perception–decision systems, delivering natural, semantically rich, and accurately responsive human–computer interaction—particularly efficient question answering (QA) across multi-space and multitask settings—remains a major research challenge.

Although current systems commonly employ speech recognition, semantic tag matching, and gesture detection to enhance user experience, clear limitations persist in processing complex natural-language instructions, uncovering deeper user intent, and capturing contextual dependencies [2]. For instance, when users express vague requests such as “create a more focused working environment” or “the room feels a bit stuffy,” systems often fail to infer the intended task and its contextual cues, leading to inappropriate responses. These issues become especially evident in intelligent building QA scenarios that involve multi-turn dialogue, dynamic task chains, and cross-scenario transitions.

In recent years, large language models (LLMs) have opened new avenues for addressing these challenges. Built on Transformer architectures and pretrained on massive corpora, representative models such as the Generative Pre-trained Transformer (GPT) family [3], Large Language Model Meta AI (LLaMA) [4], and Gemini have demonstrated strong performance in understanding natural language, generating dialog, and reasoning. In the building domain, previous work has investigated LLMs for design generation, construction document analysis, and fault-diagnosis QA systems [5,6], indicating their promise for cross-modal semantic understanding and human–computer interaction. However, LLMs remain closed-form text generators without real-time environmental grounding. They may produce hallucinated outputs that undermine controllability and safety, and they offer limited support for user-level personalization. As a result, when dealing with core intelligent building tasks—such as multimodal QA, multi-turn context tracking, and energy-state control—LLMs face notable stability and reliability constraints, which limit their broader use in semantic QA interaction settings [7].

To mitigate these limitations, the retrieval-augmented generation (RAG) framework [8] has been proposed as a hybrid paradigm that couples external knowledge retrieval with language generation. By inserting a semantic retriever before generation, RAG supplements the model with external knowledge sources (e.g., device-state data, user interaction logs, and BIM ontologies), substantially improving factual grounding, contextual adaptability, and controllability. RAG has already been applied successfully in medical diagnosis, legal consultation, and preservation of cultural heritage [9,10], demonstrating robust cross-domain transfer and semantic generalization.

In intelligent building research, RAG has also attracted growing attention. Zhang et al. employed GPT-4 to detect heating, ventilation, and air conditioning (HVAC) energy anomalies and optimize system responses using data-driven strategies [6]. Xu et al. combined RAG with knowledge graphs (KGs) to develop a semantic QA system for the “Nanjing Yunjin” cultural heritage, enabling domain-specific semantic explanation [11]. Yan et al. proposed a ReAct-based agent framework that integrates LLMs with IoT systems for the regulation of the indoor environment, offering early evidence of the practical feasibility of coupling RAG with multi-agent systems [12]. Despite these advances, three fundamental bottlenecks remain, as illustrated in Figure 1. First, dynamic reasoning adjustment is lacking: most systems rely on a static, single-pass “retrieve–generate” pattern, making iterative refinement and contextual memory difficult [2,8]. Second, adaptive prompt design is insufficient: fixed prompt templates cannot be updated in real time as scenarios or user intent evolve, causing semantic drift and interaction breakdowns [13,14]. Third, heterogeneous knowledge fusion is not robust: intelligent building QA must unify structured BIM knowledge with unstructured sources such as text and logs, yet current systems lack unified semantic alignment and multi-hop reasoning, which constrains reasoning depth and interpretability [15,16].

To address these issues, we propose LoopRAG, a closed-loop multi-agent optimization architecture for semantic QA interaction in intelligent buildings. LoopRAG incorporates the Plan–Do–Check–Act (PDCA) closed-loop optimization mechanism and uses multi-agent collaboration to build a dynamic, adaptive QA workflow. It enables cyclic interactions among task decomposition, knowledge retrieval, quality evaluation, and policy feedback. In this way, LoopRAG advances traditional static RAG into a closed-loop cognitive system with self-optimization and feedback-driven learning. LoopRAG adopts a PDCA-inspired multi-agent design in which Plan, Do, Check, and Act are explicitly decoupled at the engineering semantic level, rather than performing self-correction only at the text generation stage as in most existing frameworks. By separately operating on intent modeling, evidence selection, and generation strategies, the system can identify the source of deviations and apply targeted corrections. In addition, quality evaluation and strategy updating are handled by distinct agents, improving interpretability and controllability. This task-internal, bounded closed-loop design enables more stable and practical reasoning in complex smart building scenarios.

Specifically, our main contributions are threefold:

We present a PDCA-inspired multi-agent RAG architecture that introduces quality-control principles into semantic QA, endowing the system with self-perception, self-evaluation, and self-correction, thus maintaining stability and controllability under multi-turn dialogue and dynamic scenarios.
We develop a dynamic prompt reconfiguration algorithm that combines semantics-driven guidance with Monte Carlo sampling; by steering prompt reconstruction using task goals and user intent, it alleviates semantic drift induced by fixed templates.
We design a heterogeneous knowledge fusion module that integrates knowledge-graph reasoning, semantic vector enhancement, and context compression to achieve unified semantic modeling of structured and unstructured knowledge, thereby improving information coverage, content consistency, and reasoning depth.

The remainder of this paper is organized as follows: Section 2 reviews related literature and theoretical foundations; Section 3 introduces the system architecture and key modules of LoopRAG; Section 4 presents the experimental design and performance evaluation; Section 5 discusses the experimental results and implications; and Section 6 concludes the paper and outlines future research directions.

2. Related Work

Semantic question-answering interaction systems for smart buildings emerge from the tight integration of building engineering knowledge, IoT operational data, and advanced natural language processing. Unlike open-domain QA, building-oriented QA exhibits strong engineering characteristics: questions typically center on space–component–device chains, and answers must be not only semantically correct but also physically/functionally consistent with building structures and executable for operation and maintenance. Meanwhile, supporting evidence is distributed across heterogeneous sources—BIM/ontologies, device manuals, O&M records, real-time sensor logs, and industry standards—whose representations and granularities vary substantially [1,5,17,18]. Accordingly, prior studies have progressed in a staged manner from building semantic-interaction needs, to the capability boundaries of LLMs, to RAG-based retrieval enhancement, further toward multi-stage or multi-agent closed-loop control, and finally to adaptive prompting and graph-structured knowledge fusion. We review related advances from six perspectives: (i) domain characteristics and requirements of smart building QA, (ii) the suitability of LLMs and RAG, (iii) multi-stage closed-loop reasoning, (iv) multi-agent collaboration, (v) dynamic prompt construction, and (vi) heterogeneous graph knowledge fusion. Based on this review, we identify the research gap addressed by LoopRAG.

2.1. Domain Characteristics and System Requirements of Smart Building Semantic Question Answering

Natural language interaction is especially important in smart building scenarios such as O&M, energy management, comfort regulation, and fault diagnosis [1,18,19]. Users often express requests through perceptual and intent-driven utterances (e.g., “the room feels stuffy,” “meeting mode should be more focused,” “the corridor on this floor is too cold”), which implicitly encode multiple targets, constraints, and multi-stage task chains. Typical building QA follows a “diagnose–attribute–recommend” chain: the system must first identify the relevant spaces/devices, then trace causes using sensor states and historical logs, and finally output actionable control strategies according to standards or manuals [17,19]. This workflow inherently requires cross-entity, multi-hop reasoning supported by auditable evidence chains.

Regarding knowledge representations, BIM/facility-management studies characterize building knowledge around a “component–space–system topology,” implying strong structural dependencies [5,17]. IoT interaction frameworks emphasize the temporal coherence of “sensing–event–control” and the functional constraints embedded in device chains [20]. Together, these strands indicate that smart building QA must map natural language into executable semantic plans over device/space chains, rather than relying on surface-level text matching [5,17].

Drawing on intelligent interaction and BIM-QA research, three core challenges for smart building semantic QA can be summarized as follows:

(1): Knowledge heterogeneity and incoAfter checking, the bold formatting has been removed, and the following text has also been revised accordingly.nsistent granularity: structured BIM/ontology/device graphs coexist with unstructured manuals, logs, and standards, making retrieval and semantic alignment difficult [5,17,18];
(2): Structural dependence and chain-style multi-hop reasoning: queries often involve device chains, spatial coupling, and control dependencies, requiring cross-entity, multi-step reasoning to reach structurally valid and explainable conclusions [17,21,22];
(3): Dynamic context and interaction continuity: building states evolve over time, and user intent shifts across multi-turn dialogue, demanding in-task correction and adaptive control strategies [1,20].

These properties imply that building QA must go beyond single-pass retrieval or one-shot generation and instead adopt a controllable reasoning process over task chains [1,17].

2.2. Application Potential and Boundaries of LLM in Smart Building Question Answering

Transformer-based LLMs provide strong general capabilities in semantic understanding, dialogue generation, and complex reasoning, offering a powerful cognitive engine for smart building interaction [3,4,23]. Prior work has explored LLMs for building document analysis, requirement understanding, and O&M recommendation, highlighting their promise for cross-modal semantic integration and explanation generation [5,17].

However, the high-stakes engineering nature of building tasks magnifies LLM limitations. First, LLMs rely on static parametric knowledge without real-time environmental grounding, which can yield responses inconsistent with actual operating conditions in control-related QA [7]. Second, long-horizon reasoning may suffer from weak compositional generalization and logical drift, leading to off-target answers or missing steps in multi-turn interactions [24]. Third, hallucinated recommendations in HVAC, energy-management, or fault-diagnosis settings can introduce safety risks [7]. Hence, smart building QA typically augments LLMs with external, traceable knowledge and explicit process constraints to improve factual grounding and engineering controllability [1,7].

2.3. Retrieval-Augmented Generation Paradigm for Smart Buildings

RAG enhances LLMs by retrieving external knowledge prior to generation, and has become a mainstream approach for smart building QA [7,8,25,26]. Its standard pipeline—semantic indexing, Top-K retrieval, context aggregation, and generative answering—has been shown to improve factual consistency in knowledge-intensive NLP tasks [7,8]. Follow-up studies introduce proactive or iterative retrieval–generate strategies to handle evidence sparsity and retrieval noise in multi-hop QA, for example, by triggering additional retrieval during generation to broaden evidence coverage, or by iterating retrieval and generation to stabilize chain reasoning [27,28].

Nevertheless, RAG remains insufficient for building contexts. First, many systems still adopt a single-pass “retrieve–generate” paradigm; for chain tasks such as “alarm localization–cause tracing–standard constraints–strategy output,” retrieval noise and generation bias can accumulate without internal correction [7,8,28]. Second, the strong structural dependencies of building knowledge mean that purely text-based vector retrieval cannot guarantee physically/functionally valid chains; semantically similar yet structurally invalid evidence may cause structure-level hallucinations [15,17,29]. Thus, smart building RAG requires task-chain-oriented dynamic reasoning control together with structured-knowledge constraints [8,17].

2.4. Multi-Stage Reasoning, Reflective Error Correction, and In-Task Closed-Loop RAG

To address static RAG limitations, recent studies model complex QA as explicit multi-stage reasoning and incorporate reflective correction during generation [30,31]. Multi-step reasoning prompts decompose a question into verifiable sub-questions to increase interpretability, while reflective RAG unifies “retrieve–generate–self-critique/revise” into an iterative in-task loop to reduce hallucinations and incomplete reasoning [32,33,34,35].

These advances suggest that in-task feedback loops are more effective for robust multi-hop reasoning than merely enlarging retrieval [28]. However, current reflective RAG largely focuses on self-checking generated text and lacks (i) explicit intent-state modeling tailored to building task chains, (ii) attributable coordination between retrieval and generation, and (iii) dedicated assessment of structural consistency and engineering constraints [7]. As a result, it is difficult to obtain a convergent closed-loop reasoning path in smart building QA [1,17].

2.5. Multi-Agent Collaborative RAG and Building Task Control

Multi-agent collaboration offers another path toward process-level control. By assigning roles and enabling coordinated dialogue, multi-agent systems can separately handle planning, retrieval, verification, and summarization, improving controllability and interpretability for complex QA [36,37,38]. Their “plan–execute–verify” organization aligns naturally with chain-style building QA, where retrieval paths, evidence chains, and control recommendations must be checked to avoid structural errors and non-executable policies [1,17,19].

Yet existing multi-agent frameworks are mostly designed for general tasks and are not fully aligned with building semantics. On the one hand, they lack customized role division for task chains centered on “space–device chains–standard constraints.” On the other hand, they rarely incorporate PDCA-like closed-loop quality-control logic, making it hard to ensure semantic convergence and stable correction across iterative reasoning [36,37,38]. Therefore, smart building QA still needs a unified design that couples multi-agent role division with a convergent closed-loop quality mechanism [1,17].

2.6. Prompt Adaptivity and Dynamic Construction Mechanisms

Prompts constitute the key control interface for guiding LLM/RAG behavior. Parameter-efficient prompt learning and automated prompt construction can reduce domain-adaptation costs in tasks with stable semantic structures while keeping model parameters frozen [13,14,39,40,41]. However, prompt effectiveness depends critically on task–prompt alignment; once a prompt deviates from the true intent, both retrieval and generation may drift [42,43].

In smart building contexts, user inputs often involve vague perceptual descriptions, implicit constraints, and cross-space references, making static prompt templates particularly prone to semantic drift [1,19]. Although multi-step or reflective prompting can expand reasoning, such methods frequently rely on fixed decomposition patterns and lack robust exploration of expression variants or low-variance initialization for ambiguous building inputs [30,31]. Hence, smart building QA requires a dynamic prompt generation strategy that can automatically reconfigure prompts according to task structure and semantic uncertainty, stabilizing task expression before retrieval and closed-loop reasoning [1,42].

2.7. Research Gap and Positioning of This Paper

Overall, prior research has provided a rich toolkit for smart building semantic QA, including LLMs, RAG, reflective correction, multi-agent coordination, prompt learning, and graph augmentation. However, in engineering-ready building QA, several specific gaps remain. First, RAG in building QA is often single-pass or weakly feedback-driven, lacking lightweight in-task correction chains. Second, existing iterative/reflective RAG targets general domains; its feedback typically stays at the level of self-evaluating generated text, making it difficult to trace and locally fix concrete deviations in intent parsing, evidence selection, or answer formatting. Consequently, retrieval noise and semantic drift can accumulate in multi-hop or multi-turn building tasks. Third, prompts are not sufficiently stabilized under ambiguous building inputs, weakening retrieval specificity. Building instructions frequently encode implicit constraints and perceptual cues. Static templates or generic dynamic prompting may drift semantically, narrowing or misdirecting retrieval and degrading subsequent answering. Current prompt learning emphasizes parameter-efficient adaptation, rather than reducing prompt expression variance before the task begins. Finally, a general graph-augmented RAG provides limited structural consistency constraints for heterogeneous building knowledge. Although Graph/KG-RAG works well on open text graphs, evidence in building QA often mixes BIM topology, device chains, and OM text/logs. Existing methods lack unified alignment and filtering tailored to building-entity dependencies, allowing context to include evidence that is structurally related but semantically weak, or semantically related but structurally invalid.

To fill these gaps, we propose LoopRAG, an in-task closed-loop optimized RAG architecture for smart building semantic QA. LoopRAG introduces an auditable “plan–execute–check–revise” workflow within a single QA instance via PDCA-style multi-agent role division, reducing error accumulation from retrieval and generation. It further combines a one-shot Monte Carlo Prompt Optimization (MCPO) prompt stabilization strategy with a lightweight heterogeneous knowledge-fusion module to improve evidence coverage and structural consistency. Notably, this work focuses on task-internal optimization and feedback control rather than long-term online learning or a fully automated global multi-agent ecosystem. The goal is to provide a more robust, controllable, and engineering-friendly path for improving smart building QA.

3. Methodology

Semantic QA in smart buildings is not merely a “retrieve–generate” pipeline; it is a reasoning process that must preserve semantic consistency amid noisy knowledge, ambiguous intent, and dynamic contexts. Conventional RAG models typically execute all steps in a single forward pass, without internal mechanisms to correct misunderstandings, retrieval errors, or generation hallucinations. Consequently, answer quality relies heavily on the initial prompt and Top-K retrieval results, making robustness difficult to sustain in complex scenarios.

Within this closed-loop framework, we further incorporate two mechanisms to enhance adjustability and knowledge adaptation:

(1): Adaptive prompt optimization, which treats prompts as evolvable structures and improves alignment between generation instructions and task requirements via semantic decomposition and evaluation-driven refinement;
(2): Heterogeneous knowledge fusion, which integrates textual vector semantics with structured building knowledge so that retrieved evidence is both more relevant and more useful for reasoning.

Accordingly, LoopRAG does not expand model capacity. Instead, it achieves stable and interpretable semantic reasoning within a single QA through the synergy of closed-loop control, prompt evolution, and knowledge fusion. Section 2.1, Section 2.2, Section 2.3, Section 2.4 and Section 2.5 detail the theoretical modeling of these components and their coordinated workflow.

3.1. Theoretical Foundation: Semantic Closed-Loop Model

In intelligent building settings, semantic QA must simultaneously handle linguistic ambiguity, heterogeneous multi-source knowledge, and a dynamically changing environment. User queries are rarely complete, one-shot commands; rather, true intent often emerges through multi-turn interaction. Meanwhile, relevant evidence may be distributed across BIM models, device documentation, monitoring data, and control logs. If QA is treated only as a static input–output mapping, it becomes difficult to explain how semantic deviations are identified and corrected during reasoning, or how evidence selection and generation strategies evolve over time. We therefore model smart building semantic QA as a semantic dynamical system with internal feedback, providing a unified theoretical basis for the ensuing multi-agent architecture.

In this semantic dynamical system, the internal state at interaction round t is defined as

S_{t} = {q_{t}, k_{t}, p_{t}},

(1)

where

q_{t}

denotes the system’s current representation of user intent, including task type, target entities, and constraints;

k_{t}

denotes the knowledge state available for reasoning, i.e., an evidence set retrieved and filtered from the global knowledge base

K

; and

p_{t}

captures generation time strategy configurations such as prompt structure, reasoning expansion patterns, and output format requirements. Given

S_{t}

and user input

u_{t}

, the system produces an answer formalized as

y_{t} = f_{θ} (u_{t}, S_{t}),

(2)

where

f_{θ}

is the reasoning mapping determined jointly by model parameters and system strategies. Traditional RAG methods typically terminate at this point, treating

y_{t}

as the final output without further internal assessment. In contrast, we regard

y_{t}

as an intermediate element in the semantic closed loop. Specifically,

y_{t}

is passed to an internal quality evaluation mapping

e_{t} = Φ_{eval} (y_{t}, k_{t}),

(3)

yielding a feedback signal

e_{t}

. The mapping function

Φ

represents a state update operator based on internal evaluation signals, which is used to perform constrained adaptive adjustments to the system’s reasoning state within a semantic closed loop. It takes as input the current state

S_{t} = {q_{t}, k_{t}, p_{t}}

, the evaluation feedback

e_{t}

, and the global knowledge base

K

. Rather than performing a full reset,

Φ

applies localized refinements to the user intent representation, knowledge evidence selection, and generation strategy configuration, respectively. When the evaluation results reveal semantic deviations,

Φ

reconstructs task sub-goals, reorders or expands the retrieved evidence, and correspondingly adjusts prompts and reasoning strategies, thereby reducing semantic errors in subsequent iterations. Through this feedback-driven state update mechanism, the question-answering process is transformed from a one-shot forward inference into a dynamically convergent semantic reasoning process. This signal summarizes answer performance in semantic relevance, evidence consistency, and contextual coherence, effectively serving as a semantic residual for the current reasoning state. When the evaluation reveals deviations from user intent or supporting evidence, the system uses

e_{t}

to adjust subsequent reasoning rather than accepting the output as final. This adjustment is expressed by the state update function

S_{t + 1} = Φ (S_{t}, e_{t}, K),

(4)

which jointly updates

q_{t}

,

k_{t}

, and

p_{t}

under the constraint of the global knowledge base

K

. For instance, if the evaluation indicates inconsistency between the generated answer and retrieved evidence,

k_{t}

can be reordered or augmented; if the answer does not sufficiently address user-relevant sub-questions,

q_{t}

and

p_{t}

can be reorganized to reshape task structure and prompting. Through the “generate–evaluate–update” cycle, QA shifts from a one-shot forward pass to a dynamical process that progressively converges around the semantic state.

Under this formulation, the optimization goal is no longer direct fitting to a presumed “gold answer,” but control over the temporal decay of semantic error. The loss function

L (\cdot)

is used to compress the feedback signal

e_{t}

generated by the internal evaluation mapping into an optimizable scalar cost, whose core purpose is to characterize the degree of semantic deviation under the current reasoning state. In practice,

L (e_{t})

is typically composed of multiple weighted sub-loss terms, which respectively measure the semantic relevance deviation between the generated response and the user intent, the consistency deviation between the response content and the retrieved evidence, and the coherence deviation across multi-turn contextual interactions; these sub-terms can be approximately computed via semantic similarity scores, evidence coverage metrics, or consistency discrimination models. By minimizing

L (e_{t})

, the system does not aim to converge toward pointwise matching with a fixed “gold” answer, but instead constrains semantic error to gradually decay within a closed-loop iterative process, thereby guiding the state update function to achieve stable convergence across the three dimensions of intent understanding, knowledge selection, and generation strategy. Let the loss function

L (\cdot)

map

e_{t}

to a scalar cost; the global objective is then

min_{Θ} E [L (e_{t})],

(5)

where

Θ

comprises parameters of the reasoning mapping

f_{θ}

, the evaluation mapping

Φ_{eval}

, and the state update function

Φ

. Ideally, as interaction proceeds, the system satisfies

lim_{t \to \infty} ∥ e_{t} ∥ = 0,

(6)

meaning that semantic deviations reflected by internal evaluation diminish and

S_{t}

converges to a region where intent understanding, evidence selection, and generation strategies are mutually consistent. This convergence criterion offers a testable standard for closed-loop QA: stable answering behavior indicates that a semantic-level dynamic equilibrium has been reached under the current knowledge base and task distribution.

By explicitly introducing state representations, internal evaluation, and feedback-driven updates, the semantic closed-loop model turns an implicit reasoning process into an analyzable and controllable dynamical system and provides a unified framework for multi-agent role assignment and interaction logic in LoopRAG.

3.2. PDCA Multi-Agent Collaborative Architecture

Building on the semantic closed-loop model in Section 3.1, LoopRAG extends semantic QA from a single forward-generation pass to an adjustable dynamic system with internal feedback. To make the state update function

S_{t + 1} = Φ (S_{t}, e_{t}, K)

(7)

operational in real reasoning, we instantiate the four key mappings of the semantic dynamical system—semantic planning

Φ_{P}

, knowledge execution

Φ_{D}

, quality evaluation

Φ_{C}

, and policy updating

Φ_{A}

—as four coordinated agents: Plan, Do, Check, and Act. These agents are coupled through the state variables

q_{t}

,

k_{t}

, and

p_{t}

and the deviation signal

e_{t}

, so that each QA instance follows a within-task closed-loop trajectory of “plan–execute–evaluate–correct.” Figure 2 summarizes the end-to-end workflow of this multi-agent architecture. Below, we detail the functions of the four agents and their correspondence to the semantic closed-loop model.

3.2.1. Plan Agent: Intent Modeling and Structured Task Planning

The Plan Agent corresponds to the planning function in the semantic loop,

z_{t} = Φ_{P} (u_{t}; θ_{P}),

(8)

and is responsible for converting user input

u_{t}

from unstructured language into an executable semantic plan, while initializing the intent state

q_{t}

and strategy configuration

p_{t}

for subsequent reasoning. Upon receiving

u_{t}

, the Plan Agent performs semantic parsing to identify the task type, key entities, implicit constraints, and potential sub-goals. It then builds a lightweight task graph that decomposes complex requirements into executable semantic subtasks. This graph both delineates the knowledge scope for retrieval and specifies the logical levels that generation should cover. Next, the Plan Agent derives

p_{t}

from the task graph, including prompt structure, output-format requirements, and necessary reasoning constraints. These settings enforce clear semantic boundaries and controllable reasoning depth during execution. Finally, the planned structure

z_{t}

together with

q_{t}

and

p_{t}

is passed to the Do Agent as the input for the next stage of the loop.

3.2.2. Do Agent: Evidence Retrieval, Integration, and Initial Generation

The Do Agent realizes the execution mapping in the closed-loop model,

(k_{t}, y_{t}) = Φ_{D} (z_{t}, K, p_{t}),

(9)

and retrieves task-relevant evidence from the global knowledge base

K

under the semantic plan provided by the Plan Agent, assembling it into contextual input for the language model. Concretely, the Do Agent queries semantic-vector or rule-based indices using the entities, relations, and constraints encoded in

z_{t}

, recalls candidate evidence, and reorders/filters it according to plan objectives to form

k_{t}

. Because knowledge fusion is modeled separately in Section 3.3, the Do Agent here emphasizes executability and controllability: it retrieves evidence aligned with task needs and constructs context according to

p_{t}

.

Afterward, it assembles the task description, retrieved text snippets, and prompt configuration

p_{t}

into a model input so that generation proceeds within explicit semantic bounds. The language model then produces a candidate answer

y_{t}

. This answer is not final; it is an intermediate loop state that will be audited by the Check Agent in the next stage. By explicitly separating task planning from evidence retrieval, the Do Agent provides a dependable entity-level basis for the loop, enabling later stages to detect and correct deviations.

3.2.3. Check Agent: Evaluation of Retrieval and Answer Quality

Within the PDCA loop, the Check Agent converts model outputs into quantifiable deviation signals. Its function is formalized as

e_{t} = Φ_{C} (y_{t}, k_{t}, q_{t}),

(10)

where

y_{t}

is the candidate answer generated by the Do Agent,

k_{t}

is the retrieved evidence, and

q_{t}

is the intent representation from the Plan Agent. The Check Agent evaluates deviations along three dimensions—semantic alignment, knowledge faithfulness, and satisfaction of task constraints—and constructs an error signal for feedback control.

First, it encodes

y_{t}

and

q_{t}

into a shared embedding space via a semantic encoder and computes semantic alignment as

Align (y_{t}, q_{t}) = 1 - CosSim (E (y_{t}), E (q_{t})),

(11)

where

E (\cdot)

denotes the embedding function. This measures whether the answer diverges from the intended task; for example, in device-regulation queries, failing to capture the operational target specified by Plan will reduce alignment.

Second, to assess faithfulness to retrieved knowledge, the Check Agent introduces a knowledge deviation term based on evidence coverage. Let

δ_{i} \in {0, 1}

mark truly relevant items within the Top-K retrieval. Knowledge consistency is expressed as

Faith (y_{t}, k_{t}) = 1 - \frac{1}{K} \sum_{i = 1}^{K} δ_{i} .

(12)

A larger deviation indicates that the answer contains unsupported claims or hallucinated content. In practice,

δ_{i}

is determined via an automated evidence–answer attribution process. A retrieved item is marked as truly relevant if it can semantically support or verify at least one factual assertion in the generated answer, as determined by entailment-based or similarity-based alignment with a predefined threshold. This definition emphasizes answer-grounded relevance rather than topical similarity, enabling Faith to reflect evidence coverage for the generated content.

Third, the Check Agent verifies that

y_{t}

respects task constraints set by Plan, such as required fields, output order, length, and token limits. Let

Ψ (y_{t}, p_{t})

denote the degree to which

y_{t}

satisfies

p_{t}

; the format deviation is

Constraint (y_{t}, p_{t}) = 1 - Ψ (y_{t}, p_{t}),

(13)

and increases when mandatory fields are missing, logical steps are violated, or length constraints are exceeded.

To unify these deviations, the Check Agent aggregates them via a weighted loss

L_{t} = α Align (y_{t}, q_{t}) + β Faith (y_{t}, k_{t}) + γ Constraint (y_{t}, p_{t}),

(14)

where

α

,

β

, and

γ

are scenario-dependent weights. We set the weights

α

,

β

, and

γ

according to the principles of task-priority driving, experience-based initialization, and stability validation. In the configuration strategy, the three weights respectively correspond to semantic alignment, knowledge faithfulness, and constraint satisfaction, whose relative importance varies with task type. In intelligent building semantic question answering scenarios, we treat knowledge faithfulness (

β

) as a fundamental safety constraint and typically assign it a weight no lower than the others in order to prevent generated content from deviating from device states or control logic; semantic alignment (

α

) is used to ensure that responses cover the user’s true intent and therefore has higher priority in multi-turn interaction and task decomposition scenarios; constraint satisfaction (

γ

) mainly affects output format and execution controllability, and its weight is relatively smaller, being increased only in tasks involving structured outputs or control instructions. In practice, we adopt scenario-specific weight templates (e.g., information retrieval, device control, fault diagnosis) rather than learning weights independently for each sample, thereby avoiding the introduction of additional instability. Second, in terms of the balancing mechanism, the weights are not directly used for backpropagation to optimize model parameters, but only for bias attribution and decision-making on strategy update directions. Consequently, the system is insensitive to the absolute values of the weights and relies more on the relative magnitude relationships among the three loss components. In implementation, we normalize each sub-term to keep their numerical ranges consistent, thereby reducing the impact of scale differences on weight selection. For device control scenarios, we empirically set

(α, β, γ) = (0.3, 0.5, 0.2)

, prioritizing knowledge faithfulness over semantic alignment and format constraints. Sensitivity analysis with

\pm 20 %

perturbations shows stable convergence behavior, indicating robustness to weight selection. It then packages both the loss and its gradient into the feedback output

e_{t} = (L_{t}, \nabla L_{t}),

(15)

providing the Act Agent with interpretable signals for correction. This structured deviation representation makes quality assessment not merely a static verdict but an actionable, propagatable basis for stable closed-loop optimization, enabling LoopRAG to perform systematic self-checking and reflection.

3.2.4. Act Agent: Attribution Analysis and Optimization Feedback

The Act Agent sits at the end of the PDCA loop. It maps the deviation signal

e_{t}

from Check into corrective updates for the Plan and Do stages, thereby realizing

S_{t + 1} = Φ (S_{t}, e_{t}, K),

(16)

where

S_{t} = {q_{t}, k_{t}, p_{t}}

denotes the current intent state, retrieval state, and generation strategy. The Act Agent performs deviation attribution, strategy update, and iterative convergence so that multi-round reasoning progressively approaches both user intent and factual evidence.

During attribution, Act first identifies the dominant source of

L_{t}

. If deviation is dominated by

Align (y_{t}, q_{t})

, the Plan’s intent representation is likely unclear or under-decomposed, requiring refined entity identification or deeper task structuring. If deviation is dominated by

Faith (y_{t}, k_{t})

, Do’s retrieval scope or policy is insufficient, suggesting expansion of retrieval, adjustment of recall thresholds, or enhanced entity expansion over the knowledge graph. If deviation concentrates on

Constraint (y_{t}, p_{t})

, generation formatting, output constraints, or reasoning chain structure should be revised. Formally, attribution is written as

θ_{t}^{*} = arg max_{θ \in {q_{t}, k_{t}, p_{t}}} ∥\frac{\partial L_{t}}{\partial θ}∥,

(17)

meaning the deviation is chiefly attributed to the state variable yielding the largest loss gradient. In practical implementation, when computing

L_{t}

, the Check Agent retains the numerical contributions of each sub-loss term (Align, Faith, Constraint) and interprets them as first-order influence strengths on the corresponding state dimensions: the semantic alignment term primarily affects the intent representation

q_{t}

, the knowledge faithfulness term mainly acts on the retrieval state

k_{t}

, and the constraint deviation term mainly influences the generation policy

p_{t}

. As a result,

∥\frac{\partial L_{t}}{\partial θ}∥

is approximated in engineering practice via normalized sub-loss magnitudes or their discrete variations, rather than through backpropagation-based differentiation. This procedure is equivalent to performing a finite-difference-style sensitivity analysis in the semantic state space to determine which category of state adjustment is most likely to yield loss reduction. Such surrogate gradients are used solely for bias attribution and strategy selection, rather than for model parameter training, thereby preserving interpretability while avoiding the forced embedding of discrete semantic states into a continuous optimization framework.

Act then updates the state according to the deviation type. The intent update is

q_{t + 1} = q_{t} - η_{t} \frac{\partial L_{t}}{\partial q_{t}},

(18)

making the new intent representation clearer and more executable. Retrieval is updated as

k_{t + 1} = Φ_{D} (q_{t + 1}, K),

(19)

strengthening evidence acquisition aligned with refined intent. Generation strategies and prompting constraints are updated by

p_{t + 1} = p_{t} - η_{t} \frac{\partial L_{t}}{\partial p_{t}},

(20)

improving reasoning logic, format consistency, and behavioral boundaries of the model.

By feeding

S_{t + 1}

back into Plan, Act completes an explainable internal feedback chain, turning LoopRAG into a controllable multi-round reasoning system with progressive convergence. Unlike conventional RAG that terminates after a single pass, this closed-loop structure performs iterative semantic convergence within a task, transforming “intent–retrieval–generation–evaluation” into a dynamically optimizable process in which each iteration moves closer to semantically accurate, evidence-grounded, and rigorously formatted outputs.

3.3. Adaptive Prompt Optimization Mechanism Based on Monte Carlo Methods

In the Plan Agent’s task-planning stage, the prompt not only conveys semantics but also directly governs the controllability of retrieval and generation. Natural-language inputs in smart building settings are often characterized by vague perceptual expressions, missing context, and ambiguous references to target entities. As a result, an unrefined prompt may exhibit high variance in semantic space, which in turn destabilizes downstream reasoning. To mitigate such noise-induced instability, LoopRAG embeds a Monte Carlo Prompt Optimization (MCPO) module within the Plan Agent. Its goal is to produce, before the PDCA loop begins, a semantically stable, structurally well-formed, and executable task expression.

3.3.1. Semantic Modeling Perspective of Prompt Optimization

Given user input

u_{t}

, the Plan Agent constructs a prompt p that minimizes downstream reasoning error in task space. Because the semantic landscape of the prompt space

P

is highly non-convex, directly solving for a globally optimal expression is difficult. We therefore model prompt optimization as a one-shot stochastic search over a latent semantic neighborhood:

p^{*} = arg max_{p \in N (u_{t})} Q (p),

(21)

where

N (u_{t})

denotes an expression neighborhood centered on the input semantics, and

Q (\cdot)

is a prompt quality function measuring task executability, semantic clarity, and structural conformity. MCPO does not aim to find a global optimum. Instead, it selects from

N (u_{t})

the prompt that is most interpretable for the task and most compatible with subsequent modules, thereby improving the initial conditions of the PDCA process.

3.3.2. Monte Carlo-Based Semantic Neighborhood Exploration

To construct

N (u_{t})

, we apply a lightweight Monte Carlo procedure that performs controlled perturbations on the input. Let

{T_{i}}

be a set of perturbation operators, including semantically equivalent rewrites, syntactic reordering, focus adjustment, and explicit articulation of constraints. MCPO samples:

p_{k} = T_{i_{k}} (u_{t}), k = 1, \dots, K,

(22)

which has two properties:

Semantic preservation: perturbations operate on surface form while keeping core meaning unchanged;
Structural diversity: different operators yield prompts with varied structures, allowing the Plan Agent to explore improved task formulations.

The perturbation operator

T_{i}

is designed as a set of constrained, semantics-preserving textual transformation functions, whose objective is to introduce structural and expressive diversity without altering the core task semantics. For example, focus adjustment explicitly foregrounds or backgrounds key sub-goals (e.g., “prioritize energy efficiency” or “emphasize ventilation control”), thereby changing the salience ordering of information in the prompt; semantic-equivalent rewriting rules perform meaning-preserving substitutions and syntactic reformulations of the original input based on predefined prompt templates or controlled LLM rewriting instructions, while keeping the task actions and target entities unchanged. As illustrated in Figure 3 (left), the four-stage cycle—input parsing → local perturbation → semantic validation → format regularization—constitutes a semantic processing chain that can be viewed as a one-shot Monte Carlo sampling over the prompt space.

3.3.3. Prompt Quality Function and Selection Criteria

To select the best prompt from candidates

{p_{k}}

, the Plan Agent defines an internal quality function

Q (p)

that is differentiable but independent of model gradients. It evaluates prompts along three dimensions:

Q (p) = λ_{1} C_{sem} (p) + λ_{2} C_{str} (p) + λ_{3} C_{exec} (p),

(23)

where semantic clarity

C_{sem}

measures how explicitly the prompt states task actions, target objects, and constraints; structural conformity

C_{str}

assesses whether the prompt satisfies the system-defined format; and executability

C_{exec}

evaluates how readily the prompt can be parsed by downstream retrieval and generation. In practical implementation, the three components of the quality function

Q (p)

are obtained by combining computable heuristic indicators with lightweight evaluation, rather than relying on end-to-end training. Specifically, semantic clarity

C_{sem} (p)

is computed by detecting whether the prompt explicitly includes task actions, target entities, and constraint conditions, for example via semantic parsing or keyword and slot coverage, to measure whether the task intent is fully expressed; structural compliance

C_{str} (p)

is realized through rule checking or template matching to determine whether the prompt satisfies system-defined structural requirements, such as field completeness and ordering constraints; executability

C_{exec} (p)

evaluates the degree of difficulty for downstream retrieval and generation modules to correctly parse and utilize the prompt and is approximated via model-based scoring. All three indicators are normalized to ensure that

Q (p)

provides stable and comparable evaluation scales across different candidate prompts. The prompt templates and Python code used in this study are available in the EAGLE repository at Supplementary Materials and correspond to the repository state at the time of the experiments. Monotonicity and local stability of

Q (p)

ensure that the chosen prompt lies in a low-risk region of task semantics under finite Monte Carlo samples. Finally, the Plan Agent selects

p^{*} = arg max_{k} Q (p_{k}),

(24)

and applies structured regularization to form the final task expression passed to the Do Agent.

3.3.4. Module Role and Closed-Loop Consistency

MCPO serves as semantic pre-processing within the Plan stage. It runs before the system enters the Do–Check–Act feedback loop, does not depend on external deviation signals, and does not update prompts across iterations. Accordingly, it has the following properties:

Non-loop component: participates only in the Plan stage and does not alter the PDCA convergence path;
No semantic feedback: uses neither Check’s deviation signal nor Act’s update mechanism;
One-shot optimization: Monte Carlo sampling is executed once, without forming a multi-round loop;
Stabilized initialization: it improves the initialization quality of $S_{i} = {q_{i}, k_{i}, p_{i}}$ .

3.4. Heterogeneous Knowledge Expansion and Fusion Mechanism for Smart Buildings

Smart building environments are highly heterogeneous. Device manuals are expressed in natural language, sensor and O&M logs are time-dependent, and BIM/IoT systems encode spatial and functional dependencies as structured entities and topological relations. In real QA tasks, user questions often depend on multiple such sources simultaneously. For example, querying “causes of abnormal air quality” may require combining monitoring data, HVAC structure, filter-module records, and maintenance manuals. Pure text vector retrieval struggles to capture these cross-modal relations, whereas graph-only reasoning may overlook crucial descriptive evidence. To address this, LoopRAG introduces a Heterogeneous Knowledge Expansion and Fusion Mechanism (H-KEFM) within the Do Agent to systematically integrate multi-source knowledge, ensuring that retrieved evidence is semantically relevant and structurally consistent with building-system logic.

As shown in Figure 4, the mechanism first ingests multimodal content from distributed knowledge sources, then performs structure-aware expansion around core entities in a knowledge graph. It next scores candidate knowledge blocks along multiple dimensions and finally selects those with the highest reasoning value as input to the language model. The following subsections describe the guiding principles, key steps, and role of this mechanism within the closed loop.

3.4.1. Necessity and Design Principles of Unified Representation

Given the diversity of sources and representations, directly mixing heterogeneous knowledge as model input may cause imbalance, semantic inconsistency, and contextual conflicts. Hence, different knowledge types must be unified before retrieval. LoopRAG adopts a lightweight semantic alignment approach, embedding textual content, graph nodes, and time series into a shared semantic space so that they can be compared fairly during similarity computation, scoring, and ranking. This unified representation follows three principles:

(1): Semantic preservation: descriptive text should retain key entities, states, and operational meanings;
(2): Structural dependence: BIM-graph nodes should retain functional relations among devices and spatial containment relations;
(3): Comparability: all sources must ultimately be scoreable and selectable within the same vector space.

This joint semantic-and-structural representation allows retrieval to reflect both content relevance and the true structural logic of building systems.

3.4.2. Knowledge Graph-Based Related-Entity Expansion

After the Plan Agent identifies core task entities (e.g., “fresh-air system,” “filter,” “electrical control cabinet”), the Do Agent performs a one-step structure-aware expansion centered on those entities in the knowledge graph (KG). The goal is not merely to broaden retrieval scope, but to actively uncover system chains that are tightly coupled to the task. For instance, diagnosing indoor environmental anomalies may require jointly considering related nodes such as temperature/humidity sensors, fresh-air units, valve openings, and filter pressure differentials.

Expansion follows topological relations, functional dependencies, or spatial affiliations, typically within a 1–2 hop neighborhood to balance coverage against over-expansion. The resulting expanded subgraph yields a more complete device chain and functional path, substantially improving evidence coverage for subsequent retrieval. As shown on the left of Figure 4, the algorithm expands outward from a core node along structural relations, progressively collecting knowledge sources likely to be task-relevant. In multi-hop causal reasoning, this expansion reduces retrieval gaps caused by missing structural links.

3.4.3. Multi-Dimensional Knowledge Scoring and Fusion Strategy

After expansion, multiple candidate knowledge fragments are obtained, but some may be semantically irrelevant, structurally invalid, or unhelpful for the task. LoopRAG therefore adopts a multi-dimensional scoring scheme that evaluates each candidate along three aspects:

(1): Semantic relevance measures content matching between a candidate and the user query, using text embeddings to assess whether relevant devices, states, or concepts are mentioned.
(2): Structural consistency assesses whether a candidate is topologically close and functionally dependent on the core task entities; candidates far along the device chain or with weak functional ties receive lower scores. This avoids evidence that is semantically similar yet physically/functionally invalid, which can induce structure-level hallucinations.
(3): Task alignment checks whether a candidate supports the Plan Agent’s task structure—for example, whether it provides needed parameters, enables causal explanation, or can serve as an evidence snippet.

The right side of Figure 4 illustrates scoring and ranking. The system then selects the top-scoring fragments, merges them with the task description, and forms a semantically coherent context for this reasoning round.

3.4.4. Role in the PDCA Closed Loop and Contribution to Stability

Although located within the Do Agent, heterogeneous knowledge expansion and fusion affect not only retrieval but also closed-loop interaction with Check and Act. During checking, the Check Agent audits factual citations and evidence support in the generated answer. If issues such as insufficient evidence, missing chains, or cross-source inconsistency are detected, the feedback signal directly adjusts this module—for instance, by increasing expansion depth, reweighting scoring dimensions, or strengthening structural-consistency constraints.

During correction, the Act Agent adapts expansion policies based on the deviation type. If device dependencies are ignored, structural weights are increased; if required evidence is missing, graph expansion scope is enlarged; if the answer over-relies on minor textual cues, semantic-similarity weights are reduced. Through such feedback, H-KEFM progressively improves retrieval precision and stabilizes semantic convergence over iterative reasoning. Overall, this mechanism enables LoopRAG to obtain complete, credible, and task-aligned evidence in smart building scenarios where knowledge is fragmented and structurally constrained. It is a core component for improving answer quality, reducing hallucinations, and enhancing controllability.

3.5. System Integration and Overall Workflow

Together, the above modules form a task-level closed loop centered on semantic states. The Plan Agent structures user input; the Do Agent retrieves knowledge and generates responses under the plan; the Check Agent audits semantic consistency and constraint satisfaction between answers and evidence; and the Act Agent updates planning policies and retrieval scope based on deviation signals. Through a bounded number of “plan–execute–check–revise” iterations, the system can automatically correct semantic drift, complete missing evidence, and stabilize the generation of interpretable, building-oriented answers within a single QA task. Working in concert with adaptive prompt optimization and heterogeneous knowledge fusion, this closed-loop design grants LoopRAG higher robustness and controllability in complex smart building semantic tasks.

The next section validates LoopRAG on multiple datasets with respect to context recall, relevance, faithfulness, and accuracy.

4. Experimental Results and Analysis

This section evaluates the effectiveness and robustness of LoopRAG in smart building semantic question-answering settings. To ensure that experiments address concrete research objectives rather than merely reporting outcomes, we design and analyze experiments around four research questions (RQs):

RQ1:: Under an identical knowledge base and dataset configuration, does LoopRAG outperform mainstream RAG variants overall?
RQ2:: Does the PDCA closed-loop feedback mechanism substantially enhance system stability and task completion?
RQ3:: To what extent does the Monte Carlo Prompt Optimization (MCPO) mechanism improve semantic alignment and answer quality?
RQ4:: Compared with purely vector-based retrieval, does the smart building H-KEFM yield clear gains in retrieval quality and answer trustworthiness?

We first describe the experimental setup, then present comparative results and mechanism-level analyses for RQ1–RQ4, and finally discuss overall findings and limitations.

4.1. Experimental Setup

4.1.1. Datasets and Task Types

To assess LoopRAG across diverse scenarios and difficulty levels, we combine public multi-hop QA benchmarks with an in-house building domain dataset. We select five representative QA datasets spanning academic and industry-oriented tasks, summarized in Table 1.

Building QA. The in-house Building QA dataset is constructed in this work and includes four categories of questions:

Factual QA tasks: target clearly defined low-carbon energy knowledge, e.g., “What is the principle of solar photovoltaic power generation?” and “Is wind energy renewable?”, where answers are deterministic facts.
Explanatory QA tasks: require causal reasoning and deeper explanation of concepts or technologies, e.g., “Why is nuclear energy important in low-carbon energy?” and “What are the key technologies for bioethanol production?”
Comparative QA tasks: probe analytical capability through multi-dimensional comparisons of technologies or policy options, e.g., “How do solar and wind power differ in generation efficiency?” and “What are the pros and cons of nuclear energy versus hydropower?”
Predictive QA tasks: involve forecasting and judgment of future trends, e.g., “What is the outlook for hydrogen energy in transportation over the next decade?” and “How will the cost of solar power generation evolve?”

Together, these tasks cover the four most common “query–explain–compare–predict/judge” needs in smart building semantic QA, serving as a core testbed for validating the LoopRAG design.

4.1.2. Knowledge Base and Heterogeneous Data

LoopRAG relies on an integrated heterogeneous knowledge base comprising text, structured graphs, and temporal logs. It includes industry standards, specifications, and technical guidelines; technical manuals and maintenance documents for key equipment; and operational logs of key devices. Before ingestion, all data are cleaned, deduplicated, normalized in format, and processed for entity extraction to preserve semantic consistency during vector indexing and graph construction. The same knowledge base is used for all baselines (Naive RAG, GraphRAG, LightRAG, NodeRAG), ensuring fair comparability.

4.1.3. Evaluation Metrics

We adopt the open-source evaluation framework ragas to measure output quality from multiple perspectives, including generation quality, semantic matching, and factual grounding. The metrics are:

Context Recall: whether answers sufficiently leverage retrieved key context.
Relevance: semantic alignment between system output and reference answers.
Faithfulness/Factual consistency: whether generated content is supported by correct semantics, mitigating hallucinations.
Avg Retrieved Tokens: the amount of retrieved context consumed, reflecting dependence on external evidence.
Accuracy: whether the final output matches the gold answer.

4.1.4. Experimental Environment and Implementation Details

All experiments are conducted on a local server (Inspur NF5468M6) running Ubuntu 22.04 with kernel 6.8.0-59-generic on x86_64 processors. The system uses NVIDIA CUDA Toolkit 11.8 with nvcc 11.8.89, enabling full GPU-accelerated inference. We use Python 3.12. The LLM backend is a locally deployed Llama3.1 70B model, requiring no external API calls and providing strong semantic understanding and reasoning capability. Under these unified settings, subsequent sections perform comparative and ablation studies for RQ1–RQ4, validating LoopRAG’s methodological contributions and engineering suitability along four dimensions: overall performance, closed-loop control, prompt optimization, and heterogeneous knowledge modeling. In practical implementation, the PDCA loop in each QA instance adopts a bounded-iteration mechanism to ensure reasoning stability and controllable computational overhead. Specifically, we set the maximum number of iterations to 3–5 rounds (with 3 rounds as the default in experiments) and combine this with adaptive early-stopping criteria: if the reduction in the composite deviation loss

L_{t}

falls below a predefined threshold for two consecutive rounds, or if the Check Agent determines that semantic alignment, knowledge faithfulness, and constraint satisfaction have all reached acceptable levels, the system is considered to have converged and the loop is terminated early; if the maximum number of rounds is reached without satisfying the stopping conditions, the current best result is returned to avoid over-reasoning. For MCPO, we adopt fixed, small-scale sampling to control computational cost, with the sampling size K typically set to 4–8 (default 6), corresponding to random combinations of different perturbation operators. This configuration achieves a stable balance between performance improvement and computational efficiency in experiments, and is insensitive to small variations in K.

4.2. Overall Performance Comparison

Under a unified knowledge base, retrieval configuration, and language model, this section compares the overall performance of LoopRAG against four representative RAG architectures. The baselines span common RAG design paradigms, including a basic pipeline (Naive RAG), structure-enhanced retrieval (GraphRAG), lightweight retrieval–generation (LightRAG), and heterogeneous graph modeling (NodeRAG). In this comparison, LoopRAG is evaluated in its full configuration, with the PDCA closed-loop process, dynamic prompt optimization, and heterogeneous knowledge fusion all enabled.

Table 2 reports results on five metrics: Context Recall, Response Relevance, Faithfulness, Avg Retrieved Tokens, and Accuracy. Overall, LoopRAG achieves the best performance across all key indicators. Context Recall rises to

90 %

, and Accuracy reaches

88 %

, yielding a clear advantage over all baselines. Notably, LoopRAG consumes substantially more retrieved tokens when constructing its prompt context. This higher input budget aligns with more exhaustive evidence extraction, improved response precision, and stronger evidence consistency, reflecting a trade-off in which richer context supports more reliable outputs.

LoopRAG leads on all five metrics. Its Context Recall of

90 %

exceeds Naive RAG by 15 percentage points, indicating stronger evidence acquisition and context integration, particularly for multi-hop and multi-turn tasks. Response Relevance reaches

72 %

, substantially higher than GraphRAG (

54 %

) and NodeRAG (

61 %

), suggesting that improvements in semantic matching and prompt design translate effectively to better alignment with reference answers.

For Faithfulness, LoopRAG matches LightRAG at

87 %

and markedly surpasses Naive RAG (

50 %

) and GraphRAG (

45 %

), implying that multi-source semantic compression and constraint mechanisms improve factual grounding. Although LoopRAG retrieves an average of 3046 tokens—far more than other models—this enables a larger and deeper contextual window, mitigating the context-length bottleneck of conventional RAG systems. As a result, LoopRAG attains an Accuracy of

88 %

, at least 19 percentage points higher than any baseline.

To further evaluate the performance of different RAG methods in practical semantic question answering from a human perspective, we conducted a subjective user study. The experiment involved seven Ph.D. students with relevant research backgrounds, covering intelligent buildings, artificial intelligence, and information systems, who independently scored the answers generated by each model. The evaluated methods include Naive RAG, GraphRAG, LightRAG, NodeRAG, and the proposed LoopRAG. All evaluations were conducted on a unified question set under the same knowledge background to ensure fairness. The evaluation was carried out along three key dimensions: expertise, fluency, and consistency. Expertise assesses the accuracy and depth of domain-specific knowledge in the responses; fluency focuses on the naturalness and readability of language expression; consistency measures the coherence of logical structure, semantic continuity, and alignment with the question intent. Scores were assigned on a 1–10 scale, and all assessments were performed in a blind setting where reviewers were unaware of the model sources. The experimental results, as shown in Figure 5, indicate that LoopRAG achieves the highest median scores and overall distributional advantages across all three metrics, with more concentrated score distributions. This indicates that LoopRAG not only outperforms the baseline methods in subjective quality but also demonstrates higher inter-rater consistency and stability. These findings further validate, from a human evaluation perspective, the effectiveness of closed-loop reasoning, dynamic prompt optimization, and heterogeneous knowledge integration in improving semantic question answering quality.

4.3. Impact of the PDCA Closed-Loop Mechanism

In the following ablation tables, upward arrows (↑) indicate performance improvements, while downward arrows (↓) indicate performance degradations, relative to the corresponding baseline setting.

We next evaluate the contribution of PDCA feedback during task execution by removing the feedback step (i.e., disabling Act-stage adjustment and retaining only a single Plan–Do pass). Both settings share the same knowledge base, retrieval scheme, and base prompt decoding so that performance differences isolate the effect of closed-loop feedback.

As shown in Table 3, removing feedback reduces Accuracy from

88 %

to

71 %

; Context Recall and Response Relevance drop by

4 %

and

5 %

, respectively. This indicates that without error detection and policy correction, the system cannot effectively handle semantic drift in complex tasks. Although Avg Retrieved Tokens decreases by

23.6 %

and Faithfulness rises slightly, this gain mainly stems from shorter inputs and more conservative answers, rather than a genuine improvement in generation quality.

Disabling the Act leads to a sharp reduction in overall Accuracy, confirming that Act-driven attribution and strategy adjustment are essential for sustained optimization and for resolving complex semantic instructions within the PDCA loop.

4.4. Impact of the Dynamic Prompt Optimization Module (MCPO)

This section examines how MCPO improves task formulation and retrieval coverage. We compare the full LoopRAG system with a variant that removes MCPO, thereby assessing its impact on recall, semantic alignment, and overall accuracy.

As reported in Table 4, removing MCPO causes Accuracy to fall from

88 %

to

68 %

, Response Relevance to decrease by nine percentage points, and Context Recall to decrease by five percentage points. Avg Retrieved Tokens drop from 3046 to 1132 (nearly

63 %

), indicating that without prompt optimization, prompts provide weaker guidance for retrieval and fail to elicit sufficient evidence.

These results show that MCPO—through semantic sampling, structural reformation, and self-evaluation—substantially improves alignment between prompts and user intent, making it a key contributor to both high recall and high accuracy.

4.5. Impact of the Heterogeneous Knowledge Fusion Mechanism

We further assess the heterogeneous knowledge fusion module, focusing on its ability to improve evidence consistency and reduce structure-induced deviations. The control variant relies only on vector-based retrieval and excludes graph-based entity expansion and structural constraints.

Table 5 demonstrates that removing heterogeneous fusion causes Faithfulness to drop from

87 %

to

50 %

(

- 37

points) and Accuracy to fall from

88 %

to

64 %

. While Avg Retrieved Tokens decreases substantially, the system suffers from weaker evidence coverage and missing structural constraints, which increases the risk of structure-level hallucinations—for instance, incorrectly linking evidence across unrelated device chains. Context Recall and Response Relevance also decline by 15 and 14 percentage points, underscoring the importance of structured constraints in multi-source retrieval.

From an engineering perspective, building tasks involve tightly coupled space–device–control chains, and text similarity alone cannot guarantee structural validity. Incorporating graph entities and structural-consistency scoring enables more accurate identification of task-relevant entities and improves evidence quality.

In addition, seven Ph.D. students were invited to conduct subjective evaluations under a unified question set and consistent evaluation criteria. The assessment was performed along three dimensions: expertise, fluency, and consistency. Expertise evaluates the accuracy of domain-specific knowledge in the responses and reflects retrieval quality; fluency measures the naturalness and readability of language expression; consistency focuses on the coherence of logical structure, causal relationships, and alignment with the question intent. The experimental results, as shown in Figure 6, indicate that after enabling heterogeneous knowledge integration, the system achieves significantly higher median scores and more concentrated score distributions across all three metrics, demonstrating substantial improvements in retrieval quality and answer credibility.

4.6. Summary and Discussion

Overall, LoopRAG consistently outperforms representative RAG baselines, validating the utility of closed-loop control, dynamic prompt optimization, and heterogeneous knowledge fusion in smart building semantic QA. The main cost is higher computation, reflected in larger contextual token budgets and additional loop iterations, which may constrain latency-sensitive applications. A practical extension is to incorporate task-difficulty recognition, using single-pass processing for simple queries and the closed-loop mode only for complex tasks.

Moreover, the knowledge base used here is relatively standardized and structurally complete. In real deployments, BIM, IoT, and O&M data often contain missing values, inconsistencies, and higher noise levels. Maintaining performance under such low-quality data and designing more efficient knowledge-update workflows remain open problems.

In summary, LoopRAG excels in context coverage, semantic alignment, evidence consistency, and final accuracy. The PDCA loop improves stability on complex tasks; MCPO enhances task formulation and retrieval guidance; and heterogeneous knowledge fusion strengthens structural consistency across sources. These properties suggest strong engineering potential for LoopRAG in highly heterogeneous settings such as smart buildings.

5. Discussion

This study addresses key challenges in semantic question answering for intelligent buildings, including unstable reasoning for complex tasks, fragmented evidence organization, and limited interpretability. We propose LoopRAG, a multi-agent semantic reasoning framework based on a PDCA closed loop, which models reasoning as a controllable, task-internal iterative process. By decoupling the Plan, Do, Check, and Act roles, LoopRAG jointly optimizes intent modeling, evidence selection, and generation strategies. Extensive experiments show that LoopRAG consistently outperforms baseline methods across multiple benchmarks and intelligent building scenarios in terms of accuracy, factual consistency, and subjective credibility. Ablation results further demonstrate the effectiveness of closed-loop feedback, dynamic prompt optimization, and heterogeneous knowledge integration, highlighting LoopRAG as a practical and robust paradigm for reliable semantic question answering in complex engineering environments.

Nevertheless, this study still has several limitations that merit further exploration in future work. First, at the data and evaluation level, the current experiments are mainly conducted on relatively well-structured knowledge bases and offline question-answering settings, whereas real-world building environments involve BIM, IoT, and operation-and-maintenance data that are often noisy, incomplete, and inconsistent, posing greater challenges to the stability of closed-loop reasoning and evidence fusion. Moreover, while existing automated metrics can capture overall trends, they remain limited in characterizing structural soundness and engineering executability. Second, at the methodological level, LoopRAG achieves improved stability and credibility by introducing closed-loop iterations and richer evidence organization, but this also incurs higher computational overhead and system complexity. Exploring lightweight alternatives, such as NodeRAG, and further investigating how to achieve adaptive trade-offs between performance and efficiency remain important directions for future research. Looking ahead, we plan to incorporate more comprehensive long-term interaction evaluations, real-world operational scenario testing, and human-in-the-loop mechanisms (e.g., manual verification and policy-constrained feedback) to advance LoopRAG from experimental validation toward large-scale, sustainable deployment, and to further explore the potential of closed-loop semantic reasoning in human–machine co-managed intelligent systems.

6. Conclusions

To address three persistent bottlenecks in smart building semantic question-answering interaction systems—limited dynamic adaptivity, inflexible prompt design, and weak heterogeneous knowledge integration—we propose LoopRAG, a PDCA-driven multi-agent RAG architecture. LoopRAG coordinates four agents (Plan, Do, Check, and Act) to enable continual within-task optimization from intent understanding to response generation. With a semantics-driven prompt reconfiguration mechanism and a multi-source knowledge fusion module, the system achieves improved accuracy, robustness, and semantic coverage.

Experiments demonstrate that LoopRAG consistently surpasses mainstream RAG baselines across key metrics, including Context Recall, semantic relevance, and answer accuracy, confirming both its effectiveness in complex building contexts and its extensibility. In particular, LoopRAG shows strong adaptive behavior and cross-task generalization in high-complexity settings such as multi-turn interaction, task-chain transitions, and multi-source heterogeneous knowledge integration. These results indicate that LoopRAG offers a systematic and practically grounded solution for constructing smart building semantic QA interaction systems.

Future work may further extend LoopRAG toward cross-modal information modeling, tighter coupling with real-time environmental perception, and stronger safety and controllability guarantees, supporting the development of a general-purpose smart building agent platform with enhanced contextual understanding and knowledge collaboration.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/13bbjxwz/EAGLE/ (accessed on 24 December 2025).

Author Contributions

Conceptualization, J.B. and Y.Y.; methodology, J.B. and Y.Y.; software, J.B. and Y.Y.; formal analysis, J.B.; investigation, J.B. and J.C.; data curation, J.B., Y.Y., and J.C.; writing—original draft preparation, J.B. and Y.Y.; writing—review and editing, J.B., D.N., and J.C.; visualization, J.B. and Y.Y.; supervision, D.N.; resources, D.N.; project administration, D.N.; funding acquisition, D.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Urban Digital Transformation Special Fund, grant number 202301050.

Data Availability Statement

The data supporting the findings of this study are not publicly available due to privacy and ethical considerations but are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Abbreviation	Full Term
QA	Question Answering
LLMs	Large Language Models
RAG	Retrieval-Augmented Generation
PDCA	Plan–Do–Check–Act
IoT	Internet of Things
BIM	Building Information Modeling
HVAC	Heating, Ventilation, and Air Conditioning
GPT	Generative Pre-trained Transformer
LLaMA	Large Language Model Meta AI
KG	Knowledge Graph
MCPO	Monte Carlo Prompt Optimization
H-KEFM	Heterogeneous Knowledge Expansion and Fusion Mechanism

Symbol	Description
t	Index of interaction or closed-loop iteration step
$u_{t}$	User input at iteration t
$y_{t}$	Generated answer at iteration t
$S_{t}$	Internal semantic state of the system at iteration t
$q_{t}$	User intent representation (task type, entities, constraints)
$k_{t}$	Retrieved knowledge/evidence state at iteration t
$p_{t}$	Prompt and generation strategy configuration
$K$	Global heterogeneous knowledge base
$f_{θ} (\cdot)$	Generation and reasoning function parameterized by $θ$
$Φ_{eval} (\cdot)$	Quality evaluation mapping function
$Φ (\cdot)$	State update function in the semantic closed loop
$e_{t}$	Evaluation feedback signal at iteration t
$L_{t}$	Composite deviation loss at iteration t
$Θ$	Parameter sete set of the closed-loop reasoning system
$Φ_{P}$	Planning function (Plan Agent)
$Φ_{D}$	Execution and retrieval function (Do Agent)
$Φ_{C}$	Quality checking function (Check Agent)
$Φ_{A}$	Policy update and correction function (Act Agent)
$z_{t}$	Structured task plan generated by the Plan Agent
$α, β, γ$	Weights for semantic alignment, faithfulness, and constraint losses
$η_{t}$	Step size for state adjustment at iteration t
$E (\cdot)$	Semantic embedding function
$CosSim (\cdot, \cdot)$	Cosine similarity between embeddings
$δ_{i}$	Indicator of whether retrieved evidence i supports the answer
K	Number of retrieved evidence items (Top-K retrieval)
$Ψ (y_{t}, p_{t})$	Constraint satisfaction function for generated output
$P$	Prompt space
$N (u_{t})$	Semantic neighborhood of prompts around input $u_{t}$
$T_{i}$	Prompt perturbation or rewriting operator
$p_{k}$	k-th sampled prompt candidate
$Q (p)$	Prompt quality evaluation function
$λ_{1}, λ_{2}, λ_{3}$	Weights of prompt quality components

References

Duarte, C.; Carrilho, J.; Vale, Z. Natural language interfaces for smart buildings: A systematic review. J. Build. Eng. 2023, 77, 107312. [Google Scholar]
Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (EACL 2024), St. Julian’s, Malta, 17–22 March 2024; pp. 150–158. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LlaMA: Open and efficient foundation language models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Zhang, J.; Lu, W.; Li, H. Natural-language-driven BIM information retrieval for construction and facility management. Adv. Eng. Inform. 2021, 48, 101276. [Google Scholar]
Gemini Team. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Cui, B. Retrieval-augmented generation for AI-generated content: A survey. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Toro, S.; Anagnostopoulos, A.V.; Bello, S.M.; Blumberg, K.; Cameron, R.; Carmody, L.; Diehl, A.D.; Dooley, D.M.; Duncan, W.D.; Fey, P.; et al. Dynamic retrieval augmented generation of ontologies using artificial intelligence (DRAGON-AI). J. Biomed. Semant. 2024, 15, 19. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.L.; Ashley, E.A.; et al. Almanac—Retrieval-augmented language models for clinical medicine. NEJM AI 2024, 1, AIoa2300068. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Lu, L.; Liu, M.; Song, C.; Wu, L. Nanjing Yunjin intelligent question-answering system based on knowledge graphs and retrieval augmented generation technology. Herit. Sci. 2024, 12, 118. [Google Scholar] [CrossRef]
Yan, Z.; Tang, Y.; Liu, M.; Zhang, H.; Zhu, Q. A ReAct-based intelligent agent framework for ambient control in smart buildings. Sensors 2024, 24, 2517. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online, 1–6 August 2021; pp. 4582–4597. [Google Scholar] [CrossRef]
Cahoon, J.; Singh, P.; Litombe, N.; Larson, J.; Trinh, H.; Zhu, Y.; Mueller, A.; Psallidas, F.; Curino, C. Optimizing open-domain question answering with graph-based retrieval augmented generation. arXiv 2025, arXiv:2503.02922. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Chen, Y.; Lu, W.; Xue, F.; Webster, C. BIM-based question answering for facility management: A review and future directions. Autom. Constr. 2022, 139, 104319. [Google Scholar]
Gao, X.; Zhong, B.; Ding, L. Ontology-driven semantic question answering for smart buildings: A review. Build. Environ. 2022, 219, 109184. [Google Scholar]
Kim, J.; Hong, T.; Jeong, K.; Lee, M. Conversational interfaces for HVAC control using deep NLP: A multi-zone case study. Energy Build. 2021, 250, 111294. [Google Scholar]
Alavi, A.; Forcada, N.; Serrat, C. Natural language processing–enabled interactions in IoT-based smart buildings: A framework and case study. Autom. Constr. 2023, 149, 104799. [Google Scholar]
Yasunaga, M.; Ren, H.; Bosselut, A.; Liang, P.; Leskovec, J. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021), Online, 6–11 June 2021; pp. 535–546. [Google Scholar]
Song, Q.; Chen, X.; Dong, L.; Jiang, X.; Zhu, G.; Yu, L.; Yu, Y. Multi-hop knowledge graph question answering method based on query graph optimization. In Proceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence, Kuala Lumpur, Malaysia, 14–16 February 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 40–45. [Google Scholar] [CrossRef]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N.A.; Lewis, M. Measuring and Narrowing the Compositionality Gap in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 5687–5711. [Google Scholar] [CrossRef]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Izacard, G.; Grave, E. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL 2021), Online, 19–23 April 2021; pp. 874–880. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 7969–7992. [Google Scholar] [CrossRef]
Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 9248–9274. [Google Scholar] [CrossRef]
Linders, J.; Tomczak, J.M. Knowledge graph-extended retrieval augmented generation for question answering. Appl. Intell. 2025, 55, 1102. [Google Scholar] [CrossRef]
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.; et al. Least-to-most prompting enables complex reasoning in large language models. In Proceedings of the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting for multi-hop reasoning. In Proceedings of the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. SELF-RAG: Learning to retrieve, generate, and critique through self-reflection. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Huang, S.; Wang, L.; Chen, J.; Liu, Y. CRITIC: Large language models can self-correct with tool-interactive critique. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7 May 2024. [Google Scholar]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative refinement with self-feedback. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Conference Proceedings, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), San Francisco, CA, USA, 29 October–1 November 2023; pp. 2:1–2:22. [Google Scholar] [CrossRef]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. In Proceedings of the ICLR 2024 Workshop on LLM Agents, Vienna, Austria, 11 May 2024. [Google Scholar]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-Tuning: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 61–68. [Google Scholar] [CrossRef]
Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; pp. 4222–4235. [Google Scholar] [CrossRef]
Wang, Q.; Mao, Y.; Wang, J.; Yu, H.; Nie, S.; Wang, S.; Feng, F.; Huang, L.; Quan, X.; Xu, Z.; et al. APrompt: Attention prompt tuning for efficient adaptation of pre-trained language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 9147–9160. [Google Scholar] [CrossRef]
Petrov, A.; Torr, P.H.S.; Bibi, A. When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Marvin, G.; Nakayiza, H.; Jjingo, D.; Nakatumba-Nabende, J. Prompt engineering in large language models. In Data Intelligence and Cognitive Informatics; Jacob, I.J., Piramuthu, S., Falkowski-Gilski, P., Eds.; Springer Nature: Singapore, 2024; pp. 387–402. ISBN 978-981-99-7962-2. [Google Scholar]

Figure 1. Three key challenges of RAG-based semantic QA interaction in intelligent buildings: (1) lack of dynamic reasoning adjustment, (2) insufficient adaptive prompt design, and (3) weak heterogeneous knowledge fusion.

Figure 2. Overall PDCA multi-agent collaborative architecture of LoopRAG. The Plan, Do, Check, and Act agents form a within-task closed loop over intent state

q_{t}

, knowledge state

k_{t}

, prompt/strategy state

p_{t}

, and deviation signal

e_{t}

.

Figure 2. Overall PDCA multi-agent collaborative architecture of LoopRAG. The Plan, Do, Check, and Act agents form a within-task closed loop over intent state

q_{t}

, knowledge state

k_{t}

, prompt/strategy state

p_{t}

, and deviation signal

e_{t}

.

Figure 3. Monte Carlo prompt optimizer (MCPO) embedded in the Plan Agent. The module performs one-shot local perturbation and selection to obtain a semantically stable task prompt before entering the PDCA loop.

Figure 4. H-KEFM. Multimodal sources are unified into a shared space, expanded via graph structure, scored along multiple dimensions, and fused into a reasoning context. Solid arrows denote primary data and control flows, while dashed arrows indicate auxiliary or feedback interactions. Green knowledge blocks represent knowledge items that pass the evaluation, whereas gray knowledge blocks indicate items that do not pass the evaluation and are filtered out.

Figure 5. Human evaluation of RAG methods on expertise, fluency, and consistency.

Figure 6. Human evaluation of heterogeneous knowledge fusion versus vector-only retrieval.

Table 1. Datasets used in experiments and their evaluation purposes.

Dataset	Type	Usage Description
HotpotQA	Multi-hop QA	Assess multi-paragraph semantic composition.
MuSiQue	Synthetic multi-hop tasks	Test adaptability to structured complex tasks.
MultiHop-RAG	Multi-document reasoning	Evaluate cross-document retrieval and semantic aggregation.
RAG-QAArena	Multi-turn conversational QA	Measure context retention and historical coherence.
Building QA	Multi-type QA tasks	validating generalization and controllability in practical settings.

Table 2. Overall performance comparison under a unified knowledge base and LLM.

Model	Context Recall	Response Relevance	Faithfulness	Avg Retrieved Tokens	Accuracy
Naive RAG	75%	58%	50%	1096	64%
GraphRAG	34%	54%	45%	694	69%
LightRAG	81%	62%	87%	978	61%
NodeRAG	85%	61%	74%	1004	66%
LoopRAG	90%	72%	87%	3046	88%

Note: Bold values indicate the best performance across all compared models for each metric.

Table 3. Performance comparison with/without PDCA feedback.

Metric	With Feedback Loop	Without Feedback	Relative Change
Context Recall	90%	86%	↓ 4%
Response Relevance	72%	67%	$↓ 5 %$
Faithfulness	87%	89%	$↑ 2 %$
Avg Retrieved Tokens	3046	2325	$↓ 23.6 %$
Accuracy	88%	71%	$↓ 17 %$

Table 4. Impact of the MCPO prompt optimization module.

Metric	With Prompt Optimization	Without MCPO	Relative Change
Context Recall	90%	85%	↓ 5%
Response Relevance	72%	63%	$↓ 9 %$
Faithfulness	87%	86%	$↓ 1 %$
Avg Retrieved Tokens	3046	1132	$↓ 62.8 %$
Accuracy	88%	68%	$↓ 20 %$

Table 5. Impact of the heterogeneous knowledge fusion module (H-KEFM).

Metric	Fusion Enabled	Vector-Only Retrieval	Relative Change
Context Recall	90%	75%	↓ 15%
Response Relevance	72%	58%	$↓ 14 %$
Faithfulness	87%	50%	$↓ 37 %$
Avg Retrieved Tokens	3046	1096	$↓ 64 %$
Accuracy	88%	64%	$↓ 24 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, J.; Ning, D.; You, Y.; Chen, J. LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings. Buildings 2026, 16, 196. https://doi.org/10.3390/buildings16010196

AMA Style

Bai J, Ning D, You Y, Chen J. LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings. Buildings. 2026; 16(1):196. https://doi.org/10.3390/buildings16010196

Chicago/Turabian Style

Bai, Junqi, Dejun Ning, Yuxuan You, and Jiyan Chen. 2026. "LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings" Buildings 16, no. 1: 196. https://doi.org/10.3390/buildings16010196

APA Style

Bai, J., Ning, D., You, Y., & Chen, J. (2026). LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings. Buildings, 16(1), 196. https://doi.org/10.3390/buildings16010196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings

Abstract

1. Introduction

2. Related Work

2.1. Domain Characteristics and System Requirements of Smart Building Semantic Question Answering

2.2. Application Potential and Boundaries of LLM in Smart Building Question Answering

2.3. Retrieval-Augmented Generation Paradigm for Smart Buildings

2.4. Multi-Stage Reasoning, Reflective Error Correction, and In-Task Closed-Loop RAG

2.5. Multi-Agent Collaborative RAG and Building Task Control

2.6. Prompt Adaptivity and Dynamic Construction Mechanisms

2.7. Research Gap and Positioning of This Paper

3. Methodology

3.1. Theoretical Foundation: Semantic Closed-Loop Model

3.2. PDCA Multi-Agent Collaborative Architecture

3.2.1. Plan Agent: Intent Modeling and Structured Task Planning

3.2.2. Do Agent: Evidence Retrieval, Integration, and Initial Generation

3.2.3. Check Agent: Evaluation of Retrieval and Answer Quality

3.2.4. Act Agent: Attribution Analysis and Optimization Feedback

3.3. Adaptive Prompt Optimization Mechanism Based on Monte Carlo Methods

3.3.1. Semantic Modeling Perspective of Prompt Optimization

3.3.2. Monte Carlo-Based Semantic Neighborhood Exploration

3.3.3. Prompt Quality Function and Selection Criteria

3.3.4. Module Role and Closed-Loop Consistency

3.4. Heterogeneous Knowledge Expansion and Fusion Mechanism for Smart Buildings

3.4.1. Necessity and Design Principles of Unified Representation

3.4.2. Knowledge Graph-Based Related-Entity Expansion

3.4.3. Multi-Dimensional Knowledge Scoring and Fusion Strategy

3.4.4. Role in the PDCA Closed Loop and Contribution to Stability

3.5. System Integration and Overall Workflow

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Datasets and Task Types

4.1.2. Knowledge Base and Heterogeneous Data

4.1.3. Evaluation Metrics

4.1.4. Experimental Environment and Implementation Details

4.2. Overall Performance Comparison

4.3. Impact of the PDCA Closed-Loop Mechanism

4.4. Impact of the Dynamic Prompt Optimization Module (MCPO)

4.5. Impact of the Heterogeneous Knowledge Fusion Mechanism

4.6. Summary and Discussion

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI