1. Introduction
Building energy consumption is shaped by the interplay of physical building characteristics, climate conditions, and occupant behaviors [
1]. Structural and thermal properties set a baseline for energy use, climate conditions drive heating and cooling loads, and occupant behaviors can dramatically sway actual consumption. These factors not only determine energy consumption but also critically influence energy retrofit performance. For instance, adding insulation to exterior walls reduces heat loss in buildings with poor insulation but provides only marginal benefits in buildings that already have moderate insulation. A cool roof retrofit helps lower indoor temperatures and reduce cooling demand in hot climates, but can inadvertently increase heating demand in colder zones, offsetting its benefits. Occupant behaviors further complicate outcomes: aggressive thermostat settings or irregular occupancy patterns can negate the expected savings from heating, ventilation, and air conditioning (HVAC) upgrades. These interacting technical, climatic, and behavioral factors make retrofit performance difficult to generalize, underscoring the need for decision-support approaches that can balance multiple objectives and adapt to diverse contexts.
Existing decision-making methods for building retrofits rely on either physics-based or data-driven methods [
2,
3]; however, they have inherent limitations in handling the aforementioned three factors. Physics-based methods rely on heat and mass transfer principles as well as energy balance to simulate building energy performance and retrofit outcomes. These models are valued for their accuracy and remain a standard benchmark in engineering practice [
4,
5]. However, they face several challenges, including heavy building input, limited scalability, and static behavior assumptions. First, physics-based simulation requires technically detailed and complex input data [
5]. Building characteristics, HVAC system efficiency, and domestic hot water system efficiency can typically be obtained from building audits, blueprints, or energy performance certificates [
6,
7]. However, reliably collecting these parameters and accurately inputting them into building energy simulations becomes increasingly complex and often impractical as the number of buildings grows in large-scale applications [
3]. Second, scalability is limited not only by data demands but also by the neglect of between-building interactions, such as shared energy infrastructure, district systems, and multi-stakeholder coordination [
8]. Even when effective at the individual building level, extending them to the community or city scale remains challenging without system-level integration. Third, climate conditions, such as outdoor temperature, solar radiation, and humidity levels, are commonly obtained from empirical meteorological data of reference cities within the same climate zone, for instance, those provided by the EnergyPlus Weather Data Sources [
9]. However, low climate sensitivity to local microclimatic variations—such as urban heat island effects—can lead to discrepancies between simulated and actual building performance [
10]. Finally, occupant behavior, such as thermostat settings, appliance usage patterns, and lighting setups, is often represented using static schedules and rules, overlooking dynamic interactions and behavioral uncertainty. This static behavior assumption can introduce significant discrepancies in energy predictions, particularly for small-scale buildings, where a smaller sample size magnifies the relative error between simulated and actual energy use [
11].
Data-driven methods rely on historical energy data, statistical models, and machine learning algorithms to assess and predict the performance of retrofit measures [
3]. While these approaches offer empirical learning and flexible modeling, they face three major challenges: missing baseline data, limited generalizability, and low interpretability. First, the lack of baseline data is particularly evident in older renovated buildings without reliable pre- and post-retrofit records. Even if upgrades such as improved insulation meet current building codes, the absence of prior data makes it difficult to accurately assess actual energy savings [
12]. Second, ensuring generalizability is challenging [
13]. A model trained in one climate or building stock may fail in another with different usage patterns. Finally, many models operate as black boxes, offering limited interpretability, which undermines stakeholder trust and hinders policy or investment adoption [
14,
15].
Smart and Connected Communities (S&CC) offer a promising environment for addressing key challenges in building energy retrofits by leveraging technologies such as Digital Twin, Multi-Agent Systems (MAS), and Artificial Intelligence (AI), as summarized in
Figure 1. Digital Twin, enabled by Building Information Modeling (BIM) and the Internet of Things (IoT), provide real-time, high-resolution data on building operations, local climate, and occupant behavior, reducing reliance on static assumptions and incomplete baselines while improving responsiveness to microclimatic variations [
16,
17,
18,
19]. However, their deployment is often constrained by high implementation costs, data integration challenges, and the need for robust sensor infrastructure. MAS approaches, including Energy System Modeling and Social Network Modeling, improve scalability beyond physics-based methods by simulating inter-building dynamics and supporting coordinated decision-making at larger scales [
8]. Yet, their accuracy depends heavily on assumptions about agent behaviors and can be computationally intensive, limiting real-time applicability. Recent advances in Generative AI, particularly large language models (LLMs), offer promise for improving generalizability and interpretability by learning contextual patterns from vast datasets [
20], enabling performance predictions across varied building types and climates. Additionally, their natural language capabilities offer new pathways for improving interpretability by translating complex model outputs into human-readable explanations [
21]. At the same time, LLMs face well-documented challenges—including hallucinations, inconsistency, and shallow contextual awareness—that must be overcome for their reliable use in high-stakes retrofit decision-making.
Recent advances in LLM-based AI promise to overcome the “last-mile” challenge by translating and synthesizing complex technical data into clear, actionable, and personalized recommendations [
22,
23]. Within an S&CC ecosystem, LLMs can serve as conversational interfaces that connect expert knowledge with everyday decision-making, engaging homeowners, building managers, and community members [
24,
25]. For instance, they may propose tailored retrofit measures—such as insulation upgrades, HVAC improvements, or solar system integration—while considering comfort, budget, and climate. Through natural language, homeowners can interactively explore various retrofit options with explanations of savings, payback periods, and comfort impacts [
26]. Yet most models are trained on broad, general-purpose data, rather than domain-specific knowledge, limiting their ability to reflect diverse building characteristics, climate conditions, and occupant behaviors. This lack of domain grounding, combined with the absence of systematic evaluations, raises concerns about reliability, accuracy, and potential misguidance in real-world retrofit decisions [
23,
27].
To address the lack of domain-specific evaluation, this study examines the ability of leading LLMs to support residential building energy retrofit decision-making, focusing on environmental benefits and economic feasibility. We selected six widely discussed LLMs and LLM-powered applications released till early 2025—ChatGPT o3, DeepSeek R1, Grok 3, Gemini 2.0, Llama 3.2, and Claude 3.7, representing a mix of proprietary and open-source platforms, multimodal and text-based architectures, and varied cost structures. OpenAI’s ChatGPT o3 is recognized for balanced reasoning and coherent multi-step responses [
28]. DeepSeek R1 offers strong coding and mathematical performance while maintaining relatively low deployment costs [
29]. Grok 3 from xAI integrates real-time data from X (formerly Twitter), enabling up-to-date responses [
30]. Gemini 2.0 from Google DeepMind demonstrates advanced multimodal capabilities, particularly in processing images and long documents [
31]. Llama 3.2 from Meta is a lightweight, open-access model that supports flexible fine-tuning [
32]. Claude 3.7 from Anthropic combines fast and deep reasoning with reliable instruction-following, and robust data understanding [
33]. For simplicity, these LLMs and LLM-powered applications are hereafter collectively referred to as LLMs. Importantly, this study evaluates the intrinsic capabilities of LLMs without any domain-specific fine-tuning or retraining, isolating how well general-purpose models can perform retrofit decision-making tasks out of the box.
For building energy retrofit decision-making, LLMs show promise for improving generalizability and interpretability compared with physics-based and data-driven approaches. However, systematic domain-specific evaluations remain underexplored. Accordingly, we evaluate off-the-shelf LLMs for residential retrofits with a focus on environmental benefits and economic feasibility, and we examine performance across diverse residential contexts. Specifically, we evaluated these LLMs along four key dimensions: accuracy against established baselines, consistency across models, sensitivity to contextual features (e.g., building characteristics, climate conditions, and occupant behaviors), and the quality of their reasoning logic. This multi-faceted assessment offers a foundation for understanding both strengths and limitations of the current LLM-based AI in supporting retrofit decisions across diverse residential contexts.
This research is presented as follows. Methodology introduces the data, prompt design, and evaluation framework. Results present accuracy, consistency, sensitivity, and reasoning. Discussion first summarizes LLM opportunities and limitations, then follows the LLM workflow to identify where performance can be improved, and finally discusses the limitations of this study. Conclusions summarize the key insights.
3. Results
3.1. Accuracy
Figure 4 presents the Top-1 (blue), Top-3 (orange), and Top-5 (green) accuracy rates for the LLMs, including their 95% confidence intervals. The top panel shows accuracy rates for maximizing CO
2 reduction, while the bottom panel shows accuracy rates for minimizing payback period. “Overall” bars, displayed alongside individual models, represent the average accuracy across all LLMs for each objective.
LLMs demonstrated stronger performance on the technical task of maximizing CO2 reduction than on the sociotechnical task of minimizing payback period. As expected, accuracy improves across all models as the evaluation criteria broaden from Top-1 to Top-5. For the CO2 objective, the Top-1 accuracy is 39.9%, increasing to 70.1% at Top-3 and 79.9% at Top-5. In contrast, for the payback period objective, accuracy remained low across all tiers, with an overall Top-1 accuracy of only 11.0%, rising to just 28.0% at Top-3 and 34.8% at Top-5.
In the CO2 reduction task, leading LLMs produced highly effective recommendations. This suggests that while pinpointing the single best measure remains challenging, most models consistently fall within a near-optimal range. Specifically, Top-1 accuracy ranges from 25.5% (Grok 3) to 54.5% (Gemini 2.0); Top-3 accuracy ranges from 53.8% (Grok 3) to 86.3% (Gemini 2.0); and Top-5 accuracy ranges from 68.5% (DeepSeek R1) to 92.8% (ChatGPT o3). Top performers like ChatGPT o3 and Gemini 2.0 achieve over 45% at Top-1 accuracy, exceed 80% at Top-3 accuracy, and surpass 90% at Top-5 accuracy.
In contrast, performance on payback year minimization remains uniformly poor. Even under the most lenient Top-5 criterion, most models fail to achieve high accuracy. Top-1 accuracy ranges from 6.5% (Gemini 2.0) to 14.3% (DeepSeek R1); Top-3 accuracy ranges from 15.0% (Grok 3) to 44.0% (Gemini 2.0); and Top-5 accuracy ranges from 19.5% (Grok 3) to 52.5% (Gemini 2.0). Gemini 2.0 is the only model to surpass 50% at Top-5 accuracy, underscoring the persistent difficulty of identifying economically optimal retrofit measures.
3.2. Consistency
3.2.1. Overall Agreement
Figure 5 reports the leave-one-out Fleiss’ Kappa for each of the six LLMs, along with the overall Fleiss’ Kappa across all models. Two types of agreement are shown: (1) selection-based agreement (red), measuring whether models chose the same retrofit package as optimal; (2) correctness-based agreement at the Top-1, Top-3, and Top-5, measuring whether models agreed in correctly identifying the optimal retrofit within their top-
k results. The upper chart shows these metrics for the CO
2 reduction objective, while the lower chart shows results for minimizing the payback period. Bars labeled with model names indicate the leave-one-out Fleiss’ Kappa values when excluding the corresponding model, while “Overall” represents the Fleiss’ Kappa when all models are included. A higher leave-one-out Fleiss’ Kappa value relative to the overall value suggests that the excluded model was lowering the agreement among the other models, whereas a lower leave-one-out value indicates that the excluded model made a positive contribution to the overall agreement.
Across both objectives, selection-based agreement is consistently negative, indicating that the models’ retrofit selections were less consistent than would be expected by chance. Correctness-based agreement reaches, at best, the “fair” agreement range. The overall Fleiss’ Kappa for CO2 reduction at Top-1 is negative (−0.015), in contrast to that for the payback year (0.203). Correctness-based agreement at Top-3 and Top-5 becomes more comparable across the two objectives: the values are 0.184 and 0.280 for CO2, versus 0.195 and 0.285 for payback.
For CO
2 reduction, correctness-based agreement increases steadily from Top-1 to Top-5, starting below chance and rising to 0.335 (fair agreement). This trend mirrors the accuracy gains previously shown in
Figure 4, suggesting that LLMs show greater agreement in judging the correct optimal retrofit measures as accuracy improves. In addition, top performers in accuracy, such as ChatGPT o3 and Gemini 2.0, also show the highest leave-one-out Fleiss’ Kappa values (both larger than the overall), indicating that their correct judgments diverged from others and therefore lowered overall agreement.
For payback minimization, correctness-based Fleiss′ Kappa values remain within the fair range. Agreement peaks at Top-1 (0.296), dips at Top-3 (0.227), and rises again at Top-5 (0.315). This pattern, together with the accuracy results in
Figure 4, suggests that at Top-1, most models perform poorly in a similar manner (raising agreement), at Top-3, their improved but divergent judgments reduce alignment, and at Top-5, more lenient criteria allow partial convergence.
3.2.2. Pairwise Agreement
Figure 6 details the pairwise agreement between models using Cohen’s Kappa, offering a granular view of inter-model alignment.
Figure 6a,b display selection-based agreement;
Figure 6c–e show correctness-based agreement, and
Figure 6f–h report correctness-based agreement for both objectives at each Top-
k level.
The pairwise analysis confirms the findings from the above overall agreement. Selection-based agreement (
Figure 6a,b) is near-zero or negative across almost all model pairs. The only exception is a slight positive agreement between Grok 3 and Claude 3.7 for CO
2-based selection.
For CO
2 correctness-based agreement (
Figure 6c–e), Cohen’s Kappa values increase from Top-1 to Top-5, reinforcing the trend observed in Fleiss’ Kappa (
Figure 5). ChatGPT o3 and Gemini 2.0 consistently show the lowest agreement with other models, supporting earlier findings that their correctness judgments diverge from the rest. Conversely, ChatGPT o3 and Grok 3 demonstrate the highest pairwise agreement across all Top-
k ranks, suggesting moderate alignment in their correctness patterns—an insight not captured by Fleiss’ Kappa.
For payback year correctness-based agreement (
Figure 6f–h), Cohen’s Kappa values generally mirror the overall pattern as in Fleiss’ Kappa, with a slight dip at Top-3 and a modest rise at Top-5. Notably, Gemini 2.0 and Llama 3.2 exhibit the weakest pairwise agreement with other models, further underscoring their distinct correctness profiles identified in the leave-one-out Fleiss’ Kappa values (
Figure 5). Meanwhile, ChatGPT o3 and Grok 3 again show moderate agreement from Top-1 to Top-5, confirming their consistent alignment across retrofit decisions.
3.3. Sensitivity
Figure 7 presents feature importance values for 28 input variables in determining optimal retrofit measures under two objectives: maximizing CO
2 reduction (left panel) and minimizing payback period (right panel). Each column corresponds to the baseline, one of the six LLMs, or the overall LLMs. In this context, feature importance represents how sensitive a model is to each feature—darker shading in the heatmap indicates stronger reliance. Gray diagonal hatching denotes cases where the LLM recommended the same retrofit for all 400 homes, making sensitivity analysis impossible due to a lack of variation.
For both objectives, most LLMs exhibit feature importance patterns that are broadly consistent with the EnergyPlus-derived baseline. “County name”, “State name”, and “Space combination” consistently emerge as the most influential features, followed by a secondary group including “Orientation”, “Vintage”, “Wall insulation level”, “Window type”, “Water heater efficiency”, “Cooling setpoint”, and “Heating setpoint”. Remaining features play a limited role, with only a few LLMs showing deviations. These similarities suggest that most LLMs, like the baseline, prioritize location and architecture characteristics when recommending retrofits.
However, notable deviations were observed. Grok 3 places disproportionate weight on “Heating fuel”, “Cooking range type”, and “Water heater fuel” under the CO
2 objective—deviations that may explain its poor accuracy (
Figure 4), where it ranked lowest at Top-1 and Top-3 and near the bottom at Top-5. Claude 3.7 also places a relatively high weight on “Heating fuel”, but otherwise aligns with the baseline and most LLMs. Across both objectives, most LLMs assign higher importance to “County name” than the baseline, suggesting they rely heavily on location features like “County name” as a proxy for climate conditions, which strongly shape their judgment on energy use and retrofit selection.
3.4. Reasoning
The models followed a consistent five-step logic that mirrors a simplified engineering workflow: (1) baseline establishment, (2) envelope impact adjustment, (3) system energy calculation, (4) appliance energy assumption, and (5) outcome comparison.
Figure 8 presents representative examples from ChatGPT o3 and DeepSeek R1 corresponding to each reasoning step. ChatGPT o3 expressed its logic through explanatory text supplemented with embedded Python code, whereas DeepSeek R1 relied entirely on descriptive narrative. Despite these differences in presentation, both models applied the same underlying progression. This multi-step logic mirrors engineering principles, but it remains simplified and accounts for only a narrow set of contextual dependencies.
The reasoning processes begin with a baseline assumption, where the models estimate a building’s original energy load using basic characteristics such as floor area and location. This baseline serves as a reference for evaluating retrofit impacts. Then, in the envelope impact adjustment step, the models apply estimated reductions to account for retrofits such as improved insulation or air sealing. These adjustments are typically expressed as a percentage reduction in baseline energy demand. The system energy calculation step then modifies consumption levels based on the efficiency of newly installed mechanical systems, such as heat pumps or furnaces, estimating demand by dividing adjusted energy loads by system efficiency. During the appliance energy assumption step, savings are estimated by comparing assumed consumption levels of appliances before and after retrofits, reflecting performance differences between technologies. Finally, in the outcome comparison step, results are aggregated to evaluate and prioritize retrofit packages. Although ChatGPT o3 and DeepSeek R1 follow a similar structured logic, their reliance on assumed input values—such as baseline loads and percentage reductions—introduces variability in estimated outcomes and, ultimately, in recommended retrofit selections.
4. Discussion
The overall performance of LLMs in retrofit decision-making depends on their ability to process information effectively across the workflow—prompt understanding, context representation, multi-step inference, and response generation. Weaknesses at any stage can lead to oversimplified or incomplete recommendations. As demonstrated in our evaluation, these weaknesses appeared in the form of simplified reasoning, inconsistent alignment across models, and trade-offs between accuracy and consensus. Recognizing and addressing these limitations is essential to enhancing the performance of LLMs in building energy retrofit applications.
4.1. Opportunities and Limits of LLMs for Practice
With regard to the accuracy of performance, the stark contrast between LLM performance in maximizing CO2 reduction and minimizing payback period underscores a fundamental limitation of current models. While LLMs demonstrate competence in clear, single-objective, technical problems, they struggle with sociotechnical trade-offs that require balancing costs, savings, and contextual variability. This suggests that LLMs may be suitable as technical advisors for identifying high-impact retrofit measures but remain unreliable for guiding economic decisions that directly influence adoption and the distribution of costs and benefits across stakeholders. Accordingly, their most effective role in practice may be within a human-in-the-loop framework, where LLM-generated insights on environmental performance are complemented by expert or data-driven evaluation of financial feasibility.
While some LLMs, such as ChatGPT o3 and Gemini 2.0, demonstrate relatively high accuracy, their divergence from other models underscores a lack of consensus across platforms. This suggests that LLMs are not yet reliable as stand-alone decision-makers for retrofit planning. However, the patterns of disagreement also reveal opportunities: ensemble approaches, human–AI collaboration, or domain-specific fine-tuning may help capture the strengths of high-performing models while mitigating inconsistency. In practice, this means LLMs may be most valuable today as supportive advisors—offering alternative retrofit options, explaining trade-offs, and broadening the decision space.
Pairwise comparisons reveal that while some models, such as ChatGPT o3 and Grok 3, align closely and consistently, others like Gemini 2.0 and ChatGPT o3 achieve higher accuracy but diverge from their peers—highlighting a trade-off between accuracy and consensus across LLMs.
The sensitivity analysis shows that LLMs generally mirror baseline reasoning by emphasizing location and architectural features, but their tendency to overweight location (e.g., county) and mis-prioritize certain technical features, as seen with Grok 3, highlights both their potential to approximate domain logic and their risk of producing biased or distorted judgments.
The emergence of structured, engineering-style reasoning in models like ChatGPT o3 and DeepSeek R1 suggests that LLMs can approximate professional decision logic, but their reliance on simplified assumptions and limited contextual awareness highlights the risk of oversimplification. For practice, this means LLMs may be most effective as tools for communicating reasoning steps transparently, rather than as definitive sources of accurate retrofit calculations.
4.2. Prompt Input and Understanding
Effective prompt design is a foundational determinant of LLM performance. Prior studies show that even slight changes in prompt wording or structure can greatly influence outcomes [
37,
38], and our results confirm similar sensitivity. When prompted to “identify the retrofit measure with the shortest payback period”, most models selected the option with the lowest upfront investment, disregarding how energy cost savings substantially influence payback outcomes. However, simply adding the phrase “considering both initial investment and energy cost savings” led models to adjust their reasoning and incorporate both factors. While the final recommendations often remained unchanged, the revised prompt more consistently triggered comprehensive reasoning—highlighting that explicit guidance is necessary because LLMs may not fully recognize the relevance of key variables unless directly prompted.
Prompt phrasing also shaped response generation. When asked to “identify the measure with the greatest CO2 reduction and the one with the shortest payback period”, models interpreted the task inconsistently: some returned only one option per objective, while others provided two or three. Notably, even for prompts implying a single optimal solution—such as “the greatest” or “the shortest” —some models still listed multiple alternatives. This pattern suggests that LLMs do not always fully infer the implicit expectations behind user prompts, and that clearer phrasing is needed to ensure alignment between user intent and model understanding.
These findings reinforce the importance of prompt engineering as a practical strategy for improving LLM performance in retrofit decision tasks. Unlike physics-based simulation tools, which rely on numerical inputs and fixed equations, LLMs generate responses based on both the prompt and their learned knowledge. In this context, prompt plays a role analogous to setting initial conditions in a simulation: clear parameters, objectives, and constraints guide the model toward more accurate results. Our study shows that prompt sensitivity can directly affect not only the completeness of reasoning but also the consistency of outputs across models—making prompt design an essential part of quality control for AI-assisted retrofit analysis. Crafting effective prompts often requires iterative refinement and a solid understanding of how the model behaves in domain-specific contexts [
37,
38]. By structuring prompts to include necessary building parameters, performance targets, and even hints for step-by-step reasoning, practitioners can significantly enhance the decision accuracy of LLM-generated retrofit advice. In short, prompt engineering emerges as a key enabler for harnessing LLMs in retrofit analyses, ensuring that these models remain grounded in context—just as well-defined initial conditions steer a physics-based simulation toward reliable outcomes.
4.3. Context Representation
After interpreting a prompt, LLMs internally form a contextual representation of the problem—organizing the scenario, relevant variables, and their relationships to support reasoning. This stage is critical because it determines what information the model actively retains and uses. However, unlike physics-based models, which explicitly define variable interdependencies through equations, LLMs construct their context using language-based associations. As a result, contextual elements with lower salience may be excluded if their relevance is not explicitly signaled.
Our sensitivity analysis showed that most LLMs assigned similar importance scores to input features as the physics-based baseline. Yet, many of these features were not effectively integrated into the models’ context representation and thus were unavailable during reasoning. For example, “usage level” consistently received a moderate importance—around 2.5% on average across models—but played no role in the reasoning that produced the final retrofit recommendation. This suggests that while LLMs can identify which features are generally relevant, they often fail to embed them into a coherent contextual understanding of the task. Unlike physics-based models, which systematically account for all input variables, LLMs tend to anchor their context representations around only a few salient cues. Even when feature importance aligns with the baseline, an incomplete or fragmented context representation can limit reasoning and reduce overall accuracy.
Improving context representation, therefore, requires not only well-structured prompts but also explicit strategies to emphasize the necessity of key input features. In our study, although the prompts included information such as occupancy levels, we did not emphasize the importance of these variables within the prompt. As a result, models often overlooked them during reasoning. If prompts had more clearly directed the model’s attention—for example, by stating that certain features directly impact energy performance—LLMs might have incorporated them more consistently into their reasoning and generated more accurate recommendations.
4.4. Inference and Response Generation
LLM inference and response generation in our study exhibited three major limitations: oversimplified decision logic, inconsistency across responses, and unintended context-driven bias. These issues were not always evident from the final answers but became clear when examining the model’s step-by-step reasoning traces.
First, models frequently relied on overly simplified reasoning patterns. For instance, some assumed that the retrofit measure with the lowest upfront cost would automatically yield the shortest payback period, or that fossil-fuel systems should always be replaced by electric alternatives. Such rules were applied uncritically, without evaluating performance trade-offs. To mitigate this, prompts can be refined to explicitly emphasize multi-criteria evaluation, such as balancing upfront investment with long-term energy saving. Incorporating counterexamples into prompts may also discourage overgeneralization. Providing structured reasoning templates (e.g., requiring each step for cost, energy saving, and payback) can also help enforce more nuanced inference chains.
Second, we observed a level of variability and uncertainty in LLM-generated outputs atypical of physics-based methods. Because LLMs generate text probabilistically, repeated queries or slight rephrasing produced inconsistent answers, leading to conflicting retrofit recommendations, even when inputs were identical. Such inconsistency limits the suitability of LLMs for high-precision, reliability-critical tasks [
39,
40], such as hourly energy consumption prediction. Several strategies could address these challenges: (i) fine-tuning LLMs with domain-specific datasets to reduce variability [
41]; (ii) distilling large LLMs into smaller, domain-specific models to achieve faster and more consistent inference while retaining reasoning capacity [
42]; (iii) employing retrieval-augmented generation to ground responses in validated sources, thereby improving consistency [
43]; and (iv) hybrid modeling that uses LLMs for interpreting inputs and generating hypotheses while physics-based simulations validate results, ensuring traceable and repeatable outcomes [
25].
Third, in dialogue-based LLM interactions, context carryover may introduce unintended bias when similar but independent queries are asked sequentially. Because models retain elements of prior responses, subsequent answers can be influenced by earlier reasoning, even when each query deserves independent consideration [
44]. This undermines reliability and complicates comparisons across scenarios. To counter these effects, users can reset or isolate the conversation context, rephrase questions to stand alone, or provide explicit directives instructing the model to ignore previous content. Conversely, a structured chain-of-thought approach can turn sequential prompting into an advantage: guiding the model step by step breaks complex tasks into smaller logical units, each of which can be checked or refined before proceeding. Users can reinforce this by requesting intermediate reasoning, introducing checkpoints to correct errors, and encouraging the model to “think aloud”. In this way, chain-of-thought prompting leverages the model’s sequential nature for clarity and accuracy rather than allowing context to create unwanted bias [
45,
46].
Although LLMs can emulate structured decision logic, their tendencies toward oversimplified reasoning, inconsistent outputs, and context-driven bias mean they should currently be used as transparent, assistive tools rather than as stand-alone, reliability-critical systems for retrofit decision-making.
4.5. Study Limitations and Future Work
While this study systematically evaluated LLMs in household retrofit decision-making tasks, several limitations remain. First, although we selected 28 key parameters from the original ResStock dataset to capture the most influential factors in retrofit decisions, the data still omits important real-world variables such as occupant preferences or region-specific policy incentives. These factors can significantly influence retrofit decisions but are difficult to capture in a standardized prompt. Second, this study did not include a quantitative comparison between the LLM-generated recommendations and those derived from traditional, established engineering standards. While our approach emphasizes accessibility and flexible reasoning through natural language, traditional frameworks offer more deterministic outputs and established validation protocols. Future work could integrate LLM-based, rule-based systems, and stakeholder values to balance generalizability, interpretability, and performance in practical applications.
5. Conclusions
This study investigated the capability of six leading large language models (LLMs) to support building energy retrofit decisions, a task where conventional methods often struggle with generalizability and interpretability. Our evaluation, summarized in
Table 3, reveals a critical duality in current LLM performance: while they demonstrate a promising ability to reason through technical optimization problems, they fall short when faced with the sociotechnical complexities of real-world economic decisions.
The findings underscore that without any domain-specific fine-tuning, LLMs can produce effective retrofit recommendations, particularly for a technical objective like maximizing CO2 reduction. The high Top-5 accuracy in this context (reaching 92.8%) suggests that while pinpointing the single best measure is challenging, most models consistently identify a near-optimal solution. This competence, however, contrasts sharply with their limited effectiveness in minimizing the payback period, where complex trade-offs between costs, savings, and contextual factors proved difficult for the models to navigate.
This performance dichotomy is reflected across other dimensions. The low consistency among models—especially the tendency for higher-accuracy models to diverge from the consensus—highlights the lack of a standardized reasoning process. The sensitivity analysis confirmed that LLMs correctly identified location and architectural features as primary drivers of performance, approximating the logic of physics-based models. Yet the reasoning analysis revealed that this logic, while structured, remains oversimplified and reliant on generalized assumptions.
The results suggest that LLMs, in their current state, are best suited as technical advisory tools rather than standalone economic decision-makers. Their most effective near-term role is likely within a human-in-the-loop framework, where they can generate a set of technically sound retrofit options that are then evaluated by human experts for feasibility and contextual appropriateness. Such workflows can pre-screen plausible retrofit options and streamline early-stage analysis, reducing the time and effort required while keeping final investment decisions under expert control.
To unlock their full potential, future work should focus on overcoming the limitations identified in this study. Domain-specific fine-tuning is a critical next step to imbue models with a deeper understanding of building physics, regional construction practices, and local economic conditions. Furthermore, developing hybrid AI frameworks that integrate LLMs with physics-based simulation engines or robust cost databases could bridge the gap between simplified logic and rigorous, data-driven analysis. By addressing these gaps, generative AI can evolve into a truly transformative tool for accelerating equitable and effective building decarbonization.