Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation

Ghimire, Prashnna; Kim, Kyungki; Stentz, Terry; Roy, Tirthankar

doi:10.3390/buildings16020396

Open AccessArticle

Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation

¹

Durham School of Architectural Engineering & Construction, University of Nebraska-Lincoln, Lincoln, NE 68588, USA

²

Civil & Environmental Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(2), 396; https://doi.org/10.3390/buildings16020396

Submission received: 14 October 2025 / Revised: 29 December 2025 / Accepted: 13 January 2026 / Published: 18 January 2026

(This article belongs to the Special Issue Knowledge Management in the Building and Construction Industry)

Download

Browse Figures

Versions Notes

Abstract

The traditional cost estimation process in construction involves extracting information from diverse data sources and relying on human intuition and judgment, making it time-intensive and error-prone. While recent advancements in large language models offer opportunities to automate these processes, their effectiveness in cost estimation tasks remains underexplored. Prior studies have investigated LLM applications in construction, but there is a lack of studies that have systematically evaluated their performance in cost estimation or proposed a framework for systematic evaluations of their performance in cost estimation and ways to enhance their accuracy and reliability through prompt engineering. This study evaluates the performance of pre-trained LLMs (GPT-4o, LLaMA 3.2, Gemini 2.0, and Claude 3.5 Sonnet) for conceptual cost estimation, comparing zero-shot prompting with a modular chain-of-thought framework. The results indicate that zero-shot prompting produced incomplete responses with an average confidence score of 1.91 (64%), whereas the CoT framework improved accuracy to 2.52 (84%) and achieved significant gains across BLEU, ROUGE-L, METEOR, content overlap, and semantic similarity metrics. The proposed modular CoT framework enhances structured reasoning, contextual alignment, and reliability in estimation workflows. This study contributes by developing a conceptual cost estimation framework for LLMs, benchmarking baseline model performance, and demonstrating how structured prompting improves estimation accuracy. This offers a scalable foundation for integrating AI into construction cost estimation workflows.

Keywords:

construction cost estimation; AI; large language models (LLMs); modular framework; chain of thought (CoT); prompting

1. Introduction

Construction cost estimation is a systematic process to project the total expenses required to complete a project for its defined scope. For contractors, precise estimates are essential to remain competitive in bidding, whereas, for owners, these estimates inform critical budgeting, scheduling, and resource allocation decisions [1,2]. An accurate estimation involves a clear understanding of quantities, technical specifications, and prices. At its core process, cost estimation involves identifying necessary materials, labor, and equipment, and then assigning suitable prices to the tasks outlined in the project documentation. Through the alignment of financial expectations with primary project goals, this process acts as an important element of effective planning and decision-making, ensuring that stakeholders have a clear understanding of budgetary demands [3]. One of the most significant components of cost estimation is the quantity takeoff—often referred to as the material takeoff—where estimators compile a detailed inventory of all required materials, labor activities, and equipment [4]. This procedure can follow either two-dimensional or three-dimensional approaches, and it generally benefits from standardized practices such as breaking projects into smaller, more manageable segments to improve both speed and accuracy. However, in large projects or complicated projects, it is easy to overlook even minor tasks or items, and this could lead to significant cost overruns or delays [5,6]. Consequently, frequent cross-checking of measurements and calculations is essential. After determining the necessary quantities, estimators move on to assigning prices. Although this step can be time-intensive, it can draw upon several resources: supplier quotations, subcontractor bids, historical data, industry databases, and professional experience. And, by extracting cost information from multiple sources, estimators can develop comprehensive and reliable forecasts, ultimately helping stakeholders make informed financial decisions and supporting more predictable project outcomes [4,7].

Currently, estimators manually extract cost data from various sources. This not only demands extensive human effort and time but also risks introducing errors that can propagate through estimates [8]. Despite advancements in digital construction tools and AI in construction, estimators still lack an integrated, modular pipeline that can systematically process and extract insights from diverse data inputs such as project specifications, quantities, historical cost data, and historic data, which often exist in varied formats and structures. This lack of standardization and automation forces estimators to rely on manual, intuition-based, and subjective approaches, making cost estimation slow, inconsistent, and error-prone. LLMs have already demonstrated immense potential in extracting, synthesizing, and interpreting complex, heterogeneous data across various domains [9,10]. However, the construction industry still lacks a framework that integrates LLM in cost estimation process, preventing opportunity for AI-driven estimation solutions. To date, there is a lack of comprehensive studies evaluating how pre-trained LLMs perform when applied to cost estimation workflows in construction. This absence of empirical benchmarking and structured evaluation represents a significant research gap that limits understanding of their practical feasibility and accuracy. This is a gap to limit the industry’s ability to integrate AI to automate preconstruction cost planning. To address these challenges, this study evaluated existing LLMs performance and proposes a modular CoT prompting framework that improves LLM performance for cost estimation tasks. By structuring LLM interactions into task-specific modules, this framework allows estimators to ingest, analyze, and cross-reference data from multiple sources. This study provides general contractor companies with a flexible and scalable LLM-based framework, supporting their existing workflow automation while improving productivity and accuracy in construction cost estimation.

This study makes noteworthy contributions across theoretical, practical, and methodological dimensions. From a theoretical perspective, it explores how pre-trained LLMs can be leveraged for construction cost estimation tasks, presenting innovative ways in which careful prompt engineering can guide these models to produce targeted outcomes. The developed scenario, detailed tasks, instructions, and CoT prompting framework serve as a practical guide for estimators, offering the flexibility to be adapted as needed. On the methodological front, the study proposes an LLM-integrated modular framework aimed at standardizing and scaling the cost estimation process. This framework is designed to reduce the reliance on manual efforts through task automation, thereby making estimation processes faster and more accurate compared to traditional methods. This chapter is organized as follows: Section 2 reviews the relevant literature, and Section 3 outlines the study’s methodology. Section 4 details the case study, followed by results and discussions in Section 5. Section 6 addresses the study’s limitations and future research directions, and Section 7 summarizes the conclusion.

2. Literature Review

2.1. Existing Cost Estimation Approaches in Commercial Construction

Construction cost estimation methods are broadly categorized into four general groups, which vary in detail, accuracy, and pertinence relative to the project development stage. The overview of major construction cost estimation methods is shown in Table 1.

Construction cost estimation methods vary in detail, precision, and relevance depending on the project phase. The choice of method should align with the project’s level of design completion, available data, and the degree of cost certainty required. As the project evolves, more detailed methods, such as assemblies or unit cost estimating, offer higher accuracy but demand greater time and data resources.

Construction cost estimation is an important and considerably complex activity that demands accuracy, speed, and flexibility at different phases of a project. However, there are a number of challenges to its effectiveness, including time-consuming procedures, a lack of consistency in data, manual errors, and the integration of diverse sources of cost information [13]. These difficulties are added to by the need to aggregate quantities from diverse formats (2D drawings and 3D models), refer to past data and outside pricing databases, cross-check estimates with project specifications, and review subcontractor bids. Lack of standardization and automation often results in duplication of efforts, omissions, and unmatched units of measure, which can equate to large cost variances and budget issues [14]. The majority of estimators still rely on spreadsheets and manual input, which renders the process slower and more error-prone [15]. A key challenge involves reconciling quantities derived from heterogeneous sources, particularly 2D drawings and BIM (Building Information Modeling) models [16]. Differences in calculation methods or levels of detail can lead to conflicting quantities and substantial cost discrepancies. Inconsistent measurement units and differing classification systems, such as Uniformat II for assemblies and MasterFormat for unit costs, further complicate estimate consolidation. Additionally, cost estimation depends on data from internal historical records and external pricing guides such as RSMeans and Sage [17]. To maintain accuracy, estimators must account for inflation, market fluctuations, and regional labor variations [18], while carefully cross-checking estimates against project specifications to avoid costly assumptions and procurement-stage overruns [19].

2.2. Application of Generative Pre-Trained LLMs in Construction Cost Estimation

Recent advances in generative AI have enabled the development of large language models such as GPT, Gemini, and Llama, which leverage deep learning architectures to generate human-like text based on large-scale pre-training [9,20,21,22]. Transformer-based architectures allow these models to interpret inputs and generate coherent, contextually relevant outputs [9]. GenAI models are commonly categorized as text-to-text, text-to-image, text-to-video/3D, and text-to-task systems, each supporting different application needs [23,24,25,26]. Based on generative mechanisms, these models include Generative Adversarial Networks (GANs), Variational AutoEncoders (VAEs), autoregressive models, diffusion models, and flow-based models, as shown in Figure 1 [9,25,27,28,29,30,31,32,33].

Within the construction sector, emerging exploratory studies highlight the potential of LLMs across various project stages—including feasibility evaluation, design, procurement, on-site execution, and operations/maintenance—by capitalizing on zero-shot learning, few-shot learning, chain-of-thought reasoning, retrieval augmented generation, and fine-tuning [9,10,34]. For example, researchers have proposed frameworks that use prompts to extract detailed information from complex BIM models [22], integrated GPT-based approaches for material selection through few-shot or zero-shot prompting [35], and experimented with ChatGPT-enabled project scheduling [36]. Other applications include identifying potential safety risks via a BERT sentence-pair model tuned on the OSHA database [37] and employing LLMs to plan tasks for construction robots [38,39]. Despite these advancements, only a few published works directly address cost-related tasks. For instance, ref. [40] examined budget forecasting and bill of quantities generation using LLMs, while ref. [41] evaluated a prompt-driven framework employing the Mistral-7b language model to query cost-oriented data in IFC format. However, to the best of the authors’ knowledge, no comprehensive studies have yet explored how pre-trained LLMs might be harnessed to tackle construction estimation in real-world scenarios or integrated seamlessly into existing workflows. Further, there is a lack of literature offering practical guidelines and frameworks on how to use prompt-based techniques efficiently for routine estimation tasks in industry practice. Filling this void is critical to translating LLMs’ potential on paper into actual productivity gains and improved cost estimation practices for construction projects. Therefore, the present study addresses these gaps by systematically evaluating pre-trained LLMs under both zero-shot and modular CoT prompting to measure accuracy, contextual reliability, and structured reasoning in conceptual cost estimation tasks.

2.3. Chain of Thoughts (CoT) Prompting

The recent advances in large language models have given rise to several prompting techniques to improve output responses’ quality, correctness, and relevance. Among the most widely used are zero-shot, few-shot, CoT, and tree-of-thought (ToT) prompting [42,43,44]. Among these, CoT prompting decomposes complex problems into explicit intermediate reasoning steps, enabling models to perform multi-step tasks more effectively [45,46]. This approach improves accuracy, interpretability, and transparency by making reasoning processes explicit, thereby mitigating common “black-box” concerns in AI systems [47,48]. The design of effective CoT prompts generally involves a systematic procedure to ensure logical coherence and completeness. The first task is to define the overarching goal and pinpoint the structural logic necessary for a correct solution [45]. Next, one must decompose the larger problem into smaller, interconnected steps, making sure each intermediate stage is explicitly laid out so the model does not have to infer these steps on its own [49]. Providing detailed exemplars is crucial here, as showcasing the correct reasoning flow helps the model internalize more accurate patterns [50]. A recent study also suggests that multiple, carefully chosen examples in a single prompt boost generalization, enhancing both the accuracy and consistency of CoT-based solutions [51]. Finally, iterative refinement—guided by prompt engineering and feedback loops—enables domain-specific tuning that can further improve structured reasoning [51]. When implementing CoT prompting, users often begin by selecting an appropriate large language model, such as GPT-4, Llama, Claude, or any LLM, given that higher parameter counts generally correlate with more successful CoT outcomes [52]. After choosing the model, practitioners develop a tailored CoT framework that outlines step-by-step prompts relevant to their particular application domain. The multiple testing and iterative updating are then employed to ensure that the model consistently adheres to the prescribed logical structure [53]. Methods such as programmatic prompt generation and self-consistency sampling can be applied to enhance the reliability of CoT output, such that models are able to provide multiple reasoning pathways and select the most coherent response [45,46]. By taking these implementation steps, businesses can leverage CoT prompting to significantly improve AI-based decision-making, reasoning, and problem-solving across various industries. Therefore, there is a high potential for CoT prompting in construction cost estimation due to tasks made up of numerous interdependent elements, including quantities, cost references, specifications, calculations, and formatting. By dividing each phase of the estimation into discrete, unequivocal steps, a CoT-based system closely mimics the logical process of a human estimator. This step-by-step process not only maintains accuracy and transparency of explanation but also generates a traceable record of all intermediate actions.

Construction cost estimation is highly dependent on expert judgment and intuitive processes [54]. Although these dependencies have been practiced for decades, they remain time-consuming, highly subjective, and prone to human error [55]. Also, estimators often contend with multiple software platforms and data formats, adding inefficiency layers. Construction projects are time-sensitive in nature, and even minor delays in producing estimates negatively impact overall schedules and budgets. A major challenge arises in quantity takeoffs, where methods can differ substantially between 2D and 3D models. Estimators often face inconsistent naming conventions, document formats, and file structures, which increase error risks. In current industry practice, historical cost data is underutilized, and estimators generally rely on personal judgment instead of centralized databases, slowing the estimation process and reducing accuracy [56]. Without a systematic data integration approach, aligning estimates with benchmarks is difficult and hampers competitiveness. When internal data falls short, estimators manually search external cost databases like RSMeans or Sage, extending timelines. The verification of project specifications requires manual cross-referencing, while Work Breakdown Structure (WBS) variations add complexity, delaying conceptual estimates and increasing errors. In addition to this, the subcontractors’ estimate evaluation is a time-intensive process as it requires manual verification for completeness, unit rates, and template consistency. The selection often relies on subjective factors like past relationships rather than standardized criteria. The compilation of bids and preparing final estimates takes weeks, with additional delays. The lack of standard version control practice makes it difficult to track changes or maintain a history of costs, which often leads to repeated rework. These estimation challenges highlight the lack of consistency in cost estimation practice, which can result in overall process inefficiencies, ultimately leading to delays and budget overruns [57]. The rise of LLMs presents an opportunity to automate processes, integrate data more effectively, and improve accuracy, offering a scalable solution to longstanding industry inefficiencies. Generative Pre-trained Transformers (GPT), among the most prominent LLM architectures, demonstrate advanced natural language processing and content generation capabilities. While initial studies have applied LLMs to various construction activities, there is no investigation on how existing pre-trained models perform under zero-shot conditions in cost estimation or whether their outputs can be improved through prompt engineering. A significant gap remains in providing industry professionals with a clear, evidence-based framework for integrating this generative AI technology into current cost estimation practices.

The main objective of this study is to evaluate the LLMs’ performance on conceptual cost estimation by establishing a zero-shot baseline and then investigating the effectiveness of the modular CoT framework in improving accuracy. To achieve the main objective, the research is structured around several guiding Research Questions (RQs):

RQ1: To what extent can state-of-the-art general-purpose LLMs perform construction cost estimation workflow tasks under zero-shot prompting, without additional instructions or data?
RQ2: How effectively does CoT prompting improve the performance of LLMs in executing construction cost estimation tasks?

3. Methodology

This study followed a structured methodology to develop and evaluate the application of LLMs in modular data extraction for construction cost estimation. The methodology, as shown in Figure 2, was divided into three sequential steps: (i) scenario development, (ii) zero-shot testing for the scenario with existing pre-trained LLMs, and (iii) an experiment applying modular CoT approach. The first step involved designing a conceptual estimation scenario to test the performance of existing LLMs in addressing the identified tasks. The detailed tasks for the scenario were based on the detailed burdens of current estimation process, tasks, and sub-tasks identified by our previous study [58]. The detailed table is presented in Appendix A. The experimental scenario was then used to test LLMs’ existing capabilities, specifically their ability to learn and respond accurately without prior training (zero-shot learning). This step provided insights into the current strengths and limitations of LLMs in executing cost estimation tasks and identified areas requiring refinement or additional prompting strategies. The final step applied an experimental approach to evaluate the effectiveness of modular CoT prompting. This involved structuring LLM interactions using a systematic reasoning framework to enhance their ability to generate accurate, context-aware responses. The outcome of this experiment provided proof of concept for the implementation of modular sequential CoT prompting in construction cost estimation, demonstrating its potential to reduce estimation burdens and improve the existing process, ultimately enhancing efficiency. The findings from each step contributed to refining LLM-driven approaches, demonstrating their practical applicability in real-world cost estimation workflows within the construction industry. This study ultimately aimed to validate the potential of LLMs to support estimation process automation and improve efficiency in the construction sector.

3.1. Proposed LLM Framework

The proposed framework streamlines the construction cost estimation process by integrating LLMs into a structured workflow, enhancing automation, accuracy, and adaptability. As illustrated in Figure 3, the proposed framework serves as an intelligent intermediary between various construction data sources, file types, and estimators, helping to reduce the burdens associated with manual estimation methods. Traditional cost estimation often involves labor-intensive data extraction, analysis, and interpretation, leading to inconsistencies and potential errors. By leveraging AI-driven automation, this framework processes vast and complex datasets with improved efficiency, ensuring standardization and precision in cost estimation. At the framework’s core lies the LLM, which dynamically interacts with multiple data sources, including Building Information Modeling (BIM)-generated quantity takeoffs, multiple cost databases, estimation templates, LLM instructions, and project specifications. The multiple cost databases provide options to select preferred material and labor rates, while templates and specifications define consistent output and project-specific requirements. By aggregating and interpreting these diverse inputs, the system generates accurate estimates per requirements. In this framework, the zero-shot testing serves as the baseline scenario for evaluating model performance, providing a reference point against which improvements from the modular CoT prompting approach are measured. The interactive capability allows estimators to engage directly with the LLM through prompts. Estimators can tune prompt parameters, set priorities, request detailed explanations, show tables, generate files, and request insights during the estimation process. The LLM automates not only cost-related calculations but also allows for interactive Q&A, context-aware recommendations, and instant adjustments. The two-way interaction is dynamic and adapts well in keeping the process transparent, enhancing confidence in decisions with reduced error possibilities. This is one of the strong points in the ease of adaptation to an ever-evolving landscape of LLMs. Because it is built with a modular prompt structure, this framework will include more advanced AI models as they emerge, enhancing reasoning, contextual understanding, and accuracy. Because generative AI gets better with time, its framework is open to adaptation due to model updates and fine-tuning. At the end of the system are structured outputs, completed estimation tasks, subtasks, and visualized cost insights to make decisions easier and more efficient. It also minimizes manual efforts, expensive errors, and accelerates the process as a whole by embedding AI in an estimation workflow. In the end, this finally leads to a fast, dependable, and well-scalable cost estimation system that easily adapts to any future change in AI technology.

3.2. Cost Estimation Scenario for Existing LLMs

To evaluate the effectiveness of pre-trained LLMs in construction cost estimation, conceptual cost estimation scenario was developed, based on detailed tasks developed in previous study by [58]. The previous study identified cost estimation burdens in three categories, each representing a key aspect of the construction estimation workflow: (1) Conceptual Estimation, (2) Evaluating Subcontractor Estimates, and (3) Change Management, Version Control, and Data Recycling. The study [58] then mapped burden-to-task in transforming abstract challenges into smaller tasks and processes that could be analyzed for automation potential. Following this, the study further decomposed all identified tasks into granular sub-tasks, refining the workflow into smaller, manageable components that LLMs could understand. This hierarchical breakdown—from broad burdens to actionable tasks and detailed sub-tasks—provided a structured foundation for integrating LLMs into cost estimation workflows. This study [58] utilized the construction industry insights through direct engagement with industry practitioners, ensured the identified tasks & sub-tasks were not only theoretically relevant but also practically applicable to real-world construction operations. This involved digging deeper to understand the mechanics behind each burden and identifying where processes break down, as well as the specific details that make certain tasks more cumbersome. In this study, we developed Scenario 1—conceptual estimation—which focuses on initial cost estimation for general contractors, where estimators must work with limited project information to generate a reliable cost projection. This scenario includes tasks such as Aggregating Quantities, Referencing Enterprise Historic Cost, Referencing External Cost Databases, and Cross-Verification with Project Specifications. Figure 4 illustrates how each task and sub-tasks are mapped to specific user queries and expected LLM responses (Appendix B). The structured sequential dialogue between the user and LLM simulates realistic estimator interactions, ensuring that the evaluation measures logical consistency, task interconnections, and accuracy of AI-generated responses.

To evaluate the LLMs’ performance, this study tested Scenario 1 using zero-shot prompting and then implemented a CoT framework to assess performance improvements. By developing task-driven scenarios for evaluation, this study replicates real-world estimation processes, allowing for analysis of existing pre-trained LLMs’ effectiveness in construction cost estimation. Specifically, the prompts were manually developed based on the identified tasks and corresponding subtasks within the proposed scenario. Each subtask was translated into a natural language prompt designed to evaluate the model’s ability to perform specific estimation-related functions. We conducted three experimental iterations for each LLM. After each round, the prompts were refined based on model responses to enhance clarity, task specificity, and alignment with expected outcomes. The cost classification and quantity take-off process in this study followed the Uniformat II standard [58], which organizes building elements by functional systems (e.g., substructure, superstructure, finishes) rather than by materials or trades. This system provides a consistent structure for comparing costs across different project types and ensures alignment between estimation modules and standardized building components.

3.3. Evaluation of Existing LLMs

There are five common evaluation metrics—BLEU, ROUGE-L, METEOR, Content Overlap, and Semantic Similarity—that measure the LLMs outputs are not only numerically correct but also contextually accurate, linguistically coherent, and semantically aligned with expected responses [59,60,61,62,63]. A single metric could not fully evaluate the LLM performance on cost estimation text generation, where technical accuracy, structured reasoning, and factual completeness are equally important. The Bilingual Evaluation Understudy (BLEU) score [64] is a standard metric for evaluating text generation models by comparing their outputs to reference texts. It operates on n-gram precision and penalizes overly short responses through a brevity penalty factor:

B L E U = B P . e x p (\sum_{n = 1}^{N} w_{n} l o g p_{n})

where

p_{n}

represents the precision of n-grams,

w_{n}

is a weight distribution, and

B P

accounts for length differences. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) [65] extends beyond n-gram precision by evaluating the longest common subsequence (LCS) between generated and reference texts:

{R O U G E}_{L} = \frac{L C S (X, Y)}{| Y |}

where X is the generated response, and Y is the reference text. Unlike BLEU and ROUGE, the Metric for Evaluation of Translation with Explicit Ordering (METEOR) [66] incorporates stemming, synonym matching, and function word weighting, offering a more nuanced reflection of semantic alignment. It is computed as:

M E T E O R = F_{m e a n} X (1 - p e n a l t y)

where

F_{m e a n}

is the harmonic mean of precision and recall, and the penalty term discourages fragmented matches. Content overlap measures the Jaccard similarity [67] between words in generated and reference texts:

J (A, B) = \frac{|A \cap B|}{|A \cup B|}

where A and B are the sets of words in the generated and expected responses, respectively. Semantic similarity, unlike content overlap, focuses on the meaning preservation between texts rather than exact word matches [62,68]. A common approach is cosine similarity over embedding vectors:

c o s (θ) = \frac{X \cdot Y}{| | X | | | | Y | |}

where X and Y are the embedding representations of the generated and output text.

To evaluate the performance of pre-trained LLMs in construction cost estimation tasks, using Scenario 1, this study employed three widely used natural language (NL) evaluation metrics—BLEU, ROUGE-L, and METEOR—and compared four leading models available as of January 2025: GPT-4o, LLaMA 3.2, Gemini 2.0, and Claude 3.5 Sonnet. The selection of GPT-4o, LLaMA 3.2, Gemini 2.0, and Claude 3.5 Sonnet was based on their state-of-the-art architecture and strong performance across various benchmarks. GPT-4o, developed by OpenAI, is recognized for its superior multimodal reasoning and domain adaptation capabilities [69]. LLaMA 3.2, an open-source model from Meta, is optimized for efficient fine-tuning and domain-specific applications [70]. Gemini 2.0 from Google DeepMind leverages retrieval-augmented generation (RAG) for complex knowledge-intensive tasks, while Claude 3.5 Sonnet is designed for long-context reasoning and structured interpretability [71,72]. These models were selected due to their prominence in AI research and their demonstrated capabilities in technical reasoning, retrieval-augmented generation, and structured output generation. BLEU, originally developed for machine translation, measures n-gram precision between generated and reference texts, making it suitable for assessing word-level accuracy [64]. However, BLEU has been criticized for not considering semantic meaning or fluency, which is why recent research integrates additional contextual similarity measures alongside it [73,74]. To address BLEU’s limitations, ROUGE-L was incorporated to evaluate phrase-level recall and longest common subsequence (LCS) matches, making it particularly relevant for cost estimation, where key terms and structured phrases must align with expected outputs [65,75]. Additionally, METEOR was used, as it improves upon BLEU by incorporating synonym matching, stemming, and precision-recall balancing, offering a more linguistically robust assessment of generated text [66]. The application of several metrics ensures comprehensive evaluation, capturing both lexical accuracy and semantic accuracy, which is essential for AI-generated responses in structured domains like construction cost estimation. The results, as shown in Table 2, indicate that GPT-4o performed better than the other models in all metrics across the board, with a BLEU of 0.023, a ROUGE-L of 0.185, and a METEOR of 0.196, indicating improved performance in word correctness, structural coherence, and semantic accuracy. Based on the comparative results, GPT-4o consistently outperformed the other models across all evaluation metrics, demonstrating higher accuracy, contextual coherence, and reasoning quality. Therefore, GPT-4o was selected as the baseline platform for implementing and testing the modular CoT framework.

In order to assess how comprehensive, accurate, and uniform GPT-4o-generated answers are in terms of tasks related to construction cost estimation, an evaluation was undertaken by human means. Compared to metrics-based evaluation measures, human evaluation includes more detailed information regarding contextual appropriateness and logical cohesion for LLM-written answers, such that the domain-based processes in the model can be adequately traced [45,76]. This study conducted the qualitative evaluation, with confidence scores of 1, 2, and 3, reflecting the quality of the response and how close responses are to the expected output outlined in the given scenario. A score of 1 (low confidence) was assigned if the response was incomplete or inaccurate. A score of 2 (medium confidence) reflected a partially correct response but lacked full accuracy or completeness. In addition, a score of 3 (high confidence) was given when the response was fully accurate, complete, and aligned with the expected response. The completeness criteria for human evaluation are shown in Appendix C.

As shown in Figure 5 and Figure 6, the human evaluation highlighted notable deficiencies in zero-shot prompting. The figure compares an expected structured response against GPT-4o’s actual zero-shot output. The expected response follows a logical approach, prioritizing historical cost data before falling back on database cost information. However, the LLM’s zero-shot response lacks predefined priority logic, failing to specify how costs should be matched when multiple sources are available. Such shortcomings reduce the reliability of outcomes in the cost estimation workflow, indicating the necessity of structured prompting strategies like CoT [42,45]. Similarly, Figure 6 illustrates another zero-shot limitation, where the model fails to follow a step-by-step reasoning process for mapping quantities and defining conditions to use external cost data. Instead, it provides a generic explanation without prioritization or alignment with the expected Uniformat II-based workflow, further emphasizing the need for structured reasoning. The human evaluation shows an average confidence score of 1.906 (64%), indicating that responses frequently lacked completeness and correctness. This aligns with a previous study that emphasizes the limitations of LLMs in domain-specific reasoning without explicit reasoning prompts [77].

4. Case Study: Modular Chain of Thoughts Prompting for Conceptual Estimation

To investigate the practical applicability of the proposed framework, a case study was conducted by implementing a modular chain-of-thought prompting approach within a commercial construction project in Omaha, Nebraska. This experiment sought to evaluate the effectiveness of an LLM in facilitating structured reasoning and task execution within conceptual cost estimation. In practice, cost estimators adhere to standard classification systems such as Uniformat II and proceed through a sequence of integrated activities, i.e., data extraction, structuring, and computational calculation of material quantities and costs. This research suggested an NL query-based modular prompting methodology to optimize workflow accuracy and efficiency. By combining modular components in a systematic manner, this study suggested an NL query-based modular prompting methodology to automate workflow, standardize estimation procedures, reduce cognitive overload, and improve decision-making in construction cost management.

4.1. Data

The Revit model of a commercial construction project in Omaha, Nebraska, which contained all building components with their associated parameters, was utilized to extract quantity information. Using Revit 2024’s Schedule/Quantities tool, a structured dataset was generated for Level 3 individual elements classified under Level 1 major group elements of UniFormat II, specifically B-Shell components such as B1010 Floors, B1020 Roofs, B2010 Exterior Walls, B2020 Exterior Windows, and B2030 Exterior Doors, as illustrated in Figure 7. This process involved selecting relevant element categories, specifying essential parameters—including assembly code, family & type, unit, and area—and formatting the schedule accordingly. Once the schedule was generated, it was exported into a tabular dataset containing detailed quantity takeoffs. The final quantities file, comprising 46 distinct items, organized the extracted data into key attributes: element type, assembly code, family & type, unit, and quantity. Floors and roofs were quantified in square feet, such as concrete slabs and truss framing, whereas doors and windows were recorded as individual units with detailed specifications.

To support cost estimation scenario, two distinct datasets were developed: (1) an external cost dataset extracted from RS Means and (2) a historic cost dataset representing an enterprise cost database. These datasets were created to introduce variations and complexities commonly faced in the real-world construction cost estimation process, and evaluate the LLM’s interaction with diverse data sources, and address challenges similar to those faced by professional estimators. The data used in the case study are provided as Supplementary Materials submitted along with the manuscript.

The external cost dataset was generated by extracting cost information from RS Means 2024 for “Commercial New Construction”, specifically for the “Assembly” cost category, in “Omaha”, using the “2023 Quarter 4” release. The dataset consisted of 100 items corresponding to Level 3 individual components under the Level 1 major group elements of B-Shell, including B1010 Floors, B1020 Roofs, B2010 Exterior Walls, B2020 Exterior Windows, and B2030 Exterior Doors. However, this dataset did not include all items listed in the quantities file but covered the majority of them. Additionally, variations were introduced by incorporating (i) items with exact descriptions matching those in the quantities dataset, (ii) missing items, (iii) completely different items not present in the quantities dataset, and (iv) items with descriptions that were similar but not identical. In addition, to introduce additional complexity, column names in the external cost dataset were intentionally modified to differ slightly from those in the quantities dataset. For example, the column containing item descriptions was labeled as “Description” in the external dataset, whereas it was named “family & type” in the quantities dataset.

A historic cost dataset was created to simulate a construction enterprise cost database, which is commonly used in the construction industry. Unlike the external dataset, this database was manually curated to reflect real-world variations and inconsistencies typically encountered by estimators. The dataset contained 125 items from the same Level 3 component categories of B-Shell, B1010 Floors, B1020 Roofs, B2010 Exterior Walls, B2020 Exterior Windows, and B2030 Exterior Doors. To replicate real-world estimation challenges, this dataset included additional items beyond those in the external dataset and introduced discrepancies in material costs, installation costs, and total unit costs for identical items. Similarly to the external dataset, variations were introduced in the form of (i) items with exact descriptions, (ii) missing items, (iii) entirely different items absent in the quantities dataset, and (iv) items with closely related but non-identical descriptions. These two cost datasets (Figure 8) were designed to expose the LLM to multiple data sources, their variations in content, formatting, and descriptions, requiring it to interpret and integrate cost information while navigating inconsistencies, missing data, and variations in terminology. The structured, yet intentionally inconsistent nature of the datasets aimed to mimic the real-world complexities of a construction cost estimation scenario, where estimators generally refer to different data formats, account for variations in cost parameters, and identify the most relevant pricing information. By simulating these challenges, the study aimed to evaluate the LLM’s ability to process and synthesize cost data effectively, improving its reliability in practical estimation workflows. In addition, this study used construction specification data that was synthetically generated to reflect standard practices for commercial building projects, with a focus on the concrete floor system. The dataset included material properties, installation methods, quality assurance measures, and testing requirements, following industry standards such as ACI and ASTM. The major parameters covered were concrete mix design, reinforcement details, curing methods, and compliance criteria. The specification data was created to feed into LLM whenever needed for cross-reference.

4.2. Modular Framework for the Scenario

The modular framework, as shown in Figure 9 for conceptual cost estimation, was designed to systematically address the complexity of estimation tasks while ensuring adaptability and scalability for AI-assisted workflows. Traditional cost estimation often suffered from fragmented data structures, inconsistencies in cost references, and challenges in retrieving quantity and cost information. These challenges emerged with pre-trained LLMs when zero-shot prompting was used for multi-step tasks. These tasks needed to be presented in a clear order; otherwise, LLMs lost track of relationships and context, leading to incomplete reasoning and unreliable estimates. To overcome these limitations, a structured modular approach was developed for a scenario of creating the conceptual estimate, breaking down cost estimation burdens, tasks, and sub-tasks that were identified earlier into four core modules: (1) aggregating quantities, (2) Referencing enterprise historical cost, (3) referencing external cost database, and (4) cross-verification with project specifications. Each module was also decomposed into tasks and sub-tasks so that all the interdependencies were established well to enable a stepwise execution process that was logistically coherent. A modular structure has one of the benefits: it is flexible—each module is ready to be executed independently, so that users can work on a specific module without needing to execute all others. This means that if the user simply wants to look at historical cost data, the user can enter keywords or Module 2 without having to perform quantity aggregation or external referencing of costs. This modular running capability ensures that different project requirements can be provided for without users having to be restricted to a rigid workflow. This means that if the user simply wants to look at historical cost data, they can enter keywords or Module 2 without having to perform quantity aggregation or external referencing of costs. This modular running capability ensures that different project requirements can be provided for without users having to be restricted to a rigid workflow. For example, in my case, estimating a commercial building project in Omaha, Nebraska, I may only need Module 2 (Enterprise Historic Cost) and Module 4 (Cross-Verification with Project Specifications) while excluding external cost referencing altogether. This modularity allows individuals to use the framework to meet their specific estimation needs without affecting the integrity of the workflow. This design allows users to tailor cost estimation to their specific needs, for example, adding, removing, or adjusting modules, tasks, and sub-tasks, which supports various project types, data availability, and workflow variations without major reconfigurations. For example, in the event a new cost database is introduced, it can be integrated as another module without affecting the existing workflow. Besides straightforward applications, the framework is adaptable to various LLMs and can evolve according to future innovations in LLMs. Since current LLMs struggle with multi-step reasoning in unstructured workflows, this structured approach ensures that activities are logically dependent and sequenced, reducing reliance on open-ended inference. With the evolution of LLMs, particularly long-context reasoning, the system can leverage these advances by enhancing inter-task dependency monitoring. This modular architecture is an AI-enhanced, scalable solution that enables construction professionals to streamline estimation processes, automate processes, and customize execution according to project-specific needs. As LLM technology evolves, the structured framework ensures AI-aided cost estimation remains future-proof and adaptable while supporting various construction situations without disrupting the core estimation logic.

4.3. CoT Instructions & Architecture

The instruction set for the modular cost estimation framework was developed to provide a structured and flexible approach to estimation. It follows a modular CoT system, where tasks and sub-tasks are arranged hierarchically to maintain logical consistency and task dependencies. The structured modular design aimed to improve a key limitation that was identified of zero-shot prompting in LLMs, which struggle with multi-step reasoning when instructions are not explicitly sequenced [46]. Each step was designed using user queries and system outputs to give this structure a logical sequence rather than relying on free-form AI responses. This modular structure, as shown in Figure 10, also makes it possible for users to interact with individual modules independently while offering flexibility in the overall estimation process.

The instructions, as shown in the Figure 11 example, were framed as a dialogic interaction between the user and the system. They comprise 40 user prompt examples, 40 system response examples, and 35 sub-tasks linked by 35 task connectors. The user submits project details, quantities, cost data references, and estimation preferences, and the system responds with guided prompts, explanations, and formatted outputs. Each module consists of a chain of interrelated tasks carried out conditionally based on the previous inputs. For example, in Module 1 (Aggregating Quantities), the system first collects estimation details, confirms the work breakdown structure (WBS), and compiles quantity data before moving to Module 2 (historic cost referencing). If historical cost data is unavailable, the system automatically redirects execution to Module 3 (external Cost Database) to retrieve the missing information. The modular design ensures workflow follows a sequential pattern, reducing errors and inconsistencies in mapping and process transparency. One of the advantages of the modular CoT instruction system is that users can add, remove, or modify modules, tasks, and sub-tasks based on project needs [42]. This makes the framework scalable for different construction projects, estimation approaches, and data sources. For instance, if a new database of costs is released, it can be added as another module without any changes to the current workflow. Similarly, as LLMs improve in long-context reasoning and task dependency tracking, the framework can incorporate more advanced AI-assisted decision-making. The structured modular CoT approach aligns with AI advancements, demonstrating that explicitly defined reasoning pathways improve AI performance in complex multi-step tasks [48].

The modular CoT architecture illustrated in Figure 12 is a systematic and flexible framework for cost estimation. It ensures the tasks are logically sequential and hierarchically run and permits modular flexibility. This is particularly vital in an AI-integrated workflow when considering specific task ordering, as it is important for maintaining logical progression and awareness of context. This architecture is specified through a multi-layered module structure in which the estimation process is divided into modules, tasks, and sub-tasks with clear execution steps and dependencies. Each module (e.g., Module 1: Aggregating Quantities) is subdivided into tasks (e.g., 1.1: Specify Estimation Type, 1.2: Collect Quantities) and further into sub-tasks (e.g., 1.1.1: Confirm WBS, 1.1.2: Validate Scope), ensuring sequential and conditional execution. This numbering structure prevents logical drift by maintaining strict task dependencies, requiring lower-level sub-tasks (e.g., 1.2.3: Remove Duplicates) to be completed before higher-level transitions occur. The organized segmenting allows the system to behave sequentially or conditionally so that each module can be treated as a standalone process or integrated workflow. At the highest level, this architecture consists of modules representing major estimation components of conceptual estimation scenarios, such as aggregating quantities, cost referencing, and specification cross-verification. Each module contains tasks that define specific cost estimation objectives, which are then broken down into sub-tasks for data collection, validation, and decision-making. This structure enables LLMs to carry out tasks step by step, based on instructions, reducing errors and enabling transparency in cost estimation. The input for the user is an NL query, and the LLM converts it into the relevant module through the module calling function. After completing a module, the system proceeds to the next module according to the provided instructions beforehand. This approach maintains each step of the estimation process connected to the others.

4.4. Module Calling Function

The module calling function is designed to dynamically select and execute modules based on user input. By enforcing a consistent decision-making approach, this function prevents arbitrary execution of multi-step processes and maintains task dependencies. The module calling function is illustrated in Figure 13. It processes NL prompts, parses it, maps queries to the appropriate modules, confirms selections, and executes the chosen module while verifying that all prerequisite tasks are completed. The execution process begins with user input analysis, where the system extracts key terms and matches them to a predefined module dictionary. When a user submits a prompt, the system first conducts a task analysis and module selection process, evaluating the request against predefined categories. If an exact match is not identified, the system asks for further clarification, either by prompting the user for additional details or by suggesting the most suitable modules based on the detected keywords. This step is designed to prevent misclassification by providing the user with a list of current modules and subsequent steps with detailed descriptions for confirmation. This confirmation also supports active user involvement in the process. After the user’s confirmation, the function executes the module by doing its tasks and sub-tasks in logical order. To further support task automation, this function utilizes a dictionary-based keyword mapping system that links specific terminology to predefined modules. For example, commonly used terms such as “Estimation Type” and “WBS” are automatically assigned to Module 1: Aggregating Quantities. This dynamic module calling function works to eliminate ambiguity and allows the system to automatically map user queries to the correct estimation module with consistency under different estimation scenarios.

The main workflow, which includes analyzing the request, confirming module selection, and executing the corresponding module, is shown in Figure 14.

5. Results & Discussion

This study employed two evaluation approaches to assess GPT-4o’s overall performance in conceptual cost estimation, combining qualitative and quantitative methods. The qualitative evaluation involved human ratings based on confidence in LLMs’ output and module following, where confidence measures the model’s certainty in its responses, and module following assessed its ability to follow predefined workflows. Meanwhile, the quantitative evaluation was conducted using five widely used NLP metrics—BLEU, ROUGE-L, METEOR, Content Overlap, and Semantic Similarity—to measure response accuracy, fluency, and contextual alignment.

5.1. Qualitative Evaluation

Human evaluation is important in assessing the effectiveness of LLMs, particularly in complex and connected tasks such as cost estimation. Unlike automated metrics, human evaluation captures reasoning quality, contextual accuracy, and real-world applicability, which are essential for evaluating estimation-related tasks [45,78,79,80]. This qualitative was intentionally designed as a systematic human evaluation framework to assess the reasoning and contextual accuracy of LLMs, particularly since cost estimation involves multi-step decision-making and contextual interpretation that current quantitative metrics cannot fully capture. The evaluation process followed a structured and replicable scoring protocol, where responses were rated independently across three well-defined criteria: completeness, accuracy, and module alignment. This study rated confidence scores on a scale of 1 to 3, based on how complete, accurate, and aligned each response was with the correct module. A score of 1 represented low confidence, i.e., the response was incomplete, incorrect, or did not correspond to the expected module. A score of 2 represented medium confidence, in which the response was partially correct and corresponded to the correct module, but was not completely accurate or complete. Finally, a score of 3 represented high confidence, i.e., the response was completely accurate, complete, and corresponding to the correct module. The average confidence score of 2.52 (84%) in modular CoT prompting marked a 20% improvement over zero-shot evaluation (1.91, 64%) for the provided cost estimation scenario, confirming that structured reasoning improves response reliability. This improvement confirms that structured, step-by-step reasoning enhances both response reliability and alignment with estimation logic. Figure 15 illustrates comparative examples of zero-shot vs. CoT responses, highlighting improvements in response quality and module adherence.

5.2. Quantitative Evaluation

In evaluating the performance of GPT-4o on cost estimation tasks, this study compared Zero-Shot and CoT prompting across five key metrics: BLEU, ROUGE-L, METEOR, Content Overlap, and Semantic Similarity. Table 3 presents a comparison on the average CoT performance scores across all metrics, evaluated on 24 sub-tasks from Modules 1 and 2 in Scenario 1—Conceptual Estimation. These results are compared with the previously conducted zero-shot evaluation, as shown in Table 3. This presents a quantitative assessment of how CoT prompting influences the quality of generated responses. Each metric captures a different aspect of text generation—BLEU measures exact word match precision, ROUGE-L evaluates phrase-level recall, METEOR incorporates synonym and stemming considerations, while content overlap and semantic similarity assess factual consistency and conceptual alignment, respectively.

In the conceptual estimation scenario, the zero-shot BLEU score was 0.0234, highlighting a lack of direct word alignment between the generated and expected response. With CoT prompting, BLEU rose significantly to 0.3824, marking a 1536.43% increase. This drastic improvement suggests that CoT facilitates a structured approach to LLM to generate response, thereby reducing errors in word choice and improving syntactic accuracy. The zero-shot ROUGE-L score of 0.1852 indicates relatively weak phrase alignment and coherence. CoT prompting, however, boosted ROUGE-L to 0.6228, a 236.28% improvement. This result shows that CoT-driven responses have more relevant output when compared with the expected response from LLM, suggesting that logical connection in prompt help to improve in producing more contextually aligned responses. The zero-shot METEOR score of 0.1968 indicated poor conceptual resemblance between expected and generated texts. However, with CoT prompting, METEOR rose to 0.6109, reflecting a 210.43% improvement. This suggests that CoT enhances the model’s ability to use more appropriate word choices, improving fluency, synonym integration, and contextual awareness. The zero-shot content overlap was just 0.1091, indicating that most words in the generated text were not directly present in the reference. However, CoT increased this value to 0.4970, representing a 355.75% improvement. This contrast underscores how CoT prompting improves the inclusion of key factual elements, helping the model generate responses that align more precisely with expected content. The zero-shot semantic similarity score was 0.2452, reflecting a substantial gap in conceptual alignment between model output and reference responses. CoT prompting increased this score to 0.5970, a 143.48% improvement. This result suggests that CoT enables GPT-4o to generate responses that are more semantically meaningful, ensuring greater alignment with the intended concepts rather than just syntactic resemblance.

A comparison, as shown in Figure 16, shows that CoT prompting substantially improves model outputs across all metrics, suggesting that structured reasoning enhances coherence, factual retention, and fluency in GPT-4o’s responses on cost estimation scenarios and tasks. The bar chart visualizes the comparison between Zero-Shot GPT-4o and CoT GPT-4o across five evaluation categories.

To further investigate whether the observed improvements were statistically significant, a paired t-test was conducted:

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

where

\bar{d}

is the mean difference,

s_{d}

the standard deviation of differences, and

\sqrt{n}

is the number of evaluation categories. The test returned to a t-statistic of 24.02 and a p-value of 0.000018, well below the 0.05 significance threshold. This confirms that the improvements achieved through CoT prompting is a significant performance improvement. These findings demonstrate that CoT prompting not only improves linguistic metrics but also enhances logical consistency, contextual awareness, and reasoning traceability, key factors in construction cost estimation. The results indicate that the structured prompting enables GPT-4o to better replicate human-like estimation reasoning, reducing ambiguity and increasing reliability across tasks. Furthermore, the statistical validation reinforces that the observed gains are systematic rather than incidental, supporting the robustness of the proposed framework. These findings strongly support the adoption of CoT prompting strategies in improving LLM outputs on estimation tasks, confirming greater accuracy, contextual understanding, and meaningfulness in responses.

6. Limitations and Future Work

While this study evaluated the performance of existing LLMs in construction cost estimation tasks and demonstrated the effectiveness of the CoT prompting approach, several limitations must be acknowledged. First, although different components are involved in construction cost estimation, including conceptual estimation, evaluating subcontractor estimates, and change management, the scope of this study was a conceptual estimation scenario using pre-trained LLMs. Future research can extend the evaluation to other scenarios to assess how LLMs handle subcontractor bid analysis, estimation completeness checks, and version control in cost estimation workflows. Second, this study focused specifically on the Uniformat II, specifically within the Shell category, limiting the scope of cost estimation components analyzed. Future research can also apply this analysis to all Uniformat II categories for a more comprehensive evaluation of LLMs’ performance in handling diverse cost elements of various building systems. In addition, future research can consider the CoT prompting approach to other estimation techniques, such as unit cost estimate. Third, while CoT prompting significantly improved LLM performance compared to zero-shot prompting, its practical integration into real-world estimation workflows remains unexplored. Future research should focus on developing and deploying AI-driven cost estimation assistants that integrate CoT-enhanced LLMs into existing industry workflows.

7. Conclusions

Construction cost estimation has long been a manual and time-consuming process, heavily dependent on expert judgment, intuition, and past experience. Estimators generally work with fragmented data sources and varied formats, making the process labor-intensive and prone to errors. With tight project deadlines, even small miscalculations can lead to costly budget and schedule overruns. While LLMs are increasingly explored for various construction applications, their zero-shot performance in cost estimation and the potential benefits of structured prompting remain underexplored. This study addressed these gaps by developing and testing a modular CoT framework tailored for cost estimation workflows. The key findings and contributions are summarized below:

Among four pre-trained LLMs tested—GPT-4o, LLaMA 3.2, Gemini 2.0, and Claude 3.5 Sonnet—GPT-4o demonstrated the highest performance across BLEU, ROUGE-L, METEOR, Content Overlap, and Semantic Similarity metrics.
The CoT approach achieved significant quantitative gains, including a 1536% BLEU increase, 236% ROUGE-L improvement, and 210% METEOR enhancement, with notable rises in Content Overlap (355%) and Semantic Similarity (143%) when compared with zero-shot prompting.
The modular CoT prompting approach significantly enhanced model accuracy and completeness, raising the human-evaluated confidence score from 1.91 (64%) to 2.52 (84%), marking a 20% performance improvement over zero-shot prompting.
Results confirm that pre-trained LLMs alone are insufficient for detailed cost estimation, but structured reasoning through CoT substantially improves the performance.

The developed CoT framework provides a replicable foundation for integrating AI into early-stage estimation workflows, reducing manual workload, minimizing human error, and improving estimation consistency. This study highlights opportunities for future research, including fine-tuning LLMs with construction-specific datasets, integrating multimodal LLMs, and developing human-AI collaborative systems to advance automation in preconstruction planning. This study thus makes theoretical, methodological, and practical contributions by demonstrating how pre-trained LLMs can be adapted for cost estimation and by introducing a standardized modular CoT framework to guide industry adoption.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/buildings16020396/s1.

Author Contributions

Conceptualization, P.G.; Methodology, P.G.; Software, P.G.; Validation, P.G. and K.K.; Formal analysis, P.G.; Resources, K.K., T.S. and T.R.; Data curation, P.G.; Writing—original draft, P.G.; Writing—review & editing, P.G. and K.K.; Visualization, P.G.; Supervision, K.K., T.S. and T.R.; Project administration, P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Burdens to Tasks to Sub-Tasks Mapping [58]

Estimation Burdens	Tasks	Sub-Tasks
Aggregating Quantities	Specify Estimation Type and WBS	▪ Task initiation ▪ Specify Estimation Type ▪ Confirm WBS
	Collect Quantities	▪ Collect Data—Quantities, Cost data, Specs, Output template ▪ Determine Scope ▪ Summarize, display ▪ Check duplications ▪ Finalize the Quantities
	Format Quantities	▪ Suggest Template ▪ Upload Template and Map Quantities ▪ Provide Download Link
Referencing Enterprise Historic Cost	Referencing Multiple Cost Data Sources	▪ Explain Next Steps ▪ Collect Cost Data
	Prioritize Cost References	▪ Define Priority and Calculations ▪ Set Conditions ▪ Set Common Identifier
	Handle Missing Items	▪ Provide Closet Options ▪ Summarize Work Done ▪ Explain Next Steps
Referencing External Cost Database	Referencing External Cost Database	▪ Set Priority for External Cost ▪ Define Condition to use External Cost Data
	Handle No Match & Multiple matches	▪ List of Multiple Matched Items ▪ List No Match Items with Close Options
	Follow Output Format	▪ Load Output Template ▪ Read and Understand Output Template ▪ Map Quantities into Output Template ▪ Select Cost item ▪ Calculating
Cross-verification with Project Specification	Cross-Verification with Specification	▪ List Specific Information to Cross Verify ▪ Provide discrepancies summary ▪ Export in requested name and file format
	Handle Changes	▪ If required; Add, Remove, Format on Exported File
	Work Resuming	▪ Summarize Past Work and Data ▪ Suggest Future Steps based on where user left ▪ Provide Estimation File Last Produced
Planning Evaluation	Initialize Subcontractor Evaluation	▪ Context Understanding ▪ Plan context-based Evaluation
Planning Evaluation	Handle Inconsistencies & Mapping	▪ Review Submissions ▪ Plan Mapping into Standard Template ▪ Handle Multiple Submissions in Trade ▪ Compile Estimates
Evaluating Completeness	Check Completeness	▪ Understand Completeness Criteria ▪ Cross Check with Conceptual Estimate ▪ Flag Incomplete Submissions
	Evaluating Discrepancies	▪ List Missing Items ▪ List Additional Items ▪ Remove Incomplete Submissions
	Identify Options	▪ Identify Trades with Multiple Proposals
Metrics Based Evaluation	Metrics Based Evaluation	▪ Understand Evaluation Metrics ▪ Evaluation ▪ Explain Calculations and Priority List
Metrics Based Evaluation	Create Negotiation List	▪ Find Outstanding Items ▪ List of Outstanding Items ▪ Create Negotiation Table
Visualizing and compiling	Calculate and visualize	▪ Calculate Differences ▪ Visualize Differences
	Compile Estimates	▪ List Selected Subs ▪ Compile Selected Subs Estimates ▪ Add Self-Performing Trades
	Generate Final Estimate	▪ Element Breakdown Visualization ▪ Compare Final vs. Conceptual Estimate ▪ Generate Final Output File
Managing Changes	Navigate Recent Estimation	▪ Open Recent Estimation
Managing Changes	Update Changes	▪ Integrate Meeting Updates ▪ List Changes
Version Control	Naming Standard	▪ Automatic Naming
Version Control	Changes between Versions	▪ Summary of changes
Data Re-Cycling	Recycle to Historic database	▪ Add Metadata to Historic Data ▪ Ensure Formatting Consistency ▪ Appending

Appendix B. Scenario 1-Conceptual Cost Estimation Scenario

ID	Main Task	Estimation Sub-Task	Query from the “User”	Expected Answer from “LLM”
1	Specify Estimation Type and WBS	Task Initiation	I have a new project for which I need to create an estimate. Can you help me?	Yes, I can. Could you provide some details about the project, such as the type of construction, location of the project, and any specific type of estimation—Assembly or Item based—that you want to follow?
2	Specify Estimation Type and WBS	Specify Estimation Type	Yes. This is a new commercial building construction in Omaha, Nebraska. I want to do an Assembly cost estimation.	Ok. For assembly estimation, you need to know the assembly breakdown in Uniformat II before starting the estimation. Do you know this?
3	Specify Estimation Type and WBS	Confirm Work Breakdown Structure (WBS)	No, can you explain the Assembly breakdown in Uniformat II down structure for me?	Of course! Uniformat II divides the building into major functional areas or “assemblies” based on the work they perform. Here’s an overview of the Uniformat II structure, broken down into its primary categories: A. Substructure This section includes all the elements related to the building’s foundation and below-grade construction: ▪ A10 Foundations: Includes standard footings, foundation walls, slab-on-grade, and pile systems. ▪ A20 Basement Construction: Covers items like basement excavation, retaining walls, and waterproofing. B. Shell The building’s exterior envelope or shell: ▪ B10 Superstructure: Includes vertical elements (columns, beams) and floor/roof framing systems. ▪ B20 Exterior Enclosure: Covers walls, windows, doors, and exterior cladding. ▪ B30 Roofing: Includes roofing materials, insulation, and any associated components like flashing and drainage systems. C. Interiors This section covers everything inside the building: ▪ C10 Interior Construction: Partitions, interior walls, and doors. ▪ C20 Stairs: Includes staircases, railings, and related structures. ▪ C30 Interior Finishes: Flooring, ceilings, paint, and other wall finishes. D. Services Systems that provide utilities and services to the building: ▪ D10 Conveying Systems: Elevators, escalators, and lifts. ▪ D20 Plumbing: Water supply, drainage, and fixtures. ▪ D30 HVAC: Heating, ventilation, and air conditioning systems. ▪ D40 Fire Protection: Sprinkler systems, fire alarms, and smoke detectors. ▪ D50 Electrical: Lighting, power distribution, and wiring. E. Equipment & Furnishings This category covers the built-in equipment and furnishings in the building: ▪ E10 Equipment: Major fixed equipment such as kitchen appliances or medical equipment in a hospital. ▪ E20 Furnishings: Built-in furniture and seating systems. F. Special Construction & Demolition Non-standard or specialized construction: ▪ F10 Special Construction: Includes items like greenhouses, clean rooms, or pools. ▪ F20 Selective Demolition: Any necessary demolition work. G. Building Sitework Anything outside the building footprint: ▪ G10 Site Preparation: Grading, clearing, and demolition of structures. ▪ G20 Site Improvements: Pavements, curbs, gutters, and landscaping. ▪ G30 Site Mechanical Utilities: Storm drainage, water distribution, and gas supply. ▪ G40 Site Electrical Utilities: Exterior lighting, power distribution, and communications.
4	Collect Quantities	Collect Quantities	Ok, understand. Now how can I start?	Now, you need to provide quantities, cost data, project specifications, and a particular format or template for the final output of the estimate.
5	Collect Quantities	Collect Quantities	OK. let’s go one by one with the data	Great. Provide me quantities.
6	Collect Quantities	Determine Scope	OK. I only have quantities for the shell category. And want to estimate only that portion.	I got it! I will focus on the Shell (B) category for now. Please provide the quantities.
7	Collect Quantities	Summarize and Display Collected Quantities	Here’s the quantity excel file I have uploaded. Please show me a list of items and quantities in tabular format.	Ok. Here is the list of your items and respective quantities. Do you want to know more about, or do you want to move ahead to the next steps?
8	Collect Quantities	Check Duplications	Can you check if there are any duplicate items based on the description and assembly code?	These are items that have exactly the same description and assembly code. Do you want to add them to a single item?
9	Collect Quantities	Check Duplications	Yes, please add. Also, can you find potential duplication that doesn’t match exactly but potentially?	Here are these group items that look potentially similar based on size and descriptions.
10	Collect Quantities	Finalize Quantities	No, they are not the same items. Keep separate.	Ok. We have now total 25 items. What do you want to do next with quantities? Should we move to cost data?
11	Format Quantities	Format Quantities	Now, I want you to help formatting before starting estimate. Can you do that?	Yes, I can. Do you have any template of format or you want me to suggest some format?
12	Format Quantities	Suggest Template	Please suggest a template format suitable for this estimation	Here is the suggested format suitable for the assembly estimate… Do you want to use this? Or do you want me to create other templates?
13	Format Quantities	Upload Template and Map Quantities	I have one standard template format for quantities. Here I have uploaded. Please put quantities into this new format and follow these instructions to do mapping: (i) Read Instructions inside Template file, each description under every title is the description for you understand what it means (ii) Look at example output and understand (iii) map all quantities to suitable section of this template’s yellow highlighted area, and you are free to add remove rows as required for items. Show the table and save.	I have arranged the original quantities into the new template that you provided as per instructions i, ii, and iii. See the table below. Let me know if you want to modify or change anything.
14	Format Quantities	Provide download link	Also, can you export new table in excel format.	You can click on the link below to download the new quantities table.
15	Referencing Multiple Cost Data Sources	Referencing Multiple Cost Data Sources	Now, we are good for quantities. What’s our next task?	We will work on cost data in the next step. Please provide me with cost data.
16	Referencing Multiple Cost Data Sources	Collect Cost Data	Ok. I have uploaded two datasets for cost data that I want to use. (1) Historic cost—enterprise database (2) External cost database	Great! I have stored both datasets. How do you want to set up the estimation to use both your historic cost and external cost?
17	Prioritize Cost References	Define Priority and calculations	I want to use the historical database as much as possible.	Ok. I can structure the estimation to prioritize historic cost and only use the database cost if the historic cost for an item is not available.
18	Prioritize Cost References	Define Priority and calculations	In the historic cost data add a column and get the average of all available commercial projects, for all cost items	Ok. The new column is added to historic dataset, I have saved the updated data. I have calculated the average for items in past commercial projects Here is the table:
19	Prioritize Cost References	Set Conditions	I want to use historic cost for all items if it is available. If not, then I want to use database cost.	Ok. I can structure the estimation to prioritize historic cost and only use the database cost if the historic cost for an item is not available.
20	Prioritize Cost References	Set Common Identifier	How would you prioritize?	Are there any common identifiers in two cost data sets? On what basis you want to find items?
21	Prioritize Cost References	Set Common Identifier	We have the same Assembly Code but slight variations in descriptions. So, consider Assembly Code as the common Identifier.	I will first load both the historic and database cost data from separate files. Let’s assume both files contain Assembly Code as a common identifier that can be used to match items. I first check the historic cost data. If a historic cost is available, I use that. If no historic cost is found, I search for the cost in the database cost file.
22	Handle Missing Items	Handle Missing Items	If an item is missing from both datasets, what would you do?	I will ask you with closest options based on description.
23	Handle Missing Items	Summarize Work Done	Briefly Summarize in bullet points: what you have done so far, what data you have, what data you need now	Summary of What Has Been Done: loaded quantities, formatted to match provided quantity template loaded historic and added average cost column for all items loaded database cost. Data I Have: Quantities, Historic cost data (with average commercial costs), Database cost Data I Need: Project Specification Final output format or template
24	Handle Missing Items	Explain Next Steps	What’s the next steps?	Next Steps: Reading output template Calculate total costs using provided quantities and cost data sets Verify with specification Export the final estimate
25	Handle Missing Items	Output Template Recall	Here, I have uploaded the output template file and specification file. Please read and store both files.	Thank you! Now that I have the output template and project specification files. Can I go ahead start cost calculations?
26	Referencing External Cost Database	Define Priority for External Cost & Define Condition to use External Cost Data	How would you utilize the template and start calculation? Explain to me before doing it.	First, I’ll carefully review the output template to understand its structure, column names and section layout. There may be sections based on Uniformat II categories (e.g., B10 Superstructure, B20 Exterior Enclosure). Second, I will map the quantities (description, unit, and quantity) to the output template. Third, based on the description of quantities, I will search for each item to match in cost data, first to ‘historical cost’, where available, and then to the ‘database cost’ if needed. Fourth, multiply the quantity by the unit cost to calculate the total cost Fifth, I will keep the cost data assembly code as the unique identifier in the ‘Assembly Code’ column
27	Handle No Match & Multiple Matches	Handling No Match Items	What if you don’t find an exact match?	If I don’t find each match in the cost data, I will ask you to select with the closest options (maximum 5 options).
28	Handle No Match & Multiple Matches	Handling Multiple Matched Items	What if you find multiple matches?	If I find multiple matching items in the cost data, I will provide you with options and ask you to select one out of those options.
29	Follow Output Format	Follow Output Format	Can you export the estimate in excel format with the file name ‘conceptual estimation _today’s date)?	Yes, I can export the estimate with the file name “conceptual estimation_[today’s date].”
30	Cross-Verification with Specification	Cross-Verification with Specification	Ok. If I want to verify anything with specifications, how would you help?	I can help you in multiple ways: I can provide any specific information that you want to look at. I can also provide a summary of any discrepancies in item or installation description, between specification and items in estimation, that need clarification. And, If adjustments are required, I can incorporate them into the estimate.
31	Handle Changes	Handle Changes	If something needs to be modified, added, removed, or corrected, would you do it for me?	Absolutely! If any modifications, additions, removals, or corrections are needed, just let me know the specific details, and I’ll make the necessary changes for you. Whether it’s adjusting costs, updating quantities, or correcting formatting, I’m happy to assist with all aspects of the estimate.
32	Work Resuming	Work Resuming	If I stop here, and come back to continue work, could you remind me with summary of what we have done, and what’s next step, and providing me the last file?	Yes. When you come back, I’ll provide a quick summary of the work we’ve completed so far, including details like the data we’ve loaded, the calculations performed, and any adjustments made. I’ll outline the next steps based on where we left off, so you can pick up exactly where you need to. I’ll also share the last file we were working on.

Appendix C. Human Evaluation Criteria

Task ID	Estimation Sub-Task	Completeness Criteria for LLM Response	User Confidence Level on Response (1 = Low 2 = Medium 3 = High)
1	Task Initiation	▪ Readiness to help ▪ Asking for project details ▪ Asking for Estimation Type	3
2	Specify Estimation Type	▪ Providing workflow or process understanding of the estimation process for the specified specification type ▪ Confirm with the user if the user knows the Assembly estimation.	2
3	Confirm Work Breakdown Structure (WBS)	▪ Providing Work Breakdown Structure (WBS)	3
4	Collect Quantities	▪ Asking for required information—quantities, project specifications ▪ Asking for the final outcome and template format	1
5	Collect Quantities	▪ Asking for quantities first	1
6	Determine Scope	▪ Knowing the scope for quantities ▪ Asking again for the dataset	3
7	Summarize and Display Collected Quantities	▪ Providing a list of items and their summary ▪ Confirming before moving to the next step	3
8	Check Duplications	▪ Looking at the exact match items ▪ Asking criteria matched items	1
9	Check Duplications	▪ Presenting, if found, the potential duplications list	1
10	Finalize Quantities	▪ Informing about total items ▪ Asking what’s next with quantities now ▪ Follow up if the user wants to move to work with the next dataset	1
11	Format Quantities	▪ Asking for a template ▪ Confirming if users don’t have a template	3
12	Suggest Template	▪ Suggesting formatting ▪ Confirming with the user	3
13	Upload Template and Map Quantities	▪ Mapping out quantities in the provided template ▪ Showing table ▪ Confirming if modifications are needed ▪ Saving table	2
14	Provide download link	▪ Providing the link to download	2
15	Referencing Multiple Cost Data Sources	▪ Informing the next step ▪ Asking for cost data for the next step	3
16	Collect Cost Data	▪ Clarifying with users how they want to use multiple cost datasets	2
17	Define Priority and calculations	▪ Understanding user’s priority order ▪ Understanding user’s calculation rule	3
18	Define Priority and calculations	▪ Executing priority-based calculations	2
19	Set Conditions	▪ Structuring the estimation reference with historic cost as the first priority for referencing	2
20	Set Common Identifier	▪ Asking the user about a common identifier in different data sets ▪ Asking the user on what basis they want to find the cost items	2
21	Set Common Identifier	▪ Explaining to the user step-by-step on how the system prioritizes and use the historic and other cost database	1
22	Handle Missing Items	▪ Informing the user about handling the missing items ▪ Providing the user with the closest options in the cost databases	1
23	Summarize Work Done	▪ Presenting summary of work done so far in bullet points ▪ Presenting separate sections for what has been done, data system has, data system need	1
24	Explain Next Steps	▪ Explaining next steps	1
25	Output Template Recall	▪ Confirming that the remaining files or data from the user have been received ▪ Recalling estimation template ▪ Confirming for the next step	2
26	Define Priority for External Cost & Define Condition to use External Cost Data	▪ Understanding output template structure and section layout for Uniformat II mapping ▪ Mapping quantities to output template ▪ Searching items in the cost database, with historic cost data as the first priority ▪ Calculating total cost by multiplying unit cost with quantity ▪ Asking user if they wants to incorporate the inflation percentage in calculation ▪ Keeping the cost database’s assembly code in the final output	2
27	Handling No Match Items	▪ Providing user with closest options in case of no exact match	1
28	Handling Multiple Matched Items	▪ Providing the user with a matched items list and asking to select suitable cost item	1
29	Follow Output Format	▪ Executing file export in the provided format and naming convention	1
30	Cross-Verification with Specification	▪ Providing specific information from the specification that the user wants to look at ▪ Providing a summary of discrepancies for any, some, or all items that need clarification ▪ Incorporating adjustments to estimate if requested	2
31	Handle Changes	▪ Explaining how the system handles modifications—adding, removing, and changing—for cost, quantities, and even formatting.	3
32	Work Resuming	▪ Providing a summary of past work, ▪ Providing past data, ▪ Outlining the next steps based on the point the work stopped last time	2

References

Ali, Z.H.; Burhan, A.M. Hybrid machine learning approach for construction cost estimation: An evaluation of extreme gradient boosting model. Asian J. Civ. Eng. 2023, 24, 2427–2442. [Google Scholar] [CrossRef]
Swei, O.; Gregory, J.; Kirchain, R. Construction cost estimation: A parametric approach for better estimates of expected cost and variation. Transp. Res. Part. B Methodol. 2017, 101, 295–305. [Google Scholar] [CrossRef]
Hashemi, S.T.; Ebadati, O.M.; Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2020, 2, 1703. [Google Scholar] [CrossRef]
Holm, L.; Schaufelberger, J.E. Construction Cost Estimating; Routledge: London, UK, 2021. [Google Scholar] [CrossRef]
Ahiaga-Dagbui, D.D.; Smith, S.D. Rethinking construction cost overruns: Cognition, learning and estimation. J. Financ. Manag. Prop. Constr. 2014, 19, 38–54. [Google Scholar] [CrossRef]
Ghimire, P.; Pokharel, S.; Kim, K.; Barutha, P. Machine learning-based prediction models for budget forecast in capital construction. In Proceedings of the 2nd International Conference on Construction, Energy, Environment & Sustainability; Itecons: Funchal, Portugal, 2023; pp. 27–30. [Google Scholar]
Messner, J. Introduction to Construction Cost Estimating, August 2022. Available online: https://psu.pb.unizin.org/buildingconstructionmanagement/chapter/introduction-to-construction-cost-estimating/ (accessed on 8 March 2025).
Abanda, F.H.; Kamsu-Foguem, B.; Tah, J.H.M. BIM—New rules of measurement ontology for construction cost estimation. Eng. Sci. Technol. Int. J. 2017, 20, 443–459. [Google Scholar] [CrossRef]
Ghimire, P.; Kim, K.; Acharya, M. Opportunities and Challenges of Generative AI in Construction Industry: Focusing on Adoption of Text-Based Models. Buildings 2024, 14, 220. [Google Scholar] [CrossRef]
Rane, N. Role of ChatGPT and Similar Generative Artificial Intelligence (AI) in Construction Industry. Soc. Sci. Res. Netw. Rochester 2023. [Google Scholar] [CrossRef]
Charette, R.P.; Marshall, H.E. UNIFORMAT II Elemental Classification for Building Specifications, Cost Estimating, and Cost Analysis; MD, NIST IR 6389; National Institute of Standards and Technology: Gaithersburg, Sweden, 1999. [Google Scholar] [CrossRef]
Sayed, M.; Abdel-Hamid, M.; El-Dash, K. Improving cost estimation in construction projects. Int. J. Constr. Manag. 2023, 23, 135–143. [Google Scholar] [CrossRef]
Juszczyk, M. The Challenges of Nonparametric Cost Estimation of Construction Works with the use of Artificial Intelligence Tools. Procedia Eng. 2017, 196, 415–422. [Google Scholar] [CrossRef]
Lim, C.; Hong, W.-K.; Lee, D.; Kim, S. Automatic Rebar Estimation Algorithms for Integrated Project Delivery. J. Asian Archit. Build. Eng. 2016, 15, 411–418. [Google Scholar] [CrossRef]
Elfaki, A.O.; Alatawi, S.; Abushandi, E. Using Intelligent Techniques in Construction Project Cost Estimation: 10-Year Survey. Adv. Civ. Eng. 2014, 2014, 107926. [Google Scholar] [CrossRef]
Babatunde, S.O.; Perera, S.; Ekundayo, D.; Adeleye, T.E. An investigation into BIM-based detailed cost estimating and drivers to the adoption of BIM in quantity surveying practices. J. Financ. Manag. Prop. Constr. 2019, 25, 61–81. [Google Scholar] [CrossRef]
Mubarak, S.A. How to Estimate with RSMeans Data: Basic Skills for Building Construction; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
Mahamid, I. Factors affecting cost estimate accuracy: Evidence from Palestinian construction projects. Int. J. Manag. Sci. Eng. Manag. 2015, 10, 117–125. [Google Scholar] [CrossRef]
Akanbi, T.; Zhang, J. Design information extraction from construction specifications to support cost estimation. Autom. Constr. 2021, 131, 103835. [Google Scholar] [CrossRef]
Feuerriegel, S.; Hartmann, J.; Janiesch, C.; Zschech, P. Generative AI. Bus. Inf. Syst. Eng. 2024, 66, 111–126. [Google Scholar] [CrossRef]
Baidoo-anu, D.; Ansah, L.O. Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning. J. AI 2023, 7, 52–62. [Google Scholar] [CrossRef]
Zheng, J.; Fischer, M. Dynamic prompt-based virtual assistant framework for BIM information search. Autom. Constr. 2023, 155, 105067. [Google Scholar] [CrossRef]
Li, C.; Su, Y.; Liu, W. Text-To-Text Generative Adversarial Networks. In 2018 International Joint Conference on Neural Networks (IJCNN); Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image Diffusion Models in Generative AI: A Survey. arXiv 2023, arXiv:2303.07909. [Google Scholar] [CrossRef]
Liu, V.; Long, T.; Raw, N.; Chilton, L. Generative Disco: Text-to-Video Generation for Music Visualization. arXiv 2023, arXiv:2304.08551. Available online: http://arxiv.org/abs/2304.08551 (accessed on 27 August 2023).
Lei, T.; Barzilay, R.; Jaakkola, T. Rationalizing Neural Predictions. arXiv 2016, arXiv:1606.04155. Available online: http://arxiv.org/abs/1606.04155 (accessed on 27 August 2023).
Wu, A.N.; Stouffs, R.; Biljecki, F. Generative Adversarial Networks in the built environment: A comprehensive review of the application of GANs across data types and scales. Build. Environ. 2022, 223, 109477. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. MAL 2019, 12, 307–392. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 6840–6851. Available online: https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html (accessed on 16 September 2023).
Kumar, M.; Babaeizadeh, M.; Erhan, D.; Finn, C.; Levine, S.; Dinh, L.; Kingma, D. VideoFlow: A Flow-Based Generative Model for Video. arXiv 2019, arXiv:1903.01434. [Google Scholar]
Lee, J.; Kim, H.; Shim, J.; Hwang, E. Cartoon-Flow: A Flow-Based Generative Adversarial Network for Arbitrary-Style Photo Cartoonization. In Proceedings of the 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1241–1251. [Google Scholar] [CrossRef]
Wan, H.; Zhang, J.; Chen, Y.; Xu, W.; Feng, F. Generative AI Application for Building Industry. arXiv 2024, arXiv:2410.01098. [Google Scholar] [CrossRef]
Saka, A.; Taiwo, R.; Saka, N.; Salami, B.; Ajayi, S.; Akande, K.; Kazemi, H. GPT Models in Construction Industry: Opportunities, Limitations, and a Use Case Validation. arXiv 2023, arXiv:2305.18997. [Google Scholar] [CrossRef]
Prieto, S.A.; Mengiste, E.T.; de Soto, B.G. Investigating the Use of ChatGPT for the Scheduling of Construction Projects. Buildings 2023, 13, 857. [Google Scholar] [CrossRef]
Hassan, H.A.M.; Marengo, E.; Nutt, W. A BERT-Based Model for Question Answering on Construction Incident Reports. In Natural Language Processing and Information Systems; Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022; pp. 215–223. [Google Scholar] [CrossRef]
Kim, K.; Ivashchenko, M.; Ghimire, P.; Huang, P.-C. Context-Aware and Adaptive Task Planning for Autonomous Construction Robots Through Llm-Robot Communication. Soc. Sci. Res. Netw. Rochester 2024. [Google Scholar] [CrossRef]
Kim, K.; Ghimire, P.; Huang, P.-C. Framework for LLM-Enabled Construction Robot Task Planning: Knowledge Base Preparation and Robot–LLM Dialogue for Interior Wall Painting. Robotics 2025, 14, 117. [Google Scholar] [CrossRef]
Parsafard, P.; Elezaj, O.; Ekundayo, D.; Vakaj, E.; Parmar, M.; Wani, M.A. Automation in Construction Cost Budgeting using Generative Artificial Intelligence. In Proceedings of the International Conference on Industrial Engineering and Operations Management; IEOM Society International: Dubai, United Arab Emirates, 2024. [Google Scholar] [CrossRef]
Gatto, C.; Cassandro, J.; Mirarchi, C.; Pavan, A. LLM Based Automatic Relation Between Cost Domain Descriptions and IFC Objects. Available online: https://re.public.polimi.it/handle/11311/1280791 (accessed on 23 February 2025).
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Lazaridou, A.; Gribovskaya, E.; Stokowiec, W.; Grigorev, N. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv 2022, arXiv:2203.05115. [Google Scholar] [CrossRef]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
Nong, Y.; Aldeen, M.; Cheng, L.; Hu, H.; Chen, F.; Cai, H. Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities. arXiv 2024, arXiv:2402.17230. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 1877–1901. Available online: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html (accessed on 23 February 2025).
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv 2023, arXiv:2205.10625. [Google Scholar] [CrossRef]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y. Self-Refine: Iterative Refinement with Self-Feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
Conklin, H.; Wang, B.; Smith, K.; Titov, I. Meta-Learning to Compositionally Generalize. arXiv 2021, arXiv:2106.04252. [Google Scholar] [CrossRef]
Creswell, A.; Shanahan, M.; Higgins, I. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. arXiv 2022, arXiv:2205.09712. [Google Scholar] [CrossRef]
Zelikman, E.; Wu, Y.; Mu, J.; Goodman, N. STaR: Bootstrapping Reasoning with Reasoning. Adv. Neural Inf. Process. Syst. 2022, 35, 15476–15488. [Google Scholar]
Havrilla, A.; Du, Y.; Raparthy, S.C.; Nalmpantis, C.; Dwivedi-Yu, J.; Zhuravinskyi, M.; Hambro, E.; Sukhbaatar, S.; Raileanu, R. Teaching Large Language Models to Reason with Reinforcement Learning. arXiv 2024, arXiv:2403.04642. [Google Scholar] [CrossRef]
Cheng, M.-Y.; Tsai, H.-C.; Hsieh, W.-S. Web-based conceptual cost estimates for construction projects using Evolutionary Fuzzy Neural Inference Model. Autom. Constr. 2009, 18, 164–172. [Google Scholar] [CrossRef]
Elmousalami, H.H. Artificial Intelligence and Parametric Construction Cost Estimate Modeling: State-of-the-Art Review. J. Constr. Eng. Manag. 2020, 146, 03119008. [Google Scholar] [CrossRef]
Walton, J.R.; Stevens, J.D. Improving Conceptual Estimating Methods Using Historical Cost Data. Transp. Res. Rec. 1997, 1575, 127–131. [Google Scholar] [CrossRef]
Ji, S.-H.; Park, M.; Lee, H.-S. Cost estimation model for building projects using case-based reasoning. Can. J. Civ. Eng. 2011, 38, 570–581. [Google Scholar] [CrossRef]
Ghimire, P. Framework for Integrating Industry Knowledge into a Large Language Model to Assist Construction Cost Estimation. Ph.D. Thesis, The University of Nebraska—Lincoln, Lincoln, NE, USA, 2025. Available online: https://www.proquest.com/docview/3198872319/abstract/D556793967F749FCPQ/1 (accessed on 12 October 2025).
Liu, R.; Li, M.; Zhao, S.; Chen, L.; Chang, X.; Yao, L. In-Context Learning for Zero-shot Medical Report Generation. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; pp. 8721–8730. [Google Scholar] [CrossRef]
Merkus, B. An Assessment of Zero-Shot Open Book Question Answering Using Large Language Models. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2023. Available online: https://studenttheses.uu.nl/handle/20.500.12932/44625 (accessed on 2 March 2025).
Salvador, J.; Bansal, N.; Akter, M.; Sarkar, S.; Das, A.; Karmaker, S.K. Benchmarking LLMs on the Semantic Overlap Summarization Task. arXiv 2024, arXiv:2402.17008. [Google Scholar] [CrossRef]
Xu, S.; Wu, Z.; Zhao, H.; Shu, P.; Liu, Z.; Liao, W.; Li, S.; Sikora, A.; Liu, T.; Li, X. Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis. arXiv 2024, arXiv:2402.11398. [Google Scholar] [CrossRef]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-Thought in Neural Code Generation: From and for Lightweight Language Models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02; Association for Computational Linguistics: Philadelphia, PN, USA, 2001; p. 311. [Google Scholar] [CrossRef]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013/ (accessed on 2 March 2025).
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Goldstein, J., Lavie, A., Lin, C.-Y., Voss, C., Eds.; Association for Computational Linguistics: Ann Arbor, MI, USA, 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909/ (accessed on 2 March 2025).
Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard Coefficient for Keywords Similarity; IAENG: Hong Kong, China, 2013. [Google Scholar]
Sitikhu, P.; Pahi, K.; Thapa, P.; Shakya, S. A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. In 2019 Artificial Intelligence for Transforming Business and Society (AITB); Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
Islam, R.; Moushi, O.M. GPT-4o: The Cutting-Edge Advancement in Multimodal LLM. TechRxiv 2024. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. Available online: http://arxiv.org/abs/2302.13971 (accessed on 26 August 2023).
Islam, R.; Ahmed, I. Gemini-the most powerful LLM: Myth or Truth. In 2024 5th Information Communication Technologies Conference (ICTC); IEEE: Piscataway, NJ, USA, 2024; pp. 303–308. [Google Scholar] [CrossRef]
Kurokawa, R.; Ohizumi, Y.; Kanzawa, J.; Kurokawa, M.; Sonoda, Y.; Nakamura, Y.; Kiguchi, T.; Gonoi, W.; Abe, O. Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology’s ‘Diagnosis Please’ cases. Jpn. J. Radiol. 2024, 42, 1399–1402. [Google Scholar] [CrossRef]
Diab, N. Out of the BLEU: An Error Analysis of Statistical and Neural Machine Translation of WikiHow Articles from English into Arabic. CDELT Occas. Pap. Dev. Engl. Educ. 2021, 75, 181–211. [Google Scholar] [CrossRef]
Lee, S.; Lee, J.; Moon, H.; Park, C.; Seo, J.; Eo, S.; Koo, S.; Lim, H. A Survey on Evaluation Metrics for Machine Translation. Mathematics 2023, 11, 1006. [Google Scholar] [CrossRef]
Ganesan, K. ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks. arXiv 2018, arXiv:1803.01937. [Google Scholar] [CrossRef]
Holzinger, A.; Zatloukal, K.; Müller, H. Is human oversight to AI systems still possible? New Biotechnol. 2025, 85, 59–62. [Google Scholar] [CrossRef] [PubMed]
Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and Fairness in Large Language Models: A Survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
Jung, J.; Brahman, F.; Choi, Y. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv 2024, arXiv:2407.18370. [Google Scholar] [CrossRef]
prolego-team/pdd; Python Prolego-Team; GitHub, Inc.: San Francisco, CA, USA. Available online: https://github.com/prolego-team/pdd (accessed on 3 March 2025).
Virk, Y.; Devanbu, P.; Ahmed, T. Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores. arXiv 2024, arXiv:2404.19318. [Google Scholar] [CrossRef]

Figure 1. GenAI models.

Figure 2. Research methodology.

Figure 3. Proposed LLM integrated framework for construction cost estimation.

Figure 4. Scenario creation steps.

Figure 5. Example 1 of expected vs. actual response in zero-shot.

Figure 6. Example 2 of expected vs. actual response in zero-shot.

Figure 7. Quantities from the Revit model of a commercial building.

Figure 8. Unit cost from cost databases.

Figure 9. Modular framework for a conceptual estimation scenario.

Figure 10. Modular flow structure.

Figure 11. Chain-of-thoughts instruction example.

Figure 12. Modular CoT architecture for conceptual estimation.

Figure 13. Module calling function architecture.

Figure 14. Module calling function workflow.

Figure 15. Example of expected vs. actual response in modular CoT.

Figure 16. Performance comparison: zero-shot vs. CoT.

Table 1. Overview of Major Construction Cost Estimation Methods.

Estimation Method	Description	Typical Accuracy	Common Application
Rough Order of Magnitude (RoM)	Relies on past project data and analogous estimating techniques to provide preliminary estimates when design details are limited [4,7].	±25%	Early feasibility and concept development stages
Square Footage Estimating	Uses cost-per-square-foot benchmarks derived from similar completed projects to provide a broad cost range for labor, materials, and services [4,7].	±20%	Schematic design and early planning
Assemblies Estimating	Breaks down costs into specific building systems and assemblies (e.g., plumbing or mechanical installations) using a functional classification [11].	±15%	Design development & conceptual estimates
Unit Cost Estimating	The most detailed approach, itemizing materials, labor, and equipment at the lowest quantifiable level to produce precise cost reports [12].	−5% to +10%	Bidding phase

Table 2. LLMs’ Performance on Estimation Tasks with Zero-Shot Learning (The asterisk “*” denotes the best-performing model for each evaluation metric).

Model	BLEU	ROUGE-L	METEOR
GPT4o	0.023 *	0.185 *	0.196 *
Llama 3.2	0.0126	0.112	0.157
Gemini 2.0	0.010	0.095	0.122
Claude 3.5 Sonnet	0.0135	0.170	0.168

Table 3. GPT-4o Performance Evaluation—Zero-Shot vs. CoT.

Evaluation Category	Zero-Shot with GPT4o	CoT with GPT4o
BELU	0.023365	0.382353
ROUGE_L	0.185215	0.622845
METEOR	0.196798	0.610922
Content Overlap	0.109057	0.497031
Semantic Similarity	0.245202	0.597011

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghimire, P.; Kim, K.; Stentz, T.; Roy, T. Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation. Buildings 2026, 16, 396. https://doi.org/10.3390/buildings16020396

AMA Style

Ghimire P, Kim K, Stentz T, Roy T. Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation. Buildings. 2026; 16(2):396. https://doi.org/10.3390/buildings16020396

Chicago/Turabian Style

Ghimire, Prashnna, Kyungki Kim, Terry Stentz, and Tirthankar Roy. 2026. "Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation" Buildings 16, no. 2: 396. https://doi.org/10.3390/buildings16020396

APA Style

Ghimire, P., Kim, K., Stentz, T., & Roy, T. (2026). Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation. Buildings, 16(2), 396. https://doi.org/10.3390/buildings16020396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modular Chain-of-Thought (CoT) for LLM-Based Conceptual Construction Cost Estimation

Abstract

1. Introduction

2. Literature Review

2.1. Existing Cost Estimation Approaches in Commercial Construction

2.2. Application of Generative Pre-Trained LLMs in Construction Cost Estimation

2.3. Chain of Thoughts (CoT) Prompting

3. Methodology

3.1. Proposed LLM Framework

3.2. Cost Estimation Scenario for Existing LLMs

3.3. Evaluation of Existing LLMs

4. Case Study: Modular Chain of Thoughts Prompting for Conceptual Estimation

4.1. Data

4.2. Modular Framework for the Scenario

4.3. CoT Instructions & Architecture

4.4. Module Calling Function

5. Results & Discussion

5.1. Qualitative Evaluation

5.2. Quantitative Evaluation

6. Limitations and Future Work

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Burdens to Tasks to Sub-Tasks Mapping [58]

Appendix B. Scenario 1-Conceptual Cost Estimation Scenario

Appendix C. Human Evaluation Criteria

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI