Large Language Models for Construction Risk Classification: A Comparative Study

Erfani, Abdolmajid; Khanjar, Hussein

doi:10.3390/buildings15183379

Open AccessArticle

Large Language Models for Construction Risk Classification: A Comparative Study

by

Abdolmajid Erfani

^*

and

Hussein Khanjar

Department of Civil, Environmental, and Geospatial Engineering, Michigan Technological University, Houghton, MI 49931, USA

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(18), 3379; https://doi.org/10.3390/buildings15183379

Submission received: 23 August 2025 / Revised: 12 September 2025 / Accepted: 15 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue Next-Gen Risk Management: AI-Driven Solutions for Engineering and Construction Projects)

Download

Browse Figures

Versions Notes

Abstract

Risk identification is a critical concern in the construction industry. In recent years, there has been a growing trend of applying artificial intelligence (AI) tools to detect risks from unstructured data sources such as news articles, social media, contracts, and financial reports. The rapid advancement of large language models (LLMs) in text analysis, summarization, and generation offers promising opportunities to improve construction risk identification. This study conducts a comprehensive benchmarking of natural language processing (NLP) and LLM techniques for automating the classification of risk items into a generic risk category. Twelve model configurations are evaluated, ranging from classical NLP pipelines using TF-IDF and Word2Vec to advanced transformer-based models such as BERT and GPT-4 with zero-shot, instruction, and few-shot prompting strategies. The results reveal that LLMs, particularly GPT-4 with few-shot prompts, achieve a competitive performance (F1 = 0.81) approaching that of the best classical model (BERT + SVM; F1 = 0.86), all without the need for training data. Moreover, LLMs exhibit a more balanced performance across imbalanced risk categories, showcasing their adaptability in data-sparse settings. These findings contribute theoretically by positioning LLMs as scalable plug-and-play alternatives to NLP pipelines, offering practical value by highlighting how LLMs can support early-stage project planning and risk assessment in contexts where labeled data and expert resources are limited.

Keywords:

large language models; risk identification; natural language processing; risk classification

1. Introduction

The complex and dynamic nature of the transportation construction industry often leads to project delivery challenges, including delays, cost overruns, and disputes [1,2]. As a best practice for ensuring successful project delivery, risk management is now widely adopted in the construction industry to proactively identify potential risks during the planning and design phases, allowing for effective mitigation and response strategies [3,4,5,6,7]. Risk identification [8], which is the first step in the risk management cycle, is often considered the most critical phase, as unrecognized risks cannot be proactively addressed or mitigated. Erfani et al. [9], through a review of transportation project risk registers after project completion, found that more than 50% of the risks that occurred during project execution were not initially identified during the planning phase, indicating a significant issue of incomplete risk identification. Consequently, there has been significant interest in both research and practice in developing a wide range of tools and resources to improve risk identification in construction projects [10,11,12,13,14].

While current risk management practices heavily rely on subject-matter expert judgment through risk workshops [15,16], there has been substantial research interest in advancing construction risk identification by leveraging artificial intelligence (AI) techniques and historical data from comparable past projects. For example, Erfani and Cui [17] proposed a data-driven framework for risk identification that leverages historical data and AI techniques. The model analyzes risk items from past projects by capturing the semantic meaning of words to identify risks with high frequency and impact. When tested on new projects, the model achieved a risk identification accuracy of over 60%. However, the limited availability of comprehensive historical data often hinders the full implementation of data-driven approaches for risk identification [18]. As a result, there is a growing trend toward identifying risks from unstructured data sources such as news articles [18], social media [19], contracts [20,21,22,23], and financial reports [24]. The application of risk identification from unstructured data sources relies heavily on large volumes of annotated data and the capabilities of advanced AI techniques, such as natural language processing (NLP) and text mining [25].

Despite the growing application of NLP and text mining techniques for extracting and classifying risks from unstructured data, their effectiveness remains heavily dependent on a large number of labeled datasets—a resource that is not widely available in construction projects [26]. This gap has limited the advancement of automated risk identification in real-world, data-sparse contexts. To address this challenge, the present study introduces large language models (LLMs) as a comparative tool against traditional NLP approaches. Unlike conventional methods that require task-specific training data, LLMs can deliver competitive classification performance through prompt engineering alone, while also providing more balanced results across imbalanced risk categories. This study contributes to the body of knowledge by systematically benchmarking LLMs against established NLP pipelines for construction risk classification, positioning LLMs as scalable, plug-and-play solutions that are capable of supporting risk assessment and project planning when labeled data and expert resources are limited.

2. Literature Review

The Project Management Institute [27] defines risk as “an uncertain event or condition that, if it occurs, has a positive or negative effect on a project’s objectives.” Simply put, risk is anything that might change how a project turns out—for better or worse [28,29,30,31]. Risk identification is where risk management begins —it sets the stage for everything that follows [32]. Past studies have worked to organize the risk identification process by creating frameworks that list, group, and rank risks based on their likelihood and potential impact [4]. In transportation infrastructure projects, these frameworks often include 8 to 12 categories and up to 50–100 common risks, which are commonly referred to as risk breakdown structures [33]. Researchers have also examined various risk identification methods, comparing their strengths and limitations [34].

For decades, risk identification has relied on traditional tools such as expert judgment, risk workshops, checklists, and predefined historical risk registers. While these methods have shaped the foundation of practice, recent advances in data storage, artificial intelligence [35,36,37], and computational power have sparked growing interest in enhancing risk identification [38]. Emerging approaches now seek to leverage not only structured historical data but also unstructured sources like social media, news articles, contracts, and financial documents to uncover risks more comprehensively and proactively [39]. This approach is based on the understanding that while every project is unique, not all construction risks are [40]. Many risks are common across projects; recognizing these shared risks allows project teams to address them, proactively freeing up time and focus to identify and manage more project-specific or emerging risks.

Historical risk events are often dispersed across diverse reports and unstructured data sources, making the collection of such information both time-consuming and costly [18]. As a result, augmenting risk identification using unstructured data has emerged as a prominent research direction. As shown in Table 1, a wide range of studies have employed various NLP techniques to extract and identify risk items from text. These techniques range from early text analysis methods, such as the bag-of-words model—which matches words based purely on surface-level similarity—to more advanced semantic similarity approaches that represent text as vectors, capturing underlying meaning [41,42]. While these approaches have demonstrated success, the rapid evolution of NLP—particularly through LLMs and generative AI—has introduced new capabilities that go beyond traditional extraction, enabling human-like reasoning, summarization, and contextual analysis [43,44,45,46,47]. This advancement underscores the need to explore the integration of LLMs into construction risk management frameworks.

In the past three years, there has been a surge in research exploring the use of LLMs across a wide range of construction management tasks—including scheduling [49], safety inspection [50], legal and compliance analysis [51,52,53], design [54], and risk management [55], among others. Among the growing body of research on LLM applications in risk management, prior studies have employed models such as the GPT family to support various tasks, including generating risk registers, conducting both qualitative and quantitative risk analyses, and formulating risk response strategies [56,57]. While these efforts aim to automate the risk management process, significant challenges remain before generative AI can deliver fully autonomous, high-accuracy solutions that are capable of identifying and analyzing both common and project-specific risks. Existing findings suggest that GPT-based tools may offer value to less-experienced practitioners by providing broad risk awareness and structured outputs. However, seasoned professionals often critique these models for lacking contextual sensitivity and domain-specific nuance [44,56,57]. As such, user experience with LLMs in risk management reveals a promising yet uneven landscape—highlighting both the transformative potential and the persistent gaps of generative AI in this domain.

While ongoing efforts aim to tailor general-purpose LLMs for specific risk management applications—through fine-tuning or retrieval-augmented approaches leveraging external sources such as historical risk databases [58,59]—another critical dimension lies in their capacity to enhance core NLP tasks in risk management, particularly the extraction and identification of risks from unstructured textual sources. This paper contributes to the body of knowledge by addressing this gap through the development of a comparative framework that evaluates the application of LLMs—including various prompt engineering techniques—against state-of-the-art NLP methods.

3. Research Methodology

This study contributes to the growing body of knowledge on risk identification from unstructured data sources by establishing a comparative benchmark between LLM-based approaches and a comprehensive suite of traditional and modern NLP methods. The LLM strategies explored include a spectrum of prompt engineering techniques—such as zero-shot, few-shot, and instruction-based prompting—while the baseline models span from classical bag-of-words representations to state-of-the-art semantic embeddings. As illustrated in Figure 1, the proposed methodology is structured into two primary steps: (1) data acquisition and preprocessing to construct a high-quality ground truth dataset and (2) benchmarking of model performance using that dataset. Each step of the framework is detailed in the following sections, outlining the design choices and implementation procedures that underpin this comparative analysis.

3.1. Data Collection and Preprocessing

The Information Source for Major Transportation Projects (ISMP) database [60] served as the primary data source for this study. As illustrated in Figure 2, the dataset comprises 1000 individual risk statements extracted from 30 major transportation projects spanning 20 U.S. states. These projects encompass a diverse range of infrastructure types, including highway reconstructions, bridge and tunnel developments, new roadway constructions, and interchange upgrades. In terms of size, the projects typically range from USD 500 million to USD 2 billion in value and utilize various delivery methods, including Design–Bid–Build, Public–Private Partnerships, and Design–Build. This dataset offers a comprehensive and objective compilation of real-world risk information, sourced directly from the risk registers of these large-scale infrastructure projects.

The collected risk statements, sourced from project risk registers, reflect the language and terminology used by individual project teams—often containing project-specific phrasing, contextual details, and technical nuances. To evaluate the capability of NLP and LLMs in accurately identifying standardized risk categories from such unstructured data, a well-defined ground truth dataset is essential for benchmarking and analysis. To support data preprocessing, this study aligns unstructured risk statements with the RBS developed by Erfani et al. [33], which was constructed through a data-driven synthesis of existing risk documentation. The resulting RBS offers a standardized and comprehensive classification framework specifically tailored for major transportation projects (see Figure 3). The developed RBS for major transportation projects comprises 11 primary risk categories, encompassing a total of 71 distinct risk items. In a related effort, Erfani et al. [33] conducted a content analysis to systematically map project-specific risk statements from the ISMP database to this RBS framework. This mapping serves as a high-quality baseline dataset for evaluating the effectiveness of NLP and LLM approaches in identifying and classifying risks from unstructured textual sources.

Aligning the project-specific risk statements (1000 items) with the Level 1 RBS categories produces a labeled ground truth dataset for subsequent analysis (see examples in Table 2). This enables a structured evaluation of model performance in automating the classification process. In essence, the task is framed as a multi-class risk classification problem, where the objective is to assign each unstructured risk item to its most appropriate generic risk category—analogous to prior studies (see Table 1) that aimed to classify risks from unstructured sources such as news articles, social media content, or financial documents. As illustrated in Figure 4, the resulting dataset is highly imbalanced across classes, presenting a significant challenge for both traditional machine learning and LLM-based approaches. The risk items were lightly normalized by lowercasing, trimming whitespace, and cleaning punctuation while preserving meaningful tokens such as hyphens (e.g., “design-build”). Stop-word removal was applied only in TF-IDF pipelines, whereas stemming and lemmatization were avoided since pilot tests showed no improvement and occasionally degraded performance by truncating domain-specific terms (e.g., “geotechnical”). For Word2Vec, BERT, and LLM approaches, raw but normalized text was retained to preserve semantic and contextual cues.

3.2. Benchmarking NLP and LLM Approaches for Risk Classification

With the growing interest in data-driven approaches within the construction domain [61,62], numerous studies have demonstrated the application of machine learning techniques to automate key project management tasks—ranging from cost estimation [63,64] and scheduling to resource allocation [65,66,67] and risk forecasting [68,69]. The textual nature of risk management data requires machine learning approaches capable of understanding, analyzing, and comparing unstructured language—making NLP [70,71,72,73,74] a critical tool in this context. This study includes a comprehensive set of NLP methods to serve as baseline models for comparative evaluation. Figure 5 illustrates a two-stage process comprising text vectorization and machine learning classifier selection. At each stage, a comprehensive set of techniques has been incorporated to ensure robust baseline comparisons.

A fundamental step in NLP involves text vectorization—the process of converting unstructured text into numerical representations that preserve meaningful linguistic patterns. Traditional methods like Term Frequency–Inverse Document Frequency (TF-IDF) capture word importance based on frequency statistics (Equation (1)), while neural embedding techniques such as Word2Vec learn semantic relationships by modeling word co-occurrence contexts [59,75]. More recently, Transformer-based models like BERT produce context-aware embeddings that capture nuanced meaning based on surrounding text, enabling deeper semantic understanding [76]. These diverse approaches form the foundation for downstream tasks such as classification, clustering, and similarity analysis.

{T F - I D F}_{s c o r e} = \frac{n_{t}}{N} \times (1 + \log \frac{k}{k_{t}})

(1)

where
$n_{t}$ : Number of occurrences of terms t in the document.
N: Total number of terms in the document.
k: Total number of documents.
$k_{t}$ : Number of documents containing the term t.

Once each risk item is transformed into a vectorized representation, a set of machine learning classifiers is applied to build the risk classification pipeline, as shown in Figure 5. Leveraging the scikit-learn library, this study evaluates multiple widely used algorithms—including Logistic Regression (LR), Support Vector Machines (SVMs), and Random Forest (RF) [77,78]. These classifiers were selected to represent a diverse range of learning paradigms—linear models (LR), margin-based classifiers suitable for high-dimensional spaces (SVM), and ensemble-based decision trees capable of capturing non-linear patterns (RF). This diversity enables a robust baseline comparison and helps reveal how different model types handle the challenges of semantic variation and class imbalance in textual risk data.

To assess the predictive performance of the classification models, this study employs standard statistical evaluation metrics widely used in classification tasks. Specifically, four key metrics are utilized—accuracy, precision, recall, and F1-score [79,80]. These metrics (Equations (2)–(5)) provide a balanced view of overall correctness (accuracy), the model’s ability to avoid false positives (FP, precision), its sensitivity to true positives (TP, recall), and the harmonic mean of precision and recall (F1-score), offering a comprehensive evaluation framework for multi-class text classification.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(2)

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

F 1 = \frac{2 \times (P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(5)

3.3. LLMs: Prompt Engineering Approaches

LLMs generate responses to input prompts, which may consist of natural language questions, task instructions, or multimodal content such as text and images. The prompt functions as the core interface through which users guide the model’s behavior and constrain its output (Figure 6). In this study, the prompt is carefully designed to instruct the model to perform a risk classification task—specifically, to assign project-specific risk statements to one of several predefined high-level risk categories. The clarity and structure of the prompt play a critical role in aligning the model’s output with the intended classification objective [81,82,83,84]. A recent literature review on the application of LLMs in civil engineering [43] highlights three common adaptation strategies for repurposing general-purpose models such as GPT for specialized tasks like risk classification in construction projects—zero-shot learning, few-shot learning, and instruction-based prompting.

In zero-shot learning, the model makes predictions based solely on a well-crafted prompt without seeing any task-specific examples. Few-shot learning introduces the model to a small number of labeled examples within the prompt to demonstrate the task, improving contextual understanding and classification accuracy. Instruction-based prompting provides a directive or task description that clearly defines the expected output, helping guide the model’s behavior without requiring training or extensive fine-tuning [85]. These strategies enable LLMs to be applied effectively in domain-specific contexts despite being trained on broad, general-purpose data. Accordingly, task-specific prompts were developed for the risk classification objective. The complete set of prompts utilized in this study is presented in Table 3.

For the few-shot setting, a single example was randomly selected from each Level 1 risk category within the RBS framework. The goal was to provide representative and diverse samples reflecting common phrasing styles across categories, rather than to optimize for performance. This approach ensured coverage of all categories while avoiding performance tuning through hand-picked cases. It should be noted that different selections may affect outcomes, as LLM performance is sensitive to example choice. In this study, we employed GPT-4 as the representative LLM to explore its capability in automating the classification of construction risk statements into predefined high-level categories. GPT-4 was selected due to its state-of-the-art performance in natural language understanding and its accessibility through API, which enables scalable deployment for practical applications. To ensure a consistent and fair evaluation, all LLM prompting strategies—zero-shot, instruction-based, and few-shot—were tested using the same held-out testing dataset employed in the traditional NLP classifier pipeline. Although a key advantage of LLM-based approaches lies in their ability to operate without training data, this controlled setup allowed for a direct, comparable assessment of performance across methods under identical data conditions. LLM performance was evaluated using the same standard classification metrics—accuracy, precision, recall, and F1-score—applied to the NLP classifiers. All GPT runs were executed with temperature = 0, as well as max_tokens = 20. Setting the temperature to zero forces the model to generate deterministic outputs, which is critical for consistent benchmarking. The max_tokens parameter was limited to 20 to constrain responses to short category labels, avoiding unnecessary elaboration. Other parameters (e.g., top_p, penalties, and stop sequences) were left at their default values. Full prompt templates and code are provided in GitHub Repository.

4. Results

All NLP classifiers and LLM-based approaches were evaluated using a consistent dataset split, with 80% of the data allocated for training the traditional models and the remaining 20% reserved for testing. Unlike conventional classifiers, LLM-based methods require no training data, offering a significant advantage in settings where labeled data are limited or unavailable. Table 4 presents the weighted average performance of all classification models across the eleven risk categories, enabling a direct comparison of predictive effectiveness. Among all evaluated models, the combination of BERT embeddings with SVM achieved the highest classification performance with an accuracy and F1-score of 0.86. This demonstrates the effectiveness of Transformer-based language representations paired with robust classifiers for nuanced text classification tasks in risk management. GPT-4 with few-shot prompting achieved an F1-score of 0.81, closely matching the performance of the best traditional models (e.g., BERT + LR) without requiring model training. This finding highlights the remarkable potential of LLMs to perform high-quality classification in data-constrained scenarios, reinforcing their utility as flexible, zero-training alternatives in construction risk management applications. Compared to zero-shot prompting, the instruction-based GPT prompt improved the F1-score from 0.79 to 0.82. This indicates that carefully crafted task framing and prompt structure can significantly enhance LLM prediction quality even in the absence of fine-tuning.

Both TF-IDF paired with Random Forest and Word2Vec combined with Logistic Regression achieved a strong performance (F1-score = 0.83), demonstrating that traditional feature-based NLP methods remain competitive—particularly when trained on high-quality, well-labeled, domain-specific datasets. However, their effectiveness is heavily dependent on the availability of extensive labeled data and careful preprocessing, which can be resource-intensive and time-consuming. In contrast, LLM-based approaches offer a compelling alternative, with the potential to match or exceed traditional methods when augmented with minimal supervision—such as few-shot examples or well-crafted instructions—thereby significantly reducing the burden of data labeling while maintaining a robust performance.

Figure 7 presents the confusion matrix for GPT-4’s few-shot classification across eleven high-level construction risk categories. Overall, the model demonstrates a strong classification accuracy, with most predictions concentrated along the diagonal—indicating correct label assignments. Categories such as Design, Utilities, Right of Way, and Traffic exhibit near-perfect alignment between true and predicted labels, underscoring GPT-4’s ability to distinguish clear semantic boundaries when provided with well-structured prompts and contextual examples. However, categories with broader conceptual overlap, particularly Construction, display a higher degree of misclassification. Several Construction-related risks were mistakenly categorized as Management and Funding, Stakeholder, or Structure and Geotechnical, revealing challenges in disambiguating risks that involve operational, financial, or technical dimensions simultaneously. Despite being underrepresented in the test data, classes like Organizational and Structure and Geotechnical achieved highly precise predictions, showcasing the strength of few-shot prompting in low-data regimes. Notably, the model confused some Procurement and Contracting risks with Design or Environmental, suggesting potential ambiguity in how procedural and regulatory language is interpreted across domains. These findings reinforce the utility of LLM-based classification strategies in managing domain-specific taxonomies and highlight both the opportunities and limitations of language-based reasoning in unstructured construction risk data.

Figure 8 presents a detailed class-wise comparison between the best-performing traditional model (BERT with SVM) and the GPT-4 few-shot prompting approach for classifying construction risk statements into predefined categories. While BERT-SVM achieves high precision on many classes (often nearing 1.0), its recall values vary widely—dropping as low as 0.12 for the “Stakeholder” category and 0.20 for “Procurement and Contracting”—indicating challenges in capturing minority or nuanced classes despite its high precision. In contrast, GPT-4 few-shot demonstrates a significantly more balanced performance across both frequent and infrequent classes. It delivers strong recall in critical categories like “Management and Funding” (0.91), “Organizational” (0.83), and “Structure and Geotechnical” (0.89), while maintaining comparable precision. This balanced behavior is reflected in GPT-4’s higher macro-average recall (0.82 vs. 0.62) and macro-average F1-score (0.80 vs. 0.68) compared to BERT-SVM. The balanced recall observed in GPT-4’s predictions carry practical significance for risk management. High recall in categories such as Management and Funding, Organizational, or Structure and Geotechnical means fewer critical risks are overlooked, thereby reducing false negatives. In practice, missing such risks could result in major schedule delays, cost escalations, or safety incidents.

These results suggest that LLMs, particularly in a few-shot setting, are more effective in real-world applications where labeled training data are limited and performance across all classes—not just dominant ones—is critical. While Figure 8 highlights notable differences in class-wise recall and F1-score between GPT-4 few-shot and BERT-SVM, these comparisons are descriptive and not statistically validated. Formal tests such as McNemar’s test, paired t-tests, or effect size measures could provide stronger evidence of significance. As the primary aim of this study was benchmarking [86] across methods, statistical validation was not included here but represents an important direction for future research.

5. Discussion

To rigorously evaluate the comparative performance of traditional NLP classifiers and LLMs, this study designed an experiment involving 12 model configurations spanning TF-IDF, Word2Vec, BERT, and GPT-based approaches (including zero-shot, instruction, and few-shot prompting). This comprehensive benchmarking framework offers a holistic view of both classic machine learning pipelines and modern LLM capabilities for the task of risk classification. The findings reveal that LLM-based solutions deliver a competitive performance without the need for model-specific training data. Notably, GPT-4 few-shot demonstrated more balanced classification across categories—avoiding the common tendency of traditional classifiers to overfit dominant classes.

Although the task was conducted on a sample risk register—where project-specific risk statements were mapped to a fixed set of generic risk categories—the LLM approach proved highly effective as an out-of-the-box tool for identifying risk types from unstructured textual inputs. This underscores the transformative potential of LLMs in automating risk analysis with minimal manual labeling. While the goal remains the development of a fully autonomous, high-accuracy LLM tool capable of generating comprehensive risk assessments for new construction projects from raw inputs, achieving this vision will require the integration of rich external domain knowledge. Encouragingly, LLMs already offer foundational capabilities that pave the way toward this future.

The instruction-based GPT-4 prompt yielded notable improvements over the basic zero-shot configuration, highlighting the critical role of well-structured task descriptions and contextual constraints in aligning LLM outputs with domain-specific objectives. When representative examples were added to the prompt (i.e., few-shot prompting), performance further increased, achieving an F1-score of 0.81—closely matching that of the best-performing traditional model (BERT + SVM). These results underscore the effectiveness of few-shot prompting as a lightweight yet powerful strategy for adapting general-purpose LLMs to specialized classification tasks without the need for model fine-tuning or retraining. In essence, prompt engineering emerges as the primary lever for shaping LLM behavior and optimizing task-specific performance. Beyond the performance results reported here, it is important to note that LLMs may also introduce bias into risk classification. Such biases can arise from the underlying pre-training corpus (which may not reflect construction-specific language), from tendencies to oversimplify nuanced risk descriptions, or from prompt framing effects. While these issues were not the focus of this benchmarking study, they represent critical challenges for future work. Approaches such as domain-adapted fine-tuning, bias audits, and hybrid workflows that combine automated classification with expert validation will be necessary to ensure that LLM-based risk management tools are effective in practice.

To illustrate the practical contributions, consider two hypothetical deployment scenarios. First, within an engineering consulting firm, GPT-4 could be applied to preliminarily design documents, meeting transcripts, or contract drafts to automatically flag potential risks and provide an initial classification aligned with standard taxonomies. This would allow consultants to provide clients with rapid, evidence-based insights during early project planning, even before detailed data are available. Second, within a construction unit, GPT-4 could be embedded into project management workflows to review draft risk registers, helping teams identify underrepresented issues. These scenarios demonstrate how LLMs can function as practical plug-and-play tools, enabling more timely and balanced risk assessments in real project environments. While this study benchmarked GPT-4 as a representative commercial LLM, future work should investigate whether more compact or open-source models can achieve comparable performance. If so, the trade-offs would involve balancing accuracy with accessibility and practicality; open-source models may offer benefits in cost, transparency, and local deployment, while commercial models may provide a more stable performance and broader domain generalization. Understanding these trade-offs will be critical for guiding adoption in practice.

Unlike supervised ML models that require labeled training data, LLMs deliver strong results with only prompt-based interaction. This scalability is particularly beneficial in early-stage project planning or organizations lacking labeled datasets. Incorporating structured risk taxonomies, glossaries, or past project documentation via Retrieval-Augmented Generation (RAG) could enhance LLM understanding and reduce misclassification in closely related categories (e.g., “Stakeholder” vs. “Organizational”). It should be noted that this benchmarking was conducted on a curated dataset of project risk registers, which—while representative of real project documentation—differs from fully unstructured sources such as contract clauses, meeting minutes, or press releases. This controlled setup allowed for consistent evaluation across NLP and LLM approaches; however, it does not capture the complexity of extracting risks from raw, unstructured text streams or real-time project data. As such, the results should be interpreted within the scope of structured registers, with future work focused on extending the framework to more diverse and dynamic sources.

This study is not without limitations. First, while the classification task focused on mapping project-specific risk statements to Level 1 generic categories within an RBS, further validation is needed at more granular tiers where semantic overlap is greater. Second, the evaluation was conducted on a curated, static dataset, which enabled fair benchmarking but does not reflect the dynamic nature of real-world projects. Future work should extend the framework to streaming and real-time data sources (e.g., contracts in progress, social media, or incident reports), where risks may emerge outside predefined categories. Third, although this study revealed a strong performance from LLMs across imbalanced classes, we did not apply augmentation techniques to underrepresented categories when testing traditional NLP models. Future research could explore text-based augmentation methods (e.g., back-translation, paraphrasing, or synthetic oversampling) to determine whether conventional approaches can narrow the gap with LLMs. Finally, the scope of this paper was limited to risk classification. Future directions include leveraging the generative capabilities of LLMs to not only identify emerging risks beyond fixed taxonomies but also to propose potential mitigation strategies by synthesizing domain knowledge and contextual project information. Together, these avenues highlight promising opportunities to build upon the current study and advance the integration of LLMs into proactive and adaptive construction risk management.

6. Conclusions

This paper conducted a broad and systematic comparison of twelve classification pipelines spanning traditional NLP approaches (TF-IDF, Word2Vec), Transformer-based models (BERT), and large language models (GPT-4 with zero-shot, instruction-based, and few-shot prompting) for the task of construction risk classification. Rather than identifying a singular “best” model, this exploratory analysis aimed to uncover the performance landscape across diverse methodologies and use-case scenarios. The results demonstrate that GPT-4, particularly with instruction and few-shot prompting, offers a competitive performance (F1-score = 0.81–0.82) that approaches the benchmark set by the top-performing BERT + SVM model (F1-score = 0.86). This finding is significant; it suggests that LLMs, even without fine-tuning, can serve as effective plug-and-play tools for domain-specific tasks—especially when labeled data are limited or unavailable. Their ability to generalize in low-data settings offers a practical advantage for early-stage analysis or rapid deployment in new projects.

These findings highlight that model selection should be guided by use-case constraints; LLMs may be best suited for scenarios with limited data or expert access, while classical pipelines remain strong when high-quality training data are available. The performance of LLMs reinforces their promise as versatile risk identification tools, but their success hinges on careful prompt design and an understanding of their behavior across diverse risk categories. Future research should explore hybrid pipelines, prompt optimization strategies, and model adaptation using domain-specific corpora to bridge the remaining performance gaps and enable more robust, explainable AI in construction risk management. Beyond transportation megaprojects, the framework presented here has the potential to be extended to other construction domains such as residential and commercial projects. While terminology and risk profiles may differ across sectors, the use of high-level risk categories provides a foundation for cross-domain adaptation. Future studies should validate these findings in diverse contexts to ensure robustness and adaptability.

Author Contributions

Conceptualization: A.E.; methodology: A.E. and H.K.; investigation: A.E. and H.K.; data curation: A.E.; writing—original draft preparation: A.E. and H.K.; writing—review and editing: A.E. and H.K.; visualization: A.E.; supervision: A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study and python code-supported findings are available on online GitHub repository https://github.com/MTUresearch/LLM-Risk (accessed on 1 August 2025).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT-5 for the purposes of enhancing the language and readability. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Taroun, A. Towards a better modelling and assessment of construction risk: Insights from a literature review. Int. J. Proj. Manag. 2014, 32, 101–115. [Google Scholar] [CrossRef]
Lafhaj, Z.; Rebai, S.; AlBalkhy, W.; Hamdi, O.; Mossman, A.; Alves Da Costa, A. Complexity in construction projects: A literature review. Buildings 2024, 14, 680. [Google Scholar] [CrossRef]
Osei-Kyei, R.; Jin, X.; Nnaji, C.; Akomea-Frimpong, I.; Wuni, I.Y. Review of risk management studies in public-private partnerships: A scientometric analysis. Int. J. Constr. Manag. 2023, 23, 2419–2430. [Google Scholar] [CrossRef]
Siraj, N.B.; Fayek, A.R. Risk identification and common risks in construction: Literature review and content analysis. J. Constr. Eng. Manag. 2019, 145, 03119004. [Google Scholar] [CrossRef]
Al-Mhdawi, M.K.S.; O’Connor, A.; Qazi, A.; Rahimian, F.; Dacre, N. Review of studies on risk factors in critical infrastructure projects from 2011 to 2023. Smart Sustain. Built Environ. 2025, 14, 342–376. [Google Scholar] [CrossRef]
Arijeloye, B.T.; Ramabodu, M.S.; Chikafalimani, S.H.P. Application of Fuzzy Risk Allocation Decision Model for Improving the Nigerian Public–Private Partnership Mass Housing Project Procurement. Buildings 2025, 15, 2866. [Google Scholar] [CrossRef]
Alba-Rodríguez, M.D.; Lucas-Ruiz, V.; Marrero, M. Systematic Methodology for Estimating the Social Dimension of Construction Projects—Assessing Health and Safety Risks Based on Project Budget Analysis. Buildings 2025, 15, 2313. [Google Scholar] [CrossRef]
El-Sayegh, S.M.; Manjikian, S.; Ibrahim, A.; Abouelyousr, A.; Jabbour, R. Risk identification and assessment in sustainable construction projects in the UAE. Int. J. Constr. Manag. 2021, 21, 327–336. [Google Scholar] [CrossRef]
Erfani, A.; Ma, Z.; Cui, Q.; Baecher, G.B. Ex post project risk assessment: Method and empirical study. J. Constr. Eng. Manag. 2023, 149, 04022174. [Google Scholar] [CrossRef]
Dicks, E.P.; Molenaar, K.R. Causes of Incomplete Risk Identification in Major Transportation Engineering and Construction Projects. Transp. Res. Rec. 2024, 2679, 619–628. [Google Scholar] [CrossRef]
Bepari, M.; Narkhede, B.E.; Raut, R.D. A comparative study of project risk management with risk breakdown structure (RBS): A case of commercial construction in India. Int. J. Constr. Manag. 2024, 24, 673–682. [Google Scholar] [CrossRef]
Serpell, A.; Ferrada, X.; Rubio, L.; Arauzo, S. Evaluating risk management practices in construction organizations. Procedia-Soc. Behav. Sci. 2015, 194, 201–210. [Google Scholar] [CrossRef]
Yousri, E.; Sayed, A.E.B.; Farag, M.A.; Abdelalim, A.M. Risk identification of building construction projects in Egypt. Buildings 2023, 13, 1084. [Google Scholar] [CrossRef]
Bahamid, R.A.; Doh, S.I.; Khoiry, M.A.; Kassem, M.A.; Al-Sharafi, M.A. The current risk management practices and knowledge in the construction industry. Buildings 2022, 12, 1016. [Google Scholar] [CrossRef]
Tavakolan, M.; Mohammadi, A. Risk management workshop application: A case study of Ahwaz Urban Railway project. Int. J. Constr. Manag. 2018, 18, 260–274. [Google Scholar] [CrossRef]
Goh, C.S.; Abdul-Rahman, H.; Abdul Samad, Z. Applying risk management workshop for a public construction project: Case study. J. Constr. Eng. Manag. 2013, 139, 572–580. [Google Scholar] [CrossRef]
Erfani, A.; Cui, Q. Predictive risk modeling for major transportation projects using historical data. Autom. Constr. 2022, 139, 104301. [Google Scholar] [CrossRef]
Gao, N.; Touran, A.; Wang, Q.; Beauchamp, N. Construction risk identification using a multi-sentence context-aware method. Autom. Constr. 2024, 164, 105466. [Google Scholar] [CrossRef]
Diao, C.; Liang, R.; Sharma, D.; Cui, Q. Litigation risk detection using Twitter data. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2020, 12, 04519047. [Google Scholar] [CrossRef]
Pham, H.T.; Han, S. Natural language processing with multitask classification for semantic prediction of risk-handling actions in construction contracts. J. Comput. Civ. Eng. 2023, 37, 04023027. [Google Scholar] [CrossRef]
Wong, S.; Zheng, C.; Su, X.; Tang, Y. Construction contract risk identification based on knowledge-augmented language models. Comput. Ind. 2024, 157, 104082. [Google Scholar] [CrossRef]
Kazemi, M.H.; Alvanchi, A. Application of NLP-based models in automated detection of risky contract statements written in complex script system. Expert Syst. Appl. 2025, 259, 125296. [Google Scholar] [CrossRef]
Kim, J.; Kwon, B.; Lee, J.; Mun, D. Inherent risks identification in a contract document through automated rule generation. Autom. Constr. 2025, 172, 106044. [Google Scholar] [CrossRef]
Jallan, Y.; Ashuri, B. Text mining of the securities and exchange commission financial filings of publicly traded construction firms using deep learning to identify and assess risk. J. Constr. Eng. Manag. 2020, 146, 04020137. [Google Scholar] [CrossRef]
Erfani, A.; Cui, Q. Natural language processing application in construction domain: An integrative review and algorithms comparison. In Proceedings of the ASCE International Conference on Computing in Civil Engineering 2021, Orlando, FL, USA, 12–14 September 2021; Available online: https://ascelibrary.org/doi/10.1061/9780784483893.004 (accessed on 11 September 2025).
Mohamed, M.A.H.; Al-Mhdawi, M.K.S.; Ojiako, U.; Dacre, N.; Qazi, A.; Rahimian, F. Generative AI in construction risk management: A bibliometric analysis of the associated benefits and risks. Urban. Sustain. Soc. 2025, 2, 196–228. [Google Scholar] [CrossRef]
Project Management Institute (PMI). The Project Management Body of Knowledge (PMBOK Guide), 5th ed.; Project Management Institute: Newtown Square, PA, USA, 2013. [Google Scholar]
Zhao, X. Construction risk management research: Intellectual structure and emerging themes. Int. J. Constr. Manag. 2024, 24, 540–550. [Google Scholar] [CrossRef]
Al-Mhdawi, M.K.S.; Brito, M.; Onggo, B.S.; Qazi, A.; O’Connor, A.; Namian, M. Construction risk management in Iraq during the COVID-19 pandemic: Challenges to implementation and efficacy of practices. J. Constr. Eng. Manag. 2023, 149, 04023086. [Google Scholar] [CrossRef]
Su, G.; Khallaf, R. Research on the influence of risk on construction project performance: A systematic review. Sustainability 2022, 14, 6412. [Google Scholar] [CrossRef]
Hanna, A.S.; Thomas, G.; Swanson, J.R. Construction risk identification and allocation: Cooperative approach. J. Constr. Eng. Manag. 2013, 139, 1098–1107. [Google Scholar] [CrossRef]
Dicks, E.P.; Molenaar, K.R. Analysis of Washington State Department of Transportation risks. Transp. Res. Rec. 2023, 2677, 1690–1700. [Google Scholar] [CrossRef]
Erfani, A.; Cui, Q.; Baecher, G.; Kwak, Y.H. Data-driven approach to risk identification for major transportation projects: A common risk breakdown structure. IEEE Trans. Eng. Manag. 2023, 71, 6830–6841. [Google Scholar] [CrossRef]
Duijm, N.J. Recommendations on the use and design of risk matrices. Saf. Sci. 2015, 76, 21–31. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, L. Roles of artificial intelligence in construction engineering and management: A critical review and future trends. Autom. Constr. 2021, 122, 103517. [Google Scholar] [CrossRef]
Abioye, S.O.; Oyedele, L.O.; Akanbi, L.; Ajayi, A.; Delgado, J.M.D.; Bilal, M.; Ahmed, A. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 2021, 44, 103299. [Google Scholar] [CrossRef]
Erfani, A.; Shayesteh, N.; Adnan, T. Data-augmented explainable AI for pavement roughness prediction. Autom. Constr. 2025, 176, 106307. [Google Scholar] [CrossRef]
Tian, K.; Zhu, Z.; Mbachu, J.; Moorhead, M.; Ghanbaripour, A. Artificial intelligence in construction risk management: A decade of developments, challenges, and integration pathways. J. Risk Res. 2025, 1–33. [Google Scholar] [CrossRef]
Chung, S.; Kim, J.; Baik, J.; Chi, S.; Kim, D.Y. Identifying issues in international construction projects from news text using pre-trained models and clustering. Autom. Constr. 2024, 168, 105875. [Google Scholar] [CrossRef]
Erfani, A.; Cui, Q.; Cavanaugh, I. An empirical analysis of risk similarity among major transportation projects using natural language processing. J. Constr. Eng. Manag. 2021, 147, 04021175. [Google Scholar] [CrossRef]
Zhang, F. A hybrid structured deep neural network with Word2Vec for construction accident causes classification. Int. J. Constr. Manag. 2022, 22, 1120–1140. [Google Scholar] [CrossRef]
Ye, Y.X.; Shan, M.; Gao, X.; Li, Q.; Zhang, H. Examining causes of disputes in subcontracting litigation cases using text mining and natural language processing techniques. Int. J. Constr. Manag. 2024, 24, 1617–1629. [Google Scholar] [CrossRef]
Erfani, A.; Mansouri, A. Applications of Multimodal Large Language Models in Construction Industry. 2025. Available online: https://ssrn.com/abstract=5278215 (accessed on 11 September 2025).
Martin, H.; James, J.; Chadee, A. Exploring Large Language Model AI tools in construction project risk assessment: ChatGPT limitations in risk identification, mitigation strategies, and user experience. J. Constr. Eng. Manag. 2025, 151, 04025119. [Google Scholar] [CrossRef]
Chen, G.; Alsharef, A.; Ovid, A.; Albert, A.; Jaselskis, E. Meet2Mitigate: An LLM-powered framework for real-time issue identification and mitigation from construction meeting discourse. Adv. Eng. Inform. 2025, 64, 103068. [Google Scholar] [CrossRef]
Jeon, K.; Lee, G. Hybrid large language model approach for prompt and sensitive defect management: A comparative analysis of hybrid, non-hybrid, and GraphRAG approaches. Adv. Eng. Inform. 2025, 64, 103076. [Google Scholar] [CrossRef]
Yao, D.; de Soto, B.G. Enhancing cyber risk identification in the construction industry using language models. Autom. Constr. 2024, 165, 105565. [Google Scholar] [CrossRef]
Dikmen, I.; Eken, G.; Erol, H.; Birgonul, M.T. Automated construction contract analysis for risk and responsibility assessment using natural language processing and machine learning. Comput. Ind. 2025, 166, 104251. [Google Scholar] [CrossRef]
Prieto, S.A.; Mengiste, E.T.; García de Soto, B. Investigating the use of ChatGPT for the scheduling of construction projects. Buildings 2023, 13, 857. [Google Scholar] [CrossRef]
Tsai, W.L.; Le, P.L.; Ho, W.F.; Chi, N.W.; Lin, J.J.; Tang, S.; Hsieh, S.H. Construction safety inspection with contrastive language-image pre-training (CLIP) image captioning and attention. Autom. Constr. 2025, 169, 105863. [Google Scholar] [CrossRef]
Gao, Y.; Gan, Y.; Chen, Y.; Chen, Y. Application of large language models to intelligently analyze long construction contract texts. Constr. Manag. Econ. 2025, 43, 226–242. [Google Scholar]
Liu, C.Y.; Chou, J.S. Automated legal consulting in construction procurement using metaheuristically optimized large language models. Autom. Constr. 2025, 170, 105891. [Google Scholar] [CrossRef]
He, C.; He, W.; Liu, M.; Leng, S.; Wei, S. Enriched construction regulation inquiry responses: A hybrid search approach for large language models. J. Manag. Eng. 2025, 41, 04025001. [Google Scholar] [CrossRef]
Qin, S.; Guan, H.; Liao, W.; Gu, Y.; Zheng, Z.; Xue, H.; Lu, X. Intelligent design and optimization system for shear wall structures based on large language models and generative artificial intelligence. J. Build. Eng. 2024, 95, 109996. [Google Scholar] [CrossRef]
Sonkor, M.S.; García de Soto, B. Using ChatGPT in construction projects: Unveiling its cybersecurity risks through a bibliometric analysis. Int. J. Constr. Manag. 2025, 25, 741–749. [Google Scholar] [CrossRef]
Nyqvist, R.; Peltokorpi, A.; Seppänen, O. Can ChatGPT exceed humans in construction project risk management? Eng. Constr. Archit. Manag. 2024, 31, 223–243. [Google Scholar] [CrossRef]
Aladağ, H. Assessing the accuracy of ChatGPT use for risk management in construction projects. Sustainability 2023, 15, 16071. [Google Scholar] [CrossRef]
Isah, M.A.; Kim, B.S. Question-answering system powered by knowledge graph and generative pretrained transformer to support risk identification in tunnel projects. J. Constr. Eng. Manag. 2025, 151, 04024193. [Google Scholar] [CrossRef]
Johnson, S.J.; Murty, M.R.; Navakanth, I. A detailed review on word embedding techniques with emphasis on Word2Vec. Multimed. Tools Appl. 2024, 83, 37979–38007. [Google Scholar] [CrossRef]
Zhang, K.; Erfani, A.; Beydoun, O.; Cui, Q. Procurement benchmarks for major transportation projects. Transp. Res. Rec. 2022, 2676, 363–376. [Google Scholar] [CrossRef]
You, Z.; Wu, C. A framework for data-driven informatization of the construction company. Adv. Eng. Inform. 2019, 39, 269–277. [Google Scholar] [CrossRef]
AlTalhoni, A.; Alwashah, Z.; Liu, H.; Abudayyeh, O.; Kwigizile, V.; Kirkpatrick, K. Data-driven identification of key pricing factors in highway construction cost estimation during economic volatility. Int. J. Constr. Manag. 2025, 1–16. [Google Scholar] [CrossRef]
Abu-Mahfouz, E.; Al-Dahidi, S.; Gharaibeh, E.; Alahmer, A. A novel feature engineering-based hybrid approach for precise construction cost estimation using fuzzy-AHP and artificial neural networks. Int. J. Constr. Manag. 2025, 1–11. [Google Scholar] [CrossRef]
Paik, Y.; Chung, F.; Ashuri, B. Preliminary cost estimation of pavement maintenance projects through machine learning: Emphasis on trees algorithms. J. Manag. Eng. 2025, 41, 04025027. [Google Scholar] [CrossRef]
ElAlem, M.A.; Mahdi, I.M.; Mohamadien, H.A.; Hosny, S. Forecasting scope creep in Egyptian construction projects: An evaluation using artificial neural network (ANN) and random forest models. Int. J. Constr. Manag. 2025, 1–20. [Google Scholar] [CrossRef]
Adnan, T.; Erfani, A.; Cui, Q. Paving equity: Unveiling socioeconomic patterns in pavement conditions using data mining. J. Manag. Eng. 2025, 41, 04025041. [Google Scholar] [CrossRef]
Zeberga, M.S.; Haaskjold, H.; Hussein, B.; Lædre, O.; Wondimu, P.A. Artificial intelligence–driven contractual conflict management in the AEC industry: Mapping benefits, practice, readiness, and ethical implementation strategies. J. Manag. Eng. 2025, 41, 04025016. [Google Scholar] [CrossRef]
Gondia, A.; Siam, A.; El-Dakhakhni, W.; Nassar, A.H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 2020, 146, 04019085. [Google Scholar] [CrossRef]
Moussa, A.; Ezzeldin, M.; El-Dakhakhni, W. Machine learning and optimization strategies for infrastructure projects risk management. Constr. Manag. Econ. 2025, 43, 557–582. [Google Scholar] [CrossRef]
Shamshiri, A.; Ryu, K.R.; Park, J.Y. Text mining and natural language processing in construction. Autom. Constr. 2024, 158, 105200. [Google Scholar] [CrossRef]
Wu, C.; Li, X.; Guo, Y.; Wang, J.; Ren, Z.; Wang, M.; Yang, Z. Natural language processing for smart construction: Current status and future directions. Autom. Constr. 2022, 134, 104059. [Google Scholar] [CrossRef]
Ding, Y.; Ma, J.; Luo, X. Applications of natural language processing in construction. Autom. Constr. 2022, 136, 104169. [Google Scholar] [CrossRef]
Erfani, A.; Hickey, P.J.; Cui, Q. Likeability versus competence dilemma: Text mining approach using LinkedIn data. J. Manag. Eng. 2023, 39, 04023013. [Google Scholar] [CrossRef]
Cheng, M.Y.; Kusoemo, D.; Gosno, R.A. Text mining-based construction site accident classification using hybrid supervised machine learning. Autom. Constr. 2020, 118, 103265. [Google Scholar] [CrossRef]
Patil, R.; Boit, S.; Gudivada, V.; Nandigam, J. A survey of text representation and embedding techniques in NLP. IEEE Access 2023, 11, 36120–36146. [Google Scholar] [CrossRef]
Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT applications in natural language processing: A review. Artif. Intell. Rev. 2025, 58, 166. [Google Scholar] [CrossRef]
Li, L.; Erfani, A.; Wang, Y.; Cui, Q. Anatomy into the battle of supporting or opposing reopening amid the COVID-19 pandemic on Twitter: A temporal and spatial analysis. PLoS ONE 2021, 16, e0254359. [Google Scholar] [CrossRef] [PubMed]
Mohammadi, P.; Rashidi, A.; Malekzadeh, M.; Tiwari, S. Evaluating various machine learning algorithms for automated inspection of culverts. Eng. Anal. Bound. Elem. 2023, 148, 366–375. [Google Scholar] [CrossRef]
Mansouri, A.; Erfani, A. Machine learning prediction of urban heat island severity in the Midwestern United States. Sustainability 2025, 17, 6193. [Google Scholar] [CrossRef]
Vujović, Ž. Classification model evaluation metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
Zheng, J.; Fischer, M. Dynamic prompt-based virtual assistant framework for BIM information search. Autom. Constr. 2023, 155, 105067. [Google Scholar] [CrossRef]
Yong, G.; Jeon, K.; Gil, D.; Lee, G. Prompt engineering for zero-shot and few-shot defect detection and classification using a visual-language pretrained model. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 1536–1554. [Google Scholar] [CrossRef]
Sun, Y.; Gu, Z.; Yang, S.B. Probing vision and language models for construction waste material recognition. Autom. Constr. 2024, 166, 105629. [Google Scholar] [CrossRef]
Uhm, M.; Kim, J.; Ahn, S.; Jeong, H.; Kim, H. Effectiveness of retrieval augmented generation-based large language models for generating construction safety information. Autom. Constr. 2025, 170, 105926. [Google Scholar] [CrossRef]
Jiang, G.; Ma, Z.; Zhang, L.; Chen, J. Prompt engineering to inform large language model in automated building energy modeling. Energy 2025, 316, 134548. [Google Scholar] [CrossRef]
Erfani, A.; Frias-Martinez, V. A fairness assessment of mobility-based COVID-19 case prediction models. PLoS ONE 2023, 18, e0292090. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall research framework.

Figure 2. Overview of project distribution by state, delivery method, type, and contract value.

Figure 3. RBS framework used for categorizing unstructured project risk statements.

Figure 4. Percentage of labeled risk items in each level 1 RBS category.

Figure 5. Natural language processing pipeline using machine learning classifiers.

Figure 6. Large language modeling pipeline using prompt engineering.

Figure 7. Confusion matrix for GPT-4 few-shot risk classification.

Figure 8. Class-wise performance comparison of BERT-SVM and GPT-4 Few-Shot.

Table 1. Studies employing unstructured data to enhance risk identification.

Reference	Data Source	Methodology	Key Findings
Dikmen et al. (2025) [48]	Contract	Word2Vec, Glove (NLP: embedding technique) BERT (NLP: Transformers)	Detect risk exposures in contract clauses
Gao et al. (2024) [18]	News articles	BERT (NLP: Transformers)	Identify and extract risk-related sentences from news articles
Erfani and Cui (2022) [17]	Risk registers	Word2Vec (NLP: embedding technique)	Predictive models can offer an initial step by capturing more than 50% of risks
Diao et al. (2020) [19]	Social media	Bayesian network	Estimate litigation risk using Twitter
Jallan and Ashuri (2020) [24]	Financial reports	FastTex (NLP: embedding technique)	Identified 18 categories of risk using 10-K filings

Table 2. Examples of mapping project-specific risks to level 1 RBS categories.

Project-Specific Risk Item	Assigned RBS Level 1
Changes to structural element design required	Design
Rock excavation in I-285/SR400 Interchange	Structure and geotechnical
Opportunity to use existing I-15 pavement for traffic detours because of the shifting of I-15 to the east	Traffic
Additional Right of Way may need to be acquired in east section	Right of Way
Section 4(f) resources affected National Environmental Policy Act Review	Environmental

Table 3. Prompt engineering strategies for risk classification using LLMs.

Approach	Prompt
Zero-shot	You are a risk management expert. Given the name of a construction project risk, assign it to the most appropriate high-level category from the list below. Return only one category name from this list: construction, design, environmental, utilities, right of way, management and funding, traffic, stakeholder, organizational, and procurement and contracting.
Instruction	You are a construction risk classification expert. Your task is to assign the following unstructured risk statement to the most appropriate high-level risk category, based on the definitions below: - Construction: Issues related to construction access, safety, materials, subcontractor performance, buried objects, construction methods, or weather-related impacts. - Design: Problems involving design changes, delays, incompleteness, exceptions, or esthetic concerns. - Environmental: Risks related to permitting, NEPA processes, endangered species, hazardous materials, water/air quality, or archeological constraints. - Utilities: Issues with utility coordination, conflicts, requirements, relocation, or funding gaps. - Right of Way: Challenges acquiring or relocating right of way, railroad access, or right-of-way planning. - Management and Funding: Delays in decisions, scope changes, cash flow problems, economic conditions, labor disruptions, or force majeure. - Traffic: Risks involving traffic growth, tolling, mobility impacts, land use, or pedestrian/bicycle access. - Stakeholder: Public opposition, stakeholder changes, late requests, or communication breakdowns with external parties. - Organizational: Internal changes in leadership, policy, resources, or organizational priorities. - Procurement and contracting: Delays or disputes related to procurement methods, contract terms, or change orders. - Structure and geotechnical: Issues involving excavation, soil/geotech conditions, structural vibration, or foundation design. Classify the following risk statement into the most appropriate category from the list above. Return only the category name.
Few-shot	You are a risk management expert. Given the name of a construction project risk, assign it to the most appropriate high-level category. Use the following examples to guide your reasoning. Examples: Risk Name: “developer schedule exposes owner to risk of unsubstantiated schedule delays” Category: Construction Risk Name: “cost savings opportunity for redecking bridges” Category: Design Risk Name: “additional sound walls required due to new development” Category: Environmental Risk Name: “damage to unknown utilities during construction” Category: Utilities Risk Name: “delay in right of way document internal approval process” Category: Right Of Way Risk Name: “labor strike” Category: Management And Funding Risk Name: “coordination issues between DB and tolling contractors” Category: Traffic Risk Name: “timely railroad collaboration (railroad undertaking own construction scope)” Category: Stakeholder Risk Name: “ongoing involvement of owner staff” Category: Organizational Risk Name: “management experience with alternative procurement delivery” Category: Procurement And Contracting Risk Name: “rock excavation” Category: Structure And Geotechnical Now, classify the following risk.

Table 4. Comparison of weighted average performance across all classification models.

Approach	Accuracy	Precision	Recall	F1-Score
TF-IDF (LR)	0.77	0.82	0.77	0.73
TF-IDF (SVM)	0.80	0.86	0.80	0.78
TF-IDF (RF)	0.83	0.87	0.83	0.83
Word2Vec (LR)	0.83	0.85	0.83	0.83
Word2Vec (SVM)	0.78	0.84	0.78	0.76
Word2Vec (RF)	0.73	0.73	0.73	0.70
BERT (LR)	0.82	0.82	0.82	0.81
BERT (SVM)	0.86	0.88	0.86	0.86
BERT (RF)	0.80	0.83	0.80	0.77
GPT (Zero-shot)	0.79	0.82	0.79	0.79
GPT (Instruction)	0.82	0.84	0.82	0.82
GPT (Few-shot)	0.81	0.86	0.81	0.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Erfani, A.; Khanjar, H. Large Language Models for Construction Risk Classification: A Comparative Study. Buildings 2025, 15, 3379. https://doi.org/10.3390/buildings15183379

AMA Style

Erfani A, Khanjar H. Large Language Models for Construction Risk Classification: A Comparative Study. Buildings. 2025; 15(18):3379. https://doi.org/10.3390/buildings15183379

Chicago/Turabian Style

Erfani, Abdolmajid, and Hussein Khanjar. 2025. "Large Language Models for Construction Risk Classification: A Comparative Study" Buildings 15, no. 18: 3379. https://doi.org/10.3390/buildings15183379

APA Style

Erfani, A., & Khanjar, H. (2025). Large Language Models for Construction Risk Classification: A Comparative Study. Buildings, 15(18), 3379. https://doi.org/10.3390/buildings15183379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Construction Risk Classification: A Comparative Study

Abstract

1. Introduction

2. Literature Review

3. Research Methodology

3.1. Data Collection and Preprocessing

3.2. Benchmarking NLP and LLM Approaches for Risk Classification

3.3. LLMs: Prompt Engineering Approaches

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI