A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning

Cheng, Sihan; Qi, Yujun; Wu, Rui; Guan, Yangyang

doi:10.3390/buildings15244439

Open AccessArticle

A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning

by

Sihan Cheng

,

Yujun Qi

^*,

Rui Wu

and

Yangyang Guan

School of Civil Engineering, Nanjing Tech University, Nanjing 211816, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(24), 4439; https://doi.org/10.3390/buildings15244439

Submission received: 1 November 2025 / Revised: 28 November 2025 / Accepted: 5 December 2025 / Published: 9 December 2025

(This article belongs to the Special Issue Advancements in Smart Civil Engineering: Durability, Seismic Resilience, Construction, and Composite Repair Technologies)

Download

Browse Figures

Versions Notes

Abstract

Construction sites are complex environments where traditional safety monitoring methods often suffer from low detection accuracy and limited interpretability. To address these challenges, this study proposes a modular multimodal agent framework that integrates computer vision, knowledge representation, and large language model (LLM)–based reasoning. First, the CLIP model fine-tuned with Low-Rank Adaptation (LoRA) is combined with YOLOv10 to achieve precise recognition of construction activities and personal protective equipment (PPE). Second, a construction safety knowledge graph integrating Retrieval-Augmented Generation (RAG) is constructed to provide structured domain knowledge and enhance contextual understanding. Third, the FusedChain prompting strategy is designed to guide large language models (LLMs) to perform step-by-step safety risk reasoning. Experimental results show that the proposed approach achieves 97.35% accuracy in activity recognition, an average F1-score of 0.84 in PPE detection, and significantly higher performance than existing methods in hazard reasoning. The modular design also facilitates scalable integration with more advanced foundation models, indicating strong potential for real-world deployment in intelligent construction safety management.

Keywords:

construction activity recognition; personal protective equipment (PPE) detection; multimodal agent framework; knowledge graph reasoning; LoRA fine-tuning

1. Introduction

Construction is a high-risk industry characterized by highly complex and dynamic site environments. According to 2023 data, fatal accidents in the U.S. construction industry accounted for 18.7% of all occupational fatalities nationwide, the injury rate in the U.K. construction sector reached 1.87 per 10,000 workers, and 962 deaths occurred in housing and municipal engineering accidents in China. These figures highlight the serious safety risks in the construction industry and the urgent need to deploy efficient and accurate safety monitoring systems on job sites. Among various measures, accurate identification of construction activity types and verification of the use of workers’ PPE are still key challenges for intelligent safety monitoring.

Existing safety monitoring technologies have evolved from manual supervision to sensor-based Internet of Things (IoT) monitoring and computer vision–based approaches, contributing to improved safety management in construction [1]. However, these methods still have critical limitations: they can only alert to explicit hazards and fail to generate interactive scene descriptions. They cannot dynamically adapt safety regulations to different construction activities (e.g., requiring safety belts for work at height or insulating gloves and protective face shields for electricians), and they lack multimodal reasoning capabilities [2]. These limitations highlight the need for an LLM-based multimodal agent capable of integrating visual perception with structured domain knowledge to perform more accurate and interpretable construction safety analysis.

In recent years, agent systems based on LLMs have offered new opportunities to address these challenges. Leveraging multimodal perception, tool utilization, and advanced prompting strategies (e.g., chain-of-thought (CoT) and tree-of-thought), such systems demonstrate autonomous planning and dynamic reasoning capabilities. When combined with RAG, they enable memory augmentation and knowledge evolution, and have been applied in fields such as intelligent building interaction [3], energy optimization [4], and structural analysis [5]. These applications focus more on functional interaction or numerical optimization in static building environments, whereas construction safety hazard identification needs to target dynamically complex construction sites and achieve fine-grained visual recognition, safety code knowledge alignment, and causal reasoning capabilities for unsafe behaviors.

However, directly applying LLM-based agent systems to construction safety hazard identification remains challenging. Current general-purpose LLMs lack sufficient understanding of domain-specific terminology (e.g., ‘cantilever safety nets’ and ‘edge opening protection’) and struggle to align such terms with engineering codes. The diversity, concealment, and dynamic nature of construction hazards require integrating multimodal information from both text and images, yet existing models often exhibit visual–language alignment bias and produce hallucinated knowledge. Moreover, complex hazard reasoning requires domain-specific expertise, which current general-purpose LLMs lack.

To address these issues, this study proposes a modular multimodal agent framework integrating construction scene understanding, knowledge retrieval, and hazard reasoning. The main contributions are as follows:

(1): Propose a construction scene recognition method that combines YOLOv10 with LoRA-fine-tuned CLIP, enabling accurate detection of construction activities and PPE usage.
(2): Construct a construction safety knowledge graph and integrate it with RAG to provide structured domain knowledge for LLM, enabling dynamic knowledge acquisition and fusion, thereby improving the domain correctness and reliability of hazard reasoning.
(3): Design a FusedChain prompting strategy to guide LLM in stepwise reasoning, enhancing both the accuracy and interpretability of hazard identification.

Accordingly, this study addresses the following core research question: How can a modular LLM-based multimodal agent effectively integrate visual perception, domain knowledge, and structured reasoning to achieve accurate and interpretable construction safety hazard identification?

2. Related Work

2.1. Construction Scene Understanding

Construction scene understanding involves tasks such as object detection, object tracking, and pose estimation, which can be categorized into three types: entity-based methods, relation-based methods, and image-description-based methods [6]. In this study, the scene understanding module (YOLOv10 + CLIP) extends prior approaches by integrating multimodal inputs and employing parameter-efficient fine-tuning to overcome their limitations.

Entity-based methods primarily rely on computer vision techniques to detect workers [7], PPE [8], and construction machinery [9]. With breakthroughs in deep learning, both detection accuracy and real-time performance have been significantly improved. However, such methods are constrained by predefined label systems and struggle to capture complex semantic relationships.

In recent years, relation-based perception technologies have made notable progress. Chen et al. integrated object detection and pose estimation to construct a triplet relationship graph linking workers and PPE [10]. Xiong et al. proposed a multi-class PPE detection and scene perception framework that leverages the correspondence between PPE and body parts to assess wearing correctness [11]. Zhang et al. employed transformer networks to capture inter-entity relationships [2], while Kim et al. proposed the Pixel2Graph framework to analyze interactions between workers and machinery [12]. Although these methods can represent complex relationships, their results are usually limited to single actions or work states.

Image captioning–based methods attempt to enhance semantic representation by integrating computer vision with natural language generation (NLG). Kim et al. combined Faster R-CNN and LSTM to generate spatiotemporal descriptions of construction resources [13]. Liu et al. integrated YOLOv3 with BERT to detect unsafe behaviors [14]. Several other studies have employed attention-based and transformer-based models to extract worker actions and environmental semantics [15,16,17]. Despite their improved generative capabilities, these models still produce relatively simple descriptions and fail to comprehensively capture the complex safety details of construction sites.

2.2. Construction Safety Hazard Reasoning

Existing methods for construction safety hazard identification can be broadly categorized into rule-based, scene graph-based, and LLM-based approaches. In this study, our reasoning module advances prior work by improving domain adaptability and reasoning interpretability.

Rule-based methods identify hazards by using manually defined rules and structured knowledge representations [18,19]. They offer good interpretability but suffer from limited robustness and scalability.

Scene graph-based methods perform reasoning by constructing graphs of construction entities and their relationships. Chen et al. proposed a vision-language framework to generate semantic triplets [10], and in 2022 further incorporated NLP and ontology techniques to improve scene graph construction, enhancing the machine interpretability of safety regulations [20]. However, these methods heavily depend on the accuracy of visual detection and are prone to misjudgment in complex scenes.

With the rapid development of NLP and LLMs, construction hazard identification has entered a new stage. Zhang et al. proposed a multimodal hazard recognition method based on C-BERT [2]. Gil et al. assessed PPE compliance by comparing image captions with predefined embeddings using similarity metrics [21]. Pu et al. developed the AutoRepo framework, which uses unmanned vehicles to collect data and employs LLMs to automatically generate construction inspection reports [22]. Xiao et al. combined ChatGPT (ChatGPT 3.5) and computer vision to produce daily construction site reports [23], while Wang et al. integrated multimodal LLMs with domain knowledge for construction safety detection and validated their method on the Wuhan Metro project [24]. These studies indicate that LLMs have strong potential for semantic reasoning and multimodal integration. However, due to the lack of domain-specific construction corpora in general-purpose models, the generated results may still suffer from insufficient professional accuracy and hallucination issues.

3. Methodology

3.1. Framework Overview

This study proposes a modular multimodal intelligent agent framework for construction site safety monitoring and hazard reasoning. The framework integrates three core components—perception, knowledge, and reasoning—establishing a closed-loop workflow of “scene perception, regulation matching and hazard reasoning.” It comprises three submodules, with the overall architecture illustrated in Figure 1.

The scene understanding module focuses on collecting basic data. By integrating YOLOv10 with the CLIP model fine-tuned by LoRA, it accurately captures the activity types of construction personnel and the wearing status of PPE. This module outputs structured scene descriptions to ensure reliable perception in complex construction site environments. The knowledge retrieval module builds a construction safety knowledge graph to represent entities such as construction objects and activity types. By integrating RAG, it retrieves relevant information from the knowledge graph and injects it into the LLMs, achieving dynamic mapping between visual scenes and safety regulations. In the reasoning and decision-making module, we use a reasoning model, equipped with scene perception results and related safety knowledge, a FusedChain prompting strategy is adopted to guide the reasoning model in step-by-step hazard identification and compliance evaluation. This process eventually determines whether potential safety hazards exist and identifies their specific types.

This framework integrates three core modules and involves key technologies such as CLIP, YOLOv10, RAG, and LoRA. Detailed definitions of these technologies can be found in Appendix A (Glossary of Key Technical Terms). The implementation logic of each module will be elaborated in detail below.

3.2. Target Detection and Preprocessing

To accurately locate workers and extract regions of interest, the YOLOv10 model is adopted as the object detector [25]. YOLOv10 employs an end-to-end detection architecture consisting of a backbone network, a neck structure, and a detection head. Given an input image of a construction scene, the model extracts multi-level visual features and outputs a set of bounding box parameters

B = {b_{i}}_{i = 1}^{N}

, including the normalized center coordinates

(x_{c}, y_{c})

, width and height

(w, h)

, confidence score

s

, and class label of each detected object

c

. In this study, the YOLOv10 model is used with an input resolution of 640 × 640, and the confidence threshold is set to 0.6.

To mitigate the negative impact of truncated detection boundaries on subsequent tasks, this study introduces an adaptive bounding box expansion strategy. Specifically, the normalized bounding box parameters are mapped back to the original image coordinate system, and the boundaries are dynamically expanded according to Equation (1) to preserve the contextual information around the detected objects. As illustrated in Figure 2, the original bounding box (red) truncates the worker’s local features, whereas the adaptively expanded box (green) preserves the worker and contextual scene, yielding more effective crops for semantic matching.

\{\begin{matrix} x_{1} = m a x (0, (x_{c} - \frac{w}{2}) \times W - δ_{w}) \\ y_{1} = m a x (0, (y_{c} - \frac{h}{2}) \times H - δ_{h}) \\ x_{2} = m i n (W, (x_{c} + \frac{w}{2}) \times W + δ_{w}) \\ y_{2} = m i n (H, (y_{c} + \frac{h}{2}) \times H + δ_{h}) \end{matrix}

(1)

where

(x_{1}, y_{1}, x_{2}, y_{2})

are the pixel coordinates,

δ_{w} = α w \times W

,

δ_{h} = α h \times H

, and the expansion coefficient

α

is set to 0.1 after optimization through grid search.

For the expanded regions of interest (ROI), a combined approach of bilinear interpolation and zero-padding is applied for size normalization. Each region

R_{i}

is first resized so that its shorter side is an integer multiple of

S

, while preserving aspect ratio, and then zero-padded to reach a final size of

S \times S

. Then, any remaining blank areas are filled with zero-valued pixels along the edges to reach the target

S \times S

resolution. Through this process, all ROIs are standardized to a uniform resolution before being fed into downstream tasks. This design helps reduce background noise and improves the robustness of semantic matching in subsequent modules.

3.3. Semantic Alignment

After obtaining worker-region images from construction scenes, the CLIP model based on contrastive learning is used for scene understanding [26]. To address the high computational cost and overfitting risk of full fine-tuning, LoRA is applied to the self-attention layers of both the image and text encoders in CLIP [27]. Specifically, the pretrained weights W are frozen, and two low-rank matrices A and B are introduced and trained to approximate the parameter updates efficiently [28]. This approach reduces the number of trainable parameters while preserving model performance, as illustrated in Figure 3.

During fine-tuning, the LoRA rank is set to

r = 2

, and a dropout rate of

p = 0.25

is applied to the LoRA inputs to mitigate overfitting. The model is optimized using the Adam optimizer with a learning rate of

2 \times 10^{- 4}

, scheduled with a cosine annealing strategy to ensure stable convergence. The total number of training iterations is defined as 500 N/K, where

N

is the total number of samples and

K

is the number of categories, ensuring sufficient learning under few-shot conditions. A batch size of 32 is adopted to balance memory usage and training efficiency, allowing the entire fine-tuning process to run on a single 24-GB GPU.

As shown in Figure 4, construction activity recognition and PPE detection are jointly performed based on the CLIP model. For activity recognition, textual descriptions of construction activities and scene images are separately encoded by the pretrained text and image encoders, and the cosine similarity between each text–image pair is computed. The activity description with the highest similarity score is selected as the final prediction. For PPE detection, both positive and negative descriptions of each protective equipment are compared with the image features in the same way. If the similarity with the positive description exceeds that of the negative description, the worker is determined to be wearing the corresponding PPE. As illustrated in Figure 4, this unified similarity-based inference (applied to both activity and PPE tasks) outputs per-worker activity labels and PPE statuses, enabling integrated scene understanding.

3.4. Knowledge Graph Construction and Hybrid Retrieval

Identifying construction safety hazards requires not only accurate interpretation of scene information but also the combination of information with relevant regulations to provide interpretable professional analysis. To achieve this goal, we constructed a construction safety knowledge graph and proposed a knowledge retrieval method based on this graph. This approach ensures that the retrieval process maintains both semantic relevance and strictly abides by the rules and restrictions, thereby effectively integrating scene understanding information with domain knowledge.

The core knowledge sources of the knowledge graph include construction safety standards and historical safety incident reports. The safety standards, which define behavioral requirements for workers and specify conditions for determining rule violations, are based on Chinese national construction safety regulations such as GB 50870–2013 [29] and GB 51210–2016 [30]. Historical incident reports document accidents caused by such violations and provide information about causes, hazard types, and the consequences of missing PPE or unsafe behaviors.

For the construction safety domain, we defined four types of core entities: Worker, Behavior, Safety_Rule, and Hazard. The entity set of the construction safety knowledge graph can be formalized as:

E = W \cup B \cup S \cup H

(2)

where

W = \{w | w \in W o r k e r\}

denotes the entity set, representing on-site workers or generalized roles performing operations at construction sites;

B = \{b | b \in B e h a v i o r\}

refers to the set of construction activity types, corresponding to specific work tasks conducted by workers;

S = \{s | s \in S a f e t y_R u l e\}

represents the safety rule set;

H = \{h | h \in H a z a r d\}

stands for the set of potential safety incident types.

The relations are designed to capture the causal chains linking work scenarios, safety rules, and potential risks. Centered on the entities, we define six types of relationships, and the relationship set can be expressed as:

R = \{p e r f o r m, r e q u i r e, a s s o c i a t e d_w i t h, m i t i g a t e, c o m p l y_w i t h, e x p o s e d_t o\}

(3)

The specific constraints can be formalized as follows:

\{\begin{matrix} 𝒞 (p e r f o r m) \subseteq W \times B \\ 𝒞 (r e q u i r e) \subseteq B \times S \\ \begin{matrix} 𝒞 (a s s o c i a t e d_w i t h) \subseteq B \times H \\ 𝒞 (m i t i g a t e) \subseteq S \times H \\ 𝒞 (c o m p l y_w i t h) \subseteq W \times S \\ 𝒞 (e x p o s e d_t o) \subseteq W \times S \end{matrix} \end{matrix}

(4)

where

p e r f o r m

indicates that a worker carries out a specific behavior, representing the particular task a worker is engaged in;

r e q u i r e

specifies that a behavior must adhere to a given safety rule, defining the safety norm corresponding to the behavior;

a s s o c i a t e d_w i t h

links a behavior to a specific risk, identifying potential hazards that the behavior may lead to;

m i t i g a t e

signifies that a safety rule can reduce or prevent a certain risk, explaining how rule compliance lowers risk exposure;

c o m p l y_w i t h

reflects whether a worker has followed a specific safety rule, used to check the worker’s adherence to regulations;

e x p o s e d_t o

denotes that a worker is exposed to a certain risk, inferred based on the worker’s behavior and rule compliance status;

𝒞

represents the fact set.

The primary focus of the proposed knowledge graph is to capture the causal relationships between worker behaviors and associated hazards, which are represented using an unweighted graph. It is assumed that behavior entities serve as the endpoints of accident chains, meaning that no causal links exist between accidents themselves. If a single behavior is associated with multiple accidents, it is considered that this behavior independently triggers each of these accidents. Accordingly, the triples in the knowledge graph can be represented as:

τ = \{(h, r, t) \in E \times R \times E ∣ (h, t) \in 𝒞 (r)}

(5)

Here,

τ

denotes the knowledge graph, where each element is a triple

(h, r, t)

.

E = {e_{1}, e_{2}, e_{3}, \dots, e_{m}}

represents the set of entity names, containing head and tail entities, with a total size of

m

;

R = {r_{1}, r_{2}, r_{3}, \dots, r_{n}}

denotes the set of relations between entities, with a total size of

n

;

𝒞

represents the fact set, where each triple corresponds to a fact.

We constructed a construction safety knowledge graph comprising 5 types of agents, 43 types of behaviors, 86 safety regulations, 97 risk categories, and 227 types of construction activities, covering various unsafe behaviors, accident causes, and safety regulations. The completed knowledge graph was stored and managed using Neo4j, which enables efficient processing of complex entity–relation queries [31].

During the retrieval process, we integrated knowledge graph reasoning with text embedding techniques to form a hybrid retrieval strategy, as illustrated in Figure 5. First, a general embedding model (BAAI General Embedding, BGE) is used to embed information about workers’ construction activities and activity entities from the knowledge graph into the same vector space [32]. The semantic similarity between the construction activity information and all activity entities in the knowledge graph is then calculated. Based on high similarity scores, the relevant activity type is selected, and after matching with the corresponding entity, all related regulatory information associated with that entity is retrieved. Given the context length limitations of LLMs, not all relevant regulations can be used as prompt input. Subsequently, the retrieved regulatory entities are embedded into the same vector space, similarities are computed, and the regulations most relevant to the worker’s behavior are selected by ranking. This retrieval strategy ensures that the results are not only semantically relevant but also comply with regulatory requirements, while maintaining efficiency and traceability.

3.5. Reasoning and Decision-Making

After retrieving relevant knowledge, the next step is to use this information for safety hazard reasoning. To achieve this, we propose a FusedChain prompting method, which integrates In-Context Learning (ICL) [33], few-shot learning, and CoT [34] approach. This method combines CoT reasoning and few-shot demonstrations directly within the ICL framework to form a multi-layered prompt structure. Figure 6 illustrates the FusedChain prompt template: define a safety expert role and rules, demonstrate Chain-of-Thought reasoning via a sample Q&A, then apply the template to real scene inputs for formal safety analysis.

First, use the system prompts to establish a behavioral framework and operational guidelines for the LLM. Then, through the User and Assistant parameters, construct an in-context learning example to help the LLM understand the current task and ensure that its output is not only factually accurate but also contextually appropriate. Finally, the CoT technique is used to guide the LLM to reason step by step, generate conclusions, and explicitly refer to relevant safety regulations. By decomposing the reasoning process into progressive steps, the FusedChain strategy enhances the analytical capability and interpretability of the model, ensuring that the generated conclusions are reasonable and evidence-based.

Considering the deployment cost and resource constraints in practical engineering applications, we prioritized open-source models with relatively small parameter sizes. Specifically, we adopted the lightweight DeepSeek-R1-Distill-Qwen-7B (DeepSeek-R1-7B) as the core reasoning model. The detailed parameter settings of the model are presented in Table 1.

4. Experimental Design

4.1. Dataset Construction

This study selected seven common types of construction activity scenarios for image collection and processing, including scaffolding work, masonry work, steel reinforcement work, etc., comprising a total of 6815 images. (Table 2).

A construction site is a complex system involving personnel, locations, tasks and processes. Image descriptions must accurately identify key elements in the scene while using professional terminology. This study divides architectural scene elements into three levels: entities, interactions, and attributes. The entity level includes the subject (workers), the object (materials, equipment, etc.), and PPE. The interaction level focuses on the dynamic relationship between the subject and the object, PPE, using verbs to describe specific actions and the usage status of PPE. The attribute level concentrates on the detailed features of entities to refine scene descriptions [35]. Based on the syntactic pattern “attribute + subject + interaction + object,” preliminary structured descriptions are generated [36]. However, simple linear structures often fail to fully capture the complex object interactions and multi-layered safety information in construction scenes [37]. Therefore, we adopted a hierarchical description method to generate corresponding PPE, behavior, and scene descriptions for each image. These descriptions were then manually verified and revised by annotators, ultimately generating a dataset of 20,445 high-quality image-text pairs.

To ensure fair and representative evaluation, the dataset was divided into separate training and testing subsets. Before conducting LoRA fine-tuning, a number of image-text pairs were selected from each of the seven construction activity categories, resulting in a test set comprising 351 image-text pairs. This approach prevents an imbalance in the number of any single type of construction activity within the test set from affecting the final evaluation results. Additionally, all selected test images contain only a single worker and were verified through testing with a YOLO model to confirm that the workers could be accurately detected and extracted. This preparatory step was taken to facilitate the subsequent assessment of the performance of the YOLO model in enhancing CLIP’s understanding of architectural scenes. The remaining data was used entirely for model fine-tuning, which also helps to further avoid the issue of model overfitting.

4.2. Evaluation Metrics and Experimental Design

To comprehensively assess the performance of the proposed framework, we evaluated the identification of construction activities, PPE detection, and hazard inference. The evaluation adopted standard indicators, including Precision, Recall, and F1-score. The tests were conducted on a dataset containing 351 samples of construction activities and 565 PPE samples, and comparative experiments were carried out with the most advanced methods in this field to objectively evaluate the advantages of our method.

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

Specifically, Precision is defined as the ratio of correctly identified entities (TP) to all identified entities (TP + FP). Recall is defined as the ratio of correctly identified entities (TP) to all actual entities present in the relevant scenes (TP + FN). The F1-score is the harmonic mean of Precision and Recall, providing a balanced measure of both metrics.

In the ablation study, we analyzed the impact of key components on the overall performance of the framework by designing three control groups. To validate the performance improvement of YOLOv10 cropping preprocessing on downstream tasks, we compared scenarios where the model was used with and without the object detection model. To evaluate the effect of fine-tuning the CLIP model, we compared the performance of CLIP models with and without LoRA fine-tuning on construction activity type recognition and PPE detection tasks using the same prompts. To investigate the influence of prompt engineering on our methodological framework, we tested the performance of hazard identification by removing the content related to the FusedChain technology from the prompts, while ensuring the model could still output the correct format.

5. Experimental Results and Analysis

5.1. Construction Activity Recognition

As shown in Figure 7, our method correctly identified 331 out of 351 construction activity samples, achieving an overall precision of 97.35% and an F1-score of 0.9671. In concrete work, welding work, masonry work, scaffolding work, and excavator-driving work, there were no false positives (FP = 0), demonstrating extremely high predictive reliability. In contrast, auxiliary excavation work and rebar work showed slightly lower precision, but maintained high recall rates, ensuring stable detection performance.

To further verify the effectiveness of the proposed method, the identification results of the construction activities in this study were compared with the approach by Wang et al. [36], as shown in Table 3. Wang’s study only involved the recognition of masonry work, concrete work, and scaffolding work. When comparing comprehensive performance on the same activities, our method outperformed Wang’s approach by 21.1% in precision, 5.0% in recall, and 13.3% in F1 score, demonstrating higher accuracy and reliability. This highlights the significant advantage of our proposed framework in construction activity recognition tasks.

5.2. PPE Detection

As shown in Figure 8, based on the method proposed in this study, 472 out of 565 PPE entities were successfully detected, achieving an average precision of 82.72%, an average recall of 84.78%, and an average F1 score of 0.8358, demonstrating strong PPE detection capability. The model performed excellently on key protective equipment such as face shields, safety helmets, and safety harnesses. However, some false positives and missed detections occurred in categories like gloves, safety boots, and reflective vests. False alarms for safety boots and gloves may be attributed to visual confusion with ordinary footwear or hand movements. The low recall rate for reflective vests might be related to lighting conditions, angle variations, or occlusions. Overall, the model proves effective for PPE detection tasks and meets basic detection requirements.

To further validate the effectiveness of the proposed method, the PPE detection results were compared with the study by Zhang et al. [2], as shown in Table 4. The comparison included four PPE categories: safety helmets, face shields, gloves, and reflective vests. Our method demonstrated high average precision and recall. Particularly in face shield detection, the model exhibited significant advantages with substantially improved precision and recall, further verifying the high efficiency of our proposed method in detecting critical personal protective equipment items.

5.3. Hazard Reasoning Evaluation

Based on the Deepseek-r1-7B model, the proposed safety hazard reasoning method shows excellent performance in safety status detection. The model achieved 100% in precision, recall, F1 score, and overall accuracy, indicating extremely high reliability in assessing the safety status of workers and accurately distinguishing between safe and unsafe conditions. It is worth noting that these results were obtained under the assumption of correct upstream scene understanding and may not reflect full end-to-end system performance in real-world deployments. To further validate its effectiveness, we compared this method with several existing hazard identification methods [2,24,38], as shown in Table 5. Traditional methods mainly combine computer vision technology with language models, usually relying on large-scale annotated datasets and requiring separate training of visual and language models. In contrast, our method decouples image recognition from hazard reasoning and requires no additional training of language models. The evaluation focuses on exploring the reasoning capability of LLM integrated with construction safety knowledge graphs, under the assumption of correctly identified scene information. Compared with existing studies, the proposed method comprehensively outperforms existing methods on comparable indicators, achieving higher precision, recall, and accuracy, and also enabling more detailed descriptions of construction sites. This allows for the inspection of a broader range of visual content in the image scene. Even the smaller Deepseek-r1-1.5B model outperforms the methods reported in Study 2 and Study 3 in all metrics. Although its accuracy is slightly lower than that of Study 1, it achieves higher recall, which is a critical factor for safety hazard identification.

5.4. Ablation Study

5.4.1. Evaluation of the Object Detection Module

After removing the YOLOv10 module, the model showed a significant decline in construction scene comprehension. For PPE detection across all categories, the overall precision dropped from 82.8% to 80.2%, and the F1-score decreased from 84% to 82.3%. This indicates that without the region extraction capability of YOLOv10, the model generated more false positives, thereby reducing overall reliability and verified the role of YOLOv10 in minimizing the interference of irrelevant information. The overall recall rate remained relatively stable, suggesting that the YOLOv10 module primarily affects the model’s precise localization ability, with limited impact on the detection capability of global targets.

In PPE detection tasks, the integration of YOLOv10 significantly improved the confidence in detecting genuinely worn equipment and reduced the likelihood of misclassifying worn items as unworn. Figure 9 presents a visual comparison of PPE detection results with and without the YOLO model: in Figure 9b (without YOLOv10 integration), the model failed to detect gloves, face shields, and safety vests. The enhancement was particularly notable for small or inconspicuous PPE items such as face shields and gloves. As the accuracy of the object detection model in identifying workers improves, the overall system performance is expected to further advance, especially in complex multi-person scenarios.

5.4.2. Evaluation of LoRA Fine-Tuning Effectiveness

To validate the effectiveness of parameter-efficient fine-tuning with LoRA in construction scene understanding, we compared the fine-tuned model with the baseline CLIP model. For clarity, the fine-tuned version is denoted as CS-CLIP (Construction Safety-CLIP). As shown in Figure 10a, CS-CLIP demonstrated consistently stable and superior performance across all seven categories of construction activities. In contrast, the baseline CLIP model exhibited considerable variability across categories, with particularly poor performance in auxiliary excavation and rebar works. The wider confidence intervals of the baseline model further indicate unstable behavior across samples within the same activity class. Figure 10b presents the results of PPE detection. The baseline CLIP model maintained relatively high recall, especially for safety helmets, consistent with Gil et al.’s [21] findings on zero-shot safety helmet detection using CLIP. However, the fine-tuned CS-CLIP achieved significant improvements in both precision and F1. The recall showed no significant change, and even a slight decrease in some PPE categories. This indicates that LoRA fine-tuning enables the model to focus more on typical features in construction scenes, but reduces its detection capability for some edge cases or atypical samples.

5.4.3. Impact of Prompt Design

Considering that different types of models rely on prompts to varying degrees, the prompt ablation study evaluated reasoning models and general instruction following models. Specifically, Deepseek-r1-7B, Deepseek-r1-1.5B, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct were selected for testing. To ensure comparability, the average values of each metric were used as the evaluation criterion.

As shown in Table 6, after removing the FusedChain prompt structure, the performance of all models in hazard identification has significantly declined. This effect was most pronounced in smaller-scale models (e.g., Deepseek-R1-1.5B and Qwen2.5-7B), where both recall and F1 score dropped significantly. This highlights the critical role of prompts in guiding the reasoning process of lightweight models. Larger models (Deepseek-r1-7B and Qwen2.5-14B) maintained relatively high performance but still showed noticeable degradation without FusedChain. This demonstrates that structured prompting strategies not only enhance the stability of small models but also improve the consistency and accuracy of large models in hazard reasoning tasks. From a deployment perspective, these results indicate that without a structured prompting mechanism, lightweight models may produce unstable or inconsistent reasoning outputs, which could pose risks in real-world safety monitoring systems. This performance gap underscores the importance of robust prompt design in ensuring the reliability of the hazard identification pipeline.

6. Discussion

Our proposed modular multimodal agent framework demonstrates significant advantages in construction safety monitoring. Unlike traditional methods that rely solely on computer vision detection, this study integrates perception, knowledge, and reasoning stages to achieve efficient multimodal collaboration. In the reasoning and decision-making module, the Deepseek-r1-7B model demonstrates extremely high accuracy in identifying potential safety hazards, although this performance may be influenced by the model’s inherent capabilities. To further investigate model-related variations, we examined two aspects: model architecture and parameter scale. For architectural comparison, we selected Qwen2.5-7B-Instruct as a baseline. Regarding parameter scale, we tested Deepseek-r1-1.5B, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct. Evaluation was conducted using averaged metrics across all indicators to ensure comparability of results. As shown in Table 7, the accuracy of hazard identification varies with different model architectures. Under the same parameter scale, models from the Deepseek series (Deepseek-r1) outperform those from the Qwen series. This is primarily because the self-reasoning architecture of the r1 model is superior to that of standard language models, which is the main reason for its better performance [39]. Within the same model series, models with larger parameter counts demonstrate better hazard identification performance under this methodological framework. This further indicates that as the base models become more powerful, the reliability of the hazard identification method proposed in this study will correspondingly improve. In practical applications, these differences suggest that models with more advanced reasoning-oriented architectures are better suited for tasks requiring safety hazard interpretation. Moreover, models with larger parameter scales reduce the likelihood of misclassification in complex or ambiguous scenarios, which is critical for reliable decision-making in high-risk construction environments.

By integrating YOLOv10 with LoRA-enhanced CLIP, the proposed framework demonstrates strong performance in both construction activity recognition and PPE detection. However, a closer examination of the results reveals several category-specific challenges. In construction activity recognition, auxiliary excavation and rebar-tying tasks exhibit relatively lower precision (around 90%), mainly due to factors such as overlapping worker postures, visually similar tool operations, and cluttered backgrounds, all of which make fine-grained distinctions more difficult. This suggests that future work may benefit from expanding category-specific training samples, incorporating more discriminative feature representations, or leveraging temporal information to improve model stability and generalization. For PPE detection, small items such as gloves and protective eyewear remain challenging due to their limited pixel footprint and frequent occlusion. Although YOLO-based region extraction effectively reduces background interference and improves text–image alignment, the inherent limitations of small-object detection continue to constrain overall performance. Future research could explore multi-scale detection architectures, higher-resolution inputs, multi-camera viewpoints, or temporal feature integration to further enhance robustness in complex construction environments.

This study still has certain limitations. Although the constructed dataset covers common construction scenarios, its coverage of rare and specific construction environments and activities remains insufficient, which may affect the model’s generalization ability in these scenarios. Future work could incorporate samples from different work types, extreme environments, and severely occluded construction scenes to enhance the model’s generalization ability. The experiments were mainly conducted on static images. However, construction sites are highly dynamic environments. Future research could explore the combination of temporal information through video sequence analysis to improve the accuracy of dynamic construction activity recognition. Furthermore, the proposed framework involves serial processing of multiple modules, which may result in the overall inference speed failing to meet the requirements of real-time monitoring, especially on resource-constrained edge devices. To meet the needs of real-time monitoring, techniques such as model compression and knowledge distillation could be explored to reduce model complexity and improve reasoning speed. In addition, researching parallel module processing and edge cloud collaborative computing architectures can enhance the system’s responsiveness while maintaining accuracy.

7. Conclusions

This study presents a modular multimodal agent framework for construction safety identification, integrating scene understanding, knowledge retrieval, and hazard reasoning. The framework combines YOLOv10 with a LoRA-enhanced CLIP model for accurate perception, employs a construction safety knowledge graph with hybrid retrieval to align scene information with safety regulations, and utilizes a FusedChain prompting strategy to support interpretable hazard reasoning.

Across multiple tasks, the proposed framework demonstrates strong performance: construction activity recognition achieves an accuracy of 97.35%, PPE detection attains an average F1-score of 0.84, and hazard reasoning reaches 100% precision and recall under the assumption of correct upstream scene understanding. Ablation studies verify the critical contributions of precise detection, parameter-efficient fine-tuning, and structured prompting. Compared with existing methods, the framework provides not only higher accuracy but also improved interpretability and scalability.

Despite its promising results, the proposed framework still has several limitations, including limited dataset diversity, reliance on static images, and potential latency introduced by the multi-module processing pipeline. Future work will explore video-based temporal modeling, multi-view fusion, and UAV-assisted inspections, as well as conduct broader validation in large-scale and complex construction environments. With the continued advancement of large language models and computer vision technologies, the proposed approach offers a feasible and extensible path toward building intelligent and interpretable construction safety management systems.

Author Contributions

Conceptualization, Y.Q. and R.W.; methodology, R.W.; software, R.W.; validation, S.C. and R.W.; formal analysis, R.W.; investigation, S.C.; data curation, S.C., R.W. and Y.G.; writing—original draft preparation, S.C.; writing—review and editing, Y.G.; visualization, S.C.; supervision, Y.Q.; project administration, Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Glossary of Key Technical Terms

CLIP (Contrastive Language–Image Pretraining)

A vision–language model that learns to align images and text within a shared embedding space using contrastive learning. It enables zero-shot recognition by measuring similarity between image features and textual descriptions.

YOLOv10 (You Only Look Once Version 10)

A fast and accurate object detection model that identifies objects in images with a single pass, suitable for real-time construction site monitoring.

RAG (Retrieval-Augmented Generation)

A technique that enhances large language models by retrieving relevant documents from an external knowledge source before generating responses. In this study, RAG is used to align scene information with safety regulations and historical accident data through knowledge graph retrieval.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that injects low-rank matrices into selected layers of a pretrained model, allowing task-specific adaptation with minimal trainable parameters. LoRA improves CLIP’s ability to understand construction-specific visual features without requiring full model retraining.

References

Paneru, S.; Jeelani, I. Computer Vision Applications in Construction: Current State, Opportunities & Challenges. Autom. Constr. 2021, 132, 103940. [Google Scholar] [CrossRef]
Zhang, L.; Wang, J.; Wang, Y.; Sun, H.; Zhao, X. Automatic Construction Site Hazard Identification Integrating Construction Scene Graphs with BERT Based Domain Knowledge. Autom. Constr. 2022, 142, 104535. [Google Scholar] [CrossRef]
Yan, X.; Yang, X.; Jin, N.; Chen, Y.; Li, J. A General AI Agent Framework for Smart Buildings Based on Large Language Models and ReAct Strategy. Smart Constr. 2025, 2, 4. [Google Scholar] [CrossRef]
Zhang, L.; Fu, X.; Li, Y.; Chen, J. Large Language Model-Based Agent Schema and Library for Automated Building Energy Analysis and Modeling. Autom. Constr. 2025, 176, 106244. [Google Scholar] [CrossRef]
Ni, B.; Buehler, M.J. MechAgents: Large Language Model Multi-Agent Collaborations Can Solve Mechanics Problems, Generate New Data, and Integrate Knowledge. Extrem. Mech. Lett. 2024, 67, 102131. [Google Scholar] [CrossRef]
Li, H.; Deng, H.; Deng, Y. Towards Worker-Centric Construction Scene Understanding: Status Quo and Future Directions. Autom. Constr. 2025, 171, 106005. [Google Scholar] [CrossRef]
Tran, S.V.-T.; Lee, D.; Bao, Q.L.; Yoo, T.; Khan, M.; Jo, J.; Park, C. A Human Detection Approach for Intrusion in Hazardous Areas Using 4D-BIM-Based Spatial-Temporal Analysis and Computer Vision. Buildings 2023, 13, 2313. [Google Scholar] [CrossRef]
Park, M.; Tran, D.Q.; Bak, J.; Kulinan, A.S.; Park, S. Real-Time Monitoring Unsafe Behaviors of Portable Multi-Position Ladder Worker Using Deep Learning Based on Vision Data. J. Saf. Res. 2023, 87, 465–480. [Google Scholar] [CrossRef]
Kim, D.; Xiong, S. Enhancing Worker Safety: Real-Time Automated Detection of Personal Protective Equipment to Prevent Falls from Heights at Construction Sites Using Improved YOLOv8 and Edge Devices. J. Constr. Eng. Manag. 2025, 151, 04024187. [Google Scholar] [CrossRef]
Chen, S.; Demachi, K. Towards On-Site Hazards Identification of Improper Use of Personal Protective Equipment Using Deep Learning-Based Geometric Relationships and Hierarchical Scene Graph. Autom. Constr. 2021, 125, 103619. [Google Scholar] [CrossRef]
Xiong, R.; Tang, P. Pose Guided Anchoring for Detecting Proper Use of Personal Protective Equipment. Autom. Constr. 2021, 130, 103828. [Google Scholar] [CrossRef]
Kim, D.; Goyal, A.; Lee, S.; Kamat, V.R.; Liu, M. Single-Shot Visual Relationship Detection for the Accurate Identification of Contact-Driven Hazards in Sustainable Digitized Construction. Sustainability 2024, 16, 5058. [Google Scholar] [CrossRef]
Bang, S.; Kim, H. Context-Based Information Generation for Managing UAV-Acquired Data Using Image Captioning. Autom. Constr. 2020, 112, 103116. [Google Scholar] [CrossRef]
Liu, J.; Fang, W.; Love, P.E.D.; Hartmann, T.; Luo, H.; Wang, L. Detection and Location of Unsafe Behaviour in Digital Images: A Visual Grounding Approach. Adv. Eng. Inform. 2022, 53, 101688. [Google Scholar] [CrossRef]
Zhai, P.; Wang, J.; Zhang, L. Extracting Worker Unsafe Behaviors from Construction Images Using Image Captioning with Deep Learning–Based Attention Mechanism. J. Constr. Eng. Manag. 2023, 149, 04022164. [Google Scholar] [CrossRef]
Zhong, B.; Shen, L.; Pan, X.; Lei, L. Visual Attention Framework for Identifying Semantic Information from Construction Monitoring Video. Saf. Sci. 2023, 163, 106122. [Google Scholar] [CrossRef]
Song, K.; Chen, L.; Wang, H. Style-Enhanced Transformer for Image Captioning in Construction Scenes. Entropy 2024, 26, 224. [Google Scholar] [CrossRef]
Chen, S. Hybrid Visual Information Analysis for On-Site Occupational Hazards Identification: A Case Study on Stairway Safety. Saf. Sci. 2023, 159, 106043. [Google Scholar] [CrossRef]
Johansen, K.W.; Teizer, J.; Schultz, C. Automated Rule-Based Safety Inspection and Compliance Checking of Temporary Guardrail Systems in Construction. Autom. Constr. 2024, 168, 105849. [Google Scholar] [CrossRef]
Chen, S.; Demachi, K.; Dong, F. Graph-Based Linguistic and Visual Information Integration for on-Site Occupational Hazards Identification. Autom. Constr. 2022, 137, 104191. [Google Scholar] [CrossRef]
Gil, D.; Lee, G. Zero-Shot Monitoring of Construction Workers’ Personal Protective Equipment Based on Image Captioning. Autom. Constr. 2024, 164, 105470. [Google Scholar] [CrossRef]
Pu, H.; Yang, X.; Li, J.; Guo, R. AutoRepo: A General Framework for Multimodal LLM-Based Automated Construction Reporting. Expert Syst. Appl. 2024, 255, 124601. [Google Scholar] [CrossRef]
Xiao, B.; Wang, Y.; Zhang, Y.; Chen, C.; Darko, A. Automated Daily Report Generation from Construction Videos Using ChatGPT and Computer Vision. Autom. Constr. 2024, 168, 105874. [Google Scholar] [CrossRef]
Wang, Y.; Luo, H.; Fang, W. An Integrated Approach for Automatic Safety Inspection in Construction: Domain Knowledge with Multimodal Large Language Model. Adv. Eng. Inform. 2025, 65, 103246. [Google Scholar] [CrossRef]
Hussain, M.; Khanam, R. In-Depth Review of YOLOv1 to YOLOv10 Variants for Enhanced Photovoltaic Defect Detection. Solar 2024, 4, 351–386. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR. Volume 139, pp. 8748–8763. [Google Scholar]
Zanella, M.; Ayed, I.B. Low-Rank Few-Shot Adaptation of Vision-Language Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; IEEE: Seattle, WA, USA; pp. 1593–1603. [Google Scholar]
Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-Rank Adaptation of Large Language Models. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
GB 50870–2013; Unified Code for Construction Safety Technology. China Planning Press: Beijing, China, 2013.
GB 51210–2016; Unified Standard for Safety Technology of Construction Scaffolds. China Architecture & Building Press: Beijing, China, 2016.
Zhu, J.; Wong, M.O.; Nisbet, N.; Xu, J.; Kelly, T.; Zlatanova, S.; Brilakis, I. Semantics-Based Connectivity Graph for Indoor Pathfinding Powered by IFC-Graph. Autom. Constr. 2025, 171, 106019. [Google Scholar] [CrossRef]
Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.-Y. C-Pack: Packed Resources For General Chinese Embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 641–649. [Google Scholar]
Mao, H.; Liu, G.; Ma, Y.; Wang, R.; Johnson, K.; Tang, J. A Survey to Recent Progress Towards Understanding In-Context Learning. In Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Brian, I.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Xiao, B.; Wang, Y.; Kang, S.-C. Deep Learning Image Captioning in Construction Management: A Feasibility Study. J. Constr. Eng. Manag. 2022, 148, 04022049. [Google Scholar] [CrossRef]
Zhang, M.; Zhu, M.; Zhao, X. Recognition of High-Risk Scenarios in Building Construction Based on Image Semantics. J. Comput. Civ. Eng. 2020, 34, 04020019. [Google Scholar] [CrossRef]
Yoon, S.; Kang, W.Y.; Jeon, S.; Lee, S.; Han, C.; Park, J.; Kim, E.-S. Image-to-Image Retrieval by Learning Similarity between Scene Graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 10718–10726. [Google Scholar] [CrossRef]
Wang, Y.; Xiao, B.; Bouferguene, A.; Al-Hussein, M. Proactive Safety Hazard Identification Using Visual–Text Semantic Similarity for Construction Safety Management. Autom. Constr. 2024, 166, 105602. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]

Figure 1. Overall Framework of the Proposed Multimodal Intelligent Agent for Construction Safety Monitoring. The framework comprises three interconnected modules: (1) Scene Understanding: YOLOv10 detects workers and PPE, while LoRA-fine-tuned CLIP performs semantic matching of construction activities and PPE; (2) Knowledge Retrieval: A knowledge graph integrated with RAG retrieves safety regulations and historical accident information associated with detected activities; (3) Reasoning and Decision-Making: An LLM incorporates domain knowledge and a FusedChain prompt to conduct step-by-step hazard reasoning and generate a final risk assessment report. Dashed lines denote inter-module information flow during multimodal perception and reasoning.

Figure 2. Adaptive bounding box expansion for worker detection: (1) Original YOLOv10 detection (red) truncates contextual features; (2) Comparison with adaptively expanded box (green); (3) Cropped output preserves complete worker and scene information.

Figure 3. Application of LoRA in the CLIP Model. This figure illustrates CLIP’s encoder with LoRA applied to the self-attention layer and the LoRA update design, where frozen pretrained weights are combined with trainable low-rank matrices, reducing trainable parameters while maintaining performance.

Figure 4. CLIP-based workflow for joint activity recognition and PPE detection: (1) Input textual descriptions and worker images; (2) Encode text and images separately; (3) Predict activities and PPE status via text-image similarity, integrating per-worker results for scene understanding.

Figure 5. Hybrid Retrieval Process.

Figure 6. FusedChain Prompt Structure for Construction Safety Reasoning. This figure illustrates the FusedChain reasoning prompt framework: (1) Role & task definition: The prompt sets the model as a senior construction safety expert, with clear instructions to analyze workers’ safety status; (2) Few-shot demonstration: A sample Q&A pair (first User-Assistant interaction) serves as an example, where the Assistant’s response integrates the full Chain-of-Thought reasoning process; (3) Formal task application: The second User inputs real scene information, and the model follows the prompt template to generate structured safety analysis results.

Figure 7. Experimental results on different categories of construction activities.

Figure 8. Experimental results of PPE detection.

Figure 9. YOLO ablation experiment. (a) With YOLO detection results; (b) Without YOLO detection results. This figure compares PPE detection similarity scores (y-axis: PPE types; x-axis: similarity score) with and without YOLOv10 integration. Blue and yellow bars indicate ‘Worn’ and ‘Not Worn’ PPE, respectively. Integrating YOLOv10 increases similarity scores for correctly detected ‘Worn’ items and reduces misclassification, thereby enhancing detection reliability.

Figure 10. Comparison between the CLIP model and the CS-CLIP model. This figure compares precision, recall, and F1 across two tasks: (a) construction activity recognition; (b) PPE detection. Red curves denote CS-CLIP (LoRA-fine-tuned for construction safety), and blue curves denote the baseline CLIP.

Table 1. Parameter settings of the LLM.

Parameter Type	Value	Description
stream	False	Streaming output
max_tokens	512	Maximum output length
temperature	0.1	Controls randomness
top_p	0.5	Nucleus sampling threshold
frequency_penalty	0	Repetition penalty
n	1	Number of generated outputs

Table 2. Overview of collected construction activity images.

Construction Activity	Number of Images
Scaffolding work	1430
Welding work	398
Excavation work	1388
Masonry work	864
Concrete work	1103
Rebar work	1632
Total	6815

Table 3. Comparative results of construction activity recognition.

Study	Activity Category	Precision (%)	Recall (%)	F1
This study	Masonry work	100.0	98.0	99.0
	Concrete work	100.0	100.0	100.0
	Scaffolding work	100.0	88.2	93.8
	Overall	100.0	95.4	97.6
Wang et al. [36]	Masonry work	97.0	94.1	95.5
	Concrete work	56.1	95.8	70.8
	Scaffolding work	95.2	80.0	87.0
	Overall	78.9	90.4	84.3

Table 4. Comparative study on PPE detection performance.

Study	Category	Precision (%)	Recall (%)
This study	Safety helmets	90.0	84.8
	Face shields	85.2	97.9
	Gloves	80.1	80.1
	Reflective vests	82.7	78.2
	Average	84.5	85.3
Zhang et al. [2]	Safety helmets	86.5	88.7
	Face shields	80.1	81.2
	Gloves	82.4	84.2
	Reflective vests	83.2	78.3
	Average	83.1	83.1

Table 5. Comparison of hazard identification performance between existing methods and this study.

	Metric	Unsafe	Fall	Collision	Unsafe Driving	Struck by Moving Objects	Arc-Flash Burns	Average
Study 1 [2]	Accuracy (%)	100	100	100	100	100	82.6	97.1
Study 2 [36]	Precision (%)	-	61.4	76.2	-	-		-
	Recall (%)	-	81.8	82.8	-	-		-
	F1-score	-	70.1	79.3	-	-		-
Study 3 [24]	Precision (%)	-	72.0	79.3	-	-		-
	Recall (%)	-	95.6	81.2	-	-		-
	F1-score	-	82.1	84.2	-	-		-
Deepseek-r1-7B	Macro average (%)	100	100	100	100	100	100	100
Deepseek-r1-1.5B	Precision (%)	77.01	78.57	82.14	57.14	66.67	82.14	74.44
	Recall (%)	94.12	91.67	95.83	100	94.12	95.83	85.88
	F1-score	84.71	84.62	88.46	72.73	78.05	88.46	77.28
	Accuracy (%)	77.09	96.48	92.07	94.71	96.04	97.36	92.22

Table 6. Ablation results of prompt design.

Model	Precision (%)	Recall (%)	F1 (%)	Accuracy (%)
Deepseek-r1-1.5B + FusedChain	74.44	85.88	77.28	92.22
Deepseek-r1-1.5B	74.54	40.02	44.93	82.67
Deepseek-r1-7B + FusedChain	100	100	100	100
Deepseek-r1-7B	92.72	85.68	88.83	95.89
Qwen2.5-7B-Instruct + FusedChain	93.36	84.58	86.75	94.86
Qwen2.5-7B-Instruct	92.12	75.65	80.16	91.92
Qwen2.5-14B-Instruct + FusedChain	100	100	100	100
Qwen2.5-14B-Instruct	97.35	95.37	96.07	97.94

Table 7. Average Metrics of Different Models for Hazard Identification.

Model	Precision (%)	Recall (%)	F1	Accuracy (%)
Deepseek-r1-7B	100	100	100	100
Deepseek-r1-1.5B	74.44	85.88	77.28	92.22
Qwen2.5-7B-Instruct	92.12	84.58	86.75	94.86
Qwen2.5-14B-Instruct	100	100	100	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, S.; Qi, Y.; Wu, R.; Guan, Y. A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning. Buildings 2025, 15, 4439. https://doi.org/10.3390/buildings15244439

AMA Style

Cheng S, Qi Y, Wu R, Guan Y. A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning. Buildings. 2025; 15(24):4439. https://doi.org/10.3390/buildings15244439

Chicago/Turabian Style

Cheng, Sihan, Yujun Qi, Rui Wu, and Yangyang Guan. 2025. "A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning" Buildings 15, no. 24: 4439. https://doi.org/10.3390/buildings15244439

APA Style

Cheng, S., Qi, Y., Wu, R., & Guan, Y. (2025). A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning. Buildings, 15(24), 4439. https://doi.org/10.3390/buildings15244439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Agent Framework for Construction Scenarios: Accurate Perception, Dynamic Retrieval, and Explainable Hazard Reasoning

Abstract

1. Introduction

2. Related Work

2.1. Construction Scene Understanding

2.2. Construction Safety Hazard Reasoning

3. Methodology

3.1. Framework Overview

3.2. Target Detection and Preprocessing

3.3. Semantic Alignment

3.4. Knowledge Graph Construction and Hybrid Retrieval

3.5. Reasoning and Decision-Making

4. Experimental Design

4.1. Dataset Construction

4.2. Evaluation Metrics and Experimental Design

5. Experimental Results and Analysis

5.1. Construction Activity Recognition

5.2. PPE Detection

5.3. Hazard Reasoning Evaluation

5.4. Ablation Study

5.4.1. Evaluation of the Object Detection Module

5.4.2. Evaluation of LoRA Fine-Tuning Effectiveness

5.4.3. Impact of Prompt Design

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Glossary of Key Technical Terms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI