Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation

Urlainis, Alon; Haronian, Eran; Mitelman, Amichai

doi:10.3390/smartcities9070114

Open AccessArticle

Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation

by

Alon Urlainis

^*

,

Eran Haronian

and

Amichai Mitelman

Department of Civil Engineering, Ariel University, Ariel 40700, Israel

^*

Author to whom correspondence should be addressed.

Smart Cities 2026, 9(7), 114; https://doi.org/10.3390/smartcities9070114

Submission received: 17 May 2026 / Revised: 25 June 2026 / Accepted: 30 June 2026 / Published: 2 July 2026

(This article belongs to the Special Issue Leveraging AI and Deep Learning for Smart Cities: Challenges, Opportunities, and Applications to Sustainable Development)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Multimodal GenAI showed stronger performance in descriptive construction-management tasks, such as activity identification and progress tracking, than in judgment-intensive tasks such as execution defect detection and safety hazard identification.
The field-based evaluation using 1186 images from 17 active construction sites indicates that current general-purpose GenAI tools remain limited when professional context, technical interpretation, and engineering judgment are required.

What are the implications of the main findings?

Multimodal GenAI can support smart-city construction and urban infrastructure management by transforming visual site data into structured, decision-support information, but it should not replace professional engineering verification.
The results highlight the need for human-in-the-loop workflows and AI literacy in civil engineering education to ensure responsible, reliable, and context-aware use of GenAI in smart-city development.

Abstract

Modern construction sites generate large volumes of visual, spatial, and operational data that can support data-driven project delivery, improved monitoring, and reliable decision-making within the smart-city built environment. However, construction management still relies heavily on human observation and manual interpretation, limiting the transformation of field data into structured information for sustainable urban infrastructure delivery. Multimodal generative artificial intelligence (GenAI) offers a promising approach for interpreting construction-site data, yet its performance under real site conditions remains insufficiently examined, particularly across tasks requiring different levels of visual recognition, contextual reasoning, and professional judgment. This paper presents a field-based evaluation of multimodal GenAI models using 1186 images collected from 17 active construction sites. The evaluation considered three widely available general-purpose multimodal GenAI assistants: Gemini, ChatGPT, and Microsoft Copilot. Four major construction management tasks were assessed: construction activity identification, progress tracking, execution defect detection, and safety hazard identification. The GenAI outputs were compared against ground-truth evaluations established by human experts. The results suggest that GenAI performs more reliably in descriptive and visually explicit tasks than in judgment-intensive tasks requiring engineering interpretation. Activity identification achieved the strongest performance, whereas execution defect detection was the most challenging. The findings indicate that GenAI can support visual site interpretation and improve construction management efficiency, while highlighting the need for human oversight and verification in smart-city infrastructure delivery.

Keywords:

multimodal generative artificial intelligence (GenAI); smart cities; construction monitoring; construction management; human-in-the-loop; urban infrastructure; field-based evaluation

1. Introduction

Smart cities depend not only on intelligent infrastructure operation, but also on reliable, data-driven infrastructure delivery. Construction sites are temporary yet essential components of the urban system, generating large volumes of visual, spatial, and operational data through site images, inspections, progress documentation, and field observations. These data can support construction monitoring, quality assurance, safety management, and future asset information. However, much of this field data remains underused because construction management still relies heavily on manual observation and professional interpretation. This gap creates an opportunity to examine whether multimodal generative artificial intelligence (GenAI) can transform construction-site visual data into structured information for smart-city infrastructure delivery. This question is also aligned with the United Nations Sustainable Development Goals, particularly SDG 11, which emphasizes inclusive, safe, resilient, and sustainable cities and communities [1,2]. In this context, AI-supported construction monitoring may contribute to more transparent, resource-efficient, and evidence-based delivery of urban infrastructure.

GenAI tools are increasingly capable of interpreting visual information, generating textual explanations, and supporting decision-making processes, making them relevant to both professional practice and engineering education. In construction management (CM) and civil engineering (CE), image-based GenAI tools are particularly important because many routine monitoring tasks rely on visual evidence, including construction activity identification, progress tracking, execution defect detection, and safety hazard identification [3,4,5,6]. Such tools may improve the efficiency of visual documentation and preliminary site interpretation, especially when large image sets are collected from active construction projects.

Despite the growing relevance of multimodal GenAI for construction monitoring, its performance under real construction-site conditions remains insufficiently established. Unlike controlled benchmark datasets, construction-site images often contain incomplete visual evidence, occlusions, temporary works, cluttered backgrounds, variable lighting, and activities occurring at different stages of completion. As a result, GenAI systems may produce incomplete, inconsistent, or misleading outputs, including confident narratives that exceed the available visual evidence [7,8]. This limitation reinforces the need for explicit verification and uncertainty-aware reporting in construction-management tasks. Field-based evaluation is therefore required to determine which tasks can be supported reliably by multimodal GenAI and which still require substantial professional interpretation.

The need for verification also has implications for engineering education and professional training. GenAI tools are increasingly used by early-career engineers and engineering students [9], yet this use is often informal and only weakly integrated into structured curricula. Evidence from higher-education studies indicates frequent use and high self-reported confidence alongside uneven formal knowledge, ethical preparedness, and verification practices [10,11,12]. This is particularly important in construction management, where AI-generated interpretations may relate to safety, quality, and progress assessment. In such contexts, uncritical reliance on artificial intelligence (AI) outputs can reinforce automation bias and overconfidence, leading users to accept incomplete or misleading interpretations as reliable evidence [13,14,15]. Prior studies further suggest that scaffolded and task-based learning can strengthen critical thinking and reduce uncritical reliance on AI-generated outputs [16,17]. Therefore, responsible GenAI integration requires structured workflows that combine AI-supported analysis with human verification, uncertainty awareness, and domain-based judgment. Accordingly, while the primary focus of this study is the technical evaluation of GenAI performance, the field-based seminar setting also provides a secondary educational contribution by demonstrating how a verification-first workflow can be embedded in authentic construction-management training to strengthen students’ AI literacy, verification practice, and professional judgment.

Despite extensive research on AI in construction and the rapid emergence of GenAI tools, four gaps motivate the present study: (i) limited field-based validation under real construction-site conditions; (ii) limited evidence on how general-purpose multimodal GenAI performs across tasks that differ in visual explicitness, contextual reasoning, and required engineering judgment; (iii) insufficient analysis of recurring failure modes and unsupported interpretations; and (iv) weak integration of verification-first, human-in-the-loop workflows, particularly when applied by students or early-career engineers using real site data. These gaps are elaborated in Section 2.4.

In response to these needs, the present study examines the use of multimodal GenAI tools for construction-site image interpretation through a field-based evaluation conducted within a construction-management seminar. The dataset includes 1186 images collected from 17 active construction sites and analyzed by 31 undergraduate civil engineering students using a structured evaluation protocol. The evaluation used three widely available, general-purpose multimodal GenAI assistants, Gemini, ChatGPT, and Microsoft Copilot, each selected at the discretion of the research groups. Four construction-management tasks were examined: construction activity identification, progress tracking, execution defect detection, and safety hazard identification. The AI-generated outputs were assessed against engineering ground truth to evaluate task-dependent performance, recurring limitations, and the feasibility of a verification-first human-in-the-loop workflow.

The primary contribution of this work is a technical, field-based evaluation of multimodal GenAI performance under real construction-site conditions. The educational dimension, namely the operationalization of a verification-first workflow within engineering training, is treated as a complementary secondary contribution. In this study, verification-first refers to treating AI-generated outputs as preliminary interpretations that require domain-based human review before being accepted or used for decision-making. The study makes three main contributions: (1) empirical evidence on the task-dependent strengths, limitations, and recurring failure modes of multimodal GenAI when analyzing construction-site imagery from active projects; (2) comparative assessment of GenAI performance across four main construction-management tasks, namely activity identification, progress tracking, defect detection, and safety hazard identification; and (3) demonstration of a human-in-the-loop verification workflow that supports responsible GenAI use within smart-city construction and infrastructure monitoring.

2. Background

2.1. AI in Civil Engineering and Construction Management

AI research in construction engineering and management has grown rapidly, with AI increasingly used for planning, scheduling, risk and quality control, and site monitoring [18,19,20,21]. In parallel, Building Information Modeling (BIM) has evolved into a digital backbone for project information, and multiple studies argue that integrating BIM and AI enables new capabilities such as automated rule checking, as-built reconstruction, event-log mining, performance analysis, and digital twin development [22,23]. Figure 1 illustrates the rapid growth of AI-related research in construction engineering and construction management over the past two decades, reflecting the increasing maturity and diversification of AI applications. The trend was generated from a Scopus search using title, abstract, and keyword fields with AI-related terms, including “artificial intelligence,” “machine learning,” “deep learning,” “computer vision,” “object detection,” “generative AI,” and “large language model,” combined with construction-related terms, including “construction management,” “construction engineering,” “construction project,” “construction industry,” “construction site,” “building construction,” and “AEC industry.” The search covered publications from 2010 to 2025.

Within construction management, visual data analytics have received substantial attention. Prior work demonstrates feasibility for worker and equipment detection, scene understanding, and personal protective equipment (PPE) compliance monitoring [24,25,26]. However, the literature also emphasizes persistent barriers to real-world adoption that extend beyond technical accuracy, including lack of trust in AI outcomes, data security and privacy, fragmented data ecosystems, and uncertainty around system behavior under changing site conditions [27,28]. These concerns motivate the growing focus on trustworthy AI frameworks and ethical guidance tailored to construction settings [29].

More recently, GenAI and large language models (LLMs) have expanded attention from recognition and prediction toward language-intensive workflows, including early-stage planning outputs such as schedule generation and broader discussion of adoption opportunities and limitations in the construction sector [30,31,32]. This shift strengthens the need for educational approaches that develop not only tool familiarity, but also verification habits and calibrated trust, particularly when AI outputs are persuasive yet may be incorrect or insufficiently grounded for safety- and quality-critical decisions.

Table 1 summarizes recent studies on AI applications in construction management, highlighting the capabilities employed and tasks addressed. Notably, none of the reviewed studies combine multimodal GenAI with image-based field data in a multi-task evaluation framework.

2.2. Multimodal GenAI, Verification, and Human-in-the-Loop Construction Monitoring

Multimodal GenAI represents a shift from task-specific computer vision models toward general-purpose systems that can interpret visual information and generate explanatory text. In construction monitoring, this capability is important because site images often require more than object recognition; they may involve interpretation of activities, progress status, execution defects, and safety conditions. Unlike conventional computer vision systems trained for predefined labels, multimodal GenAI can produce flexible descriptions and reasoning-like outputs, making it useful for construction documentation and preliminary site interpretation [47,48].

However, this flexibility also introduces risks. Construction-site images often provide partial evidence, as important elements may be outside the frame, concealed by temporary works, affected by lighting, or obscured by site clutter. In such cases, GenAI may produce plausible but unsupported interpretations. Therefore, GenAI-based interpretations should be treated as preliminary outputs that require human verification rather than as final evidence [49,50,51].

In this study, a verification-first approach refers to treating AI-generated outputs as preliminary interpretations that require domain-based human review before being accepted or used for decision-making. Verification includes comparing AI outputs with engineering ground truth, identifying unsupported claims, assessing whether the visual evidence is sufficient, and documenting uncertainty where the image does not allow a reliable conclusion. This is particularly important for judgment-intensive tasks such as execution defect detection and safety hazard identification [52,53].

Human-in-the-loop monitoring positions GenAI as a support tool rather than an autonomous decision-maker. GenAI can assist by producing initial descriptions, identifying visible activities, highlighting possible issues, and organizing observations, while the human evaluator remains responsible for checking the output against available evidence and applying domain knowledge. This framing is consistent with construction-management practice, where decisions must remain evidence-based, explainable, and professionally accountable [54,55].

2.3. Engineering Education and Verification-First GenAI Use

Engineering education is experiencing a rapid diffusion of GenAI tools, with student engagement often driven by individual experimentation rather than structured instructional design [9]. Higher-education studies report frequent use and relatively high confidence in AI tools, alongside uneven levels of AI literacy and ethical preparedness [10]. This mismatch is especially important in engineering domains where AI-supported outputs may influence safety, reliability, and professional accountability.

For construction management, responsible GenAI use should emphasize verification, risk awareness, and professional judgment. Prior work suggests that scaffolded and experiential learning can strengthen critical thinking when students reflect on errors, document verification steps, and interrogate AI outputs rather than treating them as final answers [10,16]. This is particularly relevant when students work with field-based construction images, where evidence quality varies and domain constraints matter.

Accordingly, construction-focused GenAI education should develop three core competencies: (1) AI literacy for engineering tasks, (2) verification practice through ground-truth comparison and uncertainty documentation, and (3) professional responsibility, recognizing that final accountability remains with the engineer even when AI is used as a support tool [12]. These competencies support the need for field-grounded educational designs that combine authentic construction-management tasks with explicit verification and judgment.

2.4. Research Gaps and Motivation

Despite extensive research on AI in construction and the rapidly increasing availability of GenAI tools, four important gaps remain that motivate the present study:

Limited field-based validation: Empirical evaluation of multimodal GenAI under real construction-site conditions remains limited, particularly for core construction-monitoring tasks such as activity identification, progress tracking, defect detection, and safety hazard identification.
Limited multi-task performance evidence: Existing studies often focus on a single task, specialized model, or controlled dataset. Less is known about how general-purpose multimodal GenAI performs across tasks that differ in visual explicitness, contextual reasoning, and required engineering judgment.
Insufficient analysis of failure modes: Prior studies often emphasize aggregate performance metrics, while less attention is given to recurring error patterns, unsupported interpretations, and conditions under which GenAI outputs become unreliable.
Weak integration of verification-first workflows: Few studies operationalize human-in-the-loop workflows that combine AI-generated outputs with ground-truth comparison, uncertainty-aware reporting, and professional verification, especially when applied by students or early-career engineers using real construction-site data.

Motivated by these gaps, this study evaluates the use of multimodal GenAI in realistic construction monitoring tasks through a field-based undergraduate seminar, combining task-level performance analysis with pre- and post-intervention measures of student trust and verification awareness.

3. Methodology

This study adopts a field-based evaluation design to assess the performance of multimodal GenAI on construction-management tasks using images collected from real construction sites. The study was implemented within an undergraduate construction-management seminar, enabling structured human-in-the-loop evaluation, ground-truth comparison, and supporting assessment of students’ verification awareness. The methodology combines quantitative assessment of AI task performance with qualitative analysis of recurring errors, while educational outcomes related to trust, confidence, and verification practices are examined as a secondary component.

The core methodological principle is a verification-first (human-in-the-loop) workflow, in which GenAI outputs are systematically compared against engineering ground truth and professional judgment rather than being treated as authoritative results. Figure 2 illustrates the overall research framework, highlighting the sequential and parallel interactions between preliminary research, real-world site investigation, AI-assisted task execution, mandatory verification procedures, and educational assessment. The figure also emphasizes the parallel development of technical evaluation and learning outcomes across all study phases.

3.1. Study Context and Participants

The study was conducted within an undergraduate construction management seminar offered to civil engineering students. A total of 31 students participated, organized into 15 independent research groups working under a unified methodological framework. The seminar engaged students in structured, task-oriented activities involving the analysis of construction site images using multimodal GenAI tools, while explicitly emphasizing professional responsibility, verification practices, and critical evaluation of AI-generated outputs.

3.2. Construction Activities and Site Selection

Each group selected a construction activity that is visually observable and relevant to construction management practice, including structural, finishing, Mechanical, Electrical, and Plumbing (MEP), and infrastructure works. Sites were chosen to ensure that selected activities were actively occurring during the study period and that realistic field variability would be represented. Overall, the study included visits to 17 active construction sites, covering diverse project types (residential, infrastructure, data center, and mixed-use), execution stages, and environments (indoor and outdoor conditions).

3.3. Experimental Design and Data Collection

A structured experimental plan was developed for each construction activity. The plan specified the number of required site visits, the range of execution stages to be captured, and basic guidelines for image acquisition. Students were instructed to collect site images during multiple visits where possible, in order to capture temporal variation and progress-related changes. During each site visit, students collected photographic data together with complementary execution information. The image-acquisition guidelines, applied consistently across all groups, required students to photograph each activity from multiple viewpoints and working distances, ensure adequate lighting, focus, and framing, minimize occlusions where possible, and capture distinct execution stages across repeated visits. For each image, structured metadata were recorded to support traceability, verification, and subsequent analysis. These metadata included the date and time of capture, site location, construction activity, observed execution stage, image resolution, image source, and additional contextual information relevant to the activity. A summary of the data-collection and evaluation guidelines is provided in Appendix C.

3.4. Multimodal GenAI Task Execution

Collected images were analyzed using three widely available, general-purpose multimodal GenAI assistants: Gemini 1.5pro, ChatGPT 4o, and Microsoft Copilot. These tools were selected because they were accessible to students during the study period, supported image-based input, generated textual explanations, and represented common GenAI assistants available to non-specialist engineering users. Tool selection was left to the discretion of each research group, reflecting authentic, self-directed adoption of publicly available GenAI tools rather than a researcher-imposed configuration. Accordingly, the study evaluates realistic use of multimodal GenAI tools under field-based construction-management training conditions, rather than a controlled benchmark of fixed API-based computer vision models.

The analyses were conducted between December 2025 and March 2026. The tools were accessed through their public web or app interfaces under default settings. The evaluated model versions, to the extent identifiable from each interface, were Gemini 1.5 Pro, ChatGPT-4o, and Microsoft Copilot. Because these tools were accessed through commercial consumer interfaces, not all configuration parameters were directly controllable or visible to the users. Parameters such as temperature, sampling settings, model-version pinning, and backend model updates could not be fixed across all tools. This limitation was documented because web-based GenAI systems may be updated dynamically over time.

To support comparability across groups and tools, all evaluations followed a common task-oriented prompting structure. The prompts were designed to generate outputs related to construction activity identification, progress assessment, execution defect detection, and safety hazard recognition. Students were instructed to ask the model to base its response on visible evidence, describe the supporting visual cues, avoid unsupported assumptions, and indicate uncertainty or missing visual information where applicable. The same prompt structure was used consistently across the images evaluated within each task, and students were instructed not to edit the model output before recording it.

AI outputs were recorded in a structured JSON-based evaluation format to enable systematic comparison and analysis. The recorded information included the image identifier, GenAI tool used, task category, prompt category, generated answer, supporting explanation, confidence statement when provided by the model, evaluator comments, reference assessment, verification notes, and final evaluation score. The same verification-first scoring framework and 0–100 scoring anchors were then applied across tools and tasks. A representative example of the R1 evaluation sheet and category-constrained prompt is provided in Appendix B, and the general GenAI execution guidelines are summarized in Appendix C.

No formal prompt-sensitivity analysis was conducted because the study focused on field-based use of GenAI tools within an educational construction-management setting. However, the use of a structured task-specific prompt template reduced uncontrolled variation across groups. Systematic prompt-robustness testing is therefore identified as an important direction for future work, using fixed image sets, repeated prompt variants, controlled model configurations, and larger balanced samples across construction-management tasks.

3.5. Verification-First Evaluation Workflow

For each AI-generated output, the reference assessment was established through a supervised verification-first procedure. The ground-truth reference was not developed by students in isolation. For each task, the research group that collected the images first defined the expected reference assessment based on direct site observations, recorded metadata, observed execution stage, visible materials and equipment, site conditions, construction logic, and relevant professional requirements where applicable.

Key assumptions were then verified through consultation with the site engineer, site manager, or another responsible construction-site professional during or shortly after the site visit. The role of the site professional was to confirm the technical interpretation and contextual assumptions, rather than to serve as the sole author of the reference assessment. Therefore, statements in this paper that final judgment remains the responsibility of the engineer refer to professional accountability for the use of AI-supported outputs in practice, not to sole authorship of the ground-truth labels.

Additional quality control was provided by the course instructor, who independently audited a subset of evaluations to check the consistency of ground-truth definitions, scoring logic, and interpretation of ambiguous cases. Where necessary, the instructor served as an adjudicator. Verification actions included cross-image triangulation, checking whether AI claims were supported by visible evidence, identifying unsupported inferences, documenting uncertainty when visual information was insufficient, and flagging red indicators such as overconfident statements, missing visual justification, or contradictions with known execution practices.

Accordingly, GenAI outputs were treated as decision-support inputs rather than final engineering conclusions. The ground-truth procedure therefore combined student field observation, site-professional consultation, instructor audit, and adjudication of ambiguous cases, forming a multi-layer quality-control chain intended to reduce single-rater bias and improve the credibility of the reference assessments. Scoring itself followed the pre-specified, anchored 0–100 rubric (Table 2, as detailed in Section 3.6), so that the evaluation criteria were fixed in advance rather than defined case by case. The verification and scoring steps followed a common set of instructions applied uniformly across all groups. A representative structured GenAI evaluation output is provided in Appendix B to illustrate the traceability format used in the evaluation, while the data-collection, GenAI execution, verification, and evaluation guidelines are summarized in Appendix C.

3.6. Performance Analysis

AI performance was evaluated by comparing model outputs with the verified ground-truth references using a standardized 0–100 scoring scale. Scoring was guided by general evaluation anchors to reduce interpretation differences across groups (Table 2). The use of rubric-based, anchored human scoring to evaluate generative-model outputs follows established evaluation practice [56]. To support scoring consistency, the scoring approach was discussed in class, and the course instructor guided each group using a small set of approximately 3–5 images.

Quantitative analysis used one-way analysis of variance (ANOVA) to examine task-type differences and two-way mixed-model ANOVA to examine tool-by-task interaction patterns. Effect sizes were reported alongside significance tests in line with standard statistical reporting conventions [57]. Qualitative analysis examined recurring error patterns linked to site variability, image quality, lighting, camera angle, occlusions, insufficient visual evidence, and unsupported AI inferences.

3.7. Educational Assessment

Students completed paired pre–post questionnaires measuring self-reported AI literacy competencies on a six-point Likert scale (Appendix A). Four competencies were assessed: understanding of AI contribution to civil engineering, error recognition and validation capability, professional application skills, and verification-first orientation. The questionnaire also collected baseline GenAI usage patterns and post-course trust ratings. Changes were analyzed using paired-samples t-tests with Cohen’s d effect sizes.

3.8. Integration of Technical and Educational Outcomes

The final stage integrated technical findings with supporting educational outcomes. AI performance results and identified failure modes were analyzed alongside changes in student trust, confidence, and verification awareness. This integrated analysis enabled evaluation of multimodal GenAI capabilities for construction-management tasks, while also examining how a verification-first workflow supports responsible interpretation and training.

4. Results

4.1. Overview of Dataset

A total of 31 undergraduate civil engineering students, organized into 15 independent research groups, participated in the study. Each group captured and analyzed construction site imagery from active construction projects. Each image was analyzed by the research group that collected it; on average, each group analyzed approximately 79 images. In total, 17 construction sites were visited, representing a wide range of project types, execution stages, and environmental conditions. The surveyed projects included residential construction (urban renewal and high-rise developments), infrastructure works (road interchanges and sewer systems), data center construction, and mixed-use developments. Image collection covered diverse site environments, ranging from controlled indoor finishing stages to outdoor structural works, rooftop installations, and enclosed underground pipelines. Collectively, the research groups analyzed 1186 images documenting a broad spectrum of construction activities. Figure 3 provides representative examples from the image dataset, demonstrating the variety of construction stages and indoor/outdoor viewpoints used in the evaluation. The distribution of analyzed images by construction category and activity is summarized in Table 3.

Construction management task types were classified according to a research question (RQ) framework. In this framework, the identifiers R1–R15 refer to construction-management evaluation tasks and are distinct from the study’s research aims stated in Section 1. The prefix “R” is retained only as a consistent task label across the text, tables, and figures. Each research group evaluated between two and four construction management tasks. The four primary tasks assessed were Construction Activity Identification (R1), Construction Progress Tracking (R2), Execution Defect Detection (R3), and Safety Hazard Identification (R4). In addition, some groups examined exploratory tasks, including Automated Construction Report Generation (R13), Construction Area Classification (R15), Image Resolution Impact (R14), Field Applicability Assessment (R12), and Impact of Lighting and Camera Angle (R11), as presented in Table 4. In total, 47 task-level analyses were conducted across all research groups.

The evaluation process was conducted using Gemini, ChatGPT, and Copilot, with tool selection left to the discretion of each research group. Gemini accounted for 70% of the evaluations, followed by ChatGPT (26%) and Copilot (4%). The predominance of Gemini reflects its greater availability and adoption during the study period, primarily due to the provision of a free one-year trial. The overall research workflow and verification-first evaluation process are summarized in Figure 4.

4.2. Technical Results: Performance of Multimodal GenAI in Construction Management Tasks

This section presents the quantitative results of the field-based evaluation of multimodal GenAI tools for core construction management tasks, based on 47 individual evaluations and a total of 1186 site images. Performance was assessed using a standardized 0–100 scoring scale, where higher scores indicate greater task success. For each evaluation, the assigned research group determined the score by comparing the AI output to the ground truth reference established from the site visit context and the captured images. Scores reflected the extent to which the AI output correctly addressed the assigned construction management task. When outputs were partially correct (e.g., correct trade but incorrect stage, or correct hazard type with missing critical details), groups assigned intermediate scores to reflect partial success.

4.2.1. Performance by Construction Activity

Figure 5 presents the AI performance scores across 15 distinct construction activities, revealing substantial variation in model effectiveness depending on the type of work being analyzed. The overall mean performance across all activities was 75.3, with scores ranging from 56.5 to 96.0. The highest-performing activities were Earthworks (M = 96.0), External works and environmental development (M = 94.5), and Hollow-core slabs (M = 92.5). These activities typically involve large-scale, visually distinctive features and clearer stage cues, which multimodal GenAI models interpreted reliably. Mid-range performance was observed in trade activities such as Flooring (tiling) (M = 83.5) and Retaining wall construction (M = 81.3), where stage cues exist but can be partially occluded or visually similar across sub-stages. The lowest-performing activities included Deep foundation works (bored piles) (M = 56.5) and Interior plastering and painting (M = 61.7), which often require interpretation under limited visibility, reduced scale cues, and higher dependence on contextual engineering knowledge that is not fully observable in single images.

These findings indicate that multimodal GenAI performs better for activities with large-scale, visually distinct elements and clear stage cues, particularly under favorable lighting conditions. Performance decreases for activities requiring fine material discrimination, specialized construction knowledge, or interpretation of partially visible construction components.

4.2.2. Performance by Task Type

Analysis of AI performance across the four primary construction-management task types revealed a clear hierarchy associated with task complexity (Figure 6). A one-way ANOVA indicated a statistically significant effect of task type on performance (F(3, 35) = 2.96, p = 0.046, η² = 0.20). Because this analysis consisted of a single omnibus ANOVA rather than post hoc pairwise comparisons, no family-wise correction for multiple comparisons was applied. Given the modest number of task-level evaluations, particularly for Safety Hazard Identification, the inferential result is interpreted conservatively, with emphasis placed on the observed performance hierarchy, recurring task-level patterns, and effect size rather than on the borderline p-value alone. Activity Identification (R1) achieved the strongest results (M = 83.5, SD = 16.07, n = 12), indicating that multimodal GenAI can generally classify trades and recognize dominant materials and equipment from site imagery. Progress Tracking (R2) exhibited moderate performance (M = 74.1, SD = 20.36, n = 13), reflecting challenges in distinguishing adjacent execution stages when temporal context is limited. Safety Hazard Identification (R4) also showed moderate performance (M = 73.1, SD = 7.96, n = 5), suggesting that models can detect some visible hazards but lack consistent reliability across scenarios. Defect Detection (R3) was the weakest task (M = 61.6, SD = 14.80, n = 9), highlighting that fine-grained quality assessment remains difficult, particularly when defects are subtle, partially concealed, or require verification against workmanship standards rather than purely visual appearance.

To verify that this effect was not an artifact of the unequal and, in places, small task-level samples, the omnibus test was repeated using Welch’s ANOVA, a choice consistent with guidance favoring Welch-type methods when group sizes are unbalanced and variances may differ [58,59]. The effect of task type remained statistically significant when the analysis was repeated using Welch’s ANOVA; F(3, 17.8) = 3.28, p = 0.045. As an additional sensitivity check, the analysis was repeated after excluding the smallest task group, Safety Hazard Identification, n = 5, and the task-type effect remained significant. Pairwise comparisons using Welch’s t-tests with Holm correction indicated that Activity Identification (R1) scored significantly higher than Defect Detection (R3), p_adj = 0.027, whereas finer distinctions among adjacent task types were not statistically resolved at the available sample size and are therefore interpreted descriptively. A less biased effect-size estimate, ω² = 0.13, is reported to support cautious interpretation of the task-type effect.

Overall, multimodal GenAI models are more effective for identification-oriented tasks than for quality assessment tasks. The significant performance gap between activity identification and defect detection (ΔM = 22.0 points) underscores that task complexity is a key driver of AI effectiveness. These findings suggest that deployment should be calibrated by task type: higher-confidence applications in trade/activity identification and documentation, with verification-first workflows and sustained human oversight for defect and quality assessment.

4.2.3. Task Performance by Model

Table 5 reports mean performance scores for R1–R4 by model, and Figure 7 visualizes the corresponding task-level profiles. Overall performance was comparable between Gemini (M = 73.88) and ChatGPT (M = 72.07), and no significant differences were observed between models across the evaluated tasks. Both tools achieved their highest scores in Construction Activity Identification (R1) (overall mean = 82.96) and moderate performance in Construction Progress Tracking (R2) and Safety Hazard Identification (R4) (overall mean of 73.63 and 73.10, respectively). The lowest performance was observed for quality Defect Detection (R3) (overall mean = 61.56), where Gemini exhibited higher mean performance.

To test whether tool selection influenced performance outcomes, a two-way mixed-model ANOVA was conducted with tool (Gemini, ChatGPT) and task type (R1, R2, R3, R4) as factors. Results revealed no significant main effect of tool, indicating comparable performance between Gemini and ChatGPT. In addition, no statistically significant tool-by-task interaction was observed, indicating that performance differentials across task types were consistent regardless of which tool was employed. This pattern suggests that current performance limitations reflect shared constraints of general-purpose multimodal systems rather than tool-specific architectural weaknesses.

4.2.4. Exploratory Tasks

Beyond the four primary task types, several exploratory applications were examined but were not included in the main statistical analysis due to limited sample sizes. Automated documentation achieved promising results across three activities (M = 84.5, range = 77.5–96), particularly for daily work-log generation from fixed camera footage, although double-counting across consecutive frames remained a concern. Area classification achieved 85% accuracy across 95 images, with errors mainly occurring between visually similar spaces. Quality screening showed sensitivity to visible masonry irregularities (80%) but also produced over-detection of non-critical deviations. Environmental sensitivity testing indicated strong performance under optimal lighting (90%) but confirmed reduced reliability under shadows, glare, and unfavorable camera angles. Overall, these exploratory findings suggest potential for automated documentation and quality screening, while reinforcing the need for standardized image-capture protocols and professional feedback.

4.3. Educational Results: Student Trust, Verification Behavior, and AI Literacy

4.3.1. Baseline GenAI Usage Patterns

Educational outcomes were assessed using questionnaire responses from 31 undergraduate civil engineering students. Baseline results showed high general GenAI adoption but limited engineering-specific use. As shown in Figure 8a, 83.9% of students reported using GenAI tools weekly, while Figure 8b shows that most used them for general information search (74.2%), summarization (71.0%), and writing improvement (61.3%). In contrast, only 25.8% reported using GenAI for engineering tasks (Figure 8b). These findings indicate that the main educational challenge was not tool adoption, but professional literacy: helping students understand what GenAI can and cannot reliably support in engineering contexts and how outputs should be verified.

As shown in Figure 9 and Table 6, students’ self-reported AI literacy improved across all four assessed competencies after the verification-first activity. The largest gains were observed in validation importance and professional application skills, suggesting that the field-based GenAI exercise strengthened students’ awareness of the need to critically evaluate AI outputs before applying them in construction-management contexts.

4.3.2. Trust Calibration Relative to Observed AI Performance

Effective human–AI collaboration requires trust that reflects actual system capabilities, avoiding both automation bias and underutilization [60]. In construction, this issue is especially important because AI-supported outputs may influence safety, quality, and progress-related decisions [61]. To examine this issue, post-course trust ratings were compared with observed AI performance across the four primary construction-management tasks (Figure 10). Trust ratings were converted to a percentage scale for comparison.

As shown in Figure 10a, student trust was systematically lower than AI performance across all four tasks, with gaps ranging from 16 to 22 percentage points. Figure 10b shows that all four tasks fall below the perfect-calibration diagonal, indicating a conservative trust orientation rather than overreliance on AI outputs. In safety-critical engineering contexts, such caution may help reduce automation bias and support responsible human–AI collaboration.

5. Discussion

5.1. Technical Performance: Capabilities and Limitations

Within the smart-city context that frames this study, construction sites are temporary but data-rich components of the urban system. The reliable conversion of construction-site visual data into structured information is therefore important for sustainable urban infrastructure delivery, quality assurance, safety management, and future asset information. Accordingly, the findings are interpreted not only as task-level performance metrics, but also in terms of how dependably multimodal GenAI can support smart-city construction monitoring. The field-based evaluation of multimodal GenAI across 47 task-level assessments and 1186 construction site images reveals that current GenAI tools demonstrate functional but circumscribed capabilities for construction management applications. The overall mean performance score of 75.3 indicates that these systems can provide meaningful analytical support, yet fall short of the reliability threshold required for autonomous deployment in professional practice. These findings align with recent observations that, while AI technologies show considerable promise for construction applications, significant gaps remain between laboratory performance and field deployment readiness [62,63].

The most consequential finding is the clear performance hierarchy across task types. Construction Activity Identification (R1) achieved the highest performance (M = 83.5), followed by Progress Tracking (R2, M = 74.1), Safety Hazard Identification (R4, M = 73.1), and Execution Defect Detection (R3, M = 61.6). This 22-point differential between Activity Identification and Execution Defect Detection tasks reflects a fundamental distinction between classification-oriented reasoning and quality assessment reasoning. Activity identification relies primarily on recognizing visually distinctive elements such as equipment, materials, and spatial configurations. In contrast, defect detection requires fine-grained discrimination of workmanship quality, interpretation against construction standards, and inference from subtle visual cues that may not be fully observable in single images. These findings corroborate prior work indicating that general-purpose models perform well on categorical tasks but struggle with domain-specific quality judgments requiring specialized knowledge [64,65,66,67].

Activity-specific patterns (Section 4.2.1) confirm that high-performing activities share characteristics such as large-scale visual elements and favorable lighting, while low-performing activities involve confined spaces and contextual knowledge not directly observable in imagery. These patterns are consistent with findings from computer vision research in construction contexts, where environmental factors significantly affect recognition accuracy [25,68].

No statistically significant performance differences were observed between Gemini and ChatGPT across the evaluated tasks. However, because tool use was unbalanced across the dataset, this result should be interpreted cautiously and not as definitive model benchmarking. Rather, the findings suggest that the main performance limitations are associated more strongly with task type and construction-context complexity than with the specific commercial tool used. Accordingly, future improvements are likely to depend on domain-specific adaptation, enhanced prompting strategies, integration with structured project data, and multimodal context enrichment rather than simply switching between general-purpose commercial platforms.

5.2. Educational Outcomes: From General Adoption to Professional Literacy

The educational findings reveal several insights: the disparity between high general adoption and low engineering-specific use (see Section 4.3.1) indicates that the primary educational need is not closing an adoption gap but rather a professional literacy gap, helping frequent GenAI users understand what these tools can and cannot reliably support in engineering contexts and how to deploy them responsibly through structured verification. This aligns with broader findings that student engagement with AI is often driven by individual experimentation rather than structured instructional design [9,10].

The verification-first pedagogical framework demonstrated substantial effectiveness in addressing this gap. Across all four assessed competencies, students exhibited significant gains from pre- to post-course assessments. Understanding of AI contribution to civil engineering increased by 65.3%, while professional application skills improved by 56.4% (M = 3.30 to M = 5.16). Most importantly, validation importance awareness exhibited the strongest gain at 71.3%, providing quantitative evidence that the learning process successfully promoted a verification-first viewpoint. These outcomes support prior research demonstrating that scaffolded interventions can effectively strengthen critical thinking and reduce uncritical reliance on AI tools, even within relatively short instructional timeframes [16,17].

Beyond mean-level improvements, reduced post-course variability indicates pedagogical convergence: students with diverse baseline competencies converged toward a shared standard of AI literacy, consistent with competency-based engineering education principles [12,69].

5.3. Workforce Efficiency and Sustainable Infrastructure Delivery

The findings have practical implications for workforce efficiency in construction management and infrastructure delivery, both of which are central to sustainable smart-city development. In a sector increasingly affected by shortages of skilled workers, site engineers, and experienced construction managers [70], GenAI-based image interpretation can support routine documentation, preliminary progress checks, safety screening, and initial quality observations. By reducing repetitive manual tasks, such tools may allow engineers and supervisors to focus on verification, decision-making, and intervention in areas requiring professional judgment.

AI-supported visual interpretation may also contribute to sustainable infrastructure delivery by improving the use of construction-site images as urban visual data. Remote analysis of site images can help prioritize inspection needs and may reduce repeated physical visits to the site. Earlier identification of potential safety, progress, or quality issues can also reduce rework, delays, material waste, and unnecessary transportation associated with repeated inspections or late-stage corrections. These implications are particularly relevant to SDG 11 and to smart-city agendas that promote resource-efficient, data-driven, and resilient urban development [2,71,72].

Nevertheless, these benefits depend on responsible implementation. Since GenAI was more reliable in descriptive and visually explicit tasks than in judgment-intensive tasks such as execution defect detection, it should be viewed as a tool for augmenting human supervision rather than replacing professional inspection. This aligns with the broader need to address ethical and practical challenges in AI-supported sustainability applications, particularly where automated outputs may influence safety, quality, and infrastructure performance.

5.4. Implications for Construction Practice and Engineering Education

The integration of technical and educational findings suggests practical implications for both construction deployment and engineering training. From an industry perspective, the observed task-complexity hierarchy supports a stratified deployment strategy. For activity identification and documentation, GenAI can serve as a first-pass analytical tool with moderate oversight. For progress monitoring and safety screening, outputs should be verified against temporal references, site conditions, and safety checklists. For execution defect detection and quality assessment, GenAI should be limited to preliminary screening, with professional judgment remaining authoritative.

A possible practical implication concerns the use of accessible GenAI tools by small and medium-sized construction enterprises (SMEs). Because general-purpose AI tools are relatively low cost and require limited technical infrastructure, they may help SMEs strengthen activity documentation, progress tracking, and preliminary safety or quality screening. However, such use should be accompanied by verification and control mechanisms, particularly given the resource and capability gaps that often exist between SMEs and large contractors [73,74,75].

Construction sites can be viewed as temporary sensing environments within the smart-city lifecycle. During infrastructure delivery, site images, inspection records, progress observations, safety documentation, and field reports generate valuable information on how urban assets are actually constructed. When structured, verified, and linked to project information systems, these data can support not only construction management but also later operation, maintenance, and asset management. From this perspective, multimodal GenAI is not only a tool for construction documentation, but also a means of transforming fragmented field observations into structured information for smart-city data ecosystems. Verified GenAI outputs can document activity status, visible site conditions, potential safety issues, and preliminary quality observations, thereby improving continuity between construction-phase monitoring and digital-twin-based operation. However, the findings also show that such integration must remain task-calibrated and verification-based. GenAI is more reliable for visually explicit tasks, such as activity identification, than for judgment-intensive tasks, such as defect detection. Therefore, construction-site sensing workflows should position GenAI as a human-in-the-loop interpretation layer, while professional engineers remain responsible for validation and decision-making.

As introduced in Section 2, BIM has become a digital backbone for project information, and integrating BIM with AI enables capabilities such as automated rule checking and as-built reconstruction. The integration of GenAI with Building Information Modeling (BIM) and digital twin environments also presents opportunities for more reliable construction-monitoring workflows. BIM can provide structured project information for design specifications, schedules, and quality benchmarks [23,76]. Linking GenAI-based image interpretation with BIM data could reduce reliance on single-image analysis by allowing AI outputs to be checked against structured project context, thereby improving both accuracy and verification efficiency. This direction is consistent with recent smart-city sensing research emphasizing that multimodal data integration can support intelligent urban decision-making, while also introducing challenges related to data alignment, scalability, modality-specific noise, privacy, and responsible deployment [77].

From an educational perspective, the verification-first framework offers a replicable model for responsible GenAI integration in engineering programs. Key design principles include authentic field-based tasks, explicit ground-truth comparison, uncertainty documentation, red-flag identification, and clear emphasis on professional accountability. This is particularly important because GenAI tools evolve faster than formal curriculum structures, creating a risk that students adopt powerful systems before domain-specific verification habits are established [11,78]. Verification-first training can therefore be embedded within existing courses as a practical bridge between immediate AI adoption and longer-term curricular reform.

5.5. Limitations and Future Research

Several limitations should be considered when interpreting the findings. First, the study was designed as a field-based evaluation of realistic GenAI use in a construction-management seminar rather than as a controlled benchmark of commercial multimodal models. Tool selection was left to the research groups, reflecting authentic self-directed adoption of widely available GenAI assistants. Accordingly, the same image and prompt were not systematically tested across Gemini, ChatGPT, and Microsoft Copilot. This limits model-to-model generalization, and the tool-level findings should be interpreted as evidence of practical task-level feasibility and recurring failure modes rather than as a definitive ranking of platforms. Future work should include balanced cross-tool benchmarking using the same image set, prompt templates, and scoring procedure across all selected tools.

Second, the task distribution was not fully balanced. The number of task-level evaluations differed across activity types and construction-management tasks, with relatively few evaluations for safety hazard identification and several exploratory tasks. This imbalance reflects the field-based structure of the study, where each group worked on activities available at its selected construction site. Consequently, statistical comparisons should be interpreted cautiously, with emphasis placed on observed task-level patterns, effect sizes, and the descriptive performance hierarchy rather than definitive population-level estimates. To confirm that the principal task-type effect was not an artifact of this imbalance, the omnibus comparison was repeated using Welch’s ANOVA and a sensitivity analysis excluding the smallest cell; the effect remained significant in both cases. Future studies should adopt more balanced sampling across activity identification, progress tracking, defect detection, and safety hazard identification.

Third, the study did not employ a fully independent blinded annotation scheme or calculate formal inter-rater reliability metrics. The evaluation is a verification task in which each score reflects the agreement between an AI output and an engineering ground-truth reference that the group established from first-hand site observation; the evaluator therefore cannot be blinded to that reference, which is the standard against which the output is judged. Scoring followed a pre-specified, anchored rubric (Table 2) fixed before evaluation and applied uniformly, so criteria were not generated case by case. Moreover, the conclusions were therefore limited to broad, directional findings regarding task-level feasibility, recurring failure modes, and differences between visually explicit and judgment-intensive tasks, rather than fine-grained score differences that would require stronger inter-rater validation. To mitigate potential bias, the reference assessments were supported by site-professional consultation, instructor audit, and evidence checks. Nevertheless, future benchmarking studies should incorporate independent blinded expert annotation, multiple raters per image, and formal agreement metrics.

Fourth, construction-site images represent snapshots of dynamic construction processes. A single photograph may not capture prior work sequences, temporary constraints, hidden elements, future planned steps, or contextual information needed for complete engineering interpretation. This limitation affects both AI systems and human evaluators, although experienced professionals may compensate through site knowledge and construction logic. The issue is therefore not only technical but also epistemic: the image provides partial evidence of a broader construction process. Future research should examine temporally aware, multi-image, video-based, or sequence-based evaluation, supported by BIM models, schedules, inspection records, sensor data, and site documentation.

Finally, the findings are time-stamped and context-dependent because multimodal GenAI systems are updated frequently and their backend configurations are not always transparent. The more durable contribution of this study is therefore the verification-first evaluation protocol. Future research should examine updated models, domain-adapted systems, retrieval-augmented workflows, structured prompting, BIM/digital-twin integration, and longitudinal educational designs to assess whether trust calibration and verification-first behavior persist as students transition into professional practice. In particular, a systematic prompt-sensitivity analysis was not undertaken in this study and is identified as a priority for future controlled evaluation.

6. Conclusions

This paper presented a field-based evaluation of multimodal GenAI for construction and urban infrastructure monitoring using 1186 images collected from 17 active construction sites. The study assessed GenAI performance across four main construction-management tasks: construction activity identification, progress tracking, execution defect detection, and safety hazard identification. In parallel, the study examined how a verification-first human-in-the-loop workflow can support responsible interpretation of AI-generated outputs by civil engineering students.

The findings show that multimodal GenAI can provide meaningful support for visual documentation and preliminary site interpretation, but its reliability is strongly task-dependent. Activity identification achieved the strongest performance, indicating that current GenAI tools can generally recognize visually explicit construction activities, materials, and equipment. Progress tracking and safety hazard identification showed moderate performance, reflecting the need for contextual information and professional verification. Execution defect detection was the most challenging task, highlighting the limitations of general-purpose GenAI when quality assessment requires subtle visual discrimination, construction standards, and engineering judgment.

These results suggest that GenAI deployment in construction monitoring should be task-calibrated. For activity identification and routine documentation, GenAI may serve as a first-pass analytical tool. For progress assessment and safety screening, AI outputs should be verified against site context, temporal references, and professional checklists. For execution defect detection and quality assessment, GenAI should be limited to preliminary screening, with final judgment remaining the responsibility of qualified professionals.

The educational findings provide supporting evidence that verification-first workflows can strengthen responsible GenAI use. Students entered the seminar as frequent but mostly general GenAI users, while the post-course results showed improved domain-specific AI literacy, professional application skills, and awareness of validation requirements. This indicates that structured field-based training can help future engineers treat AI outputs as preliminary interpretations rather than final conclusions.

Overall, the central contribution of this study is the empirical demonstration that multimodal GenAI has practical but bounded value for construction-site monitoring within the smart-city infrastructure lifecycle. Its effective use requires task-specific validation, human verification, and professional accountability. By combining real construction-site imagery with a verification-first evaluation workflow, the study provides both practical evidence for current GenAI capabilities and a replicable framework for responsible adoption in construction management practice and education.

Author Contributions

Conceptualization, A.U., E.H. and A.M.; methodology, A.U. and A.M.; investigation, A.M. and A.U.; writing—original draft preparation, A.U.; writing—review and editing, A.M., E.H. and A.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed at the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT 5 for the purpose of improving the style of writing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AAC	Autoclaved Aerated Concrete
AI	Artificial Intelligence
ANOVA	Analysis of Variance
BIM	Building Information Modeling
CE	Civil Engineering
CI	Confidence Interval
CM	Construction Management
CV	Computer Vision
GenAI	Generative Artificial Intelligence
LLM	Large Language Model
M	Mean
MEP	Mechanical, Electrical, and Plumbing
n	Sample Size
NLP	Natural Language Processing
PPE	Personal Protective Equipment
RQ	Research Question
SD	Standard Deviation
SME	Small and Medium-sized Enterprise

Appendix A. Questionnaire

Table A1. Questionnaire structure for assessing AI use, AI literacy, image-based trust, and verification-first competencies.

Questions

Background and baseline use (pre)

Prior to the seminar, for what primary purposes did you use AI tools? [Multiple choice]

How frequently did you use AI tools prior to the seminar? [Single choice]

AI literacy for civil engineering (pre/post) [6-point Likert (agreement)]

Prior to the seminar: I understood how AI tools can support civil engineering and construction management, with an emphasis on image-based applications.

After the seminar: I understand how AI tools can support civil engineering and construction management, with an emphasis on image-based applications.

Prior to the seminar: I was able to identify situations in which AI tools may be misleading or may generate inaccurate information.

After the seminar: I am able to assess the reliability of AI outputs and recognize specific warning signs associated with misleading information.

Prior to the seminar: I rated my proficiency in using AI tools for engineering problem-solving as high.

After the seminar: I know how to apply AI tools in a professional and controlled manner to solve complex engineering problems.

Prior to the seminar: I tended to rely on AI outputs without a clearly defined verification procedure.

After the seminar: I understand when professional verification is required and the importance of systematic quality control of AI outputs.

Trust in image-based AI capabilities (Post) [6-point Likert (agreement)]

I trust AI tools to identify the type of construction activity being performed on site based on images.

I trust AI tools to assess the progress stage of on-site work based on images.

I trust AI tools to identify visible workmanship defects on a construction site based on images.

I trust AI tools to identify visible safety hazards on a construction site based on images.

I trust AI tools to support engineering decision-making based on image analysis, provided that professional verification is performed.

Verification-first competencies and responsible use (Post) [6-point Likert (agreement)]

I understand when the use of ground truth is required and its importance for validating and controlling AI-generated outputs.

I understand when supplementary data are required and the importance of selecting appropriate supporting information to justify an engineering conclusion derived from AI-assisted analysis.

I can formulate prompts that reduce uncertainty and elicit outputs that are amenable to verification and quality control.

I can identify “red flags” and indicators of unreliable output, such as logical inconsistencies or internal contradictions.

I understand that different models may produce different results for the same task and recognize the importance of cross-tool comparison to obtain a more reliable output.

Following practice with field images, I can identify how physical factors (e.g., lighting, viewpoint) influence output reliability.

I understand when independent professional judgment is required and the critical importance of my professional accountability as an engineer for any AI-assisted output.

I feel confident in my ability to integrate AI tools in an ethical and appropriate manner, and I understand how to ensure transparency and professionalism throughout the process.

Note: This questionnaire is an English translation of the original Hebrew version. Wording was translated for clarity and academic reporting; item meaning and response formats were preserved.

Appendix B. A Representative GenAI Evaluation and the Used Prompt

Appendix B presents a representative example of the GenAI image-based evaluation table, including the associated prompt used to classify construction activities from site imagery.

Figure A1. Representative examples from the activity identification dataset, including activity labels, confidence scores, visual evidence, detected elements, and corresponding site images.

Figure A2. Associated prompt used to classify construction activities from site imagery.

Appendix C. Summary of Data-Collection, GenAI Execution, and Evaluation Guidelines

The following general guidelines were provided to all student groups to support consistency in image acquisition, GenAI execution, and verification of results.

C.1 Image-acquisition guidelines

Each group selected a construction activity that was visually observable and relevant to construction-management practice.
Images were collected from active construction sites, after coordination with the site manager, site engineer, or another responsible professional where required.
Students were instructed to photograph each activity from multiple viewpoints and working distances, including both general contextual views and closer task-specific views.
Images were to be captured under adequate lighting, focus, and framing conditions whenever possible.
Students were instructed to minimize occlusions, avoid unnecessary clutter in the frame where feasible, and ensure that the relevant construction activity or component was visible.
Where possible, repeated site visits were conducted to capture distinct execution stages and temporal variation in the same activity.
Images were organized by site visit, construction activity, and execution stage to support traceability and later comparison.
Students were instructed to follow site safety instructions and avoid photographing sensitive personal information or identifiable workers unless permission and ethical conditions were satisfied.

C.2 Required metadata for each image

For each image, students recorded structured metadata to support verification and reproducibility. The metadata included:

Image identifier or file name.
Date and time of image capture.
Construction site and location within the site, such as floor, room, area, or zone.
Construction activity or activity code.
Observed execution stage.
Camera/source type, such as mobile phone, camera, drone, or fixed camera where applicable.
Viewpoint, distance, and camera angle where relevant.
Lighting conditions and visible environmental constraints.
Image resolution or image quality notes.
Additional contextual information obtained during the site visit, including materials, equipment, workmanship conditions, or site constraints where relevant.

C.3 GenAI execution guidelines

Each group selected one or more multimodal GenAI tools according to the research question.
The same prompt structure was used consistently across the images evaluated within each task.
Prompts were designed to request task-specific outputs, such as activity identification, progress assessment, defect detection, or safety hazard identification.
Students were instructed to ask the model to base its answer only on visible evidence and to avoid unsupported assumptions.
Where applicable, prompts requested a structured output including the identified activity or condition, a confidence statement, a short explanation, and the visible evidence supporting the answer.
The GenAI output was copied into a structured results table without post-editing the model response.
When a model provided a confidence score or explanatory reasoning, this information was recorded as part of the evaluation record.

C.4 Verification and ground-truth guidelines

For each image or task, students defined an expected reference assessment based on direct site observations, recorded metadata, construction logic, and relevant professional requirements where applicable.
Key assumptions were verified with the site engineer, site manager, or another responsible construction-site professional during or shortly after the site visit where possible.
AI outputs were compared with the reference assessment to determine whether the output was correct, partially correct, unsupported, or incorrect.
Students checked whether AI claims were supported by visible evidence in the image.
Students documented missing visual evidence, unsupported inferences, contradictions, or overconfident statements.
Where multiple images from the same activity or visit were available, students used cross-image comparison to support verification.
Ambiguous cases were discussed with the course instructor, who served as an academic reviewer and adjudicator where needed.
The final evaluation treated GenAI outputs as preliminary decision-support information, rather than final engineering conclusions.

C.5 Results documentation and qualitative analysis

All model outputs and evaluation results were recorded in a structured spreadsheet.
The results table included, at minimum, the image name, task type, GenAI tool, model output, confidence statement where available, short explanation, reference assessment, verification note, and evaluation score.
Students marked success or failure for each image or task according to the defined scoring criteria.
Students calculated summary performance measures, such as success rate or average score, according to the task design.
Students provided qualitative analysis of successful and unsuccessful cases.
Error analysis considered factors such as lighting, viewpoint, camera angle, occlusion, image resolution, visual similarity between construction stages, missing contextual information, and the need for professional judgment.
Students were asked to discuss the conditions under which the model performed better or worse and to propose improvements for future field-based GenAI evaluations.

References

Leal Filho, W.; Mbah, M.F.; Dinis, M.A.P.; Trevisan, L.V.; de Lange, D.; Mishra, A.; Rebelatto, B.; Ben Hassen, T.; Aina, Y.A. The Role of Artificial Intelligence in the Implementation of the UN Sustainable Development Goal 11: Fostering Sustainable Cities and Communities. Cities 2024, 150, 105021. [Google Scholar] [CrossRef]
Kochskämper, E.; Glass, L.M.; Haupt, W.; Malekpour, S.; Grainger-Brown, J. Resilience and the Sustainable Development Goals: A Scrutiny of Urban Strategies in the 100 Resilient Cities Initiative. J. Environ. Plan. Manag. 2025, 68, 1691–1717. [Google Scholar] [CrossRef]
Braun, A.; Borrmann, A. Combining Inverse Photogrammetry and BIM for Automated Labeling of Construction Site Images for Machine Learning. Autom. Constr. 2019, 106, 102879. [Google Scholar] [CrossRef]
Pal, A.; Lin, J.J.; Hsieh, S.H.; Golparvar-Fard, M. Activity-Level Construction Progress Monitoring through Semantic Segmentation of 3D-Informed Orthographic Images. Autom. Constr. 2024, 157, 105157. [Google Scholar] [CrossRef]
Choi, S.M.; Cha, H.S.; Jiang, S. Hybrid Data Augmentation for Enhanced Crack Detection in Building Construction. Buildings 2024, 14, 1929. [Google Scholar] [CrossRef]
Kim, H.; Yi, J.S. Image Generation of Hazardous Situations in Construction Sites Using Text-to-Image Generative Model for Training Deep Neural Networks. Autom. Constr. 2024, 166, 105615. [Google Scholar] [CrossRef]
Xu, W.; Yi, W.; Tan, Y. Generative AI-Driven Data Augmentation and Object-Guided Vision-Language Reasoning for PPE Compliance Analysis in Work-at-Height. Adv. Eng. Inform. 2026, 71, 104364. [Google Scholar] [CrossRef]
Taiwo, R.; Bello, I.T.; Abdulai, S.F.; Yussif, A.M.; Salami, B.A.; Saka, A.; Ben Seghier, M.E.A.; Zayed, T. Generative Artificial Intelligence in Construction: A Delphi Approach, Framework, and Case Study. Alex. Eng. J. 2025, 116, 672–698. [Google Scholar] [CrossRef]
Capogna, S.; Pellegrini, S.; Sebastiani, R. Transition and Artificial Intelligence: The Case of Student Professionalisation. Eur. J. Educ. 2025, 60, e12866. [Google Scholar] [CrossRef]
Medina-Gual, L.; Parejo, J. Perceptions and Use of AI in Higher Education Students: Impact on Teaching, Learning, and Ethical Considerations. Eur. J. Educ. 2025, 60, e12919. [Google Scholar] [CrossRef]
Vettori, O.; Warm, J. The Race for AI Skills as an Obstacle Course: Institutional Challenges and Low Threshold Suggestions. Proj. Leadersh. Soc. 2025, 6, 100183. [Google Scholar] [CrossRef]
Zou, X.; Su, P.; Li, L.; Fu, P. AI-Generated Content Tools and Students’ Critical Thinking: Insights from a Chinese University. IFLA J. 2024, 50, 228–241. [Google Scholar] [CrossRef]
Feuerriegel, S.; Hartmann, J.; Janiesch, C.; Zschech, P. Generative AI. Bus. Inf. Syst. Eng. 2024, 66, 111–126. [Google Scholar] [CrossRef]
Banh, L.; Strobel, G. Generative Artificial Intelligence. Electron. Mark. 2023, 33, 63. [Google Scholar] [CrossRef]
Monteith, S.; Glenn, T.; Geddes, J.R.; Whybrow, P.C.; Achtyes, E.; Bauer, M. Artificial Intelligence and Increasing Misinformation. Br. J. Psychiatry 2024, 224, 33–35. [Google Scholar] [CrossRef] [PubMed]
Ngo, C.-L.; Nguyen, T.T.; Nguyen, K.N.H. Critical Thinking in the Age of Generative AI: Effects of a Short-Term Experiential Learning Intervention on EFL Learners. Int. J. TESOL Stud. 2025, 250522, 1–21. [Google Scholar] [CrossRef]
Znamenskiy, V.; Niyazov, R.; Hernandez, J. Integrating Universal Generative AI Platforms in Educational Labs to Foster Critical Thinking and Digital Literacy. Int. J. Cybern. Inform. 2025, 14, 14–25. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, L. Roles of Artificial Intelligence in Construction Engineering and Management: A Critical Review and Future Trends. Autom. Constr. 2021, 122, 103517. [Google Scholar] [CrossRef]
Abioye, S.O.; Oyedele, L.O.; Akanbi, L.; Ajayi, A.; Davila Delgado, J.M.; Bilal, M.; Akinade, O.O.; Ahmed, A. Artificial Intelligence in the Construction Industry: A Review of Present Status, Opportunities and Future Challenges. J. Build. Eng. 2021, 44, 103299. [Google Scholar] [CrossRef]
Baduge, S.K.; Thilakarathna, S.; Perera, J.S.; Arashpour, M.; Sharafi, P.; Teodosio, B.; Shringi, A.; Mendis, P. Artificial Intelligence and Smart Vision for Building and Construction 4.0: Machine and Deep Learning Methods and Applications. Autom. Constr. 2022, 141, 104440. [Google Scholar] [CrossRef]
Urlainis, A.; Giat, Y.; Mitelman, A. Harnessing Large Language Models for Digital Building Logbook Implementation. Buildings 2025, 15, 3399. [Google Scholar] [CrossRef]
Heidari, A.; Peyvastehgar, Y.; Amanzadegan, M. A Systematic Review of the BIM in Construction: From Smart Building Management to Interoperability of BIM & AI. Archit. Sci. Rev. 2024, 67, 237–254. [Google Scholar]
Urlainis, A.; Mitelman, A. Implementation of a BIM Workflow for Building Permit Coordination in Urban Metro Projects. J. Inf. Technol. Constr. 2025, 30, 319–334. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Zhong, B.; Love, P.E.D.; Luo, H. Automated Detection of Workers and Heavy Equipment on Construction Sites: A Convolutional Neural Network Approach. Adv. Eng. Inform. 2018, 37, 139–149. [Google Scholar] [CrossRef]
Khallaf, R.; Khallaf, M. Classification and Analysis of Deep Learning Applications in Construction: A Systematic Literature Review. Autom. Constr. 2021, 129, 103760. [Google Scholar] [CrossRef]
Pal, A.; Hsieh, S.H. Deep-Learning-Based Visual Data Analytics for Smart Construction Management. Autom. Constr. 2021, 131, 103892. [Google Scholar]
Regona, M.; Yigitcanlar, T.; Xia, B.; Li, R.Y.M. Opportunities and Adoption Challenges of AI in the Construction Industry: A PRISMA Review. J. Open Innov. Technol. Mark. Complex. 2022, 8, 45. [Google Scholar] [CrossRef]
Singh, A.; Dwivedi, A.; Agrawal, D.; Singh, D. Identifying Issues in Adoption of AI Practices in Construction Supply Chains: Towards Managing Sustainability. Oper. Manag. Res. 2023, 16, 1667–1683. [Google Scholar] [CrossRef]
Weber-Lewerenz, B. Corporate Digital Responsibility (CDR) in Construction Engineering—Ethical Guidelines for the Application of Digital Transformation and Artificial Intelligence (AI) in User Practice. SN Appl. Sci. 2021, 3, 801. [Google Scholar] [CrossRef]
Prieto, S.A.; Mengiste, E.T.; García de Soto, B. Investigating the Use of ChatGPT for the Scheduling of Construction Projects. Buildings 2023, 13, 857. [Google Scholar] [CrossRef]
Ghimire, P.; Kim, K.; Acharya, M. Opportunities and Challenges of Generative AI in Construction Industry: Focusing on Adoption of Text-Based Models. Buildings 2024, 14, 220. [Google Scholar] [CrossRef]
Alwashah, Z.; Xiao, B.; Liu, H.; Mueller, S.T.; Shao, X. Generative Artificial Intelligence for Construction: Use Cases, Trends, Challenges, and Opportunities. J. Build. Eng. 2025, 112, 113802. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, M.; Pan, J.; Hu, J.; Luo, X. Robust Object Detection in Extreme Construction Conditions. Autom. Constr. 2024, 165, 105487. [Google Scholar] [CrossRef]
Mannem, K.R.; Prieto, S.A.; de Soto, B.G.; Bacao, F. Weighted Adaptive Active Transfer Learning for Imbalanced Multi-Object Classification in Construction Site Imagery. Autom. Constr. 2025, 176, 106297. [Google Scholar] [CrossRef]
Meng, Q.; Wang, S.; Zhu, S. Semi-Supervised Deep Learning for Recognizing Construction Activity Types from Vibration Monitoring Data. Autom. Constr. 2023, 152, 104910. [Google Scholar] [CrossRef]
Pal, A.; Lin, J.J.; Hsieh, S.-H.; Golparvar-Fard, M. Automated Vision-Based Construction Progress Monitoring in Built Environment through Digital Twin. Dev. Built Environ. 2023, 16, 100247. [Google Scholar] [CrossRef]
Pour Rahimian, F.; Seyedzadeh, S.; Oliver, S.; Rodriguez, S.; Dawood, N. On-Demand Monitoring of Construction Projects through a Game-like Hybrid Application of BIM and Machine Learning. Autom. Constr. 2020, 110, 103012. [Google Scholar] [CrossRef]
Lin, Z.-H.; Chen, A.Y.; Hsieh, S.-H. Temporal Image Analytics for Abnormal Construction Activity Identification. Autom. Constr. 2021, 124, 103572. [Google Scholar] [CrossRef]
Ashurov, A.; Qu, H. An Efficient Industrial Defect Detection Based on Hybrid Residual Attention with Modified Generative Adversarial Network and Convolutional Neural Network Model. Comput. Electr. Eng. 2025, 127, 110580. [Google Scholar] [CrossRef]
García-Pérez, A.; Gómez-Silva, M.J.; de la Escalera-Hueso, A. Improving Automatic Defect Recognition on GDXRay Castings Dataset by Introducing GenAI Synthetic Training Data. NDT E Int. 2025, 151, 103303. [Google Scholar] [CrossRef]
Cui, D.; Xu, S.; Wang, S.; Zhang, K. Beyond the Images: Comprehensible Unsafe Behaviour Recognition Boosted by Joint Inference Graph with Multi-Hop Reasoning. Adv. Eng. Inform. 2025, 66, 103454. [Google Scholar] [CrossRef]
Mei, X.; Xu, F.; Zhang, Z.; Tao, Y. Unsafe Behavior Identification on Construction Sites by Combining Computer Vision and Knowledge Graph–Based Reasoning. Eng. Constr. Archit. Manag. 2024, 32, 8360–8389. [Google Scholar] [CrossRef]
Sivanraj, S.; Uduwage, D.N.L.S.; Tripathi, M. Real-Time Safety Detection on Construction Sites Using a Vision-Language and NLP-Based Model. Adv. Eng. Inform. 2026, 69, 103889. [Google Scholar] [CrossRef]
Shao, C.; Zheng, S.; Xu, Y.; Gu, H.; Qin, X.; Hu, Y. A Visualization System for Dam Safety Monitoring with Application of Digital Twin Platform. Expert Syst. Appl. 2025, 271, 126740. [Google Scholar] [CrossRef]
Martin, H.; James, J.; Chadee, A. Exploring Large Language Model AI Tools in Construction Project Risk Assessment: Chat GPT Limitations in Risk Identification, Mitigation Strategies, and User Experience. J. Constr. Eng. Manag. 2025, 151, 04025119. [Google Scholar] [CrossRef]
Lee, J.; Jang, K.; Sparkling, A.E.; Kang, K. Large Language Model–Based Construction Site Management for Severe Weather Preparedness. J. Constr. Eng. Manag. 2026, 152, 04025220. [Google Scholar] [CrossRef]
Ersoz, A.B. Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring. In Proceedings of the 8th International Project and Construction Management Conference (IPCMC2024), Istanbul, Turkey, 6–8 June 2024. [Google Scholar]
Chen, Q.; Yin, X. Tailored Vision-Language Framework for Automated Hazard Identification and Report Generation in Construction Sites. Adv. Eng. Inform. 2025, 66, 103478. [Google Scholar] [CrossRef]
Zhang, M.; Shi, R.; Yang, Z. A Critical Review of Vision-Based Occupational Health and Safety Monitoring of Construction Site Workers. Saf. Sci. 2020, 126, 104658. [Google Scholar] [CrossRef]
Song, S.; Hong, J.; Jeoung, J.; Ahn, J.; Hong, T. Data-Centric Enhancement of Site-Specific Automated Construction Equipment Detection in Wide-Angle Site Images. Autom. Constr. 2025, 179, 106483. [Google Scholar] [CrossRef]
Ekanayake, B.; Wong, J.K.W.; Fini, A.A.F.; Smith, P.; Thengane, V. Deep Learning-Based Computer Vision in Project Management: Automating Indoor Construction Progress Monitoring. Proj. Leadersh. Soc. 2024, 5, 100149. [Google Scholar] [CrossRef]
Jiang, F.; Ma, J.; Jin, Y. Unleashing the Potential of Large Language Models in Urban Data Analytics: A Review of Emerging Innovations and Future Research. Smart Cities 2025, 8, 201. [Google Scholar] [CrossRef]
Zhan, H.; Hwang, B.-G.; Yin, J.C.X. Is AI the Solution or the Problem? Empirical Evidence on Risk Evolution in Construction Project Management. J. Manag. Eng. 2026, 42, 04025064. [Google Scholar] [CrossRef]
Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, Á. Human-in-the-Loop Machine Learning: A State of the Art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
Hu, M.; Gao, H.; Mi, Q.; Wu, B.; Lu, J.; Liu, Y. Bridging the Information Gap in Smart Construction: An LLM-Based Assistant for Autonomous TBM Tunneling. Smart Cities 2025, 8, 212. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioural Sciences; Lawrence Earlbaum Associates: Hillside, NJ, USA, 1988. [Google Scholar]
Zimmerman, D.W. A Note on Preliminary Tests of Equality of Variances. Br. J. Math. Stat. Psychol. 2004, 57, 173–181. [Google Scholar] [CrossRef] [PubMed]
Delacre, M.; Lakens, D.; Leys, C. Why Psychologists Should by Default Use Welch’s t-Test Instead of Student’s t-Test. Int. Rev. Soc. Psychol. 2017, 30, 92–101. [Google Scholar] [CrossRef]
Bach, T.A.; Khan, A.; Hallock, H.; Beltrão, G.; Sousa, S. A Systematic Literature Review of User Trust in AI-Enabled Systems: An HCI Perspective. Int. J. Hum. Comput. Interact. 2024, 40, 1251–1266. [Google Scholar] [CrossRef]
Emaminejad, N.; Kath, L.; Akhavian, R. Assessing Trust in Construction AI-Powered Collaborative Robots Using Structural Equation Modeling. J. Comput. Civ. Eng. 2024, 38, 04024011. [Google Scholar] [CrossRef]
Emedo, C.; Wada, O.Z.; Clement David-Olawade, A.; Ling, J.; Esan, D.T.; Ijiwade, J.; Olawade, D.B. AI-Driven Transformations in Smart Buildings: A Review of Energy Efficiency and Sustainable Operations. Digit. Eng. 2025, 7, 100068. [Google Scholar] [CrossRef]
Yusuf, B.O.; Aliyu, M.; Azeez, M.O.; Taialla, O.A.; Lateef, S.; Sulaimon, R.; Akinpelu, A.A.; Ganiyu, S.A. Comprehensive Technologies for Heavy Metal Remediation: Adsorption, Membrane Processes, Photocatalysis, and AI-Driven Design. Desalination 2025, 615, 119261. [Google Scholar] [CrossRef]
Vilakati, S. Prompt Engineering for Accurate Statistical Reasoning with Large Language Models in Medical Research. Front. Artif. Intell. 2025, 8, 1658316. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Karigiannis, J.; Gao, R.X. Ontology-Integrated Tuning of Large Language Model for Intelligent Maintenance. CIRP Ann. 2024, 73, 361–364. [Google Scholar] [CrossRef]
Kim, D.-K. Artifact Validity under Varying Agent Configurations in LLM-Assisted Software Development: A Comparative Analysis. Inf. Softw. Technol. 2026, 192, 108022. [Google Scholar] [CrossRef]
Arafin, P.; Billah, A.M.; Issa, A. Deep Learning-Based Concrete Defects Classification and Detection Using Semantic Segmentation. Struct. Health Monit. 2024, 23, 383–409. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Gong, Y.; Chi, S.; Kim, J.I.; Seo, J. Towards Transparent Object Detection Models for Construction Sites: Explainable AI and Error Classification. Adv. Eng. Inform. 2026, 71, 104245. [Google Scholar] [CrossRef]
Zhao, T.; Lin, X.; Na, R. Integrating AI in Construction Estimation Education: A Comparative Study of Togal AI and Bluebeam Revu 20. Eur. J. Educ. 2025, 60, e70287. [Google Scholar] [CrossRef]
Elbashbishy, T.; El-adaway, I.H. Skilled Worker Shortage across Key Labor-Intensive Construction Trades in Union versus Nonunion Environments. J. Manag. Eng. 2024, 40, 04023063. [Google Scholar] [CrossRef]
Almulhim, A.I.; Yigitcanlar, T. Understanding Smart Governance of Sustainable Cities: A Review and Multidimensional Framework. Smart Cities 2025, 8, 113. [Google Scholar] [CrossRef]
Mishra, P.; Singh, G. Smart Governance in Sustainable Smart Cities. In Sustainable Smart Cities 2.0: Enabling Research Toward SDG 11; Springer: Cham, Switzerland, 2025. [Google Scholar]
Kima, O.; Urlainis, A.; Wang, K.-C.; Shohet, I.M. Safety Climate in Small and Medium Construction Enterprises. Smart Sustain. Built Environ. 2024, 15, 508–532. [Google Scholar] [CrossRef]
Adzivor, E.K.; Emuze, F.; Das, D.K. Indicators for Safety Culture in SME Construction Firms: A Delphi Study in Ghana. J. Financ. Manag. Prop. Constr. 2023, 28, 293–316. [Google Scholar] [CrossRef]
Bachar, R.; Urlainis, A.; Wang, K.-C.; Shohet, I.M. Optimal Allocation of Safety Resources in Small and Medium Construction Enterprises. Saf. Sci. 2025, 181, 106680. [Google Scholar] [CrossRef]
Guo, M.; Wang, Y.; Wang, G.; Zhang, X. Intelligent Quality Inspection of Modular Steel Ring Trusses Based on BIM and Large-Scale TLS Point Cloud. Measurement 2026, 264, 120137. [Google Scholar] [CrossRef]
Sadiq, T.; Omlin, C.W. Sensing in Smart Cities: A Multimodal Machine Learning Perspective. Smart Cities 2025, 9, 3. [Google Scholar] [CrossRef]
Gruenhagen, J.H.; Sinclair, P.M.; Carroll, J.-A.; Baker, P.R.A.; Wilson, A.; Demant, D. The Rapid Rise of Generative AI and Its Implications for Academic Integrity: Students’ Perceptions and Use of Chatbots for Assistance with Assessments. Comput. Educ. Artif. Intell. 2024, 7, 100273. [Google Scholar] [CrossRef]

Figure 1. Growth trend of AI-related publications in construction engineering and construction management between 2010 and 2025. The search was conducted in June 2026.

Figure 2. Integrated technical and educational workflow for verification-first evaluation of multimodal GenAI in construction management using real construction site imagery.

Figure 3. Representative examples from the construction site images, illustrating diverse site conditions and construction stages, including structural works and interior works.

Figure 4. Study workflow and evaluation pipeline.

Figure 5. Performance by construction activity. Values indicate mean performance scores on the 0–100 evaluation scale. The value of n denotes the number of task-level evaluations contributing to each construction activity.

Figure 6. AI performance by task type. Bars represent mean performance scores (0–100 scale); error bars indicate 95% confidence intervals. Dashed red line denotes overall mean (M = 74.0). The value of n denotes the number of task-level evaluations per task.

Figure 7. Average performance score by task (R1–R4): Gemini vs. ChatGPT.

Figure 8. AI usage patterns before course: (a) frequency (84% weekly users) and (b) types (74% general vs. 26% engineering-specific), revealing literacy gap. N = 31; multiple selections allowed in Panel B.

Figure 9. Pre–post changes in AI literacy competencies. Top row: Distribution of individual responses showing convergence toward higher ratings, dashed lines indicate the mean values. Bottom row: Mean scores (±SD) with percentage improvement. All changes significant at p < 0.001 (N = 31).

Figure 10. Trust calibration analysis. (a) Student trust compared to actual AI performance across CM tasks. (b) Trust–performance relationship.

Table 1. Recent AI-related studies in construction management and adjacent domains, categorized by the primary capabilities employed and the main task addressed.

CM Task	Data & AI Approach	Study
Object/Activity identification	Field Data; CV/Deep Learning	Ding et al. (2024) [33]
		Mannem et al. (2025) [34]
		Meng et al. (2023) [35]
Progress monitoring	BIM; Digital Twin (concept); CV; Machine Learning; VR; Field Data; CV/Deep Learning	Pal et al. (2023) [36]
		Pour Rahimian et al. (2020) [37]
		Lin et al. (2021) [38]
Scheduling/planning	GenAI (ChatGPT); Text-based	Prieto et al. (2023) [30]
Defect detection	GenAI (GAN); Synthetic Data; CV/Deep Learning	Ashurov and Qu (2025) [39]
Defect detection	GenAI (GAN); Synthetic Data; CV/Deep Learning	García-Pérez et al. (2025) [40]
Safety hazard detection	GenAI (Text-to-Image); Synthetic Data; CV/Deep Learning; Field Data; CV/Deep Learning; Knowledge Graph; Natural Language Processing (NLP)	Kim and Yi (2024) [6]
		Cui et al. (2025) [41]
		Mei et al. (2024) [42]
		Sivanraj et al. (2026) [43]
Structural health monitoring	Digital Twin; Field; AI; Deep Learning	Shao et al. (2025) [44]
Structural health monitoring	Digital Twin; Field; AI; Deep Learning	Ding et al. (2024) [33]
Risk assessment/mitigation	GenAI (LLM); Text-based	Martin et al. (2025) [45]
Risk assessment/mitigation	GenAI (LLM); Text-based	Lee et al. (2026) [46]

Table 2. Scoring anchors for the 0–100 evaluation scale.

Score Range	Scoring Anchor
90–100	Correct, supported by visible evidence, no contradictions
70–89	Mostly correct, minor missing detail or small omissions
50–69	Partially correct, notable gaps or weak evidence
0–49	Incorrect, hallucinated elements, or contradicts the image

Table 3. Distribution of analyzed images by construction category and activity.

Category	Activities	Analyzed Images
Civil Engineering and Infrastructure	Asphalt works	15
	Earthworks	222
	External works and environmental development	17
	Infrastructure (sewer pipeline)	21
Finishes and MEP	Flooring works (tiling)	285
	Gypsum works	45
	Interior plastering and painting works	42
	Roof thermal insulation and waterproofing	30
	Waterproofing of wet rooms and balconies	60
	Apartment electrical works	33
Masonry and partitions	AAC (Autoclaved Aerated Concrete) block masonry (Ashkalit blocks)	69
Masonry and partitions	AAC block masonry (ytong)	111
Structural works	Deep foundation works (piles)	40
	Hollow Core Slabs	16
	Retaining wall construction	180
Total		1186

Table 4. Distribution of evaluated construction management tasks by research question (RQ).

RQ	Construction Management Task	Task Description	Number of Evaluations
R1	Construction Activity Identification	Identification of construction trades or activities in site imagery	12
R2	Construction Progress Tracking	Recognition of execution stage and progress level within construction activities	13
R3	Execution Defect Detection	Detection of visible workmanship defects and quality-related deviations	9
R4	Safety Hazard Identification	Identification of safety hazards and unsafe site conditions from images	5
R11	Impact of Lighting and Camera Angle	Evaluation of the influence of lighting conditions and camera angles on AI performance	1
R12	Field Applicability Assessment	Assessment of practical usability of AI outputs in site conditions	2
R13	Automated Construction Report Generation	Generation of structured construction or inspection reports from site imagery	3
R14	Image Resolution Impact	Assessment of image resolution effects on AI performance	1
R15	Construction Area Classification	Identification and classification of site areas or functional zones	1
Total			47

Table 5. Performance scores by primary construction management task by Gemini and chatGPT. In this table, n denotes the number of task-level evaluations contributing to each tool–task cell.

RQ	Task	ChatGPT		Gemini		Overall
RQ	Task	n	Mean	n	Mean	n	Mean
R1	Construction Activity Identification	4	83.67	7.00	82.56	11.00	82.96
R2	Construction Progress Tracking	2	75.00	10.00	73.35	12.00	73.63
R3	Execution Defect Detection	3	53.33	6.00	65.67	9.00	61.56
R4	Safety Hazard Identification	1	76.00	4.00	72.38	5.00	73.10
Total/Mean		10	72.07	27.00	73.88	37.00	73.39

Table 6. Pre–post changes in AI literacy competencies.

Competency	Pre (M ± SD)	Post (M ± SD)	t(df)	p	Cohen’s d	[95% CI]
Understanding AI Contribution in CE	3.16 ± 1.57	5.23 ± 0.72	6.36(30)	<0.001	1.14	[0.69, 1.60]
Error Recognition and Validation	4.13 ± 1.31	5.19 ± 0.83	4.42(30)	<0.001	0.79	[0.39, 1.20]
Professional Application Skills	3.29 ± 1.16	5.16 ± 0.64	8.46(30)	<0.001	1.52	[1.00, 2.04]
Validation Importance	3.26 ± 1.61	5.58 ± 0.72	7.52(30)	<0.001	1.35	[0.86, 1.84]

Note: n = 31. Responses measured on 6-point Likert scale (1 = strongly disagree, 6 = strongly agree).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Urlainis, A.; Haronian, E.; Mitelman, A. Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation. Smart Cities 2026, 9, 114. https://doi.org/10.3390/smartcities9070114

AMA Style

Urlainis A, Haronian E, Mitelman A. Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation. Smart Cities. 2026; 9(7):114. https://doi.org/10.3390/smartcities9070114

Chicago/Turabian Style

Urlainis, Alon, Eran Haronian, and Amichai Mitelman. 2026. "Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation" Smart Cities 9, no. 7: 114. https://doi.org/10.3390/smartcities9070114

APA Style

Urlainis, A., Haronian, E., & Mitelman, A. (2026). Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation. Smart Cities, 9(7), 114. https://doi.org/10.3390/smartcities9070114

Article Menu

Multimodal Generative AI for Construction-Site Management and Monitoring: A Field-Based Evaluation

Highlights

Abstract

1. Introduction

2. Background

2.1. AI in Civil Engineering and Construction Management

2.2. Multimodal GenAI, Verification, and Human-in-the-Loop Construction Monitoring

2.3. Engineering Education and Verification-First GenAI Use

2.4. Research Gaps and Motivation

3. Methodology

3.1. Study Context and Participants

3.2. Construction Activities and Site Selection

3.3. Experimental Design and Data Collection

3.4. Multimodal GenAI Task Execution

3.5. Verification-First Evaluation Workflow

3.6. Performance Analysis

3.7. Educational Assessment

3.8. Integration of Technical and Educational Outcomes

4. Results

4.1. Overview of Dataset

4.2. Technical Results: Performance of Multimodal GenAI in Construction Management Tasks

4.2.1. Performance by Construction Activity

4.2.2. Performance by Task Type

4.2.3. Task Performance by Model

4.2.4. Exploratory Tasks

4.3. Educational Results: Student Trust, Verification Behavior, and AI Literacy

4.3.1. Baseline GenAI Usage Patterns

4.3.2. Trust Calibration Relative to Observed AI Performance

5. Discussion

5.1. Technical Performance: Capabilities and Limitations

5.2. Educational Outcomes: From General Adoption to Professional Literacy

5.3. Workforce Efficiency and Sustainable Infrastructure Delivery

5.4. Implications for Construction Practice and Engineering Education

5.5. Limitations and Future Research

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Questionnaire

Appendix B. A Representative GenAI Evaluation and the Used Prompt

Appendix C. Summary of Data-Collection, GenAI Execution, and Evaluation Guidelines

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI