1. Introduction
Smart cities depend not only on intelligent infrastructure operation, but also on reliable, data-driven infrastructure delivery. Construction sites are temporary yet essential components of the urban system, generating large volumes of visual, spatial, and operational data through site images, inspections, progress documentation, and field observations. These data can support construction monitoring, quality assurance, safety management, and future asset information. However, much of this field data remains underused because construction management still relies heavily on manual observation and professional interpretation. This gap creates an opportunity to examine whether multimodal generative artificial intelligence (GenAI) can transform construction-site visual data into structured information for smart-city infrastructure delivery. This question is also aligned with the United Nations Sustainable Development Goals, particularly SDG 11, which emphasizes inclusive, safe, resilient, and sustainable cities and communities [
1,
2]. In this context, AI-supported construction monitoring may contribute to more transparent, resource-efficient, and evidence-based delivery of urban infrastructure.
GenAI tools are increasingly capable of interpreting visual information, generating textual explanations, and supporting decision-making processes, making them relevant to both professional practice and engineering education. In construction management (CM) and civil engineering (CE), image-based GenAI tools are particularly important because many routine monitoring tasks rely on visual evidence, including construction activity identification, progress tracking, execution defect detection, and safety hazard identification [
3,
4,
5,
6]. Such tools may improve the efficiency of visual documentation and preliminary site interpretation, especially when large image sets are collected from active construction projects.
Despite the growing relevance of multimodal GenAI for construction monitoring, its performance under real construction-site conditions remains insufficiently established. Unlike controlled benchmark datasets, construction-site images often contain incomplete visual evidence, occlusions, temporary works, cluttered backgrounds, variable lighting, and activities occurring at different stages of completion. As a result, GenAI systems may produce incomplete, inconsistent, or misleading outputs, including confident narratives that exceed the available visual evidence [
7,
8]. This limitation reinforces the need for explicit verification and uncertainty-aware reporting in construction-management tasks. Field-based evaluation is therefore required to determine which tasks can be supported reliably by multimodal GenAI and which still require substantial professional interpretation.
The need for verification also has implications for engineering education and professional training. GenAI tools are increasingly used by early-career engineers and engineering students [
9], yet this use is often informal and only weakly integrated into structured curricula. Evidence from higher-education studies indicates frequent use and high self-reported confidence alongside uneven formal knowledge, ethical preparedness, and verification practices [
10,
11,
12]. This is particularly important in construction management, where AI-generated interpretations may relate to safety, quality, and progress assessment. In such contexts, uncritical reliance on artificial intelligence (AI) outputs can reinforce automation bias and overconfidence, leading users to accept incomplete or misleading interpretations as reliable evidence [
13,
14,
15]. Prior studies further suggest that scaffolded and task-based learning can strengthen critical thinking and reduce uncritical reliance on AI-generated outputs [
16,
17]. Therefore, responsible GenAI integration requires structured workflows that combine AI-supported analysis with human verification, uncertainty awareness, and domain-based judgment. Accordingly, while the primary focus of this study is the technical evaluation of GenAI performance, the field-based seminar setting also provides a secondary educational contribution by demonstrating how a verification-first workflow can be embedded in authentic construction-management training to strengthen students’ AI literacy, verification practice, and professional judgment.
Despite extensive research on AI in construction and the rapid emergence of GenAI tools, four gaps motivate the present study: (i) limited field-based validation under real construction-site conditions; (ii) limited evidence on how general-purpose multimodal GenAI performs across tasks that differ in visual explicitness, contextual reasoning, and required engineering judgment; (iii) insufficient analysis of recurring failure modes and unsupported interpretations; and (iv) weak integration of verification-first, human-in-the-loop workflows, particularly when applied by students or early-career engineers using real site data. These gaps are elaborated in
Section 2.4.
In response to these needs, the present study examines the use of multimodal GenAI tools for construction-site image interpretation through a field-based evaluation conducted within a construction-management seminar. The dataset includes 1186 images collected from 17 active construction sites and analyzed by 31 undergraduate civil engineering students using a structured evaluation protocol. The evaluation used three widely available, general-purpose multimodal GenAI assistants, Gemini, ChatGPT, and Microsoft Copilot, each selected at the discretion of the research groups. Four construction-management tasks were examined: construction activity identification, progress tracking, execution defect detection, and safety hazard identification. The AI-generated outputs were assessed against engineering ground truth to evaluate task-dependent performance, recurring limitations, and the feasibility of a verification-first human-in-the-loop workflow.
The primary contribution of this work is a technical, field-based evaluation of multimodal GenAI performance under real construction-site conditions. The educational dimension, namely the operationalization of a verification-first workflow within engineering training, is treated as a complementary secondary contribution. In this study, verification-first refers to treating AI-generated outputs as preliminary interpretations that require domain-based human review before being accepted or used for decision-making. The study makes three main contributions: (1) empirical evidence on the task-dependent strengths, limitations, and recurring failure modes of multimodal GenAI when analyzing construction-site imagery from active projects; (2) comparative assessment of GenAI performance across four main construction-management tasks, namely activity identification, progress tracking, defect detection, and safety hazard identification; and (3) demonstration of a human-in-the-loop verification workflow that supports responsible GenAI use within smart-city construction and infrastructure monitoring.
3. Methodology
This study adopts a field-based evaluation design to assess the performance of multimodal GenAI on construction-management tasks using images collected from real construction sites. The study was implemented within an undergraduate construction-management seminar, enabling structured human-in-the-loop evaluation, ground-truth comparison, and supporting assessment of students’ verification awareness. The methodology combines quantitative assessment of AI task performance with qualitative analysis of recurring errors, while educational outcomes related to trust, confidence, and verification practices are examined as a secondary component.
The core methodological principle is a verification-first (human-in-the-loop) workflow, in which GenAI outputs are systematically compared against engineering ground truth and professional judgment rather than being treated as authoritative results.
Figure 2 illustrates the overall research framework, highlighting the sequential and parallel interactions between preliminary research, real-world site investigation, AI-assisted task execution, mandatory verification procedures, and educational assessment. The figure also emphasizes the parallel development of technical evaluation and learning outcomes across all study phases.
3.1. Study Context and Participants
The study was conducted within an undergraduate construction management seminar offered to civil engineering students. A total of 31 students participated, organized into 15 independent research groups working under a unified methodological framework. The seminar engaged students in structured, task-oriented activities involving the analysis of construction site images using multimodal GenAI tools, while explicitly emphasizing professional responsibility, verification practices, and critical evaluation of AI-generated outputs.
3.2. Construction Activities and Site Selection
Each group selected a construction activity that is visually observable and relevant to construction management practice, including structural, finishing, Mechanical, Electrical, and Plumbing (MEP), and infrastructure works. Sites were chosen to ensure that selected activities were actively occurring during the study period and that realistic field variability would be represented. Overall, the study included visits to 17 active construction sites, covering diverse project types (residential, infrastructure, data center, and mixed-use), execution stages, and environments (indoor and outdoor conditions).
3.3. Experimental Design and Data Collection
A structured experimental plan was developed for each construction activity. The plan specified the number of required site visits, the range of execution stages to be captured, and basic guidelines for image acquisition. Students were instructed to collect site images during multiple visits where possible, in order to capture temporal variation and progress-related changes. During each site visit, students collected photographic data together with complementary execution information. The image-acquisition guidelines, applied consistently across all groups, required students to photograph each activity from multiple viewpoints and working distances, ensure adequate lighting, focus, and framing, minimize occlusions where possible, and capture distinct execution stages across repeated visits. For each image, structured metadata were recorded to support traceability, verification, and subsequent analysis. These metadata included the date and time of capture, site location, construction activity, observed execution stage, image resolution, image source, and additional contextual information relevant to the activity. A summary of the data-collection and evaluation guidelines is provided in
Appendix C.
3.4. Multimodal GenAI Task Execution
Collected images were analyzed using three widely available, general-purpose multimodal GenAI assistants: Gemini 1.5pro, ChatGPT 4o, and Microsoft Copilot. These tools were selected because they were accessible to students during the study period, supported image-based input, generated textual explanations, and represented common GenAI assistants available to non-specialist engineering users. Tool selection was left to the discretion of each research group, reflecting authentic, self-directed adoption of publicly available GenAI tools rather than a researcher-imposed configuration. Accordingly, the study evaluates realistic use of multimodal GenAI tools under field-based construction-management training conditions, rather than a controlled benchmark of fixed API-based computer vision models.
The analyses were conducted between December 2025 and March 2026. The tools were accessed through their public web or app interfaces under default settings. The evaluated model versions, to the extent identifiable from each interface, were Gemini 1.5 Pro, ChatGPT-4o, and Microsoft Copilot. Because these tools were accessed through commercial consumer interfaces, not all configuration parameters were directly controllable or visible to the users. Parameters such as temperature, sampling settings, model-version pinning, and backend model updates could not be fixed across all tools. This limitation was documented because web-based GenAI systems may be updated dynamically over time.
To support comparability across groups and tools, all evaluations followed a common task-oriented prompting structure. The prompts were designed to generate outputs related to construction activity identification, progress assessment, execution defect detection, and safety hazard recognition. Students were instructed to ask the model to base its response on visible evidence, describe the supporting visual cues, avoid unsupported assumptions, and indicate uncertainty or missing visual information where applicable. The same prompt structure was used consistently across the images evaluated within each task, and students were instructed not to edit the model output before recording it.
AI outputs were recorded in a structured JSON-based evaluation format to enable systematic comparison and analysis. The recorded information included the image identifier, GenAI tool used, task category, prompt category, generated answer, supporting explanation, confidence statement when provided by the model, evaluator comments, reference assessment, verification notes, and final evaluation score. The same verification-first scoring framework and 0–100 scoring anchors were then applied across tools and tasks. A representative example of the R1 evaluation sheet and category-constrained prompt is provided in
Appendix B, and the general GenAI execution guidelines are summarized in
Appendix C.
No formal prompt-sensitivity analysis was conducted because the study focused on field-based use of GenAI tools within an educational construction-management setting. However, the use of a structured task-specific prompt template reduced uncontrolled variation across groups. Systematic prompt-robustness testing is therefore identified as an important direction for future work, using fixed image sets, repeated prompt variants, controlled model configurations, and larger balanced samples across construction-management tasks.
3.5. Verification-First Evaluation Workflow
For each AI-generated output, the reference assessment was established through a supervised verification-first procedure. The ground-truth reference was not developed by students in isolation. For each task, the research group that collected the images first defined the expected reference assessment based on direct site observations, recorded metadata, observed execution stage, visible materials and equipment, site conditions, construction logic, and relevant professional requirements where applicable.
Key assumptions were then verified through consultation with the site engineer, site manager, or another responsible construction-site professional during or shortly after the site visit. The role of the site professional was to confirm the technical interpretation and contextual assumptions, rather than to serve as the sole author of the reference assessment. Therefore, statements in this paper that final judgment remains the responsibility of the engineer refer to professional accountability for the use of AI-supported outputs in practice, not to sole authorship of the ground-truth labels.
Additional quality control was provided by the course instructor, who independently audited a subset of evaluations to check the consistency of ground-truth definitions, scoring logic, and interpretation of ambiguous cases. Where necessary, the instructor served as an adjudicator. Verification actions included cross-image triangulation, checking whether AI claims were supported by visible evidence, identifying unsupported inferences, documenting uncertainty when visual information was insufficient, and flagging red indicators such as overconfident statements, missing visual justification, or contradictions with known execution practices.
Accordingly, GenAI outputs were treated as decision-support inputs rather than final engineering conclusions. The ground-truth procedure therefore combined student field observation, site-professional consultation, instructor audit, and adjudication of ambiguous cases, forming a multi-layer quality-control chain intended to reduce single-rater bias and improve the credibility of the reference assessments. Scoring itself followed the pre-specified, anchored 0–100 rubric (
Table 2, as detailed in
Section 3.6), so that the evaluation criteria were fixed in advance rather than defined case by case. The verification and scoring steps followed a common set of instructions applied uniformly across all groups. A representative structured GenAI evaluation output is provided in
Appendix B to illustrate the traceability format used in the evaluation, while the data-collection, GenAI execution, verification, and evaluation guidelines are summarized in
Appendix C.
3.6. Performance Analysis
AI performance was evaluated by comparing model outputs with the verified ground-truth references using a standardized 0–100 scoring scale. Scoring was guided by general evaluation anchors to reduce interpretation differences across groups (
Table 2). The use of rubric-based, anchored human scoring to evaluate generative-model outputs follows established evaluation practice [
56]. To support scoring consistency, the scoring approach was discussed in class, and the course instructor guided each group using a small set of approximately 3–5 images.
Quantitative analysis used one-way analysis of variance (ANOVA) to examine task-type differences and two-way mixed-model ANOVA to examine tool-by-task interaction patterns. Effect sizes were reported alongside significance tests in line with standard statistical reporting conventions [
57]. Qualitative analysis examined recurring error patterns linked to site variability, image quality, lighting, camera angle, occlusions, insufficient visual evidence, and unsupported AI inferences.
3.7. Educational Assessment
Students completed paired pre–post questionnaires measuring self-reported AI literacy competencies on a six-point Likert scale (
Appendix A). Four competencies were assessed: understanding of AI contribution to civil engineering, error recognition and validation capability, professional application skills, and verification-first orientation. The questionnaire also collected baseline GenAI usage patterns and post-course trust ratings. Changes were analyzed using paired-samples
t-tests with Cohen’s d effect sizes.
3.8. Integration of Technical and Educational Outcomes
The final stage integrated technical findings with supporting educational outcomes. AI performance results and identified failure modes were analyzed alongside changes in student trust, confidence, and verification awareness. This integrated analysis enabled evaluation of multimodal GenAI capabilities for construction-management tasks, while also examining how a verification-first workflow supports responsible interpretation and training.
5. Discussion
5.1. Technical Performance: Capabilities and Limitations
Within the smart-city context that frames this study, construction sites are temporary but data-rich components of the urban system. The reliable conversion of construction-site visual data into structured information is therefore important for sustainable urban infrastructure delivery, quality assurance, safety management, and future asset information. Accordingly, the findings are interpreted not only as task-level performance metrics, but also in terms of how dependably multimodal GenAI can support smart-city construction monitoring. The field-based evaluation of multimodal GenAI across 47 task-level assessments and 1186 construction site images reveals that current GenAI tools demonstrate functional but circumscribed capabilities for construction management applications. The overall mean performance score of 75.3 indicates that these systems can provide meaningful analytical support, yet fall short of the reliability threshold required for autonomous deployment in professional practice. These findings align with recent observations that, while AI technologies show considerable promise for construction applications, significant gaps remain between laboratory performance and field deployment readiness [
62,
63].
The most consequential finding is the clear performance hierarchy across task types. Construction Activity Identification (R1) achieved the highest performance (M = 83.5), followed by Progress Tracking (R2, M = 74.1), Safety Hazard Identification (R4, M = 73.1), and Execution Defect Detection (R3, M = 61.6). This 22-point differential between Activity Identification and Execution Defect Detection tasks reflects a fundamental distinction between classification-oriented reasoning and quality assessment reasoning. Activity identification relies primarily on recognizing visually distinctive elements such as equipment, materials, and spatial configurations. In contrast, defect detection requires fine-grained discrimination of workmanship quality, interpretation against construction standards, and inference from subtle visual cues that may not be fully observable in single images. These findings corroborate prior work indicating that general-purpose models perform well on categorical tasks but struggle with domain-specific quality judgments requiring specialized knowledge [
64,
65,
66,
67].
Activity-specific patterns (
Section 4.2.1) confirm that high-performing activities share characteristics such as large-scale visual elements and favorable lighting, while low-performing activities involve confined spaces and contextual knowledge not directly observable in imagery. These patterns are consistent with findings from computer vision research in construction contexts, where environmental factors significantly affect recognition accuracy [
25,
68].
No statistically significant performance differences were observed between Gemini and ChatGPT across the evaluated tasks. However, because tool use was unbalanced across the dataset, this result should be interpreted cautiously and not as definitive model benchmarking. Rather, the findings suggest that the main performance limitations are associated more strongly with task type and construction-context complexity than with the specific commercial tool used. Accordingly, future improvements are likely to depend on domain-specific adaptation, enhanced prompting strategies, integration with structured project data, and multimodal context enrichment rather than simply switching between general-purpose commercial platforms.
5.2. Educational Outcomes: From General Adoption to Professional Literacy
The educational findings reveal several insights: the disparity between high general adoption and low engineering-specific use (see
Section 4.3.1) indicates that the primary educational need is not closing an adoption gap but rather a professional literacy gap, helping frequent GenAI users understand what these tools can and cannot reliably support in engineering contexts and how to deploy them responsibly through structured verification. This aligns with broader findings that student engagement with AI is often driven by individual experimentation rather than structured instructional design [
9,
10].
The verification-first pedagogical framework demonstrated substantial effectiveness in addressing this gap. Across all four assessed competencies, students exhibited significant gains from pre- to post-course assessments. Understanding of AI contribution to civil engineering increased by 65.3%, while professional application skills improved by 56.4% (M = 3.30 to M = 5.16). Most importantly, validation importance awareness exhibited the strongest gain at 71.3%, providing quantitative evidence that the learning process successfully promoted a verification-first viewpoint. These outcomes support prior research demonstrating that scaffolded interventions can effectively strengthen critical thinking and reduce uncritical reliance on AI tools, even within relatively short instructional timeframes [
16,
17].
Beyond mean-level improvements, reduced post-course variability indicates pedagogical convergence: students with diverse baseline competencies converged toward a shared standard of AI literacy, consistent with competency-based engineering education principles [
12,
69].
5.3. Workforce Efficiency and Sustainable Infrastructure Delivery
The findings have practical implications for workforce efficiency in construction management and infrastructure delivery, both of which are central to sustainable smart-city development. In a sector increasingly affected by shortages of skilled workers, site engineers, and experienced construction managers [
70], GenAI-based image interpretation can support routine documentation, preliminary progress checks, safety screening, and initial quality observations. By reducing repetitive manual tasks, such tools may allow engineers and supervisors to focus on verification, decision-making, and intervention in areas requiring professional judgment.
AI-supported visual interpretation may also contribute to sustainable infrastructure delivery by improving the use of construction-site images as urban visual data. Remote analysis of site images can help prioritize inspection needs and may reduce repeated physical visits to the site. Earlier identification of potential safety, progress, or quality issues can also reduce rework, delays, material waste, and unnecessary transportation associated with repeated inspections or late-stage corrections. These implications are particularly relevant to SDG 11 and to smart-city agendas that promote resource-efficient, data-driven, and resilient urban development [
2,
71,
72].
Nevertheless, these benefits depend on responsible implementation. Since GenAI was more reliable in descriptive and visually explicit tasks than in judgment-intensive tasks such as execution defect detection, it should be viewed as a tool for augmenting human supervision rather than replacing professional inspection. This aligns with the broader need to address ethical and practical challenges in AI-supported sustainability applications, particularly where automated outputs may influence safety, quality, and infrastructure performance.
5.4. Implications for Construction Practice and Engineering Education
The integration of technical and educational findings suggests practical implications for both construction deployment and engineering training. From an industry perspective, the observed task-complexity hierarchy supports a stratified deployment strategy. For activity identification and documentation, GenAI can serve as a first-pass analytical tool with moderate oversight. For progress monitoring and safety screening, outputs should be verified against temporal references, site conditions, and safety checklists. For execution defect detection and quality assessment, GenAI should be limited to preliminary screening, with professional judgment remaining authoritative.
A possible practical implication concerns the use of accessible GenAI tools by small and medium-sized construction enterprises (SMEs). Because general-purpose AI tools are relatively low cost and require limited technical infrastructure, they may help SMEs strengthen activity documentation, progress tracking, and preliminary safety or quality screening. However, such use should be accompanied by verification and control mechanisms, particularly given the resource and capability gaps that often exist between SMEs and large contractors [
73,
74,
75].
Construction sites can be viewed as temporary sensing environments within the smart-city lifecycle. During infrastructure delivery, site images, inspection records, progress observations, safety documentation, and field reports generate valuable information on how urban assets are actually constructed. When structured, verified, and linked to project information systems, these data can support not only construction management but also later operation, maintenance, and asset management. From this perspective, multimodal GenAI is not only a tool for construction documentation, but also a means of transforming fragmented field observations into structured information for smart-city data ecosystems. Verified GenAI outputs can document activity status, visible site conditions, potential safety issues, and preliminary quality observations, thereby improving continuity between construction-phase monitoring and digital-twin-based operation. However, the findings also show that such integration must remain task-calibrated and verification-based. GenAI is more reliable for visually explicit tasks, such as activity identification, than for judgment-intensive tasks, such as defect detection. Therefore, construction-site sensing workflows should position GenAI as a human-in-the-loop interpretation layer, while professional engineers remain responsible for validation and decision-making.
As introduced in
Section 2, BIM has become a digital backbone for project information, and integrating BIM with AI enables capabilities such as automated rule checking and as-built reconstruction. The integration of GenAI with Building Information Modeling (BIM) and digital twin environments also presents opportunities for more reliable construction-monitoring workflows. BIM can provide structured project information for design specifications, schedules, and quality benchmarks [
23,
76]. Linking GenAI-based image interpretation with BIM data could reduce reliance on single-image analysis by allowing AI outputs to be checked against structured project context, thereby improving both accuracy and verification efficiency. This direction is consistent with recent smart-city sensing research emphasizing that multimodal data integration can support intelligent urban decision-making, while also introducing challenges related to data alignment, scalability, modality-specific noise, privacy, and responsible deployment [
77].
From an educational perspective, the verification-first framework offers a replicable model for responsible GenAI integration in engineering programs. Key design principles include authentic field-based tasks, explicit ground-truth comparison, uncertainty documentation, red-flag identification, and clear emphasis on professional accountability. This is particularly important because GenAI tools evolve faster than formal curriculum structures, creating a risk that students adopt powerful systems before domain-specific verification habits are established [
11,
78]. Verification-first training can therefore be embedded within existing courses as a practical bridge between immediate AI adoption and longer-term curricular reform.
5.5. Limitations and Future Research
Several limitations should be considered when interpreting the findings. First, the study was designed as a field-based evaluation of realistic GenAI use in a construction-management seminar rather than as a controlled benchmark of commercial multimodal models. Tool selection was left to the research groups, reflecting authentic self-directed adoption of widely available GenAI assistants. Accordingly, the same image and prompt were not systematically tested across Gemini, ChatGPT, and Microsoft Copilot. This limits model-to-model generalization, and the tool-level findings should be interpreted as evidence of practical task-level feasibility and recurring failure modes rather than as a definitive ranking of platforms. Future work should include balanced cross-tool benchmarking using the same image set, prompt templates, and scoring procedure across all selected tools.
Second, the task distribution was not fully balanced. The number of task-level evaluations differed across activity types and construction-management tasks, with relatively few evaluations for safety hazard identification and several exploratory tasks. This imbalance reflects the field-based structure of the study, where each group worked on activities available at its selected construction site. Consequently, statistical comparisons should be interpreted cautiously, with emphasis placed on observed task-level patterns, effect sizes, and the descriptive performance hierarchy rather than definitive population-level estimates. To confirm that the principal task-type effect was not an artifact of this imbalance, the omnibus comparison was repeated using Welch’s ANOVA and a sensitivity analysis excluding the smallest cell; the effect remained significant in both cases. Future studies should adopt more balanced sampling across activity identification, progress tracking, defect detection, and safety hazard identification.
Third, the study did not employ a fully independent blinded annotation scheme or calculate formal inter-rater reliability metrics. The evaluation is a verification task in which each score reflects the agreement between an AI output and an engineering ground-truth reference that the group established from first-hand site observation; the evaluator therefore cannot be blinded to that reference, which is the standard against which the output is judged. Scoring followed a pre-specified, anchored rubric (
Table 2) fixed before evaluation and applied uniformly, so criteria were not generated case by case. Moreover, the conclusions were therefore limited to broad, directional findings regarding task-level feasibility, recurring failure modes, and differences between visually explicit and judgment-intensive tasks, rather than fine-grained score differences that would require stronger inter-rater validation. To mitigate potential bias, the reference assessments were supported by site-professional consultation, instructor audit, and evidence checks. Nevertheless, future benchmarking studies should incorporate independent blinded expert annotation, multiple raters per image, and formal agreement metrics.
Fourth, construction-site images represent snapshots of dynamic construction processes. A single photograph may not capture prior work sequences, temporary constraints, hidden elements, future planned steps, or contextual information needed for complete engineering interpretation. This limitation affects both AI systems and human evaluators, although experienced professionals may compensate through site knowledge and construction logic. The issue is therefore not only technical but also epistemic: the image provides partial evidence of a broader construction process. Future research should examine temporally aware, multi-image, video-based, or sequence-based evaluation, supported by BIM models, schedules, inspection records, sensor data, and site documentation.
Finally, the findings are time-stamped and context-dependent because multimodal GenAI systems are updated frequently and their backend configurations are not always transparent. The more durable contribution of this study is therefore the verification-first evaluation protocol. Future research should examine updated models, domain-adapted systems, retrieval-augmented workflows, structured prompting, BIM/digital-twin integration, and longitudinal educational designs to assess whether trust calibration and verification-first behavior persist as students transition into professional practice. In particular, a systematic prompt-sensitivity analysis was not undertaken in this study and is identified as a priority for future controlled evaluation.
6. Conclusions
This paper presented a field-based evaluation of multimodal GenAI for construction and urban infrastructure monitoring using 1186 images collected from 17 active construction sites. The study assessed GenAI performance across four main construction-management tasks: construction activity identification, progress tracking, execution defect detection, and safety hazard identification. In parallel, the study examined how a verification-first human-in-the-loop workflow can support responsible interpretation of AI-generated outputs by civil engineering students.
The findings show that multimodal GenAI can provide meaningful support for visual documentation and preliminary site interpretation, but its reliability is strongly task-dependent. Activity identification achieved the strongest performance, indicating that current GenAI tools can generally recognize visually explicit construction activities, materials, and equipment. Progress tracking and safety hazard identification showed moderate performance, reflecting the need for contextual information and professional verification. Execution defect detection was the most challenging task, highlighting the limitations of general-purpose GenAI when quality assessment requires subtle visual discrimination, construction standards, and engineering judgment.
These results suggest that GenAI deployment in construction monitoring should be task-calibrated. For activity identification and routine documentation, GenAI may serve as a first-pass analytical tool. For progress assessment and safety screening, AI outputs should be verified against site context, temporal references, and professional checklists. For execution defect detection and quality assessment, GenAI should be limited to preliminary screening, with final judgment remaining the responsibility of qualified professionals.
The educational findings provide supporting evidence that verification-first workflows can strengthen responsible GenAI use. Students entered the seminar as frequent but mostly general GenAI users, while the post-course results showed improved domain-specific AI literacy, professional application skills, and awareness of validation requirements. This indicates that structured field-based training can help future engineers treat AI outputs as preliminary interpretations rather than final conclusions.
Overall, the central contribution of this study is the empirical demonstration that multimodal GenAI has practical but bounded value for construction-site monitoring within the smart-city infrastructure lifecycle. Its effective use requires task-specific validation, human verification, and professional accountability. By combining real construction-site imagery with a verification-first evaluation workflow, the study provides both practical evidence for current GenAI capabilities and a replicable framework for responsible adoption in construction management practice and education.