Next Article in Journal
Analyzing the Contribution of Bare Soil Surfaces to Resuspended Particulate Matter in Urban Areas via Machine Learning
Previous Article in Journal
Assessment of the Impact of Blasting Operations on the Intensity of Gas Emission from Rock Masses: A Case Study of Hydrogen Sulfide Occurrence in a Polish Copper Ore Mine
Previous Article in Special Issue
A Vehicular Traffic Condition-Based Routing Lifetime Control Scheme for Improving the Packet Delivery Ratio in Realistic VANETs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models

1
Department of Highway Traffic Management, Research Institute for Road Safety of MPS, Beijing 100024, China
2
School of Traffic and Transportation Engineering, Central South University, Changsha 410083, China
3
College of Metropolitan Transportation, Beijing University of Technology, Beijing 100124, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12782; https://doi.org/10.3390/app152312782
Submission received: 8 October 2025 / Revised: 5 November 2025 / Accepted: 11 November 2025 / Published: 3 December 2025
(This article belongs to the Special Issue Autonomous Vehicles and Robotics—2nd Edition)

Abstract

Under the increasing threat of extreme weather events, road infrastructure faces significant risks of flood-induced damage. Traditional manual inspection methods are insufficient for modern highway emergency response, which requires higher efficiency and accuracy. To enhance the precision and accuracy of flood damage identification, this study proposes an intelligent recognition system that integrates a multimodal large language model with a structured knowledge base. The system constructs a professional repository covering eight typical categories of flood damage, including roadbed, pavement, and bridge components, with associated attributes, visual features, and mitigation strategies. A vectorized indexing mechanism enables fine-grained semantic retrieval, while task-specific templates and prompt engineering guide the multimodal model, such as Qwen-VL-Max, which extracts risk elements from image–text inputs and generating structured identification results with expert recommendations. The system is evaluated on a real-world highway flood damage dataset. The results show that the knowledge-enhanced model performs better than the baseline and prompt-optimized models. It reaches 91.5% average accuracy, a semantic relevance score of 4.58 out of 5, and 85% robustness under difficult conditions. These results highlight the strong domain adaptability and practical value for real-time flood damage assessment and emergency response.

1. Introduction

Highway flood damage refers to the severe deterioration of roadbeds, pavements, bridges, culverts, drainage systems, and auxiliary facilities caused by natural events [1]. Typical manifestations include roadbed erosion, slope damage near bridges, pavement burial by debris, and failure of protective structures. Severe cases of flood damage can lead to traffic disruptions or even paralysis of regional road networks, making it one of the most destructive types of highway disasters.
Highway flood disasters result from the combined effects of natural forces and engineering deficiencies, and typically follow a chain-like evolution mechanism. Under continuous hydrological processes, roadbed slopes first experience progressive cumulative damage, characterized by surface erosion, crack initiation, and soil weakening [2]. Once hydrodynamic conditions exceed critical thresholds, sudden failures such as slope instability and roadbed collapse may occur. The transformation from gradual deterioration to sudden failure underscores the need for a full-chain prevention and control system that integrates monitoring, early warning, and emergency response in high-risk areas. Therefore, systematic hazard identification is a critical step in breaking the disaster transmission chain. By developing dynamic risk profiles, implementing targeted monitoring and early warning, and strengthening preventive maintenance and mitigation, the risks of major casualties and economic losses can be effectively reduced. This approach enhances the resilience of highway infrastructure and contributes to advances in disaster prevention and control.
At present, the field of highway flood disaster prevention and control is undergoing a paradigm shift driven by emerging technologies. However, integrating traditional inspection models with advanced monitoring techniques still faces several challenges. The manual inspection systems used by road administration and traffic police departments are primarily subjective, and face four critical limitations [3]: (1) inspection effectiveness is constrained by the expertise of personnel; (2) the average daily inspection coverage is less than 80 km, which is over 60% less efficient than mechanized methods; (3) inspector safety risks increase significantly under extreme weather conditions; (4) conventional documentation methods fail to comprehensively capture the characteristics of flood-damaged road segments. Although artificial intelligence technologies have made significant progress, their large-scale deployment remains limited due to high costs and challenges, such as signal occlusion. This contradiction between high-precision equipment and the demand for lightweight applications has become a major bottleneck restricting the large-scale implementation of intelligent monitoring and early warning systems.
With the rapid advancement of general artificial intelligence, multimodal large language models (MLLMs) have shown substantial potential in transportation management. These models possess capabilities such as cross-modal understanding, complex reasoning, and multi-turn question answering, providing novel technical solutions for intelligent transportation management and decision support [4]. In response, national and regional transportation authorities have introduced policies promoting the integration of intelligent equipment and AI-based recognition algorithms into highway inspection workflows to enhance both efficiency and safety. For example, one study applied MLLMs to integrate traffic video images with real-time sensor data for the automatic detection and classification of traffic accidents, enabling timely alerts and supporting emergency response decisions [5]. In addition, by fusing visual and textual inputs, traffic monitoring systems can automatically generate incident reports, assisting law enforcement in post-incident analysis and documentation [6]. Other studies have explored the use of MLLMs in traffic congestion analysis, where traffic flow data are correlated with weather conditions to provide optimized scheduling recommendations for urban traffic management [7].
Beyond traffic flow and incident detection, MLLMs have also shown promise in infrastructure monitoring. The integration of satellite imagery and ground sensor data has played an increasingly important role in intelligent transportation surveillance. MLLM-based approaches have been employed to detect roadway damage and infrastructure degradation [8]. In urban environments in particular, these models can rapidly identify localized surface defects such as cracks and potholes, and automatically annotate and classify them for further action.
Although MLLMs have been introduced into the transportation domain, their application to highway flood damage recognition and potential hazard identification remains largely unexplored. Current large language models (LLMs) are not inherently designed for structured visual disaster recognition and continue to exhibit issues such as hallucinations and instability when processing visual inputs [9]. Moreover, existing methods fail to address the specific challenges of structured analysis and reasoning in dynamic disaster scenarios, especially considering the complexity and variability of flood damage in highway environments [10].
Therefore, this study proposes a Road Damage Intelligent Detection System (R-DIDS) based on MLLMs. The system integrates image recognition, a structured flood damage knowledge base, and a multimodal large model to construct a multi-dimensional, cross-modal intelligent recognition framework. To address the limitations of current MLLMs in disaster image recognition, this study incorporates prompt engineering and retrieval-augmented generation (RAG) techniques, with a locally deployed MLLM serving as the core engine. The retrieval capabilities of the knowledge base are leveraged to improve the model’s recognition accuracy and output stability. To validate the effectiveness of the proposed system, a test dataset comprising various types of flood damage events is developed. By comparing model outputs with expert-annotated event classifications, the system’s performance is comprehensively evaluated in semantic similarity and classification accuracy.
The rest of this paper is organized as follows. The following section discusses the related research on MLLM methods. Section 3 describes the design of an intelligent identification system for highway flood damage and related stresses. Section 4 presents the results and discussions. Section 5 summarizes the conclusions.

2. Core Technical Methodologies

2.1. Multimodal Large Models and Fine-Tuning Principles

With the continued advancement of MLLMs, researchers in artificial intelligence have increasingly applied this technology to complex tasks. Compared with traditional models, MLLMs are capable of processing information from multiple modalities, such as text, images, and audio, enabling more accurate reasoning and prediction through cross-modal representation learning. This capability has enabled significant applications in areas such as visual question answering, speech recognition, and image generation, particularly in complex scenarios requiring the integration of diverse information sources [11,12,13].
One of the prominent MLLMs is QWLen LLM, which is built on a transformer-based architecture and integrates both image recognition capabilities and natural language processing. This integration enables QWLen LLM to process and generate accurate outputs from both textual and visual inputs. The training process for such models typically involves two stages: pre-training and fine-tuning. Pre-training consists of training the model on large-scale, general-purpose datasets to develop robust, universal representation capabilities, forming a strong foundation for further adaptation. Fine-tuning, in turn, refines the model’s performance by retraining it on a smaller, task-specific annotated dataset, enhancing its ability to adapt to particular domains or applications [12].
However, fine-tuning has certain limitations. First, fine-tuning requires a considerable amount of labeled data, which poses a major challenge in domains where data are scarce or annotation is expensive [14,15]. Second, fine-tuning modifies the model’s internal parameters, requiring substantial computational resources and prolonged training time. Finally, as fine-tuning only modifies a few parameters, it may not fully leverage the comprehensive knowledge learned during pre-training, thereby limiting the model’s generalizability across tasks [16].
To address these challenges, new techniques such as prompt engineering and retrieval-augmented generation (RAG) have gained increasing attention as effective alternatives to fine-tuning [17]. These methods enhance model performance and adaptability without altering internal parameters by augmenting input prompts or incorporating external knowledge sources. Therefore, they offer a more efficient and flexible approach for domain-specific task adaptation.

2.2. Prompt Engineering

Prompt engineering is a technique that guides MLLMs to generate task-specific outputs by designing precise and targeted input instructions. The core principle is to effectively guide the model toward multi-step reasoning in complex tasks while leveraging external knowledge bases to generate high-quality responses. Unlike traditional fine-tuning methods, prompt engineering focuses on optimizing the input formulation without modifying the model’s internal structure, thereby enhancing its performance in specific scenarios [18].
In the context of MLLMs, prompt engineering extends beyond issuing simple queries or commands. Instead, it involves carefully crafted inputs that prompt the model to extract relevant information from multimodal sources, such as images, text, and videos, to generate more accurate outputs [19,20]. The prompt design directly influences the model’s task comprehension and the effectiveness of its responses. Common strategies include Chain-of-Thought (CoT) prompting, which decomposes complex problems into a sequence of intermediate reasoning steps, enabling the model to follow a logical path toward the final answer. Furthermore, Self-Consistency with Chain-of-Thought (CoT-SC) enhances result reliability by aggregating outputs from multiple independent reasoning paths. The Tree of Thought (ToT) approach constructs a tree-like reasoning structure to help the model explore multiple solution paths and select the most optimal outcome [21].
The prompt engineering depends on a deep understanding of the task domain. Effective prompt design requires careful consideration of the task background, objectives, and expected output format. As real-world application scenarios continue to evolve, prompt templates often undergo iterative refinement and empirical validation. Therefore, prompt engineering is not merely a technical trick but a continuously optimized process. Compared with fine-tuning, prompt engineering offers significant advantages in terms of efficiency and flexibility. Fine-tuning typically requires extensive parameter adjustment and large-scale annotated datasets. In contrast, prompt engineering enhances model performance solely through input manipulation, without modifying the model’s parameters, making it a practical and scalable solution for real-world applications.

2.3. Retrieval-Augmented Generation (RAG)

RAG is an innovative approach that integrates external knowledge with generative models to enhance accuracy and reasoning capabilities [22]. Unlike conventional generative models that rely solely on pre-trained internal data, RAG dynamically incorporates relevant external information to expand the knowledge context and improve the factual reliability of generated content. This technique is well-suited for tasks that demand large-scale, domain-specific knowledge and complex reasoning, such as disaster monitoring and intelligent question answering.
The RAG workflow typically consists of two main stages: knowledge retrieval and text generation. Upon receiving a user query, the retrieval module searches for relevant information from external sources, such as structured knowledge bases, textual corpora, tabular datasets, or visual feature repositories, based on task-specific requirements. The retrieved information is combined with the original input to construct an augmented context, which is then fed into the generative model. Based on this enriched input, the model performs reasoning and generates the final output [23].
In contrast to traditional generation methods, RAG enables models to overcome the limitations of internal memory and fixed training data. Standard generative models are prone to hallucinations—producing unreliable or incorrect outputs—especially when faced with unfamiliar or out-of-distribution queries [24]. By retrieving task-relevant external content in real time, RAG significantly improves the accuracy, relevance, and reliability of model outputs, making it highly valuable for professional applications in domains such as public safety and transportation management.
Another advantage of RAG is the operational efficiency. By decoupling knowledge retrieval from generation, RAG eliminates the need for continuous fine-tuning across different tasks. This approach not only reduces computational overhead but also enhances adaptability across diverse application scenarios [24]. In real-world deployments, RAG systems can dynamically update their knowledge base with newly retrieved information, enabling rapid adaptation to evolving environments while maintaining high accuracy.
Importantly, RAG is not limited to textual tasks. In the domain of highway flood damage detection, RAG facilitates the fusion of multimodal data—including road surface imagery, historical disaster records, and meteorological information—to support precise inference and decision-making. By integrating enriched external knowledge into the model’s reasoning process, RAG significantly improves the accuracy and efficiency of hazard identification under complex conditions.

3. Intelligent Identification System for Highway Flood Damage and Related Distresses

To meet the requirements of intelligent highway flood damage recognition, this study develops an advanced recognition system that integrates multimodal large models with external knowledge bases using Retrieval-Augmented Generation (RAG) technology [25]. The system is designed to analyze and identify multiple types of flood-related damage in real time. It not only processes on-site visual inputs but also retrieves and reasons over historical knowledge to rapidly and accurately assess damage severity. Furthermore, it delivers targeted early warnings and response recommendations, significantly enhancing the automation and intelligence of highway flood management.

3.1. Workflow

The intelligent recognition system for highway flood damage supports multimodal inputs, including images, timestamps, and geolocation data. It leverages a knowledge retrieval module and a prompt scheduling module to enable end-to-end automation of hazard identification and decision support. The main workflow is structured as follows:
D = R Q , K
A = M L L M P Q , D , D
where Q is the input image and its associated metadata (e.g., timestamp, geolocation, etc.); K is the pre-constructed flood damage knowledge base, which contains damage categories, structural features, and related information; R(Q,K) represent the retrieval function that searches knowledge base K for relevant items based on the input Q, returning a set of candidate knowledge entries D; P(Q,D) represent the process of combining the input Q with the retrieved knowledge D and embedding them into a structured prompt template; MLLM denote the locally deployed Multimodal Large Language Model used for visual question answering and causal reasoning; A is the final structured recognition output, including the type of damage, location, risk level, supporting rationale, and recommended mitigation measures. Figure 1 illustrates the overall workflow of the proposed highway water-damage recognition system.

3.2. System Architecture

To support efficient identification and intelligent diagnosis of flood-induced road hazards, this study proposes and implements an integrated system architecture, as illustrated in Figure 2. The system is functionally divided into four core modules: Input and Perception module, Knowledge Base and Retrieval Augmentation module, Reasoning and Output module, and Human–Machine Interaction module. These components operate collaboratively to form a closed-loop processing chain that integrates multi-source input, semantic analysis, knowledge scheduling, generative reasoning, and feedback evaluation, thereby enabling intelligent recognition of complex flood risks and supporting emergency response decision-making.
The Input and Perception module is responsible for ingesting user-uploaded images, geolocation, timestamp, and other basic metadata. Through the embedded submodule of Spatiotemporal Causal Reasoning on Weather and Map Data, it automatically links urban locations with historical meteorological records to infer potential triggering factors and geographical context. This module primarily performs the task of problem understanding, laying the semantic foundation for constructing query vectors and retrieving relevant knowledge.
The subsequent Knowledge Base and Retrieval-Augmented modules organize structured knowledge into three layers, including structure, mechanism, and strategy, by constructing three corresponding sub-repositories: the Attribute Knowledge Base, the Feature Knowledge Base, and the Remediation Knowledge Base. These collectively cover road structural components, typical failure mechanisms, and corresponding mitigation strategies. The system defines eight typical categories of flood damage in advance: subgrade, pavement, bridge, tunnel, slope, retaining wall, drainage facilities, and traffic safety infrastructure. Each category is bound to vectorizable fragments encompassing structural features, visual manifestations, causal factors, and remediation paths. During the retrieval phase, the system generates a query vector Q from the parsed input and matches it against the pre-encoded vector representations K in the knowledge base. A matching function D = R(Q, K) retrieves the Top-k relevant semantic fragments to form an evidence set that grounds and constrains the downstream reasoning process.
The Reasoning and Output Module employs locally deployed MLLMs, such as Qwen-VL-Max, to conduct multi-round semantic reasoning and generate structured outputs. Based on the task context, the system automatically selects a matching prompt template that incorporates role definitions, task objectives, output format, and linguistic style. This template is combined with the Top-k evidence fragments and original user input to form a complete structured prompt, which is fed into the model for multimodal reasoning. The output includes structured recognition results, such as damage type, location, risk level, causal factors and recommended actions to support operational implementation. The results are simultaneously routed to a feedback channel, where domain experts validate outputs, revise knowledge fragments, and augment training samples to support iterative system improvement.
Finally, the Human–Machine Interaction and Evaluation Module offers an intuitive user interface and systematic evaluation mechanisms to ensure the transparency and traceability of model outputs. The system supports bidirectional interaction at both the input and output ends, offering visual displays of inference results, associated prompt templates, and knowledge sources. For complex tasks, it supports multi-turn question-answering, follow-up queries, and error correction. The evaluation submodule assesses system performance across multiple dimensions—including accuracy, response efficiency, and robustness—and feeds the results back to optimize the knowledge base and prompt library, thereby establishing a closed-loop learning cycle. This ensures the system remains adaptive and capable of evolution.
In summary, the system architecture is built around the core principles of multimodal perception, vector-enhanced knowledge retrieval, prompt-based reasoning, and closed-loop feedback, enabling a complete intelligent recognition pipeline—from input understanding to semantic retrieval, large-model reasoning, and system evaluation. The design demonstrates high generalizability and practical value, offering an intelligent solution for flood risk warning and emergency response in road infrastructure scenarios.

3.3. System Implementation

3.3.1. Knowledge Base Construction

To enhance the reasoning capability of the multimodal large model in highway flood damage identification tasks, this study constructs a dedicated Highway Flood Damage Knowledge Base. The knowledge base is structured around three key dimensions, including roadway damage types, disaster mechanisms, and mitigation strategies, enabling the model to perform semantic retrieval and strategy generation during the recognition process. The construction process comprises five stages: source data curation, structured information extraction, semantic vectorization, hierarchical organization, and knowledge integration. The detailed workflow is illustrated in Figure 3.
Step 1: Source Data Preparation
  • To address the knowledge demands in the domain of highway flood damage, this study collected raw knowledge resources from four representative sources:
  • authoritative industry publications and technical books;
  • high-quality research literature published within the past five years (approximately 50–80 articles);
  • national and regional standards and specifications (e.g., Technical Specifications for Highway Maintenance);
  • historical case reports and on-site inspection documentation.
These data were manually screened and standardized to form the foundational corpus for subsequent knowledge extraction.
Step 2: Structured Extraction
Based on chapter structure, headings, and semantic boundaries, the text was segmented into logically independent knowledge units. Each unit was assigned a specific knowledge attribute to support structured organization.
Step 3: Semantic Vectorization
A language model encoded the extracted knowledge fragments into high-dimensional vectors, forming unified semantic representations. These vectors were indexed into the retrieval system to support semantic similarity matching and information recall. The encoding preserved key elements—such as risk category, damage mechanism, and geographic context—thereby enhancing the model’s ability to interpret complex real-world scenarios. In the knowledge-base construction stage, textual content from uploaded documents was converted into vector representations using the BAAI/bge-m3 embedding model provided by SiliconFlow. The model encodes semantic information into 1024-dimensional dense vectors, which are stored and indexed by the FAISS retrieval module to enable efficient similarity search and retrieval-augmented reasoning.
Step 4: Hierarchical Organization and Management
The knowledge base adopts a hierarchical structure for organization and management. At the first level, knowledge is divided into three repositories based on functional roles: Attribute Knowledge Base, Feature Knowledge Base, and Mitigation Knowledge Base. At the second level, fine-grained labels, such as “Definition,” “Type,” “Vulnerable Segments,” “Engineering Cases,” “Risk Levels,” “Evaluation Indicators,” and “Remediation Plans”, are assigned to support controlled retrieval granularity.
Step 5: Knowledge Integration and Update Mechanism
All vectorized fragments are linked to their corresponding labels and integrated into a unified database. The system allows incremental updates and expert review to ensure scalability and timeliness in responding to emergent flood damage events. The detailed structure of the knowledge base is shown in Figure 4.
The constructed Highway Flood Damage Knowledge Base supports both the model’s comprehension and question-answering capabilities for complex structural hazards. It also provides semantic grounding for downstream prompt engineering and retrieval-augmented generation (RAG) modules.
At the content level, the system constructs structured knowledge entries for eight typical categories of highway flood damage, including:
  • Subgrade damage: Structural failures such as collapse, sliding, and suspension; typically manifested as edge fractures, elevation differences, and exposed soil layers.
  • Pavement damage: Surface defects including cracking, potholes, and peeling; characterized by alligator cracks, reflective puddles, and rutting depressions.
  • Bridge damage: Covers pier scouring, deck rupture, and slope instability; visual signs include misaligned bridge structures, displaced guardrails, and scour pits.
  • Tunnel damage: Involves portal water accumulation, lining delamination, roof leakage, and outlet blockage; image features are often distinct.
  • Slope damage: Includes landslides, detachment, and fissures, commonly occurring in mountainous sections; features include brown exposed slopes and fault lines.
  • Retaining wall damage: Exhibits as wall bulging, cracking, forward tilting, or foundation scouring, directly affecting slope stability.
  • Drainage facility damage: Such as culvert collapse, ditch overflow, and manhole cover loss, which may trigger pavement waterlogging and structural erosion.
  • Traffic safety facility damage: Includes fractured guardrails, toppled anti-glare panels, and damaged isolation fences, impairing vehicle protection and road visibility.
To establish a systematic dimension description for highway water damage, we define typical disaster forms based on structural failure modes observed in field inspections and engineering standards. These include:
  • Erosion: Loss of surface material due to water scouring along shoulders, slopes, and drainage ditches.
  • Subsidence: Localized settlement of pavement or subgrade caused by seepage and loss of bearing capacity.
  • Fracture: Cracking or breakage of structural components such as pavements, retaining walls, or culverts.
  • Sliding: Mass movement of slope or embankment due to saturation or instability.
  • Scouring and Collapse: Severe erosion around bridge piers, culverts, or embankment toes leading to partial or complete structural failure.
Each damage entry comprises six core fields: definition, typical damage forms, visual recognition features, auxiliary identification cues, key risk indicators, and structured classification labels. This unified knowledge template is designed to support semantic retrieval, model injection, and multimodal alignment with image annotation results.

3.3.2. Prompt Template Design

To ensure the efficient application of multimodal large models in highway flood damage identification tasks, this study develops a set of prompt templates tailored for hazard detection in real-world road inspection scenarios. The prompt system employs a role-playing and task-driven strategy to guide the model through multi-step and multi-perspective reasoning. The templates are designed around core tasks, including damage classification, structural risk assessment, and environmental factor correlation, with contextual reasoning chains embedded in the prompts.
The prompt content covers multiple dimensions, including visual detail extraction, structural assessment, and contextual risk interpretation, providing a generalizable semantic instruction framework for the model. This significantly enhances the accuracy of identification as well as consistency with professional reasoning logic. The detailed structure of the prompt design is shown in Table 1.

3.3.3. Construction of the RAG-Based System

To enhance contextual understanding and knowledge retrieval in flood damage hazard identification, this study develops an intelligent recognition system based on Retrieval-Augmented Generation (RAG). As shown in Figure 5, the system integrates knowledge retrieval and large model inference, supported by a local semantic vector database and multimodal pipelines, enabling coherent image-text reasoning grounded in external knowledge.
On the input side, the system supports the ingestion of image data. When the target image is uploaded, the system activates the prompt module to retrieve a suitable task template, directing the model’s attention to critical structural features and risk-related cues. These prompt templates originate from the prompt engineering module and incorporate role definition, task decomposition, and output format specification.
Next, the image and the associated prompt are jointly input into the Qwen multimodal large model. Through image-text understanding, the model extracts descriptive textual information that serves as a semantic cue for subsequent knowledge retrieval.
In the knowledge retrieval module, the system constructs a RAG-based knowledge base using domain-specific literature on highway flood damage. Original textual content is vectorized via a text embedding model, and the resulting fragments are stored in a local vector database to enable efficient similarity-based retrieval. Upon receiving textual outputs from the model, the system performs a cosine similarity search to retrieve the most relevant knowledge entries, including definitions of damage types, influencing factors, representative cases, and mitigation strategies.
Subsequently, the retrieved expert knowledge fragments are combined with the initial textual outputs and re-input into the Qwen model for final reasoning and generation. Through multi-round semantic understanding and knowledge augmentation, the system generates logically complete and structurally rigorous outputs, including event classification, risk level estimation, and engineering recommendations.

3.3.4. System Visualization and Overall Interface

Figure 6 shows the system interface deployed on the Cherry Studio platform, which supports intelligent flood damage recognition in road infrastructure. The interface includes six key functional modules that correspond to different stages of model execution. In the Model Selection and Configuration Module, users can select models from the Qwen series (e.g., Qwen-VL-Max), with support for hot swapping to meet various scenario needs. The Knowledge Base Construction Module allows users to upload structured texts, which the system automatically parses, encodes, and classifies to build the RAG knowledge base for downstream reasoning. The Prompt Module embeds task-driven prompt templates to guide the model in structured image understanding and task alignment. In the Image and Weather Module, users can upload images and retrieve auxiliary contextual data, such as time, coordinates, and weather conditions, through the Map and MCP weather APIs. The Execution Module uses multimodal fusion reasoning to generate structured recognition outputs, including flood damage type, location, severity, and repair recommendations, with appended confidence scores and knowledge indices. Finally, the System Settings and Control Module provide access to system logs, API records, and control panels for deployment and maintenance.
In summary, the interface realizes a closed-loop process, from image acquisition to structured output, based on image-text interaction and knowledge augmentation. Its user-friendly design, task-oriented modules, and real-time responsiveness make it a practical and dependable tool for supporting frontline road inspection in flood damage scenarios.

4. Experimental Validation

To validate the proposed system’s recognition capability and reasoning performance in practical scenarios, a series of comparative experiments were designed and conducted. A comprehensive evaluation was performed from multiple perspectives, including model accuracy, domain-specific expression, task adaptability, and user feedback. The experiments systematically assessed the feasibility and advantages of the proposed approach through model output quality, semantic reasoning performance, usability, and expert evaluation.

4.1. Experimental Setup and Data Sources

To ensure objectivity and representativeness, a dataset containing 200 road flood damage images was constructed. The dataset covers a range of common disaster types, including slope collapses, pavement subsidence, bridge structural damage, tunnel leakage and drainage system blockages. The images were collected from field inspections, historical case records and UAV-based patrol footage. Each image is annotated with structural labels, concise descriptions, location metadata, and a corresponding human-verified reference answer, thereby forming a structured evaluation set [26].
The dataset composition shown in Figure 7 ensures representative coverage of eight major types of highway water damage, supporting the comprehensive evaluation of the proposed recognition system.
Moreover, the core reasoning component of the proposed system is the Qwen-VL-Max multimodal large language model (LLM) developed by Alibaba Cloud. Qwen-VL-Max integrates a Vision Transformer (ViT)-based image encoder with a transformer-based language decoder, enabling end-to-end understanding of visual–textual inputs. It supports image resolutions up to 448 × 448 pixels and generates structured textual outputs through autoregressive decoding. In this work, the model accepts three types of inputs—field images, task-specific prompts, and retrieved knowledge fragments from the RAG database and performs multimodal reasoning to produce interpretable water-damage recognition results.
The Qwen-VL-Max model contains approximately 7 billion parameters, following the Qwen-7B architecture. It combines a ViT-based visual encoder with a transformer-decoder language head, whose typical hidden and embedding dimensions are around 1024 and 4096, respectively.
For the retrieval-augmented generation (RAG) module, the system employs a vector-based semantic retrieval mechanism implemented with the FAISS (Facebook AI Similarity Search) (software version v1.11.0) library using cosine-similarity ranking. Each text fragment in the knowledge base is embedded into a 1536-dimensional vector using text-embedding-ada-002, and the Top-5 most relevant fragments are retrieved to enhance the contextual reasoning of the model.
All experiences are executed on the same local server (NVIDIA RTX 5060, 128 GB RAM) to ensure consistency in inference time and computational performance. The response time per sample remained under 3 s, supporting real-time interactive application scenarios.

4.2. Instruction-Level Reasoning Accuracy Evaluation

In complex scenarios, relying solely on image input is often insufficient for achieving comprehensive semantic recognition. Prompt engineering plays a crucial role in guiding the model’s reasoning pathway and reinforcing task objectives. This section presents a comparative analysis of the three model schemes in terms of complex instruction processing, task recognition accuracy, and semantic reasoning chain construction. This evaluation aims to demonstrate the practical benefits of incorporating prompt templates and retrieval-augmented knowledge in improving the model’s understanding and response capabilities.

4.2.1. Sample-Based Comparative Analysis

To provide an intuitive comparison of model performance under real-world conditions, five representative road damage images were selected and processed using the three model configurations. Scheme A represents the baseline multimodal large model operating without any external guidance. Scheme B introduces prompt templates to enhance task alignment, while Scheme C integrates both prompt engineering and retrieval-augmented knowledge to provide semantic context during inference. The output results generated by each configuration were analyzed and compared in detail, with key differences summarized in Table 2.
As illustrated in Table 2, Scheme C consistently generated outputs that were semantically complete and practically actionable across all five representative samples. The responses were judged as correct by domain experts, with an average semantic score of 4.72, significantly surpassing Scheme B (3.76) and Scheme A (2.24).
Key observations include:
The baseline model (Scheme A) primarily generated vague or overly generalized expressions such as “damaged road”, “blurry image”, or “warning issued”, lacking the ability to accurately identify the structure, location, or damage mechanism involved, and failing to provide fine-grained disaster classification.
The prompt-enhanced model (Scheme B) showed moderate improvement in certain cases (e.g., identifying “embankment anomalies”), yet most outputs remained superficial or logically inconsistent, limiting their usefulness in operational decision-making.
The RAG-enhanced model (Scheme C) consistently delivered dual-layered responses that included abnormal structure detection alongside recommendations for mitigation. Moreover, the model demonstrated the ability to incorporate contextual details and provide actionable guidance such as “install retaining wall” or “clean drainage outlet”, showing strong potential for supporting field decision-making.
In terms of response time, the differences among the three schemes were minor. Scheme C maintained an average response time of 3.56 s, which, although slightly higher than Scheme A, remained within acceptable limits and did not affect system responsiveness.

4.2.2. Reasoning Performance of the Model

To evaluate the reasoning capabilities of the models under various instruction complexities, we designed 40 task sets across three categories: single-round, nested, and multi-round instructions. Each task required the model to generate a multi-modal reasoning output based on image content. Examples include:
Single-round instruction:
“Identify the disaster type shown in the image and explain its cause.”
Nested instruction:
“The image indicates a landslide. Please assess whether there is a risk of secondary disasters and provide the reasoning basis.”
Multi-round instruction:
“Identify the disaster→Evaluate its severity level→Propose countermeasures.”
Each task was completed using all three model schemes. Human-annotated reference answers were prepared in advance. A panel of domain experts scored the model outputs based on three criteria: Task Completion Accuracy, Reasoning Validity, and Semantic Coherence. The results are summarized in Table 3.
In terms of task completion accuracy, the baseline model frequently failed to fully execute the given instructions, often omitting essential components such as causal analysis or recommended countermeasures. Instead, its responses were typically limited to superficial descriptions or generic disaster type labels. In contrast, Scheme C demonstrated a clear understanding of the multi-step task requirements and completed all instruction components accurately.
Regarding reasoning validity, Scheme C effectively leveraged retrieved cases and structural mechanism knowledge to construct plausible reasoning chains.
“Cracks were observed at the rear edge of the slope in the image; combined with recent continuous rainfall data, this suggests shear strength reduction leading to slope failure.”
Scheme B occasionally presented incomplete reasoning steps, while Scheme A typically produced oversimplified statements such as “Probably due to heavy rainfall.”
In terms of Semantic Coherence, Scheme C exhibited clear and logically structured outputs that could be directly integrated into technical reports. Scheme B, while reasonably fluent, often lacked causal linkages, and Scheme A generally produced fragmented sentences or template-based expressions, making its output less practical for real-world applications.

4.3. Model Bias and Robustness Analysis

To evaluate the model’s robustness and potential bias, we conducted tests under various unseen conditions, including illumination changes, different weather types (sunny, rainy, foggy), and diverse backgrounds. The recognition accuracy of the RAG-enhanced system remained stable, with less than 5% fluctuation compared with normal conditions. The dataset was balanced across eight categories (20–30 images each), minimizing class imbalance bias. We further tested the model on 40 unseen highway images collected from different provinces and devices, where the accuracy drop was 3.2%, indicating strong generalization. Overall, the results confirm that the proposed system exhibits good robustness and no significant bias toward specific categories or environments.

4.4. User Feedback and Expert Evaluation

To assess the system’s practicality, we invited 10 transportation infrastructure experts and 50 patrol personnel to conduct comprehensive evaluations. The evaluation focused on five dimensions: output professionalism, operational value, text readability, user interaction experience, and system robustness [27].
A double-blind comparative scoring mechanism was adopted. Each participant was presented with the outputs of the three models (Schemes A/B/C) for the same image samples, without revealing the source model of each output to avoid subjective bias. The evaluation included both quantitative scoring and qualitative feedback. The results are summarized in Table 4.
In terms of output professionalism, experts highlighted that Scheme C consistently employed technical terms, such as “structural anomaly in bridge components” or “gully erosion in development phase”, demonstrating higher domain specificity compared to the colloquial expressions like “the road is broken” seen in Schemes A and B.
For Operational Advice Value, Scheme C stood out by offering actionable recommendations, such as “install gabion retaining walls” or “construct blind drains for redirection”. In contrast, Schemes A and B often resorted to vague suggestions like “enhance monitoring” or “pay attention to safety”.
Regarding Robustness, Scheme C maintained an accuracy rate above 85% in challenging samples with occlusions or image blur. However, Scheme A dropped below 30% accuracy and frequently returned either vague statements or incorrect identifications.
In terms of User Interaction, participants appreciated the system’s concise interface and logical information flow. Although Scheme C’s outputs were longer, they were more readable and informative, not perceived as a burden.
Experts unanimously agreed that Scheme C exhibits the potential to partially replace manual pre-screening and assist in drafting technical reports, thereby saving time and enhancing consistency in patrol scenarios. It demonstrates strong practical value for real-world engineering applications.

4.5. Ablation Studies

This section compares three distinct experiments to assess the impact of various components on multimodal recognition tasks. Three experiment configurations were compared:
Scheme A (Baseline Model): The image and associated description are directly input into a locally deployed large model (e.g., Qwen-VL-Max), with no external knowledge enhancement.
Scheme B (Prompt-Enhanced Model): Prompt templates are introduced to guide the model’s task orientation and input structure, without integrating a structured knowledge base.
Scheme C (Proposed System): Built on a RAG architecture, this system integrates prompt engineering and external knowledge retrieval. The image, prompt, and semantically retrieved content are jointly fed into the multimodal model for reasoning and generation.
Each of the 200 image samples was tested using all three schemes, with evaluation metrics covering disaster type identification accuracy, structural completeness of the generated outputs, and the occurrence of hallucinations (i.e., incorrect or fabricated content). Structural completeness was assessed based on whether the outputs included key components such as damage description, impact analysis, and mitigation recommendations. Hallucination detection focused on identifying instances where the model produced outputs inconsistent with the visual input or factual knowledge. The comparative results are presented in Table 5.
As shown in Table 5, Scheme C outperforms both Scheme A and Scheme B across all key performance indicators: Scheme C demonstrates significant advantages over the baseline and prompt-only configurations across multiple evaluation dimensions. It achieves an accuracy of 91.5%, marking a 37% improvement over the baseline model, while Scheme B (prompt-only) yields a more modest 13.5% gain, indicating that prompt engineering alone provides limited task alignment without external knowledge integration. In terms of hallucination reduction, Scheme C substantially mitigates erroneous or fabricated outputs, such as misclassifying slope collapses as ponding risks or inventing conditions like abutment settlement, by introducing semantic constraints and factual grounding, reducing the hallucination rate to just 2%. The structural completeness of outputs also improves markedly, with Scheme C achieving a 96% adherence to the expected format, supporting its applicability in formal documentation and decision support. Moreover, Scheme C exhibits superior command of domain-specific terminology, accurately using technical expressions such as shear instability of slope, abutment scour, and asphalt delamination layer, in contrast to the vague or generic language produced by the baseline model.
In summary, while prompt engineering alone (Scheme B) enhances task alignment to a certain extent, it fails to address the inherent knowledge gaps and semantic drift issues of large models. In contrast, the RAG-enhanced approach proposed in this study shows significant advantages in structural coherence, accuracy, and domain-specific language generation.

5. Conclusions and Future Work

To improve the capability of highway management agencies in rapid response and intelligent recognition during extreme weather and emergency disaster scenarios, this study explores the integrated application of MLLMs in highway flood damage hazard identification. An intelligent recognition system based on local deployment was developed, integrating large-scale multimodal models with a structured domain knowledge base. Targeting the limitations of conventional approaches, such as delayed response, weak knowledge alignment, and poor semantic generalization, this study proposes a system framework that integrates prompt engineering with retrieval-augmented generation (RAG) techniques. The system was deployed and verified on the Cherry Studio (software version v1.5.8-rc.1) platform, leading to the following key findings:
A multi-level structured knowledge base was constructed based on typical water-damage cases, industry standards, design specifications, and field documentation. This knowledge system encompasses definitions and types, evolutionary features, inspection mechanisms, and restoration methods, and is organized into a property-feature-strategy hierarchy. All knowledge fragments were processed using embedding encoding and semantic tagging, achieving high structural adaptability and semantic retrieval capabilities, which provide explainable and traceable support for model reasoning.
By deploying the Qwen-series multi-modal large models and integrating task-specific prompt templates with image input pathways, this study established an “Image-Prompt-Knowledge” tripartite fusion recognition framework. Enhanced by the RAG mechanism, the system demonstrated improved capabilities in semantic understanding of complex visual inputs, knowledge recall, and accurate output of disaster type, risk assessment, and remediation recommendations. The experiments validated the practical effectiveness of the local deployment and knowledge enhancement paradigm for disaster identification in transportation contexts.
Future research will focus on the co-optimization of prompt templates and knowledge fragments, with the aim of improving retrieval efficiency and enhancing the model’s responsiveness to contextual variation. Additionally, Reasoning-chain control mechanisms such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) frameworks will be explored to enhance multi-step reasoning capabilities. To further meet the needs of generation stability and risk explainability, the system will incorporate visualized traceability and feedback modules, thereby enhancing controllability and engineering adaptability of the recognition outputs.

Author Contributions

Conceptualization, H.Z. and J.Z.; methodology, E.L.; software, H.Z.; validation, J.Z. and Z.L.; formal analysis, J.Z.; investigation, J.Z.; resources, J.Z.; data curation, E.L. and B.X.; writing—original draft preparation, J.Z.; writing—review and editing, H.Z., Y.L., C.L. and B.X.; visualization, E.L.; supervision, H.Z.; project administration, J.Z. and Z.L.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Public Security Science and Technology Plan Project, grant number 2024YY5; Special Project of the National Key R&D Program: “Integrated Application of Real-Time Automatic Release of Early Warning Information and Vehicle Warning & Interception” (2024YFC3017104-5), Ministry of Education Foundation for Humanities and Social Sciences (No. 24YJCZH460), Natural Science Foundation of Hunan Province, China (No. 2024JJ7624).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are not publicly available due to privacy and confidentiality restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qi, H.-L.; Tian, W.-P.; Li, J.-C. Regional risk evaluation of flood disasters for the trunk-highway in Shaanxi, China. Int. J. Environ. Res. Public Health 2015, 12, 13861–13870. [Google Scholar] [CrossRef] [PubMed]
  2. Zhou, Y.; Liu, K.; Wang, M. River flood risk assessment for the Chinese road network. Transp. Res. Part D Transp. Environ. 2023, 121, 103818. [Google Scholar] [CrossRef]
  3. Glago, F.J. Flood Disaster Hazards; Causes, Impacts and Management: A State-of-the-Art Review. In Natural Hazards-Impacts, Adjustments and Resilience; IntechOpen: London, UK, 2021. [Google Scholar]
  4. Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
  5. Ahmed, A.; Farhan, M.; Eesaar, H.; Chong, K.T.; Tayara, H. From detection to action: A multimodal AI framework for traffic incident response. Drones 2024, 8, 741. [Google Scholar] [CrossRef]
  6. Abu Tami, M.; Ashqar, H.I.; Elhenawy, M.; Glaser, S.; Rakotonirainy, A. Using multimodal large language models (MLLMs) for automated detection of traffic safety-critical events. Vehicles 2024, 6, 1571–1590. [Google Scholar] [CrossRef]
  7. Zhou, W.; Yang, L.; Zhao, L.; Zhang, R.; Cui, Y.; Huang, H.; Qie, K.; Wang, C. Vision technologies with applications in traffic surveillance systems: A holistic survey. ACM Comput. Surv. 2025, 58, 1–47. [Google Scholar] [CrossRef]
  8. Karim, M.M.; Shi, Y.; Zhang, S.; Wang, B.; Nasri, M.; Wang, Y. Large Language Models and Their Applications in Roadway Safety and Mobility Enhancement: A Comprehensive Review. arXiv 2025, arXiv:2506.06301. [Google Scholar] [CrossRef]
  9. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
  10. Hassan, M.; Kabir, M.E.; Jusoh, M.; An, H.K.; Negnevitsky, M.; Li, C. Large Language Models in transportation: A comprehensive bibliometric analysis of emerging trends, challenges and future research. IEEE Access 2025, 13, 132547–132598. [Google Scholar] [CrossRef]
  11. Alasmary, F.; Al-Ahmadi, S. Sbvqa 2.0: Robust end-to-end speech-based visual question answering for open-ended questions. IEEE Access 2023, 11, 140967–140980. [Google Scholar] [CrossRef]
  12. Chen, Z.; Xu, L.; Zheng, H.; Chen, L.; Tolba, A.; Zhao, L.; Yu, K.; Feng, H. Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models. Comput. Mater. Contin. 2024, 80, 1753. [Google Scholar] [CrossRef]
  13. Huynh, N.D.; Bouadjenek, M.R.; Razzak, I.; Hacid, H.; Aryal, S. SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation. arXiv 2025, arXiv:2503.24164. [Google Scholar]
  14. Ruan, S.; Dong, Y.; Liu, H.; Huang, Y.; Su, H.; Wei, X. Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-Training Models. In Proceedings of the 18th European Conference on Computer Vision ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 309–327. [Google Scholar]
  15. Zhang, Y.; Ji, Z.; Pang, Y.; Han, J.; Li, X. Modality-experts coordinated adaptation for large multimodal models. Sci. China Inf. Sci. 2024, 67, 220107. [Google Scholar] [CrossRef]
  16. Huang, D.; Yan, C.; Li, Q.; Peng, X. From large language models to large multimodal models: A literature review. Appl. Sci. 2024, 14, 5068. [Google Scholar] [CrossRef]
  17. Chaubey, H.K.; Tripathi, G.; Ranjan, R. Comparative analysis of RAG, fine-tuning, and prompt engineering in chatbot development. In Proceedings of the 2024 International Conference on Future Technologies for Smart Society (ICFTSS), Kuala Lumpur, Malaysia, 7–8 August 2024; pp. 169–172. [Google Scholar]
  18. Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt Engineering in Large Language Models. In Proceedings of the 4th International Conference on Data Intelligence and Cognitive Informatics (ICDICI 2023), Tirunelveli, India, 27–28 June 2023; Springer: Cham, Switzerland, 2023; pp. 387–402. [Google Scholar]
  19. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
  20. Cheng, H.; Zhang, R.; Zhang, R.; Li, Y.; Lei, Y.; Zhang, W. Intelligent Detection and Description of Foreign Object Debris on Airport Pavements via Enhanced YOLOv7 and GPT-Based Prompt Engineering. Sensors 2025, 25, 5116. [Google Scholar] [CrossRef] [PubMed]
  21. Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
  22. Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. In Proceedings of the 12th CCF Conference, BigData 2024, Qingdao, China, 9–11 August 2024; Springer: Cham, Switzerland, 2024; pp. 102–120. [Google Scholar]
  23. Jung, T.; Joe, I. An Intelligent Docent System with a Small Large Language Model (sLLM) Based on Retrieval-Augmented Generation (RAG). Appl. Sci. 2025, 15, 9398. [Google Scholar] [CrossRef]
  24. Tyndall, E.; Wagner, T.; Gayheart, C.; Some, A.; Langhals, B. Feasibility Evaluation of Secure Offline Large Language Models with Retrieval-Augmented Generation for CPU-Only Inference. Information 2025, 16, 744. [Google Scholar] [CrossRef]
  25. Areerob, K.; Nguyen, V.Q.; Li, X.; Inadomi, S.; Shimada, T.; Kanasaki, H.; Wang, Z.; Suganuma, M.; Nagatani, K.; Chun, P.j. Multimodal artificial intelligence approaches using large language models for expert-level landslide image analysis. Comput. Aided Civ. Infrastruct. Eng. 2025, 40, 2900–2921. [Google Scholar] [CrossRef]
  26. Kadiyala, L.A.; Mermer, O.; Samuel, D.J.; Sermet, Y.; Demir, I. The implementation of multimodal large language models for hydrological applications: A comparative study of GPT-4 vision, gemini, LLaVa, and multimodal-GPT. Hydrology 2024, 11, 148. [Google Scholar] [CrossRef]
  27. Wang, L.; Liu, X.; Liu, Y.; Li, H.; Liu, J.; Yang, L. Multimodal knowledge graph construction for risk identification in water diversion projects. J. Hydrol. 2024, 635, 131155. [Google Scholar] [CrossRef]
Figure 1. Workflow of the Proposed Highway Water-Damage Recognition System.
Figure 1. Workflow of the Proposed Highway Water-Damage Recognition System.
Applsci 15 12782 g001
Figure 2. Overall Research Framework.
Figure 2. Overall Research Framework.
Applsci 15 12782 g002
Figure 3. Workflow of Knowledge Base Construction.
Figure 3. Workflow of Knowledge Base Construction.
Applsci 15 12782 g003
Figure 4. Detailed Design of the Highway Flood Damage Knowledge Base.
Figure 4. Detailed Design of the Highway Flood Damage Knowledge Base.
Applsci 15 12782 g004
Figure 5. Workflow of the RAG-Based Recognition System.
Figure 5. Workflow of the RAG-Based Recognition System.
Applsci 15 12782 g005
Figure 6. System Visualization Interface.
Figure 6. System Visualization Interface.
Applsci 15 12782 g006
Figure 7. Distribution of Highway Water Damage Dataset (8 Categories).
Figure 7. Distribution of Highway Water Damage Dataset (8 Categories).
Applsci 15 12782 g007
Table 1. Detailed Design of Prompt Templates.
Table 1. Detailed Design of Prompt Templates.
Related ModuleSpecific Content
Assigned RoleYou are an expert assistant in intelligent identification of road flood damage, specializing in image content understanding and professional recognition of flood-related events. You should possess knowledge of civil engineering structures, hydrodynamic erosion, geological mechanisms, and disaster latency, along with cross-modal reasoning capabilities. Users will upload road inspection images, which may contain metadata such as capture time, geographic coordinates, or embedded watermarks. Your task is to analyze the image content, identify the type of flood damage, and infer the disaster cause and risk level by integrating external knowledge. Using the Amap Map Cloud Platform (MCP), first extract the GPS coordinates from the image, ensuring the format is longitude (E) first and latitude (N) second. If the coordinates are already in Amap’s format, no conversion is required-proceed directly to location identification. Convert the extracted coordinates to Amap-compatible format, e.g., (106.79415, 39.655048). Then, use Amap MCP to retrieve the corresponding administrative division address. If the image does not contain geographic coordinates, fill in “Location: None”. Next, extract the capture date from the image. Based on the city name obtained from the administrative address, query the weather forecast for that city on the capture date and for the following three days using the Amap MCP weather service
Primary Core TaskBased on the image content, please complete the following tasks:
(1) Determine whether a flood damage event is present in the image.
If such an event exists, classify it into one of the following eight typical categories of
  • Road flood damage:
  • Roadbed damage
  • Pavement damage
  • Bridge damage
  • Tunnel damage
  • Slope damage
  • Retaining wall damage
  • Drainage facility damage
  • Traffic safety facility damage
(2) For each identified type of flood damage, provide a detailed analysis according to the following four dimensions:
Dimension→Description Requirement
Typical Disaster Forms→Clearly specify the structural failure mode, such as erosion, subsidence, fracture, sliding, etc.
Visual Manifestations→Guide the model to focus on visual details, such as surface abrasion, water reflection, displacement, fault lines, etc.
Auxiliary Identification Cues→Instruct the model to examine whether the structure shows deformation or deviates from the design alignment.
Risk Indicators→Evaluate the disaster likelihood and potential risk evolution based on engineering context.
Spatiotemporal Causal Reasoning Task(1) Retrieve historical weather records (not limited to the previous day): Utilize the knowledge base or external APIs (e.g., historical meteorological interfaces) to: Check whether heavy rainfall or consecutive precipitation occurred within the past 1–7 days;
Examine whether any extreme weather events occurred within the past 30 days; Assess the likelihood of indirect contributing factors, such as water accumulation, prolonged subgrade softening, or pre-existing erosion.
(2) Infer causal relationships: Determine whether the current flood damage is potentially linked to prior weather events with delayed effects. Even if the image was captured on a clear day, the model should consider the possible structural impact of earlier meteorological conditions.
Final Output Format (Structured + Explanatory)Please adhere strictly to the specified output format. Do not make any unauthorized modifications.
[Hazard Type]: Slope Flood Damage
[Subtype]: Landslide + Drainage Ditch Blockage
[Visual Features]: Mud flow traces at slope toe, vegetation stripping, clearly defined sliding interface
[Geographical Location]: Jinshui District, Zhengzhou City, Henan Province
[Structural Location]: Right-side upper slope
[Disaster Mechanism]: Concentrated surface runoff and failure of drainage facilities causing saturated shear failure
[Associated Weather]: Continuous heavy rainfall on June 3 and 6 led to delayed water accumulation and slope instability. Weather forecasts show showers and light rain on July 3 and 4, and moderate rain on July 5 and 6. The prolonged rainfall likely contributed to further erosion and softening of the subgrade, such as water retention and long-term weakening of the foundation, which compromised structural stability and created conditions for delayed flood damage.
[Risk Level]: Medium-to-high; closure and emergency treatment recommended
[Recognition Confidence]: 0.93
[Matched Case]: SP_BS_004 (Similarity: 89%)
Model Usage NotesThe output should be structured and use accurate terminology in accordance with civil engineering and transportation industry standards.
In cases of blurry images or insufficient information, indicate “Uncertain” and recommend supplementary inputs.
Avoid subjective speculation; reasoning must be evidence-based, grounded in image content and domain knowledge.
Table 2. Comparison of Model Outputs Across Representative Samples.
Table 2. Comparison of Model Outputs Across Representative Samples.
Sample IDDamage TypeScheme A OutputScheme B OutputScheme C OutputA Correct?B Correct?C Correct?Semantic Score (A/B/C, out of 5)Response Time (A/B/C, s)
001Bridge CollapseBridge CollapseBridge CollapseBridge Collapse××2.3/3.6/4.83.2/3.4/3.6
002Roadbed ErosionRoadbed ErosionRoadbed ErosionRoadbed Erosion×2.7/4.2/4.63.0/3.3/3.5
003Slope CollapseSlope CollapseSlope CollapseSlope Collapse××1.9/4.0/4.72.5/3.0/3.3
004Culvert FloodingCulvert FloodingCulvert FloodingCulvert Flooding××2.5/4.1/4.93.4/3.4/3.8
005Drainage FailureDrainage FailureDrainage FailureDrainage Failure××1.8/2.9/4.62.6/3.1/3.6
Table 3. Reasoning Performance of Different Model Schemes across Task Types (Average Score/5-point scale).
Table 3. Reasoning Performance of Different Model Schemes across Task Types (Average Score/5-point scale).
Model SchemeTask Completion AccuracyReasoning ValiditySemantic Coherence
Scheme A2.32.02.1
Scheme B3.73.23.3
Scheme C4.64.84.7
Table 4. Comprehensive Evaluation Scores from Users and Experts (Average Score/5-point Scale).
Table 4. Comprehensive Evaluation Scores from Users and Experts (Average Score/5-point Scale).
Evaluation DimensionScheme A (Baseline Model)Scheme B (Prompt-Enhanced)Scheme C (RAG-Enhanced)
Output Professionalism2.33.54.7
Value of Operational Advice2.03.34.8
Text Readability3.13.84.5
Interaction Experience (Users)4.04.14.3
Robustness (Experts)2.43.24.5
Table 5. Instruction Response Performance Comparison.
Table 5. Instruction Response Performance Comparison.
MetricScheme A (Baseline Model)Scheme B (Prompt-Enhanced)Scheme C (RAG-Augmented)
Accuracy of Disaster Type Identification54.5%68.0%91.5%
Hallucination Rate21.0%9.5%2.0%
Output Structural Completeness (Three-Section Format)35%61%96%
Proportion of Domain-Specific Terminology28%55%93%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, J.; Liu, Z.; Li, C.; Zhou, H.; Lou, E.; Li, Y.; Xu, B. Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models. Appl. Sci. 2025, 15, 12782. https://doi.org/10.3390/app152312782

AMA Style

Zheng J, Liu Z, Li C, Zhou H, Lou E, Li Y, Xu B. Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models. Applied Sciences. 2025; 15(23):12782. https://doi.org/10.3390/app152312782

Chicago/Turabian Style

Zheng, Jinzi, Zhiyang Liu, Chenguang Li, Hanchu Zhou, Erlong Lou, Yaqi Li, and Bingou Xu. 2025. "Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models" Applied Sciences 15, no. 23: 12782. https://doi.org/10.3390/app152312782

APA Style

Zheng, J., Liu, Z., Li, C., Zhou, H., Lou, E., Li, Y., & Xu, B. (2025). Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models. Applied Sciences, 15(23), 12782. https://doi.org/10.3390/app152312782

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop