Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring

Pan, Shijun; Fan, Zhun; Yoshida, Keisuke; Qin, Shujia; Kojima, Takashi; Nishiyama, Satoshi

doi:10.3390/drones9090660

Open AccessArticle

Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring

by

Shijun Pan

^1,2,*

,

Zhun Fan

^1,*,

Keisuke Yoshida

³

,

Shujia Qin

²,

Takashi Kojima

⁴ and

Satoshi Nishiyama

³

¹

Shenzhen Institute for Advanced Study, UESTC, University of Electronic Science and Technology of China, Shenzhen 518110, China

²

Shenzhen Academy of Robotics, Shenzhen 518000, China

³

Graduate School of Environmental and Life Science, Okayama University, Okayama 700-8558, Japan

⁴

TOKEN C.E.E. Consultants Co., Ltd., Tokyo 170-0004, Japan

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(9), 660; https://doi.org/10.3390/drones9090660

Submission received: 1 August 2025 / Revised: 12 September 2025 / Accepted: 15 September 2025 / Published: 21 September 2025

(This article belongs to the Special Issue Unmanned Aerial Vehicles (UAVs) Applications in Critical Industrial Sectors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for civil engineering scenarios has heightened the importance of artificially generated content (AIGC) or synthetic data as supplementary inputs. AIGC is typically produced via text-to-image generative models (e.g., Stable Diffusion, DALL-E) guided by user-defined prompts. This study leverages LMMs to interpret key parameters for drone-based image generation (e.g., color, texture, scene composition, photographic style) and applies prompt engineering to systematize these parameters. The resulting LMM-generated prompts were used to synthesize training data for a You Only Look Once version 8 segmentation model (YOLOv8-seg). To address the need for detailed crack-distribution mapping in low-altitude drone-based monitoring, the trained YOLOv8-seg model was evaluated on close-distance crack benchmark datasets. The experimental results confirm that LMM-prompted AIGC is a viable supplement for low-altitude drone crack monitoring, achieving >80% classification accuracy (images with/without cracks) at a confidence threshold of 0.5.

Keywords:

artificial intelligence; large multimodal model; unmanned aerial vehicle; crack

1. Introduction

A large multimodal model (LMM, an extensive LLM, large language model) is a type of artificial intelligence model (generative artificial intelligence) that can support humans in communicating with computers using natural language or other modalities (images, sounds). In recent years, LMMs (i.e., ChatGPT from OpenAI and Tencent-interfaced DeepSeek) [1,2] have been applied in different engineering fields (remote sensing and civil engineering) [3,4,5]. During the process of LMM-based application, the method or approach of “communicating with computers” is different from that of humans, requiring more detailed and specific keywords (prompts) to help the LMM to understand the conversation content. The approach of communicating with computers to obtain the desired output is termed “prompt engineering”.

Besides LMM, some image generation models (i.e., DALL-E and Stable Diffusion) have been applied in practical projects and have increased the possibility of generating the images (i.e., AI-generated content, AIGC or synthetic data) for data augmentation, even when researchers have not encountered such data before [6,7,8,9]. The similar features between real world and AIGC lead to the possible application, i.e., color, shape and texture. Nevertheless, sometimes the generated data have some implausible features, in that the features do not match the real-world standards and rules (i.e., hallucinations), and how to avoid generating these data is becoming one of the important themes for researchers in AI-related fields [10,11,12].

Considering the potential in civil engineering, the authors have explored applying this new technology in practical drone-based applications (i.e., waste detection from drone view) with the assistance of You Only Look Once (YOLO) [6,7,8,9]. Concurrently, more experiments are needed to apply AIGC effectively. Additionally, wider application scopes should be considered, and their feasibility must be demonstrated.

Until now, crack detection and segmentation have evolved significantly from manual inspections and traditional image processing techniques to advanced deep learning approaches utilizing both real-world and synthetic data. Manual methods are time-consuming and costly, while early automated techniques like edge detection and thresholding struggle with environmental variability such as shadows, dust, and low contrast. Deep learning models, particularly CNNs (e.g., U-Net), now dominate the field, offering high accuracy in pixel-level segmentation [13,14,15]. To address scarce real-world data, synthetic data generated via the CutMix technique [16,17] has been used to improve the generalization capability of CNN classifiers. Here, it is repurposed to generate a dataset that is a mix of a public labeled dataset and the background information of the target dataset. Key unresolved challenges include computational inefficiency, lack of dataset, dataset bias, and insufficient photorealism in synthetic data.

In previous studies [18,19,20,21,22,23], as shown in Figure 1a (upper), the drone-based crack detection process always requires close-distance on-site crack image data collection to train the YOLO model. Correspondingly, manual annotation requires engineers, researchers, and operators to go to the field to observe cracks and annotate them for correct model training. Figure 1a (upper) includes eight general steps, as follows:

Step 1 (a, b): Engineers, researchers, and operators go to the field to observe and confirm the crack distribution situation (on-site close-distance, drone view).
Step 2 (a, b): Collect crack image data (on-site close-distance, drone view).
Step-3: Annotating the crack image data.
Step 4: Train the YOLO model using paired crack images and annotations.
Step-5: Saving the trained YOLO mode.
Step-6: Transferring the drone-view data to infer using the trained YOLO model.
Step 7: After inference, engineers can determine the location and size of cracks, and consider corresponding repairs.
Step 8: Researchers can save crack distribution maps for further research.

The novel process in this work, as shown in the lower part of Figure 1a, includes two steps that differ from the previous work, as follows:

Step 2 (a): Apply Stable Diffusion to generate images. During this process, systematic prompt application should be discussed, and reasonable prompts can reduce on-site work.
Step-3: Using the trained model to annotate the generated crack image data.

Correspondingly, the authors aim to demonstrate the feasibility of the novel process in this research. As shown in Figure 1b, there are three sections as follows:

Section 1: Train the ‘Visible Crack Dataset’ [24] (MIT License) using YOLOv8 as the annotation generator, and use low-altitude drone-view crack monitoring datasets, namely the Concrete Crack Image for Classification Dataset (CCI4CD) [25,26] (CC BY 4.0 License) and the Concrete Crack Segmentation Dataset (CCSD) [27] (CC BY-SA 4.0 License), for evaluation metrics.
Section 2: Use the LMM (DeepSeek R1 with MIT license) to generate prompts for Stable Diffusion (v1.4 with CreativeML Open RAIL-M License) and collect the AIGC dataset (images and annotations).
Section 3: Train the AIGC dataset and use CCI4CD and CCSD for evaluation metrics.

The authors apply the LMM to understand the range of the image generation (i.e., color, texture, scene, and photography style) and use prompt engineering to summarize and categorize them. With the assistance of LMM-generated prompts for AIGC, the synthetic data can be used as input for training the YOLOv8-seg model [28]. Based on the requirements of low-altitude crack monitoring, as a supplement, the authors tested the trained YOLOv8-seg models (i.e., Trained Model-1 and -2) on crack benchmark datasets (CCI4CD and CCSD).

2. Methods

2.1. YOLOv8 (Visible Cracks)

You Only Look Once (YOLO) is a one-stage model known for its fast inference speed and comparatively high detection accuracy. The YOLOv8 series comprises five scaled variants (nano-n, small-s, medium-m, large-l, and extra large-x), progressing from smallest to largest in size and complexity. It supports three task modes: detection, segmentation, and classification, applicable to training, validation, and inference. For crack segmentation (i.e., precise localization and background separation), the YOLOv8-seg model was selected. The Visible Crack Dataset (MIT License) [12] was used as the training data for YOLOv8-seg. As shown in Table 1 and Table 2, the authors used the parameter setting as follows to train YOLOv8x-seg with the name of “Trained Model-1”. Figure 2 dictates the process of training and validating the model with the middle-process early stop. Trained Model-1 achieved 0.6 mAP50 (M) and 0.3 mAP50-95 (M) on the validation set, individually.

Figure 3 showed the label distribution statistics of the training process, which include instances numbers, central point location (x-y), and crack size (width-height). Figure 4 showed the samples of the batch images used for validation, which include different types of cracks, i.e., longitudinal, vertical, alligator. Among these crack-related types, alligators are the most difficult to identify or annotate using the instance segmentation annotation approach. Especially when the cracks are closed, the instance segmentation result can only extract the whole area without separating the individual cracks. After training the model, Trained Model-1 was saved as one best.pt file to be used as the annotation generator.

2.2. LMM

After training the annotation generator (YOLOv8-seg), Large Multimodal Models (LMMs) were employed to analyze image attributes such as color, texture, scene, and photography style. As shown in Figure 5, the authors used YuanBao [2] from Tencent as LMM in this research. Before the conversation with LMM, the user should prepare one image and corresponding question (prompt) about this image.

It is better to provide the image without any cracks, and it can also make the necessary information of the background available. The more detailed and theme-related the questions the users asked were, the much closer the answer was to what the user really needed. In this research, the author provided one image without any cracks selected from CCI4CD Negative Part. Correspondingly the prompt is “Firstly, please tell me what is this picture, from texture and color perspective?”.

During the process of answering the question, if the “deep think” option was chosen, the LMM should consider the purpose of the question in a deeper way, which can also lead the users to provide more detailed information. Incidentally, the LMM in “deep think” mode can ask and answer itself at the same time, and it will be positive for the user to adjust the answer. And the “deep think” mode can show the detailed process of thinking step by step; during the process, the user can obtain some inspirations and improve the quality of the question.

As shown in Table 3, after “deep think”, there are four main prompt-related sections generated that can be categorized as follows, i.e., object (straight cracks perpendicular to road), texture (character, surface quality, likely material, detail and visual effect), color (hue, tone, saturation and variation) and summary. The scene (a coated or plastered wall) and photography style (clean, neutral, and minimalist appearance) are included in the summary. Table 3 was derived from the prompts as follows: Please summarize and categorize the mentioned content using object, texture, color and summary sections in one table. The content in Table 3 displays the understanding of the LMM in Figure 5, but it depends on the model version, if the model is open/close source and what dataset was used for training.

If the LMM was changed, a similar output of the prompts could not be promised. So, confirming the main sections should be considered first, which could lead the model to at least provide the information in the specified scope. The user can adjust the content for easier understanding if there is something unusual (i.e., hallucinations, which are a common kind of error often occurring in large language models), and after the adjustments the prompts can be directly applied in the image generation.

Depending on the image generation, negative prompts can be considered as an approach to improving the image generation accuracy. If the image generation derived from the provided prompts already matches the image generation needs, negative prompts are not required.

2.3. Stable Diffusion

As shown in Table 4, the authors applied the parameters set in the Stable Diffusion WebUI txt2img function to generate the image. The Stable Diffusion checkpoint used for image generation is sd-v1-4.ckpt [fe4efff1e1]. As shown in Figure 6a, there is a generated crack image sample with vertical, horizontal and alligator cracks. The background of the cracks was correctly generated and derived from the prompts. Simultaneously, the crack distributions are random and the sizes of the crack are also not same, which increased the diversity for img2img image generation.

As shown in Table 5, the authors adjusted some parameters settings from Table 4, and as shown in Figure 6b, a grid image including 800 pieces was generated under the supervision of the prompts used in Table 3, with all the individual images saved. From the generated grid image in Figure 6b, the background and cracks were well generated.

2.4. YOLOv8 (ConcreteCrackImage4Classification, CCI4C)

Based on Trained Model-1 and the parameters settings shown in Table 6, the authors inferred the images in Figure 6b and saved the labels. As shown in Figure 7, three samples were chosen for clear observation of the annotation (from left to right, original, with and without bounding box, individually).

Figure 8 dictated a general perspective of the annotation (mask mode) using a grid image (Figure 6b). The authors used the parameters setting shown in Table 7, AIGC images, and annotations to train the model (i.e., Trained Model-2).

As shown in Figure 9, Trained Model-2 maintained mAP50 (m) and mAP50-95 (m), around 0.06 and 0.015. And as shown in Figure 10, the crack instances are mainly concentrated under 0.2 width and height, i.e., 512 pixels × 0.2 = 102.4 pixels. Compared with the instances in Trained Model-1, the instances in Trained Model-2 were much smaller. In other words, when faced with the task of segmenting small cracks in detail, Trained Model-2 has advantages compared with Trained Model-1.

3. Results

After collecting the trained models, the authors tried to use open-source datasets to evaluate them individually. The first open-source dataset is called “Concrete Crack Images for Classification” (CCI4CD) [11,12], which includes two groups, positive and negative (with and without cracks), and 20,000 images in each group, independently. Each image has a uniform size, 227 × 227 pixels. As shown in Figure 11, Figure 12, Figure 13 and Figure 14, the authors applied Trained Model-1 and -2 on positive and negative crack images using different confidence values (0.1, 0.25 and 0.5), individually. If any crack was detected in a positive crack image, it was classified as correct without considering the area on each crack. If any crack was detected in a negative crack image, it was classified as incorrect.

Noteworthy, as shown in Figure 12 and Figure 14, negative crack image 00361 was detected as positive using both Trained Model-1 and -2 under confidence values of 0.1, 0.25 and 0.5, individually. From a visual perspective, there are some parts that seem to be likely cracks with shallow depth in image 00361. In other words, the standard for classifying the cracks using Trained Model-1 and -2 is much stricter than that of the operators of CCI4CD.

Table 8 and Table 9 show the accuracy results of classifying positive or negative images using Trained Model-1 and -2, separately. Trained Model-1 has an advantage on both positive and negative crack images, the accuracy is stable, and all the values are over 0.95. In contrast, although Trained Model-2 has some disadvantages in the imbalance of detections between positive and negative crack images, when the confidence value changes, the accuracy value still maintains over 0.8 for both positive and negative crack image classification.

The second open-source dataset is called “Concrete Crack Segmentation Dataset” [24], with CC BY-SA 4.0 license, and includes 458 images. Each image has a uniform size, 4032 × 3024 pixels. As shown in Figure 15, there were three chosen samples for comparison using Trained Model-1 and -2. As shown in Figure 16 and Table 10, the Recall value is derived from the TP and FN in the confusion matrix, i.e., TP/(TP + FN). Both models can obtain a Recall value over 0.8, and in the case of No.19 and 30, Trained Model-2 has a higher Recall value than Trained Model-1.

4. Discussions

The authors applied LMMs in this research to understand specific images, and they confirmed the range of the image generation, which included but was not limited to object (straight cracks perpendicular to road), texture (character, surface quality, likely material, details and visual effect), color (hue, tone, saturation and variation), scene (a coated or plastered wall) and photography style (clean, neutral, and minimalist appearance). After summarizing and categorizing the mentioned prompts using prompt engineering [29,30,31,32], the extracted information can be used for image generation directly. And with the assistance of AIGC, Trained Model-2 showed its advantages compared with Trained Model-1 in some cases of low-altitude drone-view crack monitoring.

Noteworthy, real-world-data-derived Trained Model-1 has no significant advantages compared with the AIGC-derived Trained Model-2, i.e., AIGC-derived data have similar features to real-world-data-derived data. As shown in Figure 3 and Figure 10, Trained Model-1 has around 1400 instances, much fewer instances than in Trained Model-2 (over 70,000 instances), which is a potential reason why Trained Model-2 works as well as Trained Modle-1. To some degree, the number of instances increasing can overcome some disadvantages derived from the lack of diversity. Alternatively, if necessary, AIGC can not only increase the quantity of the instances for training the model but it can also expand the diversity of the model. If the dataset under a specific scene has not been collected before or if the quantity of instances under this new scene is limited, at this point, AIGC can assist in improving it on purpose. In particular, when the prompts are well designed and controlled by prompt engineering, it is much easier to obtain the necessary data rather than performing on-site data collection, which requires drone flying operation and control of the light conditions.

Although AIGC showed its advantages in increasing the instances’ quantity and diversity, it is difficult to promise that the quality of the prompt outputs can be maintained each time the LMM is used. Alternatively, more prompt-related experiments of other scopes should be in consideration for understanding the common features in prompts and LLMs/LMMs [33,34,35,36,37,38,39].

In this research, the benchmark datasets are basically derived from the concrete cracks; nonetheless, there are also other types of surface texture (i.e., asphalt pavement, white wall) that exist in practical crack monitoring applications. To overcome the feature gap between different surface textures, AIGC can assist by increasing the data diversity and quantity through further model training.

Supplementary Material was also provided as follow for clearer understanding: Based on manual annotation (True Label) and automatic annotation (Predicted) comparisons detailed in Table S1 and Figure S1, the high recall but low precision—attributed to a low confidence threshold (0.01) for mask generation—led to over-annotation and primarily affected model training. As shown in Table S2 and Figure S2, results from Trained Model-2 indicated minimal differences in close-distance segmentation/classification across confidence thresholds. The authors emphasized annotation quality over hyperparameter tuning, and through prompt ablation studies (Figures S3 and S4), found Object+Texture+Color prompts most effective. Finally, synthesis image diversity evaluation against real reference data (Figure S5 and Table S3) confirmed excellent perceptual diversity based on reported metrics.

5. Conclusions

Trained Model-2 obtained from the AIGC dataset produced results (i.e., classification accuracy over 0.9 under 0.25 confidence; segmentation, Recall value around 0.8 under 0.01 confidence) in “Concrete Crack for Classification” and “Concrete Crack Segmentation Dataset”, individually. This proposes that the approach explored in this work has practical applications in low-altitude drone-view crack monitoring. In conclusion, the approach described in this study has the potential to reduce the resources required for on-site data collection and annotation in low-altitude drone-view crack monitoring applications, making crack monitoring systems more deployable in a variety of locales.

6. Future Works

Originally, the method in this research was designed for crack monitoring, which can be observed from low-altitude drone view. After proving the feasibility of the prompt-engineering applied in generating high-resolution low-altitude drone-view crack images, researchers can apply this technology in edge AI mini-computers to produce datasets that directly include images and annotations (automatically labeled). And the application of this edge AI supports indoor operators in reducing the most time-consuming workload, i.e., on-site image collection and annotation labeling. In the near future, researchers can connect prompt engineering with some other novel technologies (i.e., LLM, Text-To-Speech, and Automatic-Speech-Recognition) in drone-platform Edge AI devices without any operation handling. All the data collection and dataset generation can be completed in the Edge AI device, and more close-distance and high-resolution imagery by real-world drone-view data are necessary for validation.

Based on the possible application issues mentioned, relative prompts should be accumulated and applied under more practical scenes [35,36], especially around drone-related applications.

Besides the mentioned points, there are also several bulleted points that warrant further research, as follows:

Crack-related prompts;
Texture-related prompts;
Color-related prompts;
Scene-related prompts;
Photography style-related prompts;
Scalability and automation of crack image generation;
Performance of AIGC across different environments;
Integration of AIGC with existing crack classification/detection/segmentation systems;
Multiple crack classification/detection/segmentation algorithms for verification;
Multiple study sites/locations/backgrounds for verification.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/drones9090660/s1, Figure S1: The True Positive, Predicted and True Label derived automatic annotation and manual annotation.; Figure S2: The Original, True Label and different confidence segmentation results derived from Trained Model-2.; Figure S3: Ablation experiments on different specification combinations using constructed prompts.; Figure S4: A generated sample image derived from human-written prompts (I want to generate a wall with crack using close-distance).; Figure S5: This figure showed 2-D visualization t-SNE/PCA and 3-D visualization PCA derived from the comparison between AIGC (synthesis data) and CCSD (reference data). Feature extraction function was based on resnet50 and inception v3 models.; Figure S6: This figure showed Trained Model-2-derived result, TL of CCSD and corresponding overlaying result using color processing; Table S1: The Precision and Recall of automatic annotation (Predicted) comparing with manual annotation (True Label), pixel-united.; Table S2: The Precision and Recall of automatic annotation (Predicted) comparing with manual annotation (True Label).; Table S3: Synthesis image diversity evaluation report.; Table S4: The Precision and Recall of Trained Model-2-derived result (Predicted) comparing with manual annotation (True Label) of CCSD, pixel-united. This is an approximate result using RGB selection tool in an open-source image software GIMP.

Author Contributions

Conceptualization, S.P. and K.Y.; methodology, S.P., S.Q. and T.K.; software, S.P., S.Q. and T.K.; validation, S.P., S.Q. and T.K.; formal analysis, S.P.; investigation, S.P.; resources, S.P.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, S.P. and S.Q.; visualization, S.P.; supervision, Z.F., K.Y. and S.N.; project administration, S.P. and Z.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data related to this research can be made available by request from the corresponding author.

Conflicts of Interest

Takashi Kojima was employed by TOKEN C.E.E. Consultants Co., Ltd. The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
AIGC	Artificial Intelligence-Generated Content
CCI4CD	Concrete Crack Image for Classification Dataset
CCSD	Concrete Crack Segmentation Dataset
cls	Classification
conf	Confidence Value
det	Detection
dfl	Distribution Focal Loss
FN	False Negative
FP	False Positive
img2img	Image-to-Image
IoU	Intersection over Union
LLM	Large Language Model
LMM	Large Multimodal Model
mAP50	Mean Average Precision calculated at an IoU threshold of 0.50.
mAP50-95	Mean Average Precision calculated at an IoU threshold from 0.50 to 0.95.
SD (sd)	Stable Diffusion
seg	Segmentation
TL	True Label
TN	True Negative
TP	True Positive
txt2img	Text-to-Image
YOLOv8	You Only Look Once version 8

References

OpenAI. Available online: https://openai.com/ (accessed on 6 June 2025).
YuanBao. Available online: https://yuanbao.tencent.com/ (accessed on 6 June 2025).
Ren, Y.; Zhang, T.; Han, Z.; Li, W.; Wang, Z.; Ji, W.; Qin, C.; Jiao, L. A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing. Remote Sens. 2025, 17, 1748. [Google Scholar] [CrossRef]
Pan, S.; Yoshida, K.; Yamada, Y.; Kojima, T. Monitoring human activities in riverine space using 4K camera images with YOLOv8 and LLaVA: A case study from Ichinoarate in the Asahi River. Intell. Inform. Infrastruct. 2024, 5, 89–97. [Google Scholar] [CrossRef]
Pan, S.; Yoshida, K.; Yamada, Y.; Kojima, T. Trials of night-time 4K-camera-based human action recognition in riverine environments with multimodal and object detection technologies. Intell. Inform. Infrastruct. 2024, 5, 87–94. [Google Scholar] [CrossRef]
Pan, S.; Yoshida, K.; Kojima, T. Application of the prompt engineering-assisted generative AI for the drone-based riparian waste detection. Intell. Inform. Infrastruct. 2023, 4, 50–59. [Google Scholar] [CrossRef]
Shimoe, D.; Pan, S.; Yoshida, K.; Nishiyama, S.; Kojima, T. Application of image generation AI in model for detecting plastic bottles during river patrol using UAV. Jpn. J. JSCE 2025, 81, 24-16180. [Google Scholar] [CrossRef]
Pan, S.; Yoshida, K.; Shimoe, D.; Kojima, T.; Nishiyama, S. Generating 3D Models for UAV-Based Detection of Riparian PET Plastic Bottle Waste: Integrating Local Social Media and InstantMesh. Drones 2024, 8, 471. [Google Scholar] [CrossRef]
Pan, S.; Shimoe, D.; Yoshida, K.; Kojima, T. Local low-altitudes drone-based riparian waste benchmark dataset (LAD-RWB): A case study on the Asahi River Basin. Intell. Inform. Infrastruct. 2025, 6, 39–50. [Google Scholar] [CrossRef]
Sun, Y.; Sheng, D.; Zhou, Z.; Wu, Y. AI hallucination: Towards a comprehensive classification of distorted information in artificial intelligence-generated content. Humanit. Soc. Sci. Commun. 2024, 11, 1278. [Google Scholar] [CrossRef]
Lee, M.A. Mathematical Investigation of Hallucination and Creativity in GPT Models. Mathematics 2023, 11, 10. [Google Scholar] [CrossRef]
Kumar, M.; Mani, U.; Tripathi, P.; Saalim, M.; Roy, S.; Kumar, M.; Mani, U.; Tripathi, P.; Saalim, M.; Sr, S. Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus J. Med. Sci. 2023, 15, e43313. [Google Scholar] [CrossRef]
Kompanets, A.; Duits, R.; Leonetti, D.; van den Berg, N.; Snijder, H.H. Segmentation tool for images of cracks. In Advances in Information Technology in Civil and Building Engineering; Skatulla, S., Beushausen, H., Eds.; Springer International Publishing: Cham, Switzerland, 2024; pp. 93–110. [Google Scholar] [CrossRef]
Kompanets, A.; Leonetti, D.; Duits, R.; Snijder, B. Cracks in Steel Bridges (CSB) Dataset; 4TU.ResearchData: Leiden, The Netherlands, 2024. [Google Scholar] [CrossRef]
Song, Y.; Su, Y.; Zhang, S.; Wang, R.; Yu, Y.; Zhang, W.; Zhang, Q. CrackdiffNet: A Novel Diffusion Model for Crack Segmentation and Scale-Based Analysis. Buildings 2025, 15, 1872. [Google Scholar] [CrossRef]
Jamshidi, M.; El-Badry, M.; Nourian, N. Improving Concrete Crack Segmentation Networks through CutMix Data Synthesis and Temporal Data Fusion. Sensors 2023, 23, 504. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv 2019, arXiv:1905.04899. [Google Scholar] [CrossRef]
Li, H.-Y.; Huang, C.-Y.; Wang, C.-Y. Measurement of Cracks in Concrete Bridges by Using Unmanned Aerial Vehicles and Image Registration. Drones 2023, 7, 342. [Google Scholar] [CrossRef]
Cao, H.; Gao, Y.; Cai, W.; Xu, Z.; Li, L. Segmentation Detection Method for Complex Road Cracks Collected by UAV Based on HC-Unet++. Drones 2023, 7, 189. [Google Scholar] [CrossRef]
Humpe, A. Bridge Inspection with an Off-the-Shelf 360° Camera Drone. Drones 2020, 4, 67. [Google Scholar] [CrossRef]
Shokri, P.; Shahbazi, M.; Nielsen, J. Semantic Segmentation and 3D Reconstruction of Concrete Cracks. Remote Sens. 2022, 14, 5793. [Google Scholar] [CrossRef]
Yuan, Q.; Shi, Y.; Li, M. A Review of Computer Vision-Based Crack Detection Methods in Civil Infrastructure: Progress and Challenges. Remote Sens. 2024, 16, 2910. [Google Scholar] [CrossRef]
Inácio, D.; Oliveira, H.; Oliveira, P.; Correia, P. A Low-Cost Deep Learning System to Characterize Asphalt Surface Deterioration. Remote Sens. 2023, 15, 1701. [Google Scholar] [CrossRef]
Liu, F.; Liu, J.; Wang, L. Asphalt pavement crack detection based on convolutional neural network and infrared thermography. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22145–22155. [Google Scholar] [CrossRef]
Özgenel, Ç.F.; Gönenç Sorguç, A. Performance comparison of pretrained convolutional neural networks on crack detection in buildings. In Proceedings of the 35th International Symposium on Automation and Robotics in Construction (ISARC 2018), Berlin, Germany, 20–25 July 2018; pp. 693–700. [Google Scholar]
Zhang, L.; Yang, F.; Zhang, Y.D.; Zhu, Y.J. Road crack detection using deep convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar] [CrossRef]
Özgenel, Ç.F. Concrete Crack Segmentation Dataset. Mendeley Data VI 2019. Available online: https://data.mendeley.com/datasets/jwsn7tfbrp/1 (accessed on 6 June 2025).
Github. Available online: https://docs.ultralytics.com/zh/models/yolov8/ (accessed on 6 June 2025).
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Lester, B.; Yurtsever, J.; Shakeri, S.; Constant, N. Reducing retraining by recycling parameter-efficient prompts. arXiv 2022, arXiv:2208.05577. [Google Scholar]
Github. Available online: https://www.promptingguide.ai/jp/papers (accessed on 6 June 2025).
Wang, J.; Liu, Z.; Zhao, L.; Wu, Z.; Ma, C.; Yu, S.; Dai, H.; Yang, Q.; Liu, Y.; Zhang, S.; et al. Review of large vision models and visual prompt engineering. Meta-Radiology 2023, 1, 100047. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
Choi, H.S.; Song, J.Y.; Shin, K.H.; Chang, J.H.; Jang, B.-S. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat. Oncol. J. 2023, 41, 209. [Google Scholar] [CrossRef]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A prompt pattern catalog to enhance prompt engineering with Chatgpt. arXiv 2023, arXiv:2302.11382. [Google Scholar]
Chang, Y.-C.; Huang, M.-S.; Huang, Y.-H.; Lin, Y.-H. The influence of prompt engineering on large language models for protein–protein interaction identification in biomedical literature. Sci. Rep. 2025, 15, 15493. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Zhang, R.; Wang, T. Visual Prompt Learning of Foundation Models for Post-Disaster Damage Evaluation. Remote Sens. 2025, 17, 1664. [Google Scholar] [CrossRef]
Liu, H.; Yang, S.; Long, C.; Yuan, J.; Yang, Q.; Fan, J.; Meng, B.; Chen, Z.; Xu, F.; Mou, C. Urban Greening Analysis: A Multimodal Large Language Model for Pinpointing Vegetation Areas in Adverse Weather Conditions. Remote Sens. 2025, 17, 2058. [Google Scholar] [CrossRef]
Li, H.; Zhang, X.; Qu, H. DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark. Remote Sens. 2025, 17, 719. [Google Scholar] [CrossRef]

Figure 1. (a) Comparison of the previous manual annotation-based drone-view crack detection process (upper, red square) and novel automatic annotation-based drone-view crack detection process (down, green square). The upper side included 8 steps, i.e., Step-1: the engineers, researchers and operators go to the field, observe and make confirmed of the crack distribution situation; Step-2: collecting the data; Step-3: annotating the data; Step-4: training the YOLO model; Step-5: saving the trained YOLO model; Step-6: transferring the drone-view data to infer using the trained YOLO model; Step-7: after the inference, the engineers can understand the location and size of the cracks, and corresponding repair should be in consideration; Step-8: researchers can save the crack distribution mapping for further research. The down side changed 2 steps., i.e., Step-2 (a): applying Stable Diffusion to generate the images; Step-3: using the trained model to annotate the generated crack image data. (b) Workflow of applying the AIGC-based cracks dataset in low-altitude drone crack inspection. There are 3 sections in the whole research: 1. Training one YOLOv8 model using visual crack dataset as the annotation generator. 2. Applying prompt engineering in LMM to generate prompts, corresponding AIGC images and collecting the AIGC dataset. 3. Using the AIGC dataset to train the model and test using CCI4CD/CCSD for the evaluation.

Figure 2. Results of training the model, which include train/valid-based box_loss/cls_loss/dfl_loss, precision/recall, and mAP50/50-95 (image-size 640, epoch 500, batch-size 16, patience 100).

Figure 3. Label distribution statistics for training (Trained Model-1, instances numbers, x-y, width-height and size of bounding box).

Figure 4. Label (transferred from the original mask image to instance segmentation annotation) and prediction samples of the batch images used for validation: valid_batch 0 label, valid_batch 0 pred.

Figure 5. A sample image as the reference for prompt generations and the mind flow of YuanBao (Tencent-interfaced DeepSeek).

Figure 6. (a) A generated crack image sample (512 × 512 pixels). (b) A grid image of all generated images (800 images).

Figure 7. Generated image, with bounding box results, and without bounding box (mask only) results.

Figure 8. Annotation of the grid image in Figure 6b.

Figure 9. Results of training the model, which include train/valid-based box_loss/cls_loss/dfl_loss, precision/recall, and mAP50/50-95 (image-size 512, epoch 500, batch-size 32, patience 100).

Figure 10. Label distribution statistics for training (Trained Model-2, instances numbers, x-y, width-height and size of bounding box).

Figure 11. Positive crack segmentation results (Trained Model-1).

Figure 12. Negative crack segmentation results (Trained Model-1).

Figure 13. Positive crack segmentation results (Trained Model-2).

Figure 14. Negative crack segmentation results (Trained Model-2).

Figure 15. Crack segmentation results (4032 × 3024 pixels) derived from Trained Model-1 and -2, with yellow color marking undetected Trained Model-1 inference results.

Figure 16. Confusion matrix.

Table 1. Parameter setting (hardware).

Specification	Details
OS	Ubuntu 22.04.4 LTS
CUDA Version	12.4
NVIDIA-SMI	550.144.03
GPU	NVIDIA GeForce RTX 3090

Table 2. Parameter setting (YOLOv8x-seg, Visible_Cracks).

Specification	Details
Task	Segment
Model	yolov8x-seg
Early Stopping Patience	100
Batch size	16
Imgsz	640 pixels
Epochs	500
Optimizer	Auto
Lr0	0.01
Momentum	0.937

Table 3. Prompts (texture, color and summary).

Specifications	Details
Object	Straight cracks perpendicular to road, 3–10 mm width, clean edges, no branching, spaced 5–10 m apart
Texture	Character: Rough and irregular. Surface Quality: It appears tactile and uneven, suggesting a dimensional surface rather than being flat and smooth. Likely Material: The description suggests it resembles coated, painted, or plastered wall. Detail: There are no visible patterns or specific features (like grooves or tiles) beyond the overall roughness and irregularity. Its relatively fine-grained roughness, not large pebbles or deep cracks. Visual Effect: Creates a sense of understated depth and physicality due to the light interacting with the uneven surface.
Color	Hue: Light gray. Tone: Uniform. Your description emphasizes “overall tone uniform, no obvious color difference or pattern.” Saturation: Implied to be low saturation (true gray, not blue-gray or green-gray), reinforcing the neutrality described. Variation: Lack of significant color variation, striations, fading, or stains contributes strongly to the described simplicity and neutrality.
Summary	The image depicts a surface with a rough, irregular texture resembling a coated or plastered wall. The texture provides subtle depth without prominent patterns. This surface is uniformly covered in a light gray color, lacking variations in hue, tone, or saturation, resulting in a clean, neutral, and minimalist appearance.

Table 4. Parameter settings (txt2img).

Specifications	Details
Sampling steps	150
Width/Height	512/512 pixels
Batch count	100 (maximum, default = 1)
Batch size	8 (maximum, default = 1)

Table 5. Parameter settings (img2img).

Specifications	Details
Resize mode	Just resize (latent upscale)
Sampling steps	150
Switch at	0.8
Sampling method	DDIM
Schedule type	DDIM
Width/Height	512/512 pixels
Batch count	100 (maximum, default = 1)
Batch size	8 (maximum, default = 1)
CFG Scale	7 (default)
Denoising strength	0.75 (default)

Table 6. Parameters settings (annotation generations).

Specifications	Details
Imgsz	512 pixels
conf	0.01

Table 7. Parameter settings (YOLOv8l-seg, SD_Cracks).

Specification	Details
Task	Segment
Model	yolov8l-seg
Patience	100
Batch size	32
Imgsz	512 pixels
Epochs	500
Optimizer	Auto
Lr0	0.01
Momentum	0.937

Table 8. Accuracy results of detecting positive and negative cracks under individual confidence using Trained Model-1.

Confidence	Accuracy (%) (Positive, Cracks)	Accuracy (%) (Negative, No_Cracks)
0.1	99.9 (19,992/20,000)	95.6 (19,123/20,000)
0.25	99.8 (19,953/20,000)	97.5 (19,493/20,000)
0.5	97.7 (19,545/20,000)	98.9 (19,776/20,000)

Table 9. Accuracy results of detecting positive and negative cracks under individual confidence using Trained Model-2.

Confidence	Accuracy (%) (Positive, Cracks)	Accuracy (%) (Negative, No_Cracks)
0.1	99.9 (19,996/20,000)	88.0 (17,598/20,000)
0.25	99.5 (19,899/20,000)	93.5 (18,698/20,000)
0.5	82.2 (16,442/20,000)	97.7 (19,531/20,000)

Table 10. Recall values derived from Figure 15 using a 0.01 confidence value and imgsz 4032 pixels.

	Trained Model-1	Trained Model-2
17	0.899	0.834
19	0.915	0.976
30	0.810	0.946

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, S.; Fan, Z.; Yoshida, K.; Qin, S.; Kojima, T.; Nishiyama, S. Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring. Drones 2025, 9, 660. https://doi.org/10.3390/drones9090660

AMA Style

Pan S, Fan Z, Yoshida K, Qin S, Kojima T, Nishiyama S. Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring. Drones. 2025; 9(9):660. https://doi.org/10.3390/drones9090660

Chicago/Turabian Style

Pan, Shijun, Zhun Fan, Keisuke Yoshida, Shujia Qin, Takashi Kojima, and Satoshi Nishiyama. 2025. "Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring" Drones 9, no. 9: 660. https://doi.org/10.3390/drones9090660

APA Style

Pan, S., Fan, Z., Yoshida, K., Qin, S., Kojima, T., & Nishiyama, S. (2025). Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring. Drones, 9(9), 660. https://doi.org/10.3390/drones9090660

Article Menu

Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring

Abstract

1. Introduction

2. Methods

2.1. YOLOv8 (Visible Cracks)

2.2. LMM

2.3. Stable Diffusion

2.4. YOLOv8 (ConcreteCrackImage4Classification, CCI4C)

3. Results

4. Discussions

5. Conclusions

6. Future Works

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI