Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification

Gao, Bo; Xia, Xvwei; Zhang, Shuang; Bai, Xingtao; Li, Yongliang; Cui, Qiushi; Kang, Wenni

doi:10.3390/electronics15030556

Open AccessArticle

Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification

by

Bo Gao

^1,*,

Xvwei Xia

¹,

Shuang Zhang

¹,

Xingtao Bai

¹,

Yongliang Li

¹,

Qiushi Cui

²

and

Wenni Kang

¹

Electric Power Research Institute of State Grid Ningxia Electric Power Co., Ltd., Yinchua 750001, China

²

School of Electrical Engineering, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 556; https://doi.org/10.3390/electronics15030556

Submission received: 1 December 2025 / Revised: 5 January 2026 / Accepted: 19 January 2026 / Published: 28 January 2026

(This article belongs to the Special Issue AI Applications for Smart Grid)

Download

Browse Figures

Versions Notes

Abstract

The complex environment of electrical work sites presents hazards that are diverse in form, easily concealed, and difficult to distinguish from their surroundings. Due to poor model generalization, most traditional visual recognition methods are prone to errors and cannot meet the current safety management needs in electrical work. This paper presents a novel framework for hazard identification that integrates chain-of-thought reasoning and self-verification mechanisms within a visual-language large model (VLLM) to enhance accuracy. First, typical hazard scenario data for crane operation and escalator work areas were collected. The Janus-Pro VLLM model was selected as the base model for hazard identification. Then, designing a chain-of-thought enhanced the model’s capacity to identify critical information, including the status of crane stabilizers and the zones where personnel are located. Simultaneously, a self-verification module was designed. It leveraged the multimodal comprehension capabilities of the VLLM to self-check the identification results, outputting confidence scores and justifications to mitigate model hallucination. The experimental results show that integrating the self-verification method significantly improves hazard identification accuracy, with average increases of 2.55% in crane operations and 4.35% in escalator scenarios. Compared with YOLOv8s and D-FINE, the proposed framework achieves higher accuracy, reaching up to 96.3% in crane personnel intrusion detection, and a recall of 95.6%. It outperforms small models by 8.1–13.8% in key metrics without relying on massive labeled data, providing crucial technical support for power operation hazard identification.

Keywords:

power operation scenarios; hazard identification; Vision-Language Large Model (VLLM); chain-of-thought prompt; self-verification

1. Introduction

Power construction scenes are the largest in distribution and the most strictly managed in safety around the world. The accurate judgment of hazards in these scenes is the most obvious problem. This is especially true for power ground operation scenes. In these scenes, the equipment operation safety and the personal safety of workers in crane truck operation and escalator operation links are very prominent.

Traditional safety management and detection methods for power ground operation scenes mostly depend on the manual supervision of on-site safety officers [1]. In recent years, AI and computer vision have developed. Hazardous behavior monitoring technology based on image detection and target recognition has been widely used in power operation sites. It helps managers judge hazards and give reasonable standard suggestions [2].

But traditional hazardous detection technologies mostly rely on deep learning algorithms like convolutional neural networks. Deep learning models need a lot of labeled data for training to obtain good results. Most domestic research on scene hazard recognition uses small models like YOLO (You Only Look Once) and ResNet (Residual Network) for detection. For example, Li Hai and others put forward a lightweight model-based method for detecting hidden dangers in UAV distribution networks [3]. They use the improved YOLOv5s algorithm to monitor the safety of distribution networks in real time. Zhao Jiangping and others put forward an intelligent dynamic detection method for hidden dangers in external scaffolding based on YOLOv5s [4]. This method is used to monitor safety hazards in scaffolding projects in real time. Li Zhenkun and others present a footbridge damage detection and classification framework using smartphone-recorded responses of micromobility and deep learning techniques [5]. The proposed method is applied to the detection of structural defects in pedestrian bridges.

But power operations have problems. It is hard to collect data, and the labeling work needs high professional skills. So it is impossible to do a lot of manual labeling. Also, the model training is too simple. Usually, the trained model can only recognize specific scenes. Its generalization ability is weak. When the scene changes, overfitting and underfitting will happen [6].

Then, large language models (LLMs) led by ChatGPT 4.0 [7] have come out and developed one after another. LLMs provide a new way for on-site safety management. We use the strong generalization ability of LLMs. After inputting related data into the model for training, it can perform hazard recognition in all kinds of scenes.

Large Vision Language Models (VLLMs) are a branch of LLMs [8]. They not only obtain the strong generalization ability from LLMs, but also have visual interaction functions. They support the input and analysis of multimodal data. When there are no extra conditions, the model will extract the overall features of the image. The ViT (Vision Transformer, ViT) module of the model can effectively catch global dependencies. This way of automatically extracting image feature information is different from visual models like YOLO. VLLM will extract the related features that it detects. It can output the main content of the picture without labeling. This avoids the problems of building datasets, long labeling time, and weak model generalization ability in the traditional mode. It greatly improves the safety monitoring efficiency in power scenes.

But in hazardous detection and recognition tasks of crane truck operation and escalator operation scenes, if we do not add any prior information, the large model will not perform targeted detection on a certain target. The model will have messy outputs. This is because when we do not add external conditions to the large model, the model will only output without a purpose based on its original training results.

To solve this problem, Leng Shuo and others suggest giving the large model an “expert” identity in safety supervision scenes [2]. They also combine it with prompts to let the large model recognize hazardous scenes like leaving posts and regional intrusion. However, its prompt design has not been bound to industry safety rules and only relies on general reasoning, failing to meet the requirements of complex tasks. At the same time, the large model has the problem of model hallucination. This problem will make the model output results that are “made up out of nothing”. The reason for model hallucination is that the pre-training of large models takes “predicting the next token” as the goal. It first optimizes the fluency of language. So the model will make up content to make the sentences smooth. Current self-verification schemes in the industrial field lack “visual-rule dual verification” and are prone to the impact of model hallucinations, thereby resulting in far from satisfactory final recognition results.

Based on the above problems, we put forward a hazardous recognition technology for power on-site operations based on VLLM. To solve possible hazardous situations caused by wrong equipment placement and workers’ carelessness in power operations, we collected videos and pictures of hazards. These materials come from crane truck operation scenes (including crane truck stabilizers and the area around crane trucks) and escalator operation scenes. We performed operations like deleting and randomly rotating the collected dataset. Then we input it into the large model. We used the pre-trained VLLM to understand the hazards in the operation site and the workers’ regional intrusion behavior. We also took prompts and the corresponding chain-of-thought as prior information. This helps the model recognize and judge the parts with potential safety hazards. At the same time, we added a self-verification step to reduce model hallucination. This realizes the fast and accurate recognition of hazards in the operation site.

Differently from other current industrial application fields, the proposed method has the following innovations: (1) Customized Chain-of-Thought (CoT) Design for Vertical Scenarios: The designed CoT focuses on the generality of logical reasoning and embeds industry safety rules within it. Targeting the high safety requirements of power operations, a three-stage CoT template of “hazard localization–key element decomposition–safety rule binding” is developed. (2) Safety-Oriented Self-Verification Mechanism: Unlike the “logical consistency verification” of general self-verification, we construct a three-dimensional credibility matrix of “visual matching degree–safety rule adaptability–judgment basis”. Hard requirements in power safety manuals are adopted as the verification benchmark, instead of merely relying on the model’s semantic logic, which reduces the misjudgment risk in safety scenarios. (3) Zero-Shot Recognition Scheme: This technology does not need manual labeling. It can perform hazardous recognition in specific scenes and generalized scenes. It is an intelligent recognition and monitoring solution. It has low development cost and is easy to put into use.

2. Materials and Methods

2.1. Hazard Recognition in Operation Sites Based on VLLM

2.1.1. Large Vision-Language Model

The advantage of large vision-language models over small models is that they can understand and process multimodal information. They understand images and then catch the relative relationships and semantic information between people and related equipment in the images. The model uses the multimodal fusion of text semantic vectors and image feature vectors. It learns information like human actions, postures, and semantics from images so it can better understand and judge hazardous behaviors in power on-site scenes.

The hazardous recognition method for power operation sites proposed in this paper is realized based on the chain prompt technology and self-verification method of the Janus-Pro large model. It breaks down complex tasks into explainable step-by-step reasoning processes. The model is no longer limited to outputting answers directly based on questions. Instead, it reasons step by step according to the chain-of-thought prompts at each step. This enhances the reasoning authenticity and reliability of the model. For crane truck operation and escalator operation scenes, we input the collected dataset into the large model. The large model splits the image into fixed patches (such as 16 × 16). Then it maps each patch to a low-dimensional vector. It adds positional encoding to provide position information and keep spatial information. At this time, the text encoder breaks down words like “crane truck” and “stabilizer bracket” into token vectors. The fusion module aligns the encoded content of the text encoder with the encoded content of the image encoder through modal attention. The input text prompt information serves as supplementary content. It helps the model understand the user’s needs, so as to complete multimodal fusion. Finally, the decoder decodes all content to generate text. This finally realizes image understanding and outputs the corresponding text. Its reasoning framework is shown in Figure 1.

2.1.2. Prompt-Based Reasoning Strategy

The original VLLM needs a lot of data for training. But most of this data comes from public datasets. It lacks the knowledge needed for vertical field reasoning. So this paper introduces prompt technology. It uses prompt engineering to supplement the scenario knowledge of power scene monitoring. This improves the accuracy of VLLM in recognizing hazards in crane truck areas and escalator operations under power scenes. This paper puts forward a prompt text strategy suitable for hazard recognition in different power scenes.

The process of the proposed method is shown in Figure 2. First, we process the collected data. We perform operations like deleting, rotating, flipping, and scaling. We select pictures that meet the experimental requirements. After inputting the target pictures into the large vision-language model Janus-Pro, the model will recognize the corresponding image information according to the given prior information (that is, prompts like “Is there anyone around the crane truck in the picture?”). It outputs the corresponding text. The text output for the first time will be returned to the model together with the original picture for “correctness verification”. The model will give a credibility level and judgment reasons. The final output text includes content related to the prompts. Finally, the model outputs the hazard level and alarm prompts based on the power operation safety manual.

2.1.3. Hazard Judgment Guidance Based on Chain Prompts

When the set prompts are too simple, the model cannot focus on the area that the user really wants it to pay attention to. The model will make wrong judgments, miss judgments, and give answers that do not match the question. So we need to set more detailed prompts for different scenes. At the same time, to enhance the logic of the model’s reasoning process, we use the chain prompt method. Chain prompts are the chain-of-thought technology. They were proposed by the Google team [9]. The core idea of this technology is to design prompts carefully. This makes the model have “human thinking” and reason step by step through step-by-step prompts [10,11,12,13,14,15]. We let the model think and reason according to the prompt questions at each stage. And we use causal relationships and progressive relationships to make the model reason out more accurate and correct answers.

The design of the chain-of-thought directly determines the quality of the model’s output. So we need to design different chain prompts for different scenes. Take the hazard recognition in crane truck and escalator operation scenes as examples. We first need to specify the task scenario for the model, e.g., “This is an image of an escalator operation scenario in the power industry”. The model will then conduct chain reasoning with more rigorous and professional thinking by assuming the identity of a “power safety expert” to identify potential hazardous in the scenario. Meanwhile, we embed power operation safety rules into the chain-of-thought (CoT) content, enabling the model to understand the target task scenario while performing general logical reasoning. At this point, the model is not limited to a single hazardous point but reasons by integrating safety rules and the target scenario. The designed logical framework of “task scenario embedding + industry rule constraint” is more conducive to improving the reasoning quality of VLLMs. For example, in the chain prompts, we mention that in crane truck operation scenes, the crane truck stabilizer is a “long metal structure with telescopic sections”. The model will use this sentence to identify whether there are long objects similar to metal structures around the crane truck. And it will judge whether the stabilizer is fully expanded based on the length of the object. After using chain prompts, the model does not reason or output without basis. Instead, it makes reasonable guesses based on the given chain prompt content. This achieves relatively accurate recognition of hazards. In this paper, we have built an effective prompt template for hazard recognition in crane truck operation and escalator operation scenes of power scenarios. It is shown in Table 1.

2.1.4. Self-Verification

To address the issue that model hallucination is prone to causing erroneous output results in the hazardous recognition task of power operation sites, inspired by the model self-verification research from the Chinese Academy of Sciences research team [16,17,18,19,20,21,22,23], this paper proposes a self-verification scheme suitable for power scenarios. Model hallucination is one of the common problems in large language models (LLMs). To solve this problem, we designed a self-verification method of “visual extraction + industry safety binding”. Unlike self-verification schemes in general domains, our scheme constructs a three-dimensional credibility matrix of “visual feature matching degree, industry safety binding, and evidence sufficiency” by providing scenario-specific safety rules, dividing credibility into three levels: high, medium, and low. When the reliability is high, the system outputs according to the recognition results; when it is medium or low, it broadcasts safety rules to remind operators to enhance protection. This scheme is an extension based on the designed chain-of-thought (CoT) framework, which ensures that the core content proposed by the CoT remains unchanged and prevents “contextual conflicts” when the model recognizes text content. As shown in Figure 2, we performed experiments using the self-verification method in crane truck operation scenes and escalator operation scenes. These experiments prove the feasibility of the scheme. The core logic of this scheme is as follows: It uses the multimodal understanding ability of the large model to build a “dual-verification” mechanism. It guides the model to analyze the text output for the first time through “verification-specific prompts”. It conducts cross-modal alignment verification between the output text and the original image. Finally, it makes a comprehensive evaluation and outputs the recognition results with credibility grades. This reduces model hallucinations and improves judgment accuracy. This self-verification does not need extra labeled data or external tools. It can realize self-verification only by activating the model’s own capabilities.

Take the crane truck operation scene as an example. After using the self-verification scheme, the model re-reads the text result generated for the first time and the corresponding original image. It analyzes the image features and the corresponding text to judge whether the crane truck stabilizer is really consistent with the text content output for the first time. If they are consistent, it outputs the final result, high credibility and judgment reasons. If they are not consistent, it outputs the corresponding text result, low credibility and judgment reasons. The templates for self-verification are shown in Table 2 and Table 3.

Image encoders and text encoders are used to convert natural language text and images into vector representations that can be calculated and processed. This lets them fuse with visual information, achieve semantic alignment, and interact [24,25,26]. Then they output the corresponding content. The text first output by the model and the corresponding original image are sent back to the model for self-verification. Finally, the model outputs judgment reasons and credibility. The self-verification process inside the model is shown in Figure 3.

In Figure 3, the model first splits the image and encodes it to extract features. Then it fuses the text and image features through cross-modal attention. And it outputs text through the decoder. The text output for the first time is sent back to the model together with the original image. The model determines the credibility by judging the consistency between the image and the text. It also gives the judgment reasons. Finally, it outputs the corresponding text to confirm the hazards.

3. Results

3.1. Experimental Platform and Dataset Preparation

To ensure the rigor of the experiment, all dataset samples needed for the experiment are taken on-site in power scenes. They include pictures and videos of crane truck stabilizers, the area around crane trucks, single-person escalator operations, and two-person escalator operations. The equipment involved mainly includes graphics cards, UAVs, cameras, crane trucks, and escalators. We collected images and videos from the side and top of crane trucks, and from scenes where escalators are used correctly and incorrectly. When processing the dataset (including images and videos), we performed the following screening: We removed invalid samples that are damaged, have wrong formats, are blurry, or are irrelevant to the operation scenes. This ensures that crane truck samples clearly show key parts like booms and outriggers. And escalator samples fully show scenes where people use escalators correctly and incorrectly—such as people using escalators, walking backwards on escalators, or leaning out of escalators. Then we extracted key frames from the screened valid videos: The original video frame rate is 30 frames per second, and the target frame rate is 10 frames per second. So we extracted one image every three frames when extracting images. The frame splitting formula is shown in Equation (1):

Frame interval = \frac{Nativate frame rate of the video}{Target extraction frame rate}

(1)

After splitting the video into frames, we obtained initial images. Then we removed the blurry and repeated images from these initial images. Next, we performed rotation and cropping operations on the screened images. Finally, we formed an expanded dataset that meets the experimental requirements. The flow chart for screening and expanding the dataset is shown in Figure 4.

The image acquisition equipment, experimental platform, and dataset quantity are shown in Table 4.

3.2. Evaluation Metrics

In the task of hazardous recognition in power operations, a single metric cannot fully measure the system’s reliability and practicality: it is necessary to ensure effective recognition of hazardous scenarios to prevent accidents and control false alarms. Thus, this study selects four classic metrics that match scenario requirements: accuracy, recall, precision, and F1-score.

Since our hazardous recognition is a binary classification task (i.e., distinguishing between “hazardous” and “safe” cases), the four core elements of the confusion matrix in binary classification are clearly defined as follows: True Positive (TP): Number of samples that are actually hazardous and classified as “hazardous” by the model; True Negative (TN): Number of samples that are actually safe and classified as “safe” by the model; False Positive (FP): Number of samples that are actually safe but incorrectly classified as “hazardous” by the model; False Negative (FN): Number of samples that are actually hazardous but missed and classified as “safe” by the model.

Accuracy reflects the model’s overall classification correctness for “hazardous/safe” samples, calculated as Equation (2):

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(2)

Recall is a core priority metric for power safety scenarios, reflecting the model’s recognition coverage of hazardous samples (i.e., the ability of “no missing reports”), calculated as Equation (3):

Recall = \frac{TP}{TP + FN}

(3)

Precision reflects the proportion of samples classified as “hazardous” by the model that are actually hazardous (i.e., the ability of “no false reports”), calculated as Equation (4):

Precision = \frac{TP}{TP + FP}

(4)

F1-score, the harmonic mean of precision and recall, is used to balance their trade-off and reflect the model’s comprehensive performance, calculated as Equation (5):

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(5)

False Negative Rate: FNR refers to the proportion of missed judgments specifically for hazardous samples in binary classification tasks, calculated as Equation (6):

FNR = \frac{FN}{TP + FN}

(6)

3.3. Hazard Recognition in Crane Truck Operation Scenes

This paper takes electric crane truck operation scenarios and escalator operation scenarios as examples to verify the effectiveness of the proposed framework. In crane truck operation scenes, the most important requirement is that personnel must wear safety helmets when approaching crane trucks. This is because safety accidents sometimes happen when personnel approach and operate bucket trucks without wearing safety helmets. This paper uses the Janus-Pro large vision-language model as the experimental model. It detects whether there are workers approaching the crane truck nearby. And it gives timely early warnings for dangerous behaviors according to the power operation safety manual. Examples of output results are shown in Figure 5.

To verify the effectiveness of the proposed “Chain-of-Thought (CoT) + Self-Verification (SV)” framework in hazardous recognition for electric crane truck operation scenarios, this experiment selects three multimodal large models—Janus-Pro, Deepseek-vl2, and Deepseek-R1—as experimental subjects, and conducts tests around two core tasks: first, personnel intrusion recognition around electric crane trucks; second, recognition of the extension state of electric crane truck stabilizing supports. Regarding the experimental settings, the positive-to-negative sample ratio is 8:2 for the personnel recognition task and 7:3 for the stabilizing support recognition task. All models are enabled with the same chain-of-thought strategy and self-verification module to ensure a single experimental variable. The experimental results are shown in Table 5 and Table 6.

The experimental results show that Janus-Pro performs the best in both tasks: it achieves 96.3% accuracy, the lowest misjudgment rate (3.7%), the highest recall (95.6%), and a comprehensive F1-score of 94.2% in person recognition; its accuracy (94.7%) and recall (92.8%) in stabilizing support recognition also outperform Deepseek-vl2 and Deepseek-R1, while Deepseek-R1 has the highest misjudgment rate (7.4%) among the three models.

3.4. Hazard Identification in Escalator Operation

Escalator operation is the most common construction method in current power scenarios. Its safety management has always lacked strict rules and safety supervision. This paper judges whether workers have dangerous behaviors by detecting their postures and relationships with the escalator during operation. We provide well-designed prompts to make the model focus on whether workers use the escalator safely while working. We selected several typical dangerous behaviors. These include only one person working on the escalator without someone holding it, and workers leaning out or climbing down the escalator backwards during operation. Based on the prompts, the model identifies the postures, number, and behaviors of workers in the corresponding area. Finally, it outputs the relevant text. The experimental results are shown in Table 7 and Table 8.

In Table 7 and Table 8, the three large models all use well-designed chain prompts and the self-verification method. In identifying dangerous behaviors in escalator operation scenarios, Janus-Pro is still better than Deepseek-vl2 and Deepseek-R1 models. This proves the effectiveness of our well-designed prompts and the self-verification method.

3.5. Comparative Experiments

To demonstrate the advantages of the proposed hazard recognition scheme, we also compared the results of classic visual object detection models on the same tasks. The selected comparative models are YOLOv8s [27] and D-FINE [28]. It should be noted that our scheme does not require collecting and annotating scene images, but classic visual object detection models need a certain number of images and annotations of key targets for model training. Therefore, for this experiment, we tried our best to obtain 1378 images from electric crane truck operation scenarios and escalator operation scenarios as training data, and 344 images as test data. The experimental results are shown in Table 9.

In the personnel intrusion detection task for electric crane trucks, the recall of the proposed Janus-Pro (with CoT and SV) reaches 95.6%, significantly higher than 83.5% of YOLOv8s and 89.1% of D-FINE; meanwhile, its F1-score (94.2%) and MAP (0.945) are also the best among the three models. That is because traditional small models require a large amount of labeled data for training to achieve relatively satisfactory results. The proposed prompts and self-verification scheme are far superior to classic object detection models under limited samples.

3.6. Ablation Experiment

(1) Ablation Experiments on Prompts and CoT: In the ablation experiment, we removed the prompt strategy. We found that the model’s output was messy. Model hallucination was serious. It would make up non-existent results. The error rate was very high. When we removed the chain-of-thought part, the model’s output accuracy was slightly lower than the method proposed in this paper. But some model hallucination problems still could not be solved. When we did not use self-verification, the model only output detection results. It did not output reliability or verification reasons. When we used our well-designed prompts and self-verification method, the model achieved the highest accuracy in identifying hazards in crane operation and escalator operation scenarios. And model hallucination was the smallest. The experimental effect was the best. The final experimental results in crane and escalator operation scenarios are shown in Table 10 and Table 11.

As indicated in Table 10 and Table 11, when the large model does not utilize any prior information, its discrimination accuracy is extremely low—especially the excessively high miss rate—preventing the model from achieving ideal performance. CoT reduces feature misjudgment through “semantic definition of hazardous points + step-by-step reasoning,” while self-verification (SV) captures ambiguous features via “visual-rule dual verification.” The synergy between the two reduces the miss rate by more than 85% compared with the “No Prompts + No SV” configuration.

(2) Prompt Variance Experiment: To confirm the effectiveness of the well-designed prompts, we rewrote the prompt content while retaining the core information and only adjusting the expression style. Taking the detection of stabilizing supports for electric crane trucks as an example, the experimental results are shown in Table 12.

As indicated in Table 12, when adjusting the expression style of the prompts while keeping the core task unchanged, the accuracy variance of the prompts is only 0.29%. This demonstrates that the designed prompt template exhibits strong robustness—minor changes in expression style do not affect the model’s judgment results as long as the core task remains consistent.

(3) Prompt Adversarial Experiment: To verify the effectiveness of the proposed self-verification (SV) scheme in alleviating model hallucination, we input incorrect prompts to the model in this experiment. Adversarial experiments were conducted from three aspects, namely, incorrect definition, scenario interference, and ambiguous expression attacks, aiming to evaluate the model’s discrimination accuracy with the self-verification module. The experimental results are shown in Table 13.

As indicated in the prompt adversarial experiment in Table 13, our method identifies the “conflict between incorrect definitions and visual features” through the self-verification (SV) module. The accuracy remains above 85% under incorrect prompts, which is significantly higher than the range of 62.3% to 68.5% without the self-verification module. This demonstrates that the self-verification module can effectively resist incorrect prompt attacks.

4. Discussion and Conclusions

This study demonstrates the application potential of large model prompt strategies in power scenarios. The strong generalization ability of large models determines that they will surely become mainstream models for auxiliary safety management. They address the issue that small models require extensive annotation and training, and under the zero-shot prompt strategy, the model exhibits the capability to efficiently process batch images. With the support of their powerful generalization ability, the model can be rapidly deployed and implemented in image recognition tasks across various scenarios with only appropriate prompts, significantly reducing model construction costs, time costs, and R&D costs, and promoting the application of large models in various industries. In extreme scenarios—such as when the electric crane truck’s stabilizing supports are occluded, or when personnel in hazardous areas have only part of their body appearing in the image—the model will no longer recognize hazardous points in power scenarios. Instead, it directly issues an alarm and simultaneously provides voice announcements of the safety regulations and hazardous factors for the scenario, reminding operators to pay attention to potential dangers at all times and take self-protection measures.

The practical application of large models has many difficulties. General large models have problems such as hallucination generation, knowledge lag, high computing power costs, and uneven quality of datasets. These are all problems that the industry needs to solve now. The hazard identification method based on the chain prompt strategy and self-verification proposed in this paper has ideal results in power scenarios. The model outputs corresponding text content according to the prompts. Then it returns the first output text and the original image to the model. The model performs a self-check to reduce hallucinations. It then outputs the final judgment result. Finally, it achieves end-to-end conversion from image to text.

Author Contributions

B.G. acquired funding, curated data, and administered the project. X.X. drafted the original manuscript. Y.L. reviewed/edited the manuscript and developed the study’s conceptualization. S.Z. conducted investigation and handled software tasks. X.B. designed the study’s methodology. Q.C. validated the work and assisted in project administration. W.K. performed formal data analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Scientific and Technological Project of State Grid Ningxia Electric Power Co., Ltd., grant number 5229DK24001P.

Data Availability Statement

The dataset presented in this paper contains critical power infrastructure images protected by industrial data security regulations, so it is not publicly accessible. To access this dataset, a request must be submitted to the corresponding author, and formal approval from the Ningxia Electric Power Research Institute is required.

Conflicts of Interest

Authors B.G., X.X., Y.L., S.Z., X.B. and W.K. were employed by the Electric Power Research Institute of State Grid Ningxia Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

CoT-SV for Power Operation Hazard Identification

Input: Power operation image X; Scene type (crane/escalator); Safety rules R (industry safety manuals);

CoT prompt template P; Self-verification criteria C (visual matching + rule adaptability + evidence sufficiency)

Output: Hazard identification result (hazardous/safe) Res; Credibility level (High/Medium/Low) Cred; Alarm prompt Alarm

/* Data preprocessing */

1 Preprocess X: Remove blur/redundancy, adjust size/rotation (refer to Equation (1) for frame extraction if X is video If the input is a video, frame extraction shall be performed first, with the formula as follows: Frame interval = Native frame rate of the video/Target extraction frame rate)

2 Extract image features F via ViT module (patch splitting + vector mapping + positional encoding)

/* Chain-of-Thought (CoT) reasoning */

3 Embed scene type + safety rules into P to generate task-specific prompt P_task (follow “hazard localization—key element decomposition—safety rule binding” logic)

4 Encode P_task into text vector T via text encoder

5 Fuse F and T through cross-modal attention to get fused feature F_mix

6 Generate initial hazard judgment Res_init via decoder (conclude + visual basis, refer to Table 1, Table 2 and Table 3 for output format)

/* Self-Verification (SV) */

7 Construct three-dimensional credibility matrix M:

a Visual matching degree: Align Res_init with F to check feature consistency

b Safety rule adaptability: Verify Res_init against R to ensure compliance with industry standards

c Evidence sufficiency: Evaluate whether Res_init’s visual basis is sufficient

8 Calculate credibility Cred based on M (High: all dimensions meet; Medium: 2 dimensions meet; Low: ≤1 dimension meets)

9 Revise Res_init if Cred is Medium/Low (supplement uncertainty explanation) to get final Res

/* Result generation */

10 Generate Alarm based on Res and R:

a If Res = hazardous: Output safety rule reminder + hazard level

b If Res = safe: Output normal operation confirmation

11 Return Res, Cred, Alarm

References

Zhu, Y.; Ling, Z.G.; Zhang, Y.Q. Research progress and prospect of machine vision technology. J. Graph. 2020, 41, 871–890. [Google Scholar] [CrossRef]
Leng, S.; Wang, W.; Ou, J.Y.; Xue, Z.G.; Song, Y.L. On-Site construction safety monitoring based on large vision language models. J. Graph. 2025, 46, 960–968. [Google Scholar] [CrossRef]
Li, H.; He, S.; Tan, R.; Liang, C.; Huang, Z.; Li, C. UAV distribution network hidden danger target detection based on lightweight model. Autom. Appl. 2025, 66, 73–78. [Google Scholar] [CrossRef]
Zhao, J.P.; Liu, X.X.; Zhang, X.Z. Intelligent dynamic detection of external scaffold hidden danger based on YOLOv5s. Ind. Saf. Environ. Prot. 2023, 49, 14–19. [Google Scholar]
Li, Z.K.; Lan, Y.F.; Lin, W.W. Footbridge damage detection using smartphone-recorded responses of micromobility and convolutional neural networks. Autom. Constr. 2024, 166, 105587. [Google Scholar] [CrossRef]
Jiang, J.J.; Liu, D.W.; Liu, Y.F.; Ren, Y.G.; Zhao, Z.B. Few-shot object detection algorithm based on siamese network. J. Comput. Appl. 2023, 43, 2325–2329. [Google Scholar] [CrossRef]
Zhang, L.L.; Huang, W.L. Construction of patent knowledge graph based on ChatGPT API and prompt engineering. J. Intell. 2025, 44, 180–187. [Google Scholar] [CrossRef]
Ba, Z.Z.; Zhang, H.; Xie, Z.G.; Zuo, X.D.; Hou, J.W. Automatic Prompt Engineering Technology for Large Language Models: A Survey. J. Front. Comput. Sci. Technol. 2025, 19, 3131–3152. [Google Scholar] [CrossRef]
Zhao, S.X.; Li, Y.; Su, S.M. Construction safety monitoring method based on multiscale feature attention network. Sci. Sin. Technol. 2023, 53, 1241–1252. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef]
Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large Language Models are Better Reasoners with Self-Verification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 2550–2575. [Google Scholar] [CrossRef]
Zhang, H.J.; Zhang, H.; Yan, W.; Zhuo, S.; Jing, Z.G. A precise extraction method of remanufacturing process knowledge based on chain-of-thought prompting in large language models. Manuf. Technol. Mach. Tool 2025, 10, 90–98. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Wang, D.; Lu, F.; Zhang, B. A review of prompt engineering in large language models. Comput. Syst. Appl. 2025, 34, 1–10. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.Y. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2024; Volume 38, pp. 17682–17690. [Google Scholar]
Zelikman, E.; Wu, Y.; Mu, J.; Goodman, N. Star: Bootstrapping reasoning with reasoning. Adv. Neural Inf. Process. Syst. 2022, 35, 15476–15488. [Google Scholar]
Madaan, A.; Tandon, N.; Gupta, P. Self-Refine: Iterative refinement with self-Feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H. Let’s Verify Step by Step. Adv. Neural Inf. Process. Syst. 2023. [Google Scholar] [CrossRef]
Qi, Z.; Ma, M.; Xu, J.; Zhang, L.L.; Yang, F.; Yang, M. Mutual reasoning makes smaller llms stronger problem-Solvers. arXiv 2024, arXiv:2408.06195. [Google Scholar] [CrossRef]
Zhang, D.; Zhoubian, S.; Hu, Z.; Yue, Y.; Dong, Y.; Tang, J. Rest-mcts*: Llm self-Training via process reward guided tree search. Adv. Neural Inf. Process. Syst. 2024, 37, 64735–64772. [Google Scholar] [CrossRef]
Tian, Y.; Peng, B.; Song, L.; Jin, L.; Yu, D. Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing. Adv. Neural Inf. Process. Syst. 2024, 37, 52723–52748. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P. Qwen-VL: A frontier large visionlanguage model with versatile abilities. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Wu, Z.; Chen, X.; Pan, Z.; Liu, X.; Liu, W.; Dai, D. DeepSeek-VL2: Mixture-ofexperts vision-language models for advanced multimodal understanding. arXiv 2024, arXiv:2412.10302. [Google Scholar]
Wang, Y.; Li, Q.; Dai, Z.; Xu, Y. Current status and trends in large language modeling research. Chin. J. Eng. 2024, 46, 1411–1425. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, S.; Li, J.; Pan, H. Yolov8s-DDC: A Deep Neural Network for Surface Defect Detection of Bearing Ring. Electronics 2025, 14, 1079. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-fine: Redefine regression task in detrs as fine-grained distribution refinement. arXiv 2024. [Google Scholar] [CrossRef]

Figure 1. Reasoning framework of large vision-language model.

Figure 2. Hazard detection process of vision-language large models.

Figure 3. Model self-verification flow.

Figure 4. Experimental dataset processing flow.

Figure 5. Examples of hazard recognition results using the chain prompt and self-verification method.

Table 1. Chain prompt template in power operation scenarios (taking crane sites as an example).

Module Type	Core Thinking
Task Description	Detection: “Is there anyone near the crane?”; “Are all four stabilizing supports of the crane fully extended?”
Reasoning Requirements	As a power operation safety supervisor, complete the following two safety detections based on the provided crane operation image. For each, first state a clear conclusion, then explain the visual basis by linking to image features (e.g., position, shape).
Example Prompt	Presence of personnel: Confirm clear human outlines (head/torso/limbs) in the image. “None” is reasonable if absent; “Present” if present. Full extension of supports: Distinguish supports (telescopic metal structures on the crane’s longer side) from tires (round, non-telescopic). “Fully extended” is reasonable if all 4 are extended; otherwise “Not fully extended”.
Output Format	Is there anyone under the crane? Conclusion + Basis; Are all four stabilizing supports of the crane fully extended? Conclusion.

Note: All prompts and self-verification schemes were originally developed in Chinese for power scenarios. To maintain linguistic consistency, they have been translated into English in this paper.

Table 2. Self-verification template for crane operation scenarios.

Module Type	Core Thinking
Task Description	Detection: “Are there any people under the crane”, “Are all four stable supports of the crane fully extended”
Verification Standards	“Whether there are people”: Need to confirm if there are clear human outlines in the image. If there are none, the original result “No” is reasonable. If there are any, the result “Yes” is reasonable. “Whether the supports are fully extended”: Need to distinguish between supports (metal structures extending from the longer sides of the crane body, with telescopic sections) and tires (round, no telescopic sections). If all 4 supports are extended, the result “Fully extended” is reasonable. Otherwise, the result “Not fully extended” is reasonable.
Reliability	Return the reliability level and brief reason. Format: High/Medium/Low
Reason	Explain the matching degree between the original result and image features. For example: “The original result ‘Yes, Not fully extended’ matches the image features ‘person on the right side + rear-left support not extended’ with high reliability.”

Table 3. Self-verification template for escalator operation scenes.

Module Type	Core Thinking
Task Description	Detection: “How many people are in the scenario”, “Whether there are behaviors of people climbing down the ladder backwards or leaning out”
Reasoning Requirements	Number Judgment: Whether to accurately identify all visible people in the image Relationship between People and Ladders: Whether to accurately describe the relative positions and interactions between people and ladders Dangerous Behavior Judgment: Whether to accurately identify “climbing down the ladder backwards” or “leaning out behavior”
Example Prompt	Return the reliability level and brief reason. Format: High/Medium/Low
Output Format	Explain the matching degree between the original judgment and image content, and point out possible deviation points.

Table 4. Acquisition equipment, experimental platform, and dataset quantity.

Experimental Platforms, Equipment, and Datasets	Value/Model/Quantity
Drone	DJI MINI2 (SZ DJI Technology Co., Ltd., Shenzhen, China)
GPU	RTX4080 super
Mobile Acquisition Device	Xiaomi 13 (Xiaomi Corporation, Beijing, China)
Experimental Environment	Linux Ubuntu 18.04
Dataset Quantity	Crane operation: 567 images
Dataset Quantity	Escalator operation: 1146 images

Table 5. Person Recognition for hazardous points in electric crane truck scenarios (positive-to-negative sample ratio = 8:2).

Model	Accuracy	Misjudgment Rate	Recall	Precision	F1-Score
Janus-Pro (with CoT and SV)	96.3%	3.7%	95.6%	92.8%	94.2%
Deepseek-vl2 (with CoT and SV)	95.8%	4.2%	94.3%	91.5%	92.9%
Deepseek-R1 (with CoT and SV)	95.2%	4.8%	90.1%	93.1%	91.6%

Table 6. Stabilizing support recognition results for electric crane trucks (pos-to-neg sample ratio = 7:3).

Model	Accuracy	Misjudgment Rate	Recall	Precision	F1-Score
Janus-Pro (with CoT and SV)	94.7%	5.3%	92.8%	90.7%	91.7%
Deepseek-vl2 (with CoT and SV)	93.1%	6.9%	91.5%	89.2%	90.3%
Deepseek-R1 (with CoT and SV)	92.6%	7.4%	90.7%	88.5%	91.6%

Note: All model experiments adopted the chain-of-thought (CoT) prompts and self-verification (SV) scheme designed by us. Additionally, we have provided the relevant pseudocode in Appendix A.

Table 7. Model performance in person-related hazard recognition of escalator operation scenes.

Model Name	Scene Type	Accuracy	Misjudgment Rate	Recall	Precision	F1-Score
Janus-Pro (with CoT and SV)	Escalator Single-Person Scene	98.5%	1.5%	96.8%	97.4%	97.1%
Janus-Pro (with CoT and SV)	Escalator Two-Person Scene	92%	8.0%	90.5%	89.3%	89.9%
Deepseek-vl2 (with CoT and SV)	Escalator Single-Person Scene	99.2%	0.8%	97.9%	98.1%	98.0%
Deepseek-vl2 (with CoT and SV)	Escalator Two-Person Scene	91.2%	8.8%	89.8%	88.7%	89.2%
Deepseek-R1 (with CoT and SV)	Escalator Single-Person Scene	98.5%	1.5%	96.5%	97.2%	96.8%
Deepseek-R1 (with CoT and SV)	Escalator Two-Person Scene	90.8%	9.2%	89.3%	88.1%	88.7%

Table 8. Detection of workers’ reversing down escalators and leaning out behaviors in escalator operation scenarios.

Model Name	Scene Type	Accuracy	Misjudgment Rate	Recall	Precision	F1-Score
Janus-Pro (with CoT and SV)	Leaning Out Behavior	94.3%	5.7%	92.3%	89.2%	90.7%
Janus-Pro (with CoT and SV)	Walking Backwards Behavior	90.2%	9.8%	90.1%	85.6%	87.8%
Deepseek-vl2 (with CoT and SV)	Leaning Out Behavior	92.8%	7.2%	91.1%	87.9%	89.5%
Deepseek-vl2 (with CoT and SV)	Walking Backwards Behavior	89.2%	10.8%	88.9%	84.9%	86.8%
Deepseek-R1 (with CoT and SV)	Leaning Out Behavior	93.2%	6.8%	91.5%	88.5%	90.0%
Deepseek-R1 (with CoT and SV)	Walking Backwards Behavior	88.2%	11.8%	87.7%	84.2%	85.9%

Table 9. Comparative test results for electric crane truck scenarios.

Model	Hazardous	Precision	Recall	F1-Score	MAP
Janus-Pro (with CoT and SV)	Personnel intrusion detection	92.8%	95.6%	94.2%	0.945
YOLOv8s	Personnel intrusion detection	88.2%	83.5%	85.8%	0.842
D-FINE	Personnel intrusion detection	82.5%	85.0%	83.7%	0.850
Janus-Pro (with CoT and SV)	Stabilizing support recognition	90.%	92.8%	91.7%	0.923
YOLOv8s	Stabilizing support recognition	84.5%	78.3%	81.3%	0.805
D-FINE	Stabilizing support recognition	82.1%	76.7%	79.3%	0.782

Table 10. Ablation experiment (crane truck scenes).

Hazardous	Experimental Configurations	Accuracy	FNR	Recall	Precision	F1-Score
Person Intrusion	No Prompts + No SV	30.7%	47.7%	52.3%	65.2%	58.0%
	With Prompts + No SV	93.6%	7.9%	92.1%	89.5%	90.8%
	With Prompts + With SV	94.2%	6.5%	93.5%	90.3%	91.9%
	With CoT + No SV	94.0%	7.0%	93.0%	90.1%	91.5%
	With CoT + With SV	96.3%	4.4%	95.6%	92.8%	94.2%
Stabilizing Supports	No Prompts + No SV	18.4%	74.3%	25.7%	65.0%	36.5%
	With Prompts + No SV	88.5%	9.8%	90.2%	85.3%	87.7%
	With Prompts + With SV	89.2%	8.7%	91.3%	86.1%	88.6%
	With CoT + No SV	91.9%	8.0%	92.0%	88.4%	90.2%
	With CoT + With SV	94.7%	7.2%	92.8%	90.7%	91.7%

Table 11. Ablation experiment (escalator scenes).

Hazardous	Experimental Configurations	Accuracy	FNR	Recall	Precision	F1-Score
Single-/Double-Person Operation	No Prompts + No SV	30.5%	51.1%	48.9%	66.1%	56.3%
	With Prompts + No SV	89.0%	9.2%	90.8%	86.5%	88.6%
	With Prompts + With SV	89.5%	8.4%	91.6%	87.2%	89.4%
	With CoT + No SV	89.3%	8.8%	91.2%	86.9%	89.0%
	With CoT + With SV	94.2%	5.5%	94.5%	89.8%	92.1%
Leaning Out/Reversing Down Escalators	No Prompts + No SV	27.1%	57.7%	42.3%	65.5%	51.2%
	With Prompts + No SV	88.1%	10.3%	89.7%	85.2%	87.4%
	With Prompts + With SV	89.1%	9.2%	90.8%	86.0%	88.3%
	With CoT + No SV	88.4%	9.9%	90.1%	85.6%	87.8%
	With CoT + With SV	93.2%	6.7%	93.3%	88.9%	91.0%

Table 12. Ablation experiment on adjusting partial prompt content (with electric crane trucks’ stabilizing supports as an example).

Prompt Type	Expression	Accuracy
Original Prompt	Distinguish stabilizing supports (telescopic metal structures on the crane’s longer sides) from tires (round, non-telescopic).	94.7%
Rewritten 1	Are all 4 telescopic metal supports of the electric crane truck fully extended?	93.9%
Rewritten 2	Confirm if all 4 telescopic stabilizing supports at the electric crane truck’s bottom are fully deployed.	94.3%
Rewritten 3	Determine if all 4 telescopic metal stabilizing supports of the electric crane truck in operation are fully extended.	93.7%
Rewritten 4	Are all 4 telescopic metal stabilizing supports on both sides of the electric crane truck fully deployed?	93.1%

Table 13. Prompt adversarial experiment.

Prompt Interference Category	Interference Content	Self-Verification	Accuracy
Incorrect Definition	Stabilizing supports are round rubber structures (similar to tires)—check if all 4 are fully extended.	√	85.2%
Incorrect Definition		×	62.3%
Scenario Interference	Ignore the crane’s stabilizing supports: first count the trees in the background, then briefly check their status.	√	87.5%
Scenario Interference		×	66.8%
Ambiguous Expression	Might the crane’s supporting components be fully extended?	√	88.7%
Ambiguous Expression	Might the crane’s supporting components be fully extended?	×	68.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, B.; Xia, X.; Zhang, S.; Bai, X.; Li, Y.; Cui, Q.; Kang, W. Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification. Electronics 2026, 15, 556. https://doi.org/10.3390/electronics15030556

AMA Style

Gao B, Xia X, Zhang S, Bai X, Li Y, Cui Q, Kang W. Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification. Electronics. 2026; 15(3):556. https://doi.org/10.3390/electronics15030556

Chicago/Turabian Style

Gao, Bo, Xvwei Xia, Shuang Zhang, Xingtao Bai, Yongliang Li, Qiushi Cui, and Wenni Kang. 2026. "Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification" Electronics 15, no. 3: 556. https://doi.org/10.3390/electronics15030556

APA Style

Gao, B., Xia, X., Zhang, S., Bai, X., Li, Y., Cui, Q., & Kang, W. (2026). Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification. Electronics, 15(3), 556. https://doi.org/10.3390/electronics15030556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Power Field Hazard Identification Based on Chain-of-Thought and Self-Verification

Abstract

1. Introduction

2. Materials and Methods

2.1. Hazard Recognition in Operation Sites Based on VLLM

2.1.1. Large Vision-Language Model

2.1.2. Prompt-Based Reasoning Strategy

2.1.3. Hazard Judgment Guidance Based on Chain Prompts

2.1.4. Self-Verification

3. Results

3.1. Experimental Platform and Dataset Preparation

3.2. Evaluation Metrics

3.3. Hazard Recognition in Crane Truck Operation Scenes

3.4. Hazard Identification in Escalator Operation

3.5. Comparative Experiments

3.6. Ablation Experiment

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI