A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition

Ye, Song; Liu, Yuting; Yu, Chunjin; Chen, Jialei; Wan, Xili; Wang, Lu; Zhang, Guangming

doi:10.3390/buildings16101958

Open AccessArticle

A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition

by

Song Ye

¹,

Yuting Liu

²,

Chunjin Yu

²,

Jialei Chen

²,

Xili Wan

^2,*,

Lu Wang

¹

and

Guangming Zhang

¹

School of Civil Engineering, Nanjing Tech University, Nanjing 210037, China

²

College of Computer and Information Engineering, Nanjing Tech University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(10), 1958; https://doi.org/10.3390/buildings16101958

Submission received: 9 April 2026 / Revised: 7 May 2026 / Accepted: 10 May 2026 / Published: 15 May 2026

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

Building fire safety inspection is a knowledge-intensive engineering task that requires reliable hazard recognition under complex visual conditions, limited labeled data, and strict regulatory accountability. To address these challenges, this paper proposes a coarse-to-fine intelligent inspection framework for building fire hazard recognition and regulation-grounded reporting. The framework first performs binary hazard screening and then refines positive or uncertain cases into specific hazard categories, thereby aligning the inference process with practical inspection workflows. A self-supervised DINOv2 Vision Transformer is adopted as the visual backbone, and a small-sample adaptation strategy is developed by combining staged fine-tuning, a lightweight SE-based classification head, and task-aligned knowledge distillation. In addition, an Agentic RAG compliance layer is introduced to retrieve, verify, and present clause-level regulatory evidence while suppressing hallucinated or unverifiable citations. Experiments on a real-world building fire hazard image dataset show that the proposed framework achieves stable recognition performance, outperforms representative CNN-, supervised Transformer-, and self-supervised Transformer-based baselines, and improves the faithfulness of regulation-grounded reporting. The results suggest that the proposed framework provides a feasible prototype-level pathway toward intelligent and auditable fire safety inspection, while broader multi-site validation and robustness evaluation remain necessary for future deployment.

Keywords:

building fire safety inspection; fire hazard recognition; coarse-to-fine framework; self-supervised vision transformer; engineering decision support

1. Introduction

Building fire safety inspection is a representative knowledge-intensive engineering task, in which inspectors are required not only to identify hazardous visual conditions in complex environments, but also to support subsequent review, reporting, and rectification decisions with reliable evidence [1]. As one of the most direct threats to human life and property during both the construction and operational stages of buildings, fire hazards have long been identified through manual inspection, checklist-based assessment, and experience-driven judgment [2,3,4]. As a critical final step in project delivery, fire safety acceptance still faces a severe “last-mile” challenge: how to efficiently and accurately align relatively structured design-stage requirements with the unstructured and dynamically changing realities of the physical construction site. In current practice, on-site acceptance remains highly labor-intensive, requiring inspectors to visually examine the scene, collect photographic evidence, produce written descriptions, and manually retrieve regulatory clauses to determine whether a situation constitutes a hazard and whether it violates applicable codes. Although this workflow offers flexibility, it is difficult to ensure high-frequency monitoring, comprehensive coverage, and consistent decision quality in complex environments such as large commercial complexes, underground spaces, and high-rise buildings. Consequently, missed hazards and inconsistent judgments may still occur, thereby introducing long-term safety risks into building operation [5].

To improve inspection efficiency and engineering information management, the construction industry has actively explored digital technologies such as Building Information Modeling (BIM). By establishing lifecycle-aware digital representations of buildings, BIM has significantly improved design coordination, clash detection, and safety planning [6]. However, during fire safety acceptance, BIM often serves only as a static design reference, because it cannot reliably capture real-time site changes, temporary installations, deviations in construction quality, or on-site clutter emerging during implementation. Therefore, BIM alone is insufficient for automated compliance verification in dynamic acceptance scenarios. In parallel, rapid progress in computer vision and deep learning has provided a new technical pathway for intelligent fire safety inspection [7,8]. From early handcrafted-feature methods to modern convolutional neural networks (CNNs) [9] and vision Transformers [10,11], the robustness and generalization of visual recognition under complex scene conditions have been substantially improved. In particular, self-supervised visual foundation models represented by DINO and DINOv2 have demonstrated strong semantic representation capability and transferability under limited labeled data [12,13,14], making them especially attractive for small-sample hazard recognition in cluttered engineering environments.

Despite these advances, two fundamental challenges remain unresolved in practical fire safety acceptance. First, building fire safety inspection is conducted in highly variable on-site environments, where images are frequently affected by poor illumination, occlusion, viewpoint variation, scale changes, and background clutter. These factors lead to subtle inter-class differences and ambiguous decision boundaries, which place stringent demands on the robustness of visual models [15]. Second, fire safety acceptance is not merely a visual recognition problem, but also a regulation-driven compliance judgment task. In safety-critical settings, the system must not only identify and classify hazards accurately, but also cite the correct regulatory basis and provide truthful, auditable explanations that are strictly consistent with the original clauses. Recent studies have begun exploring vision-language or LLM-driven hazard recognition and compliance checking in the building safety domain [3]. However, general-purpose large language models are known to exhibit hallucinations in knowledge-intensive applications, such as fabricating clause numbers, misattributing standards, or generating explanations that deviate from authoritative text [16,17,18]. Such behavior is unacceptable in fire safety acceptance, where the output must be regulation-grounded, legally defensible, and operationally trustworthy.

Motivated by these challenges, this study formulates intelligent fire safety acceptance as two tightly coupled objectives: (1) hazard recognition and classification, i.e., determining whether an image contains a fire safety hazard and assigning it to a predefined category; and (2) precise regulatory grounding, i.e., producing authoritative and traceable clause-level evidence to support acceptance conclusions. To this end, we propose a coarse-to-fine intelligent inspection framework for building fire safety hazard analysis. Instead of treating the task as a single flat classification problem, the proposed framework decomposes the recognition process into two hierarchically organized stages: hazard presence screening and hazard-type classification. This design better matches practical inspection workflows, in which rapid screening is followed by refined interpretation for positive or uncertain cases. A self-supervised vision Transformer backbone is employed to extract transferable semantic representations from hazard images, and a staged fine-tuning strategy is introduced to stabilize adaptation under small-sample conditions by progressively unfreezing high-level semantic layers. In addition, an SE-based lightweight classification head is adopted to enhance discriminative feature reweighting, while a task-aligned knowledge distillation strategy is incorporated to improve prediction stability and generalization across both binary and multi-class tasks [19]. To further support regulation-grounded acceptance, we incorporate an Agentic RAG compliance layer that decomposes explanation generation into planning, retrieval, and verification steps, thereby enabling clause-level evidence alignment and suppressing hallucinated or mis-cited compliance outputs [20,21,22].

Compared with conventional one-stage recognition pipelines, the proposed framework offers two advantages from an engineering informatics perspective. First, it introduces a hierarchical visual decision process that is more compatible with inspection-oriented review logic, making the recognition results more actionable for follow-up verification and intervention. Second, it couples robust small-sample visual recognition with regulation-grounded evidence generation, thereby moving from pure hazard detection toward intelligent acceptance support for building fire safety.

The main contributions of this paper are summarized as follows:

We propose a coarse-to-fine intelligent inspection framework for building fire safety hazard recognition, which decomposes the task into hazard screening and hazard-type refinement to better match practical inspection workflows.
We introduce a self-supervised DINOv2-based visual adaptation strategy for small-sample fire hazard recognition, combining staged fine-tuning, partial backbone unfreezing, and task-aligned knowledge distillation to improve training stability.
We design a lightweight SE-based classification head to enhance channel-wise discriminative feature recalibration under complex backgrounds, weak hazard cues, and ambiguous visual conditions.
We incorporate a regulation-grounded Agentic RAG compliance layer to retrieve, verify, and present clause-level regulatory evidence, supporting traceable and auditable fire safety acceptance conclusions.

The remainder of this paper is organized as follows: Section 2 reviews related studies on vision-based fire safety hazard recognition, self-supervised visual representation learning, and regulation-grounded compliance reporting. Section 3 presents the proposed coarse-to-fine intelligent inspection framework, including the two-stage inference pipeline, the SE-based classification head, the staged fine-tuning and knowledge distillation strategy, and the Agentic RAG compliance layer for verifiable regulation alignment. Section 4 reports the experimental setup and evaluation results on both binary and multi-class tasks, together with ablation studies and qualitative analyses. Finally, Section 5 concludes this paper and outlines future research directions.

2. Background and Related Work

Building fire safety acceptance requires the coordinated support of two tightly coupled capabilities: reliable hazard recognition from complex visual scenes and regulation-grounded compliance alignment for audit-ready conclusions. Accordingly, prior research has evolved along two main directions. The first concerns vision-based fire hazard recognition, with recent emphasis on robust representation learning under limited labeled data and visually complex construction-site conditions. The second concerns regulation knowledge retrieval and verifiable generation, where retrieval-augmented generation (RAG) and related workflows are increasingly used to map recognition results to authoritative clauses, construct traceable evidence chains, and suppress hallucinations in professional compliance reporting.

2.1. Vision-Based Fire Hazard Recognition for Intelligent Inspection

Traditional fire hazard recognition has long relied on manual inspection and expert judgment based on visual observation [23]. Although flexible, this approach is highly dependent on inspector experience and is difficult to scale in large, visually cluttered, or dynamically changing building environments. As summarized in Figure 1, vision-based fire safety hazard detection has evolved through several representative stages. In the 2000s and early 2010s, early automated methods mainly relied on handcrafted descriptors, such as HOG [24], SIFT [25], and LBP [26], combined with classical classifiers such as support vector machines [27]. Representative studies include the color model-based fire detector of Celik et al. [28], the vision-sensor and SVM-based framework of Ko et al. [29], and the SIFT-based static-image fire detection method of Ghassempour et al. [30]. These methods are computationally lightweight and interpretable, but their reliance on manually designed features makes them sensitive to illumination variation, viewpoint changes, occlusion, and background clutter.

Since around 2012, deep learning has substantially advanced fire and smoke detection. CNN-based models can learn hierarchical features directly from raw images and have demonstrated strong performance, as shown by the DNCNN model of Yin et al. [31] and the real-time full-scale forest-fire smoke detector of Zheng et al. [32]. For practical deployment, YOLO-based methods have become widely used because of their favorable speed–accuracy trade-off, and recent variants such as YOLOv11-CHBG [33] further improve robustness to occlusion and small targets. Since approximately 2020, Vision Transformer and DeiT-style models have introduced global attention mechanisms into visual recognition, providing stronger contextual modeling capability for cluttered scenes [10,11]. However, supervised Transformer models are often data-hungry and may involve higher training costs, which limits their direct applicability in small-sample engineering scenarios.

More recently, since around 2021, self-supervised visual foundation models represented by DINO and DINOv2 have provided a promising pathway for limited-label visual inspection by learning transferable representations from large-scale unlabeled images. In parallel, CLIP-like vision-language models and recent vision-LLM systems have introduced a new multimodal paradigm for visual inspection [34,35]. By aligning visual representations with natural-language prompts, these models support open-vocabulary recognition, prompt-based category expansion, and image-level semantic interpretation. Since around 2023, multimodal and engineering-oriented inspection systems have further begun to integrate visual recognition with vision-language models, regulatory knowledge retrieval, and closed-loop reporting workflows, as exemplified by the vision-LLM-driven compliance-checking framework of Chen et al. [3]. Despite these advances, existing and CLIP/VLM-based methods still face engineering challenges, including costly annotation, prompt sensitivity, weak hazard cues, borderline cases, and limited cross-scene generalization [7,8,36]. Moreover, many studies emphasize flat classification, detection accuracy, or open-ended visual-language responses, while giving limited consideration to hierarchical decision logic and verifiable regulatory grounding in practical inspection workflows. These limitations motivate the development of a more robust and inspection-oriented visual recognition framework.

2.2. Self-Supervised Visual Representation Learning for Small-Sample Engineering Scenes

The rapid development of self-supervised visual representation learning has provided a new opportunity for hazard recognition in engineering scenarios with limited labeled data. Compared with conventional supervised learning, self-supervised methods learn transferable semantic representations from large-scale unlabeled image collections, thereby reducing dependence on task-specific annotations. This property is particularly important for building fire safety inspection, where data collection and expert annotation are expensive, yet strong semantic understanding of complex scenes is required.

Among recent self-supervised vision models, DINO and DINOv2 are especially relevant. DINO introduced a teacher–student self-distillation paradigm that enables Vision Transformers to learn semantically meaningful visual representations without manual labels [13,14]. Building on this line of research, DINOv2 further improves the robustness and transferability of learned features through large-scale training and architectural refinements [12]. Based on a ViT backbone [10,11], DINOv2 supports multiple model scales and has demonstrated strong performance across downstream classification, detection, and dense prediction tasks. In small-label regimes, it is particularly attractive because low-level representations can be largely preserved while only lightweight task heads or higher-level Transformer blocks are adapted to the target task [12]. This makes self-supervised vision foundation models more suitable than conventional supervised backbones for fire hazard recognition under limited labeled data and complex backgrounds.

Recent visual inspection systems increasingly integrate Transformer modules with task-specific detection or classification architectures—such as the Dual-Attention YOLO proposed by Song et al. [37]—to enhance recognition performance in dense and cluttered environments.However, directly transferring foundation models to fire hazard recognition remains challenging under small labeled datasets and ambiguous class boundaries, where unstable adaptation and representation drift may occur.

2.3. Regulation-Grounded Compliance Reasoning and Agentic RAG

Although visual recognition is indispensable for identifying potential hazards, fire safety acceptance is ultimately a regulation-driven compliance judgment task. In practice, recognition results must be connected to authoritative standards and clause-level evidence to produce conclusions that are reviewable, auditable, and operationally trustworthy. This requirement distinguishes fire safety acceptance from generic image classification or detection tasks and highlights the importance of knowledge retrieval and verifiable generation.

Retrieval-augmented generation (RAG) has emerged as an important paradigm for knowledge-intensive tasks by coupling external retrieval with conditional generation [20]. In recent years, RAG systems have been enhanced along multiple dimensions, including dense retrieval, reranking, multimodal retrieval, and robustness-oriented workflow design [17,18,21]. For professional domains, the key requirements are not only retrieval relevance, but also citation faithfulness and verifiable generation. In other words, the output must strictly preserve the triplet of {standard name, clause identifier, official text}, while avoiding fabricated clause numbers, incorrect standards, or plausible but unverifiable regulatory statements. This issue is particularly critical because general-purpose language models are known to hallucinate in specialized domains, and retrieval augmentation has been shown to help reduce such hallucinations by grounding responses in retrieved evidence [16].

Recent research has therefore shifted from simple generate given context pipelines toward more controllable and auditable generation workflows. In compliance-oriented scenarios, this trend emphasizes deterministic consistency checks, refusal or fallback strategies under insufficient evidence, and multi-step planning–retrieval–verification mechanisms to improve reliability. Within this context, Agentic RAG can be viewed as a regulation-alignment layer for fire safety acceptance: conditioned on visually predicted hazard candidates, the system retrieves relevant official clauses from a structured regulatory knowledge base, verifies citation consistency, and then produces clause-grounded compliance explanations. Compared with conventional single-shot generation, such a workflow is better suited to professional inspection scenarios where faithfulness, traceability, and audit readiness are essential.

However, existing studies on fire safety inspection still rarely couple robust visual hazard recognition with regulation-grounded, hallucination-suppressed compliance reporting in a unified framework. Chen et al. [3] made an important step by exploring a vision-LLM-driven pipeline for hazard recognition and compliance checking, but the broader problem of building a closed-loop pipeline from hazard perception to audit-ready clause citation remains insufficiently studied. Our work addresses this gap by integrating a coarse-to-fine visual recognition pipeline with an Agentic RAG compliance layer, thereby supporting both accurate hazard identification and verifiable regulation alignment for intelligent fire safety acceptance.

3. Methodology

This study proposes a coarse-to-fine intelligent inspection framework for building fire safety acceptance. As illustrated in Figure 2, the framework consists of two tightly coupled components: (1) a visual hazard recognition module for hazard screening and hazard-type refinement; and (2) a regulation-grounded compliance module for clause-level evidence retrieval and verification. The former provides robust perception under small-sample and visually complex conditions, while the latter transforms recognition results into audit-ready compliance outputs.

At the visual level, we adopt DINOv2 with a ViT-S/14 backbone as the feature extractor because of its strong transferability and robustness under limited labeled data [12]. Without modifying the backbone architecture, two lightweight task heads are attached in parallel: a binary head for hazard presence screening and a multi-class head for hazard-type recognition. Instead of formulating fire hazard recognition as a single flat classification problem, the proposed framework follows a hierarchical decision logic that better matches practical inspection workflows. The binary head first determines whether a hazard is present, and the multi-class head is subsequently used to refine the hazard type for positive or uncertain cases. The compliance layer is activated only when the sample is predicted as Hazard or Review, thereby avoiding unnecessary retrieval and preserving consistency with human-in-the-loop inspection procedures.

3.1. Framework Overview and Hierarchical Decision Strategy

Let x denote an input image. The binary head produces a hazard presence probability

p_{b} (x)

, and the multi-class head outputs a category distribution

p_{m} (x) \in R^{C}

over C predefined hazard types. To support inspection-oriented decision making, we introduce a three-state gating strategy:

{\hat{y}}_{g a t e} (x) = \{\begin{matrix} No_Hazard, & p_{b} (x) \leq T_{low}, \\ Review, & T_{low} < p_{b} (x) < T_{high}, \\ Hazard, & p_{b} (x) \geq T_{high}, \end{matrix}

(1)

where

T_{low}

and

T_{high}

are the lower and upper confidence thresholds, respectively.

The gating thresholds were determined using the validation set and fixed before independent test-set evaluation. The lower threshold was selected to preserve hazard recall and avoid directly suppressing visually ambiguous samples as No_Hazard, whereas the upper threshold was used to identify high-confidence hazard cases. Samples falling between the two thresholds were routed to the Review state. This setting reflects the asymmetric risk of fire safety inspection, where false negatives are more safety-critical than false positives. Therefore, the thresholds are used as prototype-level confidence-routing parameters rather than universally optimal deployment thresholds, and more systematic calibration under larger multi-site datasets will be conducted in future work.

This hierarchical gating mechanism provides two practical benefits. First, it separates clear negative samples from positive or uncertain ones, thereby reducing false triggering in visually ambiguous scenes. Second, it explicitly routes borderline cases to a Review state, which is more consistent with engineering inspection workflows than forcing a hard decision for every image. For samples categorized as Hazard or Review, the multi-class head further predicts the most likely hazard category:

{\hat{y}}_{c l s} (x) = arg max_{c \in {1, \dots, C}} p_{m, c} (x)

(2)

Therefore, the visual module does not merely output a class label, but instead provides a hierarchical inspection result consisting of hazard presence, confidence-aware review state, and refined hazard category, which forms the input to the downstream compliance-verification layer.

3.2. Small-Sample Adaptation via Two-Stage Fine-Tuning

Although DINOv2 provides strong transferable visual representations, directly fine-tuning all parameters on a small fire hazard dataset may lead to unstable optimization and representation drift. To address this issue, we adopt a two-stage fine-tuning strategy for stable small-sample adaptation.

Stage I: task-head adaptation. In the first stage, all backbone parameters are frozen and only the lightweight classification heads are optimized. This stage rapidly adapts the output layers to the target label space while preserving the generic semantic representations learned during large-scale pretraining. Let

L_{\sup}

denote the supervised task loss. The stage-I model is obtained as

M^{(1)} = arg min L_{s u p},

(3)

where model selection is performed using validation-based early stopping.

Stage II: high-level semantic refinement. Based on

M^{(1)}

, we unfreeze the last several Transformer blocks (two blocks in this study), together with the normalization layers and task heads, for joint optimization. This partial unfreezing strategy enables the model to refine task-relevant high-level semantics while avoiding excessive destruction of the pretrained representation space. To further stabilize adaptation, we adopt layer-wise learning rates:

η_{head} ≫ η_{bb},

(4)

where

η_{head}

is the learning rate for the task heads and

η_{bb}

is the learning rate for the unfrozen backbone layers.

A practical challenge in Stage II is that the feature space may drift substantially after partial unfreezing. If strong knowledge distillation is imposed from the beginning, the student model may suffer from gradient conflict between hard-label supervision and soft-target alignment, because the teacher predictions are still tied to the Stage I feature space. To alleviate this issue, we introduce a progressive warm-up schedule for the distillation weight:

α_{now} = α \cdot min (1, \frac{e p}{w a r m u p_e p o c h s}),

(5)

where

α

is the target distillation weight,

e p

is the current epoch, and

w a r m u p_e p o c h s

is the warm-up duration. This strategy allows the model to first establish stable task-specific decision boundaries using ground-truth labels, and then gradually strengthen soft-target alignment after the feature space becomes more stable.

Overall, the two-stage design improves training stability, suppresses overfitting, and enhances reproducibility under small-sample conditions, which is particularly important for engineering inspection scenarios with limited annotations and high visual uncertainty.

3.3. Regulation-Grounded Agentic RAG for Compliance Verification

Visual recognition alone is insufficient for fire safety acceptance, because acceptance conclusions must be supported by authoritative regulations and clause-level evidence. To address this requirement, we append a regulation-grounded compliance layer to the visual recognition module. This layer does not modify the DINOv2 backbone or the lightweight classification heads; instead, it transforms recognition outputs into compliance-ready results through retrieval, verification, and citation control.

Knowledge assets. We first build a structured regulation knowledge base in which each entry is stored as a triplet

r = {s, c, t},

(6)

where s denotes the standard name, c the clause identifier, and t the official clause text. In parallel, we maintain a hazard–regulation index that maps each predefined hazard category to relevant clauses and standardized remediation suggestions, thereby forming a compact acceptance-oriented knowledge repository.

Activation policy. For an input image, the visual module outputs the gating result

{\hat{y}}_{g a t e} (x)

and, when necessary, the predicted hazard category

{\hat{y}}_{c l s} (x)

. The compliance layer is activated only when

{\hat{y}}_{g a t e} (x) \in {Hazard, Review}

, which avoids unnecessary retrieval for clear negative samples and remains consistent with practical review workflows.

Agentic workflow. To suppress hallucinated citations and improve auditability, we adopt an Agentic RAG workflow [20,21,22], which decomposes compliance reporting into three controllable steps:

1.: Planning (candidate selection): Conditioned on the visual predictions, the agent selects top-k candidate hazard categories strictly from the predefined hazard taxonomy.
2.: Retrieval (evidence acquisition): For each candidate, the agent retrieves a compact evidence package from the regulation knowledge base, including the standard name, clause identifier, official clause text, and standardized remediation description.
3.: Verification (anti-hallucination constraints): The agent is not allowed to freely generate regulatory citations. Instead, it must output clause identifiers and quoted text that exactly match the retrieved records. Deterministic consistency checks are enforced on clause IDs and official text, and multi-round self-consistency voting is used to improve decision stability. Candidates that fail verification are either removed or downgraded to a Review-only output without any citation.

Compliance-ready output. The final output contains the confirmed hazard category, a standardized hazard description, and verified regulatory citations in the form of {standard name, clause identifier, official text}. If no candidate passes verification, the system returns Review required or a hazard-only output without citation. In this way, the framework ensures that acceptance reports do not contain fabricated, misattributed, or unverifiable regulatory references.

3.4. Lightweight Classification Head (SEHead)

Given the characteristics of construction and building-site images, such as complex backgrounds, large variations in hazard-region scales, and dispersed discriminative cues, this study introduces a lightweight SEHead classification head to enhance discriminative capability without modifying the backbone architecture.

The core idea of SEHead is to perform global statistical aggregation on the token features output by the backbone and to recalibrate key discriminative dimensions through channel-wise attention. As illustrated in Figure 3, the proposed SEHead first removes prefix tokens, then extracts global mean- and max-pooled statistics from patch tokens, and finally applies channel-wise recalibration before linear classification.

Let the backbone output token features be

X \in R^{B \times N \times C}

(7)

where B denotes the batch size, N is the number of tokens, and C represents the channel dimension. After removing prefix tokens (e.g., CLS or Reg tokens), the remaining patch tokens are denoted as follows:

\begin{matrix} f_{avg} & = MeanPool (X_{p}) \in R^{B \times C}, \end{matrix}

(8)

\begin{matrix} f_{\max} & = MaxPool (X_{p}) \in R^{B \times C} . \end{matrix}

(9)

Two types of global statistics are then computed:

f = Concat (f_{avg}, f_{\max}) \in R^{B \times 2 C}

(10)

Channel-wise recalibration is subsequently performed via the squeeze-and-excitation (SE) mechanism:

w = σ (W_{2} δ (W_{1} f)) \in R^{B \times 2 C}

(11)

where

δ (\cdot)

denotes the ReLU activation function and

σ (\cdot)

represents the sigmoid function. The recalibrated feature is obtained as

\tilde{f} = f ⊙ w

, which is then fed into a linear classifier to produce the logits:

z = W_{c} \tilde{f} + b_{c}

(12)

SEHead introduces only a small number of fully connected parameters and negligible computational overhead, yet it effectively strengthens discriminative channels relevant to fire hazards. As a result, it improves both the expressive capacity and stability of the classification head, particularly under small-sample conditions.

3.5. Distillation-Enhanced Training Strategy (Task-Aligned KD)

To mitigate overfitting risks caused by limited sample size, class imbalance, and annotation noise in fire hazard images, we introduce knowledge distillation (KD) during the two-stage fine-tuning process. The core idea is that a teacher model provides a soft target distribution, and the student model, under hard-label supervision, further aligns its outputs to the teacher distribution to smooth the decision boundary and improve generalization stability. Both the binary and multi-class tasks follow the same “soft-distribution alignment” paradigm, but differ in how the distributions are constructed and how the distillation loss is formulated: for the binary task, the sigmoid probability is explicitly expanded into a two-class distribution before KL-based alignment, whereas for the multi-class task, KL alignment is applied directly to the softmax distribution. To reduce alignment conflicts in the early stage of Phase-II fine-tuning, the distillation weight follows the warm-up schedule described in Section 3.2, enabling KD to serve as an effective regularizer and knowledge-transfer mechanism while maintaining stable convergence.

3.5.1. Binary Knowledge Distillation (Binary KD)

The student and teacher models output scalar

logit z_{s}, z_{t} \in R

, respectively. To obtain soft targets suitable for distribution alignment, temperature scaling is applied to smooth the logits, followed by a sigmoid function to produce temperature-scaled probabilities:

p_{s}^{(T)} = σ (\frac{z_{s}}{T}), p_{t}^{(T)} = σ (\frac{z_{t}}{T})

(13)

where T denotes the temperature coefficient, which controls the smoothness of the probability distribution. A larger T yields softer probabilities that approach a uniform distribution, thereby reducing inter-class sharpness.

Since the KL divergence requires complete probability distributions as input, the binary probabilities are explicitly expanded into two-class distributions:

q_{s}^{(T)} = [1 - p_{s}^{(T)}, p_{s}^{(T)}], q_{t}^{(T)} = [1 - p_{t}^{(T)}, p_{t}^{(T)}]

(14)

The supervised loss is defined as binary cross-entropy with respect to the ground-truth label

y \in {0, 1}

, which encourages the student model to assign samples to the correct class. In practice,

BCEWithLogitsLoss

is employed for numerical stability:

L_{\sup}^{bin} = BCEWithLogits (z_{s}, y)

(15)

The distillation loss constrains the student distribution to approximate the teacher distribution via the KL divergence, promoting smoother and more generalizable predictions. A scaling factor of

T^{2}

is applied to compensate for gradient magnitude attenuation caused by temperature scaling:

L_{KD}^{bin} = T^{2} D_{KL} (q_{t}^{(T)} ∥ q_{s}^{(T)})

(16)

The final objective for binary classification is formulated as a weighted sum of the supervised and distillation losses:

L^{bin} = (1 - α) L_{\sup}^{bin} + α L_{KD}^{bin}, α \in [0, 1]

(17)

where

α

denotes the distillation weight that balances hard-label supervision and soft-target alignment. A smaller

α

emphasizes ground-truth labels, whereas a larger

α

encourages closer adherence to the teacher’s predictions.

3.5.2. Multi-Class Knowledge Distillation (Multi-Class KD)

For the multi-class task, the student and teacher models output logit vectors

z_{s}, z_{t} \in R^{C}

, where C denotes the number of classes. Temperature-scaled softmax is applied to obtain smoothed class probability distributions:

p_{s}^{(T)} = softmax (\frac{z_{s}}{T}), p_{t}^{(T)} = softmax (\frac{z_{t}}{T})

(18)

The supervised loss is defined using the standard cross-entropy loss with respect to the ground-truth label y:

L_{\sup}^{mul} = CE (z_{s}, y)

(19)

Similarly, the distillation loss measures the discrepancy between the student and teacher distributions via KL divergence, with a

T^{2}

scaling factor to compensate for the effect of temperature scaling on gradient magnitude:

L_{KD}^{mul} = T^{2} D_{KL} (p_{t}^{(T)} ∥ p_{s}^{(T)})

(20)

The overall loss for the multi-class task is as follows:

L^{mul} = (1 - α) L_{\sup}^{mul} + α L_{KD}^{mul}, α \in [0, 1]

(21)

By explicitly expanding the binary probability into a two-class distribution

[1 - p, p]

, binary and multi-class distillation are unified into a common KL-divergence-based alignment framework, where the student distribution is aligned with the teacher distribution. This unified formulation enables shared distillation hyperparameters

(α, T)

and evaluation protocols across different stages of the two-stage task, reducing comparison bias introduced by task-specific distillation designs and improving the interpretability of ablation analyses.

From an implementation perspective, knowledge distillation introduces only an additional forward pass of the teacher model during training and does not alter the original inference pipeline. During deployment, only the student model is retained for inference, ensuring no additional computational overhead at test time.

4. Experiments and Evaluation

This section evaluates the proposed coarse-to-fine intelligent inspection framework from six complementary perspectives. First, the dataset preparation and quality-control process are reported. Second, the visual recognition module is evaluated on binary hazard screening and multi-class hazard-type recognition using an independent test set. Third, component-level ablation analysis is conducted to isolate the contributions of the SE-based classification head and the knowledge distillation strategy. Fourth, the Agentic RAG compliance layer is evaluated using system-level metrics. Fifth, comparative experiments are conducted against representative CNN-based, supervised Transformer-based, and self-supervised Transformer-based baselines under a unified small-sample evaluation protocol. Finally, prototype-level engineering case studies are used to illustrate the end-to-end behavior and practical limitations of the proposed framework.

Since building fire safety inspection is a safety-critical task, the evaluation does not rely on aggregate accuracy alone. Instead, special attention is given to class-level behavior, false-positive and false-negative risks, category-level confusion, baseline comparison fairness, parameter efficiency, confidence-aware review decisions, and the faithfulness of regulation-grounded outputs. The current experiments are intended to provide an offline and prototype-level evaluation of the framework rather than evidence of full deployment readiness.

4.1. Dataset Preparation and Data Quality Control

The fire hazard image dataset used in this study was collected from real-world building fire safety inspection scenarios. To improve the transparency of the dataset scale and representativeness, this section reports the data sources, scene coverage, annotation procedure, class distribution, and split protocol in detail. The images cover typical inspection scenes, including electrical equipment areas, fire-lane environments, and general non-hazard building scenes. They include both indoor and outdoor inspection conditions, with visual variations in illumination, viewpoint, background clutter, object scale, and partial occlusion. To protect project confidentiality and personal privacy, all images were anonymized before annotation, and no personal identities, project names, or sensitive location information were retained in the experimental records. Representative examples from the dataset are shown in Figure 4.

The complete dataset preparation workflow is illustrated in Figure 5. The workflow consists of raw data collection, format standardization, data cleaning, label standardization, data augmentation, and dataset splitting. This procedure was designed to ensure that the experimental data were traceable, consistently labeled, and suitable for small-sample model training and evaluation.

To improve data quality, a multi-step cleaning pipeline was performed before model training. First, low-resolution images smaller than 256 px were removed to ensure sufficient visual detail for feature learning. Second, exact duplicates were eliminated using SHA-256 hashing, thereby reducing sample redundancy. Third, perceptual hashing was applied to identify and remove highly similar near-duplicate images, which helped mitigate internal correlation within the dataset. Finally, the variance of the Laplacian was used as a blur indicator, and low-information blurry samples were discarded.

All images were annotated according to a predefined hazard taxonomy. The initial labels were assigned by a trained annotator based on visible fire safety hazard cues. The annotations were then reviewed by a domain expert with experience in building fire safety inspection. Ambiguous samples, such as images with weak hazard cues, partial occlusion, or uncertain hazard boundaries, were further checked during the review process. Only samples with confirmed labels were retained for model training and evaluation.

After the annotation process, data augmentation was applied to improve robustness under practical inspection conditions. The augmentation operations included random cropping, color perturbation, geometric transformation, and random erasing, which were used to simulate viewpoint variation, illumination fluctuation, and partial occlusion frequently encountered in real fire safety inspection scenes. These operations increased dataset diversity and reduced the risk of overfitting during training.

After the above procedures were completed, the entire dataset was partitioned into three subsets using stratified sampling: a training set, a validation set, and an independent test set. This splitting procedure was performed separately for the binary hazard-screening task and the multi-class hazard recognition task to ensure that class distributions were preserved consistently across all subsets. The training set was used for model optimization, the validation set was used for early stopping and hyperparameter selection, and the independent test set was used exclusively for final performance reporting. No test samples were used at any stage of model training, model selection, or threshold determination.

The dataset was organized for two related tasks. For the binary hazard-screening task, the dataset contained 162 images in total, including 111 Hazard images and 51 No_Hazard images. For the multi-class hazard recognition task, 127 hazard images were used, including 63 Confused_Wiring images and 64 Fire_Lane_Blocked images.

The class distribution indicates that the binary hazard-screening task is moderately imbalanced, with Hazard samples accounting for 68.52% of the binary dataset and No_Hazard samples accounting for 31.48%. In contrast, the multi-class hazard recognition task is nearly balanced, with Confused_Wiring and Fire_Lane_Blocked accounting for 49.61% and 50.39% of the multi-class dataset, respectively. The detailed dataset composition and split statistics are summarized in Table 1. Although this dataset comprises authentic scene images derived from inspections—covering a representative range of hazardous and non-hazardous cases—its scale remains limited. Consequently, the dataset is designed to support few-shot experimental evaluations rather than to provide comprehensive coverage of all building fire safety scenarios. We will give further consideration to this limitation when planning future work.

4.2. Implementation Details and Evaluation Protocol

4.2.1. Experimental Environment

All experiments were conducted on a Windows platform using PyTorch 2.8.0 and the timm library version 1.0.19. The visual backbone of the proposed framework is DINOv2 ViT-S/14, initialized with official pretrained weights. To ensure reproducibility, all experiments used a fixed random seed and fixed train/validation/test splits throughout model training, model selection, and final evaluation.

4.2.2. Input Processing and Data Augmentation

For main-model training, the input images are randomly cropped and resized to 518 × 518 with bicubic interpolation. The training pipeline includes the following augmentations:

Random Resize Crop: Each input image is randomly cropped and resized to a resolution of

518 \times 518

. The scale range of the cropped region is set to (0.75, 1.0), with an aspect ratio range of (0.9, 1.1). Bicubic interpolation is used for resampling.

Random Horizontal Flip: Images are horizontally flipped with a probability of 0.5.

Random Rotation: Images are randomly rotated within a range of

0 °

to

15 °

.

Color Jitter: Random perturbations with a magnitude of

\pm 0.2

are applied independently to brightness, contrast, and saturation.

ToTensor: Images are converted into tensor format to be compatible with subsequent model inputs.

Random Erasing: With a probability of 0.2, random erasing is applied to the image. The erased region occupies a scale range of (0.02, 0.10) of the image area, with an aspect ratio range of (0.3, 3.3).

For validation and test sets, a deterministic preprocessing pipeline (Resize → ToTensor → Normalize) was used without stochastic augmentation, ensuring that evaluation results were reproducible and comparable across models.

4.2.3. Model Construction

The proposed model is built upon the DINOv2 ViT-S/14 backbone and instantiated through timm.create_model. Instead of a standard linear classification layer, we adopt the proposed lightweight channel-reweighting head for both binary hazard screening and multi-class hazard recognition. During weight loading, non-strict parameter matching is used to accommodate the task-specific heads. To avoid additional stochastic effects during fine-tuning, the dropout rate and stochastic depth rate are both set to 0.0.

4.2.4. Evaluation Metrics

Model performance is evaluated using the following metrics:

Accuracy: The proportion of correctly predicted samples among all samples, reflecting the overall classification performance.

Classification Report: Precision, Recall, and F1-score are reported for each class, providing a fine-grained evaluation of the model’s recognition ability across different categories. When the model produces no predictions for a certain class in a given evaluation round, resulting in undefined evaluation metrics, the corresponding metrics for that class are set to zero to avoid warnings during the evaluation process and to ensure stable result analysis.

Confusion Matrix: The confusion matrix summarizes the counts of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs), offering an intuitive view of misclassification patterns. It is further used to analyze the sources of false negatives (Hazard → No Hazard) and false positives (No Hazard → Hazard).

4.3. Task-Specific Settings

The proposed framework contains two coupled classification tasks: binary hazard screening and multi-class hazard-type recognition. Although both tasks share the same DINOv2 backbone and the same two-stage fine-tuning paradigm, they differ in output structure, supervision loss, and evaluation emphasis.

For the binary task, the model outputs a single scalar logit corresponding to hazard presence and is trained using BCEWithLogitsLoss. When class imbalance is present, a pos_weight term can be introduced to place greater emphasis on positive hazard samples. For the multi-class task, the model outputs a C-dimensional logits vector over hazard categories and is trained with cross-entropy loss.

The two tasks also differ in the implementation of knowledge distillation. In the binary setting, the sigmoid probability must be explicitly expanded into a two-class distribution before KL-based alignment with the teacher output can be performed. In the multi-class setting, the softmax category distribution can be aligned directly. Despite these implementation differences, both tasks are unified under the same teacher-to-student soft-distribution alignment principle and share the same high-level distillation schedule. A summary of the differences between the two tasks is provided in Table 2.

4.4. Results on Binary Hazard Screening

The binary classifier serves as the first-stage screening module of the proposed framework. Its main role is to determine whether an image contains a fire safety hazard and to provide a reliable routing signal for the downstream hazard-type classifier and compliance layer.

Table 3 reports the performance of the binary hazard-screening model on the independent test set. The model correctly classified 24 out of 25 samples, achieving an overall accuracy of 0.960. For the Hazard class, the precision reaches 1.000 and the recall is 0.941, while the No_Hazard class obtains a recall of 1.000. These results indicate that the binary screening module can separate most hazardous and non-hazardous samples under the current test split.

More importantly, the error pattern is more informative than the aggregate accuracy. The only misclassification is a false negative, where one true Hazard sample is predicted as No_Hazard, corresponding to a hazard false-negative rate of 5.88% in the test set. In a fire safety inspection, such an error is more critical than a false positive because it may cause an actual hazard to be missed, whereas a false positive mainly increases manual review workload. This asymmetric risk supports the use of confidence-aware routing in the proposed framework: uncertain cases should be directed to the Review state rather than forced into a hard negative decision.

The remaining false-negative case also reflects the visual difficulty of real inspection scenes. Weak hazard cues, partial occlusion, low contrast, and complex backgrounds may reduce the reliability of purely visual judgment. Although data augmentation was used to simulate illumination, viewpoint, and partial-occlusion variations, this result suggests that visual ambiguity cannot be fully eliminated by augmentation alone. Therefore, the binary module should be understood as a screening component within a human-in-the-loop inspection workflow rather than a fully autonomous replacement for expert judgment.

4.5. Results on Multi-Class Hazard-Type Recognition

The multi-class model is triggered for samples that require finer semantic interpretation and is responsible for distinguishing specific hazard categories. Compared with binary screening, this task is more demanding because it requires fine-grained discrimination among visually similar hazard patterns.

As shown in Table 4, the proposed model achieves an overall accuracy of 0.947 on the independent test set, with 18 out of 19 samples correctly classified. The two hazard categories show relatively balanced F1-scores: 0.941 for Confused_Wiring and 0.952 for Fire_Lane_Blocked. This indicates that the multi-class recognition module can provide stable category-level discrimination after binary screening under the current controlled split.

The class-level results reveal a more nuanced behavior. For Confused_Wiring, the precision is 1.000, meaning that all samples predicted as confused wiring are correct; however, its recall is 0.889, indicating that one true confused-wiring sample is misclassified as Fire_Lane_Blocked. In contrast, Fire_Lane_Blocked achieves a recall of 1.000 but a lower precision of 0.909 due to this single cross-category error. This pattern suggests that the model is more conservative when assigning the Confused_Wiring label and that local wiring cues may be partially confused with broader scene-level obstruction patterns when the hazard evidence is incomplete or visually cluttered.

Therefore, the multi-class results support the feasibility of fine-grained hazard interpretation after coarse screening, but they should not be overinterpreted as proof of broad generalization. Given the limited test-set size, larger multi-site datasets with more diverse hazard categories are still required to evaluate cross-scene robustness and minority-class behavior more reliably.

4.6. Ablation Analysis

After evaluating the proposed framework on the binary hazard-screening and multi-class hazard recognition tasks, we further conducted an ablation study to clarify the contribution of the key components in the visual recognition module. This analysis aims to answer whether the performance improvement mainly comes from the self-supervised DINOv2 backbone itself or from the proposed task-specific adaptation components. Different from the baseline comparison in Section 4.9, which compares the proposed method with external model families, this ablation study focuses on the internal contribution of the proposed visual adaptation design.

Specifically, we compared three variants under the same train/validation/test protocol: (1) DINOv2 with a standard linear classification head, which serves as the backbone-only variant; (2) DINOv2 with the proposed SEHead, which evaluates the effect of channel-wise feature recalibration; and (3) the full model with SEHead and knowledge distillation (KD), which evaluates the additional effect of distillation-enhanced adaptation. All variants were trained on the same training set, selected using the validation set, and finally evaluated on the same independent test set.

As shown in Table 5, the ablation results demonstrate the incremental contribution of each component. The DINOv2 backbone with a standard linear head provides the backbone-only baseline, achieving 0.720 binary Accuracy and 0.713 binary Macro-F1. After replacing the linear head with the proposed SEHead, the binary Accuracy increases from 0.720 to 0.880, and the binary Macro-F1 increases from 0.713 to 0.873, corresponding to absolute improvements of 0.160 and 0.160, respectively. The multi-class Accuracy also improves from 0.842 to 0.895, while the multi-class Macro-F1 increases from 0.840 to 0.894. These improvements indicate that SEHead effectively enhances task-relevant discriminative channels, which is particularly useful for fire hazard images with complex backgrounds, local hazard cues, and large variations in object scale.

With the addition of knowledge distillation (KD), the full model further improves the binary Accuracy to 0.960 and the binary Macro-F1 to 0.956. The multi-class Accuracy and Macro-F1 also increase to 0.947 and 0.947, respectively. This suggests that KD provides additional regularization under small-sample conditions by smoothing the decision boundary and transferring soft prediction information from the teacher model. Therefore, the performance gain of the full model is not solely attributable to the DINOv2 backbone, but also benefits from the proposed task-specific feature recalibration and distillation-enhanced adaptation strategy.

The two components play complementary roles. SEHead mainly improves feature discrimination by recalibrating channel-wise responses derived from global mean and max statistics, whereas KD improves training stability and reduces overfitting by providing soft-target supervision. From the linear-head baseline to the full model, binary Macro-F1 increases by 0.243 and multi-class Macro-F1 increases by 0.107. These results suggest that the proposed adaptation strategy substantially strengthens the self-supervised backbone for small-sample fire hazard recognition.

In terms of computational trade-off, SEHead introduces only a small number of additional parameters in the classification head and does not modify the DINOv2 backbone. KD increases training-time computation because a teacher model is required during training, but it does not introduce extra inference-time parameters because only the student model is retained for deployment. Therefore, the final model achieves improved recognition performance and stability with limited additional inference cost, which is important for practical fire safety inspection scenarios.

4.7. System-Level Evaluation of the Agentic RAG Compliance Layer

The proposed framework contains two coupled but functionally different modules. The DINOv2-based visual module is responsible for hazard screening and hazard-type recognition, whereas the Agentic RAG module is responsible for transforming visual recognition outputs into regulation-grounded compliance reports. Therefore, the Agentic RAG module is not evaluated by visual classification accuracy alone. Instead, it is evaluated using system-level compliance metrics, including hazard-reporting F1-score, clause-level provenance accuracy, and hallucination or unverifiable-output rate.

In the complete pipeline, the Agentic RAG layer is activated only when the visual module predicts a sample as Hazard or Review. The predicted hazard category and confidence score are used as structured inputs for candidate generation and targeted retrieval. The retrieved evidence is then verified through clause-level consistency checks before being included in the final report.

Provenance accuracy measures whether the system links a correctly recognized hazard to the correct regulatory clause. The hallucination rate measures the proportion of outputs containing unsupported, mis-cited, or unverifiable regulatory statements, including false triggering on irrelevant scenes.

As shown in Table 6, the full Agentic RAG workflow achieves an F1-score of 0.88, a clause-level provenance accuracy of 94.1%, and a hallucination/unverifiable-output rate of 1.8%. Compared with the Zero-shot VLM baseline, the proposed workflow improves the F1-score from 0.61 to 0.88 and reduces the hallucination rate from 28.6% to 1.8%. Compared with Standard RAG, the proposed method improves provenance accuracy from 68.9% to 94.1%, showing that simple retrieval alone is insufficient for reliable compliance reporting.

The ablation variants further clarify the contribution of the Agentic design. Removing the verification stage decreases provenance accuracy from 94.1% to 75.4% and increases the hallucination rate from 1.8% to 8.5%, indicating that multimodal verification and textual consistency checking are critical for suppressing unverifiable regulatory outputs. Removing the scene gate increases the hallucination rate from 1.8% to 11.2%, suggesting that early filtering of irrelevant images is important for avoiding false triggering in open inspection environments.

It should be noted that this evaluation focuses on the downstream compliance layer rather than the visual backbone itself. The DINOv2-based two-stage visual model provides structured hazard candidates and confidence-aware routing decisions, while the Agentic RAG module evaluates whether these visual outputs can be converted into faithful and traceable regulatory evidence. Therefore, the visual recognition experiments and the Agentic RAG evaluation together assess the complete pipeline from hazard perception to regulation-grounded reporting.

4.8. Prototype-Level Engineering Case Study and Practical Limitations

To illustrate the end-to-end behavior of the proposed framework, we conducted a prototype-level engineering case study using previously unseen real-world inspection images that were not used in training, validation, or model selection. These examples include both hazard and non-hazard cases and cover visually diverse conditions, such as different indoor/outdoor scenes, lighting conditions, and background complexity. It should be emphasized that this section is intended to demonstrate workflow feasibility and qualitative system behavior, rather than to provide long-term field deployment validation.

Figure 6 shows representative qualitative results of the two-stage visual recognition module. During inference, the binary classifier first estimates the hazard probability

p (hazard)

. Based on two predefined thresholds,

T_{low} = 0.35

and

T_{high} = 0.80

, each image is assigned to one of three states: No_Hazard, Review, or Hazard. These thresholds were selected on the validation set and fixed before testing, with the purpose of reducing missed hazards while routing uncertain cases to manual review. Only samples predicted as Hazard or Review are further processed by the multi-class recognition head for hazard-type refinement.

The qualitative results show a clear separation in binary probabilities for the illustrated samples. Clean non-hazard images produce low hazard probabilities, such as

p (hazard)

= 0.019,

0.028

, and

0.080

, while clear hazard images produce high probabilities, such as

p (hazard)

= 0.864,

0.948

,

0.967

, and

0.986

. In addition, one borderline sample with

p (hazard)

= 0.795 is assigned to the Review state rather than directly treated as either positive or negative. This behavior suggests that the proposed framework can provide confidence-aware routing for ambiguous inspection cases, instead of forcing every input into a hard decision.

For Hazard/Review samples, the multi-class head produces explicit hazard categories with confidence scores ranging from approximately 0.645 to 0.872 in the illustrated cases. These outputs are consistent with the operational logic of a coarse-to-fine inspection workflow, in which hazard screening is followed by finer hazard-type interpretation. However, these examples are qualitative demonstrations and should not be interpreted as systematic robustness evaluation under domain shift, severe noise, or unseen hazard categories.

To further demonstrate the implemented engineering prototype, Figure 7 presents two interface examples of the fire-safety inspection assistant. The prototype was developed for Chinese building fire-safety inspection scenarios and regulation-based compliance checking; therefore, the interface is shown in Chinese. The left panel accepts inspection images as input, while the right panel presents a structured analysis report. In particular, the “Analysis Report” section provides the intermediate inspection results generated by the system, including the predicted risk level, candidate hazard points, brief hazard descriptions, supporting reference images, and initially matched regulatory evidence. The “Final Conclusion” section further consolidates these intermediate findings into a structured compliance-oriented summary. It reports the number and type of identified hazards, explains the corresponding fire-safety concerns in natural language, and presents the verified regulatory clauses used to support the final compliance judgment. These examples are intended to demonstrate the implemented end-to-end workflow from image input to regulation-grounded reporting, rather than to provide evidence of long-term field deployment validation.

It should be noted that the visual recognition experiments and the implemented Agentic RAG prototypes are evaluated at different levels of the overall framework. The DINOv2-based experiments in the preceding sections focus on the visual recognition module and evaluate representative hazard categories under a controlled small-sample setting. By contrast, the implemented Agentic RAG assistant is designed with a broader fire safety hazard taxonomy and is used here to demonstrate the downstream compliance-reporting workflow. Therefore, the interface examples should be interpreted as a prototype-level demonstration of the regulation-grounded reporting layer and workflow extensibility, rather than as additional quantitative evidence for the same visual classification categories.

In the illustrated system outputs, the compliance layer does not freely generate regulatory citations. Instead, it retrieves evidence tuples from the structured regulation knowledge base and enforces deterministic consistency checks before output. When retrieval is insufficient or verification fails, the framework falls back to Review or hazard-only output, thereby reducing the risk of mis-citations and fabricated regulatory content.

Overall, these prototype-level examples demonstrate that the proposed framework can support an end-to-end workflow from image input to hazard screening, hazard-type refinement, and regulation-grounded reporting. Nevertheless, the current evidence should be interpreted as qualitative evidence of prototype-level feasibility rather than proof of deployment readiness. Long-term field validation in operational inspection workflows has not yet been conducted. In addition, robustness under cross-site domain shift, severe noise, extreme low-light conditions, and unseen hazard categories remains to be systematically evaluated. Future work will therefore focus on multi-site field testing, domain-shift evaluation, noisy-condition robustness analysis, unseen-scenario testing, and runtime evaluation under realistic operational constraints.

4.9. Comparative Evaluation Against Baseline Models

To improve the fairness and interpretability of the baseline comparison, we selected representative models from different architecture families and learning paradigms. EfficientNet-B0 was included as a lightweight CNN baseline, ViT-B/16 was used as a supervised Transformer baseline, and MAE-ViT-B/16 was added as a self-supervised Transformer baseline. This design allows us to compare lightweight convolutional models, supervised Transformer models, and self-supervised Transformer models under the same small-sample fire hazard recognition setting.

We acknowledge that pretrained and self-supervised models may be naturally advantageous in small-data regimes. Therefore, the comparative results should be interpreted within the intended limited-data fire safety inspection setting rather than as a universal claim across all dataset scales. To further distinguish the contribution of the DINOv2 backbone from the proposed task-specific components, a component-level ablation analysis is provided in Section 4.6.

4.9.1. Fair Comparison Protocol

All models were trained, validated, and tested on the same cleaned dataset using task-consistent supervision, fixed random seeds, and aligned training configurations. The training set was used for parameter optimization, the validation set was used for early stopping and hyperparameter selection, and the independent test set was used exclusively for final performance reporting. To ensure fairness, all compared models adopted the same input resolution of

224 \times 224

, data augmentation strategy, optimizer type, and evaluation metrics. The final comparison was conducted on the independent test set. Therefore, the comparative results should not be directly compared with the engineering case study examples that use the

518 \times 518

setting. The main training configurations of the baseline models are summarized in Table 7. For completeness, Table 8 summarizes the main architectural characteristics of these baselines.

4.9.2. Quantitative Comparison

The overall quantitative comparison is reported in Table 9 and visualized in Figure 8 and Figure 9. The proposed DINOv2-ViT-S/14 framework achieves the best performance across binary accuracy, multi-class accuracy, and average F1-score, while maintaining a moderate parameter scale.

The comparison reveals three main observations. First, Transformer-based models generally outperform CNN baselines, suggesting that global context modeling is beneficial for visually complex fire hazard scenes. As shown in Table 9, ResNet-50 and EfficientNet-B0 obtain lower performance on both binary and multi-class tasks, with multi-class Macro-F1 scores of 0.675 and 0.700, respectively. This indicates that conventional CNN baselines may be less effective in capturing long-range contextual cues and dispersed hazard evidence in cluttered inspection images.

Second, supervised Transformer baselines improve over CNN-based models but remain less stable than the proposed framework under the small-sample setting. ViT-B/16 achieves a binary Macro-F1 of 0.872 and a multi-class Macro-F1 of 0.832, while DeiT-Small obtains 0.724 and 0.783, respectively. These results suggest that the Transformer architecture alone is not sufficient; stable adaptation and transferable pretraining are both important for limited-data fire hazard recognition.

Third, the proposed DINOv2-based framework achieves the best overall performance, with 0.960 binary accuracy, 0.956 binary Macro-F1, 0.947 multi-class accuracy, and 0.947 multi-class Macro-F1. Compared with MAE-ViT-B/16, the proposed method improves multi-class Macro-F1 from 0.894 to 0.947 while using only 21 M parameters rather than 86 M. This comparison suggests that the observed improvement is not merely due to model scale, but to the combination of transferable self-supervised representations, task-specific feature recalibration, and distillation-enhanced adaptation.

4.10. Comparison with CLIP/VLM-Based Inspection Pipelines

As indicated by the trends reviewed in Section 2, intelligent fire safety inspection is increasingly evolving from a closed-set, vision-centric approach toward multimodal and language-assisted inspection. Against this backdrop, CLIP/VLM-based methods have emerged as a significant direction of development, owing to their inherent capabilities in providing open-vocabulary recognition, prompt-based category expansion, and natural language interaction.

However, the primary objective of this study is not open-ended image description, but fire safety hazard recognition and regulation-grounded acceptance support under a predefined hazard taxonomy. In this setting, the system must produce stable category-level decisions and traceable clause-level evidence. CLIP/VLM-based methods were therefore not selected as the primary recognition mechanism for three reasons. (1) Their prediction results may be sensitive to prompt templates and category descriptions, which is problematic for visually ambiguous fire hazard categories. (2) Open-ended VLM outputs are generally less deterministic than task-specific classifiers, making it difficult to guarantee repeatable inspection decisions. (3) Compliance reporting requires exact regulatory references, whereas end-to-end VLM generation may produce plausible but unsupported explanations or mis-cited clauses.

The framework proposed in this paper adopts a fundamentally different design philosophy. A detailed comparison between CLIP/VLM-based methods and the proposed framework is presented in Table 10. Although this design sacrifices part of the open-vocabulary flexibility of CLIP/VLM-based methods, it improves the controllability, reproducibility, and auditability of the system outputs, which are particularly critical for compliance-oriented fire safety inspection. Nevertheless, CLIP/VLM-based methods are not contradictory to the proposed framework. They can serve as complementary modules for open-vocabulary hazard discovery, semantic image description, prompt-based pre-screening, and human-in-the-loop report assistance. Future work may incorporate VLMs as auxiliary components while preserving the deterministic evidence-verification mechanism for final compliance conclusions.

4.11. Summary of Experimental Findings and Limitations

The experimental results provide three main implications. First, the independent test-set evaluation indicates that the proposed framework can perform stable hazard screening and hazard-type recognition under the current limited-data setting. The error analysis further shows that false negatives are more safety-critical than false positives, supporting confidence-aware review routing rather than purely hard classification. However, given the limited data scale and category coverage, these results should be interpreted as evidence of prototype-level feasibility under controlled conditions, rather than as sufficient proof of broad generalizability across all building fire safety scenarios.

Second, the ablation and comparative experiments suggest that the observed improvement is not attributable to the DINOv2 backbone alone. Instead, the gain comes from the combined effect of transferable self-supervised representations, SE-based feature recalibration, and distillation-enhanced adaptation. This finding is important for small-sample engineering scenarios, where increasing model scale alone does not necessarily lead to better task alignment.

Third, the Agentic RAG evaluation shows that regulation-grounded reporting benefits from explicit retrieval and verification, thereby further enhancing its applicability in real-world scenarios. The reduced hallucination or unverifiable-output rate indicates that deterministic evidence checking is necessary for compliance-oriented inspection tasks. Nevertheless, the current evidence remains limited to offline experiments and prototype-level demonstrations. Larger external test sets, cross-site domain-shift evaluation, noisy-condition testing, runtime analysis, and field validation are still required before making strong claims about deployment readiness.

5. Conclusions

This study addresses two tightly coupled requirements in building fire safety acceptance: reliable hazard recognition under complex visual conditions and regulation grounded compliance support for audit-ready reporting. To this end, we propose a coarse-to-fine intelligent inspection framework that integrates hierarchical visual recognition with a regulation-grounded compliance layer. Instead of treating fire hazard analysis as a single flat classification problem, the proposed framework decomposes the task into hazard presence screening and hazard-type refinement, thereby better matching practical inspection workflows in which rapid screening is followed by refined interpretation and review.

The experimental results demonstrate that the proposed framework provides a feasible solution for small-sample fire hazard recognition. Compared with representative CNN-based, supervised Transformer-based, and self-supervised Transformer-based baselines, the proposed method shows a more favorable balance between recognition performance, parameter efficiency, and task stability. The ablation analysis further confirms that the improvement is not solely due to the DINOv2 backbone, but also benefits from the proposed SE-based feature recalibration and knowledge distillation strategy. These findings suggest that effective adaptation of self-supervised representations is particularly important for engineering inspection tasks where annotated data are limited and visual evidence is often ambiguous.

Beyond visual recognition, the proposed framework also addresses the compliance-oriented nature of fire safety acceptance. By introducing retrieval, verification, and fallback mechanisms into the Agentic RAG layer, the system reduces the risk of unsupported or hallucinated regulatory citations and improves the traceability of generated inspection reports. This design extends the framework from image-level hazard recognition toward auditable decision support, which is essential for safety-critical engineering applications.

Nevertheless, the current study should be interpreted as an offline experimental evaluation and prototype-level demonstration rather than proof of deployment readiness. The dataset remains limited in scale and category coverage, and large-scale external validation has not yet been conducted. In addition, robustness under cross-site domain shift, severe noise, low-light conditions, heavy occlusion, and unseen hazard categories still requires systematic evaluation. Future work will focus on three directions. First, we will expand the dataset to include more diverse building types, inspection environments, and hazard categories. Second, we will construct larger external and multi-site test sets to evaluate cross-scene generalization and domain-shift robustness. Third, CLIP/VLM-based modules will be explored as complementary tools for open-vocabulary hazard discovery, semantic image description, and human-in-the-loop report assistance. These efforts will further strengthen the reliability, scalability, and practical applicability of intelligent fire safety acceptance systems.

Author Contributions

Conceptualization, S.Y., X.W. and G.Z.; Methodology, S.Y., Y.L., J.C., X.W. and L.W.; Validation, S.Y., J.C., X.W. and G.Z.; Formal analysis, S.Y., X.W. and L.W.; Investigation, C.Y., J.C. and X.W.; Resources, S.Y., X.W., L.W. and G.Z.; Data curation, S.Y., Y.L., C.Y. and J.C.; Writing—original draft, Y.L.; Writing—review & editing, S.Y., Y.L., C.Y., J.C., X.W. and L.W.; Visualization, Y.L. and J.C.; Supervision, X.W.; Project administration, S.Y., X.W. and G.Z.; Funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Nanjing Urban and Rural Construction Commission (KS2505).

Data Availability Statement

The data presented in this study is not available due to privacy.

Acknowledgments

This paper is supported in part by Nanjing Urban and Rural Construction Commission Project (KS2505).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Barua, U.; Han, H.; Mojtahedi, M.; Ansary, M.A. Integration of proactive building fire risk management in the building construction sector: A conceptual framework to understand the existing condition. Buildings 2024, 14, 3372. [Google Scholar] [CrossRef]
Wang, Y.; Hou, L.; Li, M.; Zheng, R. A novel fire risk assessment approach for large-scale commercial and high-rise buildings based on fuzzy analytic hierarchy process (Fahp) and coupling revision. Int. J. Environ. Res. Public Health 2021, 18, 7187. [Google Scholar] [CrossRef]
Chen, D.; Chen, L.; Zeng, Y.; Hancock, C.; Lock, R.; Sølvsten, S. Vision LLM-Driven Operational Hazard Recognition for Building Fire Safety Compliance Checking. In Proceedings of the Sixth International Conference on Civil and Building Engineering Informatics; Cheng, J., Yu, Y., Eds.; Kalpa Publications in Computing; EasyChair: Manchester, UK, 2025; Volume 22, pp. 829–840. [Google Scholar] [CrossRef]
Liu, K.; González, V.A.; Lee, G.; Kinateder, M. A scoping review of fire safety on building construction sites: Current measures, practices and future research directions. Eng. Constr. Archit. Manag. 2025. ahead-of-print. [Google Scholar] [CrossRef]
Kodur, V.; Kumar, P.; Rafi, M.M. Fire hazard in buildings: Review, assessment and strategies for improving fire safety. PSU Res. Rev. 2020, 4, 1–23. [Google Scholar] [CrossRef]
Manzoor, B.; Charef, R.; Antwi-Afari, M.F.; Alotaibi, K.S.; Harirchian, E. Revolutionizing Construction Safety: Unveiling the Digital Potential of Building Information Modeling (BIM). Buildings 2025, 15, 828. [Google Scholar] [CrossRef]
Ann, H.; Koo, K.Y. Deep learning based fire risk detection on construction sites. Sensors 2023, 23, 9095. [Google Scholar] [CrossRef]
Sultan, T.; Chowdhury, M.S.; Safran, M.; Mridha, M.F.; Dey, N. Deep learning-based multistage fire detection system and emerging direction. Fire 2024, 7, 451. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. 2024, 1–31. Available online: https://openreview.net/forum?id=a68SUt6zFt (accessed on 9 May 2026).
Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar] [CrossRef]
Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Ebrahimi Moghaddam, M. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 3784–3803. [Google Scholar] [CrossRef]
Gupta, S.; Ranjan, R.; Singh, S.N. A comprehensive survey of retrieval-augmented generation (RAG): Evolution, current landscape and future directions. arXiv 2024, arXiv:2410.12837. [Google Scholar]
Sharma, C. Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers. arXiv 2025, arXiv:2506.00054. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Riedler, M.; Langer, S. Beyond text: Optimizing RAG with multimodal inputs for industrial applications. arXiv 2024, arXiv:2410.21943. [Google Scholar] [CrossRef]
Romeo, F.; Arena, L.; Blefari, F.; Pironti, F.A.; Lupinacci, M.; Furfaro, A. Arpaccino: An agentic-RAG for policy as code compliance. In European Conference on Advances in Databases and Information Systems; Springer: Cham, Switzerland, 2025; pp. 467–481. [Google Scholar] [CrossRef]
Bennett, G.F. The SFPE handbook of fire protection engineering: By PJ DiNenno, CL Beyler, RLP Custer, WD Walton and JM Watts, Jr., National Fire Protection Association, Quincy, MA and Society of Fire Prot. J. Hazard. Mater. 1990, 23, 348. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Celik, T.; Demirel, H. Fire detection in video sequences using a generic color model. Fire Saf. J. 2009, 44, 147–158. [Google Scholar] [CrossRef]
Ko, B.C.; Cheong, K.-H.; Nam, J.-Y. Fire detection based on vision sensor and support vector machines. Fire Saf. J. 2009, 44, 322–329. [Google Scholar] [CrossRef]
Ghassempour, N.; Zou, J.J.; He, Y. A SIFT-based forest fire detection framework using static images. In Proceedings of the 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, Australia, 17–19 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar]
Yin, Z.; Wan, B.; Yuan, F.; Xia, X.; Shi, J. A deep normalization and convolutional neural network for image smoke detection. IEEE Access 2017, 5, 18429–18438. [Google Scholar] [CrossRef]
Zheng, X.; Chen, F.; Lou, L.; Cheng, P.; Huang, Y. Real-time detection of full-scale forest fire smoke based on deep convolution neural network. Remote Sens. 2022, 14, 536. [Google Scholar] [CrossRef]
Jiang, Y.; Liu, P.; Han, Y.; Xiao, B. YOLOv11-CHBG: A lightweight fire detection model. Fire 2025, 8, 338. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning; Proceedings of Machine Learning Research; PMLR, 18–24 July 2021; Volume 139, pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 9 May 2026).
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Sun, G.; Wen, Y.; Li, Y. Instance segmentation using semi-supervised learning for fire recognition. Heliyon 2022, 8, e12375. [Google Scholar] [CrossRef] [PubMed]
Song, Z.; Huang, X.; Ji, C.; Zhang, Y. Double-attention YOLO: Vision transformer model based on image processing technology in complex environment of transmission line connection fittings and rust detection. Machines 2022, 10, 1002. [Google Scholar] [CrossRef]

Figure 1. Evolution of visual fire safety hazard detection technologies. The paradigm has evolved from hand-crafted features to CNN/Transformer-based end-to-end learning, and further to self-supervised foundation models with stronger generalization under limited labeled data. Recent trends increasingly emphasize multimodal integration and engineering deployment to support closed-loop inspection workflows.

Figure 2. Overview of the proposed coarse-to-fine intelligent inspection framework. A self-supervised vision Transformer performs hazard screening and hazard-type refinement, while the regulation-grounded Agentic RAG layer retrieves and verifies clause-level evidence for compliance-ready reporting.

Figure 3. SEHead: mean/max pooling and channel recalibration pipeline. Patch-token features are aggregated using global mean and max pooling, concatenated, recalibrated through channel-wise attention, and then passed to a linear classifier for prediction.

Figure 4. Representative examples from the fire hazard image dataset. Subfigures (a,b) show No_Hazard samples, subfigure (c) shows a Confused_Wiring sample, and subfigure (d) shows a Fire_Lane_Blocked sample.

Figure 5. Dataset preparation and data cleaning workflow.

Figure 6. Prototype-level qualitative examples of the two-stage visual recognition module. The binary stage estimates

p (hazard)

and assigns images to No_Hazard, Review, or Hazard. Images predicted as Hazard or Review are further processed by the multi-class head to obtain the refined hazard category and confidence score. These examples illustrate the coarse-to-fine inference behavior, but do not constitute field deployment validation.

Figure 6. Prototype-level qualitative examples of the two-stage visual recognition module. The binary stage estimates

p (hazard)

and assigns images to No_Hazard, Review, or Hazard. Images predicted as Hazard or Review are further processed by the multi-class head to obtain the refined hazard category and confidence score. These examples illustrate the coarse-to-fine inference behavior, but do not constitute field deployment validation.

Figure 7. Prototype-level interface demonstration of the implemented fire safety inspection assistant. The interface is shown in Chinese because the prototype is designed for Chinese building fire-safety inspection and regulation-based compliance checking. The examples illustrate the implemented workflow from inspection-image input to structured hazard reporting and clause-level regulatory citation.

Figure 8. Performance comparison of different models on the independent test set.

Figure 9. Parameter efficiency comparison of different models on the independent test set.

Table 1. Dataset composition and split statistics after data cleaning.

Task	Class	Train	Validation	Test	Total	Percentage
Binary hazard screening	Hazard	77	17	17	111	68.52%
Binary hazard screening	No_Hazard	35	8	8	51	31.48%
Binary total	–	112	25	25	162	100.00%
Multi-class hazard recognition	Confused_Wiring	45	9	9	63	49.61%
Multi-class hazard recognition	Fire_Lane_Blocked	44	10	10	64	50.39%
Multi-class total	–	89	19	19	127	100.00%

Table 2. Comparison of the binary and multi-class tasks.

Comparison Dimension	Binary Classification	Multi-Class Classification
Recognition Granularity	Coarse-grained	Fine-grained
Task Objective	Hazard screening	Hazard subtype identification
Model Output Form	Single scalar logit	C-dimensional logits vector
Supervision Loss	BCEWithLogitsLoss	Cross-Entropy
Key Difference in Distillation Implementation	Expand `sigmoid` output into a two-class distribution for alignment	Directly align `softmax` class distributions
Focus of Evaluation Metrics	Missed hazards, false alarms, threshold sensitivity, recall stability	Category confusion, minority-class behavior, macro-level discrimination

Table 3. Performance of the binary hazard-screening model.

	Precision	Recall	F1-Score	Accuracy
Hazard	1.000	0.941	0.970	0.960
No_Hazard	0.889	1.000	0.941	0.960

Table 4. Performance of the multi-class hazard recognition model.

Category	Precision	Recall	F1-Score	Accuracy
Confused_Wiring	1.000	0.889	0.941	0.947
Fire_Lane_Blocked	0.909	1.000	0.952	0.947

Table 5. Ablation analysis of the proposed components on the independent test set.

Variant	Binary Acc.	Binary Macro-F1	Multi Acc.	Multi Macro-F1
DINOv2 + Linear Head	0.720	0.713	0.842	0.840
DINOv2 + SEHead	0.880	0.873	0.895	0.894
Full model	0.960	0.956	0.947	0.947

Table 6. Quantitative comparison of baseline methods and ablated variants for compliance reporting.

Method	Precision	Recall	F1	Provenance Acc.	Hallucination Rate
Zero-shot VLM	0.58	0.65	0.61	35.2	28.6
Standard RAG	0.72	0.70	0.71	68.9	15.4
Ours w/o Scene Gate	0.91	0.86	0.88	92.5	11.2
Ours w/o Verification	0.82	0.80	0.81	75.4	8.5
Full Agentic RAG	0.90	0.87	0.88	94.1	1.8

Table 7. Hyperparameter and optimization settings of the compared models.

Model	Backbone	Classification Task	Batch	Epochs	Optimizer	LR	WD	Selection
ResNet-50	resnet50	Binary/Multi-class	32	60	AdamW	$1 \times 10^{- 4}$	0.05	Val macro-F1, patience = 5
EfficientNet-B0	effb0
DeiT-Small	deit_s
ViT-B/16	vit_b16
MAE-ViT-B/16	mae_vit_b16
DINOv2-ViT-S/14	dinov2_vits14

Table 8. Characteristics of the compared baseline models.

Model	Architecture	Pretraining Type	Advantages	Limitations
ResNet-50	CNN	Supervised	Stable optimization and strong convolutional inductive bias	Limited long-range dependency modeling
EfficientNet-B0	CNN	Supervised	Compact and parameter-efficient	Sensitive to small-data optimization
ViT-B/16	ViT	Supervised	Global context modeling	Large parameter scale and data-hungry training
DeiT-Small	ViT	Supervised/distilled	Data-efficient Transformer training	Still relies on supervised/distilled pretraining
MAE-ViT-B/16	ViT	Self-supervised	Learns representations through masked image reconstruction	Reconstruction-oriented features may not always align with fine-grained hazard discrimination
DINOv2-ViT-S/14	ViT	Self-supervised	Strong transferable visual representation with frozen backbone	Does not include task-specific feature reweighting or staged adaptation

Table 9. Performance comparison of different models on the independent test set. The proposed framework achieves the best overall performance, with 0.960 binary accuracy, 0.956 binary Macro-F1, 0.947 multi-class accuracy, and 0.947 multi-class Macro-F1.

Model	Binary Acc.	Binary Macro-F1	Multi Acc.	Multi Macro-F1	Params (M)
ResNet-50	0.680	0.643	0.684	0.675	25
EfficientNet-B0	0.720	0.715	0.737	0.700	5
ViT-B/16	0.800	0.872	0.842	0.832	86
DeiT-Small	0.760	0.724	0.789	0.783	22
MAE-ViT-B/16	0.920	0.899	0.895	0.894	86
DINOv2-ViT-S/14 (Ours)	0.960	0.956	0.947	0.947	21

Table 10. Comparison between CLIP/VLM-based inspection and the proposed framework for building fire safety acceptance.

Paradigm	Representative Methods	Advantages	Limitations in Fire-Safety Acceptance
CLIP/VLM-based inspection	CLIP, BLIP, vision-LLM systems	Open-vocabulary recognition; natural-language interaction; flexible semantic description; zero-shot or few-shot transfer	Prompt sensitivity; output variability; possible unsupported or mis-cited regulatory explanations
Proposed framework	DINOv2 + coarse-to-fine recognition + Agentic RAG	Stable small-sample recognition; confidence-aware review routing; verified clause-level evidence; auditable output	Requires predefined hazard taxonomy and structured regulation knowledge base; broader field validation is still needed

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, S.; Liu, Y.; Yu, C.; Chen, J.; Wan, X.; Wang, L.; Zhang, G. A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition. Buildings 2026, 16, 1958. https://doi.org/10.3390/buildings16101958

AMA Style

Ye S, Liu Y, Yu C, Chen J, Wan X, Wang L, Zhang G. A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition. Buildings. 2026; 16(10):1958. https://doi.org/10.3390/buildings16101958

Chicago/Turabian Style

Ye, Song, Yuting Liu, Chunjin Yu, Jialei Chen, Xili Wan, Lu Wang, and Guangming Zhang. 2026. "A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition" Buildings 16, no. 10: 1958. https://doi.org/10.3390/buildings16101958

APA Style

Ye, S., Liu, Y., Yu, C., Chen, J., Wan, X., Wang, L., & Zhang, G. (2026). A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition. Buildings, 16(10), 1958. https://doi.org/10.3390/buildings16101958

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Coarse-to-Fine Intelligent Inspection Framework for Building Fire Hazard Recognition

Abstract

1. Introduction

2. Background and Related Work

2.1. Vision-Based Fire Hazard Recognition for Intelligent Inspection

2.2. Self-Supervised Visual Representation Learning for Small-Sample Engineering Scenes

2.3. Regulation-Grounded Compliance Reasoning and Agentic RAG

3. Methodology

3.1. Framework Overview and Hierarchical Decision Strategy

3.2. Small-Sample Adaptation via Two-Stage Fine-Tuning

3.3. Regulation-Grounded Agentic RAG for Compliance Verification

3.4. Lightweight Classification Head (SEHead)

3.5. Distillation-Enhanced Training Strategy (Task-Aligned KD)

3.5.1. Binary Knowledge Distillation (Binary KD)

3.5.2. Multi-Class Knowledge Distillation (Multi-Class KD)

4. Experiments and Evaluation

4.1. Dataset Preparation and Data Quality Control

4.2. Implementation Details and Evaluation Protocol

4.2.1. Experimental Environment

4.2.2. Input Processing and Data Augmentation

4.2.3. Model Construction

4.2.4. Evaluation Metrics

4.3. Task-Specific Settings

4.4. Results on Binary Hazard Screening

4.5. Results on Multi-Class Hazard-Type Recognition

4.6. Ablation Analysis

4.7. System-Level Evaluation of the Agentic RAG Compliance Layer

4.8. Prototype-Level Engineering Case Study and Practical Limitations

4.9. Comparative Evaluation Against Baseline Models

4.9.1. Fair Comparison Protocol

4.9.2. Quantitative Comparison

4.10. Comparison with CLIP/VLM-Based Inspection Pipelines

4.11. Summary of Experimental Findings and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI