1. Introduction
As urbanization in China accelerates, the scale and complexity of building infrastructures continue to expand. In high-rise mixed-use complexes, commercial centers, and large healthcare institutions, the quantity and intricacy of fire protection facilities have grown markedly. Serving as a critical “line of defense” for protecting lives and property, the reliable operation of these facilities during a fire is paramount. However, recent safety incidents resulting from insufficient maintenance have highlighted substantial risks to public safety [
1]. Consequently, conducting regular, professional, and standardized maintenance is vital for preventing fires and minimizing casualties and economic losses [
2,
3].
Under the dual drivers of regulatory policies and market demand, the fire maintenance sector is undergoing a significant digital transformation. From a policy perspective, the Ministry of Emergency Management’s “Administrative Provisions on Social Fire Technical Services” [
4] requires maintenance companies to ensure work orders are traceable and processes supervisable. The “14th Five-Year Plan for Emergency Management Standardization Development” [
5] further emphasizes improving the intelligence of fire protection equipment. Similarly, the National Fire and Rescue Administration’s “14th Five-Year National Fire Protection Work Plan” [
6] aims to promote the integration of digital and intelligent fire protection facilities. On the market side, with China’s fire maintenance market surpassing one trillion yuan in 2024, major providers are deploying SaaS- or PaaS-based smart management platforms. Against this backdrop, leveraging emerging technologies to enhance maintenance standardization and efficiency has become a focal point for the industry.
Nevertheless, even with strong demand, today’s maintenance management still struggles at the most critical step: generating maintenance work orders that strictly follow regulatory standards. A series of standards—including the national “General Code for Fire Protection Facilities” (GB 55036-2022) [
7] and multiple local specifications—have been issued to systematically define maintenance frequencies, inspection items, and technical requirements for major systems such as fire alarm systems, sprinkler systems, hydrant systems, gas suppression systems, fire pumps and associated electrical equipment, smoke exhaust systems, and emergency lighting and evacuation signage. These clauses cover the full life cycle of fire protection facilities and function as the industry’s “gold standard,” providing authoritative guidance for maintenance practice. Representative documents are summarized in
Table 1. The large “number of clause lines” and “number of clause words” highlight how rigorous and detailed these standards are—underscoring both the operational complexity of maintenance work and the difficulty of interpreting regulatory clauses.
Currently, the vast majority of enterprises rely on manual consultation of regulations and manual extraction of clauses to formulate maintenance plans. This severely constrains the efficiency of fire protection maintenance and poses risks to the safe operation of buildings. This traditional model faces an irreconcilable contradiction of “cognitive load,” stemming from the gap between the “multi-source heterogeneity” of maintenance regulations and the requirements for maintenance work orders to be “highly directive” and “actionable.” The building fire protection system is extensive and fragmented; national standards alone comprise thousands of clauses. Coupled with the subtle differences or even logical conflicts often found among various local and enterprise standards, relying on subjective human cognition or manual lookups easily leads to the omission of “implicit constraints” within maintenance clauses—such as adjustments to maintenance frequency for fire facilities in specific scenarios or special enterprise-level maintenance requirements. Furthermore, compiling work order libraries by professional maintenance personnel is time-consuming and prone to lags in updating clauses, making it difficult for fire operations management to adapt timely to newly issued regulatory requirements. Fire protection maintenance service companies currently lack an efficient and highly adaptable method for constructing work order libraries that cater to different regions, building types, and operational natures. Given these existing issues, it is evident that any missing or erroneous work order could directly result in the failure of critical facilities during a fire, causing irreversible loss of life and property. Therefore, there is an urgent need for a technical solution capable of automatically parsing complex regulatory texts and intelligently generating standardized maintenance work orders to bridge the gap between “regulatory texts” and “maintenance work orders”.
In recent years, Large Language Models (LLMs) have made major advances in natural language processing. With strong text generation and contextual understanding, they have been widely adopted in vertical domains such as healthcare, law, and finance [
14,
15,
16,
17,
18,
19,
20]. These advances suggest a promising path for tackling text interpretation in fire maintenance. However, directly applying general-purpose LLMs remains challenging. Fire-safety standards are highly specialized and logically intricate, involving extensive equipment hierarchies and multiple constraint conditions. As a result, generic LLMs are prone to “Hallucination,” producing outputs that look professional yet contradict engineering common sense. In addition, the field lacks high-quality annotated datasets, which makes large-scale supervised fine-tuning difficult.
To overcome these challenges, we propose an LLM-based intelligent method for generating maintenance work orders for building fire protection facilities. The method introduces the Fire Services–Retrieval-Augmented Generation (FS-RAG) framework and integrates it with a domain knowledge base, Fire Services Knowledge Base (FSKB), thereby tightly coupling retrieval-augmented generation with the reasoning capability of LLMs.
The main contributions of this paper are as follows.
- 1.
We develop an LLM-driven approach for intelligently generating maintenance work orders for building fire protection facilities, addressing the low efficiency and high error rates of manual work-order preparation.
- 2.
By incorporating the FS-RAG framework together with a specialized knowledge base, we design an accurate recall mechanism for clause context, guiding the model to generate content within a controlled scope and effectively solving the hallucination problem. Experiments show that, compared with conventional LLM-based work-order generation, our method increases the F1 score by 12.62% and the line-level compliance rate by 5.7%.
- 3.
We further demonstrate an efficient strategy for low-resource settings. By relying on an LLM API and combining few-shot In-Context Learning (ICL), the proposed approach lowers the barrier of local computing requirements while maintaining both efficiency and extraction quality, enabling high-accuracy information extraction (IE) and work-order generation. The generated work orders achieve a line-level compliance rate of 97.3% and an F1 score of 90.42%. Overall, the method establishes an automated pipeline from regulatory standards to a structured maintenance work-order library, offering a new solution for the building fire protection domain.
2. Related Work
Maintenance management for building fire protection facilities is a knowledge-intensive, standards-driven system-level task. In this study, the “maintenance work order library” is more than a checklist; it serves as an actionable, highly instructive execution record generated strictly from relevant standards and specifications. As illustrated in
Figure 1, a standard maintenance work order must precisely transform unstructured regulatory clauses—covering essential elements such as equipment type, maintenance frequency, required actions, and technical requirements—into structured work order entries.
Currently, maintenance management for fire protection facilities remains largely manual and still relies on traditional practices such as paper-based work orders and telephone dispatches. Most service providers organize maintenance information in Excel spreadsheets, indicating a relatively low level of informatization. In recent years, as smart fire protection initiatives have accelerated, a wide range of fire maintenance management platforms introduced by service companies have rapidly proliferated and have gradually replaced conventional management approaches in the market. Nevertheless, the most critical step—work-order preparation—still depends on manual effort: practitioners must extract maintenance tasks and technical requirements from national standards, local standards, and internal company specifications, and then populate work-order templates clause-by-clause to build a corresponding maintenance work-order library. Key processes such as clause interpretation, task decomposition, and element categorization rely heavily on the subjective judgment and experience of maintenance staff. As a result, it is difficult to guarantee compliance from work-order drafting to on-site execution, potentially creating safety risks. In addition, when regulatory authorities release or revise relevant clauses, platforms often require manual updates by operations teams, making timely alignment with new requirements challenging and introducing compliance risks due to delayed updates [
21].
As artificial intelligence advances rapidly, information extraction and LLMs have introduced a new paradigm for addressing the challenges described above. IE aims to convert unstructured or semi-structured text into structured representations and typically covers tasks such as named entity recognition, relation extraction, and event extraction, with proven effectiveness across many application domains [
22]. However, when facing more demanding settings—such as cross-standard adaptation, lengthy clauses, multiple entities listed in parallel, and consistency requirements under multiple constraints—traditional IE pipelines often suffer from limited portability, high maintenance overhead, slow updates, and incomplete coverage. LLMs demonstrate strong performance in question answering, writing, code generation, and semantic/contextual analysis. Thanks to their deep semantic understanding and contextual reasoning, they have achieved notable progress in automatically parsing regulatory-style texts and extracting structured information in various domains. Yet, in the building fire protection domain, issues such as hallucinations, missing fields, type confusion, and non-compliant structured outputs remain common. Fire-safety clauses are rich in domain terminology, interwoven logical dependencies, and implicit engineering assumptions. As a result, rule-based approaches and lightweight fine-tuned models struggle to achieve both broad coverage and sufficient flexibility. This has motivated an integrated direction that combines “retrieval augmentation” with “structured constraints” to make extraction results more verifiable. Recent studies have systematically summarized the latest advances in generative information extraction, offering methodological baselines and a research framework for practical deployment in specialized domains [
23].
In recent years, generative LLMs—exemplified by the GPT family and DeepSeek—have shown remarkable strength in both language understanding and text generation. In the LRML (LegalRuleML) study, Fuchs et al. [
24] leveraged GPT-3.5 to convert building regulations into a semantic, structured representation end to end, producing outputs that can be directly consumed by downstream rule-execution engines. Their findings suggest that LLMs can translate regulatory text efficiently even with limited samples, although stability benefits from dynamic example selection and careful alignment of domain terminology. Despite their rapid deployability, general-purpose LLMs still struggle to fully capture specialized terms and the underlying relational logic embedded in regulatory clauses. To mitigate these issues, Zhong and Goodfellow [
25] performed domain-adaptive secondary pre-training of BERT, RoBERTa, and related models on a construction management system (CMS) corpus. This adaptation helped address terminology ambiguity and long-context dependencies typical of standards documents, substantially improving the recognition and extraction of technical nouns, and offering a practical route for later LLM-based fire-code parsing with few-shot examples. For multi-source, multi-version standards, a single model often cannot achieve both high coverage and high accuracy. Rayo et al. [
26] therefore combined classical information retrieval with LLM-based extraction: relevant clauses are retrieved first, and an LLM then extracts key fields. Their experiments indicate that this retrieval-augmented generation paradigm outperforms retrieval-only or generation-only approaches, making it well-suited for clause localization and in-depth interpretation in fire-safety standards. Focusing on the linguistic characteristics of building–engineering standards, Lin Jiarui et al. [
27] introduced ARCBERT, the first pre-trained model for the AEC domain. Pre-trained on corpora including standards, civil engineering regulations, and domain-related encyclopedic entries, ARCBERT learns industry-specific style and syntax, and it delivers at least a 7% gain over general models such as BERT and Baidu’s ERNIE in downstream text classification and named entity recognition. Meanwhile, frequent updates to local and enterprise standards—alongside regional variation—can lead to conflicts and inconsistent terminology across documents. Kumar and Roussinov [
28] used GPT-4 to automatically detect regulatory passages containing injected “conflicts” and “terminology ambiguity,” achieving an F1 score at an industry-usable level and providing a reference for terminology consistency checks and compliance validation when building fire maintenance work-order libraries.
Beyond the fire domain, similar low-resource and high-specialization settings offer useful methodological parallels. Pei Bingsen et al. [
29] proposed an LLM-based few-shot knowledge extraction approach for public-security law-enforcement texts, where expertise demands are high and labeled data are scarce. By combining knowledge editing (MEMIT), low-resource fine-tuning (LoRA), and multi-round prompt templates, they built a public-security vertical LLM that better understands policing terminology and case structures, outperforming conventional baselines on entity and relation extraction. Zhang Baoyi et al. [
30] introduced a KG-RAG framework for geological prospecting, coupling knowledge graphs with LLMs. Under geological ontology constraints, prompt engineering and chain-of-thought techniques enabled high-quality automated knowledge extraction, while knowledge-graph multi-hop retrieval replaced conventional document retrieval to improve the accuracy and trustworthiness of question answering; the approach outperformed baselines in both knowledge construction and QA tasks. Finally, Zhai Dongsheng et al. [
31] presented a TRIZ+AI Agent strategy for domain knowledge base construction, addressing the complexity of TRIZ theory and its heavy reliance on experts. Their multi-agent system handled tasks such as contradiction identification, parameter mapping, and principle matching, together with prompt optimization, to automatically extract technical contradictions and solutions from patent texts; effectiveness was validated on hydrogen-storage patents, improving both the intelligence of patent analysis and the efficiency of knowledge construction.
Combining related studies, directly applying general-purpose LLMs to the fire protection maintenance domain still faces the challenge of “Hallucination”: the model may generate maintenance instructions that appear reasonable but violate engineering common sense (e.g., incorrect matching of equipment components, or cases where complex clauses involve multiple frequencies without correct matching)
To mitigate this, retrieval-augmented generation (RAG) has been introduced. RAG couples an external, non-parametric knowledge base with a LLM [
32] and follows a “retrieve-then-generate” workflow that injects an explicit evidence trail. This approach can substantially reduce hallucinations while improving the traceability and timeliness of knowledge-intensive tasks. High-quality retrieval, evidence assembly, and passage alignment are central to RAG: evidence can be shared across the entire sequence or tracked dynamically at the token level. In industrial practice, RAG is widely used to strengthen provenance and real-time updating, which closely matches our requirement that “clauses must be verifiable” in this study [
33]. By treating the external knowledge base as a “reference book” during generation, RAG can significantly improve the factual correctness of model outputs. To date, RAG has been broadly adopted in vertical domains including healthcare, geological exploration, and legal consulting [
34,
35,
36].
Despite these successes, most existing RAG work targets question answering (QA), where the goal is to retrieve relevant passages to answer a natural language question. Work-order generation, however, is fundamentally different from open-ended QA. Here, the model must not only interpret regulatory clauses, but also decompose them—strictly according to engineering standards—into predefined fields such as equipment, actions, frequency, and thresholds, while satisfying strict constraints on output format (e.g., JSON Schema) and logical consistency. To address this gap, our method goes beyond snippet retrieval. We build a structured FSKB as an intermediate layer and shift the RAG retrieval objective from “similar text” to “directed text,” enabling structured, maintenance-actionable outputs. In doing so, we help fill the long-standing gap of generic RAG methods in engineering-oriented, standardized generation.
Therefore, in low-resource specialized scenarios, constructing a lightweight and iterable domain knowledge base, and coordinating it with prompt templates, serves as an effective approach to improve recall, disambiguate subordination relationships, and mitigate “fabricated terms” resulting from model hallucinations. Existing research indicates that in complex data processing tasks, the adoption of feature enhancement and progressive interaction mechanisms can significantly boost a model’s ability to perceive and parse key information [
37,
38]. Similarly, practices involving natural science and engineering texts have demonstrated [
39] that synergizing LLMs with structured knowledge effectively improves the quality and usability of extracting complex entities and their relationships. This provides both theoretical backing and empirical experience for the design of the Highly Relevant Knowledge Cards within our proposed FSKB.
Overall, when parsing building fire protection standards, LLMs can enable end-to-end conversion from “regulatory clauses” to “computable data structures,” and there are clear technical routes for retrieval augmentation, conflict detection, and ambiguity resolution. Building on these foundations, we tailor an integrated solution for the key challenges of “building fire maintenance clauses”: fine-grained alignment supported by FSKB; strong-relevant clause evidence injected via FS-RAG; improved structural compliance and consistency through ESS Schema together with whitelist/blacklist rules and anti-trigger words; and an end-to-end work-order generation pipeline realized via API calls under low compute and low labeling cost. The proposed method achieves rapid deployment while maintaining controllability and accuracy, thereby supporting intelligent fire maintenance work-order generation systems and providing robust technical backing for smart fire protection.
3. Materials and Methods
Given the complexity of regulatory texts in building fire protection maintenance—including intricate logic, highly specialized terminology, and limited labeled data—traditional rule-based or conventional machine-learning approaches struggle to achieve the level of accuracy required for high-precision information extraction. To address this, we propose an FS-RAG driven information extraction framework that enables an end-to-end automated transformation from unstructured standards documents to structured maintenance work orders. The key idea is to build a FSKB that supplies external evidence to the LLM, while leveraging retrieval-augmented generation to mitigate hallucinations commonly observed in general-purpose models. Unlike full-parameter fine-tuning, our study adopts In-Context Learning. By embedding high-quality exemplars through prompt engineering, we activate the model’s contextual reasoning capabilities and achieve efficient work-order generation under low-resource constraints. In this work, we define a “Fire Services Maintenance Work Order” as the smallest actionable unit used to guide technicians in carrying out standardized maintenance procedures for fire protection facilities. Rather than a free-form textual description, each work order is represented as an actionable and verifiable structured six-tuple:
Here, denotes the fire protection facility (Fire Services Device, e.g., “fire extinguisher”), i.e., the maintenance target specified in the work order; denotes a subpart or accessory of that facility (e.g., the “nameplate” of a “fire extinguisher”); A denotes the maintenance action (Action, e.g., “inspect”); R denotes the technical requirement (Requirement, e.g., “pressure is normal”); F denotes the maintenance frequency constraint (Frequency, e.g., “once per quarter”); and N denotes the quantity constraint (Number, e.g., “10%” of “smoke detectors”), i.e., the required proportion/amount of the maintenance targets to be inspected or serviced.
This paper presents an intelligent work-order generation approach for building fire protection maintenance, enabling automated interpretation of regulatory clauses and standardized population of work-order fields. The overall architecture is illustrated in
Figure 2. Given China’s substantial regional variation and industrial diversity, maintenance procedures may differ across provinces, cities, and enterprises. We therefore begin by collecting maintenance standards from multiple sources, including national, local, and enterprise-level specifications. After manual pre-processing, these documents form a maintenance text dataset. We further derive few-shot mappings for work-order generation by extracting representative sample knowledge. In parallel, we build a domain knowledge base and construct the strong-relevant knowledge cards FSKB required for retrieval-augmented generation, which serve as the knowledge foundation of the framework. Within the LLM-based information extraction pipeline, the FS-RAG module uses clause text and trigger terms to retrieve supporting evidence from both the maintenance-text database and the FSKB knowledge cards, assembling a verifiable context. An enhanced Prompt—incorporating In-Context Learning and constrained, format-aware generation—then guides the LLM to produce controlled structured outputs, yielding verifiable work orders. Next, JSON/Schema checks and rule-based consistency validation are applied for error detection, correction, and field backfilling. Finally, the outputs of named entity recognition and relation extraction are stored in the maintenance work-order repository, together with evidence provenance fields for traceability. Evaluation feedback is used to iteratively refine the knowledge base, prompt design, and few-shot exemplars, enabling stable improvements in extraction quality and database-ready work-order generation under limited compute and labeling resources.
3.1. Selection of Information Extraction LLM
In maintenance work-order generation for building fire protection facilities, the LLM functions as a “knowledge bridge.” It must comprehend complex natural language clauses, resolve the logical relations embedded in those clauses, and produce a formatted work order that strictly follows a predefined template. Selecting an LLM therefore requires balancing multiple factors, including reasoning capability, context-window length, response latency, compute cost, and overall stability. Based on these technical and practical considerations, we adopt DeepSeek-V3.1 as the backbone LLM and deploy it via an API service [
40,
41]. DeepSeek-V3.1 is built on a Mixture-of-Experts (MoE) architecture with 671B total parameters, while activating only 37B parameters per token during inference. This design provides both broad knowledge capacity and efficient inference, allowing the model to accurately handle cross-disciplinary terminology frequently found in fire-safety standards—spanning electrical systems, water supply and drainage, and ventilation engineering. It also shows strong performance in Chinese comprehension, technical term handling, and technical-document parsing [
42,
43], making it well suited to information extraction in the specialized setting of building fire maintenance and enabling scalable extraction from large volumes of regulatory text.
DeepSeek-V3.1 further supports a 128K long context window, which allows full handling of lengthy standards documents and clauses containing long or information-dense fields. This makes it possible to inject detailed strong-relevant knowledge cards directly into the Prompt, reducing the risk of partial-context interpretation introduced by conventional RAG Chunking. The model also performs reliably in structured generation, providing normalized JSON outputs that simplify downstream automation and system integration for work-order data. Moreover, under low-resource conditions, DeepSeek-V3.1 exhibits strong zero-shot and few-shot learning ability—an important advantage for the building fire maintenance domain, where high-quality labeled data are limited.
For deployment, we compare API-based access with on-premise deployment. While local deployment can offer privacy benefits, it is often impractical for complex standards processing. Lightweight models such as DeepSeek-7B, Qwen-14B, and Llama-8B have modest hardware requirements, but their F1 performance drops markedly when faced with long, difficult clauses involving multiple devices and intertwined constraints, making it hard to guarantee work-order quality. Conversely, deploying heavy models above 70B parameters to pursue higher accuracy demands substantial hardware investment, which is a major cost barrier for most fire maintenance providers. By contrast, the DeepSeek-V3.1 API currently offers a favorable Price-Performance Ratio, and a detailed comparison of model selection and deployment choices is summarized in
Table 2.
Overall, the DeepSeek-V3.1 API-based approach preserves high extraction quality while significantly reducing dependence on on-premise computing resources. This markedly lowers the barrier to real-world deployment, and currently represents the most practical and effective technical pathway for intelligent generation of building fire protection maintenance work orders.
3.2. Construction of Dataset and FSKB
To ensure that the model is well aligned with the building fire protection maintenance work-order generation task—both in domain suitability and in the reliability of its supporting sources—we first build a high-quality, domain-specific text dataset. Building on this dataset, we further construct the strong-relevant knowledge cards, FSKB, which serve as the evidence base for retrieval-augmented generation.
3.2.1. Construction of a Dataset for Maintenance of Building Fire Protection Systems
Data quality is a decisive factor in overall model performance. Accordingly, our dataset aggregates maintenance standards from multiple sources—National Standard, Local Standard, and Enterprise Standard—with the goal of covering building fire protection maintenance procedures across different regions in China and at varying levels of regulatory granularity.
The resulting corpus includes representative maintenance specifications such as General code for fire protection facilities (GB55036-2022), Service specification for testing of building fire protection facilities (DB11/T3034-2023), and Service specification for testing of building fire protection facilities (DB12/T3034-2023). In total, the maintenance-text dataset comprises 7168 clauses and 147,405 words.
During pre-processing, we clean the collected documents by removing redundant whitespace, special symbols, and invalid characters. We also normalize clauses with inconsistent formatting to ensure a unified representation, as illustrated in
Figure 3.
“DB” denotes a local (Provincial Standard) specification, while “DB11” indicates a Beijing local standard, meaning the clause applies to the Beijing region. “6.2.3.4” is the clause identifier, and the text highlighted by the blue box corresponds to the original maintenance clause. With this pre-processing design, the model can explicitly capture a clause’s applicable region during extraction, while also enabling evidence-based provenance tracing for each work-order requirement.
To ensure dataset quality and maintain fairness in downstream evaluation, we split the corpus into training and test sets using an 8:2 ratio. Both subsets cover full-spectrum maintenance scenarios and include entities across 14 major categories of fire protection equipment, such as fire power supply and distribution facilities and automatic fire alarm systems. Overall, the dataset provides comprehensive coverage of building fire protection maintenance contexts, offering a robust foundation for subsequent model training and evaluation.
3.2.2. Building Highly Relevant Knowledge Cards
To mitigate “knowledge hallucination” and “terminology drift” that commonly occur when general-purpose LLMs are applied to specialized domains, we construct FSKB. FSKB is organized around key elements required for work-order generation—device types, subcomponents/accessories, maintenance actions, and maintenance frequencies—which are jointly used with regulatory clauses and interfaced with the generation model via the FS-RAG module. Unlike conventional unstructured document stores, FSKB is a curated collection of structurally organized strong-relevant knowledge cards. Through in-depth analysis of standards documents and domain knowledge in fire maintenance, we compile a domain-oriented “professional lexicon,” thereby ensuring both the accuracy and the strong logical consistency of the knowledge.
FSKB comprises five categories of core knowledge cards, as illustrated in
Figure 4:
- 1.
KB-Devices—Lexicon of device types and accessories: We extract from maintenance standards a hierarchy of “System Category” (e.g., fire power supply and distribution facilities, automatic fire alarm systems, automatic sprinkler systems), “Device Type” (e.g., fire alarm devices, emergency broadcast loudspeakers, gas suppression controllers), and “Device Subtype” (e.g., gauges, indicator lights, valve lead screws, power modules). Because the same device or subpart may be described in multiple ways across standards, we include an “Alias” column to capture synonyms and alternative names. This helps the LLM learn equivalent terms used in regulatory text and allows it to produce acceptable variant expressions. These entries form the foundational knowledge used by FS-RAG to recognize device-related information during both retrieval and generation.
- 2.
KB-Actions—Maintenance actions and requirement phrases: From clause descriptions, we extract common “Action” terms (e.g., inspection, cleaning, servicing, functional testing) and “Requirement Type” phrases (e.g., securely installed, dust removal, replacement, lubrication). To make these cues easier for the model to detect, we also curate “Trigger Terms” (e.g., corrosion, damage, abnormality, pressure gauge). In addition, we introduce “Bind Devices Type” to explicitly link each action/requirement/trigger set to relevant device types, so the model can ground maintenance actions in the correct equipment context.
- 3.
KB-Frequency—Templates for frequency and threshold expressions: This card consolidates frequency-related expressions found in standards and organizes them into three representations (Pattern, Frequency Text, Normalize). With these normalized templates, the model can more reliably identify frequency and threshold constraints (e.g., monthly, quarterly, annually, not less than once).
- 4.
KB-Section Map—Mapping between document sections and system categories: During pre-processing, we standardize the structure of the standards and organize their sections and corresponding systems. We then store this alignment as “Prefixed Section”–“System Category” mappings, which helps prevent the model from pulling evidence from irrelevant sections and supports provenance tracing for generated work orders.
- 5.
KB-Templates—Templates for relation and parallel-structure splitting: This card captures trigger and anti-trigger knowledge for handling relational and parallel constructions. When “Trigger Terms” are detected, the model consults this card during retrieval-augmented generation. Because standards often use diverse conjunctions and variable descriptions of devices and subparts, models can struggle to correctly separate and categorize them. The KB-Templates card constrains hallucinations and guides the model to properly split and classify relationships among “system–device,” “device–device,” and “device–subcomponent.”
Through this design, we ensure that the knowledge cards are both highly relevant and accurate, providing FS-RAG with strong external support and thereby improving the accuracy and actionability of generated work orders.
3.3. The FS-RAG Framework
To mitigate hallucinations that arise when general-purpose LLMs are applied to fire maintenance tasks, as well as to reduce format drift under the strict constraints of work-order generation, we propose a retrieval-augmented generation framework—FS-RAG, which combines the FSKB. By using retrieval-augmented generation (RAG) [
44], this approach leverages a custom-built knowledge base to enhance the model’s extraction capability, as shown in
Figure 5.
With FS-RAG, the LLM is able to retrieve stronger evidence and merge it with the original clause text, which improves comprehension of complex regulatory clauses and ensures that the resulting work orders are not only accurate but also evidence-grounded and traceable.
3.3.1. Context-Aware Knowledge-Matching Mechanism Based on Knowledge Cards
Different from traditional RAG frameworks that rely on a single dense-vector retrieval (Dense Retrieval), this study, aiming at the characteristic that normative terminology in the fire maintenance domain is highly standardized, designs a context-aware knowledge-matching mechanism based on strong-relevant knowledge cards (Context-Aware Knowledge-Matching Mechanism). In the FS-RAG framework, FSKB provides the external domain knowledge required in the retrieval-augmented generation (RAG) process, and the context-aware precise matching mechanism empowered by strong-relevant knowledge cards is shown in
Figure 6.
The strong-relevant knowledge cards (FSKB) provide effective support for the large model to accurately identify key information in information extraction tasks, such as complex devices, subcomponents/accessories, maintenance actions, and affiliated systems, specifically as follows:
- 1.
Mapping clause identifiers to system categories: FSKB’s KB-Section Map infers the “System Category” by recognizing the prefixed clause identifier. This enables each generated work order to be precisely anchored to the relevant building fire protection system. When similar maintenance requirements appear across different systems, the resulting work orders remain system-specific rather than ambiguous.
- 2.
Linking clauses to devices: When FS-RAG extracts device-related information from a clause, KB-Devices provides a rich set of candidates containing extensive domain terminology. This helps the model accurately identify both “Device Type” and “Device Subtype” mentioned in the text.
- 3.
Accurate frequency alignment: KB-Frequency and KB-Actions capture the frequency expressions and maintenance actions that commonly appear in standards. They help the model detect and map time schedules and threshold constraints across different maintenance requirements, ensuring that temporal and frequency fields in the generated work orders are correct.
With this mechanism, FS-RAG can “lock in” the standardized device names, maintenance action definitions, and aligned frequency expressions before the LLM generates the work order. It then assembles an evidence-rich Augmented Context and injects it into the generation process, thereby improving both the accuracy and the internal consistency of the final structured output.
3.3.2. Strategy for Tuning the Retrieval Hyperparameter “Top-k”
FS-RAG uses a Top-k retrieval scheme, where the system retrieves the k most relevant evidence items for the current clause from both the fire maintenance text database and FSKB. The retrieved evidence is then appended to the LLM context and serves as grounding references for work-order generation.
In the RAG framework, the retrieval quantity Top-k is a key hyperparameter that affects generation quality [
45]. In our experiments, we found that if k is too small, key contextual information will be lost, reducing the recall rate of retrieved information; whereas if k is set too large, it will introduce noisy information that does not belong to the current maintenance work order. This will not only prolong the time of the information extraction task, but it will also cause the model’s attention to be distracted, amplifying “Hallucination” and thereby reducing the accuracy of generated work orders.
To identify an appropriate k, we performed a Top-k sensitivity analysis during the B4 stage. We sampled 500 representative clauses with complex structure and evaluated
. We focused on work-order accuracy and retrieval recall, using the F1 score as an intuitive indicator of extraction performance under different k settings. The results are summarized in
Figure 7.
Results indicate that with k = 10, the model often misses required fields when handling complex clauses that involve multiple subcomponents/accessories or multiple maintenance requirements, mainly because insufficient evidence is retrieved to resolve device affiliation—leading to low work-order completeness. As k increases to 30, the model reaches its peak in correctly extracted entities and maintains stable reasoning performance. However, when k exceeds 30 (e.g., k = 50), the larger evidence pool does not translate into better performance; instead, the F1 score shows a slight decline. Further analysis suggests that excessive irrelevant knowledge cards introduce noise that disrupts the model’s identification of target entities, causing entity mix-ups in a small number of generated work orders.
In the end, we set the retrieval hyperparameter Top-k to 30. This choice enables correct extraction for most complex clauses while keeping contextual noise under control, thereby striking an optimal trade-off between inference efficiency and generation quality.
3.4. In-Context Learning Strategy Based on Prompt Engineering
Conventional LLM-based information extraction typically incorporates domain knowledge via full-parameter fine-tuning. This approach is expensive, often yields limited generalization, and makes rapid updates difficult when domain knowledge evolves. In contrast, our study exploits the in-context learning capability of LLMs. We design an ESS (Extractable Subset Schema)-based contextual constraint pattern together with a few-shot prompting mechanism, enabling the model to perform a dynamic, inference-time transformation from natural-language clauses to structured work orders.
3.4.1. ESS Example Design and Expansion Logic
To enhance information extraction under low-resource conditions in this specialized domain—and to ensure that generated work orders are directly database-ready—we introduce an ESS embedding mechanism. Rather than asking the LLM to produce a complete “formal” work order upfront, ESS explicitly defines the model’s responsibility boundary: the model is required to extract only what is objectively present in the clause text, including elements such as “System Category”, “Device Type”, accessories, maintenance actions, and frequency requirements.
ESS specifies a strict JSON structure that covers all critical fields in the work-order template. With a few-shot learning setup, this design both helps curb hallucinations and enforces a clear one-to-one alignment between extracted fields and the original clause content, which simplifies downstream validation and provenance tracing.
We first group the key ESS fields into categories, and the results are summarized in
Table 3.
We then selected 90 clauses as representative exemplars. These examples capture frequent extraction challenges encountered in real deployments and are well suited for stress-testing the model on complex clauses with blurred boundaries. Specifically, they span 14 major categories of fire protection equipment and include difficult patterns such as parallel mentions of multiple devices, affiliation relations between devices and accessories, frequency-specific bindings, and quantity-based sampling requirements. We embed each clause together with its corresponding ESS-filled output into the Prompt as few-shot examples.
In our experimental setup, the ESS-based contextual strategy is applied consistently across stages B0–B4, following a progressive few-shot refinement pathway. In B0, we provide no exemplars and only supply the work-order table headers to evaluate the model’s inherent extraction ability in a zero-shot setting. In B1, we introduce ESS few-shot exemplars built from 90 clauses. In B2, we further combine FS-RAG with the ESS exemplars. Continuous refinements from B2 to B4 lead to substantial improvements in extraction performance.
Standards documents frequently contain clauses where (i) a single clause covers multiple devices (or accessories)—for example, “functional testing shall be performed for smoke detectors, heat detectors, and manual alarm buttons”—or (ii) the same device is associated with different actions under different frequencies—such as “inspect monthly and clean annually.” Conventional generation approaches often collapse such content into a single work-order field, creating redundant, stacked information that weakens the work order’s actionability and specificity. To address this, we introduce a Decoupling constraint.
We formalize the constraint in two cases as follows:
- 1.
Let the device entity set appearing in the input clause
C be
. If all devices share the same maintenance action set
A and the same frequency constraint
F, the model is required to learn an expansion mapping
that converts
C into a record set
R consisting of
n separate entries:
Here, the n devices must be mapped one-to-one into n separate work-order entries. In other words, each record must contain exactly one device entity . The model is not allowed to produce an aggregated record such as “”.
- 2.
Assume that there is only a single device entity
E in the input clause
C, and for this device, there are actions
A under different frequencies
F (i.e., cases where actions and frequencies appear in pairs). The set of “action–frequency pairs” is
. Similar to the previous case,
C is still expanded into a set
R containing
n independent records:
In other words, each must include exactly one action–frequency pair , and the model must not output an aggregated record such as “”.
3.4.2. Enhanced Prompt Design and Anti-Trigger Word Principle
Based on the few-shot in-context learning strategy and the FS-RAG/FSKB mechanism, we further apply enhanced prompt engineering to constrain the LLM’s output behavior at a fine-grained level [
46]. Since our target corpus consists of Chinese building fire maintenance standards, the system is deployed with Chinese prompts in practice, as illustrated in
Figure 8.
First, at the role and task level, the Prompt explicitly constrains the model to be a “building fire maintenance standard extraction assistant,” and clearly specifies that its input includes only two types of information: (i) the original clause text with a clause identifier, and (ii) the strong-relevant knowledge cards retrieved by FS-RAG. The Prompt explicitly requires the model to “extract only objectively existing information in the clause based solely on these contents,” and prohibits supplementing information from external resources, thereby tightening the model’s responsibility boundary at the instruction level.
Second, the Prompt enforces a strictly hierarchical JSON output. Top-level fields include clause_id, clause_text, jurisdiction, and system_category, and a requirements array is defined such that each element corresponds to one device-level maintenance requirement. Each device-level entry is further decomposed into fields such as device_type, device_subtype, and maintenance_action, while threshold_values is represented as a structured array. By explicitly fixing this schema, the Prompt turns unconstrained text generation into a controlled “slot-filling” extraction process, which simplifies downstream consistency checks and metric computation.
Third, to improve extraction granularity and constrain model behavior, the Prompt provides explicit guidance on how to split clauses and populate slots. It requires the model to “list devices and accessories first, then fill slots device by device,” ensuring that multiple devices or multiple subcomponents/accessories in a single clause are represented as separate device-level work-order entries. It also enforces the rules “leave missing information empty, do not invent,” and “output strict JSON only,” thereby preventing two common failure modes: inflating vague statements into unnecessary fields, and mixing free-form text into structured outputs. Together with the ESS few-shot exemplars and the strong-relevant knowledge cards returned by FS-RAG, this Prompt design improves both the completeness of device-level requirements and the line-level compliance rate in our experiments, providing a solid basis for building a reliable work-order knowledge base.
Finally, we incorporate templated “trigger” and “anti-trigger” rules in KB-Templates to address recurring error patterns observed in stages B2–B4, such as over-generating subcomponents/accessories like “battery,” “air filter,” or “check valve.” These patterns are encoded as explicit textual constraints: When a trigger term is present but its anti-trigger counterpart is absent, the Prompt instructs the model not to generate certain fields. For typical subcomponents/accessories (e.g., gauges, indicator lights, status displays, handwheels, rotating parts), templates further guide the model to prioritize assigning them to device_subtype.
3.5. Experimental Setup
To comprehensively verify the effectiveness of the FS-RAG framework proposed in this paper for the task of generating maintenance work orders for building fire protection facilities, we designed a rigorous experimental environment and evaluation system.
3.5.1. Dataset Preparation
We build a standardized fire-safety corpus spanning heterogeneous sources. Specifically, we collect National Standard, Local Standard, and Enterprise Standard documents relevant to fire maintenance, and the overall scale of the resulting standards corpus is summarized in
Table 4.
After data cleaning and standardized pre-processing, we obtained 7168 distinct regulatory clauses. We then partitioned the corpus into training and test sets using an 8:2 split, with dataset statistics reported in
Table 5.
The training set contains 5729 clauses (117,930 words), which are used for training the model and constructing few-shot examples, and do not participate in the model’s gradient updates. The test set contains 1439 clauses (29,475 words), which are used for the final evaluation of model accuracy.
3.5.2. Model Environment and Parameter Settings
All experiments in this study were conducted using the DeepSeek-V3.1 LLM, with inference performed remotely via the official API to mitigate the computational overhead of local deployment. The experiments ran on Python 3.11, leveraging Pandas for data flow processing. In terms of inference hyperparameters, we adopted a “low randomness, strong constraints” approach to minimize hallucinations during extraction. To ensure the consistency of generation results and the flexibility of model parameters, we specified the model’s parameters within the code. The source code for these parameter settings is shown in
Table 6.
- 1.
Temperature (temperature coefficient): We fix the model temperature at 0.0 to control the stability of generation results and avoid the impact of excessive randomness on work-order generation quality;
- 2.
Max Tokens (maximum generation length): We set max_tokens to 1200 to ensure that, under ESS constraints, key information such as maintenance requirements can be fully output and will not be truncated due to length limits;
- 3.
Timeout (timeout retry mechanism): We define the timeout (timeout duration) for API calls as 120 s and enable at most one retry mechanism. When environmental fluctuations or model server-side response parsing failures occur, the script will automatically retry and record the failure reason, ensuring the stability and traceability of the experimental process.
- 4.
API-Key: In the experiments, the API key is passed in via command-line arguments, and extraction tasks for all experimental stages are completed under the same inference environment.
3.5.3. Evaluation Metrics
To objectively evaluate the performance of the proposed method in this study on the task of generating maintenance work orders for building fire protection facilities, we design evaluation metrics from two dimensions: work-order structural compliance and key information extraction accuracy, namely the work-order level compliance rate and the F1 score.
- 1.
Work-order Level Compliance Rate
The work-order level compliance rate is used to measure whether each generated work-order row meets the structured requirements. In the post-processing stage, the script performs consistency validation for each row of extraction results: if and only if the work order passes JSON/Schema structure validation, the device type and subcomponents/accessories satisfy whitelist constraints, and there is no field conflict, the script marks the status of this work order as compliant, i.e., [status = “ok”]; otherwise, it is marked as extraction failure.
The work-order level compliance rate
is defined as
Here, denotes the number of work-order rows judged as compliant, and denotes the total number of generated work-order rows. This metric can intuitively reflect the stability of the model during the work-order generation process.
- 2.
F1 Score
To quantify the model’s ability to extract key information from clauses, we treat an extraction as correct when the value in a work-order slot is correct and the slot itself is matched correctly. We conduct global statistics over all test clauses and adopt standard evaluation metrics, including Precision, Recall, and F1 score.
Precision, Recall, and F1 score are defined as:
Here, (True Positive) denotes the number of field entries correctly extracted by the model; (False Positive) denotes the number of field entries incorrectly output by the model; and (False Negative) denotes the number of ground-truth field entries that the model fails to extract.
Overall, reflects the model’s capability in extracting work orders, while F1 score reflects the model’s extraction accuracy. Used together, they can comprehensively characterize the overall performance of the proposed method in building fire maintenance scenarios.
4. Result
4.1. Analysis of Experimental Results
To systematically verify the effectiveness of the “DeepSeek-V3.1 + FS-RAG + ICL” method proposed in this paper (i.e., the experimental group B4), as well as to demonstrate the contribution of each module to work-order generation quality, this study designs a five-stage stepwise ablation experiment from B0 to B4 and further compares the performance differences of different large language model backbones.
Table 7 presents the results of each experimental stage and verifies the iteration and advantages of the proposed method through specific data analysis.
In B0, we run the LLM-based extractor without any exemplars, relying solely on the raw clause text and the ESS output format. The B0 results suggest that the model can capture some device types and maintenance requirements, but overall performance is limited due to the absence of domain grounding and structured demonstrations. Typical issues include failure to identify the jurisdiction, inability to determine the associated system category, and inaccurate mapping of many clauses into structured work orders, resulting in low-quality outputs.
In stage B1, we introduced an in-context learning strategy based on prompt engineering. We embedded 90 typical “clause–work-order” pairs in the Prompt as few-shot examples, enabling the model to refer to existing structured samples during extraction. The experimental results in this stage improved significantly, with the work-order level compliance rate reaching 91.6% and the F1 score being 77.8%. The main improvements in this stage are reflected in the model’s ability to identify more key fields and reduce extraction errors, showing a clear improvement compared with B0.
To further address biases in understanding vertical-domain terminology, stage B2 formally introduced the FS-RAG mechanism, using the strong-relevant knowledge cards in FSKB to enhance the model’s knowledge recall capability. In this stage, the model’s extraction performance was further improved: the work-order level compliance rate reached 96.7%, and the F1 score increased to 82.18%. The FS-RAG mechanism enables the model to retrieve relevant device information and maintenance requirements from the knowledge base, thereby effectively improving the accuracy of maintenance work-order generation, especially performing better in complex scenarios such as device types, accessories, and frequency requirements.
In stage B3, we adjusted FS-RAG by adding whitelist constraints and an anti-trigger mechanism in FSKB. Specifically, for devices and subcomponents/accessories whose trigger terms are not hit in the clause text, the model is instructed to be prohibited from outputting them. This effectively suppresses overfitting and hallucination problems during generation, making the output results more rigorous and precise. Experimental results show that in stage B3, the work-order level compliance rate is 97.1% and the F1 score reaches 85.1%, indicating that the model can output high-quality work orders more stably under constraints.
Finally, in stage B4, we performed the final optimization of FS-RAG and designed an enhanced Prompt. Combined with the final version of the strong-relevant knowledge cards in FSKB, we further improved the model’s extraction performance. In stage B4, the work-order level compliance rate reached 97.3% and the F1 score was 90.42%. The extraction results gradually became normal in stage B1. Compared with stage B1 using only ICL, the work-order level compliance rate increased by 5.7% and the F1 score increased by 12.62%. This significant progress strongly demonstrates the synergistic effect of the FS-RAG framework and the ICL strategy after multiple rounds of iterative optimization, enabling the model to accurately handle extremely complex clause structures.
Figure 9 shows the trend of changes in the work-order level compliance rate
and the F1 score across the experimental stages. It can be clearly seen that with optimization, the extraction performance is gradually improving. Specifically, the introduction of Few-Shot examples initially improves the model’s performance on complex clauses; the FS-RAG mechanism combined with the strong-relevant knowledge cards in FSKB further improves the model’s recall and accuracy in information extraction; and the whitelist constraints and anti-trigger mechanism effectively suppress the model’s hallucination problem, making the generated work orders more consistent with actual normative requirements. In stage B4, the model is already able to complete the automatic generation task of maintenance work orders for building fire protection facilities with relatively high accuracy and stability, laying a solid foundation for subsequent applications and practical deployment.
While the above results validate our overall strategy, we additionally evaluate the rationality of the chosen backbone model. Under the same B4-stage setup, we swap the LLM API to benchmark three widely used models—Qwen-2.5, GPT-4o, and Llama-3.1—on the structured output task considered in this study. The comparative results are reported in
Table 8.
The comparison results show that single-generation tasks have a significant parameter threshold for model reasoning capability. Qwen-2.5 and Llama-3.1 achieve relatively low F1 scores when dealing with complex nested logic and are difficult to meet strong-constraint requirements. DeepSeek-V3.1 demonstrates SOTA-level performance, and in Chinese professional terminology, understanding it is significantly better than Llama-3.1, with an F1 score about 1% higher than GPT-4o. Moreover, the API calling cost of DeepSeek is the lowest among these four models. In summary, the method proposed in this paper is not only effective in algorithmic strategy, but it also achieves the best balance between performance and cost in engineering selection.
4.2. Robustness Verification
In real-world settings, fire maintenance standards come from heterogeneous sources, and the input may include OCR errors, non-standard punctuation introduced during manual transcription, and clauses with ambiguous wording or semantic conflicts. To evaluate how robust our FS-RAG framework is under such low-quality inputs, we construct an adversarial test set of 100 clauses and manually inject three categories of noise.
- 1.
Character-level noise: Artificially simulating pinyin input-method errors, as shown in
Figure 10.
- 2.
Format-level noise: Removing or scrambling the original punctuation in clauses.
- 3.
Semantic-level interference: Constructing clauses containing “exclusion logic” (e.g., constructing “Except for the end-of-line test device…” to test whether the model will incorrectly extract the excluded device) or clauses with “requirement conflicts.”
The results are summarized in
Table 9, where we focus primarily on the drop in F1 score after noise injection.
The experimental data show that after introducing noise interference, the performance of B0 (under the conditions of the above B0-stage experiment) drops sharply, especially when facing character-level noise (misspelled characters), where it decreases by 13.3%. In contrast, the B4 method proposed in this paper shows extremely strong robustness, with performance drops controlled within 2.5% in both misspelling and punctuation-disorder scenarios. For the most challenging semantically conflicting clauses, the B4 method, relying on the logical reasoning capability of DeepSeek-V3.1, still maintains a high accuracy of more than 84%, demonstrating that the method is sufficient to cope with data-noise challenges in real engineering scenarios.
4.3. Generalization Verification
Although the standards documents used in our experiments come from National Standard, Local Standard, and Enterprise Standard sources, and we constructed a training set with comprehensive clause coverage, to evaluate the model’s practical application capability, we selected an external standards document that did not participate in training (Gansu provincial standard DB62/T4727-2023 [
47], “Technical code maintenance of building fire protection facilities”). This document is applicable to guidance for building fire maintenance work in Gansu Province. Due to regional differences and the subjective factors of compilers, some proper nouns, maintenance actions, and other items in the document differ from those in the training set we built. Nevertheless, the model can still, based on the knowledge obtained in prior training, stably complete the maintenance work-order generation task, demonstrating excellent generalization ability.
Specifically, we compared the model’s extraction performance on this external standards document (Gansu provincial standard DB62/T4727-2023) with the results in stage B4, as shown in
Table 10.
The experimental results demonstrate that, when evaluated against the Gansu provincial standard, the model achieves an F1 score of 87.76% and a work-order level compliance rate of 97.1%. These results are notably similar to the model’s performance on our self-built dataset in stage B4, suggesting that the model is capable of maintaining high extraction performance even on previously unseen external standard documents. This further validates the effectiveness of both the in-context learning strategy and the FS-RAG mechanism in enhancing the model’s generalization capability.
By introducing the FS-RAG mechanism, the model can retrieve relevant domain knowledge from the strong-relevant knowledge cards in FSKB, thereby effectively coping with unseen regional standards. Specifically, the model maintains high adaptability across different regional standards, ensuring that the automatic generation of maintenance work orders for building fire protection facilities can meet diverse scenario requirements. The above experimental results show that the model can not only learn general extraction patterns but can also, when facing new regions and new standards, still accurately generate maintenance work orders that comply with regulatory requirements.
Although this study takes Chinese standards as an example, the framework (FS-RAG + FSKB) is also applicable to complex fire maintenance standard systems such as NFPA (National Fire Protection Association) standards and EN (European Norms) standards. By constructing appropriately matched strong-relevant knowledge cards in FSKB and an FS-RAG mechanism, this framework can flexibly adapt to regulatory differences across countries and regions, and it has good cross-regional adaptability.