A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations

Zheng, Tao; Shen, Shiyu; Zeng, Changchang

doi:10.3390/electronics14163314

Open AccessArticle

A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations

by

Tao Zheng

,

Shiyu Shen

and

Changchang Zeng

^*

School of Computer Science, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3314; https://doi.org/10.3390/electronics14163314

Submission received: 21 July 2025 / Revised: 17 August 2025 / Accepted: 19 August 2025 / Published: 20 August 2025

(This article belongs to the Special Issue The Future of AI-Generated Content（AIGC）)

Download

Browse Figures

Versions Notes

Abstract

Civil aviation airworthiness regulations are the fundamental basis for the design and operational safety of aircraft. Their provisions exhibit a high degree of specialization, cross-disciplinary complexity, and hierarchical structure. Moreover, the regulations are frequently updated, posing unique challenges for automated question-answering systems. While large language models (LLMs) have demonstrated remarkable capabilities in dialog and reasoning; however, they still face challenges such as difficulties in knowledge updating and a scarcity of high-quality domain-specific datasets when tackling knowledge-intensive tasks in the field of civil aviation regulations. This study introduces a retrieval-augmented generation (RAG) approach that integrates retrieval modules with generative models to enable more efficient knowledge acquisition and updating, encompassing data processing and retrieval-based reasoning. The data processing stage comprises document conversion, information extraction, and document parsing modules. Additionally, a high-quality airworthiness regulation QA dataset was specifically constructed, covering multiple-choice, true/false, and fill-in-the-blank questions, with a total of 4688 entries. The retrieval-based reasoning stage employs vector search and re-ranking strategies, combined with prompt optimization, to enhance the model’s reasoning capabilities in specific airworthiness certification regulation comprehension tasks. A series of experiments demonstrate the effectiveness of the retrieval-augmented generation approach in this domain, significantly improving answer accuracy and retrieval hit rates.

Keywords:

large language model; retrieval augment generation; question answering; civil aviation regulations

1. Introduction

Airworthiness regulations are the core standards ensuring the design, manufacture, and safe operation of aircraft, encompassing advanced disciplines such as aerodynamics, structural design, materials science, and electronic information. The provisions feature cross-references, hierarchical structures, as well as frequent updates. Although this domain’s knowledge is wide-ranging and costly to master, the abundance of textual resources creates opportunities to build privatized knowledge-assistance systems. Leveraging large language models enables safe, specialized human–machine interaction, breaking down domain knowledge barriers and enhancing the level of intelligent assistance.

Since the advent of ChatGPT-1 [1], large language models (LLMs) have exhibited a breakthrough performance in natural language processing tasks. The rapid development of this technology is driving its deep exploration in vertical domains such as healthcare, law, and finance [2,3,4]. However, although domain-specific fine-tuning can improve model adaptability to specific tasks, this parameter-intensive paradigm still suffers from challenges such as exponential increases in training costs due to frequent knowledge updates, high computational demands, and reduced model generalization capabilities [5], highlighting the tension between task adaptation efficiency and model generalization.

To address the aforementioned issues, Lewis et al. proposed the retrieval-augmented generation framework [6]. By integrating the parametric memory capabilities of large language models with the dynamic retrieval of external knowledge, RAG not only preserves the original generalization ability of the models but also significantly improves the timeliness of domain knowledge updates and reduces the cost of extending to specialized tasks.

Currently, retrieval-augmented generation technology has demonstrated broad application potential in downstream tasks across various specialized domains. For example, in the biomedical domain, Kumar and Marttinen et al. leveraged external medical knowledge bases to guide the generation of large medical vision models, achieving excellent performance in medical visual question answering and report generation tasks [7]. In the legal domain, CBR-RAG [8] combines case-based reasoning with retrieval-augmented generation techniques, significantly improving the accuracy and reliability of legal question answering. These studies demonstrate that RAG technology holds great promise for knowledge-intensive specialized question answering tasks.

As a critical safeguard in the aircraft design process, flight test certification faces two core challenges when incorporating large language models as assistants. Firstly, certification standards exhibit significant interdisciplinary characteristics, encompassing aerodynamics, structures, avionics, and other domains. These standards are highly specialized and involve complex cross-disciplinary interactions. Secondly, the certification requirements are continuously evolving, and relying on retraining LLMs to accommodate new standards would incur substantial costs and time expenditures.

Against this backdrop, the “airworthiness certification” question answering task reveals clear knowledge and engineering gaps. On the data side, there is a lack of high-quality QA datasets aligned with the regulatory context and annotated with clause-level provenance, making it difficult to uniformly evaluate both answer correctness and evidence localization. On the retrieval and generation side, there is insufficient fine-grained discrimination for clauses with similar wording but different applicable subjects and scenarios, as well as the ineffective utilization of hierarchical structure and cross-references within the regulations. Moreover, the absence of explicit alignment between answers and hierarchical clauses weakens both the interpretability of answers and the capability for compliance auditing.

Based on the above research challenges and gaps in the field, this study focuses on the task of “airworthiness regulation question answering.” It aims to integrate RAG techniques to provide a systematic solution from the perspective of three aspects—data, retrieval, and generation. By constructing a task-specific dataset, introducing clause-level semantic-aware retrieval, and explicitly aligning the retrieved clauses during the generation stage, the approach enhances both the accuracy and traceability of professional question answering. The main contributions of this work are as follows:

Constructing a civil aviation airworthiness certification QA dataset: Developing an LLM-based three-stage “generation–ehancement–correction” framework, producing 4688 high-quality questions in multiple-choice, fill-in-the-blank, and true/false formats.
Designing a hierarchical retrieval pipeline for certification regulations: Incorporating a two-stage “semantic retrieval–cross-encoder re-ranking” mechanism into the RAG framework, balancing retrieval efficiency and re-ranking accuracy.
Effectiveness validation: Experiments on the self-constructed dataset demonstrate significant improvements over baseline values in relation to both answer accuracy and clause retrieval capability.

2. Related Work

2.1. Large Language Models

Since the introduction of the Transformer architecture with its self-attention mechanism that allows parallelized sequence modeling [9], the scale of language models has expanded rapidly, bringing about a new era of parameter-driven performance improvements. Represented by GPT-3 [10], with 175 billion parameters, these ultra-large-scale models achieve remarkable language understanding and generation capabilities through pre-training in massive corpora, indicating that LLMs have approached human-level performance in tasks such as text generation and logical reasoning.

To adapt pre-trained language models to new tasks, researchers have explored various technical approaches. Full-parameter fine-tuning involves retraining pre-trained models with domain-specific knowledge, allowing for a deeper integration of such knowledge; however, this requiries substantial computational resources and hardware support. Parameter-efficient fine-tuning methods, represented by LoRA [11], introduce low-rank matrices into the original model weights and update only a small subset of additional parameters, enabling cross-domain knowledge transfer while significantly reducing training costs. Recent work addresses exposure bias in autoregressive language models by introducing imitation learning into large model distillation, aligning training and inference distributions to reduce error accumulation and quality degradation in multi-step generation [12]. Moreover, due to the high sensitivity of LLMs to input formatting, prompt engineering has gradually emerged as a critical technique for improving task performance [13]. By designing structured prompts such as chain-of-thought templates [14], it is possible to guide the model to activate specific knowledge pathways, thereby enhancing its performance on various tasks without additional training.

2.2. Retrieval-Augmented Generation

Retrieval-augmented generation technology establishes a collaborative framework between external retrieval modules and parametric generative models, effectively alleviating the limitations of large language models in dynamic knowledge updating. The RAG model comprises two core components—a retrieval module based on vector databases and dense vector representations, which recalls the top-K documents relevant to a given query from an external document repository, and a text generation module, typically implemented with large language models, which fuses user queries with the retrieved documents to produce high-quality, contextually relevant responses. This framework significantly enhances the capability of large language models in handling open-domain question answering and various knowledge-intensive tasks [15,16].

Compared with traditional fine-tuning paradigms, the greatest advantage of retrieval-augmented generation lies in its “plug-and-play” knowledge updating capability, whereby the retrieval module enables independent and real-time knowledge updating, allowing the model to adapt rapidly to new task requirements without retraining, simply by replacing the external document collection. This design significantly reduces the barrier to cross-domain applications of large language models, making it particularly suitable for dynamic knowledge environments such as legal regulation updates and news fact verification, demonstrating strong flexibility and adaptability [17,18].

Currently, retrieval-augmented generation technology has been widely applied across various specialized domains and has demonstrated considerable research value, particularly in knowledge-intensive natural language processing tasks. For example, in the medical domain, RAG-based question answering systems integrate medical databases such as PubMed to effectively enhance their ability to respond to biomedical queries [19]; in the legal domain, RAG systems incorporating legal knowledge databases have significantly improved the performance of legal text generation and intelligent question answering tasks [20].

To systematically evaluate the performance of retrieval-augmented generation in specialized tasks, researchers have developed numerous benchmark question answering datasets, such as COVID-QA, SciRAG-QA, and TriviaQA, covering a diverse range of knowledge domains including healthcare, scientific research, and general question answering [21,22,23]. By comparing manually annotated evidence documents and reference answers, it is possible to quantitatively analyze the recall capability of the retrieval module, the response capability of the generation module, and the overall performance of the model across different tasks. Moreover, with the advancement of multimodal technologies, the RAG framework is gradually extending to cross-modal tasks such as image–text retrieval and generation [24].

Although retrieval-augmented generation technology has demonstrated broad applicability, it still faces numerous challenges that require urgent resolution. On the one hand, the recall quality of the retrieval module largely determines the final response quality of the RAG model; thus, balancing high recall with high precision remains a critical research topic. However, the effective integration of information from multiple candidate documents, as well as avoiding redundancy or conflict, is also a key factor influencing the quality of the final response. In addition, in many domain-specific tasks, challenges such as fragmented knowledge, complex terminology, and strong contextual dependencies impose additional demands on RAG systems.

2.3. Question Answering

Question answering (QA) aims to generate answers that are semantically consistent with user queries and that are verifiable using given corpora or external knowledge. Research has developed in several directions. Factual QA, represented by benchmarks such as SQuAD [25] and Chinese SimpleQA [26], has established standardized frameworks for evaluation. Non-factual QA, on the other hand, addresses subjective and open-ended questions, with approaches such as query-focused summarization and multi-hop reasoning [27].

In recent years, QA has expanded toward multimodal and cross-lingual settings. Multimodal QA integrates images, audio, or 3D scenes to support complex reasoning [28,29], while cross-lingual QA, such as TyDi QA [30], evaluates performance across diverse languages and resource levels.

Domain-specific QA remains a research hotspot, with datasets in medicine and finance (e.g., HEAD-QA [31], K-QA [32], and SEC-QA [33]) highlighting the challenges of complex reasoning and the gap between current methods and real-world requirements.

Overall, while substantial progress has been made in benchmark development, multimodal and cross-lingual expansion, and domain-specific applications, challenges remain in interpretability and corpus diversity.

3. Materials and Methods

3.1. Dataset

To address the complexity of the knowledge system, the frequent updates of regulations, and the high interdependency of provisions in the context of aircraft airworthiness certification, this study proposes a three-stage question–answer dataset construction method comprising generation, augmentation, and correction. This method integrates the generative capabilities of large language models with expert knowledge verification to construct a multidimensional question–answer dataset for airworthiness certification regulation comprehension tasks. Figure 1 illustrates the general framework for the construction of the dataset, which is divided into the following three stages.

3.1.1. Stage One: Initial Question Generation

Based on the original clauses of AC-25-AA-2023-03 Airworthiness Standards for Transport Category Airplanes, DEEPSEEK API [34] was utilized to perform a clause-level decomposition of the document, automatically generating question stems, correct options, and basic distractors to produce a structured draft of test items. After automatic format verification, a total of 14,752 initial question–answer samples were obtained.

3.1.2. Stage Two: Semantic Enhancement and Reverse Tracing

A “question–document” inverted index system was constructed to trace the potential knowledge sources of question stems. The original text paragraphs corresponding to the draft stems were treated as standard answer fragments, and related clauses in the corpus were incorporated to semantically reconstruct both the question stems and distractors. By introducing more deceptive distractors, the difficulty of the questions was increased, thereby enhancing the dataset’s challenge and representativeness in real-world application scenarios.

3.1.3. Stage Three: Manual Correction

Three experts with domain knowledge in airworthiness certification were invited to form an evaluation panel to review each sample, correcting the question content, option design, and corresponding regulatory clauses to ensure the rationality of the options and the uniqueness of the answers. Ultimately, 4688 high-quality question–answer samples were obtained, covering three types of questions—multiple-choice, fill-in-the-blank, and true/false. The distribution of these question types is shown in Table 1.

3.2. RAG Framework

The RAG framework proposed in this study consists of three main components—a document processing module, a retrieval module, and a generation module. The overall architecture is illustrated in Figure 2.

The framework first parses the airworthiness certification regulations into structured text blocks and constructs a semantic index repository enriched with metadata and parent path information. The retrieval module then retrieves regulatory clauses relevant to user queries. Finally, structured prompts are constructed to guide the large language model in generating traceable responses. The detailed design of each module is described as follows.

3.2.1. Data Processing

The airworthiness certification regulations exhibit a tree-like hierarchical structure of “standards–clauses–provisions,” with significant semantic dependencies and logical inheritance relationships among the layers. The conventional flat text chunking strategy used in standard RAG models is inadequate for this task. To address this issue, this study leverages the explicit heading hierarchy in Markdown documents to design a semantic-aware chunking strategy, whereby each regulation clause is treated as a root node, with sections such as “Interpretation of Clause Requirements/Regulatory Safety Intent” and “Acceptable Means of Compliance” as child nodes, forming a semantic hierarchical structure (Figure 3).

During the chunking process, metadata such as the parent node path is first attached to each chunk to ensure the complete transfer of the hierarchical structure during the indexing and generation stages. At the same time, the chunk granularity is dynamically adjusted based on clause content complexity (e.g., sentence length and distribution of key terms) and paragraph semantic density to avoid fragmenting important concepts, thereby enhancing the model’s ability to perform contextual modeling on long documents.

3.2.2. Retriever

In airworthiness certification tasks, the regulatory texts are highly specialized, exhibit complex cross-references among provisions, and have strong contextual dependencies, making it challenging for simple information retrieval methods to effectively recall regulatory clauses that are highly relevant to user queries. To address this challenge, this study incorporates a two-stage “retrieval–re-ranking” collaborative mechanism into the RAG framework to enhance recall quality and retrieval accuracy in tasks related to civil aviation airworthiness certification regulations.

The retrieval module consists of two core stages, as follows: (1) preliminary retrieval based on semantic and metadata parsing; and (2) fine-grained re-ranking using cross-encoders. In the preliminary retrieval stage, potential candidate documents are efficiently recalled based on the entity parsing results of user queries. In the re-ranking stage, high-quality relevant documents are selected from the candidates through deeper semantic matching.

Stage One: Semantic Vector Encoding and Embedding Matching

The preliminary retrieval stage employs the BGE-base embedding model, released by BAAI, as the encoder. This encoder is based on the Transformer architecture, possesses excellent sentence embedding capabilities, and demonstrates a strong performance across various real-world applications. The user query (Query) and the airworthiness regulatory text (doc) are encoded separately into high-dimensional vector representations, as follows:

V_{Q u e r y} = {Encoder}_{q u e r y} (Q u e r y),

(1)

V_{d o c} = {Encoder}_{d o c} (d o c_{i}) .

(2)

Subsequently, in the same semantic space, the cosine similarity between the user query and each document chunk is computed to select the top-K preliminary candidate documents for fine-grained re-ranking in the next stage, as follows:

Similarity (V_{Q u e r y}, V_{d o c}) = cos θ = \frac{\sum_{i = 1}^{n} V_{Q u e r y_{i}} \cdot V_{d o c_{i}}}{\sqrt{\sum_{i = 1}^{n} V_{Q u e r y_{i}}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} V_{d o c_{i}}^{2}}} .

(3)

To enhance the model’s ability to perceive the hierarchical structure of approval specification texts, this study incorporates metadata such as chapter paths and regulation names during the encoding of each provision, ensuring that identical content in different hierarchical contexts can still maintain distinctiveness in the semantic space.

Stage Two: Cross-Encoder Re-Ranking

Although initial retrieval can efficiently recall relevant documents, in airworthiness certification specifications, different provisions often involve detailed requirements for specific systems, components, or operational environments; additionally, their expression styles are highly similar. A method based solely on embedding similarity struggles to differentiate these subtle distinctions, which may result in candidates that do not fully align with user intent.

The root cause of this phenomenon lies in the fact that the airworthiness certification provisions for a particular class of aircraft often stipulate detailed requirements for various subsystems, structural components, or operational environments, covering a wide range of aspects from electrical systems and fire protection design to cockpit voice recorders, cabin seats, and safety belts. These provisions exhibit highly similar structural patterns and writing styles; however, they differ significantly in applicable targets and contextual environments. Moreover, the intricate cross-references among provisions further exacerbate the complexity of retrieval.

Therefore, focusing solely on the semantic similarity between the query and candidate documents is inherently limited, as some candidate documents retrieved during the coarse retrieval stage may not pertain to the specific system or scenario of interest in the user’s query, thereby posing a potential risk of misguidance. To enhance the accuracy and practicality of retrieval results, this study introduces a cross-encoder with an attention mechanism for re-ranking. The model concatenates the user query q with each candidate document

d_{i}

to form input pairs [q,

d_{i}

], which are then jointly modeled using a selected language model. Through a cross-attention mechanism, it captures deep contextual interactions between the query and candidate documents, thereby re-evaluating their relevance.

{Similarity}_{Cross-encoder}^{i} = f_{CrossEncoder} ([q; d_{i}]) .

(4)

This process, guided by the cross-attention mechanism, transcends the limitations of static similarity computations between independent vectors by dynamically modeling the matching relationship between the query and the document, thereby significantly improving the ranking accuracy of candidate documents and the overall retrieval performance.

3.2.3. Generator

At the final stage of the RAG architecture, the generation module produces responses that are semantically consistent with the provisions, factually sound, and traceable, based on the user query and the re-ranked airworthiness certification documents. In view of the domain-specific nature of airworthiness certification tasks and the unique question types in the QA dataset, this study introduces a structured prompt design and leverages large language models with advanced instruction-following capabilities to adapt to the understanding of airworthiness certification specifications.

Specifically, the input to the generation module consists of three components—the user query q, the TOP-K most relevant documents filtered by the retrieval module, and the metadata information associated with each document block. Building on this foundation, the system constructs a structured prompt template that explicitly annotates the user question and the context of candidate documents, ensuring that the large language model can accurately capture the relationships between the query intent and the regulatory provisions. This enables the generation of responses that are domain-specific, interpretable, and aligned with the requirements of certification scenarios.

3.2.4. Metrics

In the highly specialized context of understanding civil aviation airworthiness certification specifications, focusing solely on the superficial plausibility of responses generated by the question answering system fails to comprehensively reflect its performance. It is more critical to ensure that the system-generated answers faithfully reflect the original regulatory text and can accurately locate and cite the correct provisions. Therefore, in evaluating the RAG system, this study constructs a multi-metric comprehensive evaluation framework from the perspective of two dimensions—answer accuracy and traceability to airworthiness certification specifications.

Specifically, this study employs the following core metrics to systematically evaluate the model:

(1) Accuracy: Accuracy measures the degree of consistency between the answers generated by the question answering system and the reference answers. It serves as a key metric for the task of understanding airworthiness certification specifications and is computed as follows:

Accuracy = \frac{N_{correct}}{N_{total}},

(5)

Here,

N_{correct}

denotes the number of questions for which the model’s output is consistent with the reference answers, and

N_{total}

represents the total number of questions in the evaluation dataset. This metric reflects the overall answer quality of the model in the given task.

Considering the inherent randomness in responses generated by large language models, even carefully designed prompts may not ensure strict adherence to instructions. For example, in true/false questions, the model might respond using synonyms such as “correct” or “incorrect”; in fill-in-the-blank questions, strict reliance on a single reference answer as the correctness standard may lead to a higher probability of misjudgment.

To mitigate this issue, this study introduces a keyword set matching strategy—if the model’s response R contains elements from the standard answer keyword set A, the response is considered correct, as follows:

Correctness (R) = \{\begin{matrix} 1, & if R \cap A \neq \emptyset \\ 0, & otherwise \end{matrix} .

(6)

(2) Recall, Precision, and F1 Score: In the task of understanding civil aviation airworthiness certification specifications, the ability of the retrieval module to locate the correct regulatory provisions is a critical factor in evaluating the performance of the RAG model. The relevant metrics are defined as follows:

Precision = \frac{T P}{T P + F P},

(7)

Recall = \frac{T P}{T P + F N},

(8)

F_{1} Score = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(9)

Here,

T P

denotes the number of truly relevant provisions among the retrieved ones,

F P

refers to the number of retrieved provisions that are actually irrelevant, and

F N

indicates the number of relevant provisions that were missed during retrieval. The

F_{1}

score, as the harmonic mean of precision and recall, provides a comprehensive evaluation of the system’s overall capability in retrieving the correct provisions.

(3) Mean Reciprocal Rank (MRR): This metric evaluates the system’s ability to rank the correct regulatory provisions higher in the retrieved list. The calculation formula is defined as follows:

MRR = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{{rank}_{i}} .

(10)

Here,

| Q |

denotes the total number of questions, and

{rank}_{i}

represents the rank position of the first correct retrieval result for the i-th question. This metric reflects the strengths and limitations of the ranking mechanism in practical applications.

Using the aforementioned multidimensional metrics, this study systematically evaluates the performance of the RAG-based question answering system in the task of understanding airworthiness certification specifications, ensuring attention to both the correctness of the responses and the precision of regulatory provision localization and citation.

4. Experiment

4.1. Dataset

This study constructs the “Aircraft Airworthiness Certification QA Dataset” to simulate real-world civil aviation certification tasks that involve comprehension, reasoning, and answering based on regulatory provisions. The dataset is built upon core airworthiness documents, such as ”AC-25-AA-2023-03 Airworthiness Certification Specifications for Transport Aircraft”, covering major chapters and key provisions. It consists of 4688 QA pairs, encompassing multiple-choice, fill-in-the-blank, and true/false question types. Each question is linked to a unique correct answer and the corresponding regulatory provision, thereby providing a solid foundation for evaluating the system’s traceability capabilities (see Table 2 for examples).

4.2. Experiment Settings

As summarized in Table 3, the experimental configurations and hyperparameters are presented. To assess the impact of the choice of generation module on overall performance, the study selected Qwen2-7B, Baichuan2-7B, and ChatGLM3-6B as generation models. As shown in Table 4, under identical retrieval modules and prompt construction, all three achieved a relatively good generation performance. Among them, the Qwen2-7B-based approach achieved the highest accuracy across all three question types, demonstrating the best overall performance. Therefore, Qwen2-7B was also chosen as the base model for comparison in the baseline schemes.

Both the retrieval and re-ranking modules are built on the BGE-base embedding model. The former constructs a dense vector index for document queries to enable efficient retrieval, while the latter refines the initial retrieval results by applying cross-encoder semantic re-ranking, significantly improving the accuracy of document ranking. To ensure that the retrieval module sufficiently covers the knowledge relevant to the query, the top-k parameter in the initial retrieval stage was set to 20, meaning that 20 candidates related to the user query were recalled from the vector database. These were then re-ranked, and the top three most semantically relevant candidate documents were retained.

During the document preprocessing stage, a semantic-aware chunking strategy is applied to parse the Markdown-formatted specifications. This approach preserves metadata, such as clause numbers and parent paths, to ensure that the model maintains contextual awareness when handling complex hierarchical structures. The final system generates standardized responses based on structured inputs and employs five evaluation metrics to comprehensively assess model performance.

4.3. Experimental Results

4.3.1. Benchmark Testing

Table 4 presents a performance comparison of different RAG approaches on knowledge-based question answering tasks in the domain of civil aviation airworthiness certification. To eliminate interference caused by differences in generation models, all RAG approaches uniformly adopt Qwen2-7B as the base language model.

The experimental results demonstrate that all RAG approaches achieved a notable performance in the civil aviation airworthiness certification QA task, with particularly strong results on multiple-choice and true/false questions. However, the performance on fill-in-the-blank tasks was relatively lower, indicating that this task imposes greater challenges on the model’s comprehension and generation capabilities.

Compared to other approaches, the enhanced RAG framework proposed in this study achieves overall performance improvements through strategies such as structured preprocessing, two-stage retrieval, and cross-encoder re-ranking. The final model, built on Qwen2-7B, achieves accuracy rates of 87.47%, 34.18%, and 83.76% on multiple-choice, fill-in-the-blank, and true/false tasks, respectively. The proposed RAG framework also achieves optimal results on retrieval metrics such as precision, recall, and MRR. This clearly demonstrates that structured and fine-grained retrieval mechanisms for domain-specific regulatory texts play a critical role in enhancing the accuracy and traceability of QA systems.

Table 4. Benchmark testing results of different RAG approaches on civil aviation airworthiness QA tasks.

Model	LLM	Choice Accuracy	Fill-in Accuracy	Judgement Accuracy	Precision	Recall	F1 Score	MRR
Vanilla RAG	Qwen2-7B	0.5724	0.1394	0.5058	0.3215	0.4522	0.3758	0.4108
ColBERT	Qwen2-7B	0.6831	0.2166	0.7657	0.4944	0.6636	0.5666	0.5508
FiD	Qwen2-7B	0.8418	0.2457	0.7879	0.4337	0.6014	0.5039	0.5508
Ours	ChatGLM3-6B	0.7886	0.2552	0.7621	0.5783	0.7369	0.6480	0.6594
Ours	Baichuan2-7B	0.8151	0.2670	0.8012	0.6137	0.7423	0.6719	0.5836
Ours	Qwen2-7B	0.8747	0.3418	0.8376	0.6422	0.8041	0.7140	0.7148

4.3.2. Ablation Study

To systematically verify the independent contributions and cumulative effects of each module in the framework, this study conducts a progressively enhanced ablation experiment. The performance variations across different stages are presented in Table 5.

Under the baseline design, the system directly constructs a dense vector index over the entire specification and performs question answering, achieving an overall accuracy of 42.32%. After introducing semantic-aware chunking and hierarchical metadata augmentation, the overall accuracy significantly increases to 60.19%, with performance on multiple-choice and true/false tasks improving by 21.95% and 18.01%, respectively.

These results indicate that structured chunking and metadata embedding for airworthiness specifications effectively enhance the model’s understanding of contextual and hierarchical relationships. Building on this, the introduction of a hybrid retrieval mechanism that combines semantic and keyword-based retrieval further improves the recall of candidate provisions, raising the overall accuracy to 66.49%. Finally, with the integration of a cross-encoder-based re-ranking mechanism, the system achieves an optimal performance, reaching accuracy rates of 87.47%, 34.18%, and 83.76% on multiple-choice, fill-in-the-blank, and true/false tasks, respectively, and raising the overall accuracy to 70.19%.

Based on all experimental results, the progressively enhanced structured processing and fine-grained retrieval strategies demonstrate significant synergistic benefits in improving both the QA accuracy and the model’s ability to locate regulatory provisions.

4.4. Generalization Experiments

To further verify the applicability of the proposed method in other domains, we conducted small-scale experiments on three external datasets—the Econometrics and US Foreign Policy subsets of MMLU [35], as well as the SEC-QA dataset [36]. MMLU covers a wide range of academic knowledge questions, while SEC-QA focuses on cybersecurity and regulatory scenarios. All these datasets are provided in multiple-choice format with standard answers. Since external benchmarks do not contain clause-level evidence for traceability, we report only answer accuracy as the evaluation metric.

The results show that the proposed RAG method exhibits a certain degree of transferability across different domains. Specifically, the MMLU Econometrics subset contains numerous logic and mathematical reasoning problems, resulting in a relatively lower accuracy; in contrast, policy-related and cybersecurity questions are more factual and straightforward, which aligns better with the advantages of the RAG framework, thus achieving higher accuracy, as shown in Table 6.

5. Discussion

5.1. Discussion and Summary

This study focuses on the highly specialized application scenario of intelligent question answering for civil aviation airworthiness certification specifications. It introduces retrieval-augmented generation technology into this domain for the first time and conducts systematic research on data construction, model design, and performance validation. By constructing a high-quality QA dataset covering core regulatory provisions and integrating semantic-aware chunking, hybrid retrieval, and cross-encoder re-ranking optimization strategies, this study designs an RAG-based QA system tailored for regulatory understanding tasks.

The experimental results demonstrate that the system can efficiently leverage the latest airworthiness certification specifications and provide accurate question answering without retraining the large model, thereby fully validating the applicability and advantages of the retrieval-augmented generation framework in knowledge-intensive professional scenarios.

It is noteworthy that even pre-trained language models without domain-specific fine-tuning, when combined with simple retrieval, achieve a certain level of accuracy, highlighting the impressive transferability and generalization capabilities of large language models. However, in specialized scenarios, incorporating targeted structured processing to provide contextual boundaries and employing multi-stage retrieval for optimized recall significantly enhances the overall performance of the QA system.

These findings suggest that in scenarios involving structured and hierarchical texts, deeply integrating domain structural information with retrieval-augmented strategies is a critical pathway to improving system reliability and domain-specific performance.

5.2. Case Analysis

To further reveal the limitations and potential improvement directions of the proposed RAG framework in the civil aviation airworthiness certification QA task, this section selects several representative error prediction examples. These are analyzed and discussed from multiple perspectives, considering the model’s processing flow and the characteristics of the task.

As shown in Table 7, this error involves a misjudgment of a conditional clause in the airworthiness regulations. The retrieval module successfully recalled the regulatory content containing “cabin pressure altitude not exceeding 8000 feet,” indicating a good retrieval performance. However, during semantic parsing, the generation model failed to adequately capture the scenario constraint “under normal operating conditions,” overlooking the contextual prerequisite of the clause. This omission led to a fact-reversal error, highlighting the model’s weakness in handling contextual ambiguity and factual consistency judgment.

Table 8 illustrates the model’s tendency toward semantic generalization when handling technical terms, which is one of the key reasons why the current RAG model performs poorly on fill-in-the-blank question answering tasks. Although the retrieval module successfully recalled the relevant clause containing “fatigue tests of windshields and window panels,” the generation module, at the output stage, simplified the original term “load amplification factor” into the more generalized expression “load factor,” resulting in a mismatch with the reference answer.

This issue reveals the absence of domain knowledge constraint mechanisms and the limited term recognition capability of the generation model in highly specialized contexts, a problem that becomes more pronounced in generative tasks. Therefore, enhancing semantic modeling capabilities and constraints at the terminology level is a key direction for improving performance on such tasks.

5.3. Limitations and Future Directions

Although this study validates the effectiveness of retrieval-augmented generation (RAG) technology in intelligent question answering for civil aviation airworthiness certification specifications, several limitations remain.

First, the current QA dataset is primarily constructed based on core certification specifications. Due to challenges in data collection and confidentiality constraints, it does not yet include case corpora from real-world aircraft certification scenarios or complex cross-referencing contexts within regulatory provisions.

Second, the model’s performance on fill-in-the-blank tasks is noticeably lower than on multiple-choice and true/false tasks. Case analyses reveal that the model still struggles with contextual constraints and highly specialized terminology, indicating that its capabilities in factual consistency judgment under complex contexts and the fine-grained modeling of domain-specific terminology require further improvement.

To address the above limitations, future research can proceed in the following directions: (1) Incorporating non-confidential real-world certification scenario corpora and expert Q&A records to construct a high-quality, more diverse dataset that covers a broader range of scenarios. (2) Integrating complex multimodal information, such as diagrams that are commonly present in actual certification tasks, and exploring multimodal fusion retrieval and real-time update mechanisms to further enhance system performance and practical applicability. (3) Incorporating constrained generation in terminology modeling and contextual condition parsing to enhance the model’s robustness in specialized domains.

These directions will help promote the deployment and adoption of intelligent QA systems in high-demand professional scenarios such as airworthiness certification.

6. Conclusions

This study addresses intelligent question answering tasks for civil aviation airworthiness certification regulations by proposing and validating an RAG method that incorporates the hierarchical structure of regulatory texts. Without relying on large model retraining, the proposed system leverages semantic-aware structured prompt design and a two-stage retrieval strategy to significantly improve evidence localization accuracy, clause consistency in answers, and the interpretability of generated content. The experimental results show that the method demonstrates strong adaptability across multiple task types—including multiple-choice, fill-in-the-blank, and true/false questions—thus validating the effectiveness of RAG in highly regulated text scenarios.

At the same time, this study reveals that the model still faces challenges in highly regulated scenarios, particularly in ensuring terminology consistency, contextual understanding, and the adequate coverage of real engineering data. Subsequent research could focus on corpus expansion, refined terminology and scenario parsing, and multimodal information integration, thereby improving the robustness and practical applicability of the system.

Author Contributions

Conceptualization: T.Z.; methodology: T.Z.; software and data curation: S.S.; validation: S.S. and C.Z.; formal analysis: T.Z.; investigation: T.Z., S.S. and C.Z.; resources: C.Z.; writing—original draft preparation: T.Z. and S.S.; writing—review and editing: T.Z., S.S. and C.Z.; visualization: S.S.; supervision: C.Z.; project administration: C.Z.; funding acquisition: C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Fundamental Research Funds for the Central Universities (25CAFUC04041, PHD2023-028), the Research Projects of Civil Aviation Flight Technology and Flight Safety Engineering Technology Research Center of Sichuan (GY2024-24D), and the Research Projects of Key Laboratory of General Aviation Technology of Henan (ZHKF-240205).

Data Availability Statement

The dataset developed in this study includes 4688 question–answer pairs on airworthiness certification regulations, created through a multi-stage process of generation, augmentation, and expert validation. This dataset is not publicly available at this time but may be shared by the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 20 July 2025).
Zhang, B.; Yang, H.; Zhou, T.; Babar, M.A.; Liu, X.Y. Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. In Proceedings of the Fourth ACM International Conference on AI in Finance (ICAIF ’23), New York, NY, USA, 27–29 November 2023; ACM: New York, NY, USA, 2023; pp. 349–356. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.; Ashley, E.; et al. Almanac—Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 2024, 1, AIoa2300068. [Google Scholar] [CrossRef] [PubMed]
Yepes, A.J.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv 2024, arXiv:2402.05131. [Google Scholar] [CrossRef]
Reizinger, P.; Ujváry, S.; Mészáros, A.; Kerekes, A.; Brendel, W.; Huszár, F. Position: Understanding LLMs Requires More than Statistical Generalization. arXiv 2024, arXiv:22405.01964. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Kumar, Y.; Marttinen, P. Improving Medical Multi-Modal Contrastive Learning with Expert Annotations. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 468–486. [Google Scholar]
Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; et al. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering. In Proceedings of the International Conference on Case-Based Reasoning Research and Development, Merida, Mexico, 1–4 July 2024; Springer: Cham, Switzerland, 2024; pp. 445–460. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022; pp. 1–3. [Google Scholar]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating Exposure Bias in Large Language Model Distillation: An Imitation Learning Approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Ekin, S. Prompt Engineering for ChatGPT: A Quick Guide to Techniques, Tips, and Best Practices. Authorea Preprints. 2023. Available online: https://www.researchgate.net/publication/370554061_Prompt_Engineering_For_ChatGPT_A_Quick_Guide_To_Techniques_Tips_And_Best_Practices (accessed on 20 July 2025).
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Yu, D.; Zhu, C.; Fang, Y.; Yu, W.; Wang, S.; Xu, Y.; Ren, X.; Yang, Y.; Zeng, M. KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. arXiv 2021, arXiv:2110.04330. [Google Scholar]
Wang, Y.; Ren, R.; Li, J.; Zhao, W.X.; Liu, J.; Wen, J.-R. REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Singapore, 6–10 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 4213–4225. Available online: https://aclanthology.org/2024.emnlp-main.321.pdf (accessed on 25 July 2025).
Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. ACM Trans. Inf. Syst. 2024, 43, 1–32. [Google Scholar] [CrossRef]
Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. J. Empir. Leg. Stud. 2025, 22, 216–242. [Google Scholar] [CrossRef]
Wang, C.; Long, Q.; Xiao, M.; Cai, X.; Wu, C.; Meng, Z.; Wang, X.; Zhou, Y. BioRAG: A RAG-LLM Framework for Biological Question Reasoning. arXiv 2024, arXiv:2408.01107. [Google Scholar]
Li, H.; Chen, Y.; Hu, Y.; Ai, Q.; Chen, J.; Yang, X.; Yang, J.; Wu, Y.; Liu, Z.; Liu, Y. LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), Washington, DC, USA, 13–17 July 2025; ACM: New York, NY, USA, 2025; pp. 3606–3615. [Google Scholar] [CrossRef]
Möller, T.; Reina, A.; Jayakumar, R.; Pietsch, M. COVID-QA: A Question Answering Dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Virtual, 9 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Vasantharajan, C. SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents. Ph.D. Thesis, McMaster University, Hamilton, ON, Canada, 2025. [Google Scholar]
Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv 2017, arXiv:1705.03551. [Google Scholar] [CrossRef]
Zhao, Y.; Shi, J.; Chen, F.; Druckmann, S.; Mackey, L.; Linderman, S. Informed Correctors for Discrete Diffusion Models. arXiv 2024, arXiv:2407.21243. [Google Scholar] [CrossRef]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA, 1–4 November 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
He, Y.; Li, S.; Liu, J.; Tan, Y.; Wang, W.; Huang, H.; Bu, X.; Guo, H.; Hu, C.; Zheng, B.; et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv 2024, arXiv:2411.07140. [Google Scholar] [CrossRef]
Deng, Y.; Zhang, W.; Xu, W.; Shen, Y.; Lam, W. Nonfactoid Question Answering as Query-Focused Summarization with Graph-Enhanced Multihop Inference. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11231–11245. [Google Scholar] [CrossRef]
Zhang, W.; Zhou, Z.; Zheng, Z.; Gao, C.; Cui, J.; Li, Y.; Chen, X.; Zhang, X.-P. Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space. arXiv 2025, arXiv:2502.03091. [Google Scholar]
Zhou, Z.; Wang, R.; Wu, Z. Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities. arXiv 2025, arXiv:2502.01952. [Google Scholar]
Clark, J.H.; Choi, E.; Collins, M.; Garrette, D.; Kwiatkowski, T.; Nikolaev, V.; Palomaki, J. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Trans. Assoc. Comput. Linguist. 2020, 8, 454–470. [Google Scholar] [CrossRef]
Vilares, D.; Gómez-Rodríguez, C. HEAD-QA: A Healthcare Dataset for Complex Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Manes, I.; Ronn, N.; Cohen, D.; Ber, R.I.; Horowitz-Kugler, Z.; Stanovsky, G. K-QA: A Real-World Medical Q&A Benchmark. arXiv 2024, arXiv:2401.17218. [Google Scholar]
Lai, V.D.; Krumdick, M.; Lovering, C.; Reddy, V.; Schmidt, C.; Tanner, C. SEC-QA: A Systematic Evaluation Corpus for Financial QA. arXiv 2024, arXiv:2406.14394. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2021, arXiv:2009.03300. [Google Scholar] [CrossRef]
Liu, Z. SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security. arXiv 2023, arXiv:2310.12281. [Google Scholar]

Figure 1. Framework for constructing the question–answer dataset.

Figure 2. Architecture of the RAG framework.

Figure 3. Example of emergency exit markings in AC-25-AA-2023-03 airworthiness certification.

Table 1. Distribution of question types in the dataset.

Question Type	Count	Answer Distribution
Multiple-Choice	2119	A/B/C/D: 530/530/530/529
Fill-in-the-Blank	1442	-
True/False	1127	True/False: 563/564

Table 2. Sample questions and answers.

Question Type	Question	Options/Answer	Reference
Multiple-Choice	After a crash impact, within how much time should the cockpit voice recorder’s erase function cease operation?	A. Within 5 min B. Within 10 min C. Within 15 min D. Within 20 min Answer: B	AC-25.1457 Cockpit Voice Recorder The cockpit voice recorder must cease operation and disable all erase functions within 10 min after a crash impact.
Multiple-Choice	In the electrical distribution system design documentation, which characteristic of the bus bar should be given special attention?	A. Temperature variation range B. Voltage variation range C. Current variation frequency D. Resistance variation curve Answer: B	AC-25.1355 Electrical Systems Attention should be given to the voltage variation range of the bus bar.
Fill-in-the-Blank	In the design process based on statistical analysis of material strength properties, the probability of structural failure caused by ____ can be minimized.	Answer: Material Variability	AC-25.613 Material Strength Properties and Design Values This can minimize probability of failure caused by material variability.
True/False	In the accelerate–stop braking transition process, a 2 s period is considered part of the transition phase.	Answer: False	AC-25.109 Accelerate–Stop Distance The 2 s period is for distance calculation and is not part of the braking transition process.

Table 3. Experimental environment configurations.

Item	Configuration
Operating System	Ubuntu 20.04
GPU	NVIDIA Tesla T4 * 3 (Nvidia Corporation, Santa Clara, CA, USA)
Cuda Version	12.2
PyTorch Version	2.6.0
Top-k	3
Temperature	0.3

Table 5. Ablation study.

Model	Choice	Fill-in-the-Blank	Judgement	Overall
Vanilla RAG	0.5724	0.1394	0.5058	0.4232
+Metadata & MdChunk	0.7918	0.2572	0.6859	0.6019
+Hybrid Retriever	0.8339	0.3058	0.8066	0.6649
+Re-rank	0.8747	0.3418	0.8376	0.7019

Table 6. Evaluation results on external datasets.

Datasets	Domain	QA Type	Accuracy
MMLU-Econometrics	Finance	Choice	54
MMLU-Us_foreign_policy	Policy	Choice	89
SecQA	Cybersecurity	Choice	92

Table 7. Error case 1.

Element	Content
Question	8000 feet is the maximum cabin pressure altitude limit mandated by the International Civil Aviation Organization.
Reference Answer	False
Model Prediction	True
Retrieved Passage (excerpt)	“… under normal operating conditions, the aircraft shall maintain a cabin pressure altitude not exceeding 2438 m (8000 feet) at its maximum operating altitude …”

Table 8. Error case 2.

Element	Content
Question	Fatigue tests of windshields and window panels must account for the effects of ____ and the life factor.
Reference Answer	load amplification factor
Model Prediction	load factor
Retrieved Passage (excerpt)	“… in fatigue tests of windshields and window panels considering the load amplification factor and the life factor, the durability of the metal parts of the installation structure must be taken into account …”

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, T.; Shen, S.; Zeng, C. A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations. Electronics 2025, 14, 3314. https://doi.org/10.3390/electronics14163314

AMA Style

Zheng T, Shen S, Zeng C. A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations. Electronics. 2025; 14(16):3314. https://doi.org/10.3390/electronics14163314

Chicago/Turabian Style

Zheng, Tao, Shiyu Shen, and Changchang Zeng. 2025. "A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations" Electronics 14, no. 16: 3314. https://doi.org/10.3390/electronics14163314

APA Style

Zheng, T., Shen, S., & Zeng, C. (2025). A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations. Electronics, 14(16), 3314. https://doi.org/10.3390/electronics14163314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Retrieval-Augmented Generation Method for Question Answering on Airworthiness Regulations

Abstract

1. Introduction

2. Related Work

2.1. Large Language Models

2.2. Retrieval-Augmented Generation

2.3. Question Answering

3. Materials and Methods

3.1. Dataset

3.1.1. Stage One: Initial Question Generation

3.1.2. Stage Two: Semantic Enhancement and Reverse Tracing

3.1.3. Stage Three: Manual Correction

3.2. RAG Framework

3.2.1. Data Processing

3.2.2. Retriever

Stage One: Semantic Vector Encoding and Embedding Matching

Stage Two: Cross-Encoder Re-Ranking

3.2.3. Generator

3.2.4. Metrics

4. Experiment

4.1. Dataset

4.2. Experiment Settings

4.3. Experimental Results

4.3.1. Benchmark Testing

4.3.2. Ablation Study

4.4. Generalization Experiments

5. Discussion

5.1. Discussion and Summary

5.2. Case Analysis

5.3. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI