TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions

Pan, Weijun; Zheng, Yao; Wang, Yidi; Chen, Sheng; Zuo, Qinghai; Luan, Tian; Zeng, Chen

doi:10.3390/app16073461

Open AccessArticle

TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions

by

Weijun Pan

¹,

Yao Zheng

^1,*,

Yidi Wang

²,

Sheng Chen

³,

Qinghai Zuo

³,

Tian Luan

³

and

Chen Zeng

⁴

¹

Caac Academy of Flight Technology and Safety, Civil Aviation Flight University of China, Guanghan 618307, China

²

School of Transportation and Logistics, Southwest Jiaotong University, Chengdu 611756, China

³

College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan 618307, China

⁴

Xinjin Flight College, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3461; https://doi.org/10.3390/app16073461

Submission received: 13 February 2026 / Revised: 1 April 2026 / Accepted: 1 April 2026 / Published: 2 April 2026

(This article belongs to the Topic AI-Enhanced Techniques for Air Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

To enhance flight safety and operational efficiency, extracting key actions, flight parameters, and status information from civil aviation flight crew instructions generated during pre-flight and in-flight procedures is crucial. However, such texts are highly condensed and involve complex multi-role interactions, easily leading to entity boundary drift and category misclassification. To address this, this paper proposes a joint key information extraction framework based on a lightweight pre-trained language model (TinyBERT) and a Role-Aware Fusion mechanism, abbreviated as TR-BiGRU-CRF. This framework introduces the Role-Aware Fusion mechanism to resolve semantic ambiguity caused by multi-party interactions, utilizes TinyBERT for semantic representation that balances accuracy and computational efficiency, and employs BiGRU-CRF for robust sequence feature modeling and decoding. Experiments on a flight crew instruction dataset show that the proposed method achieves 92.2% precision, 91.8% recall, a 92.0% F1 score, and an overall prediction accuracy of 92.6%. Compared to the BiGRU-CRF baseline, it significantly improves accuracy, precision, and F1 score by 11.4, 13.3, and 13.5 percentage points, respectively. These results prove that the proposed method effectively mitigates boundary drift and category confusion, providing strong support for flight crew instruction understanding and safety decision-making.

Keywords:

key information extraction; flight crew instructions; TinyBERT; sequence labeling

1. Introduction

With the expansion of civil aviation transportation and the increasing complexity of operational environments, higher demands have been placed on the timeliness, consistency, and traceability of information processing within the flight operations system. Global air passenger demand reached a historic high in 2024, and the increasing operational load highlights the urgent need to improve information processing efficiency and intelligent support capabilities under high-density operating conditions [1]. At the same time, risk management under complex operational conditions relies on more granular data analysis and monitoring to enhance risk identification, early warning, and mitigation capabilities, thereby supporting operational safety and decision-making [2,3]. In this context, flight crew instruction texts contain key information such as action intent, object elements, and parameter constraints, serving as a crucial data source for operational monitoring, event review, and risk analysis. However, these texts often include specialized terminology, colloquial omissions, and intertwined multi-role interactions, making automatic understanding and parsing more challenging. To achieve refined management and intelligent support of the operational process, it is essential to automatically parse and transform these unstructured instruction texts into structured key information, improving the consistency and reusability of information retrieval, statistical analysis, and automated reasoning while reducing manual processing costs and enhancing the reliability of process monitoring. Therefore, key information extraction from flight crew instructions has become one of the foundational technologies in the field of intelligent civil aviation operations.

Key information extraction aims to identify and extract core semantic segments from instruction texts, such as object elements, action descriptions, and parameters or constraints. The extraction results not only affect the accuracy of structured representations but also directly impact subsequent tasks such as instruction consistency validation, execution link tracking, and safety situation analysis. Unlike general texts, the same word or phrase may carry different semantic functions in different contexts, which adds complexity to the extraction task. Although Named Entity Recognition (NER) has achieved significant success in multiple NLP tasks [4], domain-specific instruction texts often feature dense technical terminology, highly condensed sentence structures, prevalent omissions, and ambiguous boundaries of key segments. Therefore, the model needs not only strong semantic representation capabilities but also the ability to consistently model sequence dependencies and maintain label prediction consistency.

In existing research, traditional methods often rely on rules or statistical features, which are interpretable but highly sensitive to template and vocabulary coverage, making it difficult for them to handle the diverse expression variants in real-world applications. In recent years, deep learning methods have typically used pre-trained language models combined with sequence modeling and structured decoding to improve extraction performance. In instruction-based text scenarios, end-to-end frameworks that integrate pre-trained representations, attention mechanisms, and sequence labeling structures have been applied to air traffic control instruction key element extraction, achieving good performance improvements [5]. However, general sequence labeling frameworks still face significant performance bottlenecks in the flight crew instruction domain, especially in terms of semantic overlap, insufficient semantic representation, and inconsistent label paths, which often result in errors such as missing key elements, boundary shifts, and label confusion.

To further characterize the essence of the problem and identify the technical breakthroughs, this paper summarizes the challenges faced by this task into the following three aspects:

The implicit representation of role information and semantic ambiguity lead to label confusion. In flight crew instructions, the frequent colloquial omission of executing subjects often causes identical trigger words to correspond to disparate entity boundaries and categories across various role contexts. However, standard models such as BERT typically process input text as role-agnostic continuous sequences, lacking the necessary prior perception of multi-party interactive identities. Consequently, these models are highly susceptible to semantic confusion and boundary drift when handling such specialized instructions.
The pervasive use of domain-specific terminology and compact expressions exacerbates the difficulty of semantic modeling. Flight crew instructions are replete with highly condensed sentence structures and abundant abbreviations, exacerbating the difficulty of semantic modeling. Although increasing model parameters can capture deeper contextual features, such massive models struggle to satisfy the extremely stringent real-time deployment constraints of aviation terminals. Therefore, when processing such texts, existing conventional sequence labeling models often lose fine-grained semantic features, making it difficult to achieve an effective balance between inference latency and representational precision [6].
The complex sequence dependencies and label consistency requirements impose higher standards on the structural decoding of models. The arrangement of key elements in aviation instructions implies strict execution logic, accompanied by intricate nesting and cross-span dependencies. However, standard BERT or standalone TinyBERT architectures typically lean toward independent predictions at local positions during sequence labeling, lacking prior constraints on global label transition patterns. When handling highly condensed texts, this localized decision mechanism not only struggles to ensure label sequence consistency but is also highly susceptible to the omission of key elements and boundary segmentation errors [7].

To address the aforementioned challenges, this paper proposes a key information extraction method tailored for flight crew operational instructions. First, to resolve role semantic ambiguity, a role-awareness module is introduced, integrating role condition information into the representation learning process, thereby enhancing the separability and recognition accuracy of key elements. Second, TinyBERT is adopted as a lightweight pre-trained encoder to balance domain semantic representation with deployment efficiency, providing robust contextual representation capabilities while optimizing inference efficiency through two-stage distillation [8,9]. Finally, BiGRU is used for bidirectional context modeling and CRFs are applied for global decoding constraints to improve label transition patterns and boundary prediction stability [10]. The overall process is as shown in Figure 1.

Based on this, the main contributions of this paper can be summarized in the following three points:

A Role-Aware Fusion mechanism is proposed to resolve multi-party semantic ambiguity. By explicitly injecting speaker identities into the feature space, this mechanism compensates for the deficiency of traditional models regarding prior knowledge of multi-party interactions. Compared to conventional undifferentiated encoding methods, it significantly enhances the model’s capability to distinguish role-conditioned semantics, thereby effectively improving keyword localization precision.
A lightweight semantic encoding framework based on TinyBERT is constructed to overcome computational bottlenecks. Diverging from the conventional reliance on computationally heavy standalone pre-trained models, this framework effectively bridges the gap between stringent aviation deployment constraints and the need for deep semantic representation. It achieves an optimal trade-off between computational overhead and representational accuracy, providing a robust and deployable solution specifically tailored for constrained aviation scenarios.
A joint structural decoding strategy integrating BiGRU and CRFs is incorporated to guarantee label sequence consistency. By imposing rigorous global consistency constraints, this strategy enables a global optimal path search during the decoding phase. It effectively overcomes the localized independent predictions inherent in standalone pre-trained models, significantly enhancing the model’s boundary stability and mitigating boundary segmentation errors.

The structure of the remaining parts of this paper is as follows: Section 2 provides a categorical review of deep learning-based key information extraction methods; Section 3 introduces the model and key modules proposed in this paper; Section 4 presents the experimental setup and result analysis, validated through ablation experiments and case studies; Section 5 concludes the paper, summarizing the work; and Section 6 offers a discussion, analyzing the advantages and disadvantages of the method and outlining future research directions.

2. Related Work

Research in key information extraction primarily focuses on three directions: (1) enhancing semantic representation through multi-level feature fusion; (2) utilizing lightweight pre-trained language models to balance performance and inference cost; (3) incorporating sequence dependency modeling and label consistency constraints to improve the stability of extraction results.

2.1. Multi-Level Feature Fusion to Enhance Semantic Representation

In the development of Named Entity Recognition (NER), early research primarily relied on manual feature engineering and statistical learning methods, such as Conditional Random Fields (CRFs), which improved recognition performance by combining features like word forms, part-of-speech, and dictionaries. With the rise of deep learning, researchers began to automatically learn features through neural networks and integrate multiple information sources to enhance representation capabilities. Li et al. reviewed the evolution from feature engineering to modern neural models, emphasizing the importance of integrating multiple input representations, such as character-level, word-level, and contextual embeddings, for improving NER performance [11]. Particularly in specialized fields like biomedicine, methods that combine context-aware embeddings with local character and lexical features have significantly improved recognition performance [12].

With the advent of deep neural networks, sequence-based model architectures became the core framework for NER. To further enhance model performance, researchers designed multi-level and multimodal feature fusion mechanisms. Shi et al. proposed a multi-layer semantic fusion network that combines morphological, character, word, and syntactic features and integrates syntactic dependencies via Graph Neural Networks (GNNs), which effectively improved NER accuracy for Chinese medical texts [13]. Ke et al. proposed a multi-feature fusion approach based on glyph, pinyin, and character information, which, when combined with pre-trained models, significantly enhanced the expression of character semantics and performed excellently in nested entity recognition tasks [14].

With the development of pre-trained language models, NER research entered a new phase, integrating powerful contextual representations with other features. A typical approach combines pre-trained models with attention mechanisms and structured features. For example, Zheng et al. proposed a biomedical NER model that integrates multiple rounds of cross-attention fusion, enhancing the model’s ability to focus on key semantic information [15]. In the field of Chinese NER, Guo et al. proposed a deep neural network model with a multi-network structure that fuses word vectors and multi-scale lexical features, significantly improving NER performance [16]. Additionally, Tan et al. proposed a multi-source fusion model based on BERT, BiLSTM, and CRFs which combined character shapes and dictionary features to significantly improve the F1 score of Chinese electronic medical record NER [17]. Wang [18] developed a GNN-based text classification framework (specifically BERT-GCN) to identify fire-door defects. This model leverages the synergy between Transformer’s global context and GCNs’ structural modeling, leading to substantial improvements in robustness and feature representation within the domain of industrial safety inspection. Zhang et al. proposed a dialog model that integrates role embeddings to capture identity information between interlocutors, and by combining multiple self-supervised learning tasks it significantly improves the accuracy of Named Entity Recognition in complex conversational scenarios [19]. Overall, in increasingly complex tasks such as Multimodal Named Entity Recognition (MNER), the mere stacking of coarse-grained features has become insufficient to meet performance requirements. Consequently, excavating the fine-grained semantic alignment between crossmodal semantic units—such as textual tokens and visual objects—and utilizing a unified multimodal graph structure to model their intricate internal interaction logic has emerged as a critical pathway for further optimizing multimodal representation learning. While multi-level feature fusion has significantly enhanced performance by integrating multi-dimensional semantics, major findings suggest that existing mechanisms rely heavily on coarse-grained feature stacking or structural constraints, which, despite deepening representation, still face limitations. A critical research gap remains: these methods often introduce redundant noise and exacerbate ambiguity when processing information-dense, highly abbreviated, or tightly coupled professional texts, leading to boundary drift and category confusion. To address this, this paper introduces a role-awareness module that identifies and mitigates redundant information through dynamic modeling, enabling the precise extraction of key elements in complex environments. While ensuring efficiency, this approach significantly bolsters model robustness against non-standard texts and improves both the accuracy and stability of entity recognition

2.2. Lightweight Pre-Trained Language Models

In recent years, pre-trained language models have achieved revolutionary breakthroughs in the field of natural language processing. However, their massive parameter scales and high computational costs have constrained their deployment in resource-constrained environments, making model compression and lightweight modeling pivotal research directions. Dantas et al. [20] systematically evaluated various model compression techniques, such as pruning, distillation, and quantization, from the perspectives of efficiency enhancement and energy reduction, providing a comprehensive framework and evaluation criteria for PLM compression. Building upon this, within industrial applications, Wang [21] conducted a systematic evaluation of several lightweight Transformer variants, including RoBERTa, ALBERT, and DistilBERT, demonstrating that optimized lightweight architectures could achieve recognition accuracy comparable to large-scale models in complex text classification tasks, such as identifying building safety defects, while significantly reducing computational costs. Furthermore, Kong et al. [22] proposed a hierarchical BERT model with an adaptive fine-tuning strategy, which effectively integrates the global semantics of long documents through a hierarchical attention mechanism and optimizes fine-tuning efficiency via a dynamic layer selection strategy. In addition, Silva Barbon et al. [23] investigated the transfer learning efficacy of BERT, DistilBERT, and their language-specific versions for automated text classification in multilingual environments, validating that lightweight models can significantly enhance inference efficiency for domain-specific tasks while maintaining robust multilingual semantic representation capabilities.

To further improve the balance between model compression and inference efficiency, distillation and lightweight strategies have continuously evolved. The classic DistilBERT model introduced distillation during the pre-training phase, significantly reducing model parameters and improving inference speed while maintaining high performance across various NLP tasks [24]. In contrast, task-specific adaptive distillation methods guide the student model to learn important information by aggregating knowledge from internal layers, thereby improving its stability and generalization across different tasks [25]. Domain-specific lightweight strategies have also significantly enhanced model practicality. For example, in clinical NLP tasks, lightweight Transformer models trained with knowledge distillation and continual learning achieved performance comparable to large models in Named Entity Recognition and relation extraction tasks, with a substantial reduction in parameters, showcasing the potential of lightweight strategies in real-world applications [26].

From a macro perspective, lightweight pre-trained language model research is shifting from single compression strategies to multi-strategy integration. Survey studies on compression technologies provide comprehensive support for theoretical frameworks and technical choices, enabling researchers to make informed comparisons and combinations between pruning, distillation, and neural architecture search (NAS) methods [27]. Additionally, for specific languages or domains, customized lightweight models (e.g., small distillation models for low-resource languages) have demonstrated strong transferability and compression effects. These models optimize distillation strategies and reduce parameters to achieve performance that is close to or even surpasses that of the teacher models [28]. Recent applied research shows that lightweight large-scale language models (LLMs) based on knowledge distillation can significantly reduce memory consumption and shorten response times, providing clear pathways for real-world applications [29]. Overall, lightweight modeling has become a key trend driving the widespread application of AI models in edge computing and mobile scenarios, and it will continue to evolve to meet more practical deployment needs.

In summary, lightweight modeling research—leveraging techniques such as distillation and pruning—has achieved significant breakthroughs in balancing inference overhead with domain-specific performance, confirming the practical potential of lightweight models in edge computing scenarios. Meanwhile, developments based on Large Language Models (LLMs) and Transformer architectures have also demonstrated advantages in reducing memory consumption and response times. However, a research gap still remains: when processing complex instructions characterized by dense terminology and highly abbreviated expressions, overly compressing models or relying solely on parameter-heavy generic models often leads to the loss of deep representation capabilities. This, in turn, impairs the model’s recognition and fine-grained differentiation of professional texts. To address this issue, this paper adopts the lightweight TinyBERT encoder and introduces a task-structure-aligned method to compensate for the representation loss caused by compression. This approach effectively enhances the model’s stability and accuracy in complex tasks while ensuring efficient inference.

2.3. Sequence Dependency Modeling and Label Consistency Constraints

Named Entity Recognition (NER), as a fundamental task in natural language processing, has long faced the challenge of effectively capturing label dependencies and consistency constraints in sequences. Early approaches, such as Conditional Random Fields (CRFs), addressed the issue of modeling dependencies between labels and achieved significant success. For example, Munoz-Ortiz et al. [30] demonstrated that by treating entity recognition as a single-pass sequence labeling task, CRFs can effectively mitigate boundary ambiguity in nested entities, maintaining the logical rigor of the label chain through global optimal path decoding. Similarly, Jeong et al. [31] emphasized in their study of document-level NER that since sentence-level models often overlook long-range context, incorporating a consistency enhancement mechanism alongside a CRF structure can significantly reduce instances of conflicting labels assigned to the same entity within a document. Models based on LSTM and CRFs have also been widely used. Huang et al. proposed a multi-task learning approach that shared information across tasks, strengthening label consistency modeling and improving entity recognition [32].

With the development of deep learning, more and more neural network models have been introduced to NER tasks, demonstrating powerful sequence modeling capabilities. The BiLSTM-CRF architecture, by combining bidirectional LSTM and CRFs, effectively captures contextual information in text and models label dependencies, showing excellent performance in complex tasks such as medical and legal texts [33]. Furthermore, models incorporating domain knowledge graph fusion—such as the BiGRU-CRF architecture proposed by Fan et al. [34]—leverage global label consistency constraints to significantly enhance both the precision and robustness of entity recognition. For medical and biomedical texts, the integration of shared layers and task-specific layers within a BiLSTM-CRF framework for multi-task transfer learning has effectively enhanced recognition accuracy [35]. Further improvements in CRF models, such as the introduction of higher-order label transition factors, have also enhanced entity boundary precision [36].

As research in NER progresses, models combining sequence dependencies and label consistency constraints have evolved into more complex strategies. Particularly with the aid of multi-feature fusion and knowledge graphs, the accuracy and robustness of entity recognition have been further enhanced. Studies indicate that structure consistency training methods help maintain label prediction consistency, which is crucial for small-sample and complex domain tasks [37]. For example, Wei et al. proposed an entity recognition model for power equipment maintenance texts that combined location and similarity-enhanced attention structures with CRFs, improving performance in industry-specific texts [38]. Chen et al. introduced a knowledge graph-enhanced Graph Convolutional Network (KGGCN) for Chinese NER, which integrated knowledge graph information and graph convolutional architectures to enhance label consistency and sequence dependency modeling, achieving high recognition performance on multiple Chinese NER benchmark datasets [39].

In summary, major findings indicate that the introduction of CRF global optimal path decoding and Knowledge Graph-enhanced Graph Convolutional Networks (KGGCNs) for deep modeling of long-range label dependencies and global consistency has significantly bolstered the logical rigor and precision of entity recognition in complex industrial texts. However, a critical research gap remains: when processing professional texts characterized by high information density, frequent abbreviations, or tightly coupled logic, local prediction errors can still lead to entity mismatches, omissions, or inconsistent label paths. To address this issue, this paper combines bidirectional sequence modeling with CRF global decoding, further leveraging global sequence optimization to ensure consistency between label paths and boundary predictions. This approach effectively enhances the stability and interpretability of the extraction results.

To provide a direct and professional comparison of the technical features across the aforementioned research directions, Table 1 systematically summarizes relevant research findings in dimensions such as representative technologies, advantages, and limitations

3. Methodology

Figure 2 illustrates the overall architecture of our proposed TR-BiGRU-CRF, a joint key information extraction framework based on a lightweight pre-trained language model (TinyBERT) and a role-aware mechanism. The model takes flight crew instruction text as input, first obtaining contextual semantic representations through TinyBERT. In the role-aware interaction module, speaker identity information is introduced as meta-information input, which is fused with the text representation to form role-conditioned representations. This module utilizes multi-head attention to model cross-position dependencies, enhancing the model’s ability to capture global semantic associations. Subsequently, the fused feature sequence is fed into BiGRU for bidirectional encoding, capturing sequence context dependencies and extracting discriminative clues related to key information. During the decoding stage, CRFs are used to model label transitions and apply global consistency constraints, thereby outputting the optimal label sequence and reducing illegal label paths. During training, a sub-center loss is introduced into the feature representation space as auxiliary supervision, optimized jointly with the CRF objective to enlarge inter-class margins and reduce intra-class dispersion, further enhancing class separability and prediction stability. Overall, the architecture mitigates the issues of multi-role semantic ambiguity and boundary drift through collaborative role-conditioned representations and structured decoding, thereby improving key information extraction performance in complex flight crew communication scenarios.

3.1. TinyBert Module

The TinyBERT module forms the core encoding component of the model architecture proposed in this paper. As a lightweight pre-trained language model developed based on knowledge distillation techniques, TinyBERT aims to retain the dense semantic knowledge and generalization capabilities of the BERT model while significantly optimizing computational efficiency. This feature is of significant advantage in civil aviation business scenarios, as it can maintain low latency and high throughput in resource-constrained device environments, thus meeting the real-time processing requirements for flight crew operation instructions. In this study, the TinyBERT module is responsible for providing high-quality, context-aware representations for the key information extraction task of civil aviation flight crew operation instructions, thereby supporting the precise identification of core semantic slots such as operational actions, flight parameters, system equipment, and status feedback. The specific algorithmic structure is illustrated below.

Algorithm 1 illustrates the algorithmic logic of the TinyBERT module. In this process, the input instruction

X_{t e x t} \in R^{B \times T}

, serving as a tensor of tokenized integer indices, is mapped through the embedding layer and accumulated to obtain the initial input matrix

X^{(0)} \in R^{B \times T \times d}

. Its values are determined by

E_{s u m}

, the sum of the word, segment, and position embedding vectors. The dimensions

B

and

T

are defined by the batch size and sequence length, respectively.

Algorithm 1 TinyBERT Network

1 : procedure TINYBERT_Network (X_{t e x t}, Θ)

2 : E_{s u m} \leftarrow EmbeddingLayer (X_{t e x t}, Θ_{e m b})

3 : X^{(0)} \leftarrow E_{s u m}

4 : for l = 1

to L

do

5 : Z^{(l)} \leftarrow MultiHeadAttention (X^{(l - 1)})

6 : H^{(l)} \leftarrow LayerNorm (Z^{(l)} + X^{(l - 1)})

7 : O^{(l)} \leftarrow FeedForwardNetwork (H^{(l)})

8 : X^{(l)} \leftarrow LayerNorm (O^{(l)} + H^{(l)})

9: end for

10 : T_{T i n y B E R T} \leftarrow {(X^{(L)})}^{⊺}

11 : return T_{T i n y B E R T}

12: end procedure

Subsequently, the sequence enters a feature extraction chain composed of

L

stacked Transformer Encoder Blocks. This chain extracts intermediate features

Z^{(l)}

via a multi-head self-attention mechanism and employs the mathematical operator

Z^{(l)} + X^{(l - 1)}

to perform a residual connection, generating the normalized state

H^{(l)}

. Building upon this, the feature

O^{(l)}

, processed through a feedforward neural network, undergoes a second residual calculation to output the final state

X^{(l)}

of that layer.

Throughout this process, the model achieves a feature mapping from discrete instruction tokens to a high-dimensional semantic space. The hidden representation

T_{i}

of each token corresponds to a column vector in

R^{d}

, representing the numerical mapping of that token after fusing global contextual information. Finally, the model arranges these token vectors sequentially via a transpose operation to output the feature matrix

T_{T i n y B E R T} \in R^{B \times D \times T}

, providing a solid computational foundation for subsequent sequence labeling tasks

3.2. Role-Aware Fusion

The Role-Aware Fusion (RAF) module proposed in this paper is a targeted enhancement based on the standard TinyBERT architecture. Unlike traditional pre-trained models that rely on a single encoding stream, the RAF module employs a dual-stream parallel processing mechanism, aiming to capture both global semantic dependencies and explicit speaker role constraints. This module consists of a context-aware branch, a role-aware branch, and a feature fusion unit, and is designed to effectively meet the practical needs of eliminating semantic ambiguity in civil aviation instruction parsing.

To simultaneously capture deep semantic relationships and pragmatic role information, the RAF module is designed to meet the core requirements of civil aviation instruction parsing by leveraging global contextual features from TinyBERT for syntactic structure and long-distance dependencies, while incorporating speaker identity for personalized role constraints. As illustrated in Figure 3, the module employs a dual-stream parallel architecture consisting of a context-aware branch and a role-aware branch. The left branch utilizes multi-head attention with residual connections to enhance long-range context modeling while preserving TinyBERT’s knowledge representation, whereas the right role-aware branch employs a role mapping and synchronization mechanism to transform discrete identity labels into guidance signals aligned with every token. The core value of this design lies in providing a clear basis for identity discrimination; for instance, the model can identify differences in the expression of true intentions between the Captain and the co-pilot within similar contexts, facilitating the filtering and calibration of semantics to effectively eliminate ambiguity in highly abbreviated instructions. Finally, features from both branches undergo LayerNorm and Channel Concatenation to generate a unified representation with identity constraints, laying a solid foundation for downstream BiGRU sequence modeling.

The detailed computational process of the RAF module consists of multiple progressive steps, which progressively extract and fuse multi-scale contextual and identity features. First, the first branch of the RAF focuses on processing the externally input role identity information. Considering that role information exists in the form of discrete labels, the model first transforms this information into a dense distributional representation via linear mapping. Let the input role one-hot vector be

r \in R^{B \times C}

, where

B

denotes the batch size and represents the number of role categories. We map this vector into a low-dimensional real-valued vector

e_{role}

using a learnable weight matrix

W_{r} \in R^{d_{role} \times K}

and a bias term

b_{r}

:

e_{r o l e} = W_{r} r + b_{r}

(1)

In the feature alignment stage, to address the dynamic text sequence features output by TinyBERT, the model employs a broadcasting mechanism to expand the static role identity features along the time axis. This process constructs a role feature matrix that is strictly aligned with the text features in the temporal dimension. Subsequently, layer normalization is introduced following the broadcasting output. This operation aims to map features from heterogeneous modalities into a unified numerical distribution space. This not only ensures the balance of weights during the multimodal fusion process—preventing any specific feature dimension from dominating gradient updates—but also significantly enhances the training stability and convergence efficiency of the model.

E = Broadcast (e_{r o l e}, T)

(2)

H_{r o l e} = LayerNorm (E)

(3)

where

H_{role}

is the normalized role feature sequence, which provides numerically stable identity priors for subsequent fusion. The second branch of RAF receives the contextual output sequence

X_{in} = {x_{1}, x_{2}, \dots, x_{T}} \in R^{T \times d_{model}}

from the TinyBERT module. To further capture long-distance dependencies within the instruction and focus on key semantic units, this branch employs a multi-head attention mechanism, supplemented by residual connections and layer normalization. For the input sequence

X_{in}

, the model maps it to query (Query), key (Key), and value (Value) matrices. For the

i

-th attention head (

i = 1, \dots, h

), the computation process is as follows:

Q_{i} = X_{i n} W_{Q}^{i}, K_{i} = X_{i n} W_{K}^{i}, V_{i} = X_{i n} W_{V}^{i}

(4)

{head}_{i} = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(5)

where

W_{Q}^{i}, W_{K}^{i}, W_{V}^{i} \in R^{d_{model} \times d_{k}}

are the projection parameter matrices and

d_{k}

is the scaling factor. The multi-head mechanism enables the model to simultaneously attend to diverse information facets across different representation subspaces. To formally aggregate the features captured by individual heads, the output of each head is concatenated and subsequently projected through a learnable weight matrix

W^{O}

to obtain the fused feature representation. To preserve the original contextual information from TinyBERT and prevent gradient vanishing or feature degradation in deep networks, we introduce a residual connection after the attention output, followed by normalization, yielding the final output

H_{ctx}

of the context branch.

\tilde{X} = Concat ({head}_{1}, \dots, {head}_{h}) W_{O}

(6)

H_{c t x} = LayerNorm (X_{i n} + Dropout (\tilde{X}))

(7)

Here,

H_{ctx} \in R^{T \times d_{model}}

integrates both the global contextual attention information and maintains the stability of the underlying semantics through the residual structure. Finally, the RAF module integrates the processed context features

H_{c t x}

with the normalized role features

H_{r o l e}

via the Concat function, which performs a head-to-tail concatenation of these two heterogeneous vectors along the feature dimension. This operation achieves a deep fusion of semantic information and identity features, thereby generating the final role-aware feature representation

X_{R A F}

.

X_{R A F} = Concat (H_{c t x}, H_{r o l e})

(8)

The output feature

X_{RAF} \in R^{B \times (d + d_{r}) \times T}

, representing the batch size, fused dimension, and sequence length, not only contains fine-grained semantic dependencies but also explicitly integrates the numerically stable role identity vector. This feature is then fed as input into the BiGRU module, enabling the sequence labeling network to dynamically adjust the prediction probabilities for key terms based on the speaker’s identity, significantly improving the robustness of intent recognition in complex interaction scenarios.

3.3. BiGRU-CRF Structure

BiGRU, through its bidirectional propagation mechanism, is capable of simultaneously capturing contextual information from both the preceding and following contexts in the sequence. This addresses the shortcomings of traditional RNNs in handling long-term dependencies. Its bidirectional structure is particularly well-suited for processing complex sequential data, enhancing the model’s ability to understand long-distance dependencies and context. When combined with the role-aware module, BiGRU further improves the model’s precision in capturing key information, especially when processing complex instructions, as it effectively focuses on the contextual relationships of important parts within the input sequence. With the help of forward and backward propagation paths, the model accurately captures forward sequence dependencies and deeply mines backward sequence association information. Additionally, the introduction of CRFs for structured decoding models the transition constraints between adjacent labels, enabling a global optimal path search, thus improving the consistency of the label sequence and the stability of boundary predictions. This further enhances the model’s contextual understanding and the accuracy of instruction recognition. The structure is shown in the Figure 4:

The forward GRU processes

X_{RAF}

in sequential time order, capturing the associations between the current moment and previous features, generating hidden states that reflect the forward sequence dependencies. The equation is as follows:

h_{t}^{f o r w a r d} = GRU (X_{R A F, t}, h_{t - 1}^{f o r w a r d})

(9)

where

h_{t}^{forward}

is the forward hidden state at time step

t

,

h_{t - 1}^{forward}

is the forward hidden state at the previous time step, and

X_{RAF, t}

is the role-aware feature at the current time step.

The backward GRU processes

X_{RAF}

in reverse time order, capturing the associations between the current moment and subsequent features, generating hidden states that reflect the backward sequence dependencies. The equation is as follows:

h_{t}^{b a c k w a r d} = GRU (X_{R A F, t}, h_{t + 1}^{b a c k w a r d})

(10)

where

h_{t}^{backward}

is the backward hidden state at time step

t

,

h_{t + 1}^{backward}

is the backward hidden state at the next time step, and

X_{RAF, t}

is consistent with the input used in the forward computation.

To fuse the semantic information from both the forward and backward hidden states, the concat function is applied to concatenate the two types of hidden states along the feature dimension, generating a fused feature that contains bidirectional sequence information:

F_{t} = concat (h_{t}^{f o r w a r d}, h_{t}^{b a c k w a r d})

(11)

where

F_{t}

is the bidirectional fused feature at time step

t

and concat is the feature concatenation function along the feature dimension.

Subsequently, using the bidirectional fused feature

F_{t}

generated by BiGRU as the input, the CRF layer performs inference and prediction of the corresponding label sequence for the input sequence. During the inference phase, the trained and converged model computes the conditional probability distribution of the output sequence based on the feature representation of the input sequence. Finally, the Viterbi algorithm is used to execute the decoding process, generating the key information labeling results for the flight crew instructions that are aligned with the input text sequence. The specific computation formula is as follows:

Score (y, F) = \sum_{t = 1}^{T} (T_{mat, y_{t - 1}, y_{t}} + F_{t, y_{t}})

(12)

where

Score (y, F)

is the CRF global scoring function;

y = y_{1}, y_{2}, \dots, y_{T}

is the label sequence to be predicted;

T_{mat, y_{t - 1}, y_{t}}

is the transition score from label

y_{t - 1}

to

y_{t}

(where

T_{mat}

is the transition matrix);

F_{t, y_{t}}

is the score of the fused feature assigned to label

y_{t}

at time step

t

; and

T

is the total length of the flight crew instruction sequence. The final output of the CRF layer is a prediction matrix

\hat{y} \in {1 \dots K}^{B \times T}

with dimensions

B \times T

. Each element

\hat{y_{b, t}}

represents the optimal category index predicted for the

b

-th sequence at time step

t

, selected from a label set containing

K

distinct states

4. Results

4.1. Experimental Dataset

The data collection for this study was conducted using the IPT integrated procedure trainer at the Guanghan campus of the Civil Aviation Flight University of China, equipped with the Dutch-imported AXA320 FTD static simulator, certified by the Civil Aviation Administration of China (CAAC). The objective was to reproduce the aircraft’s real operating conditions with high fidelity. Following the standardized operation procedures of the CAAC, data collection covered flight simulation tasks on fifty different routes, focusing on capturing the crew interaction speech during key flight phases such as climb, cruise, and approach. The recording tasks were carried out by flight trainees and instructors holding valid pilot licenses, with dialogs covering core flight stages, including pre-start preparation, post-start checks, taxiing, post-takeoff checks, air traffic control communications, approach, and post-landing checks.

Regarding data annotation, this study strictly adhered to the ‘Civil Aviation Radio Communication Terminology’ standards (MH/T 4014-2003) [40] issued by the CAAC, alongside the air traffic management procedures and telecommunications standards of the International Civil Aviation Organization (ICAO). The collected voice signals were processed, recognized, and converted into text data for database storage. To ensure professional accuracy and technical rigor, the annotation task was performed by four flight trainees holding valid aviation licenses, under the direct supervision of a senior flight instructor. The process was conducted in two distinct stages: first, the four annotators independently labeled the utterances using the BIO schema on the doccano platform; subsequently, the flight instructor conducted a cross-verification of all labeled results and resolved any ambiguous tags through final arbitration. This systematic approach establishes a solid data foundation for subsequent model training and feature analysis. The specific process is shown in Figure 5 and Figure 6.

The dataset constructed for this study comprises 8564 cleaned and denoised flight crew command entries, totaling 78,789 tokens with an average length of nine words per command. The distribution of entity categories exhibits distinct task-driven characteristics, with System Equipment accounting for 31%, Action for 27%, Flight Parameters for 24%, and Feedback for 18% of the total entities. As shown in Table 2. To ensure the stability of model training and the objectivity of evaluation, the dataset was divided into training, testing, and validation sets with a ratio of approximately 7:2:1. Specifically, the training set accounts for 69.6%, the testing set for 20.7%, and the validation set for 9.8%. The balanced distribution across these subsets provides robust support for the model’s generalization capability.

4.2. Experimental Configuration

All experiments in this study were conducted on a high-performance computing platform to ensure the efficiency of model training and the reproducibility of results. In terms of hardware configuration, the experimental environment was equipped with a 16-core AMD EPYC 9354 processor and 60.1 GB of RAM, providing ample computational power for large-scale data preprocessing. To accelerate the training and inference process of deep learning models, the system was equipped with an NVIDIA GeForce RTX 4090 GPU, with 25.2 GB of VRAM, effectively meeting the computational demands for high-dimensional feature extraction and complex network structures. Additionally, the system was configured with 751.6 GB of high-speed storage to ensure the read and write efficiency of large datasets. In terms of software, the experiments were conducted on the Windows 11 operating system, using Python 3.9 as the programming language. The models were built and deployed based on the PyTorch 2.4.0 deep learning framework. This synergistic software and hardware environment provided a robust foundation for the large-scale data training and model validation in this study.

Building upon this experimental platform, the hyperparameters listed in Table 3 were determined by referencing the classic empirical configurations of mainstream pre-trained models, and further adapted based on the specific characteristics of the aviation instruction data alongside the model’s actual performance on the validation set. Specifically, the maximum sequence length was set to 128, which completely encompasses the actual maximum length of the short flight crew instructions within the dataset, thereby effectively avoiding unnecessary computational overhead. The adoption of a six-layer Transformer architecture aimed to ensure the precision of deep semantic feature extraction while satisfying the low-latency deployment requirements inherent to aviation scenarios. Furthermore, training parameters—including the learning rate, batch size, and dropout rate—were established based on standard fine-tuning setups. Their exceptional convergence stability and effectiveness in preventing overfitting were subsequently validated on our validation set.

4.3. Performance Evaluation Metrics

During the model training process, we used the AdamW optimizer to ensure the stability of parameter updates and set the initial learning rate to

3 \times 10^{- 5}

to accommodate the fine-tuning requirements of the TinyBERT pre-trained model. Based on the GPU memory advantages of the experimental platform, the batch size was set to 32 to balance computational efficiency with the randomness of gradient descent. To enhance the model’s generalization ability and mitigate overfitting, the training process was set to 40 epochs, with early stopping applied based on the F1 score on the validation set to preserve the optimal model weights.

In the key information extraction task for civil aviation flight crew interaction instructions, four core metrics were selected to comprehensively evaluate the model’s overall performance: precision, recall, F1 score, and accuracy. Through a comprehensive analysis of these four metrics, this study is able to fully assess the model’s performance in extracting key information from crew interaction instructions and provide quantitative support and technical assistance for optimizing real-time monitoring capabilities in aviation command systems.

Precision is mainly used to quantify the confidence of the model’s prediction results, and its calculation formula is defined as follows:

Precision = \frac{T P}{T P + F P}

(13)

In the formula, TP represents true positives, i.e., the number of samples correctly identified as positive by the model; FP represents false positives, i.e., the number of negative samples incorrectly labeled as positive by the model. The higher this metric, the lower the model’s misclassification rate of negative samples, indicating stronger resistance to noise interference.

Recall focuses on evaluating the model’s coverage or ability to retrieve the target information, and its calculation formula is defined as follows:

Recall = \frac{T P}{T P + F N}

(14)

In the formula, FN represents false negatives, i.e., the number of true positive samples that were missed and not identified by the model. A high recall indicates that the model missed as few relevant pieces of information as possible, reflecting its ability to capture positive samples.

The F1 score is the harmonic mean of precision and recall, providing a comprehensive evaluation perspective. Its calculation formula is defined as follows:

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(15)

Since precision and recall often have a trade-off in practical tasks, the F1 score provides a more balanced evaluation perspective. The higher the F1 score, the better the model has achieved a balance between precision and recall, reflecting the model’s overall robustness in handling complex features.

Accuracy measures the overall proportion of correct classifications made by the model across all samples, and its calculation formula is defined as follows:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(16)

In the formula, TN represents true negatives, i.e., the number of samples correctly identified as negative by the model. This metric encompasses both the model’s correct identification of positive samples and its correct exclusion of negative samples. As an intuitive metric for measuring the model’s global judgment ability, accuracy reflects the effectiveness of the classifier on the overall dataset.

4.4. Ablation Experiment

This study uses the BiGRU-CRF model as the baseline model and optimizes its architecture based on the task characteristics. To comprehensively evaluate the effects of these adjustments, ablation experiments for each enhancement module were conducted. Table 4 presents a quantitative comparison of the detection performance with different optimization strategies.

The experimental results demonstrate that, first, after incorporating the role-aware module, the model’s F1 score increased from 0.801 to 0.836, marking an improvement of 3.5%. This indicates that clarifying the speaker’s identity information effectively assists the model in disambiguating dialog meanings, thereby avoiding intent misinterpretation caused by different roles using the same terminology during crew communications. Second, the introduction of the lightweight TinyBERT model further elevated the F1 score to 0.868, yielding a 6.7% increase over the baseline model. Relying on its robust language understanding capabilities, the model can accurately identify dense civil aviation abbreviations within context, overcoming the limitations of traditional methods in processing complex domain-specific vocabulary. Finally, the complete TR-BiGRU-CRF model effectively integrates these advantages to achieve optimal performance, with an accuracy of 0.895 and an F1 score of 0.888. The overall F1 score is improved by 8.7% compared to the baseline. Notably, the model’s high recall rate of 0.885 minimizes the omission of key operational commands, effectively reducing the risk of missed detections for critical instructions. This fully substantiates the reliability and practicality of the proposed architecture in complex aviation scenarios.

4.5. Statistical Significance Analysis

To verify the robustness of performance improvements and mitigate the impact of stochastic factors, this section introduces statistical significance analysis to complement the ablation study. Given the requirement to evaluate paired sample differences on the test set, McNemar’s Test is employed to provide rigorous statistical validation of the effectiveness and robustness of the TR-BiGRU-CRF architecture and its core components

As illustrated in Table 5, TR-BiGRU-CRF demonstrates a superior capability to rectify baseline errors across all comparison groups. In the evaluation against the foundational BiGRU-CRF, the result of

p < 0.001

indicates that the performance improvement of the proposed model is statistically highly significant. Furthermore,

p

values below 0.05 for Group 2 and Group 3 provide statistical evidence for the synergistic effect between the lightweight encoder and the role-aware module, mitigating the potential impact of random initialization and data variance on the experimental conclusions. These findings suggest that the proposed architecture effectively enhances feature extraction capabilities within complex aviation contexts.

4.6. Comparative Performance Analysis of Different Models

To systematically evaluate the performance advantages of the proposed TR-BiGRU-CRF model compared to existing mainstream methods, we constructed a standardized comparative evaluation framework based on the civil aviation flight crew interaction instruction dataset. Under identical experimental conditions, this study trained and evaluated ten different sequence labeling models with varying architectures. These models include: the traditional baselines BiLSTM-CRF and BiGRU-CRF; strong baselines integrating large-scale pre-trained language models such as BERT-CRF, BERT-BiLSTM-CRF, BERT-BiGRU-CRF, RoBERTa-BiGRU-CRF, and DeBERTa-BiGRU-CRF; and lightweight and attention-based models including TinyBERT-BiGRU-CRF and TinyBERT-Att-BiGRU-CRF. To comprehensively assess the comprehensive performance of each model in terms of accuracy, lightweight characteristics, and real-time efficiency, six core metrics—accuracy, precision, recall, F1 score, storage size, and inference time—were utilized for evaluation. Table 5 presents the detailed quantitative comparison results and performance analysis of the aforementioned models on the test set.

Table 6 provides a detailed quantitative comparison of model performance on the flight crew instruction dataset. Experimental results demonstrate that the proposed TR-BiGRU-CRF model achieves favorable performance across all evaluation metrics, reaching an accuracy of 0.926, precision of 0.922, recall of 0.918, and an F1 score of 0.920. Compared to the baseline BiGRU-CRF model with an F1 score of 0.785 without the pre-training mechanism, TR-BiGRU-CRF shows a significant absolute improvement of 0.135, highlighting the importance of deep semantic representations in handling complex instructions. Further horizontal comparisons reveal that TR-BiGRU-CRF maintains a demonstrable advantage over mainstream large-parameter models such as BERT-BiGRU-CRF with an F1 score of 0.859, and even the larger DeBERTa-BiGRU-CRF with an F1 score of 0.883. Notably, compared to the strongest lightweight baseline TinyBERT-Att-BiGRU-CRF with an F1 score of 0.885, our model’s F1 score is further improved by 3.5 percentage points. This performance gain suggests that the integrated Role-Aware Fusion mechanism plays a positive role in mitigating semantic ambiguity caused by unclear speaker identities in crew communication. Moreover, while maintaining favorable performance, TR-BiGRU-CRF requires only 58.5 MB of storage and an inference latency of 11.2 ms. Compared to the massive parameters of BERT-based models, our model achieves a superior balance between lightweight deployment and real-time performance requirements, which is better suited for actual civil aviation operation scenarios As shown in Figure 7.

4.7. Confusion Matrix

Although the overall evaluation metrics indicate that the TR-BiGRU-CRF model performs well at the macro level, to gain a deeper understanding of its fine-grained classification performance across different entity categories, this paper further analyzes the confusion matrix on the test set. By visualizing the distribution of prediction errors, the matrix reveals the specific confusion patterns of the model when handling certain flight instruction terms. This analysis helps objectively assess the model’s discriminative boundaries in distinguishing key entities such as altitude and heading, providing more specific evidence to support its practical effectiveness in flight safety applications

Based on the confusion matrix in the figure, the model demonstrates relatively accurate prediction ability across most categories. As shown in Figure 8. Specifically, in the Action and Feedback categories, the number of correctly predicted samples is significantly higher, indicating the model’s strong performance in these categories. However, although the model performs well in overall recognition, the confusion matrix reveals that a certain proportion of misclassifications still occurs between specific high-similarity categories, leading to the erroneous merging of adjacent independent commands. For instance, in consecutive action commands such as “Gear down” and “Turn on landing lights,” due to the lack of speech pauses or highly condensed semantic expressions, the model is prone to entity boundary recognition drift when capturing multi-objective concurrent features within extremely short time sequences, thereby omitting critical action parameters of the subsequent command. From the perspective of human–computer interaction, should a pilot inadvertently miss an operation under a high workload and the system fail to provide effective status cross-checking and error-prevention alerts due to the aforementioned classification biases, it would negatively impact the overall reliability of the flight safety monitoring system. These instances objectively reflect that the current model still faces certain bottlenecks in fine-grained semantic feature capture and long-sequence logical segmentation. Consequently, this further clarifies the rational application positioning of the system within an actual cockpit: it should not serve as a singular, absolute safety decision-maker, but rather be defined as an intelligent assistive tool that provides redundant support and cross-validation.

5. Conclusions

This paper proposes a key information extraction method that introduces a role-aware module to address the challenges of key information extraction, role semantic ambiguity, and the diverse expression of complex instructions in civil aviation flight crew commands. The method uses TinyBERT as a lightweight semantic encoder, combining the sequence modeling capabilities of BiGRU and the global decoding constraints of CRFs to jointly model entity boundaries and label dependencies within instruction sequences. Unlike traditional extraction methods that often rely on a single feature or shallow semantic representation, the role-aware module designed in this paper explicitly captures the semantic differences and relationships between different roles (such as crew members, control instruction objects, and action elements) in the instructions. This enhances the model’s understanding and discriminative ability regarding role-conditioned information, significantly improving the accuracy of key information identification in instructions with long spans, strong noise, or multi-element coupling.

To further leverage the multi-level representation information during the encoding phase, this paper introduces a multi-level fusion strategy that dynamically models the interactions between multi-granularity features. This enables effective complementarity between local context and global semantics, thus improving the overall robustness and adaptability of the model. Experimental results demonstrate that after integrating the role-aware module, the proposed model outperforms the baseline methods across all evaluation metrics, particularly exhibiting significant performance gains in the key information extraction task, with a precision of 92.2%, a recall of 91.8%, and an F1 score of 92.0%, alongside an overall prediction accuracy of 92.6%. Ablation studies further confirm the critical role of the role-aware module and multi-level fusion strategy in enhancing system performance. In real-world civil aviation operations, this capability to accurately capture complex, multi-dimensional instruction information can substantially mitigate safety hazards arising from semantic ambiguity or role identification errors. Compared to the traditional single-feature processing paradigm, this research not only improves algorithmic robustness but also provides a reliable technical foundation for intelligent cockpit voice monitoring, human-error prevention, and flight decision support systems. Future work will explore more refined role annotation and constraint mechanisms while further optimizing the model’s generalization ability by incorporating the data distribution characteristics of real-world operational scenarios

6. Discussion

The key information extraction framework based on the role-aware module proposed in this study demonstrates significant performance advantages in the high-noise, highly constrained, and highly condensed expression environment of civil aviation flight crew instructions. Experimental results show that this method can effectively alleviate extraction difficulties caused by role semantic ambiguity, long-distance dependencies of key elements, and command-style omitted expressions in complex instructions. Compared to the baseline model, the method achieved stable improvements across key evaluation metrics, indicating that role-conditioned modeling plays a crucial role in enhancing the separability and consistency of instruction understanding.

From a model mechanism perspective, the core contribution of the role-aware module lies in explicitly strengthening the “role–semantic” alignment relationship, enabling the model to form differentiated representations of different piece of role information during the encoding phase, thereby enhancing the model’s ability to discriminate actions, objects, constraints, and other elements. Compared to traditional sequence labeling methods that rely solely on contextual semantics or lexical features, this module better captures semantic shifts in command fragments with the same form but different meanings or multiple referents under different role conditions, reducing the probability of label confusion and boundary errors. Additionally, the multi-level fusion strategy further enhances the model’s ability to jointly model local short-range clues and global long-range dependencies. By dynamically interacting and integrating features from different levels, the model maintains strong robustness and generalization capabilities even when handling complex commands with multiple instruction segments, nested constraints, or parallel structures.

Ablation experiment results further validate these conclusions: removing the role-aware module leads to more significant performance degradation in samples involving multi-role collaboration or implicit subjects, while lacking the multi-level fusion strategy results in more omissions and mismatches in long instructions or multi-element coupling scenarios. This indicates that the two strategies are complementary: the former improves role semantic separability, while the latter enhances cross-granularity information integration and complex structure adaptation, jointly contributing to the improvement of the model’s performance.

Although this study has achieved promising results in the laboratory evaluation of key information extraction, the limitations and challenges of actual operating environments must be fully considered when deploying and integrating the system into real-world cockpit environments. On one hand, the model’s adaptability to complex interaction scenarios requires further enhancement. Compared to generative Large Language Models (LLMs) pre-trained on massive data or highly efficient Small Language Models (SLMs), it still has inherent limitations. Specifically, when processing cross-seat communications, readback confirmations, and non-standard expressions, defining information boundaries and matching intentions become significantly more complex. On the other hand, extreme conditions in real-world operations impose stringent demands on system stability. Beyond typical physical background noise and communication interference, pilots’ physiological and psychological fluctuations—induced by fatigue, illness, or sudden high-stress situations (e.g., abnormal speech rates or slurred pronunciation)—can severely degrade input quality, consequently compromising the accuracy of the final information extraction. Furthermore, in the event of extreme failures where a pilot is unable to speak or audio equipment malfunctions, a singular input mechanism overly reliant on voice interaction risks total failure. Therefore, in practical deployments, this system must be explicitly positioned as an ‘intelligent assistive and cross-validation tool.’ To ensure absolute flight safety, the system must adhere to aviation standard operating procedures (SOPs) and be tightly integrated with physical backup interfaces, such as instrument buttons or data link communications, thereby mitigating the limitations of relying on a single information source

In summary, although this study has achieved promising results in experimental evaluation, the focus of future research must shift from isolated algorithm testing to real-world cockpit deployment and system integration. On a theoretical level, future work will explore more refined role collaboration mechanisms and consider integrating acoustic features with textual information to enhance the system’s fault tolerance under complex operational constraints. On an applied level, research will directly target practical deployment challenges by evaluating system adaptation strategies under constrained airborne computing resources, as well as exploring the deep cross-fusion of parsed voice commands with existing cockpit systems. These efforts will effectively overcome the limitations of current offline experimental evaluations and genuinely facilitate the safe and reliable deployment of this method within real-world intelligent crew assistance systems.

Author Contributions

Conceptualization, W.P. and Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z. and W.P.; formal analysis, Y.Z.; investigation, Y.Z.; resources, W.P.; data curation, S.C. and C.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, W.P., Y.Z. and Q.Z.; visualization, Y.Z.; supervision, Y.Z., T.L. and Y.W.; project administration, Y.Z.; funding acquisition, W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (U2333209), Sichuan Science and Technology Program (2024ZDZX0046), and Sichuan Provincial Civil Aviation Flight Technology and Flight Safety Engineering Technology Research Center (F2024KF20D, GY2024-46E, GY2025-46E).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the. corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

International Air Transport Association. IATA Annual Safety Report—2024; International Air Transport Association: Montreal, QC, Canada, 2024. [Google Scholar]
International Civil Aviation Organization. ICAO Safety Report—2024 Edition; International Civil Aviation Organization: Montreal, QC, Canada, 2024. [Google Scholar]
European Union Aviation Safety Agency. Annual Safety Review 2024; Publications Office: Luxembourg, 2024. [Google Scholar]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A Survey on Named Entity Recognition—Datasets, Tools, and Methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
Chen, S.; Pan, W.; Wang, Y.; Chen, S.; Wang, X. Research on the Method of Air Traffic Control Instruction Keyword Extraction Based on the Roberta-Attention-BiLSTM-CRF Model. Aerospace 2025, 12, 376. [Google Scholar] [CrossRef]
Zhu, X.; Li, J.; Liu, Y.; Ma, C.; Wang, W. A Survey on Model Compression for Large Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 1556–1577. [Google Scholar] [CrossRef]
Xu, Y.; Chen, Y. Attention-Based Interactive Multi-Level Feature Fusion for Named Entity Recognition. Sci. Rep. 2025, 15, 3069. [Google Scholar] [CrossRef] [PubMed]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4163–4174. [Google Scholar]
Moslemi, A.; Briskina, A.; Dang, Z.; Li, J. A Survey on Knowledge Distillation: Recent Advancements. Mach. Learn. Appl. 2024, 18, 100605. [Google Scholar] [CrossRef]
Zhao, D.; Chen, X.; Chen, Y. Named Entity Recognition for Chinese Texts on Marine Coral Reef Ecosystems Based on the BERT-BiGRU-Att-CRF Model. Appl. Sci. 2024, 14, 5743. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Li, M.; Yang, H.; Liu, Y. Biomedical Named Entity Recognition Based on Fusion Multi-Features Embedding. Technol. Health Care 2023, 31, 111–121. [Google Scholar] [CrossRef]
Shi, J.; Sun, M.; Sun, Z.; Li, M.; Gu, Y.; Zhang, W. Multi-Level Semantic Fusion Network for Chinese Medical Named Entity Recognition. J. Biomed. Inform. 2022, 133, 104144. [Google Scholar] [CrossRef] [PubMed]
Ke, X.; Wu, X.; Ou, Z.; Li, B. Chinese Named Entity Recognition Method Based on Multi-Feature Fusion and Biaffine. Complex Intell. Syst. 2024, 10, 6305–6318. [Google Scholar] [CrossRef]
Zheng, D.; Han, R.; Yu, F.; Li, Y. Biomedical Named Entity Recognition Based on Multi-Cross Attention Feature Fusion. PLoS ONE 2024, 19, e0304329. [Google Scholar] [CrossRef]
Guo, Y.; Liu, H.-C.; Liu, F.-J.; Lin, W.-H.; Shao, Q.-S.; Su, J.-S. Chinese Named Entity Recognition with Multi-Network Fusion of Multi-Scale Lexical Information. J. Electron. Sci. Technol. 2024, 22, 100287. [Google Scholar] [CrossRef]
Tan, X. Named Entity Recognition of Electronic Medical Records Based on Multi-Feature Fusion. Front. Comput. Intell. Syst. 2023, 3, 6–10. [Google Scholar] [CrossRef]
Wang, S. Graph Neural Network–Driven Text Classification for Fire-Door Defect Inspection in Pre-Completion Construction. Sci. Rep. 2025, 15, 44382. [Google Scholar] [CrossRef]
Zhang, Z.; Guo, T.; Chen, M. DialogueBERT: A Self-Supervised Learning Based Dialogue Pre-Training Encoder. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management; Virtual Event Queensland Australia; ACM: New York, NY, USA, 2021; pp. 3647–3651. [Google Scholar]
Dantas, P.V.; Cordeiro, L.C.; Junior, W.S.S. A Review of State-of-the-Art Techniques for Large Language Model Compression. Complex Intell. Syst. 2025, 11, 407. [Google Scholar] [CrossRef]
Wang, S. Development of an Automated Transformer-Based Text Analysis Framework for Monitoring Fire Door Defects in Buildings. Sci. Rep. 2025, 15, 43910. [Google Scholar] [CrossRef] [PubMed]
Kong, J.; Wang, J.; Zhang, X. Hierarchical BERT with an Adaptive Fine-Tuning Strategy for Document Classification. Knowl.-Based Syst. 2022, 238, 107872. [Google Scholar] [CrossRef]
Silva Barbon, R.; Akabane, A.T. Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study. Sensors 2022, 22, 8184. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.-J.; Chen, K.-Y.; Kao, H.-Y. LAD: Layer-Wise Adaptive Distillation for BERT Model Compression. Sensors 2023, 23, 1483. [Google Scholar] [CrossRef]
Rohanian, O.; Nouriborji, M.; Jauncey, H.; Kouchaki, S.; Nooralahzadeh, F.; Clifton, L.; Merson, L.; Clifton, D.A. ISARIC Clinical Characterisation Group. Lightweight Transformers for Clinical Natural Language Processing. Nat. Lang. Eng. 2024, 30, 887–914. [Google Scholar] [CrossRef]
Singh, P.; De Clercq, O.; Lefever, E. Distilling Monolingual Models from Large Multilingual Transformers. Electronics 2023, 12, 1022. [Google Scholar] [CrossRef]
Kang, D.-H.; Kim, K.-T.; Habibilloh, E.; Chang, W.-D. A Lightweight Model of Learning Common Features in Different Domains for Classification Tasks. Mathematics 2026, 14, 326. [Google Scholar] [CrossRef]
Hong, Y. Design and Evaluation of Knowledge-Distilled LLM for Improving the Efficiency of School Administrative Document Processing. Electronics 2025, 14, 3860. [Google Scholar] [CrossRef]
Lamaakal, I.; Maleh, Y.; El Makkaoui, K.; Ouahbi, I.; Pławiak, P.; Alfarraj, O.; Almousa, M.; Abd El-Latif, A.A. Tiny Language Models for Automation and Control: Overview, Potential Applications, and Future Research Directions. Sensors 2025, 25, 1318. [Google Scholar] [CrossRef] [PubMed]
Muñoz-Ortiz, A.; Vilares, D.; Corro, C.; Gómez-Rodríguez, C. Nested Named Entity Recognition as Single-Pass Sequence Labeling. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Association for Computational Linguistics: Suzhou, China, 2025; pp. 9993–10002. [Google Scholar]
Jeong, M.; Kang, J. Consistency Enhancement of Model Prediction on Document-Level Named Entity Recognition. Bioinformatics 2023, 39, btad361. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Qian, T.; Lyu, C.; Zhang, J.; Jin, G.; Li, Y.; Xu, Y. A Multitask Learning Approach for Named Entity Recognition by Exploiting Sentence-Level Semantics Globally. Electronics 2022, 11, 3048. [Google Scholar] [CrossRef]
An, Y.; Xia, X.; Chen, X.; Wu, F.-X.; Wang, J. Chinese Clinical Named Entity Recognition via Multi-Head Self-Attention Based BiLSTM-CRF. Artif. Intell. Med. 2022, 127, 102282. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Yan, J.; Song, W.; Zhu, Y.; Chen, X. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. ISPRS Int. J. Geo-Inf. 2019, 9, 15. [Google Scholar] [CrossRef]
Pooja, H.; Jagadeesh, M.P.P. A Deep Learning Based Approach for Biomedical Named Entity Recognition Using Multitasking Transfer Learning with BiLSTM, BERT and CRF. SN COMPUT. SN Comput. Sci. 2024, 5, 482. [Google Scholar] [CrossRef]
Wang, T.; Liu, Y.; Liang, C.; Wang, B.; Liu, H. XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling. Electronics 2025, 14, 3034. [Google Scholar] [CrossRef]
Huang, X.; Li, P.; Wang, Y.; Ren, X.; Zhao, Z.; Li, G. Knowledge Graph-Augmented ERNIE-CNN Method for Risk Assessment in Secondary Power System Operations. Energies 2025, 18, 2104. [Google Scholar] [CrossRef]
Wei, Z.; Qu, S.; Zhao, L.; Shi, Q.; Zhang, C. A Position- and Similarity-Aware Named Entity Recognition Model for Power Equipment Maintenance Work Orders. Sensors 2025, 25, 2062. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; He, L.; Hu, W.; Yi, S. KGGCN: A Unified Knowledge Graph-Enhanced Graph Convolutional Network Framework for Chinese Named Entity Recognition. AI 2025, 6, 290. [Google Scholar] [CrossRef]
MHT4014-2003; Phraseology for Air Traffic Control Radiotelephony. Civil Aviation Administration of China (CAAC): Beijing, China, 2003.

Figure 1. Flight crew operational instruction information extraction framework.

Figure 2. TR-BiGRU-CRF model architecture.

Figure 3. Role-Aware Fusion.

Figure 4. BiGRU-CRF structure.

Figure 5. Experimental data collection environment and process.

Figure 6. Schematic diagram of annotation based on the BIO method.

Figure 7. Comparison of performance data across different models.

Figure 8. Confusion matrix.

Table 1. Comparative analysis of models under different technical approaches.

Category Number	Representative Techniques and Features	Key Datasets	Advantages and Limitations
Multi-Level Feature Fusion	Cross-attention and multi-feature fusion (glyph, pinyin, and lexical)	Biomedicine, Chinese medical, and industrial texts	Enhances representation depth for nested entities but remains sensitive to redundant noise in dense texts
Lightweight Modeling and Inference Optimization	Model compression via pruning, distillation, and quantization	Clinical NLP and edge computing scenarios	Achieves optimal efficiency–performance balance but may lose expressiveness in boundary-ambiguous cases
Sequence Dependency and Label Consistency Constraints	BiLSTM or BiGRU-CRF architectures and KG-enhanced GCNs	Medical, legal, and power equipment maintenance	Ensures global label consistency but lacks robustness against non-standard or informal input

Table 2. Keyword category classification.

Category Number	Keyword Category	Example
1	Action	Set, Check, Open, Retract
2	Flight Parameters	100 knots, Altitude 1200 m, 121.6 MHz
3	System Equipment	Radar, Landing Lights, Thrust Lever
4	Feedback	Confirm, Complete, Normal

Table 3. Hyperparameter settings.

Parameter	Value
Optimizer	AdamW
Learning Rate	$3 \times 1 0^{- 5}$
Batch Size	32
Dropout Rate	0.3
Number of Epochs	40
Hidden Units	768
Layers of Transformer	6
Max Sequence Length	128

Table 4. Ablation experiment data comparison.

Model	Accuracy	Precision	Recall	F1 Score
BiGRU-CRF	0.818	0.805	0.795	0.801
TinyBERT-BiGRU-CRF	0.878	0.872	0.865	0.868
Role-Aware-BiGRU-CRF	0.852	0.841	0.832	0.836
TR-BiGRU-CRF	0.895	0.891	0.885	0.888

Table 5. Results of McNemar’s Test for statistical significance.

Comparison	b (A Correct, B Wrong)	c (A Wrong, B Correct)	p Value	Significance
TR-BiGRU-CRF vs. BiGRU-CRF	98	24	<0.001	Yes
TR-BiGRU-CRF vs. TinyBERT-BiGRU-CRF	38	18	<0.05	Yes
TR-BiGRU-CRF vs. Role-Aware-BiGRU-CRF	62	22	<0.001	Yes

Table 6. Comparison of different models.

Model	Accuracy	Precision	Recall	F1 Score	Storage Size (MB)	Inference Time (ms)
BiLSTM-CRF	0.819	0.792	0.785	0.788	15.2	4.5
BiGRU-CRF	0.812	0.789	0.782	0.785	12.8	3.8
BERT-CRF	0.855	0.851	0.844	0.847	418.5	41.8
BERT-BiLSTM-CRF	0.871	0.865	0.858	0.861	425.6	48.2
BERT-BiGRU-CRF	0.863	0.862	0.856	0.859	423.4	45.7
RoBERTa-BiGRU-CRF	0.885	0.881	0.875	0.878	482.1	47.3
DeBERTa-BiGRU-CRF	0.89	0.885	0.881	0.883	538.6	59.2
TinyBERT-BiGRU-CRF	0.879	0.874	0.868	0.871	55.6	9.2
TinyBERT-Att-BiGRU-CRF	0.892	0.888	0.882	0.885	57.2	10.5
TR-BiGRU-CRF	0.926	0.922	0.918	0.92	58.5	11.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pan, W.; Zheng, Y.; Wang, Y.; Chen, S.; Zuo, Q.; Luan, T.; Zeng, C. TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions. Appl. Sci. 2026, 16, 3461. https://doi.org/10.3390/app16073461

AMA Style

Pan W, Zheng Y, Wang Y, Chen S, Zuo Q, Luan T, Zeng C. TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions. Applied Sciences. 2026; 16(7):3461. https://doi.org/10.3390/app16073461

Chicago/Turabian Style

Pan, Weijun, Yao Zheng, Yidi Wang, Sheng Chen, Qinghai Zuo, Tian Luan, and Chen Zeng. 2026. "TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions" Applied Sciences 16, no. 7: 3461. https://doi.org/10.3390/app16073461

APA Style

Pan, W., Zheng, Y., Wang, Y., Chen, S., Zuo, Q., Luan, T., & Zeng, C. (2026). TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions. Applied Sciences, 16(7), 3461. https://doi.org/10.3390/app16073461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TR-BiGRU-CRF: A Lightweight Key Information Extraction Approach for Civil Aviation Flight Crew Operational Instructions

Abstract

1. Introduction

2. Related Work

2.1. Multi-Level Feature Fusion to Enhance Semantic Representation

2.2. Lightweight Pre-Trained Language Models

2.3. Sequence Dependency Modeling and Label Consistency Constraints

3. Methodology

3.1. TinyBert Module

3.2. Role-Aware Fusion

3.3. BiGRU-CRF Structure

4. Results

4.1. Experimental Dataset

4.2. Experimental Configuration

4.3. Performance Evaluation Metrics

4.4. Ablation Experiment

4.5. Statistical Significance Analysis

4.6. Comparative Performance Analysis of Different Models

4.7. Confusion Matrix

5. Conclusions

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI