Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports

Chandra, Chetan; Ojima, Yuga; Bendarkar, Mayank V.; Mavris, Dimitri N.

doi:10.3390/aerospace11110890

Open AccessArticle

Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports

Aerospace Systems Design Laboratory (ASDL), Georgia Institute of Technology, Atlanta, GA 30332, USA

^*

Author to whom correspondence should be addressed.

Aerospace 2024, 11(11), 890; https://doi.org/10.3390/aerospace11110890

Submission received: 8 October 2024 / Revised: 22 October 2024 / Accepted: 23 October 2024 / Published: 29 October 2024

(This article belongs to the Special Issue Machine Learning for Aeronautics (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

This work introduces Aviation-BERT-NER, a Named Entity Recognition (NER) system tailored for aviation safety reports, building on the Aviation-BERT base model developed at the Georgia Institute of Technology’s Aerospace Systems Design Laboratory. This system integrates aviation domain-specific data, including aircraft types, manufacturers, quantities, and aviation terminology, to identify named entities critical for aviation safety analysis. A key innovation of Aviation-BERT-NER is its template-based approach to fine-tuning, which utilizes structured datasets to generate synthetic training data that mirror the complexity of real-world aviation safety reports. This method significantly improves the model’s generalizability and adaptability, enabling rapid updates and customization to meet evolving domain-specific requirements. The development process involved careful data preparation, including the synthesis of entity types and the generation of labeled datasets through template filling. Testing on real-world narratives from the National Transportation Safety Board (NTSB) database highlighted Aviation-BERT-NER’s robustness, with a precision of 95.34%, recall of 94.62%, and F1 score of 94.78% when evaluated over 50 manually annotated (BIO tagged) paragraphs. This work addresses a critical gap in English language NER models for aviation safety, promising substantial improvements in the analysis and understanding of aviation safety reports.

Keywords:

natural language processing; text mining; aviation safety; NTSB; ASRS; named entity recognition

1. Introduction

Aviation is considered one of the safest forms of transport, with 2024 considered one of the safest years on record [1], primarily because of its proactive and continuous focus on safety. The recent trend in aviation safety increasingly gravitates towards data-driven solutions. Aviation safety data have been used in reactive and proactive [2,3] approaches to detect risk factors using different kinds of models. Broadly, a reactive approach to safety focuses on previously occurring accidents and their causal factors to understand historical data. A proactive approach focuses on building models that predict the probability of risky events, which can be used to recommend actions.

Natural language processing (NLP) is a fast-growing area in artificial intelligence (AI), where computer models are built to analyze text data [4]. In aviation, the analysis of safety reports, aviation maintenance, and traffic control are the three broad areas where NLP is currently used [5]. For reactive safety analysis, the National Transportation Safety Board (NTSB) Report Database (https://www.ntsb.gov/Pages/AviationQuery.aspx, accessed on 19 March 2024) is invaluable for learning lessons from past incidents and accidents. It is an openly available database that stores accident text narrative data in structured and unstructured form. The Aviation Safety Reporting System (ASRS) (https://asrs.arc.nasa.gov/, accessed on 19 March 2024) is another database that collects unsafe aviation occurrences reported voluntarily [6]. Together, NTSB and ASRS have collected over 2.1 million accident and incident reports [7,8]. These reports carry a wealth of safety information from historical events that can be analyzed to improve aviation safety. Since any manual analysis of these reports is complicated and time-consuming, NLP can serve as a powerful enabler to automatically extract safety data from accident reports for use in data-driven reactive safety analysis. The models built using these analyses can then be used for improving aviation safety as a whole.

Yang and Huang [9] and Liu [10] conducted a detailed review of applications of NLP techniques in aviation safety and air transportation. The evolution of NLP techniques from word-counting techniques like term-frequency, inverse-document-frequency (TF-IDF), and bag-of-words to long short-term memory (LSTM), to more modern context-capturing transformer-based architectures is visible over the last decade. For instance, Perboli et al. [11] used Word2vec and Doc2vec representations of language for semantic similarity in the identification of human factor causes in aviation. Similarly, Madeira et al. [12] used TF-IDF and support vector machines (SVMs) for the prediction of human factors in aviation incident reports. Miyamoto et al. [13] used a bag-of-words approach with TF-IDF to identify operational safety causes behind flight delays using the ASRS database. Zhang et al. [14] used word embeddings and LSTM neural networks for the prognosis of adverse events using the NTSB database.

The state-of-the-art models in NLP use the transformer architecture with the attention mechanism, which allows better capturing of context from textual data [15]. Google AI’s Bidirectional Encoder Representations from Transformers (BERT) is one such transformer-based model that revolutionized NLP through its innovative pre-training of deep bidirectional representations from unlabeled text [16]. Recently, transformer-based models have been increasingly finding their applications to the aviation domain. Kierszbaum and Lapasset [17] applied the BERT base model on a small subset of Aviation Safety Reporting System (ASRS) narratives to answer the question “When did the accident happen?” In another work, Kierszbaum et al. [18] pre-trained a Robustly optimized BERT (RoBERTa) model on a limited ASRS dataset to evaluate its performance on natural language understanding (NLU). Recently, Wang et al. [19] developed AviationGPT on open-source LLaMA-2 and Mistral architectures, which showed greater performance for aviation domain-specific NLP tasks. A comprehensive review of NLP in aviation safety is beyond the scope of the present work. Readers interested in an overview of NLP in aviation are directed to Refs. [5,9,10] and to Ref. [20] for NLP applied to safety reports across different industries.

The present work introduces Aviation-BERT-NER, which builds on a previous aviation domain-specific BERT model developed by Chandra et al. [21]. This tailored base-model, Aviation-BERT, incorporated domain-specific data from aviation safety narratives, including those from NTSB and ASRS. By doing so, Aviation-BERT significantly enhanced its performance in aviation text-mining tasks. Aviation-BERT not only captures the unique lexicon and complexities of aviation texts but also surpasses the capabilities of traditional BERT models in extracting valuable insights from aviation safety reports [21]. For instance, Aviation-BERT-Classifiers show improved performance for identifying occurrence categories for NTSB and ASRS text reports [22] compared to the generic English language version of BERT. Aviation-BERT was also shown to outperform other generic English language models more than four times its size in expanding Aviation-Knowledge Graph (Aviation-KG) [23]. In a parallel effort by Jing et al. [23], the Aviation-KG question answering system, first proposed by Agarwal et al. [24], is being expanded upon to mine and integrate data from structured knowledge graphs, thereby enabling the system to answer complex questions with accuracy. The development of these knowledge graphs involves a thorough process of mapping out subjects, objects, and their interrelations within texts to form structured datasets. However, this detailed process presents its own set of challenges, particularly in accurately extracting entities and their relationships, which are crucial for the model’s effectiveness.

This is where Named Entity Recognition (NER) adds value, particularly in niche domains like aviation. NER aids Knowledge Graph Question Answering (KGQA) systems because it identifies specific aviation-related named entities, such as airport names, aircraft models, and aviation terms. These entities and their relationships can be translated into a Knowledge Graph (KG), which can then be used in a KG-RAG (Retriever Augmented Generation) framework for accurately responding to fact-based queries in the aviation field [25,26]. Some of the biggest challenges in developing an English language NER model for aviation safety include developing annotated datasets for model training that can be easily scaled, are generalizable, are flexible to customization, and are representative of real world datasets. These challenges and the problem definition are discussed further in Section 2.4. The present work proposes a new method for the development of synthetic training datasets that can tackle these issues while also describing the development and performance of the Aviation-BERT-NER model.

The rest of this paper is organized as follows: Section 2 reviews the latest in domain-specific NER models, data preparation, and related works and defines the problem. Section 3 details the data preparation, while Section 4 provides details for the Aviation-BERT-NER model fine-tuning. Section 5 discusses results, while Section 6 concludes the present work.

2. Literature Review and Problem Definition

2.1. Domain-Specific NER Models

There are a variety of specialized NER models tailored to specific domains. For instance, developed in 2023, the FiNER-LFs + Snorkel WMV NER model is designed for the financial sector [27] and focuses on entities like names of people, places, and organizations within financial documents, news articles, and regulatory texts, achieving an F1 score of 79.48%. Similarly, the BioClinicalBERT NER model focuses on the healthcare industry, particularly for parsing electronic health records (EHRs) [28]. It is designed to extract essential clinical information related to oncology, such as diseases, symptoms, treatments, tests, medications, procedures, and other specific cancer-related details with an F1 score of 87.6%.

A few NER models have been designed for aviation safety documents. For example, a customized Chinese IDCNN-BiGRU-CRF model created in 2023 targets the extraction of particular entities from aviation safety reports, including event, company, city, operation, date, aircraft type, personnel, flight number, aircraft registration, and aircraft part, with an F1 score of 97.93% [29]. Another Chinese language model, which uses a multi-feature fusion transformer (MFT), achieves an F1 score of 86.10% and focuses on capturing seven entities more related to the space domain [30]. For English language models, a tailored SpaCy NER model from Singapore identifies and categorizes aviation-specific terms such as airline, aircraft model, aircraft manufacturer, aircraft registration, flight number, departure location, and destination location, achieving an F1 score of 93.73% [31]. Andrade et al. [32] utilize a BERT-based model to extract entities from SAFECOM mishap reports pertinent to failure such as modes, causes, and effects, along with control processes and recommendations. On five entities of interest, this model achieves an F1 score of 33% (excluding non-entity terms), with the authors acknowledging the need for additional training data. An aerospace requirements-focused model, aeroBERT-NER, is tailored for Model-Based Systems Engineering (MBSE) within the aerospace domain [33] and identifies entities such as system names, organizations, resources, values, and date/time expressions relevant to aerospace with an F1 score of 92%. While this dataset is accessible on Hugging Face, it is important to note the distinct nature of aerospace requirement documents compared to aviation safety reports, as the former does not include specific mentions of aircraft names, airport names, and similar entities. The use of gazetteers and rule-based matching using syntactic-lexical patterns for NER tasks has shown tremendous promise in structured aviation datasets such as Letters of Agreement [34]. However, the performance of this method is unlikely to hold up for aviation safety datasets with a lot of variability, such as ASRS, where each contributor may have a different writing style.

A wide range of generic NER models is publicly accessible via platforms like GitHub and Hugging Face; however, not all are covered in scholarly articles. For example, the instructNER_ontonotes5_xl model available on Hugging Face is capable of identifying a broad spectrum of entities such as events, facilities, legal documents, geographic locations, products, and services, with an F1-score of 90.77% [35]. Another model found on Hugging Face, the span-marker-bert-base-fewnerd-fine-super, identifies diverse entities that include airports, islands, mountains, transit systems, companies, government agencies, etc., with an F1 score of 70.5% [36]. Although these models can assist in preparing datasets by analyzing aviation safety texts to identify specific entities, a process known as pseudo-labeling, the creation of a “Judge Model” still necessitates the use of gold labels or true labels [37]. This implies that manual checking of the data remains critical to correct any errors in labeling, a task that becomes impractical for datasets with more than 1000 sentences, especially when multiple iterations are required.

2.2. Data Preparation Methods

Numerous labeled datasets exist across different domains, including multilingual collections like CoNLL-2003 and OntoNotes, social media compilations such as WNUT-2017, Wikipedia datasets including Wiki Gold, Wiki Coref, and Hyena, as well as biomedical datasets like NCBI Disease, BioCreative, Genia Corpus, BC4CHEMD, JNLPBA, and BC2GM [38]. While there are several labeled datasets for Chinese aviation safety reports, a comparable labeled dataset for English aviation safety reports is found lacking, highlighting the necessity for developing an original aviation safety dataset.

There are several approaches for handling unlabeled datasets in aviation safety reports. One method previously discussed is pseudo-labeling, where models from third parties are employed to analyze aviation safety texts for identifying specific entities. However, developing a Judge Model to oversee this process still demands the utilization of gold or true labels [37]. This necessitates a labor-intensive process of manually checking and correcting misclassified named entities, a task that becomes particularly burdensome with multiple iterations. In addition, adding new types of named entities introduces another layer of complexity, as it requires sourcing models capable of detecting the desired entities, which can be challenging for certain types, like heading or temperature, which may both be stated in degrees and can confuse generic NER models.

Unsupervised NER techniques, which innovate in avoiding manual input, face substantial hurdles when applied to niche datasets such as aviation safety reports [39]. The precision and relevance of seed entities pose a significant challenge in specialized fields like aviation, where the jargon and context diverge markedly from those found in generic or internet-based texts. Moreover, the intricate details and specific characteristics of aviation safety reports might not be adequately tackled by the ambiguity resolution strategies inherent to these systems, resulting in heightened inaccuracies and noise within the extracted entity data.

Another unsupervised approach, CycleNER, utilizes cycle-consistency training alongside seq2seq transformer models, providing an effective strategy for areas with a lack of annotated datasets [40]. Nevertheless, when applied to specific datasets, such as those related to aviation safety, CycleNER might face considerable obstacles. The success of this method depends significantly on the presence of a small, yet highly representative and high-quality set of entity examples, which may be difficult to procure or may be excessively niche for aviation safety scenarios. Furthermore, existing challenges within CycleNER, including issues with detecting entity spans and processing sentences devoid of entities, may impede its effectiveness in aviation safety contexts, where precise recognition of technical terminology and intricate entity relationships are paramount.

2.3. Related Work

Template-based strategies, specifically those that utilize structured datasets for NER, seem promising given the challenges presented by various models and data preparation techniques listed earlier. Cui et al. [41] developed a sequence-to-sequence (seq2seq) framework utilizing BART for dynamic template filling, highlighting the importance of context and semantic relationships for accurately predicting and classifying entity spans. However, this approach faces limitations in direct data generation control and requires extensive model familiarity for post-testing adjustments.

The method presented in the present work, inspired by the principles outlined in the dynamic template-filling technique, seeks to exploit the advantages of template-based strategies within NER. By incorporating a deterministic template-filling strategy that leverages structured datasets, the present work overcomes the hurdles noted in the seq2seq framework, such as the challenge of direct control and adaptability in real-world application scenarios. The proposed method not only capitalizes on the structured nature of the data for generating training material but also ensures greater flexibility and ease in making targeted adjustments, enhancing the overall adaptability and effectiveness of NER systems in practical settings.

2.4. Problem Definition

To summarize prior discussion, in the realm of aviation safety, the effective analysis of incident reports is hindered by significant challenges in NER. These challenges include the absence of English-specific NER models in the United States and datasets tailored for aviation safety, the labor-intensive nature of manual data verification for label accuracy, and the complexities inherent in developing a Judge Model for oversight. For unsupervised data generation, the accuracy of seed entities is compromised by the specialized jargon of the aviation field. Dynamic template filling introduces innovative approaches to contextual understanding but lacks direct control over data generation and necessitates substantial expertise for post-testing adjustments. These issues collectively underscore a pressing need for advanced NER solutions that are not only technically sophisticated but also pragmatic and adaptable to the specialized domain of aviation safety.

The present work introduces a novel approach to domain-specific NER model creation applied specifically for the analysis of aviation safety reports. By customizing and extending the concept of template-based NER, this work pioneers several key innovations that address the unique challenges of this domain. These advancements promise not only to enhance the accuracy and efficiency of domain-specific NER systems but also to significantly improve their applicability and responsiveness to real-world needs.

1: Improved generalizability: This work enhances template-based NER by directly generating training data from templates populated with diverse entities drawn from structured datasets, reflecting the complexity of aviation safety reports. This ensures the domain-specific synthetic training data accurately represent the multifaceted nature of aviation entities, reducing the risk of overfitting.
2: Enhanced flexibility and customization: This work improves flexibility of domain-specific NER systems, enabling rapid updates and customization of entity types and formats to align with specific domain requirements or evolving stakeholder feedback. This adaptability ensures that the NER system remains current and effective, even as aviation safety standards and practices evolve.
3: Independence from existing models and manual labeling: The presented methodology eliminates reliance on pre-existing NER models and the intensive labor of manual data labeling. By leveraging structured datasets for template filling, a more efficient, scalable, and error-resistant approach is offered for generating training data, ensuring high-quality inputs from the start.
4: Ease of adjustments for real-world texts: In response to testing feedback and real-world application, this system provides swift and precise model adjustments. Whether refining entity definitions or modifying template wording, these changes can be implemented without the need for comprehensive model retraining, facilitating continuous improvement and adaptability to real-world needs.

3. Data Preparation

An overview of the methodology used to prepare data for the NER model is shown in Figure 1. The following subsections provide a detailed explanation of the steps taken to generate the training and test datasets.

3.1. Named Entity Data Preparation

Aviation safety reports feature a wide range of named entities and essential key terms vital for thorough analysis. Utilizing narratives from the NTSB database, 17 distinct types of entities were identified based on a manual examination of the narratives by the authors. The corresponding data were compiled from various sources, as shown in Table 1.

The data for each named entity derived from external sources included a diverse array of forms, such as full names, codes, and abbreviations, reflecting the complexity found in actual aviation safety reports. The counts of unique entities excluded overlapping terms, such as identical abbreviations and codes. Consistency was maintained across entities like AIRPORT, CITY, STATE, and COUNTRY to ensure alignment in the subsequent step of template-based NER dataset generation. For example, if a specific AIRPORT was used to build a template, the corresponding CITY, STATE, and COUNTRY (when applicable) were selected.

For entities where data were not available from external sources, synthetic data were generated, mirroring the formats seen in genuine reports. The DATE, TIME, WAY, SPEED, ALTITUDE, DISTANCE, TEMPERATURE, PRESSURE, DURATION, and WEIGHT data lists contained a variety of formats, and quantity-based entities were represented using realistic ranges and units. Format variations and ranges of values for these entities are detailed in Appendix A. The final count of entity data obtained after iteratively adding random values within these ranges is summarized in Table 1.

3.2. Template Preparation

To generate templates that closely resemble aviation safety reports, NTSB narratives were manually examined, occasionally adopting their exact structure or making minor alterations in phrasing and sequence to introduce diversity. Named entities within these templates were then replaced with placeholders marked by {}. These placeholders were designed to be randomly filled with named entities from the previously generated lists in Section 3.1. A few examples of such templates are provided below:

During a routine surveillance flight {DATE} at {TIME}, a {AIRCRAFT_NAME}
({AIRCRAFT_ICAO}) operated by {AIRLINE_NAME} ({AIRLINE_IATA}) experienced technical difficulties shortly after takeoff from {AIRPORT_NAME}
({AIRPORT_SYMBOL}) in {CITY}, {STATE_NAME}, {COUNTRY_NAME}.
The pilot of a {MANUFACTURER} {AIRCRAFT_ICAO} noted that the wind was about {SPEED} and favored a {DISTANCE} turf runway.
An unexpected {WEIGHT} shift prompted {AIRCRAFT_NAME}’s crew to reroute via {TAXIWAY} for an emergency inspection.
{AIRLINE_NAME} encountered unexpected {TEMPERATURE} fluctuations while cruising at {SPEED}, leading to an unscheduled maintenance check upon landing at {AIRPORT_NAME}.
Visibility challenges due to a low {TEMPERATURE} at {AIRPORT_NAME} led the {AIRCRAFT_NAME} to modify its course to an alternate {TAXIWAY}, under guidance from air traffic control.

Additionally, ChatGPT 4 was utilized to create varied sets of modified templates, with half of them generated by this external model [45]. These templates were generated iteratively using few-shot prompting. The initial prompt included placeholder descriptions for all the entity categories listed in Table 1. The next prompt provided three genuine NTSB narratives to illustrate the structure of aviation safety reports, along with three example templates containing placeholders. Finally, ChatGPT was prompted to create templates with the following instructions:

“Using the information provided for entity placeholders, aviation safety reports, and example templates, generate 10 new templates based on aviation safety data, covering a range of possible incidents and scenarios. Do not add placeholders beyond the list already provided. Templates must be single sentences.”

ChatGPT was then prompted to generate further templates in batches of 10, ensuring no repetition of previously generated templates. Each template was reviewed for suitability and added to the collection. To adjust the frequency of specific placeholders, the prompt was modified accordingly. An example of such a prompt is given below:

“Generate 10 more templates. Increase the occurrence of placeholders {DURATION}, {WEIGHT}, and {TEMPERATURE}. Do not repeat any previously generated templates.”

This process ultimately resulted in the development of 423 distinct templates (manual and ChatGPT combined), each consisting of a single sentence.

3.3. Labeled Synthetic Dataset Generation for Training

The developed templates were used to construct meaningful sentences by randomly replacing placeholders with named entities. This process was repeated 166 times, resulting in the creation of 70,218 sentences. The precise number of repetitions was determined based on the study outlined in Appendix B, which highlights the flexibility of the template-based approach in generating synthetic training datasets of varying sizes. The entity annotation and BIO tagging were performed simultaneously, with “O” representing words outside the entities of interest, “B” marking the beginning of an entity phrase, and “I” indicating continuation inside the entity phrase. Table 2 provides an example of placeholders replaced with random named entities, along with their corresponding tags. Additionally, Figure 2 shows the distribution of entities across the training dataset, with a total of 2,125,157 entities (including “O”).

3.4. Labeled Dataset Generation for Testing

To test performance on real-world data, genuine narratives from the NTSB database were considered. As an initial step in the selection process, all narratives in the database were tokenized using the Aviation-BERT tokenizer, and only those narratives with a sequence length shorter than 512 tokens were selected. This was done to prevent indexing errors during inference, as Aviation-BERT is based on the BERT-Base-Uncased model with a maximum allowable sequence length of 512 tokens [16,21]. In total, 108,514 narratives were selected from this process.

From the 108,514 selected narratives, individual narratives were randomly chosen and qualitatively examined for entities of interest. Narratives with a visibly high number and diversity of entities were shortlisted for the test dataset, while others were disregarded. For example, extremely short narratives of one or two sentences with no words within the entities of interest were not considered as candidates for the test dataset. This random selection process continued until 50 narratives were shortlisted. These shortlisted narratives were then manually annotated for all entities, for a total of 22,021 (including “O”). Entity distribution for the test dataset is shown in Figure 3.

A secondary test dataset was constructed using the same 50 narratives to understand the effect of shorter inputs on the model’s performance. For this purpose, the 50 narratives were broken down into individual sentences without altering the manual annotation for entities. This process resulted in 848 sentences, which can be used to evaluate performance of the same narratives at the sentence level.

4. Fine-Tuning

4.1. Training

The Aviation-BERT model [21], initially pre-trained on textual data from NTSB and ASRS, was fine-tuned for NER using the generated training dataset. The same tokenizer employed during the model’s pre-training phase was used for fine-tuning. During fine-tuning, all layers of the Aviation-BERT model were frozen, except for the top layer, thereby preserving the model’s foundational linguistic knowledge.

The AdamW optimizer [46] was selected for the training stage. A learning rate of

5 \times 10^{- 5}

and a batch size of 16 were chosen, aligning with best practices for fine-tuning transformer-based models to ensure a balance between quick adaptation to the new domain and maintenance of stability in the learning process. Fine-tuning was conducted over three epochs, a duration carefully calibrated to optimize the model’s learning without leading to overfitting (see study in Appendix B). A single NVIDIA Tesla V100 32GB GPU was utilized, and fine-tuning was completed in under two hours.

4.2. Testing

After the model was fine-tuned, its performance was evaluated using the 50 test narratives at both the paragraph and sentence levels. The performance was quantified using precision, recall, and F1 scores as metrics. Weighted average values of these metrics were calculated to account for the imbalanced nature of the test datasets. Additionally, the percentages of correct predictions for each entity were evaluated to compare performance at both levels.

The misclassifications and incorrect predictions were then manually reviewed as part of a qualitative assessment. This step was required because the excessive presence of the “O” entity may at times inflate the scores, potentially giving a misleading indication of the model’s predictive performance on the actual entities of interest.

5. Results and Discussion

5.1. Scores and Predictions

Precision, recall, and F1 scores for the test narratives at both the paragraph and sentence levels are summarized in Table 3. The percentages of correct predictions for each entity in the test datasets are presented in Table 4. F1 scores of 94.78% and 95.08% indicate that the model performs well at both levels, with slightly better performance at the sentence level. This is further supported by the higher percentages of correct predictions at the sentence level for nearly all entities. This behavior suggests a clear influence of the type of training data, as the training dataset consisted solely of single sentences derived from templates.

The model’s performance can also be visualized using a confusion matrix. Figure 4 shows the confusion matrix for the more conservative results at the paragraph level. The darkened diagonal in this figure represents the alignment between predicted and true labels for each entity, with a percentage scale on the right. The prominent diagonal demonstrates the model’s strong ability to accurately recognize and categorize the majority of entities.

When adapting the NER model for KGQA systems as in Ref. [23], the focus shifts from the fine-grained boundary delineation typical of BIO (Beginning, Inside, Outside) tagging to a broader emphasis on accurately recognizing and categorizing entities. In the context of KGQA, the detailed entity spans defined by BIO tagging are less relevant, as the primary goal is to understand the intent behind user queries and directly link identified entities to the knowledge graph. Consequently, this adaptation led to a review of performance metrics without the BIO tagging scheme. The results, summarized in Table 5, show that the model performs even better without the BIO tagging scheme, demonstrating its improved alignment with the operational requirements of KGQA systems.

As stated in Figure 3, the test dataset contains a relatively high number of “O” entities, comprising nearly 79.59% of all labeled data. While correctly predicting the “O” entity contributes to overall model performance, its excessive influence on the F1 score highlights the need for a qualitative assessment of the results to better understand the model’s performance on entities of interest.

Examples of correct entity classifications and predictions are shown in Figure 5. In Figure 5a, the model’s ability to distinguish between heading angle and temperature is noticeable, even though both use the same units. For instance, “330 degrees” is correctly classified as “O” based on meaning and context, while the temperature and dew point are identified as TEMPERATURE. A similar challenge for the model is differentiating between ALTITUDE and DISTANCE. Figure 5b illustrates how the model correctly classifies the dimensions of the runway as DISTANCE. Similarly, in Figure 5c, vertical distance is correctly classified as ALTITUDE. In the same example, quantities in gallons are classified as WEIGHT. While this classification is considered correct for the current model, it can easily be adjusted to volume by introducing a separate entity category during the data preparation and labeling stages.

Despite achieving high F1 scores, the model exhibits certain misclassifications, indicating areas for improvement. Figure 6 provides examples of some frequently occurring misclassifications. These misclassifications are particularly evident with aircraft engine specifics, such as model number, fuel consumption, and rotational speed, as shown in Figure 6a. Another area where the model underperforms is in classifying addresses and telephone numbers as entities outside of interest (Figure 6b). These misclassifications likely result from the absence of similar phrases in the training templates labeled as “O”, making the model unfamiliar with them. A different yet common type of misclassification is shown in Figure 6c, where the manufacturer “Beech” is identified as AIRCRAFT. Although this switch from MANUFACTURER to AIRCRAFT negatively impacts the overall scores, it was considered contextually acceptable within the narrative.

The root cause of most misclassifications can be addressed by increasing the diversity of templates to better capture specific non-entity terms. For instance, the strategy of using placeholders can be extended to the “O” label as well. Meaningful templates can be created with placeholders for addresses, telephone numbers, and quantities with units that are not of interest. During dataset generation, random values can be inserted into these placeholders, assigning them the “O” label. This approach will improve the model’s ability to recognize entities that fall outside the scope of interest.

5.2. Model Comparison

In Table 6, Aviation-BERT-NER is compared with alternative models to highlight its capabilities and performance. A direct comparison with Chinese IDCNN-BiGRU-CRF [29], Chinese MFT-BERT [30], and aeroBERT-NER [33] is challenging, primarily because the first two are designed for Chinese language documents, while the latter focuses on aerospace design requirements. Andrade et al. [32] report an F1 score of 33% for their custom BERT NER model, but since the “O” label is excluded from the weighted average calculations, a straightforward comparison with Aviation-BERT-NER may not be possible. They also note that the F1 score achieved is passable for their specific purpose and suggest that additional labeled training data could potentially improve results.

The evaluation of Aviation-BERT-NER, based on a weighted F1 score from 50 narratives of actual aviation safety reports, shows performance closely aligned with Singapore’s SpaCy NER [31], though Aviation-BERT-NER identifies 17 entities, compared to just 7 by the prior model. A distinguishing feature of Aviation-BERT-NER is its ability to capture not only aviation-specific entities but also a wide range of quantities. The fine-tuning setup inherently supports the expansion of entity types, allowing for easy refinement of templates and datasets in response to testing outcomes. This flexibility, along with the elimination of extensive manual dataset labeling, makes the model highly practical and adaptable, especially in the evolving landscape of aviation document analysis.

6. Conclusions

Domain-specific NER models serve an important function in the analysis of textual data. This paper presents a novel approach to training such models by generating synthetic training datasets for improved generalizability, flexibility, and customization, while reducing reliance on manual labeling and facilitating their application to real-world problems.

The Aviation-BERT-NER model developed here enhances the extraction of key terminology from aviation safety reports, such as those found in the NTSB and ASRS databases. A central feature of Aviation-BERT-NER’s approach is its use of templates, designed to reflect the diverse formats and scenarios found in real-world aviation safety narratives. This approach not only supports generalizability but also allows for quick adaptation to evolving aviation terminology and standards. The synthetic training dataset can be tailored to accommodate any number of entities while ensuring balance and diversity. In this work, over 70,000 synthetic sentences were generated for training Aviation-BERT-NER to recognize 17 distinct entities. For testing, real NTSB accident narratives were manually annotated to assess the model’s performance.

The methodology behind Aviation-BERT-NER addresses common challenges in NER models, such as misclassification, by continuously refining template diversity and strategically enhancing its named entity datasets. This proactive approach to model improvement demonstrates the method’s effectiveness in delivering a robust tool for aviation safety analysis. In benchmarking against other NER models, Aviation-BERT-NER outperformed other aviation-specific English NER models while handling more than twice the number of entities. Aviation-BERT-NER achieved an F1 score of 94.78% while identifying 17 entities, compared to Singapore’s SpaCy NER, which had an F1 score of 93.73% for 7 entities. Testing on real-world narratives from the NTSB database highlighted Aviation-BERT-NER’s robustness, with a precision of 95.34%, recall of 94.62%, and F1 score of 94.78% when evaluated over 50 manually annotated (BIO tagged) paragraphs. These scores are even higher when tested on single sentences instead of paragraphs.

However, the model has limitations in handling entities it was not trained to recognize, such as aircraft engine details, telephone numbers, or addresses, which may appear in real-world aviation safety reports. Future work will focus on addressing these gaps. It will also involve integrating Aviation-BERT-NER into an Aviation-Knowledge Graph system. This integration will enable the development of an Aviation-KG Retrieval Augmented Generation (KG-RAG) as well as a KG-Question Answering (KG-QA) system for aviation safety analysis.

Author Contributions

Conceptualization, Y.O. and M.V.B.; methodology, data curation, formal analysis, C.C. and Y.O.; resources, M.V.B. and D.N.M.; writing—original draft preparation, C.C. and Y.O.; writing—review and editing, M.V.B.; visualization, C.C. and Y.O.; supervision, M.V.B.; project administration and funding, M.V.B. and D.N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by U.S. Federal Aviation Administration (FAA) as a part of the Top-down Safety Risk Modeling For Facilitating Bottom-up Safety Risk Assessment project under FAA Award Number: 692M15-21-F-00225. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the FAA.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available upon request to the authors.

Acknowledgments

This research was supported in part through research cyber-infrastructure resources and services provided by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology, Atlanta, Georgia, USA. The authors would also like to thank Huasheng Li, Shane Bertish, Dereck Wilson for their guidance and support.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Generated data for the 10 entities mentioned in Section 3.1 span a wide variety of formats and values. The formats for each entity category are shown in Table A1, with the values represented as variables within the format itself. Random variable values were generated from specified ranges and inserted into the formatted text to create lists of entity data.

Table A1. Format and variations for synthetically generated entity data.

Entity Category	Data Format/Variation	Variable Range	Comments
DATE	y -m-d m-d-y d/m/y on the $d a y$ of $m o n t h$ $y e a r$ in $m o n t h$ of $y e a r$ on the $d a t e$	d = 01–31 m = 01–12 y = 2010–2023 $d a y$ = 1st–31st (appropriate suffix for each day) $m o n t h$ = January–December, Jan.–Dec. $y e a r$ = 2010–2023	All dates generated were practical values. Example, 30th Feb. is not possible and thus not used.
TIME	HM H:M HM $T z$ H:M $T z$	H = 00–23 M = 00–59 $T z$ = time zone names, abbreviations, offsets	List of world time zones was obtained from [47].
WAY	runway $N U M B E R$ rwy $N U M B E R$ rw $N U M B E R$ taxiway $L E T T E R$ txy $L E T T E R$ txwy $L E T T E R$	$N U M B E R$ = 01–36, 01L–36L, 01C–36C, 01R–36R $L E T T E R$ = A–Z, A1–Z5, AA–ZZ	Taxiway letters I, O, X, II, OO, and XX were excluded.
SPEED	$s_{k n o t}$ kts $s_{k n o t}$ knots $s_{k m p h}$ kilometers per hour $s_{k m p h}$ km/h $s_{k m p h}$ km/hr $s_{m p h}$ miles per hour $s_{m p h}$ mph	$s_{k n o t}$ = 1–600 $s_{k m p h}$ = 1–1200 $s_{m p h}$ = 1–745	Both integers and floating point to 1 decimal values were used.
ALTITUDE	$a_{f t}$ feet $s u f f i x$ $a_{f t}$ ft $s u f f i x$ $a_{f t}$ ’ $s u f f i x$ $a_{m}$ meters $a_{m}$ m FL $l e v e l$	$a_{f t}$ = 50–60,000 $s u f f i x$ = no suffix, ‘agl’, ‘above ground level’, ‘msl’, ‘mean sea level’ $a_{m}$ = 20–18,300 $l e v e l$ = 100–600	Only integers were used for numbers. Multiples of 10 were used for flight level.
DISTANCE	$d_{n m}$ nautical miles $d_{n m}$ NM $d_{k m}$ kilometers $d_{k m}$ km $d_{m i l e}$ miles $d_{f t}$ feet $d_{f t}$ ft $d_{f t}$ foot $d_{m}$ meters $d_{m}$ m	$d_{n m}$ = 10–7500 $d_{k m}$ = 10–14,000 $d_{m i l e}$ = 10–8700 $d_{f t}$ = 50–9000 $d_{m}$ = 10–3000	Both integers and floating point to 1 decimal values were used.
TEMPERATURE	$t_{c}$ Celsius $t_{c}$ C $t_{c}$ °C $t_{c}$ degrees Celsius $t_{c}$ degrees C $t_{f}$ Fahrenheit $t_{f}$ F $t_{f}$ °F $t_{f}$ degrees Fahrenheit $t_{f}$ degrees F	$t_{c}$ = −60 to 70 $t_{f}$ = −76 to 150	Both integers and floating point to 1 decimal values were used.
PRESSURE	$p_{i n H G}$ inches of mercury $p_{i n H G}$ inHG $p_{h P a}$ hectopascals $p_{h P a}$ hPa $p_{m b}$ millibars $p_{m b}$ mb $p_{P a}$ pascal $p_{P a}$ Pa $p_{p s i}$ pounds per square inch $p_{p s i}$ psi	$p_{i n H G}$ = 2–33 $p_{h P a}$ = 800–1014 $p_{m b}$ = 72–1014 $p_{P a}$ = 7240–101,400 $p_{p s i}$ = 1–16	Both integers and floating point to 1 decimal values were used.
DURATION	$d u r_{s}$ seconds $d u r_{s}$ sec $d u r_{s}$ secs $d u r_{h}$ hours $d u r_{h}$ h $d u r_{h}$ hr $d u r_{h}$ hrs $d u r_{m}$ minutes $d u r_{m}$ m $d u r_{m}$ min $d u r_{m}$ mins	$d u r_{s}$ = 1–1000 $d u r_{h}$ = 1–24 $d u r_{m}$ = 1–300	Both integers and floating point to 1 decimal values were used.
WEIGHT	$w_{l b}$ pounds $w_{l b}$ lb $w_{l b}$ lbs $w_{k g}$ kilograms $w_{k g}$ kg $w_{t o n}$ tons $w_{t o n}$ tonnes	$w_{l b}$ = 1–1,000,000 $w_{k g}$ = 1–453,592 $w_{t o n}$ = 1–453	Both integers and floating point to 1 decimal values were used.

For example, 100 random values were generated for each of the variables

w_{l b}

,

w_{k g}

, and

w_{t o n}

within the specified ranges for WEIGHT. These values were then applied to the 7 different formats, resulting in 700 named entity data points, as seen in Table 1. It is important to note that not all formats within certain entity categories were given equal emphasis. In PRESSURE, for instance, preference was given to

p_{i n H G}

and

p_{p s i}

, with more random values generated for these variables compared to

p_{h P a}

,

p_{m b}

, and

p_{P a}

, due to the higher prevalence of ’inches of mercury’ and ’psi’ units in the NTSB narratives.

Appendix B

As outlined in Section 3.3, the 423 templates are iteratively processed to generate labeled sentences for the training dataset, with each iteration involving the random selection of entities to populate the templates. The number of iterations must be carefully determined to ensure sufficient model learning without overfitting. This section details the methodology used to select the optimal number of iterations, i.e., the dataset size, while balancing the number of training epochs to achieve the best results.

Table A2 presents models trained with different dataset sizes over 3 to 5 epochs. Note that the dataset sizes are all multiples of 423, as they are generated by iterating the templates. All other aspects, including the hyperparameters identified in Section 4.1, remain consistent for training these models. The paragraph-level test dataset containing 50 narratives is used to generate weighted F1 scores for quantitative comparison of performance. Scores without the BIO tagging scheme are also presented for additional comparison.

Table A2. Test results (paragraph level) for different dataset sizes and training epochs. Bold values represent the highest F1 score. The final model chosen is highlighted in green.

Model #	Dataset Size	Training Epochs	F1 Score	F1 Score (without BIO)
1	50,337	3	94.45%	95.86%
2	50,337	4	95.03%	96.12%
3	54,990	3	93.03%	94.20%
4	54,990	4	94.28%	95.43%
5	60,066	3	93.84%	95.28%
6	60,066	4	94.11%	95.26%
7	60,066	5	93.98%	95.13%
8	65,142	3	94.26%	95.53%
9	65,142	4	93.94%	95.14%
10	70,218	3	94.78%	96.14%
11	70,218	4	93.60%	95.19%
12	80,370	3	92.54%	93.65%
13	80,370	4	93.79%	95.20%
14	80,370	5	93.47%	94.69%
15	100,251	3	93.83%	95.47%
16	100,251	4	93.27%	94.82%
17	100,251	5	93.38%	94.68%
18	120,132	3	93.45%	94.60%
19	120,132	4	93.30%	94.49%
20	120,132	5	92.57%	93.87%

Model 2 in Table A2 showed the highest F1 score of 95.03%. While this model might seem like the natural choice, the general trend of high scores across all models being influenced by the “O” entity necessitated a qualitative assessment to evaluate the performance on entities of interest. The high scores already indicated a strong level of correct classifications; therefore, the qualitative assessment focused on the severity of misclassifications. An example of a misclassification observed in Model 2 that disqualified it as the top contender is shown in Figure A1. In some narratives, this model incorrectly identified full stops as CITY, which was deemed an unacceptable error.

Figure A1. Example of unacceptable misclassification in Model 2.

The next best model, Model 10, with an F1 score of 94.78%, was examined, and an analysis of its common misclassifications is provided in Section 5.1. While Model 10 was not without flaws, it did not exhibit signs of overfitting or serious misclassifications and simultaneously achieved an impressive F1 score. Incidentally, it also achieved the highest F1 score of 96.14% when tested without the BIO tagging scheme. As a result, the dataset size of 70,218, generated through 166 iterations of the templates and three epochs, was selected as optimal for the NER model.

An additional observation can be drawn from the data presented in Table A2. Models trained with larger dataset sizes over a greater number of epochs exhibited reduced performance. This suggests that the model was likely memorizing phraseology and context from the training data, leading to overfitting. Consequently, it can be inferred that merely increasing the volume of templatized data through repeated iterations does not inherently enhance performance.

References

International Air Transport Association. IATA Annual Review 2024. Available online: https://www.iata.org/contentassets/c81222d96c9a4e0bb4ff6ced0126f0bb/iata-annual-review-2024.pdf (accessed on 12 September 2024).
Oster, C.V., Jr.; Strong, J.S.; Zorn, C.K. Analyzing aviation safety: Problems, challenges, opportunities. Res. Transp. Econ. 2013, 43, 148–164. [Google Scholar] [CrossRef]
Zhang, X.; Mahadevan, S. Bayesian Network Modeling of Accident Investigation Reports for Aviation Safety Assessment. Reliab. Eng. Syst. Saf. 2021, 209, 107371. [Google Scholar] [CrossRef]
Zhong, K.; Jackson, T.; West, A.; Cosma, G. Natural Language Processing Approaches in Industrial Maintenance: A Systematic Literature Review. Procedia Comput. Sci. 2024, 232, 2082–2097. [Google Scholar] [CrossRef]
Amin, N.; Yother, T.L.; Johnson, M.E.; Rayz, J. Exploration of Natural Language Processing (NLP) applications in aviation. Coll. Aviat. Rev. Int. 2022, 40, 203–216. [Google Scholar] [CrossRef]
Rose, R.L.; Puranik, T.G.; Mavris, D.N.; Rao, A.H. Application of structural topic modeling to aviation safety data. Reliab. Eng. Syst. Saf. 2022, 224, 108522. [Google Scholar] [CrossRef]
NASA. ASRS Program Briefing. 2023. Available online: https://asrs.arc.nasa.gov/docs/ASRS_ProgramBriefing.pdf (accessed on 22 October 2024).
NTSB. National Transportation Safety Board–Aviation Investigation Search. 2024. Available online: https://www.ntsb.gov/Pages/AviationQueryv2.aspx (accessed on 22 October 2024).
Yang, C.; Huang, C. Natural Language Processing (NLP) in Aviation Safety: Systematic Review of Research and Outlook into the Future. Aerospace 2023, 10, 600. [Google Scholar] [CrossRef]
Liu, Y. Large language models for air transportation: A critical review. J. Air Transp. Res. Soc. 2024, 2, 100024. [Google Scholar] [CrossRef]
Perboli, G.; Gajetti, M.; Fedorov, S.; Giudice, S.L. Natural Language Processing for the identification of Human factors in aviation accidents causes: An application to the SHEL methodology. Expert Syst. Appl. 2021, 186, 115694. [Google Scholar] [CrossRef]
Madeira, T.; Melício, R.; Valério, D.; Santos, L. Machine Learning and Natural Language Processing for Prediction of Human Factors in Aviation Incident Reports. Aerospace 2021, 8, 47. [Google Scholar] [CrossRef]
Miyamoto, A.; Bendarkar, M.V.; Mavris, D.N. Natural Language Processing of Aviation Safety Reports to Identify Inefficient Operational Patterns. Aerospace 2022, 9, 450. [Google Scholar] [CrossRef]
Zhang, X.; Srinivasan, P.; Mahadevan, S. Sequential deep learning from NTSB reports for aviation safety prognosis. Saf. Sci. 2021, 142, 105390. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Kierszbaum, S.; Lapasset, L. Applying Distilled BERT for Question Answering on ASRS Reports. In Proceedings of the 2020 New Trends in Civil Aviation (NTCA), Prague, Czech Republic, 23–24 November 2020; pp. 33–38. [Google Scholar] [CrossRef]
Kierszbaum, S.; Klein, T.; Lapasset, L. ASRS-CMFS vs. RoBERTa: Comparing Two Pre-Trained Language Models to Predict Anomalies in Aviation Occurrence Reports with a Low Volume of In-Domain Data Available. Aerospace 2022, 9, 591. [Google Scholar] [CrossRef]
Wang, L.; Chou, J.; Tien, A.; Zhou, X.; Baumgartner, D. AviationGPT: A Large Language Model for the Aviation Domain. In Proceedings of the AIAA Aviation Forum and Ascend 2024, Las Vegas, NV, USA, 29 July–2 August 2024. [Google Scholar] [CrossRef]
Ricketts, J.; Barry, D.; Guo, W.; Pelham, J. A Scoping Literature Review of Natural Language Processing Application to Safety Occurrence Reports. Safety 2023, 9, 22. [Google Scholar] [CrossRef]
Chandra, C.; Jing, X.; Bendarkar, M.; Sawant, K.; Elias, L.; Kirby, M.; Mavris, D. Aviation-BERT: A Preliminary Aviation-Specific Natural Language Model. In Proceedings of the AIAA AVIATION 2023 Forum, San Diego, CA, USA, 12–16 June 2023. [Google Scholar] [CrossRef]
Jing, X.; Chennakesavan, A.; Chandra, C.; Bendarkar, M.V.; Kirby, M.; Mavris, D.N. BERT for Aviation Text Classification. In Proceedings of the AIAA AVIATION 2023 Forum, San Diego, CA, USA, 12–16 June 2023. [Google Scholar] [CrossRef]
Jing, X.; Sawant, K.; Bendarkar, M.V.; Elias, L.R.; Mavris, D. Expanding Aviation Knowledge Graph Using Deep Learning for Safety Analysis. In Proceedings of the AIAA Aviation Forum and Ascend 2024, Las Vegas, NV, USA, 29 July–2 August 2024. [Google Scholar] [CrossRef]
Agarwal, A.; Gite, R.; Laddha, S.; Bhattacharyya, P.; Kar, S.; Ekbal, A.; Thind, P.; Zele, R.; Shankar, R. Knowledge Graph - Deep Learning: A Case Study in Question Answering in Aviation Safety Domain. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 6260–6270. [Google Scholar]
Sanmartin, D. KG-RAG: Bridging the Gap Between Knowledge and Creativity. arXiv 2024, arXiv:2405.12035. [Google Scholar] [CrossRef]
Mollá, D.; Van Zaanen, M.; Smith, D. Named entity recognition for question answering. In Proceedings of the Australasian Language Technology Association Workshop, Sydney, Australia, 30 November–1 December 2006; Australasian Language Technology Association: Sydney, Australia, 2006; pp. 51–58. [Google Scholar]
Shah, A.; Gullapalli, A.; Vithani, R.; Galarnyk, M.; Chava, S. FiNER-ORD: Financial Named Entity Recognition Open Research Dataset. arXiv 2024, arXiv:2302.11157. [Google Scholar] [CrossRef]
Durango, M.C.; Torres-Silva, E.A.; Orozco-Duque, A. Named Entity Recognition in Electronic Health Records: A Methodological Review. Healthc. Inform. Res. 2023, 29, 286–300. [Google Scholar] [CrossRef]
Wang, X.; Gan, Z.; Xu, Y.; Liu, B.; Zheng, T. Extracting Domain-Specific Chinese Named Entities for Aviation Safety Reports: A Case Study. Appl. Sci. 2023, 13, 11003. [Google Scholar] [CrossRef]
Chu, J.; Liu, Y.; Yue, Q.; Zheng, Z.; Han, X. Named entity recognition in aerospace based on multi-feature fusion transformer. Sci. Rep. 2024, 14, 827. [Google Scholar] [CrossRef]
Bharathi, A.; Ramdin, R.; Babu, P.; Menon, V.K.; Jayaramakrishnan, C.; Lakshmikumar, S. A hybrid named entity recognition system for aviation text. EAI Endorsed Trans. Scalable Inf. Syst. 2024, 11, 1–10. [Google Scholar]
Andrade, S.R.; Walsh, H.S. What Went Wrong: A Survey of Wildfire UAS Mishaps through Named Entity Recognition. In Proceedings of the 2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC), Portsmouth, VA, USA, 18–22 September 2022; pp. 1–10. [Google Scholar] [CrossRef]
Ray, A.T.; Pinon-Fischer, O.J.; Mavris, D.N.; White, R.T.; Cole, B.F. aeroBERT-NER: Named-Entity Recognition for Aerospace Requirements Engineering using BERT. In Proceedings of the AIAA SCITECH 2023 Forum, National Harbor, MD, USA, 23–27 January 2023. [Google Scholar] [CrossRef]
Pai, R.; Clarke, S.S.; Kalyanam, K.; Zhu, Z. Deep Learning based Modeling and Inference for Extracting Airspace Constraints for Planning. In Proceedings of the AIAA AVIATION 2022 Forum, Online, 27 June–1 July 2022. [Google Scholar] [CrossRef]
Aarsen, T. SpanMarker for Named Entity Recognition. Available online: https://github.com/tomaarsen/SpanMarkerNER (accessed on 13 September 2024).
Aarsen, T. SpanMarker with Bert-Base-Cased on FewNERD. Available online: https://huggingface.co/tomaarsen/span-marker-bert-base-fewnerd-fine-super (accessed on 13 September 2024).
Li, Z.Z.; Feng, D.W.; Li, D.S.; Lu, X.C. Learning to select pseudo labels: A semi-supervised method for named entity recognition. Front. Inf. Technol. Electron. Eng. 2020, 21, 903–916. [Google Scholar] [CrossRef]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A survey on Named Entity Recognition—Datasets, tools, and methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
Nadeau, D.; Turney, P.D.; Matwin, S. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Proceedings of the Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2006, Québec City, QC, Canada, 7–9 June 2006; Proceedings 19. Springer: Cham, Switzerland, 2006; pp. 266–277. [Google Scholar] [CrossRef]
Iovine, A.; Fang, A.; Fetahu, B.; Rokhlenko, O.; Malmasi, S. CycleNER: An unsupervised training approach for named entity recognition. In Proceedings of the The Web Conference 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
Cui, L.; Wu, Y.; Liu, J.; Yang, S.; Zhang, Y. Template-Based Named Entity Recognition Using BART. arXiv, 2021. [Google Scholar] [CrossRef]
Palt, K. ICAO Aircraft Codes—Flugzeuginfo.net. 2019. Available online: https://www.flugzeuginfo.net/table_accodes_en.php (accessed on 19 March 2024).
Gacsal, C. airport-codes.csv—GitHub Gist. 2021. Available online: https://gist.github.com/chrisgacsal/070379c59d25c235baaa88ec61472b28 (accessed on 19 March 2024).
Bansard International. Airlines IATA and ICAO Codes Table. 2024. Available online: https://www.bansard.com/sites/default/files/download_documents/Bansard-airlines-codes-IATA-ICAO.xlsx (accessed on 19 March 2024).
OpenAI. ChatGPT (GPT-4). Large Language Model. 2024. Available online: https://openai.com/chatgpt (accessed on 19 March 2024).
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
Time and Date AS. Time Zone Abbreviations—Worldwide List. 2024. Available online: https://www.timeanddate.com/time/zones/ (accessed on 19 March 2024).

Figure 1. Methodology outline for NER data preparation.

Figure 2. Distribution of entities in synthetic training dataset.

Figure 3. Distribution of entities in test dataset.

Figure 4. Test confusion matrix for 50 paragraphs.

Figure 5. Examples of correct entity classification. (a) Example of the model’s ability to distinguish between heading angle and temperature. (b) Example of how the model can correctly identify DISTANCE, even though units are same as ALTITUDE. (c) Example showing vertical distances correctly classified as ALTITUDE.

Figure 6. Examples of frequent misclassifications. (a) Example of misclassification for aircraft engine specifics. (b) Example of misclassification for addresses and telephone numbers. (c) Example of an acceptable misclassification for MANUFACTURER vs. AIRCRAFT.

Table 1. Synthesized overview of entity types in aviation safety analysis.

Entity Category	Unique Count	Entity Examples	Source
MANUFACTURER	272	‘Bell Helicopter’, ‘Aerospatiale’, ‘Piaggio’, ‘Boeing’, ‘EMBRAER’	flugzeuginfo.net [42]
AIRCRAFT	2720	‘A330-200’, ‘707-100’, ‘DH-115 Vampire’, ‘PA38’, ‘MD11’	flugzeuginfo.net [42]
AIRPORT	58,632	‘Caffrey Heliport’, ‘Regina General Hospital Helipad’, ‘Wilsche Glider Field’, ‘41WA’, ‘SAN’	GitHub Gist [43]
CITY	15,507	‘New Brighton’, ‘Hortonville’, ‘Wiesbaden’, ‘Dimapur’, ‘Marietta’
STATE	1893	‘Michigan’, ‘Sipaliwini District’, ‘Saarland’, ‘SO’, ‘QLD’, ‘ME’
COUNTRY	115	‘United States of America’, ‘Ghana’, ‘South Africa’, ‘Brazil’, ‘France’
AIRLINE	759	‘Gulf Air’, ‘Air Moldova’, ‘Etihad Airways’, ‘S7’, ‘NCA’	Bansard International [44]
DATE	22,115	‘in October of 2022 on the 5th’, ‘02-04-2016 ’, ‘22/07/2016’, ‘on the 6th of Aug. 2022’	See Appendix A
TIME	14,360	‘05:37 CST’, ‘0031 UTC + 6’, ‘11:47’, ‘11:56 Central Europe Time’, ‘0945’
WAY	915	‘runway 09’, ‘rwy 25R’, ‘taxiway D’, ‘txwy A4’
SPEED	800	‘120 kts’, ‘107.2 knots’, ‘514.6 mph’, ‘87 miles per hour’, ‘217 km/h’
ALTITUDE	2650	‘12,422 m’, ‘38,800 feet agl’, ‘10,132 ft’, ‘FL210’, ‘5541 m’
DISTANCE	500	‘11,981 km’, ‘182.9 miles’, ‘496.7 kilometers’, ‘995 NM’, ‘2000 ft’
TEMPERATURE	1500	‘9 degree C’, ‘31 Fahrenheit’, ‘−2.5 F’, ‘54 °F’, ‘25 C’
PRESSURE	900	‘101,263.6 Pa’, ‘24.2 inches of mercury’, ‘20.5 inHg’, ‘1117 hPa’, ‘1026.2 millibars’
DURATION	1100	‘35 s’, ‘11.6 h’, ‘1.5 h’, ‘9 h’, ‘10 mins’
WEIGHT	700	‘56,667 kg’, ‘503,385 pounds’, ‘1200 kilograms’, ‘22,476.3 lbs’, ‘179.1 tonnes’

Table 2. Example of entity annotation in generated training sentences.

Template	Sentence	Entity	Entity with BIO
During	During	O	O
a	a	O	O
routine	routine	O	O
surveillance	surveillance	O	O
flight	flight	O	O
{DATE}	on	DATE	B-DATE
	the	DATE	I-DATE
	12th	DATE	I-DATE
	of	DATE	I-DATE
	May	DATE	I-DATE
	2012	DATE	I-DATE
at	at	O	O
{TIME}	0821	TIME	B-TIME
,	,	O	O
a	a	O	O
{AIRCRAFT_NAME}	100	AIRCRAFT	B-AIRCRAFT
	King	AIRCRAFT	I-AIRCRAFT
	Air	AIRCRAFT	I-AIRCRAFT
(	(	O	O
{AIRCRAFT_ICAO}	BE10	AIRCRAFT	B-AIRCRAFT
)	)	O	O
operated	operated	O	O
by	by	O	O
{AIRLINE_NAME}	SATA	AIRLINE	B-AIRLINE
{AIRLINE_NAME}	Internacional	AIRLINE	I-AIRLINE
(	(	O	O
{AIRLINE_IATA}	S4	AIRLINE	B-AIRLINE
)	)	O	O
experienced	experienced	O	O
technical	technical	O	O
difficulties	difficulties	O	O
shortly	shortly	O	O
after	after	O	O
takeoff	takeoff	O	O
from	from	O	O
{AIRPORT_NAME}	Andrews	AIRPORT	B-AIRPORT
	University	AIRPORT	I-AIRPORT
	Airpark	AIRPORT	I-AIRPORT
(	(	O	O
{AIRPORT_SYMBOL}	C20	AIRPORT	B-AIRPORT
)	)	O	O
in	in	O	O
{CITY}	Berrien	CITY	B-CITY
{CITY}	Springs	CITY	I-CITY
,	,	O	O
{STATE_NAME}	Michigan	STATE	B-STATE
,	,	O	O
{COUNTRY_NAME}	United	COUNTRY	B-COUNTRY
	States	COUNTRY	I-COUNTRY
	of	COUNTRY	I-COUNTRY
	America	COUNTRY	I-COUNTRY
.	.	O	O

Table 3. Test results.

Test Dataset	Precision	Recall	F1 Score
50 paragraphs	95.34%	94.62%	94.78%
848 sentences	95.59%	94.90%	95.08%

Table 4. Percentages of correct predictions for all entities. Bold values represent the greater of two percentages for each entity label.

Entity Label	% Correct Predictions		Entity Label	% Correct Predictions
Entity Label	50 Paragraphs	848 Sentences	Entity Label	50 Paragraphs	848 Sentences
B-MANUFACTURER	82.00	88.00	I-MANUFACTURER	36.36	63.64
B-AIRCRAFT	82.86	88.10	I-AIRCRAFT	92.79	92.79
B-AIRPORT	74.80	73.60	I-AIRPORT	92.75	89.12
B-CITY	88.14	86.02	I-CITY	88.41	85.51
B-STATE	96.67	95.33	I-STATE	91.89	91.89
B-COUNTRY	73.59	73.59	I-COUNTRY	65.91	70.46
B-AIRLINE	73.08	85.90	I-AIRLINE	75.20	85.60
B-DATE	80.00	87.62	I-DATE	97.78	99.72
B-TIME	93.71	94.34	I-TIME	92.06	97.20
B-WAY	73.49	75.90	I-WAY	88.06	88.06
B-SPEED	88.06	91.05	I-SPEED	76.15	78.90
B-ALTITUDE	86.15	90.00	I-ALTITUDE	69.44	72.70
B-DISTANCE	88.89	93.16	I-DISTANCE	78.02	79.12
B-TEMPERATURE	100.00	100.00	I-TEMPERATURE	90.36	91.57
B-PRESSURE	81.82	95.46	I-PRESSURE	70.37	80.25
B-DURATION	79.38	79.38	I-DURATION	57.45	56.74
B-WEIGHT	82.05	84.62	I-WEIGHT	60.00	60.00
O	97.55	97.35

Table 5. Test results without BIO tagging.

Test Dataset	Precision	Recall	F1 Score
50 paragraphs	96.37%	96.07%	96.14%
848 sentences	96.55%	96.23%	96.32%

Table 6. Comparison of NER models for aviation/aerospace documents.

Model	Focus Area	Language	Identified Entities	F1 Score
Chinese IDCNN-BiGRU-CRF [29]	Aviation Safety Reports	Chinese	1. Aircraft Type, 2. Aircraft Registration, 3. Aircraft Part, 4. Flight Number, 5. Company, 6. Operation, 7. Event, 8. Personnel, 9. City, 10. Date	97.93%
Chinese MFT-BERT [30]	Aerospace Documents	Chinese	1. Companies & Organizations, 2. Airports & Spacecraft Launch Sites, 3. Vehicle Type, 4. Constellations & Satellites, 5. Space Missions & Projects, 6. Scientists & Astronauts, 7. Technology & Equipment	86.10%
Singapore’s SpaCy NER [31]	Aviation Safety Reports	English	1. Aircraft Model, 2. Aircraft Registration, 3. Manufacturer, 4. Airline, 5. Flight Number, 6. Departure Location, 7. Destination Location	93.73%
Custom BERT NER model [32]	Aviation Safety Reports	English	1. Failure Mode, 2. Failure Cause, 3. Failure Effect, 4. Control Processes, 5. Recommendations	33% *
aeroBERT-NER [33]	Aerospace Design Requirements	English	1. System, 2. Organization, 3. Resource, 4. Values, 5. Date/Time	92%
Aviation-BERT-NER	Aviation Safety Reports	English	1. Manufacturer, 2. Aircraft, 3. Airport, 4. City, 5. State, 6. Country, 7. Airline, 8. Date, 9. Time, 10. Way, 11. Speed, 12. Altitude, 13. Distance, 14. Temperature, 15. Pressure, 16. Duration, 17. Weight	94.78%

* Score calculation excludes “O” label.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chandra, C.; Ojima, Y.; Bendarkar, M.V.; Mavris, D.N. Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports. Aerospace 2024, 11, 890. https://doi.org/10.3390/aerospace11110890

AMA Style

Chandra C, Ojima Y, Bendarkar MV, Mavris DN. Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports. Aerospace. 2024; 11(11):890. https://doi.org/10.3390/aerospace11110890

Chicago/Turabian Style

Chandra, Chetan, Yuga Ojima, Mayank V. Bendarkar, and Dimitri N. Mavris. 2024. "Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports" Aerospace 11, no. 11: 890. https://doi.org/10.3390/aerospace11110890

APA Style

Chandra, C., Ojima, Y., Bendarkar, M. V., & Mavris, D. N. (2024). Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports. Aerospace, 11(11), 890. https://doi.org/10.3390/aerospace11110890

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Aviation-BERT-NER: Named Entity Recognition for Aviation Safety Reports

Abstract

1. Introduction

2. Literature Review and Problem Definition

2.1. Domain-Specific NER Models

2.2. Data Preparation Methods

2.3. Related Work

2.4. Problem Definition

3. Data Preparation

3.1. Named Entity Data Preparation

3.2. Template Preparation

3.3. Labeled Synthetic Dataset Generation for Training

3.4. Labeled Dataset Generation for Testing

4. Fine-Tuning

4.1. Training

4.2. Testing

5. Results and Discussion

5.1. Scores and Predictions

5.2. Model Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI