A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model

Geng, Huangzheng; Qing, Haihua; Hu, Jie; Huang, Wentao; Kang, Hanrui

doi:10.3390/electronics14071361

Open AccessArticle

A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model

by

Huangzheng Geng

^1,2,†,

Haihua Qing

^1,3,*,†,

Jie Hu

¹,

Wentao Huang

¹ and

Hanrui Kang

¹

School of Automotive Engineering, Wuhan University of Technology, Wuhan 430070, China

²

Technical Development Center, Shanghai Automotive Industry Corp General Motors Wuling Automobile Co., Ltd., Liuzhou 545007, China

³

School of Automotive and Intelligent Manufacturing, Shaoyang Polytechnic, Shaoyang 422000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(7), 1361; https://doi.org/10.3390/electronics14071361

Submission received: 1 February 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the inefficiency of manually screening fault knowledge in Chinese vehicle repair cases and proposes a NER method based on a combined model aimed at efficiently extracting automotive fault knowledge entities from unstructured vehicle repair case texts. First, the data characteristics of historical vehicle repair cases are analyzed, and in response to issues such as the misuse and overuse of punctuation and redundant text, long-text segmentation rules are designed, and text classification is performed using the Text-CNN method. Second, to address the low recognition accuracy of traditional methods for non-continuous and nested entities, a BERT-BiLSTM-CRF model is used to independently recognize entity categories and relationships. An entity relationship matching database is constructed, and methods and algorithms for non-continuous entity combination are designed. Finally, named entity fusion based on text similarity is employed to recognize automotive fault knowledge entities in vehicle repair case data. The results demonstrate that this method can effectively identify named entities related to automotive fault knowledge in Chinese vehicle fault cases.

Keywords:

text segmentation; text classification; named entity recognition; text similarity

1. Introduction

In recent years, the strong promotion of intelligent vehicle fault diagnosis platforms by automotive companies, along with the standardized management of vehicle fault data by their after-sales service departments, has led to the accumulation of a large volume of historical vehicle repair case data. These unstructured data contain substantial professional knowledge related to vehicle fault diagnosis [1]. Effectively utilizing this knowledge to serve the after-sales service process has become one of the current hot research directions. In the fundamental tasks of processing text data in the field of natural language, named entity recognition (NER) is a commonly used technique [2], effectively extracting specific textual entities from historical vehicle fault repair case data, such as fault phenomena, causes of faults, and more. In the development of intelligent online/offline and remote diagnostic platforms for vehicles, a key task is the extraction of rich vehicle fault diagnosis knowledge. Therefore, mining diagnostic knowledge from historical repair case data through technical methods holds significant research value.

Current research on NER is relatively mature. In terms of application domains, NER technology has gradually become an important means for rapidly acquiring knowledge across various fields. For example, in the field of biology, NER is commonly used to extract entities such as diseases and prescriptions, along with their relationships, from medical records [3]. Pooja et al. [4] proposed a NER method that integrates transfer learning and multi-task models, which effectively improves the accuracy of named entity recognition in biomedical datasets, though it heavily relies on annotated texts for predefined named entities. Košprdić et al. [5] introduced a zero-shot and few-shot NER approach in the biomedical domain to address these challenges. In the news domain, NER is often used to extract key information such as time, events, and locations from news articles. Tu et al. [6] proposed a deep learning-based NER model that can effectively identify new words in news texts; however, the model lacks the ability to recognize polysemous entities. To tackle this issue, Ren et al. [7] proposed a new model based on multi-level semantic feature fusion and a keyword dictionary. In the automotive domain, NER is commonly applied to identify terms related to automobiles. Ding et al. [8] developed a multi-view joint feature embedding model for entity recognition in Chinese automotive reviews, and to further enhance model performance, Park et al. [9] proposed an effective NER model based on domain adversarial training and multi-task learning. Moreover, NER has broad applications in other fields such as law, finance, and geography. From a developmental perspective, NER can be traced back to the 1950s, when it was primarily used in academic and medical contexts. Its application expanded to the news domain in the 1980s, but the term “named entity recognition” was not formally introduced until 1996. Early NER systems largely relied on expert-defined rules and dictionaries, which required no training and were easy to implement. With the advent of machine learning, NER developed rapidly, and in recent years, the emergence of deep learning, neural networks, and large models has led to significant breakthroughs. For instance, Affi M et al. [10] proposed a NER method that combines the BERT and ELMO models, and Gao et al. [11] introduced a RoBERTa-BiLSTM-CRF-based NER approach. Both methods have effectively enhanced the performance of traditional BERT-based entity recognition, although they still struggle with the issue of missing inherent semantic information. Ni et al. [12] proposed an entity recognition method that integrates radical features with the BERT model, which effectively improves entity recognition performance.

Thanks to the rise of large models, NER has experienced rapid development across various fields, yet its practical application still faces many challenges, such as nested entities, discontinuous entities, and low-resource data [13,14]. Consequently, scholars have conducted extensive research on these issues. To address the difficulty of recognizing nested entities in NER tasks, Yao et al. [15] proposed a bidirectional context-aware network specifically for nested NER, which optimizes the recognition of nested entities but cannot capture the interdependencies among different entity types. To overcome this limitation, Han et al. [16] proposed a new NER method based on a multi-layer CRF segmentation-aware relational graph convolutional network architecture. In response to the challenges of recognizing nested entities and ambiguous entity boundaries, Huang et al. [17] presented an approach based on the entity category enhanced NER model that effectively improves the recognition accuracy of nested named entities. Chen et al. [18] proposed a nested NER method based on word fusion and span detection; however, these models fail to effectively address the issue of class imbalance in nested entity recognition. To mitigate the negative impact of class imbalance on NER tasks, Liu et al. [19] introduced a method that integrates a flat NER module with a candidate span classification module. For the challenge of recognizing discontinuous entities, Yan et al. [14] proposed a method that fuses high-frequency words with syntactic context, and experimental results indicate that this method demonstrates excellent performance. Regarding the low accuracy of NER under low-resource conditions, Chen et al. [20] proposed a NER method based on a cross-modal generation module, Hang et al. [21] introduced a NER approach based on prompt words that integrate prompt learning concepts, and Zhan et al. [22] developed the LELNER model, which includes an information interaction module and an information fusion network, thereby improving model performance under low-resource language conditions. However, these methods often pay little attention to specific entity information when constructing prompts, making it difficult for the model to understand the semantic relationship between labels and entities. To address this, Yang et al. [23] combined entity-guided masked prompt techniques with a multi-similarity matching mechanism to propose a novel solution for low-resource NER tasks.

Despite the remarkable success of NER across various industries, research specifically focused on the automotive fault domain remains relatively scarce. The reasons for this include the following: first, the limited availability of publicly available corpora on vehicle faults; and second, inherent challenges in processing Chinese natural language texts. These challenges include (1) the uneven length of Chinese texts, where some longer texts have entities that are widely separated and contain nested entities, resulting in suboptimal entity recognition accuracy; and (2) most Chinese corpora are manually collected and entered, leading to issues such as polysemy and vague expressions in some entities, which complicates the entity alignment process and hinders the formation of structured knowledge. This paper addresses the above issues by combining methods such as grammar rules, text classification, and named entity recognition (NER) to extract an NER method suitable for unstructured historical vehicle repair case data. Section 2 introduces the data source, analyzes the textual features of the original data, and establishes the NER process. Section 3 designs long-text segmentation rules based on syntactic rules and performs text classification using the Text-CNN model to remove redundant text from the original data. Section 4 uses the deep learning model BERT-BiLSTM-CRF to recognize part entities and failure mode entities while constructing an entity relationship matching database and designing a non-continuous entity combination method and entity combination algorithm. Section 5 analyzes entity features and designs the entity alignment process, completing the named entity alignment task based on text similarity. Section 6 provides the conclusion.

2. Data Sources and Analysis

2.1. Data Sources

The data utilized in this study comprise historical maintenance case records for a specific vehicle model, documented by the after-sales department of an automobile company from January 2021 to December 2022. A total of 45,613 records were collected, as illustrated in Figure 1. The data collection process includes maintenance records from 4S dealerships, as well as business data generated from assistance and claims procedures. These records encompass multiple labels, including service station number, vehicle model, VIN, purchase date, repair date, mileage, fault description, handling results, and names of replaced parts.

2.2. Data Analysis

A portion of the labels from the original fault maintenance case data is presented in Table 1. It is evident that only the “Fault description”, “Handling results” and “Replaced part name” labels contain fault diagnosis knowledge. The “Fault description” label provides a textual description of the vehicle’s fault condition, the “Handling results” label generally describes the actual repair process, and the “Replaced part name” label specifies the name of the part that needs to be replaced after diagnosing the fault cause. Both the “Fault description” and “Handling results” labels consist of unstructured data.

2.2.1. Text Feature Analysis

The historical fault maintenance case data for this vehicle contain a significant number of automotive domain-specific terms, such as vehicle structure, fault causes, and fault phenomena. Upon examining the original data, the “Fault description” label, when segmented by punctuation marks, can be categorized into two main groups, “Fault phenomenon field” and “Other fields”. The “Handling results” label contains numerous sentences, which, after segmentation, can be classified into the following four categories: “Fault cause field”, “Fault phenomenon field”, “Maintenance method field”, and “Other text fields”. Therefore, the key tasks for NER in this study are first to accurately extract the “Fault phenomenon field” from the “Fault description” label and the “Fault cause field” from the “Handling results” label; and second, to extract entities such as “Fault phenomenon location”, “Fault phenomenon”, “Fault location”, and “Fault cause” from the “Fault phenomenon” and “Fault cause” fields.

Analyzing the “Fault description” and “Handling results” labels in the maintenance case data reveals that the following five key issues must be addressed to accomplish the aforementioned tasks: (1) both labels exhibit mixed and excessive use of punctuation marks, and some lengthy texts lack any punctuation, rendering direct segmentation based on punctuation ineffective; (2) in addition to the “Fault phenomenon” and “Fault cause” fields, the “Handling results” text also contains highly similar “Fault phenomenon” and “Maintenance method” fields, making it challenging for traditional NER techniques to efficiently recognize entities; (3) the data are manually collected, resulting in missing “Fault description” and “Replaced part name” labels in some cases, although the corresponding fields exist within the “Handling results” label; (4) the target entities are specialized automotive terms, and direct segmentation using tools like Jieba or pkuseg yields suboptimal results; (5) the texts contain “discontinuous entities” and “nested entities”, such as “steering wheel noise” and “left rear wheel and right rear wheel hub bearing damage”.

2.2.2. Text Processing Methods

To address the aforementioned issues, this study designs corresponding solutions. For issues (1) and (2), a set of rules suitable for segmenting long texts is developed. These rules are combined with the Text-CNN model to further categorize the segmented long texts, thereby preventing confusion in entity recognition across different field categories, enhancing the accuracy of entity recognition and accelerating the convergence of the recognition process. For issue (3), the “Fault description” and “Maintenance method” fields obtained after processing with the Text-CNN model are supplemented to the originally missing labels in the dataset, thereby enriching the original data. To tackle issues (4) and (5), the BIOS sequence labeling method is employed to annotate the location and failure mode entities within the label texts. This is integrated with the BERT-BiLSTM-CRF model to perform entity recognition, thereby resolving the problem of nested named entities. Additionally, a named entity relationship matching database is constructed based on the continuous entities in the original data to assist in the recognition of discontinuous named entities. In summary, the NER process is outlined in Figure 2.

3. Text Segmentation and Classification

Both the fault description and handling result fields contain numerous entries unrelated to the target entities. Directly applying NER to long texts results in low precision for target entity recognition and makes it difficult to extract target entities from a multitude of entities. Therefore, segmenting long texts into shorter texts and classifying these shorter texts facilitates subsequent recognition of target entities.

3.1. Text Segmentation

This subsection designs suitable long-text segmentation rules based on the textual characteristics of the original data [23]. The specific segmentation steps are as follows: (1) check for the presence of punctuation marks such as “,”, “:”, “;”, “,”, “。”, “:”, “;” and blank to perform an initial segmentation of the text; (2) identify the punctuation mark “.” and determine whether to segment based on the types of characters surrounding it. In the original text, “.” serves multiple purposes, indicating both a decimal point and a period. If the characters before and after “.” are both Chinese characters, the text is segmented at this punctuation mark; otherwise, it is not segmented; (3) for lengthy texts lacking punctuation-based segmentation, select specific keywords such as “Caused”, “Replaced”, etc., to reasonably segment the text. An example of long-text segmentation is shown in Table 2. According to the statistics, the original maintenance data contained 954 entries with texts that mixed punctuation marks (accounting for 2.1%) and 13,667 entries with lengthy texts (accounting for 30.0%). Through experiments on text classification using the Text-CNN model and on entity recognition using the BERT-BiLSTM-CRF model, the overall accuracies for these two types of texts improved by 36.7% and 60.2%, respectively, after applying the long-text segmentation rules.

3.2. Short-Text Classification

After obtaining short texts through the segmentation steps outlined above, further categorize the short texts. For example, “Handling results” can be divided into four categories: “Fault cause field”, “Fault phenomenon field”, “Maintenance method field”, and “Other text fields”. “Fault descriptions” can be categorized into “Fault phenomenon field” and “Other fields”. Subsequently, only the key fields need to be identified. Examples of short-text classification for “Fault descriptions” and “Handling results” are presented in Table 3.

3.2.1. Text-CNN Model

Currently, text classification tasks are commonly performed using traditional machine learning and deep learning algorithms. Traditional machine learning algorithms primarily involve two processes: constructing feature engineering and classifiers. The feature engineering process typically extracts text feature values using methods such as TF-IDF or the bag-of-words model, including term frequency and inverse document frequency. Classifiers then determine the text category based on one or multiple input variables, using methods such as decision trees, random forests, support vector machines, and naive Bayes classifiers [24]. This approach requires manual construction of feature engineering, which is costly. In contrast, deep learning algorithms effectively address the challenges of text representation through word embedding techniques and fully utilize neural network models to extract text features, such as BERT, Text-CNN, and Text-RNN models [25]. Among them, the classic Text-CNN model effectively captures local correlation features of text, has a simple structure, and offers fast training speeds. The model structure is illustrated in Figure 3, and its working process is as follows: firstly, the target text is encoded in the embedding layer and transformed into a vector representation; secondly, features of adjacent text are extracted using convolutional kernels in the convolutional layer; subsequently, the pooling layer applies max pooling to the convolutional outputs, thereby reducing the dimensionality of the features; finally, the fully connected layer and the output layer perform mapping and classification based on the pooled results, outputting the probability for each category [26].

3.2.2. Comparative Experiments of Models

Since the classification results of “Handling results” encompass those of “Fault descriptions”, the two categories of label data can be merged for experimentation. A total of 4000 labeled data entries were selected, and after long-text segmentation, 6114 short texts were obtained. These were manually classified into “Fault phenomena”, “Fault causes”, “Maintenance methods”, and “Other texts”, resulting in sample sizes of 676, 2072, 1438, and 1928 for each category, respectively. The dataset was divided into training, validation, and test sets in a 6:1:1 ratio. The Text-CNN model was employed for category prediction, and commonly used text classification models such as BERT, Text-RNN, and DP-CNN were selected for comparative experiments. Common evaluation metrics in multi-class classification tasks—namely precision, recall, and F1-score—were used as the criteria for assessing classification performance. Due to the imbalance in the number of samples among the different classes, the F1-score on the test set was computed using a weighted average based on class weights [27]. As shown in Table 4, a comprehensive comparison of the evaluation metrics indicates that the Text-CNN model achieved the highest precision and a commendable F1-score. Therefore, the Text-CNN model was selected to perform the text classification task.

4. Entity Recognition

Analysis of the original text reveals that some target entities are non-continuous or nested entities, as shown in Figure 4, and direct entity extraction performs poorly. Non-continuous entities refer to named entities that are not sequentially positioned within the text, meaning that the entities are not directly adjacent but are connected by other unrelated entities or symbols. Nested entities, on the other hand, refer to named entities that contain other named entities within them.

Non-continuous and nested entities generally exhibit the following characteristics: (1) Non-continuous entities often omit overlapping parts of the fault location, fault phenomenon, or fault cause in the text. For example, Figure 4a hides the entity “steering wheel”. Due to the misuse of punctuation in the original text and the presence of long-distance non-continuous entities, it is not possible to identify non-continuous entities using punctuation alone. Figure 4b illustrates the incorrect combination of part entities and failure mode entities. (2) The primary form of nested entities includes overlapping part entities and phenomenon entities within the “Fault phenomenon”, and part entities and cause entities within the “Fault cause”, as shown in Figure 4c, where “steering wheel” is the overlapping part. Nested entities generally belong to two-layer nesting and exist both in continuous and non-continuous entities. In continuous entities, nested entities exhibit a clear subject–predicate structure and are easier to identify. However, for nested entities in non-continuous entities, non-continuous entity recognition should be prioritized, and these entities should be converted into nested entities within continuous entities.

Based on the above text characteristics, prior to performing NER, part and failure mode entities in the input data are manually labeled, and part and corresponding failure mode entities are extracted from all single fields. An entity-failure mode relationship matching database is constructed to facilitate the identification of non-continuous and nested entities, thereby improving recognition accuracy for such entities.

4.1. BERT-BiLSTM-CRF Model

This section employs a deep learning model for the NER task. To better mitigate the impact of nested and non-continuous entities on recognition accuracy, a BERT-BiLSTM-CRF model based on single-character input is selected. The model framework is shown in Figure 5, consisting of the BERT layer, BiLSTM layer, and CRF layer [28]. In operation, the BERT layer first splits the original text sequence (Fault phenomena, Fault cause fields) into individual characters and maps these characters into vector representations using the BERT model, retaining rich semantic features of the text. These are then passed to the BiLSTM layer, where the BiLSTM network further learns the contextual features of the input text and calculates the probability of each named entity belonging to different categories. Finally, the CRF layer applies learning constraints to adjust the results of the BiLSTM layer, producing the final sequence of named entity labels. The model parameters are as follows: Dropout: 0.5; Learning rate: 2 × 10⁻⁵; Epochs: 40; Number of attention heads: 8; Optimizer: AdamW.

4.2. Model Training

4.2.1. Text Sequence Pre-Labeling

Named entity recognition involves classifying words in natural language text. Before training the model, each word in the text sequence must be manually labeled with a specific entity tag. The mainstream text sequence labeling methods include BIO, BIOS, BIOE, and BIOES [29]. Among these, “B” indicates the beginning of the target entity, “I” represents the inside of the target entity, “O” marks non-target entities, “E” marks the end of the target entity, and “S” indicates that the target entity consists of a single word. Since the “Fault phenomenon” field contains single-character entities like “Noise”, “Light”, “Tilted”, and “High”, the BIOS method is used to label the text sequence, with suffixes “-P” and “-S” added to “B”, “I”, and “S” to represent part and failure mode entities, respectively. For instance, the original text “The customer reported to the shop that the tire pressure light was on. Upon inspection by the technician, it was found that a short circuit in the tire pressure sensor caused the tire pressure light to illuminate. The tire pressure sensor was replaced” is labeled as shown in Table 5.

4.2.2. Comparative Experiment

A set of 6000 fault phenomenon and fault cause data after text classification processing is selected. Text sequences are manually labeled using the BIOS method and divided into training, validation, and test sets with an 8:1:1 ratio. The BERT-BiLSTM-CRF model is compared with other models, such as BERT-CRF, BiLSTM-CRF, and Word2v-BiLSTM-CRF. The precision, recall, and F1-score metrics are used to evaluate the recognition performance of each model. As the number of entities is not balanced, a weighted average method is used to calculate the F1-score for the test set. Table 6 presents the experimental results for each model. It is evident that the BERT-BiLSTM-CRF model achieved the highest recognition accuracy. Compared with the BERT-CRF and BiLSTM-CRF models, the inclusion of a bidirectional LSTM network to capture contextual semantic features improved the accuracy by 9.5%. Furthermore, the addition of a Transformer architecture for representing textual information increased the accuracy by 7.03%. Relative to the Word2v-BiLSTM-CRF and FLAT models, the accuracy improvements were 2.45% and 1.44%, respectively. Although its performance metrics were similar to those of the roBERTa-wwm-ext model, the BERT-BiLSTM-CRF model demonstrated relatively superior accuracy. To enhance subsequent entity combination performance and reduce manual proofreading workload, this study adopted the BERT-BiLSTM-CRF model for the named entity recognition task.

4.3. Entity Combination

Due to the presence of nested and non-continuous entities, after completing named entity recognition, the entities need to be combined according to specific rules to output the complete target entities, “Fault phenomenon” and “Fault cause”. To improve the accuracy of entity combination, an entity relationship matching database is constructed, and based on this database, a process and algorithm for combining non-continuous entities are designed.

4.3.1. Entity Relationship Matching Database

By observing the fault phenomenon and fault cause fields, the following rules can be summarized: (1) Non-continuous fields may not contain entity matching relationships, such as “steering wheel noise, unable to charge” where “steering wheel” and “unable to charge” do not match. (2) Continuous fields inherently contain entity matching relationships, such as “steering wheel noise”, where “noise” is one of the failure modes of the “steering wheel”. Based on these rules, part entities and failure mode entities are extracted from all continuous fields, and failure modes are categorized by part labels, thereby completing the construction of the entity relationship matching database. This process results in 3771 “Phenomenon part-phenomenon” entities and 6296 “Fault part-cause” entities, some of which are shown in Table 7.

4.3.2. Non-Continuous Entity Combination

To address the issue of non-continuous entity combination, the previously constructed entity relationship matching database, which contains only “part + failure mode” entities, is used to identify non-continuous entities and improve their recognition accuracy. The non-continuous entity combination process is shown in Figure 6.

4.3.3. Entity Combination Algorithm

After completing named entity recognition, all entities are arranged in the order they appear in the original text. Upon observing the original text, the following syntactic rules are identified: Rule 1: part entities always precede failure mode entities; Rule 2: if there is a failure mode entity between part entity A and C, then C and B generally have no matching relationship and do not need to be combined. Similarly, failure mode entities to the right of A and C also have no matching relationship; Rule 3: For non-continuous entities, whether part entity A and the adjacent failure mode entity B can be combined requires querying the entity relationship matching database. Based on these three syntactic rules, an entity combination algorithm is designed, with pseudocode shown in Algorithm 1 (the pseudocode is applicable when the first entity in the text is a part entity; if the first entity is a failure mode entity, it first outputs and removes the failure mode entity and then executes the pseudocode).

Algorithm 1 Entity combination algorithm based on syntactic rules

Input: Part entity list L1, failure mode entity list L₂, all entity list L₃,
entity relationship matching dictionary Dict = {x₁: ”y₁, y₂, …”, x₂: “z₁, z₂, …”, …},
where x represents part, and y, z represent failure modes
Output: Combined entity list L₄ of “part entity + failure mode entity” form

01: L₄, stack ←

\emptyset

; part_ flag, sym_flag ← 0 //Initialize entity list, flags, and stack
02: for i₁ in L₁ do //Traverse the part entity list L₁
03: for i₂ in L₃[L₃.index(i₁) + 1: n −1] do //Traverse all entities in list L₃ starting from element i₁
04: if i₂ ∈ L₂ then //Check if i₂ is a failure mode entity
05: part_ flag ← 1 //Marks that a failure mode entity has been encountered
06: stack.push(i₂) //Push the failure mode entity i₂ onto the stack
07: if part_ flag == sym_flag then
08: break
09: end if
10: else
11: sym_flag ← 1 //Marks that the corresponding part entity has been found
12: if part_ flag == sym_flag then
13: break
14: end if
15: end if
16: end for
17: for j ∈ stack do
18: stack.pop(j) //Pop the top element j from the stack
19: if j ∈ Dict[i₁] then //Check if failure mode j matches with part entity i₁
20: L₄ ← [i₁ +j] //Combine the part entity and failure mode entity, and store in L₄
21: end if
22: end for
23: stack ←

\emptyset

; part_ flag, sym_flag ← 0 //Reset stack and flags
24: end for
25: return L₄ = [x₁ + y_i, x₁ + z_i,…] //Return the result
26: end

This entity combination algorithm can effectively recognize non-continuous entities with long distances and mixed punctuation. Four typical text samples, A, B, C, and D were selected for entity recognition and combination, and the results are shown in Figure 7.

4.4. Entity Recognition and Performance Analysis

4.4.1. Entity Recognition Experiment

After text segmentation and classification, 45,613 “Fault phenomenon” and “Fault cause” fields are processed. The BERT-BiLSTM-CRF model computes the entity categories for each entity, and the entity combination method described above is applied to obtain 25,940 valid “Fault phenomenon–fault cause” entity pairs. These pairs include “Phenomenon part-phenomenon” and “Fault part-cause” entities, resulting in a structured “Fault Phenomenon–fault cause” named entity dataset.

4.4.2. Performance Analysis

A total of 600 data samples were selected, consisting of 300 samples each containing only nested entities and samples containing both nested and non-contiguous entities (for fault descriptions and processing results, respectively). The long-text segmentation and short-text classification (hereinafter referred to as text classification, with the model selected as Text-CNN) from Section 2 and Section 3, the entity recognition model (BERT-BiLSTM-CRF), and the entity combination algorithm (hereafter referred to as the entity matching library) were integrated to complete the entity recognition task. Using the traditional approach of a single entity recognition model as a baseline for comparison, three different model combinations were set up (with variables being the presence or absence of text classification and the entity matching library) to analyze the improvement effect of integrating text classification and the entity matching library on traditional entity recognition methods. Finally, the correctness of the entity recognition results for “fault phenomenon–fault reason” was verified through manual checking, and the experimental statistical results are shown in Table 8.

The comparative results indicate that, based on traditional entity recognition, the addition of text classification and the entity matching library improved the accuracy of the fault description fields for the two types of texts by 12.33% and 16.34%, respectively. For the processing result fields, the accuracy of the two types of texts increased by 15.34% and 16.66%, respectively. The main reasons are as follows: The text classification process removed some redundant words, narrowed the scope of entity recognition, and indirectly reduced the error rate in the entity recognition process. Additionally, the processing result texts contain more fields, and the non-fault reason fields include more technical terms, leading to a higher accuracy of text classification for the latter compared to the fault description texts. After incorporating the entity relationship matching database, the accuracy of entity recognition for non-contiguous entities and nested texts increased by 5% and 4.33%, respectively. This improvement was smaller compared to the increase brought about solely by text classification. The main reason is that the base number of correctly identified named entities is not large; incorrect matching of entity information negatively impacted accuracy. Furthermore, due to the limited number of experimental samples, the accuracy of nested texts did not change. In summary, after integrating text classification and the entity relationship matching library, the average entity recognition accuracy of the integrated model improved by 15.17% compared to the traditional single entity recognition model.

5. Entity Alignment

Entity alignment is the process of further standardizing structured entities to form logical and hierarchical systematic knowledge. The fundamental steps generally include entity alignment and entity fusion. The former encompasses coreference resolution and entity disambiguation, where coreference resolution involves merging multiple descriptions of the same entity from various sources, and entity disambiguation distinguishes between multiple meanings of the same entity. Entity fusion entails linking all identical entities to the correct standardized entity objects.

5.1. Entity Feature Analysis

The original maintenance case data consist of manually entered unstructured text, leading to ambiguous naming issues for the same entity, which results in synonymous multi-word structured entities that require entity alignment. Observing the location and failure mode entities, the ambiguity in naming primarily manifests as abbreviations and variants. For example, component entities such as [ADAS module assembly, ADAS control module, ADAS module] and [Battery, 12 V battery, 12 V accumulator] are standardized to “ADAS module assembly” and “12 V battery”, respectively. Failure mode entities like [circuit break, internal circuit break, circuit break fault] and [no effect, malfunction, failure] are standardized to “circuit break” and “failure”. To address the ambiguous naming of entities, this section employs multiple similarity judgment methods to calculate text similarity, thereby merging synonymous entities. Additionally, a certain degree of manual review is incorporated to ensure accuracy, thereby completing the entity alignment task.

5.2. Text Similarity

A key task in entity alignment is calculating the similarity between texts. Currently, commonly used methods fall into the following two categories: the first type involves directly calculating the literal structural similarity of the text [30], with commonly used techniques such as cosine distance, Jaccard distance, edit distance, and Euclidean distance. These methods are straightforward and convenient. The second type utilizes model-based approaches, such as Word2vec, FastText, SimSCE, and BERT models [31], where word vectors are constructed to imbue the text with certain semantic features. However, these methods require substantial manual annotation of the text, which results in high time and labor costs. The entities extracted in this study, particularly location and failure mode entities, are predominantly specialized terms with high lexical similarity. Therefore, the former category of similarity calculation methods is adopted for entity alignment. Additionally, the Chinese synonym database Synonyms is utilized to capture semantically similar entities, compensating for the limitations of lexical similarity methods in semantic understanding. This section employs cosine similarity, the Jaccard similarity coefficient, edit distance similarity, and Synonyms similarity, using their average as the final text similarity value (see Section 5.4).

5.3. Entity Alignment Methods

Analysis of the original data characteristics reveals the following: (1) entities of similar parts or failure modes generally correspond to the same part name entity, and (2) the number of failure modes for part entities is typically fixed. In the original repair case data, the “Replaced part name” is an aligned entity. Based on characteristics (1) and (2), part entities can be preliminarily clustered using this aligned entity. Once the part entities are aligned, failure mode entities can similarly be clustered using the aligned part entities. An entity alignment method is thus designed, as shown in Figure 8. The specific process is as follows: First, the “Replaced part name” entity is used to preliminarily cluster the part entities, resulting in a part entity set Si. At this point, the parts in the set Si are strongly associated with their corresponding part name entities. From the perspective of automotive structure, the parts in Si should constitute the components of the corresponding part; next, the similarity values (sim) between each pair of elements in the set are calculated. Entities with a similarity value greater than a predefined threshold (sim > sim₁) are considered the same entity, completing the entity alignment. Entities with similarity values between sim₁ and sim₂ (1 > sim₁ > sim₂ > 0) are manually reviewed and then aligned. Similarly, part entities are replaced with failure mode entities to complete the alignment of failure mode entities. Finally, the standardized part and failure mode entities are aligned. The optimal threshold parameters for this dataset are sim₁ = 0.8 and sim₂ = 0.7 (as described in Section 5.4).

5.4. Experimental Results

This section selects 1000 data samples, sets different threshold parameters, and records the following data: the number of aligned entities, the number of entities pending manual review, accuracy rate, the number of entity categories before and after alignment, and the alignment rate. The alignment effects under various threshold conditions are presented in Table 9. Here, the accuracy rate represents the proportion of correctly aligned entities, and the alignment rate is the ratio of the difference in the number of entity categories before and after alignment to the number of entity categories before alignment. To maximize the accuracy of aligned entities while minimizing the number of entities pending review, the optimal threshold parameters are selected as sim₁ = 0.8 and sim₂ = 0.7.

Under these threshold conditions, the four different similarity calculation methods are applied to compute the text alignment accuracy, as shown in Table 10. Here, “Average” and “Max” represent the average and maximum values of the four similarity methods, respectively. It is observed that using the maximum similarity results in the lowest accuracy, likely because the task involves multi-text alignment, and selecting the maximum similarity may align more similar but distinct entities, thereby reducing accuracy. The Synonyms method exhibits lower accuracy, primarily due to the limited coverage of synonyms in the database. Experimental data indicate that the average of the four methods is chosen as the similarity measurement metric.

Following the entity alignment process outlined in this study, the original 25,940 fault phenomenon–fault cause diagnostic historical case data were processed, achieving named entity alignment for fault phenomena and fault causes. A total of 4375 location entities and 3939 failure mode entities were obtained. After further entity alignment, the total number of location entity categories was reduced by 30.9%, and the total number of failure mode entity categories was reduced by 25.7%, effectively completing the named entity extraction task for unstructured automotive maintenance case data.

6. Conclusions

Due to the unstructured nature of the original maintenance case data stored by the enterprise’s after-sales department, direct use for fault diagnosis is infeasible. This study proposes a method for extracting automotive fault diagnosis knowledge based on techniques such as text classification, named entity recognition, and knowledge alignment. The main issues addressed and experimental conclusions are as follows: (1) Aiming at the problems of punctuation mixing, abuse and redundant Text in the data, a long-text segmentation rule is designed, and long-text classification is completed based on the Text-CNN model. The results of the control experiment show that the classification effect of the model is better than other traditional models, and the classification accuracy is 91.82%. (2) To improve the recognition accuracy of discontinuous entities and nested entities, a method based on the deep learning model BERT-BiLSTM-CRF was adopted to independently identify the location entity and the failure form entity. The control experiment results showed that the recognition effect of the model was better than other traditional models, and the recognition accuracy was 91.06%. (3) To improve the entity alignment accuracy of discontinuous entities, an entity relation matching database is constructed, and a discontinuous entity combination method and algorithm are designed. Through experimental data testing, the average recognition accuracy of nested entities and discontinuous entities is 91% and 77%, respectively. Compared with the traditional named entity method, this paper integrates text classification and entity relation matching libraries, effectively improving the recognition accuracy by 13.84% and 16.50%. In summary, this named entity recognition method based on a combined model can effectively extract automotive named entities from unstructured text data related to vehicle fault repair cases.

Future work will further optimize the model performance as follows: (1) Integrate fault code data recorded during vehicle failures and employ association rule mining and other algorithms to analyze the complex coupling of fault codes, investigate their coupling logic, and perform fault code decoupling. (2) Building on this study, incorporate related knowledge such as vehicle structures and modules to assist automotive companies in constructing and enhancing automotive fault diagnosis knowledge graphs. (3) Collaborate with industry partners to access vehicle freeze frame data, data streams, and messages, and leverage large pre-trained models to design a more comprehensive vehicle fault diagnosis solution.

Author Contributions

Methodology, H.G.; Writing—review and editing, H.Q.; Supervision, J.H.; Software, W.H. and H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Science and Technology Major Project, grant number 2023AA03009.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Huangzheng Geng was employed by the company General Wuling Automobile Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NER	Named entity recognition
BERT	Bidirectional encoder representations from Transformers
BiLSTM	Bi-directional long short-term memory
CRF	Conditional random field
CNN	Convolutional neural network
ELMO	Embeddings from language models
Text-CNN	Convolutional neural network for sentence classification
Text-RNN	Recurrent neural network for sentence classification
RoBERTa	Robustly Optimized BERT Pretraining Approach
ELMO	Embeddings from Language Models
LELNER	Lightweight and effective low-resource named entity recognition
TF-IDF	Term frequency–inverse document frequency
DP-CNN	Deep pyramid convolutional neural network
SimSCE	Simple contrastive learning of sentence embeddings
Word2vec	Word to vector

References

Priyankar, B.; Sriram, S.; Sleeman, W.C.; Palta, J.; Kapoor, R.; Ghosh, P. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts. Appl. Sci. 2021, 11, 8319. [Google Scholar] [CrossRef]
Yeung, A.J.; Shek, A.; Searle, T.; Kraljevic, Z.; Dinu, V.; Ratas, M.; Al-Agil, M.; Foy, A.; Rafferty, B.; Oliynyk, V.; et al. Natural language processing data services for healthcare providers. BMC Med. Inform. Decis. Mak. 2024, 24, 356. [Google Scholar]
Goyal, N.; Singh, N. Named entity recognition and relationship extraction for biomedical text: A comprehensive survey, recent advancements, and future research directions. Neurocomputing 2025, 618, 129171. [Google Scholar] [CrossRef]
Pooja, H.; Jagadeesh, M.P. A Deep Learning Based Approach for Biomedical Named Entity Recognition Using Multitasking Transfer Learning with BiLSTM, BERT and CRF. SN Comput. Sci. 2024, 5, 482. [Google Scholar]
Košprdić, M.; Prodanović, N.; Ljajić, A.; Bašaragin, B.; Milošević, N. From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts. Artif. Intell. Med. 2024, 156, 102970. [Google Scholar]
Tu, M. Named entity recognition and emotional viewpoint monitoring in online news using artificial intelligence. PeerJ Comput. Sci. 2024, 10, e1715. [Google Scholar] [CrossRef]
Ren, Y.; Liu, Y. Recognition of News Named Entity Based on Multi-Level Semantic Features Fusion and Keyword Dictionary. Acad. J. Comput. Inf. Sci. 2024, 7, 32–40. [Google Scholar]
Ding, J.; Xu, W.; Wang, A.; Zhao, S.; Zhang, Q. Joint multi-view character embedding model for named entity recognition of Chinese car reviews. Neural Comput. Appl. 2023, 35, 14947–14962. [Google Scholar]
Park, C.; Jeong, S.; Kim, J. ADMit: Improving NER in automotive domain with domain adversarial training and multi-task learning. Expert Syst. Appl. 2023, 225, 120007. [Google Scholar]
Affi, M.; Latiri, C. BE-BLC: BERT-ELMO-Based Deep Neural Network Architecture for English Named Entity Recognition Task. Procedia Comput. Sci. 2021, 192, 168–181. [Google Scholar]
Gao, F.F.; Zhang, L.; Wang, W.; Zhang, B.; Liu, W.; Zhang, J.; Xie, L. Named Entity Recognition for Equipment Fault Diagnosis Based on RoBERTa-wwm-ext and Deep Learning Integration. Electronics 2024, 13, 3935. [Google Scholar] [CrossRef]
Ni, J.; Wang, Y.; Wang, B. Named Entity Recognition for Automotive Production Equipment Fault Domain Fusing Radical Feature and BERT. J. Chin. Comput. Syst. 2024, 45, 1370–1375. [Google Scholar]
Mao, T.; Xu, Y.; Liu, W.; Peng, J.; Chen, L.; Zhou, M. A simple but effective span-level tagging method for discontinuous named entity recognition. Neural Comput. Appl. 2024, 36, 7187–7201. [Google Scholar]
Zhen, Y.; Li, Y.; Zhang, P.; Yang, Z.; Zhao, R. Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method. J. Supercomput. 2023, 79, 13670–13695. [Google Scholar]
Li, Y.; Liao, N.; Yan, H.; Zhang, Y.; Wang, X. Bi-directional context-aware network for the nested named entity recognition. Sci. Rep. 2024, 14, 16106. [Google Scholar]
Han, D.J.; Wang, Z.; Li, Y.; Ma, X.; Zhang, J. Segmentation-aware relational graph convolutional network with multi-layer CRF for nested named entity recognition. Complex Intell. Syst. 2024, 10, 7893–7905. [Google Scholar]
Huang, Z.; Hu, J. Entity category enhanced nested named entity recognition in automotive domain. J. Comput. Appl. 2024, 44, 377–384. [Google Scholar]
Chen, S.; Dou, Q.; Tang, H.; Jiang, P. Chinese nested named entity recognition based on vocabulary fusion and span detection. Appl. Res. Comput. 2023, 40, 2382–2386+2392. [Google Scholar]
Liu, Y.D.; Zhang, K.; Tong, R.; Cai, C.; Chen, D.; Wu, X. A Two-Stage Boundary-Enhanced Contrastive Learning approach for nested named entity recognition. Expert Syst. Appl. 2025, 271, 126707. [Google Scholar]
Chen, J.; Su, L.; Li, Y.; Lin, M.; Peng, Y.; Sun, C. A multimodal approach for few-shot biomedical named entity recognition in low-resource languages. J. Biomed. Inform. 2024, 161, 104754. [Google Scholar] [CrossRef]
Hang, H.; Chao, M.; Shan, Y.; Tang, W.; Zhou, Y.; Yu, Z.; Yi, J.; Hou, L.; Hou, M. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. J. Earth Sci. 2024, 35, 1035–1043. [Google Scholar]
Zhan, J.Z.; Hao, Y.Z.; Qian, W.; Liu, J. LELNER: A Lightweight and Effective Low-resource Named Entity Recognition model. Knowl.-Based Syst. 2022, 251, 109178. [Google Scholar] [CrossRef]
Yang, J.; Yao, L.; Zhang, T.; Tsai, C.Y.; Lu, Y.; Shen, M. Integrating prompt techniques and multi-similarity matching for named entity recognition in low-resource settings. Eng. Appl. Artif. Intell. 2025, 144, 110149. [Google Scholar] [CrossRef]
Zheng, C.; Xiao, S. Text Classification Method Combining Grammar Rules and Graph Neural Networks. J. Chin. Comput. Syst. 2023, 44, 1–12. [Google Scholar]
Zhou, Y.; Li, J.; Chi, J.; Tang, W.; Zheng, Y. Set-CNN: A text convolutional neural network based on semantic extension for short text classification. Knowl.-Based Syst. 2022, 257, 109948. [Google Scholar] [CrossRef]
Zhang, Y.; Li, G.; Gao, H.; Dang, D. Multi-Scale Interaction Network for Multimodal Entity and Relation Extraction. Inf. Sci. 2024, 699, 121787. [Google Scholar] [CrossRef]
Cai, B.; Tian, S.; Yu, L.; Long, J.; Zhou, T.; Wang, B. ATBBC: Named entity recognition in emergency domains based on joint BERT-BILSTM-CRF adversarial training. J. Intell. Fuzzy Syst. 2024, 46, 4063–4076. [Google Scholar] [CrossRef]
Jiao, Y.; Zhao, L. Real-Time Extraction of News Events Based on BERT Model. Int. J. Adv. Netw. Monit. Control. 2024, 9, 24–31. [Google Scholar] [CrossRef]
Yoo, J.; Cho, Y. ICSA: Intelligent chatbot security assistant using Text-CNN and multi-phase real-time defense against SNS phishing attacks. Expert Syst. Appl. 2022, 207, 117893. [Google Scholar] [CrossRef]
Deforche, M.; De Vos, I.; Bronselaer, A.; De Tré, G. A Hierarchical Orthographic Similarity Measure for Interconnected Texts Represented by Graphs. Appl. Sci. 2024, 14, 1529. [Google Scholar] [CrossRef]
Chang, D.; Lin, E.; Brandt, C.; Taylor, R.A. Incorporating domain knowledge into language models by using graph convolutional networks for assessing semantic textual similarity: Model development and performance comparison. JMIR Med. Inform. 2021, 9, e23101. [Google Scholar] [CrossRef]

Figure 1. Business data collection process of the after-sales department.

Figure 2. NER process for historical vehicle maintenance cases.

Figure 3. Schematic diagram of Text-CNN model structure.

Figure 4. Examples of nested and non-continuous entity recognition.

Figure 5. BERT-BiLSTM-CRF model framework.

Figure 6. Non-continuous entity combination process.

Figure 7. Typical text entity combination examples.

Figure 8. Entity alignment method for “Fault phenomenon and Fault cause” named entities.

Table 1. Initial format of selected labels in historical maintenance case data.

No.	Vehicle Model	Fault Description	Handling Results	Workstation No.	Replaced Part Name
1	E…	Source text: “客户反映倒挡总是挂不上。” Translation: “The customer reported that the reverse gear is always difficult to engage.”	Source text: “经检查发现是由于换挡开关故障导致。” Translation: “Upon inspection, it was found that the issue was caused by a faulty gear shift switch.”	259F2410	Source text: “驾驶模式选择开关” Translation: “The driving mode selection switch”
2	E…	Source text: “按动车辆喇叭有杂音” Translation: “Pressing the vehicle horn produces a buzzing sound.”	Source text: “电喇叭击穿故障，更换电喇叭” Translation: “The electric horn has a short circuit fault, and the electric horn needs to be replaced.”	102G0200	Source text: “电喇叭总成（高音）” Translation: “Electric horn assembly (high tone)”
3	E…	Source text: “检查行驶中底盘后面异响。” Translation: “Inspect abnormal noise from the rear chassis while driving.”	Source text: “电驱动桥内部异响，更换电驱动桥处理。” Translation: “Internal noise in the electric drive axle, replace the electric drive axle.”	402S3067	Source text: “电驱动桥” Translation: “Electric drive axle”
…	…	…	…	…	…

Table 2. Example of long-text segmentation.

Original Long Text	Short Texts
Source text: “经技师检查，车内顶灯内部虚接导致顶灯时亮时不亮，更换前顶灯进行修复处理” Translation: “Upon inspection by the technician, the internal loose connection of the interior roof light caused intermittent lighting. The front roof light was replaced for repair.”	Source text: “经技师检查” Translation: “Upon inspection by the technician”	Source text: “车内顶灯内部虚接” Translation: “the internal loose connection of the interior roof light”	Source text: “导致顶灯时亮时不亮” Translation: “caused intermittent lighting”	Source text: “更换前顶灯进行修复处理” Translation: “The front roof light was replaced for repair”
Source text: “该车辆报修方向盘偏左,路试确实偏左.上定位仪测量总前束0.53度偏大,重新做四轮定位调整前束修复．” Translation: “The vehicle was reported for the steering wheel being slightly tilted to the left. Road testing confirmed the tilt. The alignment tool measured a total toe of 0.53 degrees, which was too large. The four-wheel alignment was redone, adjusting the toe for repair.”	Source text: “该车辆报修方向盘偏左” Translation: “The vehicle was reported for the steering wheel being slightly tilted to the left”	Source text: “路试确实偏左” Translation: “Road testing confirmed the tilt”	Source text: “上定位仪测量总前束0.53度偏大” Translation: “the alignment tool measured a total toe of 0.53 degrees, which was too large”	Source text: “重新做四轮定位调整前束修复” Translation: “The four-wheel alignment was redone, adjusting the toe for repair”
Source text: “转向器总成异响，左、右转向拉杆总成松旷，更换转向器总成、左右拉杆总成。” Translation: “The steering assembly made abnormal noise, and the left and right steering rods were loose. The steering assembly and both steering rods were replaced.”	Source text: “转向器总成异响” Translation: “The steering assembly made abnormal noise”	Source text: “左、右转向拉杆总成松旷” Translation: “and the left and right steering rods were loose”	Source text: “更换转向器总成、左右拉杆总成” Translation: “The steering assembly and both steering rods were replaced”	——

Table 3. Example of short-text classification.

Field Category	Fault Description Text	Handling Result Text
Fault phenomenon field	Source text: “车里顶灯开门时候时亮时不亮” Translation: “Interior roof light intermittently lights up when the door is opened”	Source text: “导致顶灯时亮时不亮” Translation: “Caused roof light to intermittently light up”
Fault cause field	——	Source text: “车内顶灯内部虚接” Translation: “Loose internal connection in the interior roof light”
Maintenance method field	——	Source text: “更换前顶灯进行修复处理” Translation: “Replaced the front roof light for repair”
Other fields	Source text: “车里顶灯开门时候时亮时不亮” Translation: “Customer reported to the shop”	Source text: “经技师检查” Translation: “Upon inspection by the technician”

Table 4. Text Classification performance of various models.

Model	Precision/%	Recall/%	F1-Score/%
BERT	86.49	83.27	84.85
RoBERTa	87.35	85.19	86.26
Text-RNN	89.22	90.61	89.91
DP-CNN	90.35	91.37	90.86
Fast-Text	91.24	90.88	91.06
Text-CNN	91.82	90.36	91.08

Table 5. Example of BIOS-labeled text sequence.

Text Sequence	BIOS Labels
Source text: “客户到店反映，” Translation: “ The customer reported to the shop that”	O\|O\|O\|O\|O\|O\|O
Source text: “车里胎压灯亮。” Translation: “the tire pressure light was on.”	O\|O\|B-P\|I-P\|I-P\|S\|O
Source text: “经技师检查，” Translation: “Upon inspection by the technician,”	O\|O\|O\|O\|O\|O
Source text: “轮胎压力传感器短路,” Translation: “it was found that a short circuit in the tire pressure sensor”	B-P\|I-P\|I-P\|I-P\|I-P\|I-P\|I-P\|B-S\|I-S\|O
Source text: “导致胎压灯亮,” Translation: “caused the tire pressure light to illuminate.”	O\|O\|B-P\|I-P\|I-P\|S\|O
Source text: “更换轮胎压力传感器。” Translation: “The tire pressure sensor was replaced.”	O\|O\|B-P\|I-P\|I-P\|I-P\|I-P\|I-P\|I-P\|O

Table 6. Entity recognition results for each model.

Model	Precision/%	Recall/%	F1-Score/%
BERT-CRF	81.56	83.41	82.47
BiLSTM-CRF	84.03	86.27	85.14
Word2v-BiLSTM-CRF	88.61	89.09	88.85
FLAT	89.62	89.24	89.43
roBERTa-wwm-ext	90.57	90.06	90.31
BERT-BiLSTM-CRF	91.06	89.53	90.29

Table 7. Examples from the entity relationship matching database.

Phenomenon Part Entity	Phenomenon Entity	Fault Part Entity	Cause Entity
Source text: “倒挡” Translation: “reverse gear”	Source text: “挂不上、异响、…” Translation: “difficult to engage, abnormal noise, …”	Source text: “换挡开关” Translation: “shift switch”	Source text: “短路、损坏、…” Translation: “short circuit, damage, …”
Source text: “胎压故障灯” Translation: “tire pressure fault light”	Source text: “亮、报警、…” Translation: “illuminates, alarm, …”	Source text: “右前轮发射器” Translation: “right front wheel transmitter”	Source text: “信号丢失、信号失效、…” Translation: “signal loss, signal failure, …”
Source text: “底盘” Translation: “chassis”	Source text: “异响、漏油、…” Translation: “abnormal noise, oil leakage, …”	Source text: “右半轴球笼” Translation: “right half shaft ball joint”	Source text: “松旷、漏油、…” Translation: “loose, oil leakage, …”
Source text: “雨刮” Translation: “wiper”	Source text: “不工作、刮不干净、…” Translation: “not working, does not clean properly, …”	Source text: “洗涤泵” Translation: “wash pump”	Source text: “漏水、开裂、…” Translation: “leaking, cracked, …”
Source text: “电喇叭” Translation: “electric horn”	Source text: “不响、杂音、…” Translation: “no sound, noise, …”	Source text: “电喇叭” Translation: “electric horn”	Source text: “短路、接触不良、…” Translation: “short circuit, poor contact, …”
…	…	…	…

Table 8. Comparative results of entity recognition on different types of texts after integrating text classification, entity recognition, and entity matching databases.

Text Type	Processing Method	Nested Text			Non-Continuous Entities and Nested Text
Text Type	Processing Method	Precision/%	Recall/%	F1-Score/%	Precision/%	Recall/%	F1-Score/%
Fault description	Entity recognition	84.00	72.22	77.67	59.33	64.61	61.86
	Text classification + entity recognition	96.33	85.81	90.77	70.67	72.17	71.41
	Text classification + entity recognition + entity matching database	96.33	85.81	90.77	75.67	74.01	74.83
Processing results	Entity recognition	70.33	66.82	68.53	61.67	66.49	63.99
	Text classification + entity recognition	85.67	83.66	84.65	74.00	76.13	75.05
	Text classification + entity recognition + entity matching database	85.67	83.66	84.65	78.33	82.55	80.38

Table 9. Entity alignment performance under various threshold conditions.

sim₁	sim₂	Number of Entities		Aligned Entities Accuracy Rate (%)	Number of Entity Categories		Sample Entity Alignment Rate (%)
sim₁	sim₂	Aligned	Pending Review	Aligned Entities Accuracy Rate (%)	Before Alignment	After Alignment	Sample Entity Alignment Rate (%)
0.9	0.8	973	27	93.2	345	286	17.1
0.8	0.7	989	11	90.8	345	225	34.8
0.7	0.6	986	14	83.8	345	197	42.9
0.6	0.5	986	14	72.4	345	154	55.4

Table 10. Text alignment accuracy under various similarity calculation methods.

Similarity Method	Cosine	Jaccard	Edit Distance	Synonyms	Average	Max
Accuracy/%	88	85	87	80	90.8	78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Geng, H.; Qing, H.; Hu, J.; Huang, W.; Kang, H. A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model. Electronics 2025, 14, 1361. https://doi.org/10.3390/electronics14071361

AMA Style

Geng H, Qing H, Hu J, Huang W, Kang H. A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model. Electronics. 2025; 14(7):1361. https://doi.org/10.3390/electronics14071361

Chicago/Turabian Style

Geng, Huangzheng, Haihua Qing, Jie Hu, Wentao Huang, and Hanrui Kang. 2025. "A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model" Electronics 14, no. 7: 1361. https://doi.org/10.3390/electronics14071361

APA Style

Geng, H., Qing, H., Hu, J., Huang, W., & Kang, H. (2025). A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model. Electronics, 14(7), 1361. https://doi.org/10.3390/electronics14071361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Named Entity Recognition Method for Chinese Vehicle Fault Repair Cases Based on a Combined Model

Abstract

1. Introduction

2. Data Sources and Analysis

2.1. Data Sources

2.2. Data Analysis

2.2.1. Text Feature Analysis

2.2.2. Text Processing Methods

3. Text Segmentation and Classification

3.1. Text Segmentation

3.2. Short-Text Classification

3.2.1. Text-CNN Model

3.2.2. Comparative Experiments of Models

4. Entity Recognition

4.1. BERT-BiLSTM-CRF Model

4.2. Model Training

4.2.1. Text Sequence Pre-Labeling

4.2.2. Comparative Experiment

4.3. Entity Combination

4.3.1. Entity Relationship Matching Database

4.3.2. Non-Continuous Entity Combination

4.3.3. Entity Combination Algorithm

4.4. Entity Recognition and Performance Analysis

4.4.1. Entity Recognition Experiment

4.4.2. Performance Analysis

5. Entity Alignment

5.1. Entity Feature Analysis

5.2. Text Similarity

5.3. Entity Alignment Methods

5.4. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI