Domain Knowledge Graph Question Answering Based on Semantic Analysis and Data Augmentation

: Information retrieval-based question answering (IRQA) and knowledge-based question answering (KBQA) are the main forms of question answering (QA) systems. The answer generated by the IRQA system is extracted from the relevant text but has a certain degree of randomness, while the KBQA system retrieves the answer from structured data, and its accuracy is relatively high. In the ﬁeld of policy and regulations such as household registration, the QA system requires precise and rigorous answers. Therefore, we design a QA system based on the household registration knowledge graph, aiming to provide rigorous and accurate answers for relevant household registration inquiries. The QA system uses a semantic analysis-based approach to simplify one question into a simple problem consisting of a single event entity and a single intention relationship, and quickly generates accurate answers by searching in the household registration knowledge graph. Due to the scarcity and imbalance of QA corpus data in the ﬁeld of household registration, we use GPT3.5 to augment the collected questions dataset and explore the impact of data augmentation on the QA system. The experiment results show that the accuracy rate of the QA system using the augmented dataset reaches 93%, which is 6% higher than before.


Introduction
The question answering (QA) system is designed to provide users with personalized information services through human-computer interaction in the form of question-andanswer by analyzing the user's input. As one of the core tasks of artificial intelligence, QA has attracted extensive attention due to its widespread application in natural language processing and information retrieval [1]. Information retrieval-based question answering (IRQA) and knowledge-based question answering (KBQA) are the main forms of QA systems [2]. IRQA is known as open-domain QA: it can answer the questions that come from any domain. This type of QA system retrieves relevant texts from a large amount of passage based on the user's given question using information retrieval methods. Simply using retrieved texts as the answer is not precise enough. Thanks to the breakthroughs in recent years in the large language models (LLM), such as GPT-4 [3] and ChatGLM [4], which can better understand natural language questions, integrating LLM into QA systems can provide users with more comprehensive answers. However, because the answers generated by LLM have a certain degree of randomness, it cannot guarantee that the answers will always be consistent with the retrieved texts. In some special domains (such as medicine, policy, law, etc.), QA systems are required to provide accurate and rigorous answers to ensure that the system can provide reliable and authoritative information. On the other hand, KBQA systems process unstructured or semi-structured passages into structured database storage. They construct query templates through semantic analysis of the questions, and the final answers are also retrieved from these texts. As a result, the answers obtained are consistent with the retrieved texts, providing high accuracy and • This paper uses compound value types (CVT) nodes to store household registration events. Since CVT nodes collect multiple attributes of events and more accurately model complex relationships between entity nodes, this approach simplifies queries with multiple constraints in a knowledge graph (KG) into simple queries; • This paper comprehensively uses KGs and text similarity technology to improve the accuracy of the QA system. It leverages a corpus of query questions to train a RoBERTa-BiLSTM-MultiHeadAttention (RBMA) model to classify query intent. When the intent is clear, it utilizes the language technology platform (LTP) [7] to extract semantic role subjects from queries, and further retrieves the answer from the KG. When the intent is ambiguous, it uses text similarity techniques to match input queries with a corpus of queries and outputs the most similar answers; • This paper applies the LLM to enhance the training data to solve the problem of data imbalance and improve the accuracy of intent classification. We use the GPT-3.5-turbo language model to augment the dataset size by replacing synonyms and randomly inserting irrelevant words. The experiment results show that data augmentation techniques greatly improve the performance of QA systems.
The structure of the remaining sections of the paper is as follows: Section 2 discusses related works, Section 3 describes the process of constructing a household registration domain KG, Section 4 details the framework structure of the QA system, Section 5 presents Appl. Sci. 2023, 13, 8838 3 of 23 the experiment results and analysis, and Section 6 concludes the paper and discusses future work.

Related Works
In natural language processing, a simple question pertains to a single head entity and relation present in the knowledge graph (KG), with its corresponding tail entity acting as the answer [8], and a complex question commonly involves multiple entities and relationships within the KG or obtains the answer through specialized operations. This type of question is also referred to as a multi-constraint question [9].
Template-based methods and semantic parsing-based methods are the two main paradigms in KBQA [2]. Template-based methods use templates or rules to answer questions by mapping the questions to predefined templates [10,11]. Although this approach has higher accuracy, it has lower coverage and recall for various types of domain-specific questions [12]. For instance, H. Bast and E. Haussmann [13] proposed a model called Aqqu that maps the question to three templates, identifies all entities in the KG that match the question, and instantiates the three templates using Aqqu. Based on a ranking model, the best instantiation is selected to query the KG and retrieve the answer. However, these templates provide limited coverage for complex questions. Abujabal et al. [14] introduced an automated template generation model named QUINT, which generates question templates based on the dependency parse of the given question. Then, it queries candidate results based on these question templates, sorts them using a random forest classifier, and outputs the final answer obtained by the query. Semantic parsing-based methods involve constructing a semantic parser to map natural language questions into a semantic representation, logical expression, or query graph [15]. These representations are used to query the knowledge base and retrieve the answer. For instance, Yongrui Chen et al. [16] generated query graphs using a hierarchical self-recursive decoder that outlines the query graph and continually populates it. This end-to-end model enhances the accuracy of answering complex questions but requires manual design of semantic logic representations and query rules. K. Xu et al. [17] introduced a syntactic query graph that represents the intention of input questions based on three types of syntactic information: word order, dependency relations, and constituents. Then, they encoded the syntactic graph using a graph-to-sequence model and decoded the logical form of the question.
In the 1960s, KBQA systems such as BASEBALL [18] and LUNAR [19] had already been developed. BASEBALL was designed to answer questions about American League baseball issues within a one-year cycle, while LUNAR aimed to answer questions related to lunar rock geology analysis based on data collected from the Apollo moon landing missions. These early systems were designed specifically for domain-specific QA through structured data processing. Currently, there are three main ways to store knowledge: The first is RDF storage in the form of triples; the second is storage in traditional relational databases; and the third is storage in graph databases. Graph structures have the natural advantage of exploiting both structural and semantic information to analyze complex relationships [20], so we use a knowledge graph to store knowledge.
After Google proposed the knowledge graph (KG) in 2012, with the emergence of large-scale KGs such as Wikidata [21], Dbpedia [22], and Freebase [23], knowledge-graphbased question answering (KGQA) has gradually become a research hotspot, attracting considerable attention from researchers [24]. This allows us to convert semantic analysis results into structured data and query information in the knowledge base [25]. A KG is a directed graph that uses entities as nodes and entity relations as edges [26]. Essentially, it is a knowledge base represented by a structured semantic network [27,28]. Each directed edge in the graph creates a triplet composed of a head entity, a tail entity, and their relation, forming a directed relationship between entities. The construction of KGs now has a fairly mature workflow, storing domain knowledge in a structured data format and providing data support for fast and efficient QA systems. In recent years, with the concept of the KG spreading to various fields, there has been no shortage of research in finance, medicine, education, e-commerce, and even the military. For example, the medical KGQA system [29], the e-commerce KGQA system [30], and the intelligent travel KGQA system [31], with their greater depth of knowledge, can provide more accurate professional knowledge services to users in their respective fields.
There are two characteristics of knowledge in the household registration domain. Firstly, household registration policies vary significantly by region. Secondly, there are many additional constraint conditions in the supplemental descriptions of household registration events. Template-based methods require manual design of template rules, and the templates have limited applicability. In order to ensure that the QA system has a certain degree of generalization ability and will not change the template due to changes in some household registration policies, we use the semantic parsing-based method to construct the QA system. However, the basic query graph or logical expression in the semantic parsing method is only suitable for simple questions and cannot express multi-constraint questions. Bao et al. [9] constructed multi-constraint query graphs by adding constraints to the basic query graph to limit the set of answers. This method effectively solves the problem of complex question queries by setting more rules to cover multi-constraint problems. Instead of adding rules to achieve multi-constraint queries, we simplified complex questions into simple ones by changing the storage method of the graph. Multiple-constraint event entities were replaced with CVT nodes to participate in the query, and the RBMA model was used to parse the question's intention into a relationship, ultimately obtaining a query representation for simple questions.

The Construction of Household Registration Domain Knowledge Graph
The aim of the construction of the household registration domain KG is to process unstructured or semi-structured texts, such as household registration policy regulation files, into structured information and store them in a graph structure to provide data support for subsequent QA systems. At present, there are two main ways to construct a KG: top-down and bottom-up [32]. Top-down construction refers to extracting ontology and schema information from high-quality structured data sources and adding them to the database. It is knowledge-oriented and can ensure that the KG has high standardization and accuracy. Bottom-up construction refers to using certain techniques to extract factual triplets from publicly collected data, processing them, cleaning and summarizing them to obtain ontology, and finally verifying them before adding them to the knowledge library. Considering that the ontology concepts in the household registration field already have standardized definitions and explanations in relevant policies and regulations, we adopted the top-down construction approach. The household registration KG was built based on government-published household registration documents and business question corpus. The process of constructing the KG is shown in Figure 1. fairly mature workflow, storing domain knowledge in a structured data format and providing data support for fast and efficient QA systems. In recent years, with the concept of the KG spreading to various fields, there has been no shortage of research in finance, medicine, education, e-commerce, and even the military. For example, the medical KGQA system [29], the e-commerce KGQA system [30], and the intelligent travel KGQA system [31], with their greater depth of knowledge, can provide more accurate professional knowledge services to users in their respective fields. There are two characteristics of knowledge in the household registration domain. Firstly, household registration policies vary significantly by region. Secondly, there are many additional constraint conditions in the supplemental descriptions of household registration events. Template-based methods require manual design of template rules, and the templates have limited applicability. In order to ensure that the QA system has a certain degree of generalization ability and will not change the template due to changes in some household registration policies, we use the semantic parsing-based method to construct the QA system. However, the basic query graph or logical expression in the semantic parsing method is only suitable for simple questions and cannot express multi-constraint questions. Bao et al. [9] constructed multi-constraint query graphs by adding constraints to the basic query graph to limit the set of answers. This method effectively solves the problem of complex question queries by seBing more rules to cover multi-constraint problems. Instead of adding rules to achieve multi-constraint queries, we simplified complex questions into simple ones by changing the storage method of the graph. Multipleconstraint event entities were replaced with CVT nodes to participate in the query, and the RBMA model was used to parse the question's intention into a relationship, ultimately obtaining a query representation for simple questions.

The Construction of Household Registration Domain Knowledge Graph
The aim of the construction of the household registration domain KG is to process unstructured or semi-structured texts, such as household registration policy regulation files, into structured information and store them in a graph structure to provide data support for subsequent QA systems. At present, there are two main ways to construct a KG: top-down and boBom-up [32]. Top-down construction refers to extracting ontology and schema information from high-quality structured data sources and adding them to the database. It is knowledge-oriented and can ensure that the KG has high standardization and accuracy. BoBom-up construction refers to using certain techniques to extract factual triplets from publicly collected data, processing them, cleaning and summarizing them to obtain ontology, and finally verifying them before adding them to the knowledge library. Considering that the ontology concepts in the household registration field already have standardized definitions and explanations in relevant policies and regulations, we adopted the top-down construction approach. The household registration KG was built based on government-published household registration documents and business question corpus. The process of constructing the KG is shown in Figure 1.

Entity Extraction
Relation Extraction Attribute Extraction

Knowledge Fusion Knowledge Graph
Solid Disambiguation Entity LinkIng

Data Acquisition
Due to the significant regional differences in China's household registration policy, the requirements for handling the same household registration matter may vary greatly in different cities. Therefore, the knowledge source of the QA system must strictly comply with the household registration policy publicly announced in each specific region. In this paper, the household registration domain KG was constructed using the household registration policy of Wuhan City, Hubei Province, as the data source. The textual data for constructing the household registration KG mainly rely on the Wuhan City Household Registration Business Processing Guidelines [33], (hereinafter referred to as the Guidelines), issued by the Wuhan City Public Security Bureau, and the question corpus dataset compiled by staff of the Household Registration Center as a supplement.
The Guidelines list the handling information for a total of 128 household registration businesses under six major categories and 28 subcategories in a semi-structured data format, totaling 112,400 words. This text summarizes a large number of household registration domain concepts and terminologies, accurately and succinctly describing how to handle various household registration businesses under different constraint conditions. The sentence question corpus dataset consists of 1427 genuine questions recorded by the government service hall inquiry system, and each question corresponds to the household registration business information in the Guidelines. We trained a semantic parsing model using the question corpus dataset to convert the questions into simple queries, which are then used to search for answers in the KG database.

Information Extraction
We preprocessed the text data and focused on household registration events to extract specific content from semi-structured tables to form event triplets. We then aligned the questions and answers in the question corpus with the household registration item content to obtain the entity relationships in the household registration field. These entities mainly include the business item name, application requirements, required materials, processing procedures, application methods, and processing deadlines. The specific steps for information extraction are as follows, with an illustration shown in Figure 2

Data Acquisition
Due to the significant regional differences in China's household registration policy, the requirements for handling the same household registration maBer may vary greatly in different cities. Therefore, the knowledge source of the QA system must strictly comply with the household registration policy publicly announced in each specific region. In this paper, the household registration domain KG was constructed using the household registration policy of Wuhan City, Hubei Province, as the data source. The textual data for constructing the household registration KG mainly rely on the Wuhan City Household Registration Business Processing Guidelines [33], (hereinafter referred to as the Guidelines), issued by the Wuhan City Public Security Bureau, and the question corpus dataset compiled by staff of the Household Registration Center as a supplement.
The Guidelines list the handling information for a total of 128 household registration businesses under six major categories and 28 subcategories in a semi-structured data format, totaling 112,400 words. This text summarizes a large number of household registration domain concepts and terminologies, accurately and succinctly describing how to handle various household registration businesses under different constraint conditions. The sentence question corpus dataset consists of 1427 genuine questions recorded by the government service hall inquiry system, and each question corresponds to the household registration business information in the Guidelines. We trained a semantic parsing model using the question corpus dataset to convert the questions into simple queries, which are then used to search for answers in the KG database.

Information Extraction
We preprocessed the text data and focused on household registration events to extract specific content from semi-structured tables to form event triplets. We then aligned the questions and answers in the question corpus with the household registration item content to obtain the entity relationships in the household registration field. These entities mainly include the business item name, application requirements, required materials, processing procedures, application methods, and processing deadlines. The specific steps for information extraction are as follows, with an illustration shown in Figure 2:

Information Extraction Question Tagging
Event Name   (1) The Guidelines are semi-structured tables, and the theme tags and headers of the table can be directly extracted as relationships and attributes. The household registration business event name is extracted as the head entity, while the specific contents in the tables are extracted as tail entities, forming the basic event triplet of the household registration business; (2) Supplementary explanations for household registration events under different constraint conditions are added as a tail entity to the head entity of the household registration business event name, forming the condition triplet of the household registration event.

Semi-Structured
(3) The questions from the question corpus were manually labeled, and corresponding relationships were established with the entities in steps (1) and (2), thereby constructing the question triplet of the household registration event.

Knowledge Representation
In a KG, the triplet is a commonly used representation, denoted as G = (E, R, F), where E = {e 1 , e 2 , · · · e m } represents the entity set containing m distinct entities, R = {r 1 , r 2 , · · · r n } represents the relation set containing n distinct relationships, and F = (e a , r b , e c ), · · · , e i , r j , e k represents the set of fact triplets. In knowledge extracted from household registration business text data, many entities have multiple relationships that cannot be effectively represented using conventional triplet forms. Currently, there are two primary methods for representing multiple relationships: one involves adding the multiple relationships as edge attributes, as exemplified by the knowledge base ConceptNe [34]. However, this method yields redundant relationship data, reducing the efficiency of data retrieval and querying. The second method involves using compound value types (CVT) nodes to represent multiple relationships, as exemplified by the knowledge base Freebase [23]. However, this method requires data to be stored as a graph data structure.
Considering efficiency in querying and the redundancy in relationship data, we used CVT nodes to represent multiple relationships and convert the fact triplets under different circumstances into tuples represented by CVT nodes. CVT nodes are a node type in the Freebase knowledge base that is used to collect multiple attributes of an event and model complex relationships more accurately between entity nodes. We used the CVT nodes to represent household registration events that involve different constraint conditions and complex relationships, as illustrated in Figure 3.
After completing entity relation extraction and deduplication operations, the extracted triplets were reorganized using CVT nodes. The statistics of the quantity and categories of the entity relationships extracted from the document are shown in Table 1. This paper utilized the NEO4J graph database to store multi-tuple data. NEO4J provides Cypher statements to import and query graph data, which is a descriptive graph query language with simple syntax and powerful functions. In this paper, the entity relationships in the Guidelines and the corresponding question corpus were imported into the NEO4J database using the Cypher CREATE statement.  Figure 3. Example of CVT nodes representation. In the event of household registration for children born within marriage, when the parents divorce, they can choose to provide either one of two sets of documents for registration. One set includes the divorce certificate and divorce seBlement, and the other set includes the divorce mediation agreement and divorce judgment. This text involves the use of 'and' and 'or' relationships, which are difficult to express in conventional triple formats, but CVT nodes can handle it effortlessly.

Question Answering System
There are presently two methods for retrieving household registration policy information: one is to manually query customer service, which is inefficient and prone to errors, and the other is to search government service websites in various regions (e.g., Hubei Government Services website [35]), which primarily use string-based search methods that cannot accurately interpret user intention. We present a QA system based on a household registration domain KG, which can retrieve information from complex household registration regulations relevant to the user's query, enabling the system to efficiently answer users' questions and improve the efficiency of household registration services.

System Structure
The QA system consists of two stages: the semantic parsing stage and the answer retrieval stage. In the semantic parsing stage, the QA system parses the question into a simple question composed of a single entity and a single relationship, outpuBing the household registration event entity and the intention relationship corresponding to the question. We analyzed the household registration event entity relationships in the household registration domain KG and combined them with the question corpus dataset. Overall, there were a total of 622 household registration event entities, and 618 of them had corresponding entries in the question dataset; there were nine types of question intention relationships, and within the question dataset, there were five question intention relationships identified. In addition to these, there were also questions with unclear intentions and those unrelated to household registration affairs, which accounted for a total of seven intention categories.
Since there were a large number of household registration event entities and extremely imbalanced question data, we treated the recognition of household registration Figure 3. Example of CVT nodes representation. In the event of household registration for children born within marriage, when the parents divorce, they can choose to provide either one of two sets of documents for registration. One set includes the divorce certificate and divorce settlement, and the other set includes the divorce mediation agreement and divorce judgment. This text involves the use of 'and' and 'or' relationships, which are difficult to express in conventional triple formats, but CVT nodes can handle it effortlessly.

Question Answering System
There are presently two methods for retrieving household registration policy information: one is to manually query customer service, which is inefficient and prone to errors, and the other is to search government service websites in various regions (e.g., Hubei Government Services website [35]), which primarily use string-based search methods that cannot accurately interpret user intention. We present a QA system based on a household registration domain KG, which can retrieve information from complex household registration regulations relevant to the user's query, enabling the system to efficiently answer users' questions and improve the efficiency of household registration services.

System Structure
The QA system consists of two stages: the semantic parsing stage and the answer retrieval stage. In the semantic parsing stage, the QA system parses the question into a simple question composed of a single entity and a single relationship, outputting the household registration event entity and the intention relationship corresponding to the question. We analyzed the household registration event entity relationships in the household registration domain KG and combined them with the question corpus dataset. Overall, there were a total of 622 household registration event entities, and 618 of them had corresponding entries in the question dataset; there were nine types of question intention relationships, and within the question dataset, there were five question intention relationships identified. In addition to these, there were also questions with unclear intentions and those unrelated to household registration affairs, which accounted for a total of seven intention categories.
Since there were a large number of household registration event entities and extremely imbalanced question data, we treated the recognition of household registration event entities in the questions as a text matching task, and used LTP to extract semantic subject roles in the questions to determine the corresponding event entity through the text similarity model. Since there were relatively few question intention relationships, and the imbalance in the dataset was significant, we treated the recognition of the question intention relationships as a text classification task [36] and trained a neural network classifier using the RBMA model. In the answer retrieval stage, we inserted the household registration event entity and the question intention relationship obtained by semantic parsing into a querying statement, and executed the query to obtain the answer. The QA system process is shown in Figure 4. event entities in the questions as a text matching task, and used LTP to extract subject roles in the questions to determine the corresponding event entity throug similarity model. Since there were relatively few question intention relationship imbalance in the dataset was significant, we treated the recognition of the quest tion relationships as a text classification task [36] and trained a neural network using the RBMA model. In the answer retrieval stage, we inserted the household tion event entity and the question intention relationship obtained by semantic pa a querying statement, and executed the query to obtain the answer. The QA sy cess is shown in Figure 4.

Question Intention Classification
We conducted a statistical analysis on the question corpus dataset and foun question intentions in the dataset correspond to five relationships in the KG, a some ambiguous and unrelated expressions. Therefore, we classified the quest tions into seven categories, including the five intention categories correspondin tionships and two special cases. Table 2 lists the seven question intention catego Table 2. Question intention category and definition.

Category
Definition Class 1 Application method and processing time li Class 2 Processing location

Question Intention Classification
We conducted a statistical analysis on the question corpus dataset and found that the question intentions in the dataset correspond to five relationships in the KG, as well as some ambiguous and unrelated expressions. Therefore, we classified the question intentions into seven categories, including the five intention categories corresponding to relationships and two special cases. Table 2 lists the seven question intention categories.

Category Definition
Class 1 Application method and processing time limit Class 2 Processing location Class 3 Application conditions Class 4 Required materials Class 5 Processing procedures Class 6 Intention unclear Class 7 Not related to household registration In this paper, recognizing the intention of the question was considered a text classification task. We constructed the RBMA model as a classifier that parses and classifies the intention of the question. The model classifies the question process by: encoding input texts semantically using the Robustly Optimized BERT Pretraining Approach (RoBERTa) to obtain the word vector representations of each word; inputting the word vector sequence into a bidirectional long short-term memory (BiLSTM) to capture the contextual semantic information of the sentence; and then using a multi-head attention mechanism (MHA) to extract the essential information from the text [37]. Finally, the classification (CLS) vector representing the entire sentence's semantic is concatenated, and a fully connected layer maps the CLS vector to a predefined set of labels to obtain the text's classification result [38]. The model structure is shown in Figure 5, and the intention category intention is ultimately obtained through question parsing.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 9 Class 6 Intention unclear Class 7 Not related to household registration In this paper, recognizing the intention of the question was considered a text clas cation task. We constructed the RBMA model as a classifier that parses and classifies intention of the question. The model classifies the question process by: encoding in texts semantically using the Robustly Optimized BERT Pretraining Approach (RoBER to obtain the word vector representations of each word; inpuBing the word vector quence into a bidirectional long short-term memory (BiLSTM) to capture the contex semantic information of the sentence; and then using a multi-head aBention mechan (MHA) to extract the essential information from the text [37]. Finally, the classifica (CLS) vector representing the entire sentence's semantic is concatenated, and a fully nected layer maps the CLS vector to a predefined set of labels to obtain the text's clas cation result [38]. The model structure is shown in Figure 5, and the intention categ intention is ultimately obtained through question parsing. (1) RoBERTa Layer RoBERTa [39] is an improved version of the Bidirectional Encoder Representa from Transformers (BERT) [40] pretraining model, representing a more robust and fi (1) RoBERTa Layer RoBERTa [39] is an improved version of the Bidirectional Encoder Representation from Transformers (BERT) [40] pretraining model, representing a more robust and finetuned version of BERT. The RoBERTa model employs a Transformer-based encoder-decoder structure to represent input text as vectors. Specifically, the model converts each input word into its corresponding word vector and feeds them into the encoder in a particular order. The encoder consists of multiple Transformer blocks, each containing multi-headed attention mechanisms and feedforward neural network layers. By iterating through these blocks, the RoBERTa model can gradually convert the input text into a fixed-dimensional vector representation, serving as the output of the model.
The vector representation, E = E d 1 , E d 2 , · · · , E d n , of an input text query is generated by a combination of tokenization and encoding procedures. The length of the tokenized query is denoted by n, while the dimensionality of individual word vectors is denoted by d.
The extraction of feature vectors through decoder processing, using vector E, yields the output vector X = (X 1 , X 2 , · · · , X n ).
(2) BiLSTM Layer The bidirectional long short-term memory network (BiLSTM) consists of a forward LSTM that processes a sequence and a backward LSTM that processes the sequence in reverse, enabling deep feature extraction of the input context. This allows for effective capture of the position relationships and semantic information between words, which better accommodates longer text sequences while retaining semantic information. Thus, a BiLSTM layer is introduced after the RoBERTa layer to extract features from the context of the input query, in order to more accurately capture global feature information of the input text statement. This compensates for the RoBERTa layer's tendency to forget contextual information. The In Equations (1)- (3), w t and v t represent the weight matrices of the forward and backward LSTM cells, respectively. b t represents the bias, and h t denotes the output of the BiLSTM layer at time t. The input feature vector captures the global information of the text sentence through the BiLSTM layer, and the resulting output is represented by T = (T 1 , T 2 , · · · , T n ).

(3) Multi-head attention layer
The attention mechanism enables selectively focusing on critical information in text. We employed the multi-head self-attention mechanism (MHA) to extract word dependencies in different semantic spaces [41]. The MHA first conducts multiple linear transformations on the input vectors and then performs attention computation on the transformed vectors to obtain a weighted vector representation. By introducing multiple heads, the MHA can learn more semantic information and improve the model performance [42,43].
The MHA is based on the principle of scaled dot-product attention (SDA), which can be expressed mathematically as follows: In Equation (4), Q, K, and V represent the query, key, and value matrices used for computing the attention, respectively. The input key and value dimension is denoted by d k .
The attention weight of a value is obtained by calculating the dot product of the query and all keys, dividing the result by √ d k , and applying the function denoted by softmax. For an input vector sequence, multiple linear transformations are first applied to obtain multiple heads (i.e., multiple vector sequences). Then, the attention weight for each head is computed separately, followed by multiplication with the corresponding vectors and summation to obtain the output vector for each head: Equation (5) shows that the projection matrices W Q i ∈ R d model ×d k , W K i ∈ R d model ×d k and W V i ∈ R d model ×d k correspond to Q, K, and V respectively. The single-headed output head i is concatenated to form the final vector representation, and is then dimensionally adjusted by the projection matrix W O ∈ R hd v ×d model to obtain the multi-headed output: (4) Linear layer We concatenate the CLS vector X 1 from the RoBERTa layer, the last sequential output T 1 from the BiLSTM layer, and the pooled output H from the multi-head attention layer, normalize the result, and apply an activation function to obtain vector V T : The feature representation, which fuses the semantic information of the textual statements, is input to a fully connected layer. The resulting representation is mapped to the instance label space to obtain the final classification result:

Event Matching
After analyzing the questions in the dataset, we found that in actual usage scenarios, sentence components may be missing or unclear due to factors such as limited user expression abilities [44]. As a result, the identification of event entities is frequently less effective; in particular, the input to the QA system consists of spoken audio converted to text, which increases the difficulty of identifying event entities and makes the answer retrieval more difficult.
To address the above issues, we utilized LTP [7] developed by Harbin University of Technology to analyze questions by starting with the sentence structure. The tool performs syntactical analysis on the question, extracts important semantic subject roles. We reorganized them into phrases based on semantic role properties. The resulting phrases were then matched with a predefined statement template. The process partly eliminates noises caused by colloquial language or vague expressions and ultimately identifies the household registration event entity associated with the question.
(1) Semantic Role Analysis The LTP platform supports user-defined dictionaries. We added commonly used words related to the field of household registration to the LTP dictionary. The question is then segmented, annotated with parts of speech, and labeled with semantic roles. Based on the result of the semantic analysis, the semantic roles were extracted from the clauses and output. Semantic roles and their meanings are listed in Table 3 [45]. The LTP performs semantic role labeling and extracts semantic roles present in interrogative sentences. To illustrate, we take the question 'Can I apply for my first ID card for non-local household registration in Wuhan?' as an example. Figure 6 shows the analysis result of the LTP for this question, which provides a list of semantic labeling results:  The LTP performs semantic role labeling and extracts semantic roles present in interrogative sentences. To illustrate, we take the question 'Can I apply for my first ID card for non-local household registration in Wuhan?' as an example. Figure 6 shows the analysis result of the LTP for this question, which provides a list of semantic labeling results:  The resulting list consists of multiple dictionaries where each dictionary contains a predicate 'predicate' and a semantic role 'argument.' The predicates and semantic roles in each dictionary are then reassembled into a phrase set ( ) 1 2 , , , m P p p p = ⋯ , which expresses the key semantic elements of the question, according to the role types, as shown in Table  3. For instance, using semantic role extraction, we obtained the following phrases: 'nonlocal household registration apply for first ID card' and 'non-local household registration apply for in Wuhan'.

Can I apply for my first ID card for non-local household registration in Wuhan？
(2) Text similarity matching The purpose of similarity matching is to calculate the similarity between the phrases extracted from semantic analysis and the event entities stored in the KG. This process can ensure the household registration item names and contextual conditions for the given question, and provides important information to support answer retrieval [46]. The specific process of similarity matching includes the following steps: • To begin with, the phrases extracted from semantic analysis in the phrase set P are reassembled and concatenated, resulting in the sentence sent with extracted semantic role features.

•
Search for all CVT nodes that represent household registration events in the KG, extract the corresponding description texts of household registration events for these nodes, and form a candidate set ( ) The resulting list consists of multiple dictionaries where each dictionary contains a predicate 'predicate' and a semantic role 'argument.' The predicates and semantic roles in each dictionary are then reassembled into a phrase set P = (p 1 , p 2 , · · · , p m ), which expresses the key semantic elements of the question, according to the role types, as shown in Table 3. For instance, using semantic role extraction, we obtained the following phrases: 'non-local household registration apply for first ID card' and 'non-local household registration apply for in Wuhan'.
(2) Text similarity matching The purpose of similarity matching is to calculate the similarity between the phrases extracted from semantic analysis and the event entities stored in the KG. This process can ensure the household registration item names and contextual conditions for the given question, and provides important information to support answer retrieval [46]. The specific process of similarity matching includes the following steps:

•
To begin with, the phrases extracted from semantic analysis in the phrase set P are reassembled and concatenated, resulting in the sentence sent with extracted semantic role features.

•
Search for all CVT nodes that represent household registration events in the KG, extract the corresponding description texts of household registration events for these nodes, and form a candidate set S = (s 1 , s 2 , · · · s n ); • Perform similarity matching between the sentence sent and the description texts in the candidate set S to determine the household registration item corresponding to the given question and its CVT node id in the KG.
In NLP downstream tasks, determining the similarity between two pieces of text is an essential task called text similarity matching. This task involves transforming input text into vectors to capture semantic information and calculate their similarity. In this paper, we employed pretrained language models to represent the sentence sent as a semantic vector e s , and sentence set S as a semantic vector set E S = (e 1 , e 2 , · · · , e n ). We calculated the cosine similarity between the element vectors in vector set E S with vector e s , outputted their maximum result index, which determines the sentence s k in sentence set S that is most similar to sentence sent, along with the corresponding CVT node id of s k . The cosine similarity calculation is provided in (9): cosine(e 1 , e 2 ) = e 1 · e 2 e 1 2 · e 2 2 (9) The semantic vectors represented by dimensions e 1 and e 2 are identical. BERT and RoBERTa are highly effective for semantic representation, but these large language models are based on unsupervised learning. To achieve better performance in text similarity subtasks, it is typically necessary to fine-tune the model based on supervised learning. Currently, popular text similarity models, such as Sentence-BERT [47], use average pooling to obtain the mean vector as the sentence vector and have achieved good performance and fast convergence. However, such models have not been optimized for similarity prediction. We used the CoSENT [48] model proposed by Jianlin Su to represent semantic vectors. This model designs a new solution to optimize cosine values and solves the problem of inconsistency between training and prediction.
In text similarity tasks, CoSENT model uses sentence pairs for training, with a denotation that Ω pos refers to the set of all positive sample pairs and Ω neg denotes the set of all negative sample pairs. For any positive sample pair (i, j) ∈ Ω pos and negative sample pair (k, l) ∈ Ω neg , it is desirable to fulfill the following criterion: cosine(ui, uj) > cosine(uk, ul) (10) ui, uj, uk, ul represent the semantic vectors of the respective sentences. In the original cross-entropy loss Equation (11): To achieve the prediction target of s i < s j , e s i −s j can be added to log. Therefore, the loss function formula corresponding to (10) can be revised as follows: Here, λ > 0 is a hyperparameter. CoSENT model fine-tunes the model by designing a loss function that optimizes cosine value, enabling the model to achieve better convergence speed and final performance in text similarity tasks compared to Sentence-BERT.
The semantic vectors represented by dimensions e 1 and e 2 are identical.

Answer Retrieval
After intention classification and event matching, we obtain the intention type intention and event node ID value id. The answer retrieval is divided into two methods by the intention of the question. The flow chart in Figure 7 illustrates the process:

Answer Retrieval
After intention classification and event matching, we obtain the intention ty intention and event node ID value id . The answer retrieval is divided into two metho by the intention of the question. The flow chart in Figure 7 illustrates the process: (1) The first method applies to intention belonging to class 1 to class 6, which fills in pr designed Cypher query templates with intention and id of the event node to co struct querying statements. The Cypher query template is shown in Figure 8. If t query returns an answer node, the node content will be output as an answer. If a CV node is returned, the node will be re-inserted into the Cypher query template and t query will continue until an answer node is returned, and use the text content stor in the answer node as the output; (2) The second query method is used when intention is class 7, which means the que tion is unrelated to household registration processing. Here, we use the CoSEN model to retrieve the most similar question from the question corpus to generate t answer.

Data Augmentation
Currently, the training of the model relies heavily on manually annotated corpo which may not achieve good results when the dataset is small in size [49][50][51]. Da (1) The first method applies to intention belonging to class 1 to class 6, which fills in predesigned Cypher query templates with intention and id of the event node to construct querying statements. The Cypher query template is shown in Figure 8. If the query returns an answer node, the node content will be output as an answer. If a CVT node is returned, the node will be re-inserted into the Cypher query template and the query will continue until an answer node is returned, and use the text content stored in the answer node as the output; Return Node Is CVT ? Y N Get Answer Output Figure 7. Flowchart of answer query process.
(1) The first method applies to intention belonging to cla designed Cypher query templates with intention and struct querying statements. The Cypher query temp query returns an answer node, the node content will b node is returned, the node will be re-inserted into the query will continue until an answer node is returned in the answer node as the output; (2) The second query method is used when intention is tion is unrelated to household registration processi model to retrieve the most similar question from the answer. Figure 8. Cypher query statement template.

Data Augmentation
Currently, the training of the model relies heavily o which may not achieve good results when the dataset (2) The second query method is used when intention is class 7, which means the question is unrelated to household registration processing. Here, we use the CoSENT model to retrieve the most similar question from the question corpus to generate the answer.

Data Augmentation
Currently, the training of the model relies heavily on manually annotated corpora, which may not achieve good results when the dataset is small in size [49][50][51]. Data augmentation refers to the use of various techniques and methods to expand the training dataset, thus improving the model's performance and robustness. Jason Wei et al. [52] proposed four simple operations, including synonym substitution, random deletion, random swapping, and random insertion, to prevent overfitting and enhance model generalization. Ateret et al. [53] used a generative language model GPT-2 for text data augmentation and achieved excellent augmentation results in few-shot scenarios.
As there are currently no open-source question corpus data in the household registration field, all the data used to train the model in this paper come from the question corpus collected from Wuhan Municipal Government Service Center's inquiry system. After removing a small number of invalid and abandoned question corpus due to household registration policy changes, a total of 1427 authentic questions related to household registration were obtained. We labeled all these questions and obtained the question classification dataset required for the experiment. As shown in Fig 10, through visualization and analysis of the dataset, we found that the dataset is extremely unbalanced: the number of data samples for class 5 questions is more than three times that of class 3 questions. Since the smaller class size in the dataset can lead to overfitting during model training, we used the GPT-3.5-turbo generative language model to expand the total amount of the dataset by two methods: synonym substitution and random insertion of irrelevant words.
To enable LLM to rewrite questions according to the requirements, we need to design prompts given to GPT. Currently, there are two main types of prompt templates: cloze prompts [54] and prefix prompts [55]. Since question rewriting belongs to the sentence generation task, the prefix prompt method is often more beneficial because such tasks align well with the model's left-to-right property [56]. We designed prompts based on the CRISPE prompt framework [57] provided by Matt Nigh, and the specific steps are shown in Table 4. Table 4. Prompts Creation Framework. Step

Interpretation Prompt
Capacity and Role What role (or roles) should ChatGPT act as? As a user who is going to handle household registration business

Insight
Provides the behind the scenes insight, background, and context to your request.
You have questions about the handling information of some household registration matters Statement What you are asking ChatGPT to do. Please rewrite the given question, provide a similar question

Personality
The style, personality, or manner you want ChatGPT to respond in.
Ensure that the original meaning of the sentence is preserved. The rewriting method includes synonym substitution and random insertion of irrelevant words Experiment Asking ChatGPT to provide multiple examples to you. The sentences that need to be rewritten are: The final prompt is as follows: 'As a user who is going to handle household registration business, you have questions about the handling information of some household registration matters. Please rewrite the given question, provide a similar question, and ensure that the original meaning of the sentence is preserved. The rewriting method includes synonym substitution and random insertion of irrelevant words. The sentences that need to be rewritten are:'. We rewrote the 1427 questions multiple times and finally obtained a dataset of 7055 questions.

Comparison Model
In order to assess the efficacy of the RBMA model in intention classification tasks, this study conducted comparative experiments by comparing the RBMA model against several prevailing text classification models. Furthermore, we conducted ablation experiments to examine the classification performance of RoBERTa, BiLSTM, and MultiHeadAttention. Specifically, we compared the RBMA model against seven other models: For each category's prediction, we calculated TP, TN, FN, and FP, which correspond to the number of true positive, true negative, false negative, and false positive, respectively. We evaluated the performance of the question matching model using three metrics: precision, recall, and F1 score. The formulas for these metrics are shown below: To ensure optimal generalization abilities of the model during the training process, we saved the model parameter files when the performance of each model on the validation set was at its best, with the premise that the model was not overfitting. Then, the accuracy of the final answer when the QA system used the model for intention classification was tested on the test set. The formula for accuracy calculation is presented below: The hyperparameter settings for the model experiment are shown in Table 5.

Dataset Configuration
For the intention classification model experiments, the data processing methods for the training, validation, and testing sets were as follows. The number of samples in each dataset is presented in Figure 9: the training, validation, and testing sets were as follows. The number of samples in each dataset is presented in Figure 9: • _ _ unaugmented train set : 1427 genuine question collected and labeled from the question corpus dataset; • _ _ augmented train set : Randomly selecting 90% of the data from the 7055 datasets obtained through data augmentation resulted in a total of 6222 questions; • _ validation set : Extracting the remaining 10% of the data from the 7055 datasets obtained through data augmentation resulted in a total of 833 questions; • _ test set : 100 genuine questions were randomly selected from the _ _ unaugmented train set , and these questions were manually rewriBen to create an additional 100 questions as the test set. Figure 9. Statistics of each dataset.

Experimental Results and Analysis (1) Model Comparison Experiment
To evaluate the performance of the RBMA model in text classification tasks, we conducted comparative experiments between the RBMA model and seven other models on the _ _ augmented train set and _ validation set . The accuracy, precision, recall, and F1 scores were calculated for each model. The experimental results are presented in Table 6.  • unaugmented_train_set: 1427 genuine question collected and labeled from the question corpus dataset; • augmented_train_set: Randomly selecting 90% of the data from the 7055 datasets obtained through data augmentation resulted in a total of 6222 questions; • validation_set: Extracting the remaining 10% of the data from the 7055 datasets obtained through data augmentation resulted in a total of 833 questions; • test_set: 100 genuine questions were randomly selected from the unaugmented_train_set, and these questions were manually rewritten to create an additional 100 questions as the test set.

Experimental Results and Analysis (1) Model Comparison Experiment
To evaluate the performance of the RBMA model in text classification tasks, we conducted comparative experiments between the RBMA model and seven other models on the augmented_train_set and validation_set. The accuracy, precision, recall, and F1 scores were calculated for each model. The experimental results are presented in Table 6.  Table 6 shows that the RBMA model outperforms the four baseline models, namely DPCNN, TextRCNN, BERT, and RoBERTa. It can be seen that RBMA compares with RoBERTa, having the most significant improvement in precision, recall, and F1 score by 1.44%, 6.11%, and 4.03%, respectively. This improvement results in a 6% increase in the accuracy of the final answer. The results indicate that the RBMA model performs better in intention classification tasks than the current mainstream text classification models. During the ablation experiment, the RBMA model demonstrated improvements in various evaluation metrics compared to the RoBERTa-BiLSTM and RoBERTa-MultiHeadAttention models. This validates the effectiveness of combining the ability of the BiLSTM layer to extract contextual information from sentences with the multi-level feature representation of the MHA mechanism.
(2) Impact of Text Data Augmentation on the Model There are seven categories of intention for household-registration-related questions. The specific definitions for each category are presented in Table 2. In order to explore the RBMA model's ability to classify each intention category and the impact of text data augmentation on the model's prediction performance, this study conducted comparative experiments on the unaugmented_train_set and augmented_train_set. The model's prediction performance was then validated on the validation_set, and the evaluation metrics for each category were recorded. The specific data statistics are presented in Table 7. The average values of each evaluation metric and the loss function curve comparison are presented in Figures 10 and 11, respectively. According to the comparison of the evaluation metrics for each category in Table 7, text data augmentation significantly improves the prediction performance of the model for each category, which confirms the significant improvement in the model's training performance through text data augmentation. However, we found that regardless of whether the model was trained on the unaugmented_train_set or the augmented_train_set, the recognition accuracy for class 2 was lower compared to other categories. Upon analyzing the class 2 questions in the dataset, we found that questions related to processing procedures were highly colloquial and some of them were easily confused with class 3 and class 4 questions. For example, 'What are the procedural requirements for applying for death registration of a household?' and 'What are the procedures and documents required for changing the head of household?' These samples make it difficult for the model to learn, resulting in weak prediction performance for this intention category.  According to the comparison of the evaluation metrics for each category in Table 7, text data augmentation significantly improves the prediction performance of the model for each category, which confirms the significant improvement in the model's training performance through text data augmentation. However, we found that regardless of whether the model was trained on the _ _ unaugmented train set or the _ _ augmented train set , the recognition accuracy for class 2 was lower compared to other categories. Upon analyzing the class 2 questions in the dataset, we found that questions related to processing procedures were highly colloquial and some of them were easily confused with class 3 and class 4 questions. For example, 'What are the procedural requirements for applying for death registration of a household?' and 'What are the procedures and documents required for changing the head of household?' These samples make it difficult for the model to learn, resulting in weak prediction performance for this intention category.
As shown in the evaluation metric average curve comparison in Figure 10, when the model is trained on the _ _ augmented train set , the model's fiBing speed increases, and all evaluation metrics improve significantly, resulting in a marked improvement in prediction performance. Similarly, Figure 11 demonstrates that the model's convergence speed accelerates, and the degree of overfiBing is reduced, resulting in improved generalization ability.  According to the comparison of the evaluation metrics for each category in Table 7, text data augmentation significantly improves the prediction performance of the model for each category, which confirms the significant improvement in the model's training performance through text data augmentation. However, we found that regardless of whether the model was trained on the _ _ unaugmented train set or the _ _ augmented train set , the recognition accuracy for class 2 was lower compared to other categories. Upon analyzing the class 2 questions in the dataset, we found that questions related to processing procedures were highly colloquial and some of them were easily confused with class 3 and class 4 questions. For example, 'What are the procedural requirements for applying for death registration of a household?' and 'What are the procedures and documents required for changing the head of household?' These samples make it difficult for the model to learn, resulting in weak prediction performance for this intention category.
As shown in the evaluation metric average curve comparison in Figure 10, when the model is trained on the _ _ augmented train set , the model's fiBing speed increases, and all evaluation metrics improve significantly, resulting in a marked improvement in prediction performance. Similarly, Figure 11 demonstrates that the model's convergence speed accelerates, and the degree of overfiBing is reduced, resulting in improved generalization ability. As shown in the evaluation metric average curve comparison in Figure 10, when the model is trained on the augmented_train_set, the model's fitting speed increases, and all evaluation metrics improve significantly, resulting in a marked improvement in prediction performance. Similarly, Figure 11 demonstrates that the model's convergence speed accelerates, and the degree of overfitting is reduced, resulting in improved generalization ability.

Experimental Method Comparison
After manually annotating the question dataset, we conducted a similarity comparison experiment to verify the effectiveness of using LTP to extract semantic role phrases for similarity matching. The experiment compared two processing methods: (1) Match the raw question text directly with the corresponding event entity for text similarity without any processing, and record the experimental accuracy and similarity values. (2) Use LTP for syntactic analysis of the question, extract semantic role phrases, reorganize the phrases into sentences and then conduct text similarity matching with the corresponding event entities, and record the experimental accuracy and similarity values.
We performed text similarity experiments on a dataset of 7055 questions, and calculated similarity values and presented the experimental results using a box plot.

Experimental Results and Analysis
The box plots displaying the results of the two methods are presented in Figure 12. Overall, the use of LTP to extract and reorganize semantic role phrases in method (2) improved the median, mean, maximum, and minimum event entity similarity values compared to method (1). We compared the similarity values of the two methods for the same questions and found that 56.71% of the questions showed improved similarity values with the use of method (2), while 12.74% of the questions showed no change. However, 30.55% of the questions exhibited decreased similarity values after processing with method (2).
(2) Use LTP for syntactic analysis of the question, extract semantic role phrases, ize the phrases into sentences and then conduct text similarity matching corresponding event entities, and record the experimental accuracy and si values.
We performed text similarity experiments on a dataset of 7055 questions, an lated similarity values and presented the experimental results using a box plot.

Experimental Results and Analysis
The box plots displaying the results of the two methods are presented in Fi Overall, the use of LTP to extract and reorganize semantic role phrases in metho proved the median, mean, maximum, and minimum event entity similarity valu pared to method (1). We compared the similarity values of the two methods for t questions and found that 56.71% of the questions showed improved similarity val the use of method (2), while 12.74% of the questions showed no change. However of the questions exhibited decreased similarity values after processing with meth We analyzed the questions that exhibited decreased similarity values and fo their expressions were clearer, and had well-formed syntax and almost no colloq pression. Therefore, such sentences may have become disordered and repeated wh syntax and semantics were reorganized using LTP, leading to a decrease in simila ues. These observations underscore that while method (2) is effective in enhan matching capability of event entities for questions with missing grammatical elem ambiguous expressions, it may reduce the matching accuracy of properly structu tences.

Conclusions
This paper presents a QA system based on a household registration domain K the household registration documents released by Wuhan city as an example. Con We analyzed the questions that exhibited decreased similarity values and found that their expressions were clearer, and had well-formed syntax and almost no colloquial expression. Therefore, such sentences may have become disordered and repeated when their syntax and semantics were reorganized using LTP, leading to a decrease in similarity values. These observations underscore that while method (2) is effective in enhancing the matching capability of event entities for questions with missing grammatical elements or ambiguous expressions, it may reduce the matching accuracy of properly structured sentences.

Conclusions
This paper presents a QA system based on a household registration domain KG using the household registration documents released by Wuhan city as an example. Considering the generality of the QA system, we divide the KG and QA system into two parts for separate construction: The KG processes the collected semi-structured and unstructured documents into structured data to establish a graph database that provides information support for the QA system. The QA system is trained on manually collected and dataaugmented question sentence corpus to achieve accurate question parsing results, and then it retrieves answers from the KG. Policy changes and updates can be accommodated by re-establishing the KG to apply new policies and regulations. The QA system can adapt to different regional question expressions by adding question sentence corpus for further training. We validate the effectiveness of the QA system by comparing two semantic parsing steps. Our experimental results demonstrate that 56.71% of questions obtain a significant improvement in similarity calculation after reorganizing the sentence using LTP. This approach can efficiently resolve problems caused by the lack of grammatical components and unclear expressions in oral language expression. Additionally, the RBMA model outperforms several widely used text classification models with an F1 value of 97.74%, precision of 99.49%, and recall of 96.15% in identifying question intentions. Finally, we apply the popular GPT3.5 generative language model to augment the dataset and reduce the impact of data imbalance. Our QA system achieves an accuracy of 87% in the unaugmented dataset and 93% in the augmented one after training.
Although the QA system based on the household registration domain KG in this paper achieved certain performance results, it still has some limitations: (1) Due to the limited size of the question corpus dataset, despite the use of data augmentation, the amount of data remains insufficient, which limits the performance improvement of the model. We plan to collect more language resources or manually generate more questions to expand the dataset and improve the accuracy of the QA system. (2) In the event entity matching experiment, some questions experienced a decrease in similarity values after processing with LTP. Therefore, we will compare the similarity values before and after processing to ensure that sentences with normal expressions do not lose accuracy due to LTP processing. (3) Currently, for unrelated questions, we retrieve similar question in the question corpus. However, such a small-scale corpus often outputs irrelevant or completely erroneous answers. To address this issue, we plan to use other generative LLM to enhance the system's robustness and improve the performance of answering general questions.