MRE: A Military Relation Extraction Model Based on BiGRU and Multi-Head Attention

: A great deal of operational information exists in the form of text. Therefore, extracting operational information from unstructured military text is of great signiﬁcance for assisting command decision making and operations. Military relation extraction is one of the main tasks of military information extraction, which aims at identifying the relation between two named entities from unstructured military texts. However, the traditional methods of extracting military relations cannot easily resolve problems such as inadequate manual features and inaccurate Chinese word segmentation in military ﬁelds, failing to make full use of symmetrical entity relations in military texts. With our approach, based on the pre-trained language model, we present a Chinese military relation extraction method, which combines the bi-directional gate recurrent unit (BiGRU) and multi-head attention mechanism (MHATT). More speciﬁcally, the conceptual foundation of our method lies in constructing an embedding layer and combining word embedding with position embedding, based on the pre-trained language model; the output vectors of BiGRU neural networks are symmetrically spliced to learn the semantic features of context, and they fuse the multi-head attention mechanism to improve the ability of expressing semantic information. On the military text corpus that we have built, we conduct extensive experiments. We demonstrate the superiority of our method over the traditional non-attention model, attention model, and improved attention model, and the comprehensive evaluation value F1-score of the model is improved by about 4%.


Introduction
With the progress in science and technology, and the evolution of war patterns, operational information and intelligence data have exponentially increased. This massive amount of information forms a "war fog", which directly interferes with the commander's command decision making. Military data abound in the form of unstructured text. Therefore, understanding how to extract valuable operational information from unstructured text, and how to build a military knowledge base for command and decision support, has become a topic of intense research in the field of military information extraction. As one of the basic tasks in military information extraction technology, military relation extraction is a key approach to creating military knowledge bases and a military knowledge graph [1]. This approach also facilitates improvements in the quality of operational information services, assisting commanders in decision making.
At present, the most prevalent method of entity relation extraction in domain-specific fields is supervised learning. In particular, the effect of relation extraction is significantly improved with in-depth application of the deep neural network model. Nevertheless, this method requires considerable time and effort to construct a large number of artificial features and to label numerous document requirements, which directly affect relation extraction. Compared with other fields, the artificial construction features of military text are not obvious, Chinese word segmentation is not so accurate, and, sometimes, and between input and output, the correlation is poor. Common relation extraction takes a single sentence as its processing unit, without taking the semantic association between sentences into account.
To address the above issues, we design a feature representation method that combines word embedding with position embedding, based on the pre-trained model, using BiGRU networks and the multi-head attention mechanism to capture the semantic features of military text, and achieve the effective extraction of military relations. We offer the following three contributions: (1) We encode the input military text using the pre-trained language model. The word features and position features of military text are combined to generate the vector feature of military text, and then the semantic features of military text can be expressed more effectively. (2) We apply a multi-head attention mechanism combined with BERT into military relation extraction. As a variant of self-attention, the core idea of this approach is to calculate self-attention from multiple dimensional spaces, so that, based on effective expression of semantic features in military texts from BERT, the model can learn more semantic features in military texts from different subspaces, and thus capture more contextual information. Relation extraction will automatically identify the relations between the symmetric entity pairs "1st Infantry Division" and "16th Infantry Regiment" as "command relation", and generate relation triple (1st Infantry Division, 16th Infantry Regiment, Command). Consequently, the extraction of symmetric entity relations has a wide spectrum of applications in the fields of combat data processing, military knowledge map construction, commander's critical information requirements (CCIRs), and question-answer on military knowledge [3].
At present, the most prevalent method of entity relation extraction in domain-specific fields is supervised learning. In particular, the effect of relation extraction is significantly improved with in-depth application of the deep neural network model. Nevertheless, this method requires considerable time and effort to construct a large number of artificial features and to label numerous document requirements, which directly affect relation extraction. Compared with other fields, the artificial construction features of military text are not obvious, Chinese word segmentation is not so accurate, and, sometimes, and between input and output, the correlation is poor. Common relation extraction takes a single sentence as its processing unit, without taking the semantic association between sentences into account.
To address the above issues, we design a feature representation method that combines word embedding with position embedding, based on the pre-trained model, using BiGRU networks and the multi-head attention mechanism to capture the semantic features of military text, and achieve the effective extraction of military relations. We offer the following three contributions: (1) We encode the input military text using the pre-trained language model.
Feature-based methods are based on feature vectors. First, different feature sets are constructed manually; next, they are transformed into feature vectors, then input into appropriate classifiers to realize relation extraction. For example, Kambhatla et al. [4] combined word features, syntactic features, and semantic features, and designed a classifier based on a maximum entropy model. The F1-score reached 52.8% with ACE RDC2003 evaluation datasets. Che et al. [12] used the entity type, order of occurrence of the two entities, and the number of words around the entity as the features, and de-signed a classifier based on support vector machines (SVM). In the evaluation of the dataset of ACE RDC2004, the F1-score reached 73.27%. However, since the extraction efficiency relies heavily on artificially constructed features, it is difficult to improve the performance of this method.
The kernel-based method was first introduced by Zelenko et al. [13]. This method does not require construction of feature vectors. Instead, it mainly calculates the similarity of two nonlinear structures by analyzing the structural information of corpus and by adopting appropriate kernel functions, so as to realize the relation extraction. Extensive experiments have indicated that this method can achieve useful results. Plank et al. [14] proposed the introduction of structural information and semantic information into kernel functions simultaneously, to cope with the problem of relation extraction. However, since all data must be fully summarized by the kernel function, the validity of kernel functions is the key to the extraction efficiency of kernel-based methods.
In recent years, deep learning methods have been enthusiastically applied to various fields of NLP, due to their superiority of learning and expression of deep features, making good progress with entity relation extraction task. For instance, Liu et al. [15] first applied the convolutional neural network theory to tackle relation extraction problems. Specifically, they built an end-to-end network based on convolutional neural networks, and coded sentences according to synonymous vectors and lexical features. The optimality of the model of the ACE 2005 dataset was 9% higher than that of the most advanced kernel-based model at that time. Zeng et al. [16] proposed a novel, piecewise convolutional neural networks (PCNN) model based on multi-instance learning, which can not only automatically extract the internal features of sentences, but also effectively reduce the impact of noise. Although convolutional neural networks (CNN) can effectively improve the efficiency of relation extraction, CNN is not suitable for learning long-distance-dependent information [10]. Although recurrent neural networks (RNN) can effectively learn long-distance-dependent information, there is a gradient disappearance problem in the trained process, which limits the processing of context [17]. To address the above problems, Hochreiter and Schmiduber [18] designed a long-and short-term memory network (LSTM), which effectively alleviates the gradient disappearance problem of RNN by introducing a gating unit. Zhou et al. [19] applied a BiLSTM neural network to learn sentence features, and a self-attention mechanism to capture more semantic information in sentences. Experiments with the SemEval2010 dataset show that attention-based mechanisms can effectively boost the efficiency of relation extraction. As a simplification of the LSTM model, the GRU neural networks model was first used in machine translation tasks. This model has the advantages of simple calculation and high execution efficiency [20], and has been used recently in relation extraction tasks. Luo et al. [21] achieved good results by combining BiGRU neural networks and attention mechanisms, to build a geographic data analysis model, from which entity relations are extracted. Zhang et al. [22] proposed a model combining a dual-layer attention mechanism and BiGRU neural network to realize the extraction of character relations. The experimental results showed a significant improvement in extraction efficiency. Zhou et al. [23] proposed a neural network-based attention  [24] proposed a relation extraction model based on dual attention-guided graph convolutional network, and they used the dual attention mechanism to capture rich context dependencies and achieved better performances. Thakur et al. [25] proposed a model of entity and relation extraction in IoT. Liu et al. [26] proposed a relation extraction method based on CRF and the syntactic analysis tree, and created a military knowledge graph.
Compared with the relation extraction in open domains and other specific domains, military texts contain numerous abbreviations, combinations, nesting, and other complicated grammatical forms, and, as shown in Figure 2, military texts are usually very concise, but contain many kinds of relations in a long sentence. The semantic relationship of Chinese noun phrases is more complex than that of English [27]. Therefore, it is difficult to obtain effective entity and relation features. While, the existing word segmentation tools are mainly applicable to the general domain, and it is difficult to achieve good results in the military domain. Thus far, there is no public corpus in the military domain, which makes the extraction of military relation more difficult. network-based attention model (NAM) for the extraction of chemical-disease relation (CDR). Li et al. [24] proposed a relation extraction model based on dual attention-guided graph convolutional network, and they used the dual attention mechanism to capture rich context dependencies and achieved better performances. Thakur et al. [25] proposed a model of entity and relation extraction in IoT. Liu et al. [26] proposed a relation extraction method based on CRF and the syntactic analysis tree, and created a military knowledge graph. Compared with the relation extraction in open domains and other specific domains, military texts contain numerous abbreviations, combinations, nesting, and other complicated grammatical forms, and, as shown in Figure 2, military texts are usually very concise, but contain many kinds of relations in a long sentence. The semantic relationship of Chinese noun phrases is more complex than that of English [27]. Therefore, it is difficult to obtain effective entity and relation features. While, the existing word segmentation tools are mainly applicable to the general domain, and it is difficult to achieve good results in the military domain. Thus far, there is no public corpus in the military domain, which makes the extraction of military relation more difficult.
Military texts are usually long and there are many long-dependent sentences. As to BiGRU structure can gain rich contextual features in military text, relation extraction is more effective.

Military Relation Extraction Model
Based on the pre-trained language model, this paper presents an explicitly military relation extraction model that combines BiGRU and multi-head attention mechanisms. The structure of this model is shown in Figure 3. First, all the characters in the input sentence are vectorized by using the pre-trained language model [28], and the relative position vectors of each character are calculated. The word embedding and the position embedding are then joined to generate the sentence eigenvectors, which are input into the bidirectional gate recurrent unit (BiGRU), which can capture the high-dimensional semantic features of sentences; more context information of sentences can be captured by establishing a multi-head attention mechanism; and finally, the conditional probability of each relation type can be calculated through a SoftMax classifier, which outputs the classification results.
The enemy of the 16th Inf Regiment(Remove 2th Cavalry Battalion) was the German 352nd Inf Div. and 1st Inf Div. ordered that it should land on Omaha Beach before 6:30 on June 6, and immediately attacked the Fusiliers Battalion. Military texts are usually long and there are many long-dependent sentences. As to BiGRU structure can gain rich contextual features in military text, relation extraction is more effective.

Military Relation Extraction Model
Based on the pre-trained language model, this paper presents an explicitly military relation extraction model that combines BiGRU and multi-head attention mechanisms. The structure of this model is shown in Figure 3. First, all the characters in the input sentence are vectorized by using the pre-trained language model [28], and the relative position vectors of each character are calculated. The word embedding and the position embedding are then joined to generate the sentence eigenvectors, which are input into the bi-directional gate recurrent unit (BiGRU), which can capture the high-dimensional semantic features of sentences; more context information of sentences can be captured by establishing a multi-head attention mechanism; and finally, the conditional probability of each relation type can be calculated through a SoftMax classifier, which outputs the classification results.

Embedding Layer
Before being inputted into the neural network model, the sentences in the natural language text form must first be represented by vectors. To achieve the embedding of sentences, we combine word embedding with position embedding.

Word Embedding
There are many ways to achieve word embedding. The main models of word embedding are Word2Vec [29] and GloVe [30]. These models are usually static and fixed, and cannot change with context, so that they cannot effectively express the word features in the context of military texts. Pre-trained language models can express rich sentence syntax and grammar information, and can model the ambiguity of words. These models are widely used in natural language processing, such as information extraction and text classification [31]. Bi-directional encoder representations from transformers (BERT) are one such pre-trained language model proposed by Google in 2018 [28].

Embedding Layer
Before being inputted into the neural network model, the sentences in the natural language text form must first be represented by vectors. To achieve the embedding of sentences, we combine word embedding with position embedding.

Word Embedding
There are many ways to achieve word embedding. The main models of word embedding are Word2Vec [29] and GloVe [30]. These models are usually static and fixed, and cannot change with context, so that they cannot effectively express the word features in the context of military texts. Pre-trained language models can express rich sentence syntax and grammar information, and can model the ambiguity of words. These models are widely used in natural language processing, such as information extraction and text classification [31]. Bi-directional encoder representations from transformers (BERT) are one such pre-trained language model proposed by Google in 2018 [28].
We use BERT to achieve the word embedding, as shown in Figure 4. Given a sentence in Chinese military text containing N characters, it can be represented as S = (s 1 ,s 2 ,…,s N ). Each character contains the following three types of features: character features, sentence features, and position features. We represent the character features of X as (e 1 t ,e 2 t ,…,e N t ), the sentence features as (e 1 s ,e 2 s ,…,e N s ), and the position features as e 1 p ,e 2 p ,…,e N p . The input of the word vector representation layer of BERT is the sum of character features, sentence features, and position features, as follows: C i = e i t +e i s +e i p , C = (C 1 ,C 2 ,…,C N ).
After inputting C into the multi-layer transformers, we can obtain the following final word embedding: X w = (x 1 ,x 2 ,…,x N ) = BERT(C 1 ,C 2 ,…,C N ), and the dimension of x i is d w . We use BERT to achieve the word embedding, as shown in

Position Embedding
Although word embedding can effectively capture the word information in a sentence, it is difficult to obtain the structural information of the sentence. The distance relation between the word and the entity directly affects the determination of the entity relation. Therefore, the position embedding X p is used in this paper to denote the relative distance between the current word and two entities. As shown in Figure 5, the relative After inputting C into the multi-layer transformers, we can obtain the following final word embedding: X w = (x 1 , x 2 , . . . , x N )= BERT(C 1 , C 2 , . . . , C N ), and the dimension of x i is d w .

Position Embedding
Although word embedding can effectively capture the word information in a sentence, it is difficult to obtain the structural information of the sentence. The distance relation between the word and the entity directly affects the determination of the entity relation. Therefore, the position embedding X p is used in this paper to denote the relative distance between the current word and two entities. As shown in Figure 5, the relative positions of the current word "在 "(on) with the military named entity "第六步兵团 "(16th Infantry Regiment) and "奥马哈海滩 " (Omaha Beach) are 6 and -1, and both relative positions correspond to the dimension d p position embedding X p1 , X p2 .

Position Embedding
Although word embedding can effectively capture the word information in a sentence, it is difficult to obtain the structural information of the sentence. The distance relation between the word and the entity directly affects the determination of the entity relation. Therefore, the position embedding X p is used in this paper to denote the relative distance between the current word and two entities. As shown in Figure 5, the relative positions of the current word "在"(on) with the military named entity "第六步兵团"(16th Infantry Regiment) and "奥马哈海滩" (Omaha Beach) are 6 and -1, and both relative positions correspond to the dimension d p position embedding X p1 ,X p2 .
The 1st Infantry Division ordered the 16th infantry to land on Omaha Beach. Finally, word embedding is cascaded with the position embedding to generate the following complete feature representation vector: X i = X w i +X p i ; the dimension of X i is d = d w +2d p .

BiGRU Layer
GRU neural networks are essentially a variant of recurrent neural networks (RNN). In order to solve the problem that traditional RNN rewrites its own memory in unit steps, and has the problem of gradient dispersion, based on RNN, Hochreiter et al. [18] proposed a neural network named long short-term memory (LSTM). The LSTM neural networks mainly include input gates, forget gates, and output gates. As shown in Figure 6, GRU is a simplified LSTM neural network, which can be calculated more easily, while maintaining the effect of LSTM neural networks.  Finally, word embedding is cascaded with the position embedding to generate the following complete feature representation vector: X i = X i w +X i p ; the dimension of X i is d = d w +2d p .

BiGRU Layer
GRU neural networks are essentially a variant of recurrent neural networks (RNN). In order to solve the problem that traditional RNN rewrites its own memory in unit steps, and has the problem of gradient dispersion, based on RNN, Hochreiter et al. [18] proposed a neural network named long short-term memory (LSTM). The LSTM neural networks mainly include input gates, forget gates, and output gates. As shown in Figure 6, GRU is a simplified LSTM neural network, which can be calculated more easily, while maintaining the effect of LSTM neural networks.

Position Embedding
Although word embedding can effectively capture the word information in a sentence, it is difficult to obtain the structural information of the sentence. The distance relation between the word and the entity directly affects the determination of the entity relation. Therefore, the position embedding X p is used in this paper to denote the relative distance between the current word and two entities. As shown in Figure 5, the relative positions of the current word "在"(on) with the military named entity "第六步兵团"(16th Infantry Regiment) and "奥马哈海滩" (Omaha Beach) are 6 and -1, and both relative positions correspond to the dimension d p position embedding X p1 ,X p2 .
The 1st Infantry Division ordered the 16th infantry to land on Omaha Beach. Finally, word embedding is cascaded with the position embedding to generate the following complete feature representation vector: X i = X w i +X p i ; the dimension of X i is d = d w +2d p .

BiGRU Layer
GRU neural networks are essentially a variant of recurrent neural networks (RNN). In order to solve the problem that traditional RNN rewrites its own memory in unit steps, and has the problem of gradient dispersion, based on RNN, Hochreiter et al. [18] proposed a neural network named long short-term memory (LSTM). The LSTM neural networks mainly include input gates, forget gates, and output gates. As shown in Figure 6, GRU is a simplified LSTM neural network, which can be calculated more easily, while maintaining the effect of LSTM neural networks. As show in Figure 6, x t is the input vector, h t−1 is hidden state at time t−1, and h t is the output vector of current GRU. At the time t, x t and h t−1 are the input into the GRU networks, and we can obtain the output h t . Further, h t is expressed as Formulas (1)-(4), as follows: where σ is the symbol of Sigmoid function, which can help the GRU neural networks to retain or forget information, and ⊗ is an elementwise production, z t is the update gate, and In order to make full use of the contextual information in military texts, we chose the BiGRU structure, which includes a forward hidden layer and a backward hidden layer. As shown in Figure 7, each input sequence is input into forward GRU networks and backward GRU networks, and two symmetrical hidden layer state vectors are obtained. These two state vectors are symmetrically merged, and then we can obtain the final coded representation of the input sentence, as follows: h t = (1-z t )⊗ h t-1 +z t ⊗ h t (4) where σ is the symbol of Sigmoid function, which can help the GRU neural networks to retain or forget information, and ⊗ is an elementwise production, z t is the update gate, and r t is the reset gate. Further, h t is the candidate implied state at the time t. W z , W r , W h are the input weights for the current time, and U z , U r , U h are the weights for the cyclic input. Additionally, b z , b r , b h are the corresponding offset vectors for W z , W , W h , U z , U r , U h . In order to make full use of the contextual information in military texts, we chose the BiGRU structure, which includes a forward hidden layer and a backward hidden layer. As shown in Figure 7, each input sequence is input into forward GRU networks and backward GRU networks, and two symmetrical hidden layer state vectors are obtained. These two state vectors are symmetrically merged, and then we can obtain the final coded representation of the input sentence, as follows:

Multi-Head Attention Layer
The multi-head attention mechanism can be used to represent the correlation between the input and the output during text processing tasks. In military text, there are usually a large number of technical terms and abbreviations, and the referential relationships are complex and the sentence structures are diverse. Using the multi-head attention mechanism, the entity and relation information can be effectively analyzed and extracted.
After the sentence X = (x 1 ,x 2 ,…,x T ) is computed from the BiGRU layer, we can obtain the vector H = (H 1 ,H 2 ,…,H T ), where T is the length of X, and γ is the weighted average of H. We can construct the general attention model as follows:

Multi-Head Attention Layer
The multi-head attention mechanism can be used to represent the correlation between the input and the output during text processing tasks. In military text, there are usually a large number of technical terms and abbreviations, and the referential relationships are complex and the sentence structures are diverse. Using the multi-head attention mechanism, the entity and relation information can be effectively analyzed and extracted.
After the sentence X =(x 1 , x 2 , . . . , x T ) is computed from the BiGRU layer, we can obtain the vector H =(H 1 , H 2 , . . . , H T ), where T is the length of X, and γ is the weighted average of H. We can construct the general attention model as follows: where H ∈ R d w ×T , in which d w is the dimension of the embedding layer, w is a parameter vector in training, and w T is its transpose. After the single-head attention calculation, we can obtain the output eigenvalue as follows: As shown in Figure 8, the multi-head attention mechanism [32] can help our model to derive more features from different representation subspaces and capture more contextual information from military texts. In a single self-attention calculation, after H is transformed linearly [28], we can obtain W h i H, and W h i ∈ R d h /k×d h , i ∈ {1, 2, . . . , k}. Additionally, by using the mechanism of multiplicative attention, we can achieve a highly optimized matrix multiplication. The Formulas (6) where the dimension of w s is k × d h , and ⊗ means point-by-element multiplication.
formed linearly [28], we can obtain W i h H, and W i h ∈R d h /k×d h ,i∈ {1,2, …, k}. Additionally, by using the mechanism of multiplicative attention, we can achieve a highly optimized matrix multiplication. The Formulas (6)-(8) are used for k times of calculation. After splicing and linearly mapping the results of the calculation, we can obtain the final result, as follows: h s =w s ⊗ concat(h 1 * ,h 2 * ,…,h k * ) (9) where the dimension of w s is k×d h , and ⊗ means point-by-element multiplication.

Ouput Layer
Military relation extraction is also essentially a multi-classification problem. There are a few common classifiers, such as k-NN, random forest, SoftMax, etc. [33]. SoftMax is specific, with a simpler calculation and more-remarkable results than others, so, in the output layer, we chose SoftMax to calculate the conditional probability of each relation type, and it chose the relation category corresponding to the maximum conditional probability as the output of the prediction result. As the relation type of the entity pair in sentence S, y ' is predefined. SoftMax calculates with h s as the input. The formulas of the predicted relation type y are as follow: y=argmaxP(y'|S) y' (11) where W o ∈R c×kd w , in which c is the number of relation types in the dataset. We chose the cross-entropy loss function with L2 penalty as the objective function, as follows: (12) where m is the number of relations in sentence S, y i ' is the probability of each relation type obtained through SoftMax, and λ is the L2 regularization factor.

Ouput Layer
Military relation extraction is also essentially a multi-classification problem. There are a few common classifiers, such as k-NN, random forest, SoftMax, etc. [33]. SoftMax is specific, with a simpler calculation and more-remarkable results than others, so, in the output layer, we chose SoftMax to calculate the conditional probability of each relation type, and it chose the relation category corresponding to the maximum conditional probability as the output of the prediction result. As the relation type of the entity pair in sentence S, y is predefined. SoftMax calculates with h s as the input. The formulas of the predicted relation type y are as follow: y =argmaxP y S y (11) where W o ∈ R c×kd w , in which c is the number of relation types in the dataset. We chose the cross-entropy loss function with L2 penalty as the objective function, as follows: where m is the number of relations in sentence S, y i is the probability of each relation type obtained through SoftMax, and λ is the L2 regularization factor.

Dataset
At present, the research on entity relation extraction tasks in the open domain, and in medical and judicial fields, is mature, and there are many open datasets, such as ACE2003-2004 [34], SemEval2010 Task8 [35], and FewRel [36]. Research on relation extraction in Chinese is also developing gradually, but relation extraction in military fields is basically in its infancy, with no public datasets. Military scenarios, an important form of military text, contain a large amount of military information, such as subordinate relation, location relation, attacking relation, etc. Therefore, we chose military scenarios as the research object; through the analysis of a large number of military scenarios, we have organized experts in the military field to conduct the research and discussion, and define the relations in the military field in combination with the specifications of military documents. Table 1 lists six coarse-grained categories and 12 fine-grained categories of military relationships, and Table 2 shows an example of a military relation labeling corpus. The two organizations are alliance.

Location Relation Deploy (Dep) Entity is in a specific location. Route
Entity is in a position.

Equipment Relation Own
Organization configures some equipment.

Link
Organization links some equipment. Equipment links some equipment.

Target Relation Target
Organization attacks some organization. Organization attacks some location.

Evaluation Criterion
We selected military scenario texts randomly as analysis objects, annotated 50 texts (about 320,000 words) manually, and have obtained 6105 military text corpus' as the datasets of the experiments. The distribution of relations in the military text corpus is shown in Table 3. In the experiments, TP represents the number of correctly classified relations, FP represents the number of incorrectly classified relations, and FN represents the number of classified relations that should be correctly classified, but have not been classified. Precision, recall, and F1-score are chosen as the evaluation criterions, and can be calculated as follows:

Parameters Setting
In the experiment, we used our previously developed cross-validation methods to optimize the parameters of our model, and the data are verified in the literature [9]. The specific parameters are shown in Tables 4-6. It should be noted that if the head number k of self-attention in the multi-head layer is too large or too small, we should first determine the value of the parameter k before starting the comparative experiment. Referring to the experiment of Vaswani et al. [32], we take k = {1,2,4,6,10,15,30} as the candidate value (k should be divisible by d h ); the results are shown in Table 7. With the increase in k, when k = 6, the comprehensive evaluation index of the model reaches the highest value. Therefore, the value of parameter k in this experiment is six.

Results and Analysis
The military text corpus is selected as the training corpus and test corpus, and the experiment is conducted according to the set experimental parameters. To verify the validity of our model, we have designed a number of comparative experiments.

Comparison of Result on Different Embedding Methods
In this section, we employ the commonly used tool Word2Vec (dimension of Word2Vec is set to 100) [29] for the comparative experiments. We compared three feature representation methods to verify the effectiveness of the embedding-based pre-trained model (BERT), combined with word embedding and position embedding. These methods include the following: Feature representation of Word2Vec + word; Feature representation of Word2Vec + word + position; Feature vector representation of BERT + word; Feature vector representation of BERT + word + position.
As shown in Table 8, embedding methods based on the pre-trained model are superior to those based on Word2Vec. The F1-score of the BERT + word is 7.6% higher than that of the Word2Vec + word method, and that of the BERT + word + position method is 7.9% higher than that of the Word2Vec + word + position method. At the same time, the input embedding method based on word embedding and position embedding is better than that based on word (among which the F1-score of the Word2Vec + word + position method is 4.1% higher than that of the Word2Vec + word method, and the BERT + word + position method is 4.4% higher than that of the BERT + word method). This analysis indicates that the feature vector combined with word embedding and position embedding can better express the semantic features in military text.

Comparison of Result on Different Feature Extraction Models
To verify the advantages of the BiGRU-MHATT model, several classical relation extraction models are set up in this paper, as follows: Traditional non-attention models: BiLSTM, BiGRU; Based on the traditional attention models: BiLSTM-ATT model, BiGRU-ATT; Based on the improved attention models: BiLSTM-2ATT, BiGRU-2ATT.
(1) The structure of BiGRU. As shown in Table 9, the model of the BiGRU networks can extract military relations more effectively. The F1-score with the BiGRU structure is 1.8-2.5% higher than those with the BiLSTM structure. Thus, the GRU network, as a variant of LSTM networks, can not only acquire memory sequence characteristics effectively, but can also learn long-distance dependency information. Military texts are usually long and there are many long-dependent sentences. As the BiGRU structure can acquire rich contextual features in military text, relation extraction is more effective. (2) The influence of the attention mechanism. From Table 8, we can observe that the attention mechanism is better than the non-attention mechanism. The F1-score of the BiLSTM-ATT model is 4.4% higher than those of the BiLSTM model, and the F1-score of the BiGRU-ATT model is 5.1% higher than those of the BiGRU model, indicating that the attention mechanism effectively improves the accuracy in extracting military relations. At the same time, the F1-score of our proposed model, which combines BiGRU with a multihead attention mechanism, is at least 4.1% higher than that of other attention mechanism models, indicating that our model can learn more sentence characteristics from military text, thus improving the extraction efficiency of military relations. Therefore, we believe that the model proposed in this paper has several advantages in extracting military relations.

Comparison of Result on Different Training Data Sizes
To test the training efficiency of the model, we designed six training corpus' of different sizes, between 1000 and 5000 words, and evaluated the performance of the BiLSTM-2ATT, BiGRU-2ATT, and BiGRU-MHATT models. As shown in Figure 9, the performance gap between the five models becomes more significant as the size of the training set increases. When the dataset reaches 4000, the F1-score of BiGRU-MHATT approaches the maximum value of BiGRU-2ATT. This indicates that the BiGRU-MHATT model proposed in this paper can make full use of the training document. 2ATT, BiGRU-2ATT, and BiGRU-MHATT models. As shown in Figure 9, the performance gap between the five models becomes more significant as the size of the training set increases. When the dataset reaches 4000, the F1-score of BiGRU-MHATT approaches the maximum value of BiGRU-2ATT. This indicates that the BiGRU-MHATT model proposed in this paper can make full use of the training document.

Comparison of Result on Different Sentence Length
To test the sensitivity of the model to sentence length, we classified the test corpus by sentence length, including (<20, [20,30], [30,40], [40, 50], >50), and evaluated the performance of BiLSTM-2ATT, BiGRU-2ATT, and BiGRU-MHATT. As shown in Figure 10, BiGRU-MHATT is superior to the BiLSTM-2ATT and BiGRU-2ATT models, in terms of sentence length. With the increase in sentence length, the information acquisition perfor-

Comparison of Result on Different Sentence Length
To test the sensitivity of the model to sentence length, we classified the test corpus by sentence length, including (<20, [20,30], [30,40], [40,50], >50), and evaluated the performance of BiLSTM-2ATT, BiGRU-2ATT, and BiGRU-MHATT. As shown in Figure 10, BiGRU-MHATT is superior to the BiLSTM-2ATT and BiGRU-2ATT models, in terms of sentence length. With the increase in sentence length, the information acquisition performance of the three models showed a downward trend, and the BiGRU-MHATT model was slower than the other two models. The results show that BiGRU-MHATT can acquire the semantic features in long text more effectively.

Comparison of Result on Different Sentence Length
To test the sensitivity of the model to sentence length, we classified the test corpus by sentence length, including (<20, [20,30], [30,40], [40,50], >50), and evaluated the performance of BiLSTM-2ATT, BiGRU-2ATT, and BiGRU-MHATT. As shown in Figure 10, BiGRU-MHATT is superior to the BiLSTM-2ATT and BiGRU-2ATT models, in terms of sentence length. With the increase in sentence length, the information acquisition performance of the three models showed a downward trend, and the BiGRU-MHATT model was slower than the other two models. The results show that BiGRU-MHATT can acquire the semantic features in long text more effectively. To verify the generalization ability of our model, we conducted experiments on public corpus: SemEval2010-Task8, which contains 10,717 sentences, including 8000 training and 2717 testing instances.
As shown in Table 10, the extraction effect of our model on the English dataset is not the best, which indicates that the generalization ability of our model needs to be improved. It can also prove that our model can learn more features from military texts based on the pre-trained language model and multi-head attention mechanism. To verify the generalization ability of our model, we conducted experiments on public corpus: SemEval2010-Task8, which contains 10,717 sentences, including 8000 training and 2717 testing instances.
As shown in Table 10, the extraction effect of our model on the English dataset is not the best, which indicates that the generalization ability of our model needs to be improved. It can also prove that our model can learn more features from military texts based on the pre-trained language model and multi-head attention mechanism.

Conclusions and Future Work
In this paper, we construct a military relation extraction model based on the characteristics of Chinese military text. Through the constructed military text corpus, the experimental results show that our model can achieve better performance than traditional non-attention models, traditional attention models, and improved attention models.
In further experiments, although in a different language dataset, the generalization ability of our model needs to be further improved, but it is verified that our model has stronger robustness and generalization ability in different training data sizes and sentence lengths.
In the future, we plan to expand the military text corpus, distinguish fine-grained semantic information to achieve fine-grained military relation extraction, and we will try to extract military entity and relation jointly.