DAWE: A Double Attention-Based Word Embedding Model with Sememe Structure Information

: Word embedding is an important reference for natural language processing tasks, which can generate distribution presentations of words based on many text data. Recent evidence demonstrates that introducing sememe knowledge is a promising strategy to improve the performance of word embedding. However, previous works ignored the structure information of sememe knowledges. To ﬁll the gap, this study implicitly synthesized the structural feature of sememes into word embedding models based on an attention mechanism. Speciﬁcally, we propose a novel double attention word-based embedding (DAWE) model that encodes the characteristics of sememes into words by a “double attention” strategy. DAWE is integrated with two speciﬁc word training models through context-aware semantic matching techniques. The experimental results show that, in word similarity task and word analogy reasoning task, the performance of word embedding can be e ﬀ ectively improved by synthesizing the structural information of sememe knowledge. The case study also veriﬁes the power of DAWE model in word sense disambiguation task. Furthermore, the DAWE model is a general framework for encoding sememes into words, which can be integrated into other existing word embedding models to provide more options for various natural language processing downstream tasks.


Introduction
The basis of applying deep learning to solve natural language processing (NLP) tasks is to obtain high-quality representations of words from large amounts of text data [1]. Traditionally, words are represented in a sparse high-dimensional space using count-based vectors in which each word in a vocabulary is represented by a single dimension [2]. In contrast, word embedding aims to map words into continuous low-dimensional semantic space; in this way, each word is represented by a real-valued vector, namely word vector, often composed tens or hundreds of dimensions [3]. Word embedding assumes that words used in similar ways should have similar representations, thereby naturally capturing their meaning. Word vectors obtained from word embedding have been widely used in many applications: text summarization, sentiment analysis, reading comprehension, machine translation, etc.
In a great deal of word embedding-related works that have emerged in recent years, the Word2Vec [3] model strikes a good balance between efficiency and quality. In the training For instance, the word (the first layer) "Apple" contains three senses (the second layer): "Apple Brand" (a famous computer brand), "Apple" (a sort of fruit) and "Apple Tree". The third layer is those sememes explaining each sense. The sememes of sense "Apple Brand" are "computer", "PatternValue", "able", "bring" and "SpeBrand (specific brand)". The sememe of sense "Apple" is "fruit". The sememes of sense "Apple Tree" are "fruit", "reproduce" and "tree". SE-WRL model believes that contributions of each sememe under a sense are equivalent. However, the nature of sememes determines that different sememes under a sense may be different, which means the contributions of each sememe to the sense should be varied depending on the particular case. The inequality may be caused by two main reasons: (1) Different senses correspond to the different hierarchical structures of sememes. The sememes are organized into a hierarchical structure, such as the sememes in the sense "Apple brand" in Figure 1. Because of the hierarchical structure, there is fusion among sememes, which means that sememes at different branches of different levels are usually not equivalent. For example, the sememe "computer" in the sense "Apple brand" can be presented by its under-layer sememes ("PatternValue", "able", "bring" and "SpeBrand (specific brand)"). Furthermore, sememes at the same level are not equivalent in most cases. (2) The context of sememes is varied. The meaning of a word needs to be reflected in a special context, so the sememes are also affected by the context of the word. As shown in Figure 2, when the word "Apple" appears in the context "I am going to the ~ store now.", the meaning of "Apple" should be close to the For instance, the word (the first layer) "Apple" contains three senses (the second layer): "Apple Brand" (a famous computer brand), "Apple" (a sort of fruit) and "Apple Tree". The third layer is those sememes explaining each sense. The sememes of sense "Apple Brand" are "computer", "PatternValue", "able", "bring" and "SpeBrand (specific brand)". The sememe of sense "Apple" is "fruit". The sememes of sense "Apple Tree" are "fruit", "reproduce" and "tree". SE-WRL model believes that contributions of each sememe under a sense are equivalent. However, the nature of sememes determines that different sememes under a sense may be different, which means the contributions of each sememe to the sense should be varied depending on the particular case. The inequality may be caused by two main reasons: (1) Different senses correspond to the different hierarchical structures of sememes. The sememes are organized into a hierarchical structure, such as the sememes in the sense "Apple brand" in Figure 1. Because of the hierarchical structure, there is fusion among sememes, which means that sememes at different branches of different levels are usually not equivalent. For example, the sememe "computer" in the sense "Apple brand" can be presented by its under-layer sememes ("PatternValue", "able", "bring" and "SpeBrand (specific brand)"). Furthermore, sememes at the same level are not equivalent in most cases. (2) The context of sememes is varied. The meaning of a word needs to be reflected in a special context, so the sememes are also affected by the context of the word. As shown in Figure 2, when the word "Apple" appears in the context "I am going to the~store now.", the meaning of "Apple" should be close to the sense "Apple brand". At this point, the sememe "SpeBrand" should have a higher weight than other sememes of the sense "Apple brand". Therefore, the weights of sememes in a sense should change dynamically with different contexts.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 20 sense "Apple brand". At this point, the sememe "SpeBrand" should have a higher weight than other sememes of the sense "Apple brand". Therefore, the weights of sememes in a sense should change dynamically with different contexts.

Figure 2.
Interpretation of weight changes of sense and sememe. Sense means the meaning that exists within the word and does not change with the context, but its semantic contribution to the word is different in different contexts. For example, the word "Apple" has three different "senses": "Apple Brand", "Apple (Fruit)" and "Apple Tree". In the context "I am going to the ~ store now.", the semantic of the word "apple" tends to the sense "Apple Brand". The sememe's contribution to the sense should be different.
To fill this gap, a "double attention" mechanism is proposed to capture the inequality of sememes, thereby the meanings of words can be more accurately represented. Specifically, we derive a double attention-based word embedding (DAWE) model. This model uses senses as a bridge in the process of encoding sememes into words, in which the word can be represented as a fusion of their different senses and a sense can be represented as the weighted sum of the sememes of the sense.
The original contributions of this work can be summarized as follows: (1) The proposed "double attention" mechanism captures the weight changes of the different sememes of a sense with context, as well as the weight changes of the different senses within the word with context so that the obtained word vectors can be represented completely and accurately by sememes.
(2) Two specific word training models are derived by combining the DAWE word encoding model with context-aware semantic matching. The experimental results of both word similarity task and word analogy reasoning task on the standard datasets show that the proposed models outperform previous models. (3) The proposed DAWE model is a general framework of encoding sememes into words and can be integrated with other existing word embedding models to provide more methods for word embedding.

Notation and Definition
The symbolic conventions that are used below are given here: W, S and X represent word set, sense set and sememe set, respectively. For each word w ∈ W, there are multiple senses s ( ) ∈ S ( ) , and S ( ) represents the sense set corresponding to the word w; for each sense s ( ) , corresponding to several different sememes x ( ) ∈ X ( ) , X ( ) represents the sememe set of the ith sense corresponding to the word w and C(w) represents the context word set corresponding Interpretation of weight changes of sense and sememe. Sense means the meaning that exists within the word and does not change with the context, but its semantic contribution to the word is different in different contexts. For example, the word "Apple" has three different "senses": "Apple Brand", "Apple (Fruit)" and "Apple Tree". In the context "I am going to the~store now.", the semantic of the word "apple" tends to the sense "Apple Brand". The sememe's contribution to the sense should be different.
To fill this gap, a "double attention" mechanism is proposed to capture the inequality of sememes, thereby the meanings of words can be more accurately represented. Specifically, we derive a double attention-based word embedding (DAWE) model. This model uses senses as a bridge in the process of encoding sememes into words, in which the word can be represented as a fusion of their different senses and a sense can be represented as the weighted sum of the sememes of the sense.
The original contributions of this work can be summarized as follows: (1) The proposed "double attention" mechanism captures the weight changes of the different sememes of a sense with context, as well as the weight changes of the different senses within the word with context so that the obtained word vectors can be represented completely and accurately by sememes. (2) Two specific word training models are derived by combining the DAWE word encoding model with context-aware semantic matching. The experimental results of both word similarity task and word analogy reasoning task on the standard datasets show that the proposed models outperform previous models. (3) The proposed DAWE model is a general framework of encoding sememes into words and can be integrated with other existing word embedding models to provide more methods for word embedding.

Notation and Definition
The symbolic conventions that are used below are given here: W, S and X represent word set, sense set and sememe set, respectively. For each word w ∈ W, there are multiple senses s (w) i ∈ S (w) , and S (w) represents the sense set corresponding to the word w; for each sense s represents the sememe set of the ith sense corresponding to the word w and C(w) represents the context word set corresponding to the word w. We use the bold form w/s/x ⊂ R D corresponding to w/s/x to represent the vectors of word/sense/sememe, where D is the dimension of those vectors. As shown in Figure 3, for the text corpus C, word embedding maps each word w ∈ W to a continuous low-dimensional space R D , while ensuring that the final embeddings (vectors) can represent the semantic relevance between words in the original text corpus C.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 20 to the word w. We use the bold form w/s/x ⊂ ℝ corresponding to w/s/x to represent the vectors of word/sense/sememe, where D is the dimension of those vectors. Definition 1. Word Embedding As shown in Figure 3, for the text corpus C, word embedding maps each word w ∈ W to a continuous low-dimensional space ℝ , while ensuring that the final embeddings (vectors) can represent the semantic relevance between words in the original text corpus C.

Definition 2. Encoding Words with Sememes
It is a process of using sememes as a semantic supplement to encode words. In the process, word embedding can be simplified as the encoding of words from sememes to words, which means word vectors can be obtained by encoding corresponding sememes of words: where θ → represents the parameters when encoding sememe set X ( ) to its corresponding word w . f → (X ( ) , θ → ) can be a simple encoding function, such as sum operation ), or f → X ( ) , θ → be neural networks, such as f → (X ( ) , θ → ) = σ(W • X ( ) + b) , where σ denotes the activation function, W is the weight matrix and b is the bias.

Definition 3. Encoding Words with Sememes through Senses
As shown in Figure 1, a word may consist of many different senses, each of which is described by several sememes. Therefore, this "word-sense-sememe" structure allows us to achieve the encoding process from sememes to words using senses as a semantic bridge, that is, the encoding process of the word w is represented as a mapping function of all its corresponding senses S ( ) . The formalization is as follows: for each sense s ( ) ∈ S ( ) , it is encoded by all its corresponding sememes X ( ) : where θ → and θ → denote the trainable parameters.

Definition 2. Encoding Words with Sememes.
It is a process of using sememes as a semantic supplement to encode words. In the process, word embedding can be simplified as the encoding of words from sememes to words, which means word vectors can be obtained by encoding corresponding sememes of words: where θ X→w represents the parameters when encoding sememe set X (w) to its corresponding word w. f X→w (X (w) , θ X→w ) can be a simple encoding function, such as sum operation (f X→w X (w) , θ X→w = i ), or f X→w X (w) , θ X→w be neural networks, such as f X→w (X (w) , θ X→w ) = σ W·X (w) + b , where σ denotes the activation function, W is the weight matrix and b is the bias.

Definition 3. Encoding Words with Sememes through Senses.
As shown in Figure 1, a word may consist of many different senses, each of which is described by several sememes. Therefore, this "word-sense-sememe" structure allows us to achieve the encoding process from sememes to words using senses as a semantic bridge, that is, the encoding process of the word w is represented as a mapping function of all its corresponding senses S (w) . The formalization is as follows: where θ S→w and θ X→s denote the trainable parameters. The objective of this study is to find the f S→w function and the f X→s function in Equation (2) and Equation (3), while taking full advantage of the structure of the sememes.

Related Works
In this section, we mainly introduce the works related to this study, including classical word embedding models and the word embedding models that introduce internal semantic information of words and external semantic information (image, knowledgebase, etc.). These works are illustrated in Figure 4. The objective of this study is to find the f → function and the f → function in Equation (2) and Equation (3), while taking full advantage of the structure of the sememes.

Related Works
In this section, we mainly introduce the works related to this study, including classical word embedding models and the word embedding models that introduce internal semantic information of words and external semantic information (image, knowledgebase, etc.). These works are illustrated in Figure 4.

Classical Word Embeddings
Word embedding aims to embed words into continuous low-dimensional, high-density semantic space. Early models usually use an NLM (neural language model) to generate word vectors (word-level word embedding vectors). The typical representative of them is the Word2Vec, which includes a CBOW model (continuous bag-of-words model) and a Skip-gram model (continuous skip-gram model), as shown in Figure 4. The key idea of Word2Vec is that the words with similar text contexts (or those words appearing in the same window that slides through the text with a size of k) should be close to each other in the semantic space, that is, their word vectors should be similar. As shown in Figure 3, the words "first", "second" and "third" are close to each other in the semantic space because they have the same context "This is the ~ sentence".
Skip-gram is a model which predicts the context words (surrounding words) given a target word (the center word). It intends to maximize the likelihood function as follows:

Classical Word Embeddings
Word embedding aims to embed words into continuous low-dimensional, high-density semantic space. Early models usually use an NLM (neural language model) to generate word vectors (word-level word embedding vectors). The typical representative of them is the Word2Vec, which includes a CBOW model (continuous bag-of-words model) and a Skip-gram model (continuous skip-gram model), as shown in Figure 4. The key idea of Word2Vec is that the words with similar text contexts (or those words appearing in the same window that slides through the text with a size of k) should be close to each other in the semantic space, that is, their word vectors should be similar. As shown in Figure 3, the words "first", "second" and "third" are close to each other in the semantic space because they have the same context "This is the~sentence".
Skip-gram is a model which predicts the context words (surrounding words) given a target word (the center word). It intends to maximize the likelihood function as follows: where n is the size of the text corpus, that is, the number of words contained in the corpus. P(w t−k , . . . , w t+k |w t ) denotes the probability of the context [w t−k , . . . , w t+k ] being predicted by the target word w t , [w t−k , . . . , w t+k ] is the set of the first and last k words of the current word w t in the text sequence and k is the size of the context window. For example, for the text sequence, "I twisted an apple off the tree," when the target word is "apple" and k = 2, then [w t−k , . . . , w t+k ] = [twisted, an, off, tree]. Based on the assumption of context independence, the probability of predicting context [w t−k , . . . , w t+k ] by the target word w t can be converted to the product of the probability of predicting each word w c in the context (the co-occurrence probability of the target word and the context word): . By introducing the negative sampling [12] method, the co-occurrence probability of each context word and the target word can be formalized by the following: where σ(·) denotes the sigmoid function and NEG(w t ) is the negative word set for the target word w t . The objective of negative sampling is to make the context word w c as close as possible to the target word w t in the semantic space and as far away as possible from the negative sample w t . It aims to make the co-occurrence probability of w c and w t (σ w T c ·w t ) greater than w c and w t (σ w T c ·w t ). Although Word2Vec strikes a good balance between efficiency and quality, the representation of low-frequency words remains a challenge on due to the lack of adequate training for sparse words.

Word Embeddings with Internal Semantic Information
In addition to the word co-occurrence, the internal features of words have also been shown to contribute to word embedding. Related works can be roughly divided into three categories: models based on morphological information, models based on character information and models based on subword information. Examples of "morphological information", "character information" and "subword information" are illustrated in Figure 4, where morphological information mainly refers to the features from components (i.e., prefix, root and suffix) of the word. Bian, Gao and Liu [1] utilized morphological (prefix, root and suffix), syntactic and semantic knowledge to achieve high-quality word embeddings. Chen et al. [13] and Sun et al. [14] performed character-level embedding and word embeddings obtained by fusing character features and word features. Xu et al. [15] also used a character-level embedding, and the weight information of different characters was taken into account in the fusion process. Cao and Lu [16] combined both the morphological information of the word and the information of the character-level and captured the structure information of the context by adding the subword information (character n-gram, root/affix and inflections). To better discover the laws of language for word embedding, Li et al. [17] tried to discover the relationship between morphology and semantics in language expression and summarized 68 implicit morphological relationships and 28 display semantics relationships.
Actually, in Chinese, characters are not the smallest granularity units, but strokes. On top of this, there are structures such as radicals and components. Shi et al. [18] and Yin et al. [19] added the features of radicals of the characters inside target words to CBOW model. Yu et al.'s [20] method, regarded as a more refined version of those of Shi, Zhai, Yang, Xie and Liu [18] and Yin, Wang, Li, Li and Wang [19], captured not only the radical information but also other components inside the Appl. Sci. 2020, 10, 5804 7 of 19 character. To better exploit the structural information inside the character, Cao et al. [21] proposed to use the set of Stroke n-gram information of characters to supplement the semantics of characters.
The methods mentioned above only use the semantic information of the word itself, such as from word-level embeddings to character-level embeddings or other more fine-grained embeddings. However, the semantic information obtained from the word is limited. Besides, the models are influenced by the formation of language, the characteristics of language, etc., thus it is difficult to generalize to other languages.

Word Embeddings with External Semantic Information
A lot of semantic information related to words is now emerging, such as images with text labels, as well as some semantic knowledge bases including WordNet [22], BabelNet [23], ConceptNet [24] and HowNet [11]. These semantic data should help us improve the accuracy of word vectors.
A large and growing body of literature has researched on joining external semantic information for word embedding. Liu et al. [25] proposed a character-level embedding model that attempts to capture the common structure between characters from visual features by using morphological images corresponding to characters. Wang, Zhang and Zong [26] proposed a word-level embedding model, which uses images from the real world as a complement to text semantics, rather than directly replacing text semantic information with visual feature information. In terms of considering external semantic knowledge base, Yang and Sun [9] used Tongyici Cilin [27] whose purpose is to make the words with the same semantic classification in the Tongyici Cilin close to each other (Tongyici Cilin is a Chinese semantic knowledge base based on synonym sets, which can classify words according to their semantics). Mancini, Camacho-Collados, Iacobacci and Navigli [6] used BabelNet to annotate the different senses of words and then performed joint learning to get word and sense embeddings. Tissier, Gravier and Habrard [8] introduced the concepts of "strong pairs" and "weak pairs" from dictionary entries, so as to better distinguish the relative intensity of word pairs in the semantic space. Liu et al. [28] proposed a knowledge-enabled language representation model with knowledge graphs (KGs), in which KG triples are injected into the sentences as domain knowledge. Niu, Xie, Liu and Sun [7] proposed the sememe-encoded word representation learning (SE-WRL) model. The SE-WRL model embeds words by encoding sememe in word-sense-sememe knowledge of HowNet. Since the word-sense-sememe is an intuitive form of organizing words, it is easily organized and interpretable [11] and has a wide range of potential uses.
Specifically, three SE-WRL models are mentioned: simple sememe aggregation model (SSA), sememe attention over context model (SAC) and sememe attention over target model (SAT).
(1) The SSA model simply represents vector w of each word as the average of all its sememe vectors, as shown in Equation (6).
where m is the number of sememes of the word w.
(2) Based on the SSA model, Niu, Xie, Liu and Sun [7] developed a SAC model and a SAT model that can distinguish different word meanings: In the above formula, if w v represents the target word w t and w u represents the context w c , then it is SAC model, while, if w v represents context w c and w u represents the target word w t , then it is the SAT model.
In this study, we used both sememe and word-sense-sememe structure as external supplements to refine the process of word embedding. Different from SE-WRL, our model captures the weight changes of different sememes under the same sense over different contexts, while using sememe to encode words.

Methodology
The proposed double attention-based word embedding (DAWE) model is derived from SE-WRL, where "double attention" refers to sense-level attention and sememe-level attention. The model assumes that the meaning of a word in a sentence is composed of senses with different weights, and each sense is composed of different sememes with different weights. In addition, the study assumes that a better way to disambiguate the senses of words in different contexts is to carefully design the process of constituting senses from sememes. DAWE model, introduced in Section 4.1, is a general framework for encoding sememes into words. Double attention over context model (DAC) introduced in Section 4.2 and double attention over target model (DAT) introduced in Section 4.3 are two specific word training models that are obtained by integrating the DAWE model through context-aware semantic matching.

Double Attention-Based Word Embedding Model
To encode the semantics of sememes into words through the "word-sense-sememe" structure, a DAWE model is developed, as shown in Figure 5.
captures the weight changes of different sememes under the same sense over different contexts, while using sememe to encode words.

Methodology
The proposed double attention-based word embedding (DAWE) model is derived from SE-WRL, where "double attention" refers to sense-level attention and sememe-level attention. The model assumes that the meaning of a word in a sentence is composed of senses with different weights, and each sense is composed of different sememes with different weights. In addition, the study assumes that a better way to disambiguate the senses of words in different contexts is to carefully design the process of constituting senses from sememes. DAWE model, introduced in Section 4.1, is a general framework for encoding sememes into words. Double attention over context model (DAC) introduced in Section 4.2 and double attention over target model (DAT) introduced in Section 4.3 are two specific word training models that are obtained by integrating the DAWE model through context-aware semantic matching.

Double Attention-based Word Embedding Model
To encode the semantics of sememes into words through the "word-sense-sememe" structure, a DAWE model is developed, as shown in Figure 5.
In the DAWE model, a "double attention" architecture is adopted: (1) Sense-level attention is to capture senses weight changes with context. A word may have different meanings in different contexts, but those meanings are not isolated. We argue that the meaning of a word in a specific context should be a fusion of different senses. As the context changes, the fusion weight of the senses also changes accordingly. (2) Sememe-level attention is to capture the weight change of sememes with context. In the SE-WRL model, each weight of sememes that constitutes the senses is thought as equivalent. Actually, when a sense presents different meanings, the weights of sememes under a sense should be different.  Figure 5, DAWE is a word embedding model based on the word-sensesememe structure, as well as a word sense disambiguation (WSD) model. In DAWE, sememes constitute the different senses of a word, and then different senses reconstitute into word meanings that are relevant to the textual context.

As shown in
The purpose of word embedding is to keep semantic relevance of words while words are embedded into a unified semantic space. However, word embedding has the semantic In the DAWE model, a "double attention" architecture is adopted: (1) Sense-level attention is to capture senses weight changes with context. A word may have different meanings in different contexts, but those meanings are not isolated. We argue that the meaning of a word in a specific context should be a fusion of different senses. As the context changes, the fusion weight of the senses also changes accordingly. (2) Sememe-level attention is to capture the weight change of sememes with context. In the SE-WRL model, each weight of sememes that constitutes the senses is thought as equivalent. Actually, when a sense presents different meanings, the weights of sememes under a sense should be different.
As shown in Figure 5, DAWE is a word embedding model based on the word-sense-sememe structure, as well as a word sense disambiguation (WSD) model. In DAWE, sememes constitute the different senses of a word, and then different senses reconstitute into word meanings that are relevant to the textual context. The purpose of word embedding is to keep semantic relevance of words while words are embedded into a unified semantic space. However, word embedding has the semantic confusion defect of representing all the meanings of a word in the same vector. To remedy such deficiencies, the different meanings of words need to be modeled separately to overcome the chaos of word embedding. The research suggests that better decomposition of word meanings combined with context leads to better representations of word meanings. WSD is to distinguish the different senses of words in different contexts, which can be roughly divided into unsupervised methods and knowledge-based methods. DAWE uses a knowledge-based approach to disambiguate the different senses of words in context using weighted sememes for the presenting of senses under a word. As a word embedding model based on knowledge, the objective of DAWE is the same as conventional approaches based on knowledge, which is to have words with the same semantics close to each other and words with different semantics away from each other [29].
According to the location of the object of the "attention", DAWE models can be extended to double attention over context model (DAC) and double attention over target model (DAT). Figures 6 and 7 illustrate the relationships and differences between the two models.

Double Attention over Context Model
As shown in Figure 6, DAC consists of two parts: encoding part and training part, which correspond to the DAWE encoding framework and Skip-gram training framework respectively.
For each context word w c ∈ C(w) (C(w) = [w t−k , . . . , w t−1 , w t+1 , . . . , w t+k ], where k is the size of the context window), we have: where att s (w c ) i , w t denotes that the target word w t is used as attention to calculating the weight of the ith sense of the context word w c , as follows: . It can be formalized by the following:ŝ Similar to att s j , w t indicates that the target word w t is used as attention to calculating the weight of the jth sememe in the ith sense of the context word w c , as follows: DAWE is a two-layer encoding framework. The first layer is sense encoding, which corresponds to Equation (10) and Equation (11). In the first layer, the sememe embeddings are used as input, and then the sense embeddings are obtained through sememe-level attention. The second layer is word encoding, corresponding to Equation (8) and Equation (9). In the second layer, the sense embeddings obtained by Equation (10) are used as input, and then the word embeddings are obtained through sense-level attention. In DAC, the target word w t is used to guide the generation of word vectors of context words. Under this attention mechanism, if the sememe vectors and sense vectors of the context word are more relevant to the target word vectors, the corresponding sememes and senses will get higher weight. This is similar to the idea in Word2Vec that the more similar words are closer in the semantic space. In this way, the different senses of context words can be disambiguated too.
deficiencies, the different meanings of words need to be modeled separately to overcome the chaos of word embedding. The research suggests that better decomposition of word meanings combined with context leads to better representations of word meanings. WSD is to distinguish the different senses of words in different contexts, which can be roughly divided into unsupervised methods and knowledge-based methods. DAWE uses a knowledge-based approach to disambiguate the different senses of words in context using weighted sememes for the presenting of senses under a word. As a word embedding model based on knowledge, the objective of DAWE is the same as conventional approaches based on knowledge, which is to have words with the same semantics close to each other and words with different semantics away from each other [29].
According to the location of the object of the "attention", DAWE models can be extended to double attention over context model (DAC) and double attention over target model (DAT). Figures 6 and 7 illustrate the relationships and differences between the two models.

Double Attention over Context Model
As shown in Figure 6, DAC consists of two parts: encoding part and training part, which correspond to the DAWE encoding framework and Skip-gram training framework respectively.
For each context word w ∈ C(w) (C(w) = [w , … , w , w , … , w ], where k is the size

Double Attention over Target Model
DAT is a variant of DAC, and word vectors are also encoded by the DAWE model and trained by the Skip-gram model. In contrast to DAC, DAT takes context embedding as attention to guide the generation of the word vector of the target word. The model structure of DAT is shown in Figure 7.
For each target word w t ∈ W, we have: where att s , w context denotes that the context word set C(w t ) is used as attention to calculating the weight of ith sense of w t , as follows: The calculation ofŝ is similar to DAC (Equation. (10) and Equation (11)), which is the weighted sum of all sememe embeddings of sememe set X , where C(w t ) denotes the context, and its corresponding word vector is obtained by the average of all context word vectors in the context window. It is formalized by the following: where k is the size of the context window. DAT uses context as attention and is richer in contextual semantics than DAC, hence it should be more conducive to the choice of sememes and senses.

Optimization
This section takes DAT as an example to illustrate the training process of the proposed model. As shown in Figure 8, in DAT's pre-processing phase, each word in the vocab needs to be annotated according to "word-sense-sememe" knowledge (association is established among sememe, sense and word). Then, in DAWE framework, the target word w t is encoded through the "double-attention" mechanism. In DAT, the context (Equation (14)) is used to guide the encoding of the target word (see Section 4.3 for details). The objective of the optimization is the same as classical Skip-gram (Equation (4)); however, the parameters that need to be optimized include not only word embeddings but also sense embeddings and sememe embeddings: where α denotes the learning rate; k is the size of the context window (in Figure 8, k = 2); S (w t ) denotes the vector set of senses corresponding to w t ; and X (w t ) denotes the sememe vector set corresponding to w t . w ± : = w ± + α •△ w ± S ( ) : = S ( ) + α •△ S ( ) X ( ) : = X ( ) + α •△ X ( ) , i = 1,2, … , k .
where α denotes the learning rate; k is the size of the context window (in Figure 8, k = 2); S ( ) denotes the vector set of senses corresponding to w ; and X ( ) denotes the sememe vector set corresponding to w . The optimization process of DAC is similar to DAT, and is not be elaborated upon in this paper.

Experiments and Results
Our experiments were conducted on a Chinese word embedding task. Model performances were examined with two tasks: the word similarity task and the word analogy task. In this section, we first introduce the experimental datasets, including the training set and the evaluation set in the two evaluation tasks. Next, we introduce the experimental settings, including the selection of baselines and the setting of parameters. Finally, we present the metrics and results of the two evaluation tasks.

Datasets
For training, HowNet annotated text corpus Clean-SogouT1 [7] was selected to train our model. Each word in the vocab of Clean-SogouT1 dataset is annotated in this form: (word w, sense num (s), sememe number of first sense, sememe set of first sense X ( ) , …, sememe number of the sth sense, sememe set of sth sense X ( ) ). The example of "Apple" in Figure 1 can be represented as ("Apple", 3, 5, ("computer", "PatternValue", "able", "bring", "SpeBrand (specific brand)"), 1, ("fruit"), 3, ("fruit", "reproduce", "tree")). The basic statistics of the dataset are shown in Table 1. Table 1 shows that more than 60% of the words in the dataset have more than two senses, which suggests that dynamic sense disambiguation is necessary for improving The optimization process of DAC is similar to DAT, and is not be elaborated upon in this paper.

Experiments and Results
Our experiments were conducted on a Chinese word embedding task. Model performances were examined with two tasks: the word similarity task and the word analogy task. In this section, we first introduce the experimental datasets, including the training set and the evaluation set in the two evaluation tasks. Next, we introduce the experimental settings, including the selection of baselines and the setting of parameters. Finally, we present the metrics and results of the two evaluation tasks.

Datasets
For training, HowNet annotated text corpus Clean-SogouT1 [7] was selected to train our model. Each word in the vocab of Clean-SogouT1 dataset is annotated in this form: (word w, sense num (s), sememe number of first sense, sememe set of first sense X (w) 1 , . . . , sememe number of the sth sense, sememe set of sth sense X (w) s ). The example of "Apple" in Figure 1 can be represented as ("Apple", 3, 5, ("computer", "PatternValue", "able", "bring", "SpeBrand (specific brand)"), 1, ("fruit"), 3, ("fruit", "reproduce", "tree")). The basic statistics of the dataset are shown in Table 1. Table 1 shows that more than 60% of the words in the dataset have more than two senses, which suggests that dynamic sense disambiguation is necessary for improving word embedding models. Following Niu, Xie, Liu and Sun [7], this study removed words from the vocab set with word frequency under 50. For evaluation, we chose the Chinese word similarity (CWS) dataset and the Chinese word analogy (CWA) dataset provided by Niu, Xie, Liu and Sun [7] to evaluate the performance of the models in the word similarity task and the word analogy reasoning task. The CWS datasets Wordsim-240 and Wordsim-297 contain 240 similar word pairs and 297 similar word pairs, respectively, and each word pair in the CWS dataset has its corresponding similarity score, e.g., "consumer, customer, 8.4". Each entry in the CWA dataset is composed of four words "w 1 , w 2 , w 3 , w 4 ". The form of word analogy is: w 2 − w 1 w 4 − w 3 , such as the classic example: w king − w man w queen − w woman . The bold form w ⊂ R D denotes the embedding of the word w. The statistics of the CWA dataset used in this study is shown in Table 2. Table 2. Chinese word analogy dataset, which contains three analogy types: capitals of countries (Capital), e.g., w London − w England w Beijing − w China ; cities in states (City), e.g., w Jacksonville − w Florida w Francisco − w California ; and family relationships (Relationship), e.g., w Father − w Mather w Son − w Dauther .

Experimental Settings
In the experiments, we chose Skip-gram (the basic training framework of our models), CBOW (another model in Word2Vec, for comparison with Skip-gram) and GloVe [30] (different from the calculation method of Skip-gram in the local context window, Glove obtains the word embeddings by global matrix decomposition) as the comparison models. We also chose the SSA model (encoding words with sememes without attention mechanism), SAC (for comparison with DAC) and SAT (for comparison with DAT) proposed in Niu, Xie, Liu and Sun [7] as our baselines.
Following Niu, Xie, Liu and Sun [7], the vector dimensions of word embeddings, sense embeddings and sememe embeddings were set to 200; the size of the context window was set to 8; the initial learning rate was 0.025; and the number of negative samples was set to 25 in the negative sampling method. For the SAT and DAT, we set the context embedding window size to 2.
Our DAWE models were implemented based on the code of the SE-WRL model (https://github. com/thunlp/SE-WRL). The benchmark models and our models were trained on the same machine.

Word Similarity
In this section, this study examine the quality of word embeddings through the performance of the proposed models in word similarity tasks. In the evaluation of the word similarity tasks, we used the cosine value between the vectors of two words as their similarity scores to obtain the similarity ranking of all pairs of words in the benchmark datasets (Wordsim-240 and Wordsim-297). By calculating the Spearman correlation coefficient between the similarity ranking obtained by our models and the similarity ranking in the benchmark datasets, we could evaluate the performance of the model in word similarity tasks. The higher the Spearman correlation coefficient is, the better the model performs in the word similarity task. Table 3 shows the evaluation results on word similarity tasks. (1) On the Wordsim-240 dataset and Wordsim-297 dataset, our models performed better compared to the baseline models. This shows that distinguishing the sememes within the senses can help us to present different senses of the word more accurately and deeply. (2) DAT performed better than DAC. DAT takes context embedding as attention to guide the semantic generation to the target words, thus it can better capture contextual semantic information. Therefore, when the training of words is sufficient, the results of DAT will be better than DAC.

Word Analogy
In this section, we examine the quality of word embeddings by the performance of the models in the word analogy reasoning task. In the Chinese word analogy reasoning task, each analogy sample consists of two-word pairs (w 1 , w 2 ) and (w 3 , w 4 ), which satisfy: w 2 − w 1 w 4 − w 3 , ie w 2 − w 1 + w 3 w 4 .Therefore, in the word analogy reasoning task, the score of the candidate word is calculated by replacing w 4 with the candidate word w and by the following formula: After obtaining the ranking of all candidate words, the experiment chose top-ranked words and evaluated the performance of the model by calculating accuracy and mean rank metrics. The higher is the accuracy and the lower is the mean rank, the better is the model.
The results of the word analogy reasoning task are shown in Table 4. From the evaluation results of the word analog task, we can conclude that: (1) In the word analogy reasoning task, our models are significantly better than the previous models. The accuracy of DAC is 2% higher than that of SAC, and the accuracy of DAT is 3% higher than that of SAT. DAC has increased more than 4% compared to the SAC model and DAT has increased more than 3% compared to the SAT model of mean rank. The experimental results show that both DAC and DAT are more conducive to the accurate description of senses by distinguishing the internal sememes.
(2) Our models perform well in the class of Capital, which is the collection of groups of capital and country around the world. Most of the words of the capital names have distinct meanings in various contexts, such as the word "Washington" may be the name of a capital city, a state, a university, a hotel, or a people. In the training process, the proposed model can dynamic adjustment the weights of both senses and sememes by the "double-attention" mechanism, hence offering more powerful ability on the embedding of those words.
(3) Although the performances of our models are not the best in the classes of City and Relationship, our models are more robust in the overall performance of accuracy and mean rank.
(4) DAWE models are significantly improved in the performance of the word analogy reasoning task, but only a small increase in performance in the word similarity task. Since Skip-gram trains word vectors based on context, the more similar the context is, the closer the word vector is in the semantic space. Thus, with sufficient training, there is no significant difference among the performance of these Skip-gram-based models for the word similarity task. By adding sememe-level attention, our models can more accurately express the sense of the word, resulting in better results in the word analogy reasoning task requiring higher semantic accuracy.

Case Study
To illustrate the dynamic semantic generation of our models, we select some specific cases for analysis. Tables 5-7 lists the relative weights of the different senses of the word "Apple" (Sense 1: "Apple Brand" (Sememe: "computer", "PatternValue", "able", "bring" and "SpeBrand (specific brand)"); Sense 2: "Apple" (Sememe: "fruit"); Sense 3: "Apple Tree" (Sememe: "fruit", "reproduce" and "tree")) in a specific context and the relative weights of different sememes within the senses. Those weights are calculated by sense-level attention and sememe-level attention of DAT. Tables 5-7 show that: (1) Our model correctly distinguishes the different senses of "Apple" from different contexts. This shows the power of our model in word sense disambiguation (WSD).
(2) In the sense "Apple Brand", the sememe "SpeBrand" gets a large weight. This is consistent with our description in the Introduction. In the process of sense construction, the distribution of weights between sememes should be unequal. (3) When the meanings of "Apple" changing with different contexts (the meaning of "Apple" changes when the sentence changes), both the sense items of the word and the sememe items in each sense of the word do not change, what changes with the context are the weights of those senses and the weights of those sememes. The model of this paper is trained on the large text corpus Clean-SogouT1, and the learned word vectors and model parameters are consistent with the feature distribution of the entire corpus. As a result, the sense representation inside the words will tend to be stable, that is, the weight distribution of sememes inside the senses will also be stable (sememe consists of sense, sense and then word).  In the above cases, we take the word "Apple (Apple/Apple Brand/Apple Tree)" as an example to examine the weight distribution of sememes and verify the effectiveness of our model in WSD. We take the word "Notebook (Notebook/ Laptop Computer)" as an example to study the impact of sememe's weight distribution in a specific context. As shown in Table 8, when the meaning of word "Notebook" in the context tends to the sense "Laptop Computer", we observe the following: Table 8. The impact of context on the weight of sememes. The values in this table represent relative weights and the relative weight of the sense "Notebook" is 0. Word: "Notebook" (Sense 1: "Notebook" (Sememe: "account"); Sense 2: "Laptop Computer" (Sememe: "bring", "PatternValue", "computer" and "able")). (1) When the word "Notebook" and the word "Computer" appear together, that is, "Notebook Computer", the weight of the sememe "computer" is the lowest among all the sememes of the sense "Laptop Computer". It can be explained that, when "Notebook Computer" appear together, "Notebook" is mainly used as a modifier of "Computer" to indicate that "Computer" is light, thin and portable. Therefore, "Notebook" will have less "computer" meaning.

Context
(2) When "Notebook" appears alone, sememe "computer" has more weight than when "Notebook Computer" appear together. At this point, "Notebook" no longer appears as a modifier of "Computer" but as a separate entity, thus it should cover the semantics that tends to favor "computer".
(3) When "Notebook" appears alone, the weight of the sense "Laptop Computer" is generally lower than when "Notebook Computer" appear together because, when "Notebook Computer" appear together, the context carries more semantics that tends to the sense "Laptop Computer", thus "Laptop Computer" is generally weighted more heavily. (Note the second example of Table 8, where the weight of "Laptop Computer" reached 4.59. This is because "HP" is a computer brand, which results in the "Laptop Computer" weight more than the other case of "Notebook" appearing alone).
The results in Table 8 also show the effectiveness of our model. In the DAWE model, the representation of words depends on senses, the weight distribution of the sememes cannot directly determine the final representation of words. As the word "Notebook" appears alone, the weight of the sense "Laptop Computer" is lower than that when "Notebook Computer" appear together, although the weight of the sememe "Computer" is higher than when "Notebook Computer" appear together.
In summary, in the training process of word embeddings, the semantics of words are affected not only by the semantic accumulation in corpus, but also by the context in the current slide window.
(1) The impact of semantic accumulation is mainly reflected in the gradual stabilization of the representation of the inherent senses within the word. As shown in the examples in Tables 5-7, the weight distribution of the sememes used to represent the internal senses of the word "Apple" is consistent in different contexts.
(2) The current context is mainly used to select the appropriate senses and can affect the weight distribution of sememes. As shown in Tables 5-7, although the representation of the inherent senses inside the word "Apple" tends to stable, the weight of these senses is varied in different sentences. Besides, the senses of the target word "Notebook" and the weights of their sememes in Table 8 also illustrate this point.

Integrating DAWE with Other Models
DAWE is a general encoding framework. In this paper, we integrate and train DAWE based on the Skip-gram model. DAWE can be extended for other models many by the following steps: (1) Data pre-processing. Using "word-sense-sememe" knowledge to annotate text corpus.
(2) Determine the encoding "target" of DAWE. For example, in DAC, the "target" is the context, while, in DAT, the "target" is the target word.
(3) Determine the "object" of "double-attention". For example, in DAC, the "object" is the target word, while, in DAT, the "object" in the context.
(4) Forward propagation (encoding). According to the "target" and "object" determined in Steps 2 and 3, in the DAWE framework, "object" is used to guide the encoding of the "target" through the "double-attention" mechanism.
(5) Back propagation. Model parameters (word embeddings, sense embeddings and sememe embeddings) are updated according to the model optimization objective.
Among them, Steps 1 and 5 are relatively easy to implement. The core step is Step 4, which depends on Steps 2 and 3. Therefore, in expanding DAWE, the parts that are difficult and require careful design are Steps 2 and 3. Once Steps 2 and 3 are established, DAWE can be easily extended to other models.

Conclusion and Future Work
In this paper, double attention-based word embedding (DAWE) model is proposed to encode sememes into words by a "double attention" mechanism, resulting in going deep into the senses of a word to describe the word. Our proposed DAWE model is a general framework that can be applied to other existing word embedding training frameworks, such as Word2Vec. In this paper, we extend the DAWE model to get two specific training models. In the experiments of word similarity task and word analogy task, the validity of our models was demonstrated. To further explore the models proposed in this paper, some cases were analyzed in the experiment. The results show that word semantics are not only affected by the global semantic accumulation, but also by the context of a word. Experimental results show that DAWE models can effectively capture the semantic changes of words through dynamic semantic generation, which means that our model is also effective in word sense disambiguation. The findings of this study suggest it could get performance improvement of NLP tasks if words are processed in a more fine-grained perspective.
A limitation of this study is that the DAWE model requires more training time than baseline models because it increases training parameters as it integrates the "double attention" mechanism. Additionally, the values of hyperparameters in this study are set following previous research; further experimental investigations are needed to estimate the impacts of those hyperparameters.