A Gloss Composition and Context Clustering Based Distributed Word Sense Representation Model

: In recent years, there has been an increasing interest in learning a distributed representation of word sense. Traditional context clustering based models usually require careful tuning of model parameters, and typically perform worse on infrequent word senses. This paper presents a novel approach which addresses these limitations by ﬁrst initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned representations outperform the publicly available embeddings on half of the metrics in the word similarity task, 6 out of 13 sub tasks in the analogical reasoning task, and gives the best overall accuracy in the word sense effect classiﬁcation task, which shows the effectiveness of our proposed distributed distribution learning model.


Introduction
The representation of knowledge has become focal areas in natural language processing.There have been many different methods for conceptual information representation.These range from extreme localist theories in which each concept is represented by a single unit (symbolic or distributional representation) to extreme distributed theories in which a concept corresponds to a pattern of activity over a large part of the cortex (distributed representation) [1].
Distributed representation of word senses refers to represent word senses in a low-dimensional space for conveying the semantic information contained in the words.Usually, a word sense is represented as a dense and real-valued vector.To this end, most existing approaches adopted a cluster-based paradigm, which produces different sense vectors for each polysemy or homonymy through clustering the context of the target words.However, this paradigm usually has two limitations: (1) The performance of these approaches is sensitive to the clustering algorithm which requires the setting of the sense number for each word.For example, Neelakantan et al. [4] proposed two clustering based model: the Multi-Sense Skip-Gram (MSSG) model and Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) model.MSSG assumes each word has the same k-sense (e.g., k = 3), i.e., the same number of possible senses.However, the number of senses in WordNet [21] varies from 1 such as "ben" to 75 such as "break".As such, fixing the number of senses for all words would result in poor representations.NP-MSSG requires a tuning of a hyperparameter λ which controls the creation of cluster centroids during training.Different λs need to be tuned for different datasets; (2) The initial value of sense representation is critical for most statistical clustering based approaches.However, previous approaches usually adopted random initialization [4] or mean average of the candidates words in a gloss [3].As a result, they may not produce optimal clustering results for word senses.
Focusing on the aforementioned two problems, this paper proposes to learn distributed representations of word senses through WordNet gloss composition and context clustering.The basic idea is that a word sense is represented as a synonym set (synset) in WordNet.In this way, instead of assigning a fixed sense number to each word as in the previous methods, different words will be assigned with different number of senses based on their corresponding entries in WordNet.Moreover, we notice that each synset has a textual definition (named gloss).Naturally, we use a convolutional neural network (CNN) to learn distributed representations of these glosses (a.k.a.sense vectors) through sentence composition.Then, we modify the MSSG algorithm for context clustering by initializing the sense vectors with the representations learned by our CNN-based sentence composition model.We expect that word sense vectors initialized in this way would lead to more precise representations of word senses generated from context clustering.
The obtained word sense representations are evaluated on three tasks: a word similarity task on two datasets, an analogical reasoning task provided by WordRep [22], and word sense effect classification task.The results show that our approach attains comparable performance on learning distributed representations of word senses.In specific, our learned representation outperforms publicly available embeddings on half of the metrics in word similarity task, and 6 in 13 subtasks in the analogical reasoning task.In the sense effect classification task, we achieve the state-of-the-art results.The results show that our approach attains an overall better performance on learning distributed representations of word senses.
The main contributions of this work are as follows: (1) we propose to use a sentence composition model to capture word sense from a knowledge base, e.g., WordNet; (2) while previous approaches to sense vector clustering often adopted random initialization, we propose to initialize sense vectors and the number of sense clusters with the word sense knowledge learned from WordNet for better clustering results; (3) we further verify our learned distributed word sense representations on three different tasks, word similarity measurement, analogical reasoning and word sense effect classification.Our approach achieves comparable results compared to the existing distributed word sense representation learning models on the first two tasks and gives the state-of-the-art results on the last task.
The rest of this paper is organized as follows.Section 2 reviews related work.Section 3 presents our proposed model.Section 4 describes the evaluation results and presents discussions.Section 5 concludes the paper and outlines future research directions.

Distributed Representation for Word Sense
Most distributed word sense representation approaches are derived from distributed single prototype word representation approach first proposed by Rummelhart [7] and then have become a successful paradigm, especially for neural probabilistic language models [8][9][10][11][12].
Reisinger and Mooney [23] proposed a multi-prototype vector space model using the context cluster of each word to generate a distinct prototype vector for a word.Huang et al. [2] followed this idea, but introduced a probabilistic neural language model to generate distributed representations instead of distributional representations.Their approach first represents each word by a vector averaged over its context window comprising of five words before, five words after and the word itself.The spherical k-means algorithm is then used to cluster such context representations.Each word occurred in the corpus is re-labeled by its associated cluster and is used to train the distributed representation for that cluster.
Tian et al. [5] integrated a probabilistic multi-prototype model into the continuous skip-gram model.Expectation Maximization (EM) algorithm is used to learn multiple embeddings for polysemy.Motivated by the intuition that the same word in a source language with different senses is supposed to have different translations in a foreign language, Guo et al. [6] proposed a distributed word senses representation approach by clustering translated words from bilingual parallel data.Neelakantan et al. [4] presented an extension to the skip-gram model to learn word sense representation by non-parametrically estimating the number of senses per word type.Chen et al. [3] used glosses in WordNet as clues for learning distributed representation of word sense.But they simply represent each word sense by the vector averaged over all the words occurred in the corresponding gloss which may not be able to produce a good word sense representation.

Distributed Sentence Composition Model
Distributed Sentence Composition refers to representing a sentence in a low-dimensional space for conveying the semantic information contained in the sentence.Various types of distributed sentence representation models have been proposed recently.Socher et al. [16] proposed a recursive neural tensor network (RNTN) for semantic compositionality over a sentiment Treebank which pushes the binary classification accuracy on Stanford sentiment tree bank from 80% up to 85.4%.Kalchbrenner et al. [17] proposed a dynamic convolutional neural network (DCNN) to handles the input sentences with varying length and induced a feature graph over the sentence that is capable of explicitly capturing short and long-range relations.It improves the above accuracy from 85.4% to 86.8%.Kim [18] presented two simple CNN models with little hyper parameter tuning which are trained on pre-trained word vectors for sentence-level classification tasks.It further improves the above accuracy to 88.1%.Le and Mikolov [19] proposed an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs and documents.The recurrent neural network (RNN) may also be viewed as a sentence model.The layer computed at the last word represents the sentence [11,17].

Our Approach
In this study, we propose to learn distributed representation of word sense learning approach by incorporating WordNet gloss compositionality and context words clustering in large-scale raw text.The system framework with three main components is shown in Figure 1

Word Embedding Construction
Mikolov et al. [12] introduced the Continuous Bag-Of-Words (CBOW) model and continuous skip-gram model (Skip-gram) to learn vector representations for capturing a large number of syntactic and semantic word relationships from unstructured text data.The training objective of CBOW model is to use the surrounding words of a target word in a sentence or a document to predictits representation representation.Given a sequence of training words w 1 , w 2 , w 3 . . .w T , the training objective is to maximize the average log probability: where c is the size of the training context, w t is the center word, and log p(w t |w t+i ) is the conditional log probability of the center word w t given the surrounding words w t+i .The prediction task is performed via softmax.The hierarchical softmax [10,24] process which uses a binary tree representation of the output layer with the words as leaves, is used to reduce computational complexity.

Training Sense Vectors from WordNet Glosses Using CNN
Most of the glosses in WordNet are single sentence.We learn the distributed representation of each gloss sentence as the representation of the corresponding synset.

Training Objective
The training objective of this component is similar to the training objective proposed in [2,9] where the goal is to maximize the conditionalprobability of observing the actual target word given the input context.A common practice is to replace each target word by a random word to create negative training examples.Our goal is to model glosses in WordNet.Here, we replace several words in the gloss sentence to construct a negative sample at a time.
Given a gloss sentence s as a positive training sample, we randomly replace some words (controlled by a parameter λ) in s to construct a negative training sample s .We compute scores f (s) and f (s ) where f (•) is the scoring function represents the whole CNN architecture without the softmax layer.We expect f (s) to be approximating 1, f (s ) to be approximating 0, and f (s) to be larger than f (s ) by a margin of 1 for all the sentences in the positive training set P .So the training objective is to minimize the ranking loss below:

Neural Network Architecture
The CNN architecture, shown in Figure 2 is used to model the glosses in WordNet.It follows the architecture proposed by [18] (The source code provided by the authors of this paper is available at https://github.com/yoonkim/CNN_sentence)which is a slight variant of the architecture proposed by [9].
It takes a gloss matrix s as input where each column corresponds to the distributed representation v w i ∈ R d of a word w i in the sentence or a padding vector where v w i is a d dimensional pre-trained word vector constructed from a large corpus by CBOW model, v z i is a d dimensional zero vector, m is the size of a filter window, n is defined as the maximum length of sentences in the training set.There are two types of convolution operation: the narrow one and the wide one [17].We use the wide type in this paper in which the padding vectors v z 1 to v z m−1 at the beginning and the end of the sentence are used to make sure the convolution operation can be done from the beginning of the sentence until the end of the sentence.The idea behind the one-dimensional convolution is to take the dot product of the vector w with each m-gram in the sentence s to obtain another sequence c.In the convolutional layer, one-dimensional convolution is taken between a filter vector w ∈ R md and a vector s i:i+m−1 ∈ R md of m concatenated columns in s.The i-th feature c i ∈ R of a feature map F j ∈ R n+m−1 is generated as follows: where b ∈ R is a bias term and f is a point-wise non-linear function such as the hyperbolic tangent.s i:i+m−1 refers to columns from i to i + m − 1 of s.In order to make c cover different words in the negative sample corresponding a positive sample, in this work, we randomly replace half of the words in a positive training sample to construct a negative training sample (λ = 0.5).A feature map F j ∈ R n+m−1 is defined as In the pooling layer, a max-overtime pooling operation [25], which forces the network to capture the most useful local features produced by the convolutional layers, is applied over F j .The maximum value Fj = max(F j ) is the feature corresponding to a particular filter w.The Fj of k filters are concatenated to form a vector F ∈ R k .The model uses multiple filters (with varying window sizes) to obtain multiple features.These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.The training error propagates back to fine-tune the parameters (w, b) and the input word vectors.The vector generated in the penultimate layer of the CNN architecture is regarded as the sense vector which captures the semantic content of the input gloss, to some degree.

Context Clustering and VMSSG Model
The sense vectors trained from WordNet glosses using CNN doesn't perform well on some word sense evaluation tasks, partly because the semantic meaning of the actual context that a word occurs may not be similar to the gloss of the synset in which the word belongs to.To deal with this problem, we propose to incorporate the sense vectors learned from WordNet glosses by CNN composition as prior knowledge into a context clustering model such as the MSSG model proposed by Neelakantan et al. [4] The MSSG model extends the skip-gram model to learn multi-prototype word embeddings by clustering the word embeddings of context words around each word.In this model, for each word w, the corresponding word embedding  C ← {w i−r , ..., w i−1 , w i+1 , ..., w i+r } 6: Assign C to context cluster k.

9:
Update µ k. 10: Gradient update on v s w k , v w in C, C .12: end for 13: Output: We improve the MSSG model in two different ways.Firstly, instead of setting a fixed number of senses K for each word as in the original MSSG, we set the sense number of each word based on its actual number of senses in the WordNet.By doing so, semantically rich words would have a larger number of senses and K becomes deterministic.Secondly, instead of randomly initializing sense vectors in the MSSG algorithm, we initialize sense vectors using those trained from WordNet glosses with CNN composition.In addition, we use the learned CBOW word embedding to initialize global word vectors v w .We named this model as a variant of the MSSG (VMSSG) model.
The training algorithm of the VMSSG model is shown as Algorithm 1, where D is a text corpus, V is the vocabulary of D, |V | is the vocabulary size, M is the size of context window, v w is the word embedding for w, s w k is a kth context cluster of word w, µ w k is the centroid of cluster k for word w.The function NoisySamples(C) randomly replaces context words with noisy words from V .

Experiments
In this section, we first give a qualitative analysis by comparing the nearest neighbors of our embeddings with other embeddings.Next, we evaluate the performance of our word sense representations on three tasks, namely, word similarity task, analogical reasoning task, and word sense effect classification task respectively.

Experimental Setup
In all experiments, we use the publicly available word vectors trained on 100 billion words from Google News.The vectors have dimensionality of 300.They were trained using the CBOW model.For training sense vectors with VMSSG model, we use a snapshot of Wikipedia in April 2010 [26] previously used in [2,4].WordNet 3.1 is used for training the sentence composition model.
For training CNN, we use: rectified linear units, filter windows of 3, 4, 5 with 100 feature maps each, AdaDelta decay parameter of 0.95, the dropout rate of 0.5.For training VMSSG, we use MSSG-KMeans as the clustering algorithm, and CBOW for learning sense vectors.We set the size of word vectors to 300, using boot vectors and sense vectors.For other parameter, we use default parameter settings for MSSG.

Qualitative Evaluations
In Tables 1-3, we list the nearest neighbors of each sense of three example words generated from two single-prototype word vector models (C & W and Skip-gram) and five multi-prototype word representation models.C & W refers to the word embedding published in [9].Skip-gram refers to the language model proposed in [12].Huang et al. refers to the multi-prototype word embedding proposed in [2].Unified-WSR refers to the word sense embedding proposed in [3].Both MSSG and NP-MSSG were previously proposed in [4] where MSSG assumes each word has the same number of senses and NP-MSSG extends from MSSG by automatically inferring the number of senses from data.CNN-VMSSG is our model.The column heading N of the tablesshows the number of sense vectors generated by different models, and it is 1 for single-prototype word vector models.The nearest neighbor is selected by comparing the cosine similarity between each sense vector and all the sense vectors of other words in the vocabulary.
It is observed that single-prototype word vector models such as C & W and Skip-gram are not able to learn different sense representations for each word while Huang et al. and MSSG always generate a fixed number of sense vectors.NP-MSSG finds fewer number of sense vectors than the actual number of word senses.Our model can find a diverse range of word senses, for example, "edge" and "IMF" for bank, "MVP" and "circle" for star, "seed" and "Spedding" for plant.It shows that our model learns more different sense representations.

Word Similarity Task
In this task, we evaluate our learned word sense embedding on two datasets: the WordSim-353 (WS353) dataset [27] and the Contextual Word Similarities (SCWS) dataset [2], respectively.WS353 dataset consists of 353 pairs of nouns.Each pair is associated with 13 to 16 human judgments on similarity and relatedness on a scale from 0 to 10.For example, (car, flight) received an average score of 4.94, while (car, automobile) received an average score of 8.94.
SCWS dataset contains 2003 pairs of words and their sentential contexts.It consists of 1328 noun-noun pairs, 399 verb-verb, 140 verb-noun, 97 adjective-adjective, 30 noun-adjective, and 9 verb-adjective.241 pairs are same-word pairs.Each pair is associated with 10 human judgments of similarity on a scale from 0 to 10.
We use the same metrics in [4] to measure the similarity between two words w and w given their respective context c and c The avgSim metric computes the average similarity of all pairs of prototype vectors for each word, ignoring information from the context: where d(•, •) is a standard distributional similarity measure.Here, cosine similarity is adopted.v s i (w) is the sense vector of w.K 1 , K 2 are the numbers of word senses of w and w , respectively.The avgSimC metric weights each similarity term in avgSim by the likelihood of the word context appearing in its respective cluster: where d c,w,i = d (v c , π i (w)) is the likelihood of context c belonging to cluster π i (w).The globalSim metric computes each word vector ignoring the many senses: The localSim metric chooses the most similar sense in context to estimate the similarity of word pairs: where k = arg max i d c,w,i and k = arg max j d c ,w ,j .We report the Spearman's correlation ρ × 100 between a model's similarity scores and the human judgements in the datasets.
Table 4 shows the performance achieved on the WordSim-353 dataset.In this table, the avgSimC and localSim metrics are not given since no context is provided in this dataset.Random-VMSSG refers to MSSG trained with the sense number of each word taken from WordNet.Average-VMSSG refers to MSSG trained with the average vector of the candidate word vectors of WordNet glosses which has previously proposed by Chen et al. [3].In Average-VMSSG, for each sense sense i of word w, a candidate set from gloss(sense i ) is defined as follows: where POS(u) is the part-of-speech tagging of the word u and CW is the set of possible part-of speech tags in WordNet: noun, verb, adjective and adverb.v w and v u are word vectors of w and u, respectively.
Following Chen et al. [3], we set the similarity threshold σ = 0 in this experiment.The average of the word vectors in cand(sense i ) is used to initialize sense vectors in the VMSSG model.We also present the results obtained using the word distributional representations including Pruned TF-IDF [23], Tiered TF-IDF [28] and Explicit Semantic Analysis (ESA) [29].Pruned TF-IDF and Tiered TF-IDF combine the vector-space model and context clustering.TF-IDF represents words in a word-word matrix capturing co-occurrence counts in all context windows.Pruned TF-IDF prunes the low-value TF-IDF features while Tiered TF-IDF uses tiered clustering that leverages feature exchangeability to allocate data features between a clustering model and shared components.ESA explicitly represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia.
It is observed that our model achieves the best performance on the globalSim metric.It indicates that the use of pre-trained word vector and initializing word sense vector is helpful to improve the quality of global word vector generated by CNN-VMSSG.Unified-WSR has the same number of senses as in our model but gives a much worse result on avgSim, being 23.0% lower.Random-VMSSG also takes the same number of senses for each word from WordNet as in our model but still performs worse on both avgSim and globalSim.CNN-VMSSG is 2.9% higher than Average-VMSSG on the avgSim metric (64.4 vs. 61.5),and 0.6% higher than Average-VMSSG on the globalSim metric (69.8 vs. 69.2),respectively.It indicates that the WordNet glosses composition approach proposed in our model performs better than using the average of the candidate word vectors of WordNet glosses.
Our model gives lower avgSim results compared to MSSG and NP-MSSG.One possible reason is that we set the number of context clusters for each word to be the same as the number of its corresponding senses in WordNet.However, not all senses appear in the our experimented corpus which could lead to fragmented context clustering results.One possible way to alleviate this problem is to perform post-processing to merge clusters which have smaller inter-cluster differences or to remove sense clusters which are under-represented in our data.We will leave it as our future work.
We report the Spearman's correlation ρ × 100 between a model's similarity scores and the human judgements of SCWS dataset in Tabel 5.It is observed that our model achieves the best performance on the globalSim and localSim metrics, being 0.8% higher on globalSim and 1.3% higher on localSim compared to the second best performing model NP-MSSG.Comparing with Average-VMSSG, our model achieves better performance on all the four metrics.It indicates that the CNN composition approach proposed in our model is beneficial for this task.Our approach however performs worse on avgSim and avgSimC possibly due to the same reason explained for the WS353 task.

Analogical Reasoning Task
The analogical reasoning task introduced by [12] consists of questions of the form "a is to b as c is to _", where (a, b) and (c, _) are two word pairs.The goal is to find a word d * in vocabulary V whose representation vector is the closest to v b − v a + v c , i.e., The question is judged as correctly-answered only if d * is exactly the answer word in the evaluation set [22].
WordRep is a benchmark collection for research on learning distributed word representations, which expands the Mikolov et al.'s analogical reasoning questions.It includes two kinds of evaluation sets: an enlarged evaluation set where the word pairs are collected from Wikipedia, and WordNet evaluation set where the word pairs are collected from WordNet.Considering the size of evaluation set, in our experiments, we use one evaluation set in WordRep, the WordNet collection which consists of 13 sub tasks.Let the sense numbers of a, b, c be N a , N b , N c , and the size of vocabulary be V size , the number of candidate vectors for a word sense model is N a ×N b ×N c ×V size , while it is only V size for single-prototype word vector models.This shows that the evaluation task is computationally more complicated for the word sense based models than for the single prototype models.
Table 6 shows the precision results on the 13 sub tasks.The Word Pair column is the number of word pairs of each sub task (N wp ).The results of C&W were obtained using the 50-dimensional word embeddings that were made publicly available by Turian et al. [30].The CBOW results were previously reported in [22].Weighted Average is computed as follows: It can be observed that our learned representations outperform all the other 4 embeddings on weighted average.Among 13 sub tasks, our model outperforms the others by a good margin in six sub tasks, Attribute, Causes, Entails, IsA, MadeOf and RelatedTo.Overall, our model gives superior performance compared to all the other models.

Word Sense Effect Classification
In this section, we evaluate our approach on word sense effect classification proposed by Choi and Wiebe [31].In this task, each sense is annotated with three classes: + effect, − effect and Null.In total, 258 + effect senses, 487 − effect senses, and 440 Null senses are manually annotated as a word sense lexicon with the help of FrameNet [32].Half of each set is used as training data, and the other half is used for evaluation.
Choi and Wiebe [31] propose three word sense effect classification methods, namely, supervised learning (onlySL) method, graph-based learning (onlyGraph) method and hybrid method.In the onlySL method, the gloss classifier (SVM) is trained with word features and sentiment features for WordNet Gloss.The method also uses WordNet relations and WordNet similarity information as training features.In the onlyGraph method, a graph is constructed by using WordNet relations, such as hypernymy, troponymy and grouping, and a graph-based semi-supervised learning method is used to perform label propagation.In the hybrid method, the results generated from onlySL and onlyGraph are combined by some rules, e.g., If the labels assigned by both models are + effect (or − effect), it is + effect (or − effect).
For evaluation metrics, we use precision (P × 100), recall (R × 100) and F1 score (F 1 × 100) for each class, and an overall accuracy.For classifiers, we use support vector machines (LibSVM [33]) with default parameters in the Weka software tool [34].
Table 7 shows the overall accuracy results and Table 8 gives a more detailed analysis of the results obtained using different models on each word sense effect class.In both two tables, the first three models were proposed by Choi and Wiebe [31].For distributed sense representation models, we only compare our approach with Unified-WSR, because other word sense models, such as Huang et al., MSSG and NP-MSSG, do not provide a one-to-one correspondence between a word sense and a WordNet synset.As such, they cannot be used for this task.It is observed that CNN-VMSSG achieves the best overall accuracy of 66.1%, outperforming Unified-WSR and Hybrid by 1.7% and 4.3%, respectively.For each effect class, the Hybrid model achieves the best F1 performance of 66.7% on the + effect class, but the worst F1 performance of 53.8% on the Null class.The Unified-WSR model gives the best F1 performance of 75.0% on the − effect class, but much worse F1 performance of 48.1% on the + effect class.Our Model achieves the best F1 result of 65.8% on the Null class and comes at the second place on both the + effect and − effect classes.Overall, Our Model gives the superior performance in F1, outperforming Unified-WSR and Hybrid by 5.2% and 4.1%, respectively.It indicates the robustness and effectiveness of our proposed model in improving the quality of sense-level word vectors.Random-VMSSG also gives with a one-to-one correspondence mapping between a word sense and a WordNet synset, so that it can be used in this task.Comparing with Average-VMSSG which uses the average of the candidate word vectors of WordNet glosses, CNN-VMSSG achieves 2.7% higher on overall accuracy (66.1 vs. 63.4), that is 4.3% relative improvement.It further verifies the superiority of our proposed WordNet glosses composition approach.

Conclusions
This paper presents a method of incorporating WordNet glosses composition and context clustering based model for learning distributed representation of word senses.By initializing sense vectors using the embeddings learned by a sentence composition from WordNet glosses, the context clustering method is able to generate better distributed representation of word senses.The obtained word sense representations achieve state-of-the-art results on half of the metrics in the word similarity task and in six sub tasks of the analogical reasoning task.It also achieves the state-of-the-art performance on word sense effect classification.It shows the effectiveness of our proposed learning algorithm for generating word sense distributed representations.Considering the coverage of word sense in training data, in future work we plan to filter out those sense vectors with those that are under-represented in the training corpus.We will also further investigate the feasibility of applying the multi-prototype word sense embeddings in a wide range of NLP tasks.
. The first component, the Word Embedding Construction Module, takes a large collection of raw text to train a word embedding model.The word embeddings output by the model are then used by a Sentence Composition Model, which takes glosses in WordNet as positive training data and randomly replacing part of the words as negative training data to construct the corresponding word sense vectors based on the one-dimensional CNN.The learned sense vectors are fed into a variant of the previously proposed Multi-Sense Skip-Gram Model (MSSG) to generates distributed representations of word senses from a text corpus.We name our approach as CNN-VMSSG.

Figure 1 .
Figure 1.Framework of our approach.

Figure 2 .
Figure 2. A one-dimensional convolutional neural network (CNN) with two filter widths for an example gloss sentence.

Algorithm 1
2, . . ., K) are initialized randomly.The sense number K of each word is a fixed parameter in the training algorithm.Algorithm of VMSSG model.1: Input: D, d,K 1 , ..., K w , ..., K |V | , M .2: Initialize: ∀w ∈ V, k ∈ {1, . . ., K w }, initialize v w to a pre-trained word vector, v s w k to a pre-trained sense vector for word w with sense k, and µ w k to a vector of random real value ∈ (−1, 1) d .3: for each w in D do 4: r ← random number ∈ [1, M ]

Table 1 .
Nearest neighbors of each sense of word bank.

Table 2 .
Nearest neighbors of each sense of word star.
CNN-VMSSG 12 cast galaxies Carradine MVP newspaper Ursae sign beat trek purple circle sun

Table 3 .
Nearest neighbors of each sense of word plant.

Table 4 .
[4]erimental results in the WordSim-353 (WS353) task.We compute the avgSim value using the published word vectors for Unified-WSR 200 d.Other results of the compared models, e.g., Huang et al., Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) and MSSG, were reported in[4].50 d, 200 d and 300 d refer to the dimension of the vector.The best results are highlighted in bold face.

Table 5 .
[4]erimental results in the Contextual Word Similarities (SCWS) task.We compute the evaluation results using the published word vectors for Unified-WSR 200 d.Other results of the compared models, e.g., Huang et al., NP-MSSG and MSSG, were reported in[4].

Table 6 .
Experimental results in the analogical reasoning task.The numbers are the precision p × 100.

Table 7 .
Experimental results on word sense effect classification task.