1. Introduction
In computational semantic analysis, word sense disambiguation (WSD), i.e., finding the correct sense of a word in a given context, has long been a challenge in natural language understanding. Recently, it has mostly been studied with machine learning models typically using the supervised, unsupervised, and knowledge-based approaches [
1,
2,
3,
4]. The supervised approach [
5,
6] uses a sense annotated corpus to train the models and usually suffers from data sparseness. The unsupervised approach [
7,
8] uses raw text to find a word cluster with the same sense, discriminating it from others. However, these senses often do not entirely match those defined in a dictionary or a thesaurus; therefore, clustering such senses for explainable applications is not trivial. The knowledge-based approach [
9,
10] uses glossary information, usually from a dictionary or thesaurus, to match with the context of a target word. The glossary information is usually insufficient to cover all the contexts of a target word.
To date, supervised learning becomes popular and is considered an effective approach, but suffers from processing unknown senses in the training data. To alleviate this problem, the supervised approach combined with the knowledge-based approach has recently been researched. Advances in computing power enables the various WSD algorithms to utilize the large size of dictionary and corpus [
11]. Word embeddings and contextual word vectors are especially used widely based on neural network models in a pre-trained manner [
12,
13,
14]. Huang et al. [
15] used the sense definition in a glossary to match the context of a target word by using the sentence pair comparison function of BERT [
16]. The context vector and the glossary vector are trained in a supervised way for fine-tuning of a generic BERT system. Kumar et al. [
17] extended this idea to include thesaurus information into the sense definition and merge them for continuous vector representation. However, this approach requires a large amount of memory to process all the senses of homographs and their related glossaries. Moreover, when decoding it requires multiple scans for the homographs’ sense glossaries.
The sense vocabulary compression (SVC) method tackles the unknown sense problem differently by compressing sense tags [
18]. It finds the optimal set of sense tags by decreasing the number of sense tags using sense relations, such as synonyms, hypernyms, and hyponyms in a thesaurus. The synonyms are simply compressed to one representative sense, known as the synset tag. The hypernym and hyponym sense tags are compressed by using the highest common ancestor tag not shared with the other senses in the homograph. 
Figure 1 shows an example of compressed sense vocabularies in hypernym hierarchical relations. There are two kinds of homographs in the figure: 
fund and 
bank. The dotted rectangles show the maximum subtrees not shared with the other senses in the homographs. Each root in the subtrees is a representative sense. For example, 
nature#1 is a representative sense of 
bank#7 and 
river#1; 
financial institution#1 is that of 
fund#1 and 
bank#4.
The compressed sense vocabulary achieves engineering efficiency in processing smaller output parameters and learning efficiency for sparse or unknown sense tags to be included implicitly during the learning phase. Once the compressed sense vocabulary is obtained, no additional memory or multiple scans are required for decoding; therefore, it is extremely compact and fast.
However, for sense compression, the SVC method requires a superior quality thesaurus. Although there are thesauruses for many languages, most are not rich enough or freely available [
19]. In this paper, we propose an alternative sense vocabulary compression method without using a thesaurus. Instead, the sense definitions in a dictionary become the embedding vectors and are clustered to find representative sense vocabularies, which are the compressed sense vocabularies. Our paper’s contributions are the following: (1) a novel method is proposed to compress sense vocabularies without using a thesaurus and (2) a practically effective method is proposed for sense vocabulary clustering, which is required to handle the vast number of senses with modern clustering algorithms.
  2. Sense Definition Clustering
Word sense definitions have been utilized for WSD tasks either as a surface form or a vector form [
9,
15,
17,
20]. In this paper, we also use the word sense definition in a dictionary to calculate the vectors for the senses, assuming that the sense is represented by the vector of the sense definition. To produce the sense definition vector (SDV), we used Universal Sentence Encoder [
21], which, at the time, was the best sentence vector generation program available. SDVs are clustered into groups of similar vectors. The similarity is measured using normalized Euclidean distance, which ranges from 0 to 1 as defined in the following:
      where 
M is the maximum distance in all the SDV pairs, and 
n is the dimension of 
x, 
y vectors.
A pair of vectors are clustered as a group under two conditions: 1. the similarity value is above a threshold, and 2. the new group does not include more than one sense in the same homograph set. This is accomplished by the sense definition clustering algorithm (SDC) as shown in Algorithm 1, which is a slight modification of the hierarchical agglomerative clustering (HAC) algorithm [
22].
      
| Algorithm 1 Sense Definition Clustering (SDC) | 
|  1:Input: SDVs represented as wi (1 i  N) 2:Output: clustered SDVs 3:Place each wi in a group, gi, as a set with one element 4:Let G = {x | x = gi (1 i  N)} 5:Calculate the similarity matrix for each pair of (gj, gk) where . 6:repeat 7:    Find the most similar pair (gj, gk) with their similarity above the threshold 8:    Make a new group gjk with union (gj, gk) where  and only if any pair of the padd group is not in the same homograph set 9:    If gjk equals null, then terminate10:   Let G = G – gj – gk + gjk11:   Recalculate the similarity matrix12:until 1
 | 
The time complexity of the SDC algorithm is O(
), the same as HAC. Moreover, the number of SDVs is usually too large to be run by the complex clustering algorithm; in the case of WordNet the size is 207 K. Therefore, for practical reasons, we divided the initial large set of SDVs into smaller partitions to start the hierarchical agglomeration SDC with efficiently sized partitions rather than the set of singleton clusters [
23].
We first divide the data by parts of speech (noun, verb, adjective, adverb) and then utilize a flat clustering program, such as the affinity propagation program (AP) [
24] or the 
k-means program, [
22] for further partitioning. Affinity propagation is good for partitioning with an appropriate number of clusters. However, it fails to partition very large datasets such as noun partition, because its algorithm needs considerable space to process various matrix calculations. In those cases, AP is run to determine the cluster size 
k for a randomly chosen subset of the dataset, and for partitioning, the 
k-means program is run using the 
k size determined on the original dataset.
Once we have partitions, we apply the SDC algorithm to each partition to build a compressed sense vocabulary group within the partition. We then apply the SDC again to the newly grouped senses externally to find a higher common ancestor. The new clusters are given cluster numbers, which are used as compressed sense numbers. 
Figure 2a shows the overall flow of sense clustering. 
Figure 2b shows the clustering example in detail. In 
Figure 2b, the three red dots represent the different senses of a homograph. In the partitioning process at (2), more than one red dot can be included in a group, but in the clustering process at (3) and (4), the red dots cannot be located in the same group even though they are closer than others in Euclidean distance.
  5. Discussion
Training size effect: We tested Korean training datasets of three different sizes. As the size increased, the performance increased both in the uncompressed vocabulary (threshold 0.0) and in the compressed vocabulary. 
Table 5 shows the change in performance on the Small, Medium, and Large datasets: 95.7%, 96.0%, and 96.1% for the uncompressed vocabulary at threshold 0.0, and 97.2%, 97.3%, and 97.5% for the compressed vocabulary. This is because more training data increases the performance by decreasing the quantity of unseen data on testing. 
Table 6 shows that the percentage of untrained data decreases from 3.3% to 1.8% and 1.2% as the training data size increases. This confirms that more training data are also useful for the compressed vocabulary.
 Performance difference between English and Korean data: The performance on Korean data is much higher than on the English in the highest average F1 value: 97.3% versus 69.1%. We conjecture that this difference is caused by the following:
1. The ambiguity of English is higher than Korean. It ranges from 6.9 to 10.9 in English and from 5.2 to 5.3 in Korean (shown in 
Table 3). This is because English senses are categorized more specifically than Korean, as shown in 
Table 2; 206 K versus 114 K senses. This means that English sense labeling prediction should be more precise than Korean, which makes the performance gain relatively difficult.
2. The configuration of test corpus is different. English test corpus used the data of different domains: SemCor for training, Senseval and Semeval for testing, whereas Korean used the shuffled data of the same corpus (Sejong) for 3-fold cross validation testing. It is verified by 
Table 6, showing that the ratio of untrained data in English test data is much higher than Korean, 16.3% versus 1.2% to 3.3%. This fact benefits the performance of Korean data.
3. For Korean BERT input, normalized words are used, whereas raw words are used for English input. English language is inflectional and does not have many variations in word form, but Korean language is agglutinative and has many variations in word ending [
34,
35]. Because these productive variations of Korean language degrade the performance of the language model, some Korean language models are trained with preprocessing the input words to avoid the degradation. We used KorBERT-morp model for Korean BERT, which is the Korean language model using normalized input words through the preprocessing of morphological analysis and part of speech tagging. We conjecture that this normalized word with part of speech tag rather benefits the performance of Korean dataset, because it gives more information to the BERT model than a simple word.
Performance increase ratio: The performance of English data increased by 6.2 percentage points, from 62.9% to 69.1%, whereas that of Korean data increased by 1.4 percentage points, changing from 95.9% to 97.3%. There is a much higher performance improvement on English data. This is because the baseline performance for English data is significantly lower than for Korean data, which allows greater improvements more easily than a high baseline performance.
Compression level:Table 7 shows the comparison of (a) the model with a thesaurus and (b) the model without a thesaurus. The best value of (a) is higher than (b), i.e., 75.6% versus 69.1%. This means that a manually constructed thesaurus is better than our clustering method for sense label compression. Nonetheless, both models perform better than the baseline models that do not use compressed labels and have the highest performance when compression is used with appropriate relations or threshold restrictions.
 Use of Korean thesaurus: The sense vocabulary compression method [
18] must have both a thesaurus and the corresponding sense annotated corpus. However our method needs a sense annotated corpus and the corresponding sense definitions contained in a dictionary or a thesaurus. In the case of Korean language, most of the sense annotated corpus is developed based on a dictionary. Examples are the Sejong sense annotated corpus based on Standard Korean Language Dictionary and the Modu sense annotated corpus [
36] based on Urimalsaem dictionary [
37]. Korean thesauri that have been developed so far are KAIST’s CoreNet [
38], Pusan National University’s KorLex [
39], University of Ulsan’s UWordMap [
40], Postech’s WordNet translation [
41], and others. Among these, KorLex, UWordMap, and Postech’s WordNet use the same sense tag as Sejong sense annotated corpus. However, KorLex and Postech’s WordNet contain only some part of the whole senses in Standard Korean Language Dictionary; KorLex contains only the senses that exist in English WordNet, and Postech’s WordNet is partially translated from English WordNet. UWordMap has included most of sense vocabulary of Standard Korean Language Dictionary more recently, which is quite later than the sense annotated corpus release [
42].
 Modu sense annotated corpus has been recently developed based on the senses defined in Urimalsaem dictionary, which contains more vocabulary and uses finer sense number hierarchy than Standard Korean Language Dictionary. The corresponding thesaurus to Urimalsaem dictionary has not been developed yet. Urimalsaem dictionary is being constructed and updated with online crowd sourcing, which makes updating the corresponding thesaurus, if it exists, non-trivial. Our sense clustering method using sense definitions in a dictionary is very useful for this environment, where a corresponding thesaurus does not exist to the sense annotated corpus. For future work, we will apply our method to the Modu sense annotated corpus, for which a full version was not available at implementation time.
Further improvement of clustering: We can further improve our method using a clustering algorithm. For this we should consider the following points: 1. Because we are dealing with huge data points (146K in case of English nouns), the space and time complexity should be small enough to be executed in an appropriate time limit. 2. The clusters with appropriate size should be determined automatically, with the clustering criteria reflecting not only similarity measures but also homograph restrictions that avoid including the same homograph senses in the same cluster.
Most clustering algorithms such as AP [
24], HAC [
22], GMM [
43], and DBSCAN [
44] having high computational complexity could not execute our clustering data with huge data points under our workstation level computation environment (Intel i7 CPU and 48GB main memory). Although the k-means algorithm has relatively lower complexity than the other algorithms, it has to determine the optimal number of clusters automatically [
45,
46], which will cause the complexity to increase. What we have proposed in this paper is one of the practical solutions, and we leave for future work the task to find in complexity level more efficient clustering algorithms for sense compression.
Word embedding vectors retain multiple relations such as synonym, antonym, singular-plural, present-past tense, capital city-country, and so on [
47,
48]. This is same as SDVs we have used in this paper. As shown in 
Table 7a, the selective use of the relations increases the performance. Therefore, finding the useful relations from SDVs will improve the quality of clustering.