Arabic Gloss WSD Using BERT

: Word Sense Disambiguation (WSD) aims to predict the correct sense of a word given its context. This problem is of extreme importance in Arabic, as written words can be highly ambiguous; 43% of diacritized words have multiple interpretations and the percentage increases to 72% for non-diacritized words. Nevertheless, most Arabic written text does not have diacritical marks. Gloss-based WSD methods measure the semantic similarity or the overlap between the context of a target word that needs to be disambiguated and the dictionary deﬁnition of that word (gloss of the word). Arabic gloss WSD suffers from a lack of context-gloss datasets. In this paper, we present an Arabic gloss-based WSD technique. We utilize the celebrated Bidirectional Encoder Representation from Transformers (BERT) to build two models that can efﬁciently perform Arabic WSD. These models can be trained with few training samples since they utilize BERT models that were pretrained on a large Arabic corpus. Our experimental results show that our models outperform two of the most recent gloss-based WSDs when we test them against the same test data used to evaluate our model. Additionally, our model achieves an F1-score of 89% compared to the best-reported F1-score of 85% for knowledge-based Arabic WSD. Another contribution of this paper is introducing a context-gloss benchmark that may help to overcome the lack of a standardized benchmark for Arabic gloss-based WSD.


Introduction
Words in spoken and written languages can have multiple meanings for each word and, thus, could be ambiguous. The intended meaning of a word (sense) depends strongly on its context. Word Sense Disambiguation (WSD) is the task of predicting the correct sense of a word that has multiple senses given its context [1]. Many Computational Linguistics' applications depend on the performance of WSD, such as Machine Translation, Text Summarization, Information Retrieval, and Question Answering [2]. The Arabic language is particularly challenging for WSD. It was shown in [3] that 43% of Arabic words that have diacritic marks are ambiguous. Moreover, that percentage increases to 72% when the diacritic marks are missing, which is the case for most written Arabic text [4]. This is compared to 32% for the English language [5].
We note here three main challenges in Arabic WSD. First, most written text does not have diacritic marks. As a result, the level of ambiguity for each word can increase. For example, the word " " (meaning flag) and the word " " (meaning science) are both written as " " . Secondly, even when diacritic marks are present, many words can have several possible part-of-speech (POS) tags that lead to different meanings. For example, the word " " can have the POS tag verb (leading to the meaning of "showing something"). " ". In addition, Arabic is an agglutinative language that adds parts of speech to the stems or roots of words to give new meanings [6]. For instance, the word " " (in the street) has a stem " " (a street) and two prefixes, " " (in) and " " (the), which is the definite article. This problem has been addressed in the preprocessing phase in our work, which will be discussed shortly. Bidirectional Encoder Representation from Transformers (BERT) [7] is a neural language model based on transformer encoders [8]. Recent Arabic Computational Linguistics' studies show a large performance enhancement when pretrained BERT models are fine-tuned to perform many Arabic Natural Language Processing (NLP) tasks, for example, Question Answering (QA), Sentiment Analysis (SA), and Named Entity Recognition (NER) [9,10]. This performance enhancement stems from the fact that BERT can generate contextual word embeddings for each word in a context. These embeddings were shown to encode syntactic, semantic, and long-distance dependencies of the sentences [11], thus making them ideal for many NLP tasks.
Contextual embedding methods (e.g., BERT [7], Embeddings from Language Models (ELMO) [12], and Generative Pre-trained Transformer (GPT) [13]) learn sequence-level semantics by considering the sequence of all the words in the input sentence. In particular, BERT is trained to predict the masked word(s) of the input sentence; to do this, BERT learns self-attention to weigh the relationship between each word in the input sentence and the other words in the same sentence. Each word then has a vector that represents the relationship with other words in the input sentence. Those vectors are used to generate word embedding, so the generated embeddings for each word depends on the other words in a given sentence. This is unlike Glove [14] and Word2Vec [15] that represent the word using a fixed embedding, regardless of its context. These traditional models depend on the cooccurrence of these words in the hole corpus. If two words usually have similar words in different contexts, they will have a similar representation.
WSD is classified into three main categories: supervised WSD, unsupervised WSD, and knowledge-based WSD. Knowledge-based WSD methods usually utilize language resources such as knowledge-graphs and dictionaries to solve the WSD problem. Knowledgebased WSD can be further subcategorized into graph-based approaches and gloss-based approaches [16]. Graph-based approaches utilize language knowledge-graphs, such as WordNet, to represent the sense by its surrounding tokens in that graph. Gloss-based approaches measure the semantic similarity or the overlap between the context of the target word and the dictionary definition of that word (gloss of the word). Gloss-based methods require context-gloss datasets. In this paper, we explore utilizing pretrained BERT models to better represent gloss and context information in gloss-based Arabic WSDs. Fortunately, there are pretrained BERT models available for Arabic [9,10]. We utilized these pretrained models to build two gloss-based WSD models. The first model uses the pretrained BERT models as a feature extractor without fine-tuning BERT layers to generate a contextual word embedding of the target word in its context. We also used it to generate a sentence vector representation of the gloss sentence. These representations' vectors were then fed to a trainable dense layer to perform supervised WSD. In the second model, we fine-tuned BERT layers by training them with a sentence pair classification objective. Our experiments showed that we outperform many of the state-of-the-art Arabic knowledge-based WSDs. The evaluation of WSD models suffers from a lack of standard benchmark datasets [16]. Each author tests on a different and small (around 100 ambiguous words) dataset, which makes it difficult to compare results. We attempt to fix this issue by creating a publicly available benchmark dataset. We hope that such a dataset can help ease evaluating and comparing different WSD techniques.
The rest of the paper is organized as follows. In Section 2, we discuss the related work. Section 3 introduces the context-gloss benchmark. The background is presented in Section 4. Section 5 describes the proposed models. In Sections 6-8, we present our experimental setup, results, and discussion. Finally, we conclude our paper in Section 9.

Related Work
As mentioned previously, knowledge-based WSD is categorized into graph-based approaches and gloss-based approaches [16]. Graph-based approaches depend on language knowledge graphs such as WordNet to represent the sense by its surrounding tokens in that graph. For example, the authors in [17] introduced a retrofit algorithm that utilized Arabic WordNet along with Glove [14] and Word2Vec [15] to generate context-dependent word vectors that carry sense information. Finally, they used cosine similarity between the generated word vector and vectors generated for WordNet synsets of that word and selected the sense with the highest similarity score. In [18], the authors utilized Arabic WordNet (AWN) to map words to concepts. A concept was defined as a specific word sense. They further utilized the English WordNet to extend the Arabic one by employing Machine Translation for the missing words in AWN. A concept was then selected for a target word by comparing the context of the target word to neighbors of the concept in the extended WordNet. Concepts with high match to the context were selected as the correct sense. In [19], the authors used AWN to generate a list of words that can be potentially ambiguous. For each ambiguous word, they found Wikipedia articles that correspond to the different senses of those words. Those articles were then converted to real vectors by tf-idf [20]. They also generated a context vector by tf-idf of the word in its context. Finally, they used cosine similarity between the two vectors and selected the sense with the highest similarity score.
The second type of knowledge base approaches is gloss-based approaches. These methods measure the semantic similarity or the overlap between the context of the target word and the dictionary definition of that word (gloss of the word). The authors in [21] used the Lesk algorithm to measure the relatedness of the target word's context to its definition. They used AWN as a source for the word gloss. The authors of [22] combined unsupervised and knowledge-based approaches. They used a string matching algorithm to match the root of salient words that appear in many contexts of a particular sense with the gloss definition. The authors of [23] employed word2vec [15] to generate a vector representation for the target word's context sentence and the gloss sentence. Then, cosine similarity was used to match the gloss vector and the context vector. The authors in [24] used Rough Set Theory and semantic short text similarity to measure the semantic relatedness between the target word's context and multiple possible concepts (gloss). Recently, The authors in [25] used FLAIR [26], a character-level language model, to generate sentence representation vectors for both context and gloss sentences. The sentence's vector was calculated by taking the mean of its word vectors, causing a loss of information, especially when the sequence is long. Cosine similarity was then used to measure the similarity between the word context sentence and the gloss sentence.

Benchmark
Arabic gloss WSD suffers from the lack of context-gloss pair datasets [16]. Arabic Word Net (AWN), which is the only available gloss dataset, contains only word senses with gloss definitions but lacks context examples for these senses [27]. Due to this limitation in AWN, we created a new Arabic WSD benchmark. This dataset is extracted from the "Modern Standard Arabic Dictionary" [28]. To the best of our knowledge, our benchmark is the only available Arabic context-gloss pair dataset. It consists of 15,549 senses for 5347 unique words with an average of 3 senses for each word. Figure 1 shows the histogram of sense counts per word. We can see that most of the words (4000+) have 2 to 4 senses, that about 750 word are between 4-6 senses per word, and that the count decreases as the number of senses increases. Each record in the dataset is a tuple of three elements: a word sense, a context example, and a definition of that word sense. In Table 1, we present the statistics of the dataset. This dataset is available online (https://github.com/MElrazzaz/ Arabic-word-sense-disambiguation-bench-mark.git, accessed on 10 March 2021). Table 2 shows examples of records in the benchmark's dataset. We believe that the benchmark can contribute to standardizing the evaluation of WSD models.

Background
In this section, we review the background needed to introduce our approach. We begin by reviewing transformers [8], which are a substantial part of BERT [7]. We then review BERT as it is one of the building blocks of our approach.

Transformers
A transformer [8] is a celebrated Neural Network (NN) architecture that was introduced to approach translation tasks. Figure 2 depicts the transformer's main components. It consists of two components: an encoder and a decoder. The encoder is used to encode a source sentence into a sequence of real vectors. The decoder, on the other hand, has the goal of decoding a set of vectors into a target sentence. Both the encoder and the decoder consist of 6 blocks. Each of these blocks consists of two layers: the first layer is a multi-head attention layer, while the second is a fully connected layer. The layers are equipped with a residual connection that connects the input of a layer to the output of that layer. After each block, a Layer Norm [29] is applied. Now, we elaborate more on the multi-head attention layer. The intuition behind this layer is that the sense of a given word in a sentence is usually affected by different words in the same sentence (self-attention) or the target sentence, with different degrees. This intuitive idea is implemented in the attention layer by associating three vectors: a key k, a query q, and a value v with each word. The degree of dependence between a given word w s and a target word w t is quantified by the dot product between the query of the word q w s and the key of the target word k w t . Finally, the word is represented by a weighted average of the values of target words, using q w s , k w t as weights. The weights are further normalized and a Softmax is applied to ensure that the weights sum to 1 before the weighted average is taken. To obtain a compact formula, keys of target words are put in rows of a matrix K, queries are arranged similarly into a matrix Q, while V denotes the concatenation of values of target words. Then, the layer output is realized using Equation (1) where d k represents the dimensionality of the vector k.
As can be noticed in the above computation of the attention-based word representation, the position of a word in the target sequence is lost as a result of the weighted average over targeted words. This can be alleviated by adding a vector to each word representation that encodes the position of the given word. Such a process is called Positional Encoding (PE). The PE vector is achieved using Equation (2).
where k is the index of the value calculated of for the positional vector and pos is the position of the word.

Bidirectional Encoder Representation from Transformers
BERT [7] is a language model that is based on the architecture of the transformers reviewed above. As can be seen in Figure 3, BERT's architecture consists of a stacked set of transformer encoders. A key aspect of BERT is that it is designed to be trained in an unsupervised manner with no particular task at hand. To facilitate training, two objectives were proposed to optimize BERT. (1) Given a sentence with a particular word, it is masked (replaced with a placeholder) and then such a word is predicted. (2) Given two sentences, say X and Y, the order of such sentences is predicted; that is, whether Y appears before X or vice versa is predicted. The model that learned in this manner is shown in [11] to encode phrase-level information in the lower layers, while in the middle layers, it encodes surface, syntactic, and semantic features. Moreover, the top layers of the model encode long-distance dependencies in the sentence.

Arabic Pretrained BERT Models
Recent studies on Arabic NLP showed that using BERT achieved state-of-the-art results on many NLP tasks. For instance, Question Answering (QA), Named Entity Recognition (NER), and Sentiment Analysis [9,10]. These studies trained two BERT models, namely AraBERTv2 [9] and ARBERT [10]. These models were trained on large Arabic corpora. The details of these corpora as well as the number parameters of each model are shown in Table 3. Since these models already contributed to the state-of-the-art results in many tasks, we use them as building blocks for our proposed models.

Transfer Learning with BERT
In this section, we discuss how to utilize BERT pretrained models to approach NLP tasks. There are three common strategies, as shown in Figure 4. The first strategy is to tune all layers, that is, to train the entire pretrained model with a new task while the pretrained model parameters are used as initializations. Second, we fine-tuned only some layers while keeping the weights of the rest of the network unchanged. Another strategy is to freeze the entire architecture while attaching a set of layers to the output of the pretrained model. During training, only the parameters of the new layers are allowed to change.

Materials and Methods
In this section, we present our proposed approach to gloss WSD. As stated above, we utilized a pretrained BERT to design our models. Our first model uses BERT as a feature extractor, that is, the parameters of BERT layers are not retrained. On the other hand, the second model fine-tunes BERT layers by training it with a sentence pair classification objective. We experimented with both AraBERTv2 [9] and ARBERT [9] as our pretrained BERT. In the rest of this section, we present the details of our proposed models. The code and the data needed to replicate our proposed method are available on GitHub (https://github.com/MElrazzaz/Arabic-word-sense-disambiguation-bench-mark. git, accessed on 10 March 2021).

Model I
Model I uses the pretrained BERT model as a feature extractor. The intuition here is that, since we only have a few samples, fine-tuning the model can be affected by the noise in those samples. Therefore, training only our added layers will reduce the complexity of the model and make efficient use of the training sample size. Figure 5 shows the Model I architecture. The architecture consists of the concatenation of two BERT outputs. These two outputs are the features extracted from a pretrained BERT. We then applied a fully connected layer on top of those concatenated outputs. Finally, a softmax layer was attached on top of the fully connected layer to provide means of training that layer. The two feature vectors were extracted in a specific manner to represent the sense of the word as defined in the gloss and its sense as defined by its context in a given example. To that end, we fed the word gloss to our pretrained BERT model. We then used the embedding of the '[CLS]' token as the feature vector representing the sense of the word from the gloss perspective. The second feature vector, on the other hand, was generated by feeding the pretrained BERT with an example sentence of this word sense. The extracted features, in this case, were, however, the embedding of that word averaged over the last N BERT layer(s). We then concatenated those feature vectors and applied a fully connected layer.
The model was trained with the objective of classifying whether the sense in the example context of the target word matches the sense defined by the gloss of the target word. To train such an objective, we created a dataset with pairs of gloss and examples. A record with a positive class has both the gloss and the example matching, while a negative record has the example and gloss not matching. For an example of those records, see

Model II
Unlike Model I, in Model II, we fine-tuned all BERT layers. Again, the training objective was to perform a sentence pair classification task. In this model, however, only one BERT architecture was used. The input to the BERT model, in this case, was two sentences separated by a separator token '[SEP]'. The first sentence was a context example, while the other sentence was the gloss of the target word. A sequence classification layer with a Gaussian Error Linear Units (GELU) activation function was added on top of the BERT layers to classify whether the context sentence and the gloss definition were related. Examples of training set record are show in Table 5. For each context-gloss pair, a '[SEP]' was added between the two sentences and a '[CLS]' token was added at the beginning of the two sentences. The classification token '[CLS]' was used to present an encoding on the total sequence, and therefore, the classification layer was attached on top of the '[CLS]' embedding. Figure 6 shows Model II architecture. 'wrong example sentence' 0 the word Figure 6. Model II architecture.

Experiments
In this section, we present the experimental setup for building and evaluating our proposed Arabic WSD models.

Evaluation Dataset
To evaluate the performance of the proposed models, the benchmark dataset was divided into 60%, 20%, and 20% for training, validation, and testing, respectively. The details of the test data are shown in Table 6. We preprocessed both the word sense definition and the context examples to match the input expected by the BERT models. Diacritic marks were removed from the input sentences. Moreover, we applied tokenization to the input sentences. We used the Farasa tokenizer [30].

Evaluation Metrics
The goal of the model was to find the correct word sense from a given example. We report three evaluation metrics, namely, precision, recall, and F1-score. These metrics are defined as follows: where TP denotes the true positive count, that is, the number of correctly classified sentence pairs when the context example matches the word sense definition, while FP is the false positive count, that is, the number of falsely classified sentence pairs. TN is the number of correctly classified sentence pairs when the word sense definition does not match its context, while FN reflects the times when the classification is incorrect.

Training Configuration
We now present the details of our training procedures. For Model I, we trained the model with 20 epochs. The features extracted from BERT were the average of the last 4 layers. We used the Adam [31] optimization method with a batch size of 8 and a learning rate of 0.00002. The learning objective was to minimize the binary cross-entropy loss. These learning parameters are summarized in Table 7. Model II was trained with the same parameters. However, we optimized also the BERT layers. Moreover, we appended the targeted words to the sequence of input to drive the model to learn to disambiguate this specific word from the full sequence.

Results
In this section, we present our experimental results. ArBERT as a pretrained BERT model without tokenizing the input sentences.
As can be seen in Table 8, both configurations that use AraBERTv2 seem to outperform ArBERT as the pretained BERT model. This holds for both Model I and Model II. For the tokenization configuration, Model II enhances the precision from 92% to 96% while decreasing the recall from 87% to 67%, which affects the F1-score and hence decreases from 89% to 79%. On the other hand, Model I is adversely affected by tokenization, leading to a decrease in precision from 69% to 67% and in F1-score from 74% to 72%. We can also see that Model II always outperforms Model I. Table 9 compares our proposed models against other embedding based models that depend on representing gloss and the context sentences in the vector space such as in our model. We compare our proposed models with two models, Laatar [25] and Laatar [23] which are the most recent Arabic WSD works. The test data of the two models are not available; thus, in order to make a fair comparison, we redeveloped their models and tested them on our benchmark. Laatar [25] used FLAIR [26] to generate word embedding. FLAIR is a pretrained character level language model. Laatar [25] represents each sentence by the mean of its words vectors. The authors in [23] used Word2Vec [15] language model to represent the words of the sentence in the vector space and then they represented the sentence by adding its words vector representation. After calculating the sentences' vectors, both models measure the cosine similarity between the context sentence vector and the gloss sentences vector and choose the gloss with the highest similarity score. Table 9 shows that our models outperform other embedding-based models in terms of precision, recall and F1 score. Table 9 also shows the original paper results of other models according to their test set. We performed McNemar's significance test [32] between Model II and Laatar [25] and between Model II and Laatar [23]; we find that Model II is significantly better than Laatar [25] and Laatar [23] and that the p-value is less than 0.01. In Table 10, we present the performance in comparison to other Arabic Knowledge based WSDs.

Discussion
The results above show that our approach outperforms two of the recent gloss-based models and that the reported results are super-passed previously reported results of knowledge-based WSD. We believe that the reason behind this improvement lies in the use of a pretrained BERT. Indeed, BERT models are trained on a very large dataset to perform masked language modeling and, next, sentence prediction tasks. In order to perform well on these tasks, it has been shown in [11] that BERT models learned both syntactic and semantic features at the middle layers while, at the top layers, BERT learned long-distance dependency information. Therefore, using these pretrained models resulted in transferring this knowledge to Arabic WSD and hence improved the performance and decreased the amount of data needed for these models to learn. Moreover, we exploited BERT's ability to perform well in sentence pair classification tasks by posing the WSD as a sentence pair classification problem. Finally, we used our new benchmark to perform supervised fine-tuning; we believe that this is one of the main drivers to reach these results.

Conclusions
In this paper, we introduced the first Arabic context-gloss pairs benchmark. We believe that it can help standardize the evaluation of models in Arabic WSD. We then introduced two Arabic WSD models that benefited from pretrained BERT models. The first model used BERT as a feature extractor without fine-tuning BERT layers to generate two feature vectors. The first vector is a word-contextualized word embedding, while the second vector is a sense representation vector. These two vectors served as inputs to a fully connected layer with a sigmoid activation function to predict whether the word in the context matches the sense in the definition. In the second model, we posed the WSD task as a sentence pair classification task and fine-tuned pretrained BERT using the sequence classification objective. We experimentally showed that our models outperform recent gloss-based WSD systems and that the reported results passed other reported knowledge-based WSD results. Furthermore, we discovered that fine-tuning the model was crucial for achieving good performance.