Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

: Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a diﬀerent problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by ﬁve expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.


Introduction
Semantic similarity between two words represents semantic closeness (or semantic distance) between the two words or concepts.It is an important problem in natural language processing as it plays a crucial role in information retrieval, information extraction, text mining, web mining and many other applications.In artificial intelligence and cognitive science also, semantic similarity has been used for different scientific evaluation and measurement as well as for deciphering the intricate interface operating behind the process of conceptualizing senses for a long time.
Theoretically, semantic similarity refers to the idea of commonality in characteristics between any two words or concepts within a language.Although it is a relational property between the concepts or senses, it can also be defined as a measurement of conceptual similarity between two words, sentences, paragraphs, documents, or even two pieces of texts.
Similarity among concepts is a quantitative measure of information and is calculated based on the properties of concepts and their relationships.Semantic similarity measures have applications in information extraction (IE) [1], information retrieval (IR) [2], bioinformatics [3,4], word sense disambiguation [5] etc.
Semantic relatedness, introduced by Gracia and Mena [6], and semantic similarity, are two related terms but, semantic relatedness is less specific than semantic similarity.For instance, when we say that two words are semantically similar, it means that they are used in the same way in relation to other words.For example, পে াল (petrol) and িডেজল (diesel) are similar terms owing to their common relation with fossil fuels.On the other hand, two words are related if they tend to be used near one another in different contexts.For example, পে াল (petrol) and গািড় (car) are related terms but they are not similar in sense.
All similar concepts may be related but the inverse is not true.Semantic similarity and semantic distance of words or concepts are defined inversely.Let us suppose A1 and A2 are two concepts that belong to two different nodes N1 and N2 in a particular ontology.The similarity between these two concepts is determined by the distance between the nodes N1 and N2.Both N1 and N2 can be considered as an ontology or taxonomy that contains a set of synonymous terms.Two terms are synonymous if they are in the same node and their semantic similarity is maximized.Whenever we take up the question of semantic similarity, relatedness or distance we expect our system of evaluation to return a score lying between −1 and 1 or 0 and 1 where 0 indicates no similarity and 1 indicates extremely high similarity.
English is a well-resourced language and as such, a wide array of resources and methods can be applied for determining similarity between English words.However, languages such as Bangla do not enjoy this status owing to the lack of well-crafted resources.Thus, determining similarity between word pairs in such a language is a more complex task.
This paper focuses on semantic similarity measurement between Bangla words and tries to describe four different methods for achieving the same.Each method of semantic similarity measure is evaluated in monolingual and cross lingual se ings and compared with other methods.The rest of the paper is structured as follows.Section 2 presents the related works of semantic similarity measure in English and other languages.Section 3 describes the proposed methodology adopted for achieving the goal.Section 4 describes the experimental setup.Section 5 gives details of the resource used for our work.Section 6 provides the experimental results and their analysis.Finally, the paper concludes in Section 7.

Related Work
Many works have been done on semantic similarity-based on either word similarity or concept similarity.Based on semantic relationships, work has been done on approaches involving usage of Dictionary and Thesaurus.Ones that are more complex depend on WordNet [7] and ConceptNet [8].Fellbaum [9] introduced a method for similarity measures based on WordNet.Liu and Singh [10] worked on a technique based on ConceptNet.So far, four strategies are known for measuring similarity.These are: (i) structure-based measures, (ii) information content (IC)-based measures, (iii) feature-based measures and (iv) hybrid measures.
The structure-based similarity measures use a function to compute semantic similarity.The function calculates path length of the words or concepts and their position in the taxonomy.Thus, more linked words or concepts are more similar they are to each other.Rada et al. [11] calculated shortest path-based similarity using semantic nets.This measure is dependent on the distance method and is designed mainly to work with hierarchies.It is a very powerful measuring technique in hierarchical semantic nets.Weighted links [12] is an extension of the shortest path-based technique measure.Here the similarities between two concepts are computed using weighted links.There are two factors which affect the weight of a link viz. the depth of hierarchy (namely density of taxonomy), and the strength between child and parent nodes.The summation of the weights of the traversed links gives the distance between two concepts.Hirst and St-Onge [13] came up with a method to find relatedness between the concepts using the path distance between the concept nodes.The concepts are said to be semantically related to each other if there is relational closeness between the meanings of two concepts or words.
Wu and Palmer [14] proposed a similarity measure between two concepts in a taxonomy, which depends on the relative position of the concepts with respect to the position of the most common concept.Based on edge counting techniques, Slimani et al. [15] created a similarity measuring technique, which was an extension of the Wu and Palmer measure.To calculate sentence similarity Li et al. [16] proposed a method to include the semantic vector and word order in taxonomy.Leacock and Chodorow [17] put forth the relatedness similarity measure.In this technique, similarity of two concepts is evaluated by taking the negation of the logarithm of the shortest path length divisible by twice the maximum depth of the taxonomy.
The IC of concepts is another approach for tackling the similarity problem.The frequency of a particular term in a document collection is the key for calculating the IC value.There are many methods for calculating semantic similarity based on the IC of words or concepts.Resnik [18] presented a technique that uses IC of the shared parents.The reasoning behind this technique was that two concepts are more similar if they have more shared information.Lin et al. [19] put forth a semantic similarity measure based on ontology and corpus.The technique used the same formula as that of Resnik for information sharing but the difference lied in the definition of concepts, which gave a be er similarity score.Other IC-based methods for handling the similarity problem were proposed such as the Jiang-Conrath [20] approach, which is an extension of the Resnik similarity.Jiang-Conrath and Lin similarity have almost identical formulae for calculating semantic similarity in the sense that both approaches compute the same components.However, the final similarity is formulated in two different ways using the exact components.
The problem with thesaurus-based approaches is that they are not available for every language.Furthermore, they are hard to create and maintain and sometimes many words and links between them are absent.To circumvent such problems, distributional or vector space models of meaning are used.In this domain, mention must be made about the cosine similarity metric, which is perhaps the most widely used measure.The Jaccard index, also known as the Jaccard similarity coefficient is another distributional similarity metric.Cosine similarities along with several other distributional similarity measures are calculated using the term document matrix of a given corpus, which is essentially a 2D array where the rows correspond to terms, and the columns represent the documents.Each cell of the matrix holds the count of the number of times a particular term has appeared in a particular corpus (or document).The intuition behind this approach is that two documents are similar if their vectors are similar.
Mikolov et al. [21][22][23] published three papers on the topic of distributed word embedding to capture the notion of semantic similarity between words, which resulted in Google's unique Word2Vec model.The Word2Vec can operate in two forms; continuous bag-of-words (CBOW) or skip-gram.Both are variants of a neural network language model proposed by Bengio et al. [24] and Collobert and Weston [25].However, rather than predicting a word conditioned on its predecessor, as a traditional bi-gram language model does, a word is predicted from its surrounding words (CBOW) or multiple surrounding words are predicted from one input word (skip-gram).Arefyev et al. [26] used the Word2Vec model in their research to detect similarity between Russian words.After comparing the results from the Word2Vec experiment with two other corpus-based systems for evaluating semantic similarity, it became clear that the Word2Vec model is a far superior approach and further work needs to be done on it.
However, traditional word embeddings only allow a single representation for each word.Newer methods have been proposed to overcome the shortcomings of word embeddings by modeling sub-word level embeddings (Bojanowski et al. [27]; Wieting et al. [28]) or learning separate sense embeddings for each word sense (Neelakantan et al. [29]).
Bojanowski et al. [27] approached the embedding task by representing words as bag-of-characters n-grams and the embedding for a word is defined as the sum of the embeddings of the n-grams.The method (popularly known as FastText) is particularly suited for morphologically-rich languages and it can compute word representation for words that are not present in the training data.
Faruqui and Dyer [30] presented a multi-lingual view of word embeddings.In this method, firstly, monolingual embeddings are trained on monolingual corpora for each language independently.Then a bilingual dictionary is used to project monolingual embeddings in both languages into a shared bilingual embedding space where the correlation of the multilingual word pairs is maximized using canonical correlation analysis (CCA).They reported that the resulting embeddings can model word similarities be er than the original monolingual embeddings.
In a very recent development, Conneau et al. [31] presented a method for learning translation lexicons, (or cross-lingual alignments) in a completely unsupervised manner without the need for any cross-lingual supervision.The method involves learning monolingual embeddings independently and learning a linear mapping weight to overlap the monolingual semantic spaces of both languages leveraging adversarial training.This method has paved the way for unsupervised machine translation which is particularly suitable for low-or zero-resource (i.e., parallel corpora) language pairs.Several other methods based on feature and hybrid measures have been suggested.Tversky [32] proposed a method using features of terms for measuring semantic similarity between them.The position of the terms in the taxonomy and their IC were ignored in this method.The common features between the concepts increase the similarity in this method.Petrakis et al. [33] gave a word matching method called X-similarity, which was a feature-based function.The words are extracted from the WordNet by parsing term definition for a match between the words.Two terms are said to be similar when concepts of the words and their neighborhoods are lexically similar.Sinha et al. [34] introduced a new similarity measure for the Bangla language based on their developed Mental Lexicon, a resource inspired by the organization of lexical entries of a language in the human mind.
Sinha et al. [35] proposed a semantic lexicon in Bangla which is hierarchically organized and also a method to measure semantic similarity between two Bangla words using a graph-based edge weighting technique.

Methodology
Our work on measuring semantic similarity between Bangla words involves both path-based semantic similarity and distributional (Word2Vec-based) semantic similarity.WordNet [7], being the only semantic ontology available for Bengali, is used for implementing the path-based semantic similarity method in the present work.
The information content-based semantic similarity of Li et al. [16] requires sense annotated corpus which is unavailable for Bangla.The Bangla semantic lexicon proposed in [35] is not publicly available and the method proposed in [35] for computing semantic similarity is not directly applicable to the Bangla Wordnet as such.The semantic similarity of Wu and Palmer [14] and Slimani et al. [15] are applicable on WordNet.However, in this paper, we limit the study of semantic similarity in Bangla to path-based similarity, the most rudimentary method of computing similarity based on semantic ontology, and distributional similarity, the state-of-the-art method in semantics.
We use the path-based semantic similarity and distributional semantic similarity in both monolingual as well as cross-lingual se ings, thus giving rise to four different methods.The four methods are described below and they are summarized in Table 1 The monolingual approaches described in the study are applied on Bangla only as our objective is to study semantic similarity in Bangla whereas, the cross-lingual approaches involve translating Bangla words into their English counterparts and calculating semantic similarity in English.IC-based methods [4,[18][19][20] are more reliable than path-based method.Unfortunately, they could not be a empted due to unavailability of sense-annotated corpora in Bangla.Obviously, we could apply IC-based methods of semantic similarity for the translation-based approaches.However, to keep our evaluation metrics fair in all se ings, we chose only the path-based method as the baseline, which could easily be applied for both the languages.

Cross-Lingual SS P_C SS D_C
We considered a cross-lingual approach in the study since the English WordNet is much more enriched than the Bangla WordNet and the English Word2Vec model is supposed to be be er than the Bangla Word2Vec model.Also, it is one of the objectives of this study to experiment whether an enriched WordNet and be er Word2Vec model in English lead to improved similarity metric in Bangla when we take a cross-lingual approach (i.e., via translation of Bangla words).An explanatory figure for the cross-lingual approaches is given in Figure 1.In order to measure semantic similarity between Bangla words B 1 and B 2 using the cross-lingual approach, we consider the English translations of B 1 and B 2 .According to the figure, B 1 has M translations in English i.e., Tr(B We compute semantic similarity (either path-based or distributional) between each pair of English words, denoted by the table row and column header and fill up the entire semantic similarity matrix.For example, the matrix cell corresponding to the i th row and j th column represents the similarity between E i B 1 and E j B 2 .Finally, the semantic similarity (SS) between B 1 and B 2 is computed following Equation (1).

Path-Based Semantic Similarity Using Bangla WordNet (SS P_M )
The path-based approach is one of the oldest methods used for calculating semantic similarity between senses or concepts.It belongs to the thesaurus-based class of semantic similarity algorithms and measures semantic similarity between a pair of senses in terms of path length of the shortest path between the two senses in an ontology.WordNet is the most popular resource for measuring path-based semantic similarity.
WordNets for all languages share a common foundation in construction.They all follow three main principles: minimalism, coverage, and substitution for the synsets they contain.Minimalism refers to the property of representing a concept by a small (minimal) set of lexemes, which clearly define the sense.Coverage refers to the property of a synset for including all the words representing the concept for a language considered.Finally, substitution indicates the property of swapping or substituting words in a context with words appearing in their corresponding synset in a reasonable amount of corpora.
Formally, path-based similarity between two senses is defined as the inverse of the shortest path length between them, as in Equation (2).
To avoid division by zero, path length is defined as in Equation (3).
Pathlength(s 1 , s 2 ) = 1 + number of edges in the path between sense nodes s 1 and This formulation of the pathlength and SI M PATH also keeps the path-based similarity in a scale of 0 to 1 and assigns the maximum similarity of 1 between a sense and itself.The path-based semantic similarity algorithm measures similarity between senses or concepts.However, the same algorithm can be used to measure semantic similarity between words as in Equation (4).In Equation ( 4), B 1 and B 2 represent the Bangla words between which we want to measure the semantic similarity and S(w) returns the senses of w.
Figure 2 shows an excerpt of the hypernym-hyponym structure from the Bangla WordNet showing the shortest path (indicated by the bold arrow) between synsets containing the Bangla words, িদন and রাত.Thus, the pathlength between িদন and রাত comes out to be 2.

Path-Based Semantic Similarity Using Translation and English WordNet (SS P_C )
English can be considered as a 'well-resourced' language because of the myriad of richly designed resources and tools available for language processing tasks.The English WordNet is one such example.It is much more developed in comparison to the Bangla WordNet and has coverage far superior to the WordNets for other languages.The Bangla WordNet shares similar roots with the English WordNet as it was created using an expansion approach from the Hindi WordNet, which in turn was inspired by the English WordNet.However, there exists some dissimilarities in terms of the number of senses a word carries, such as রাত ( night) which has eight senses in the English WordNet but only one sense in the Bangla WordNet.
The idea here is to obtain a projection of Bangla words in English through translation and then calculate the path-based similarity using the English WordNet.The translation pair with the maximum value (least path length) is assigned as the similarity score for the Bangla word pair.
Path-based similarity is computed using English WordNet for every English word pair [E i , E j ] such that E i ∈ Tr( রাগী), E j ∈ Tr(য ণা).Finally, the maximum of SI M PATH (E i , E j ) (0.2 in this case) is assigned as the similarity between রাগী and য ণা according to this approach.

Monolingual Distributional (Word2Vec) Semantic Similarity in Bangla (SS D_M )
Word2Vec is one of the most effective and efficient models for semantic similarity.It is a distributional or corpus-based approach for finding semantic similarity between word pairs and is emerging as one of the most promising and popular approaches for context modeling.It is a shallow word-embedding model, which means that the model learns to map each word into a low-dimensional continuous vector space from their distributional properties observed in raw text corpus.The beauty of the Word2Vec model is that not only does the model generate positive similarity scores between word pairs, it also produces negative scores which indicate that the "word vectors" are opposite in direction and thus the words have an antonym type of relationship.
As mentioned earlier, the Word2Vec is a group of models, which generate word embeddings.Word embedding is a collective term for a set of language modeling and feature learning techniques in NLP where words in a given corpus are mapped onto a vector of real numbers.This is where the Word2Vec's ability is seen in that it uses word vectors to calculate semantic similarity.There are two modes of operation of Word2Vec i.e., skip-gram and CBOW.CBOW is an architecture of Word2Vec, which calculates the word vector for a target word given its surrounding words or context while skip-gram calculates the context word(s) from the given word.Put another way, CBOW learns to predict the word given context whereas skip-gram can be considered as the reverse CBOW predicting the context given the target word because we are 'skipping' over the current word in the calculation.
As our work deals with semantic similarity, we wanted to generate the word vectors in order to calculate their distance (an inverse measure of similarity) from each other in semantic space and as such chose the CBOW approach.Moreover, prediction of context was not of much relevance to our work and as such, we stuck to the CBOW method.
We trained the model on a Bangla corpus (cf.Section 4) and obtained similarity scores reflected by the cosine of the angle between the word vectors.
Example: The distributional semantic similarity for মা ( Mother) and মিহলা ( Woman) is 0.67, which is slightly greater than double the score of the previous approach, which was shown to be 0.33.

Cross-lingual Distributional (Word2Vec) Semantic Similarity using Translations (SS D_C )
The principal of distributional semantics is that, larger the training corpus be er is the model created.Bangla is a less digitized language and therefore, obtaining a well-developed sizable Bangla corpus is difficult task.However, ge ing hold of good quality large English corpus is almost a trivial task owing to their ready availability.The idea here is similar to the SS P_C approach (cf.Section 3.2).We obtain the English translations (Tr) of the Bangla words to be compared (say, B 1 and B 2 ) (cf. Figure 1) and compute semantic similarity between every word in Tr (B 1 ) and Tr (B 2 ) according to the English Word2Vec model.Finally, the maximum of these similarity scores is assigned as the semantic similarity between B 1 and B 2 .
Example: For the example word pair মা ( Mother) and মিহলা ( Woman), the following English translations are obtained.

Experimental Setup
The Bangla corpus used for training the Word2Vec model consisted of 1270 text files.These files were combined into a single text file and all unnecessary information such as XML like tags was removed using a suitable text editor.The English corpus comprised of a collection of 182 XML files, all of which were agglomerated into a single XML file which was ultimately converted into a text file by removing the XML tags.Both corpora are described in Section 5.
In order to build the word vectors, the Word2Vec model was trained on the preprocessed corpora.
• Word2Vec can operate in two modes i.e., skip-gram or CBOW.For our experiments, we chose the CBOW mode.

•
We used two English corpora in our work.The Gigaword corpus (cf.Section 5) is available as pre-trained word vectors created using the GloVe (https://nlp.stanford.edu/projects/glove/)algorithm [36].However, since we are dealing with the Word2Vec model, we had to convert the GloVe vectors to their corresponding vectors for use with Word2Vec using a converter program (https://github.com/manasRK/glove-gensim).

Resources Used
The resources used for our work are as follows.
• We used both the Bangla WordNet (http://www.cfilt.iitb.ac.in/indowordnet/) [7] and the English WordNet 3.0 (http://wordnetweb.princeton.edu/perl/webwn)[37] in our study.Some statistics of the Bangla and the English WordNet are given in Table 2, which clearly shows the superiority of the English WordNet over the Bangla WordNet.

•
The gensim (https://radimrehurek.com/gensim/)library, developed for the Python programming language, was used for the implementation of the Word2Vec model.

•
Translations of the Bangla words were obtained from three online sources-Google Translate, www.shabdkosh.com and www.english-bangla.com.Coverage is always an issue with dictionaries; bilingual (translation) dictionaries often miss some source words, or some translations of the source words.Therefore, we considered translations from three different sources so that most of the translations are covered for each of the testset word.We used a python package, mtranslate 1.The subject ma ers of this corpus span across several domains such as literature, social science, commerce, mass media and many more [38].In total, the TDIL corpus covers texts from 85 subject areas [39].The BNC Baby corpus comprises of texts from four domains-academic, fiction, newspapers and conversations between speakers of British English.Table 3 shows some statistics of the British National Corpus.We also used pre-calculated word vectors trained on a combined corpus including Google Inc.'s Gigaword corpus 5th edition, developed by Parker et al. [40] and Wikipedia 2014.The Gigaword corpus is a collection of newswire text data that was collected over many years by LDC (Linguistics Data Consortium) at the University of Pennsylvania.It has 6 billion tokens and a vocabulary size of 400,000 uncased words.The vectors are available in 4 dimension variants-50, 100, 200 and 300.

•
We used the natural language tool kit (NLTK) [41] for its path-based model implementation using the English WordNet.

Evaluation Dataset
For the evaluation of the semantic similarity methods, we used a dataset (the dataset will be made available for public access upon acceptance of the article) comprising of 162 Bangla word pairs.The dataset was carefully created by an expert linguist with over twenty years of research experience and the semantic similarity score for each word pair was assigned by students well versed with the problem of semantic similarity and having foundational knowledge in linguistic theory which provided them with the strong intuition needed for ascertaining their scores.The scores provided by them for each pair, was considered as the gold standard against which our results were measured.There were five raters in total and each rater provided a score for semantic similarity on a Likert scale of 1 to 5 where 1 indicates complete dissimilarity and 5 indicates absolute similarity.
The selection of 162-word pairs is controlled by several linguistic-cum-cognitive criteria which enabled us to delimit the dataset within a fixed number that can be openly verified and measured on the account of semantic proximity by the respondents engaged in the experiment.The first criterion that is invoked for the selection of the dataset is the frequency of occurrence of the word-pairs in the present Bangla text corpus.The word-pairs that have been selected as controls for the experiment registered a very high frequency of usage across all text domains included in the corpus (Dash [42]).The second criterion is the 'imageability' which signifies that each word-pair that is put to the dataset for the experiment must have a real image-like quality based on which a reference to the word-pair will evoke a clear and concrete image in the mind of the respondents, and they will be able to visualize the conceptual-interfaces underlying between the word-pairs.The third or last criterion, which is far more important and crucial here, is the 'degree of proximity' between the concepts represented by the word-pairs and the respondents reacting against these word-pairs within an ecosystem of language use controlled by various praxis of discourse and ethnographic constraints.Although, in a true pragmatic sense, we should refrain ourselves from claiming the present dataset is 'global', we can, however, argue that it is maximally wide and adequately representative for the present scheme of research; it may be further augmented keeping in mind the nature requirement of future studies when we try to measure the length of semantic proximity across cross-lingual databases.

Results and Analysis
Inter-rater agreement was computed according to Fleiss' kappa (κ) and Krippendorff's alpha (α) (cf.Table 4).Pairwise percentage agreement and Cohen's kappa (cf.Table 5) between each rater was also calculated.There is widespread disagreement within the research community regarding the interpretation of the Fleiss' kappa scores partly because an "acceptable" level of inter-rater agreement largely depends on the specific field of study.Among the several interpretations of kappa (κ) values, the one proposed by Landis and Koch [43] seems to have been cited most by researchers.As such, according to this scheme, our raters had a slight agreement among themselves, as a 0.17 (cf.Table 4) kappa score falls in the range 0.01-0.20,which is the range for such a category of agreement.With such a low agreement score among our raters, the correlation results calculated subsequently (between the raters and the system scores) was bound to lie within a spectrum of high and low values i.e., some raters scores would have high correlation with the evaluation metrics while others, not so much.The same fact is further corroborated by the alpha value obtained.From the pairwise inter-rater agreement figures given below, it is clear that several pairs of raters agreed more, than they did with the rest.
The cells marked in 'green' (in the upper triangle) indicate pairwise inter-rater percentage agreement while those marked in blue (in the lower triangle) indicate pairwise Cohen's kappa agreement.
We compute similarity between each word pair using the four different similarity metrics and compare the metric scores with the gold standard similarity scores as defined by human annotators to evaluate the similarity metrics.Table 6 shows the evaluation results.Each cell in this table indicates the Pearson correlation value between the scores provided by a rater and the corresponding similarity metric scores.The column titled 'majority' denotes the correlation scores obtained when the majority score from among the five annotators is considered.In case of a tie, we selected a score randomly from among the scores that tied.The column titled 'overall' represents the correlation values for a particular metric with respect to all the raters.
The path-based similarity metric based on Bangla WordNet SI M BENG PATH_BASED produces correlation scores in between 0.16 and 0.20.However, it is to be noted that out of a total 162 test cases, it returned a score of zero in 55 (33.95%) cases.A detailed analysis of these 55 cases revealed the following.

•
In 21(12.96%)cases, one of the words (cases) was absent in the Bangla WordNet.There were no cases where both words in a test word pair were absent in the Bangla WordNet.From the statistics above, it can be noticed that the numbers do not add up to the number of cases (55) producing zero score.This is owing to the fact that there were several cases among the 55, in which a word in a test pair was repeated in another test pair producing zero score for both test pairs.As such, we wanted the analysis of the cases to reflect only the unique test pairs.These discrepancies reveal the weaknesses of the Bangla WordNet and in turn the path-based similarity metric SI M BENG PATH_BASED built on it.The main motive behind using cross-lingual approaches to semantic similarity was to take advantage of the well-developed resources in English.The path-based similarity model with translation and English WordNet SI M BENG→ENG PATH_BASED shows significant improvements over the monolingual counterpart as can be observed from the results in Table 6.It improved the correlation scores across all the annotators; the improvements being very high (more than double) with respect to R2, R4 and R5 and moderate for R1 and R3.The correlation for SI M BENG→ENG PATH_BASED with respect to majority voting annotation scores was also found to be more than double than that for SI M BENG PATH_BASED , thus marking significant improvements from the monolingual path-based se ing.SI M BENG→ENG PATH_BASED is really put into perspective when we consider only those cases (106, 65.43% of the test set) for which both the path-based approaches produced non-zero similarity scores.Such a setup is needed in order to truly appreciate the improvements obtained in light of the English WordNet.This is because several pairs obtained zero scores for the SI M BENG PATH_BASED approach thus lowering the correlation for the SI M BENG PATH_BASED method.As such, observing those zero scores along with the other non-zero scores for other pairs would not lead to comparable results.Therefore, we recomputed the correlation scores considering only those scores for which both path-based metrics produced non-zero scores which would help in truly identifying how much improvement the English WordNet results in.The results for this setup are presented in Table 7. Correlation values improved for SI M BENG→ENG PATH_BASED with respect to each annotator as well as majority voting and overall scoring when compared to SI M BENG PATH_BASED .As a consequence of removing the zero similarity scored pairs from both path-based metrics, we find several changes in the correlation values for the setup in comparison to when all the test cases were included (cf.Table 6).It can be seen that SI M BENG PATH_BASED correlation scores increased for annotators R2, R4 and R5 with a good improvement with respect to the majority and overall scores as well.This was quite expected owing to the fact that 55 zero scores were removed from the analysis and only the non-zero scores were used for measuring correlation.However, scores declined for raters R1 and R3.On the other hand, SI M BENG→ENG PATH_BASED was found to produce lower correlation scores (except for R3 and overall) in comparison to the ones obtained with the metric when all the pairs were considered.Intuitively, it can be understood that eliminating the zero scored pairs for SI M BENG PATH_BASED from the dataset also removed good scores obtained with SI M BENG→ENG PATH_BASED which in turn caused the reduction in correlation values.However, the overall SI M BENG→ENG PATH_BASED correlation score remains the same.It is evident from Table 7 that, although the correlations improve substantially for SI M BENG PATH_BASED for this subset, SI M BENG→ENG PATH_BASED still outclasses SI M BENG PATH_BASED even on this dataset.Compared to 55 (33.95%) cases of 0 scores for SI M BENG PATH_BASED , SI M BENG→ENG PATH_BASED resulted in 0 scores for only 2 (1.23%) cases; a significant (96.36%) improvement as is visible from both Tables 6 and 7.In both these two cases, a proper translation of Bangla words was not obtained using our resources; the cases being রা ু র ( sunlight) and ধাের ( nearby).Thus, this method becomes reliant on the translation resources, considering the errors creeping in by the translation process.All in all, improvement can be a ributed due to the wide coverage of the English WordNet.However, this method did show weaknesses in certain cases, e.g., in case of computing similarity between বন া ( flood) and পব ত ( mountain).The translations produced by the translation tools for these two words are as follows.
The path-based similarity between 'spate' and 'mountain' turned out to be 1 since spate#n#1 and mountain#n#2 belong to the same synset ("a large number or amount or extent") in English WordNet.
Although, according to the English WordNet this approach results in such a high similarity score between বন া and পব ত, native speakers seldom think of this similarity in the metaphoric (and rare) usage of these two words.This example is perhaps an indication that when considering similarity between word pairs, we should not consider their very rare usages.
The Bangla Word2Vec model SI M BENG WORD2VEC produced really poor correlation scores compared to the path-based models with the correlation scores ranging from 0.08 to 0.16.However, an interesting finding is that it correlated be er than the SI M BENG PATH_BASED model with respect to the majority score.The SI M BENG WORD2VEC based correlation score with respect to the majority score was also found to be higher than the SI M BENG WORD2VEC -based correlation scores with respect to individual rater scores.It is interesting to note that whenever we obtain a zero similarity score for a test word pair for either of the path-based methods, it can be due to a variety of factors as discussed before.However, when we obtain a zero score from a distributional approach, it simply implies that either (or both) of the words is absent from the corpus on which the model was trained and as such their vectors could not be generated.
The cross-lingual Word2Vec models SI M BENG→ENG WORD2VEC produced much be er correlation scores than the SI M BENG WORD2VEC model; the correlation scores being much higher than for the SI M BENG WORD2VEC model with respect to each annotator.Predictably, among the two English Word2Vec models, the model (pre)trained on the Gigaword corpus performed be er than the one trained on the BNC corpus with a sharp increase in correlation score with respect to majority voting; however the scores either declined or stayed same for raters R3 and R5.The comparative study (cf.Table 6) of the results obtained with SI M BENG WORD2VEC and SI M BENG→ENG WORD2VEC is an indicator of the fact that using a richer and more diverse corpus results in be er word vectors and in turn be er similarity scores.
When contrasted with SI M BENG WORD2VEC , the distributional model trained on the Gigaword corpus showed as much as 125% increase in correlation scores with respect to rater R1 while it showed a maximum of 87.5% increase over the model trained on the British National corpus for the same rater.Correlation scores improved for annotators R1, R2 and R4 increasing to almost double whereas the improvement was slightly less evident for R3 and R5 when contrasting SI M BENG→ENG WORD2VEC Gigaword with SI M BENG WORD2VEC .Similar to SI M BENG WORD2VEC , the correlation score with respect to majority score for the Word2Vec model trained on the Gigaword corpus was higher than the correlation scores with respect to all annotators.
It is to be noted that SI M BENG→ENG WORD2VEC BNC performs be er than SI M BENG WORD2VEC despite the size of the English BNC corpus being smaller than the Bangla TDIL training data.This result is quite surprising.One could perhaps conjecture that Bangla is a morphologically richer language and therefore for corpora of comparable size, the Bangla corpus would have a much larger vocabulary size than English corpus.However, that is not the case here; in fact, the English corpus despite being smaller than the Bangla corpus has a larger vocabulary than the Bangla corpus.Linguistically speaking there are other reasons behind this phenomenon, which however is not elaborated in this paper.
The SI M BENG→ENG WORD2VEC models could not beat the performance of the SI M BENG PATH_BASED model with respect to raters R1, R2 and R3 and overall, however they correlate be er than the SI M BENG PATH_BASED model with respect to R4, R5 and majority score.These observations were quite consistent with our expectations and could be justified as such owing to the robust nature of the cross-lingual distributional model on account of the vast vocabulary size of the English corpora leading to the generation of high quality word vectors.
It was presupposed that when detecting similarity between Bangla words using the distributional models, the monolingual Word2Vec approach would offer near competitive human correlated scores with respect to the cross-lingual approach.This is because the language in which we are trying to discover similarity is Bangla and as such, the Bangla corpora should have been able to provide more insightful and varied contexts and in turn be er word embeddings suitable for measuring semantic similarity in Bangla.However, as can be seen from Table 6, this is not the case.
In order to further investigate and visualize how human scores relate to the similarity metric scores, we plo ed graphs (for all the metrics) where the x-axis denotes the test pair (i.e., word pair) ids and y-axis represents the Ratio H S = (HumanScore)/(SystemScore).These graphs were created first by up scaling the system scores, which originally lie in the [0, 1] range, to the annotator scoring range [1,5] so as to avoid division by zero errors, and then by plo ing the Ratio H S s.The reason for choosing such a plo ing scheme is to examine the proximity of the plo ed points to the y = 1 line in the graphs.If a similarity metric perfectly correlates (i.e., r = 1) with a human annotation, then the corresponding points will fall on the y = 1 line.More the number of points that lie on or near this line, stronger will be the correlation between the metric considered and the human annotation scores.Since both the human score and the up-scaled system scores lie in the [1,5]    Table 9 shows some statistics of the results presented in Figure 3. Finding the number of points lying in the vicinity of the y = 1 line in these graphs gives a strong indication about the correlation.We observed that both SI M BENG PATH_BASED and SI M BENG→ENG PATH_BASED produced highest number of points (92) aligned on the y = 1 line, followed by SI M BENG→ENG WORD2VEC Gigaword (61), SI M BENG→ENG WORD2VEC BNC (43) and SI M BENG WORD2VEC (19).models.On the other hand, the path-based metrics typically provided lower similarity scores yielding Ratio H S greater than one which is visible from the majority of the plo ed points above the y = 1 line in Figures 3a,b.Most of the points in the SI M BENG→ENG PATH_BASED graph (cf. Figure 3b) being close to the y = 1 line is reasoned out to be providing the most accurate similarity scores, a fact which is further corroborated by the correlation results (cf.Table 6).Furthermore, a sharp drop in the spread of the data points between the graphs of SI M BENG PATH_BASED (cf. Figure 3a) and SI M BENG→ENG PATH_BASED can also be observed indicating that SI M BENG→ENG PATH_BASED produces more correlated similarity scores than SI M BENG PATH_BASED which shows divergent scores all across its plot.This fact goes on to show what a marked improvement translation brings to semantic similarity.
Among the graphs for SI M BENG WORD2VEC and SI M BENG→ENG WORD2VEC BNC , the graph of the la er showed less dispersion from the y = 1 line meaning that the scores produced from the method were be er correlated with the human judgments; a fact which can also be verified from Table 6. Figure 3e shows the graph for the SI M BENG→ENG WORD2VEC Gigaword metric.When it is examined in light of the other two distributional methods, it was found that it produced the best such plot for that class of methods.The points were relatively more divergent from the y = 1 line, although giving a higher number of points lying on the y = 1 line (61) as compared to the other two distributional methods (19 and 43).
Our initial intuition drove us to believe that the Word2Vec model would produce the best results.However, from the correlation scores obtained, we were proven otherwise.Overall, the SI M BENG→ENG PATH_BASED model provides the best correlation scores with respect to all individual raters, majority score and all rating scores together (overall), which are much higher than the correlation scores yielded by the other similarity metrics.Finally, it could also be pointed out that in comparison to the Word2Vec models, the path-based metrics performed far be er with respect to the overall correlation scores (cf.Table 6), an explanation for which is proffered in Section 6.3.Clearly, the path-based model has visible advantages in spite of being compared with one of the more robust and state-of-the-art models for semantic similarity, i.e., Word2Vec.

Comparative Analysis of the Various Methods
When discovering semantic similarity in monolingual domain, the path-based model clearly performs be er than the Word2Vec model as can be seen from the correlation scores.This is because the Word2Vec algorithm requires a well-designed corpus with large vocabulary size and contexts, which properly reflect the correct senses of a word in order to build a comprehensive model for detecting similarity.However, obtaining such a well-designed large corpus in Bangla is a difficult task.The correlation scores obtained with SI M BENG WORD2VEC is a clear indication of the limitation of the corpus used.Even though the Bangla WordNet is lacking in terms of coverage, the higher correlation scores provided by SI M BENG PATH_BASED in comparison to SI M BENG WORD2VEC is fairly justifiable.It was clear that the cross-lingual approach via translation helped improve the similarity scores for both the path-based and Word2Vec models.However, it was noticed that the cross-lingual approach works be er for the path-based metric than the distributional ones.This could perhaps be a ributed to the fact that when obtaining cross-lingual senses (through translations) in English for a given Bangla word, we were retrieving the most appropriate or the nearest one in sense from the bucket of all possible conceptual equivalents of the word; whereas in the Word2Vec approach, we are dealing with only a subset of the translations limited by the corpus, where the possibility of having multiple translational equivalents is restricted due to imposition of contextual constraints.This explanation would also be in line with the way a human annotator assigns scores to the word pairs.They would always realize what possible senses the words within a word pair encompass and which sense pair has the strongest conceptual proximity.It is evident that when obtaining the entire array of translations (senses) of a word, some or all of them maybe absent from the WordNet or a corpus (even the word itself maybe absent in both of them).The Word2Vec approach depends only on a raw corpus to generate a model for calculating similarity.The problem with corpora is that they may not include the word itself and even if they do, there may not be the contexts for encapsulating all the possible senses that the word defines.The advantage of a corpus, on the other hand, is that it can describe contexts for a word, which represent new senses that are not present in the WordNet.However, when assigning similarity score to a word pair, a rater (or assigner) considers all possible senses of the words but rarely takes into account the newer senses which may have evolved with time and been incorporated into the present corpus.
Overall, the cross-lingual path-based metric excels due to excellent coverage of concepts in the English WordNet.Finding a missing word in it is a seldom occurrence as could be seen from the number of cases (2, 1.23%) producing a zero score with SI M BENG→ENG PATH_BASED as opposed to the number of cases producing 0 for SI M BENG→ENG WORD2VEC BNC (8,4.94%) and SI M BENG→ENG WORD2VEC Gigaword (31,19.14%).

Conclusions and Future Work
Linguistic resources available for poorly resourced languages like Bangla are few in number and are underdeveloped when compared with richly resourced languages like English.This is one of the main reasons as to why research in under-resourced languages relies either on unsupervised or cross-lingual techniques.Our work clearly highlights the power of the Word2Vec model and its ability to overcome the limitations of thesaurus-based approaches, the biggest drawback of which is how to calculate similarity in the absence of resources like WordNet.The Word2Vec is an extremely efficient model and is capable of analyzing large volumes of text in minutes and generating similarity scores for word pairs present in corpus.However, the model does fail to tackle problems such as detecting words with multiple meanings and out of vocabulary words.These issues deserve further exploration.
Semantic similarity plays a very crucial role in many NLP applications.Even without such applicational relevance, semantic similarity, in itself, is a fundamental linguistic query and crucial conceptual hypothesis.Since it is a subjective issue, it is destined to receive different interpretations from different evaluation approaches.Accurate understanding of semantic similarity will mean ge ing a closer look into the enigmatic world of human cognition to speculate how human beings associate words (or word pairs, for that ma er) based on their sense relations, semantic closeness, and conceptual proximity.The present study has certain theoretical relevance on the ground that it helps us to identify the probability of semantic association of a word following a given word with or without reference to any given context.Such a knowledge base is indispensable for many tasks of language engineering, such as machine translation, machine learning, information retrieval, lexical clustering, text categorization, word sense induction, language teaching, semantic net and many more.
The objective of our work was to determine semantic similarity between Bangla word pairs.We have proposed here that translation based approaches, which take help of existing algorithms and can show improved results.We have also identified that the strategies adopted for some advanced languages like English cannot be used blindly on less resourced languages like Bangla, since successful operation of those strategies require large amount of processed and structured linguistic resources in the forms of corpora and WordNets, which are not yet made ready in these poorly resourced languages.However, the most striking finding of our study is that language corpora, be it for the richly or poorly resourced languages, are not a useful hunting ground for executing semantic similarity measurement techniques.Owing to certain contextual constraints, corpora usually fail to reflect on the wide range of possible semantic similarity of words, which a human being or a WordNet can easily do.
In future, we would also like to compare semantic similarity of Wu and Palmer [14] and Slimani et al. [15] with the path-based similarity employed in the paper and distributional similarity.
Semantic similarity is a crucial NLP task for both well-resourced and under-resourced languages like English, Hindi, Bangla etc.The next step in this direction should be an effort that can try to enrich WordNets as well as create be er corpora so that the semantic similarity problem can be addressed for any word pair.

Figure 2 .
Figure 2. A snapshot of the hypernym-hyponym relations in the Bangla WordNet.
range, the Ratio H S lies in the [0.2, 5] range.Since the score pairs (2, 1) and (1, 2) results in Ratio H S of 2.0 and 0.5 respectively and the both the ratios are equally divergent from y = 1, we make the lines y = 2 and y = 0.5 equally distant from the y = 1 line in the graphs.Similarly (3, 0.333), (4, 0.25), (5, 0.2) line pairs are also shown equally distant from the y = 1 line in the graphs.The graphs for the five metrics are shown in Figure 3.

Table 1 .
Semantic Similarity Methods(P 3 (https://github.com/mouuff/mtranslate),an API for collecting the translations from Google Translate.The source code of mtranslate was modified to collect translations from all three sources.• Bangla Corpus: The technology development for Indian languages (TDIL) (http://www.isical.ac.in/~lru/downloadCorpus.html) corpus was used for training the Word2Vec model in Bangla.This corpus is a collection of modern Bangla prose texts published between 1981 and 1995.
Table 3 provides some statistics of the TDIL corpus.• English Corpus: The British National Corpus (BNC)-Baby Edition (http://ota.ox.ac.uk/desc/2553) maintained by the University of Oxford was used for training the English Word2Vec model.

Table 3 .
Statistics of the Corpora used.

Table 6 .
Pearson correlation values for all approaches.

Table 7 .
Correlation scores for cases where both SI M BENG PATH_BASED and SI M BENG→ENG PATH_BASED produced non-zero scores.

Table 9 .
Statistics of the points plo ed in Figure3.