A Survey on Portuguese Lexical Knowledge Bases: Contents, Comparison and Combination †

: In the last decade, several lexical-semantic knowledge bases (LKBs) were developed for Portuguese, by different teams and following different approaches. Most of them are open and freely available for the community. Those LKBs are brieﬂy analysed here, with a focus on size, structure, and overlapping contents. However, we go further and exploit all of the analysed LKBs in the creation of new LKBs, based on the redundant contents. Both original and redundancy-based LKBs are then compared, indirectly, based on the performance of automatic procedures that exploit them for solving four different semantic analysis tasks. In addition to conclusions on the performance of the original LKBs, results show that, instead of selecting a single LKB to use, it is generally worth combining the contents of all the open Portuguese LKBs, towards better results.


Introduction
Lexical-semantic knowledge bases (LKBs) are computational resources that organise words according to their meaning.In addition other features, they should have a significant coverage of the words of a language, which, according to their possible senses, should be connected by means of semantic relations.Princeton WordNet [1] is the paradigmatic resource of this kind, for English, used in many natural language processing (NLP) tasks, and with a model also adapted to many other languages, including Portuguese.However, the first Portuguese WordNet [2] is not available to be used by the research community and the first open alternatives were only developed in the last decade.
In order to cope with the lack of such a resource, several open Portuguese LKBs were since then created and most became available for download, either upon a paid license (e.g., MWN.PT (http://mwnpt.di.fc.ul.pt/)) or for free.However, those LKBs were developed by different teams, following different approaches, which resulted in LKBs with variable coverage and with slightly different features, regarding their organisation.Due to the difficulties inherent in crafting such a broad resource manually, most Portuguese LKBs have some degree of automation in their creation process, which increases the chance of noise.Furthermore, not all follow the full WordNet model.For instance, even though all of them cover one or more types of semantic relations, not all handle word senses.In fact, none of them is as consensual as Princeton WordNet, which was created manually and has a large community of users, is for English.Finally, some Portuguese LKBs are not large enough, while others have an interesting size but include several incorrect, unfrequent or unuseful relations or lexical items.
In this paper, ten open Portuguese LKBs are characterised in terms of covered lexical items and semantic relations.The redundancy across them is then analysed, towards the creation of (potentially) more useful LKBs.All the LKBs, including the new ones, are finally compared indirectly, when exploited in semantic similarity tasks with available benchmark datasets for Portuguese, namely: (i) given a word, selecting the most similar word from a predefined set; (ii) quantifying the semantic similarity of two words; (iii) filling a blank in a sentence with the correct word from a set; and (iv) quantifying the semantic textual similarity between two sentences.In addition, confirming our intuition that there are advantages in combining different LKBs, this can be seen as the first systematic comparison of the open Portuguese LKBs.This is an extended version of a previously published paper [3], where a more detailed comparison is made, including a conversion table that maps semantic relation names in different LKBs; where two new resources are considered (ConceptNet and CARTÃO) as well as the most recent version of another (PULO).This resulted in different redundancy-based LKBs and, consequently, new experimentation results.

Related Work
The current scenario for Portuguese LKBs can be seen as atypical.There are currently many open LKBs for this language, but none is as consensual as Princeton WordNet [1] is for English.The latter started to be used by the NLP community in a time when there was nothing similar in terms of representation of the mental lexicon, with its coverage, granularity, reliability and, of course, the key factor of being freely available.On the other hand, the first Portuguese WordNet [2] was only released about a decade later and was not available to be used by the research community.Therefore, several related projects started, concurrently, for Portuguese.Those include several wordnets [4] and other simpler LKBs that, in some cases, may replace a wordnet.Looking at the Wordnets in the World list available at the site of the Global WordNet Association (http://globalwordnet.org/wordnets-in-theworld/(January 2018)), one can see that previous problem is probably not specific to Portuguese.There are other languages with more than one wordnet, available or not under different licenses.For other languages, there is one "main" LKB used by the NLP community, possibly further enriched or aligned with different knowledge bases in specific domains or kinds of knowledge.For instance, there are several extensions for Princeton WordNet (e.g., subject field codes [5]), as well as alignments with other lexical resources (e.g., FrameNet and VerbNet [6], or Wikipedia and Wiktionary [7]).WordNet is also the "core" of most multilingual wordnets (e.g., EuroWordNet [8], MultiWordNet [9], Open Multilingual WordNet [10], MCR [11]) and of multilingual knowledge bases that cover linguistic and encyclopaedic knowledge (e.g., Universal WordNet [12], BabelNet [13]).This is probably why there is not much work similar to what is presented here, where LKBs that aim at covering more or less the same kind of knowledge are combined.On the other hand, redundancy models have been proposed for assessing the confidence of relations automatically extracted from corpora [14].The main intuition is that relation instances extracted more often, from different sources, are more plausible to be correct or useful.

Open Portuguese LKBs
Ten open Portuguese knowledge bases with lexical-semantic information were identified and explored in this work, namely:

•
Semantic relations between Portuguese words in the ConceptNet [22] semantic network, which includes common-sense knowledge, lexical knowledge and others.
As these resources do not share exactly the same structure, to enable their comparison and integration, they were all reduced to a set of relation instances of the kind "x related-to y", where x and y are lexical items and related-to is the name of a semantic relation.For synset-based LKBs, wordnets and thesauri, synsets had to be deconstructed.For example, the instance {porta, portão} partOf {automóvel, carro, viatura} resulted in: (porta synonymOf portão), (automóvel synonymOf carro), (automóvel synonymOf viatura), (carro synonymOf viatura), (porta partOf automóvel), (porta partOf carro), (porta partOf viatura), (portão partOf automóvel), (portão partOf carro), (portão partOf viatura)-In English, {door, gate} partOf {automobile, car} resulted in: (door synonymOf gate), (automobile synonymOf car), (door partOf automobile), (door partOf car), (gate partOf automobile), (gate partOf car).Adopted relation names were those defined in the project PAPEL [19], a rich set that covered most relation types in all the LKBs.However, some relation names, in other LKBs, had to be converted to a common name, always considering their semantics.Table 1 presents the performed conversions.Inverse relation names are omitted from this table, but they were also considered in the conversion process.
The size and type of contents of the LKBs obtained after conversion is summarised in Tables 2  and 3. Table 2 is focused on the number of covered lexical items, organised according to their part-of-speech (POS).Given that, without a context, the same lexical item may have different POS, the table also provides the number of distinct lexical items, when the POS is not considered.Table 3 targets the number of relations covered by each LKBs, grouped according to their broader types.The total number of relations is already provided, together with the average degree of each word, which measures the average number of relations involving each word in the network.A remark should be given on ConceptNet.In addition, being a slightly different knowledge base, not exclusively focused on lexical-semantic knowledge, it was also the last one to be included in this work.After analysing the set of available relations, several were not covered by our set of relation types.From this set, we discarded lexical relations, such as those related to word forms (e.g., FormOf, DerivedFrom, EtymologicallyRelatedTo), not so useful for semantic analysis, but we kept other interesting and potentially useful relation types (e.g., Desires, MotivatedByGoal).In the previous tables, the numbers of the latter types are only considered in the total, which is why the given number is followed by an asterisk (*).It should also be added that, for the converted relations, we only kept those for which we could identify the POS of both arguments.For this purpose, we used the POS provided by ConcepNet.However, as this information is only provided for some items, when it was not available, the possible POS of each word was automatically checked in the corpora of the AC/DC service [23].More precisely, we considered that a word could have every POS with which its lemma occurred in AC/DC at least five times.It should also be mentioned that, although relation instances in the current version of ConcepNet have an attached confidence weight, the majority of the instances between two Portuguese words (≈95%) have this parameter set to 1.0, so it was not used.
Although the LKB with more lexical items is the one obtained from DA (≈95,000 distinct items), it contains substantially less relation instances than TeP, which covers ≈490,000 synonymy and antonymy instances but no other relation type.PAPEL, DA, OWN-PT and WN.Br all contain more than 100,000 relation instances.This is also noticeable from the average degree of each of those LKBs, which is the lowest in DA.On the other hand, WN.Br only covers verbs and is the smaller LKB in terms of lexical items, but the average degree of its words is substantially higher than others (36.9, followed by 11.9 in TeP).In fact, though lower than WN.Br, the average degrees of the synset-based LKBs are higher than for the others, which is, to some extent, a consequence of the synset deconstruction process.
On the relation types, all LKBs cover synonymy; antonymy is not covered by OT.PT, WN.Br and Port4Nooj; and hypernymy is not covered by TeP and OT.PT because the latter are originally synset-based thesauri.Other types are present in several LKBs (e.g., part, cause, property), but some types are only found in the LKBs extracted from dictionaries.ConcepNet also has an interesting range of covered types, where we highlight the quantity of purpose-of and place-of relations.

Redundancy in Portuguese LKBs
Open Portuguese LKBs are not only organised in slightly different models.They were also created with different approaches, most of which involve automatic or semi-automatic steps for exploiting available resources, such as dictionaries or encyclopaedias, not only in Portuguese, but also in other languages.Therefore, although they try to cover the whole language, they end up having different granularities and contents, not only in terms of covered relation types, but also of lexical items and relation instances, some of which are less useful for some tasks, or even incorrect.Table 4 shows the number of relation instances grouped by relation type and number of LKBs they were found in.
Table 5 complements Table 4 and gives an idea on the typical knowledge covered by each LKB.More precisely, for each LKB, the included relation instances are grouped into those that are exclusive from the target LKB, those that are in only one more LKB (+1), and those that are in only two more (+2).This table shows, for instance, that ConceptNet is the network with more non-overlapping knowledge.The LKBs extracted from dictionaries contain the lowest proportion of knowledge that is not found in another LKB, but this proportion is still high-≈57% for DA and ≈64% for PAPEL and Wiktionary.
The majority of relation instances found (≈82%) is in only one LKB, ≈13% is in two, ≈3% in three and just ≈1% in four.Only synonymy, and a residual number of antonymy and hypernymy instances, are in six or more LKBs, expectable because those also happened to be the types covered by more LKBs.Our intuition is that the more resources an instance is in, the more likely it is to transmit a consensual, frequent and useful relation.This does not mean, however, that most of the relations found in only one LKB are incorrect or not useful.It only means that the latter set should contain a higher proportion of relations that are either incorrect, very specific or useful only in a more limited domain of application, when compared to the set of relations in more than one LKB.This is confirmed by observed examples, including those in Table 6, which contains relation instances that are in nine to three LKBs.Each redundancy level includes only instances of relation types that were not present in the previous level, or were but with arguments with a different POS.On the other hand, instances that only occur in one LKB are more likely to either be incorrect, due to noise on the automatic process, or to involve very specific meanings, which makes them less useful.Observed examples also confirm this.Some of them are presented in Table 7, which shows a list of relation instances that are in a single LKB, selected randomly for different relation types.

Comparing Portuguese LKBs Indirectly
Due to the time-consuming work required for evaluating the contents of each LKB manually, plus the subjectivity of such a task, the Portuguese LKBs were compared indirectly, when exploited to solve semantic similarity-related tasks, for which datasets, here used as benchmarks, are available.Experiments performed in this comparison cover four different tasks, namely: selecting the most similar word from a small set (B 2 SG, Section 5.1); computing the semantic similarity between pairs of words (SimLex-999, Section 5.2); selecting the most suitable word, in a set, for a blank in a sentence (cloze questions, Section 5.3); and computing the semantic similarity between pairs of sentences (ASSIN, Section 5.4).Table 9 organises those benchmark tests according to their type.

Selecting the Most Similar Word from a Small Set
The B 2 SG [25] test is similar to the WordNet-Based Synonymy Test [26], but based on the Portuguese part of BabelNet [13] and partially evaluated by humans.It contains frequent Portuguese nouns and verbs (target), each followed by four candidates, from which only one is related, and is organised in six files: two for synonymy, two for hypernymy, and two for antonymy, respectively, between nouns and for verbs.Table 10 illustrates the B 2 SG test with the first line of each file.The correct answer is always the first candidate, followed by three distractors.Although created for evaluating less structured resources, such as distributional thesauri, we analysed how many correct relations of this test are covered by the Portuguese LKBs.Furthermore, for the uncovered instances, the correct alternative was guessed from the top-ranked candidate, after running the Personalized PageRank [27] algorithm in each LKB, for 30 iterations, using the target word as context.
Table 11 presents the number of covered (In) and guessed (Guess) relation instances for each LKB.Coverage numbers highlight known limitations of some LKBs.For instance, antonymy relations extracted from dictionaries are mostly between adjectives; synset-based thesauri do not cover hypernymy; only the wordnet-based LKBs cover hypernymy between verbs and WN.Br covers only verbs.However, for this specific test, some limitations could be minimized by exploiting the structure of the LKB.As expected, the highest coverage and proportion of guessed relations is obtained for the All LKB, for which 97.4% of the instances are guessed.It is followed by OWN-PT on both coverage and guesses, except for the guesses of hypernymy and antonymy between nouns.In the former, CARTÃO gets the second highest number, followed really close by Redun2, which gets the second highest number of guesses of antonymy relations between nouns.However, we suspect that these numbers are positively biased towards OWN-PT because it is currently integrated in BabelNet.

Computing the Similarity between Word Pairs
SimLex-999 [28] is a recent benchmark for assessing methods for computing semantic similarity.It contains 999 pairs of words, with the same POS, and their similarity score, given by human subjects who followed strict guidelines to differentiate between similarity and relatedness.No multiword expressions nor named entities are included.This dataset was originally made available for English but has been translated to other languages.The Portuguese adaptation was originally made to assess the distributional models of Portuguese words [29] and is available online (http://metashare.metanet4u.eu/or https://github.com/nlx-group/lx-dsemvectors/(October 2017)).Table 12 shows two adjectives, two nouns and two verbs of the Portuguese SimLex-999.
In order to exploit the LKBs in this task, two different algorithms were applied to compute the similarity between the words of each pair, namely:

•
Similarity of the adjacencies of each word in the LKB, using measures such as the Jaccard coefficient (Adj-Jac, Equation ( 1)) or the cosine similarity (Adj-Cos, Equation ( 2)): • PageRank vectors, inspired by Pilehvar et al. [30].For each word of a pair, Personalized PageRank was first run in the target LKB, for 30 iterations, using the word as context; a vector was then created with the resulting rank of each other word of the LKB in each position.Finally, the similarity between the vectors for each word was computed, using: the Jaccard coefficient between the sets of words in these vectors (PR-Jac) or the cosine of the vectors (PR-CosV).Given the large vector sizes, vectors were trimmed to the top−N ranked words.Different sizes N were tested, from 50 to 3200.In addition, since SimLex-999 is a similarity test, the previous methods were tested using all the relations of each LKB, or only synonymy and hypernymy relations, which are more connected with this phenomena.
The obtained results were evaluated with the Spearman correlation (ρ) between the similarities in SimLex-999 and the similarities computed from each of the previous methods in each LKB.Table 13 shows the best results for each combination of method, relations used, and LKB, as well as different methods for the LKB with the best results (All).
Results show that LKBs extracted from dictionaries have better results with PageRank-based algorithms, using all relations.This also includes CARTÃO, which we recall combines relations extracted from three dictionaries.On the other hand, LKBs extracted from wordnets have better results with adjacency-based algorithms, using only synonymy and hypernymy relations.It should be noted that there are clear advantages on using the adjacency-based algorithms, which, because of their lower time complexity, take much less to compute the similarity scores, especially in larger LKBs.The best results are clearly obtained with the combination of all LKBs, using different configurations (0.56-0.61).The original LKB with the best performance is PAPEL (0.49), which performed slightly better than Redun2 (0.48), but lower than CARTÃO (0.53), which got second place overall.PAPEL was followed by OWN-PT (0.44) and Wiktionary.PT (0.42), both better than Redun3 (0.44).
was obtained for the English SimLex-999 [31], which is not very far from the results of our best configuration (0.61).In the future, we will study the impact of combining the LKB-based approach with distributional vectors.

Answering Cloze Questions
Open domain cloze questions have been generated in the scope of REAP.PT [32], an assisted language learning tutoring system for European Portuguese.Those consist of sentences with a blank, to be filled with a word from a shuffled list of candidates, of which only one is correct and the other are distractors.Some of the Portuguese LKBs have previously been exploited [33] to answer a set of 3890 of those questions, provided by the researchers involved in the REAP.PT project.Table 14 illustrates the contents of this dataset with the first two questions and the respective set of candidate words, with the correct answer in bold.The experiment reported here used the same dataset, this time answered with each of the LKBs explored in this work.The selection method was similar to the one used for the B 2 SG test (Section 5.1): for each sentence, answers were guessed from the top-ranked candidate, after running Personalized PageRank, this time using the lemmas of all the open-class words as context.For instance, for sentence #2, the words artista, verdadeiro, obra and arte were used.
Table 15 shows the accuracy in the selection of the correct answer, using each LKB, and with a baseline that selects the most frequent alternative, based on the frequency lists of the AC/DC corpora [23].Results are shown as a total, and also organised according to the POS of the correct word to fill the blank.When no alternative was covered by the LKB, the answer would contain all the alternatives (25% correct).
Although all LKBs performed better than random chance (25%), this revealed to be a challenging task.WN.Br was just slightly higher than this number, possibly because it only covers verbs.Other LKBs were not much higher than the frequency baseline, which improved the random chance for nouns and verbs, but apparently did not make much difference for adjectives and adverbs.The highest rate of correct answers (≈40%) was obtained with CARTÃO, with no significant differences when compared to the result obtained with the All LKB.On the one hand, CARTÃO got the highest proportion of correct answers when the blank was to be filled with a verb (≈37%) or an adjective (≈36%), while the All LKB got the highest proportion for nouns (≈50%).For adverbs, this proportion is not significantly different than the random chance.Curiously, the highest result is for Port4Nooj (≈30%).If using a smaller LKB is desired, PAPEL (≈191,000 relation instances) or Redun2 (≈145,000) answer ≈38% of the questions correctly.

Textual Similarity and Entailment
The ASSIN shared task targeted semantic similarity and textual entailment in Portuguese [34].Its training data comprises 6000 sentence pairs (t, h), half of which in Brazilian Portuguese (PTBR) and the other half in European Portuguese (PTPT).Test data comprises 4000 pairs, 2000 in each variant.Data is available in the task's website (http://nilc.icmc.usp.br/assin/(April 2017)), together with the gold annotations of the test data and evaluation scripts.Similarity values range from 1 (completely different sentences, on different subjects) to 5 (t and h mean essentially the same).Entailment can have one of the following values: Paraphrase, Entailment or None.Table 16 shows a selection of sentence pairs in the ASSIN training collection LKBs were exploited to compute similarity according to Equation (3).Briefly, after preprocessing the sentences and computing the cosine of their stems, a bonus (γ) was added for each additional word from t directly related to a word in h (γ+ = 0.75) or related to a common word (γ+ = 0.05): A very simple approach was followed for the entailment task.Common words and synonyms were first removed from the longer sentence.If the proportion of remaining words was below α = 0.1, the pairs would be classified as a Paraphrase.After this, words from the first sentence in an hypernymy relation with words from the second were also removed.If the proportion of remaining words was below β = 0.45, the pair would be classified as Entailment.Parameters α and β were set after several experiments in the training collection.
Table 17 shows the obtained results for the PTPT and PTBR variants, with each LKB, plus a baseline that does not use an LKB (α = β = 0), and the best official results of ASSIN.Entailment performance is scored in terms of accuracy and Macro-F1, while similarity resorts to the Pearson correlation and the mean square error (MSE).An initial comparison focused on size, relation types covered, and redundancy across the LKBs.Despite sharing a similar goal, these LKBs were created by different teams, following different approaches, and there are significant differences in the covered lexical items, relations -more than 80% of all the relations instances are in only one LKB-, their correctness or utility.The creation of new LKBs by combining the existing ones was described and all LKBs were then compared indirectly, when exploited in different computational semantics tasks.
The limitations of some LKBs were confirmed, especially the smaller ones (Port4Nooj, OT.PT), or those focused on a single POS (WN.Br) or relation (OT.PT).Except for the expected impact of those limitations, obtained results are positive for every LKB, especially in the word-based similarity tests.However, experiments suggest that using all the available relation instances generally leads to the best results.Some of these LKBs were recently used to answer other word similarity and relatedness tests [36] and, despite different results for different tests, the claim that combining several LKBs leads to better results still holds.
This comparison should not be seen as complete and further analysis is needed for stronger conclusions.Due to the large size of the LKB with all relation instances, in some cases, it might be worth using an LKB containing only relations in two or three LKBs.In the performed experiments, the negative impact of the latter solution on performance is higher for algorithms based on the structure of the network, such as PageRank, and not so much on approaches that do not go one level further than the direct adjacencies.This happens because PageRank exploits every link in the network structure, some of which are not redundant and thus missing from the redundancy-based LKBs.Even though the aforementioned conclusions are still valid for the sentence-oriented tests, additional features and more sophisticated approaches would be required for a higher performance (see [35]).
It should be added that all the ten LKBs compared in this work were exploited in the creation of new version of the fuzzy Portuguese wordnet CONTO.PT [37], in order to be released in the future.In CONTO.PT, words are grouped together or related with a confidence measure, computed from the relations in all the exploited LKBs.This way, users may set their own cut-point on confidence and use either a smaller but more reliable LKB or a larger one, though not so reliable.All the redundancy-based LKBs are freely available for anyone to use, from http://ontopt.dei.uc.pt/index.php?sec=download_ outros.We aim at using these LKBs in additional tasks, or in the same but focusing on certain aspects, such as the POS.However, a manual intervention might be required for stronger conclusions.
Following the current trend of using distributional models of words in NLP, such as word embeddings, the performance of the LKBs and algorithms used here was recently compared with the performance of some of the previous models for Portuguese [36].On the one hand, LKBs lead to the highest results when it comes to genuine similarity.On the other hand, they are outperformed by the distributional models when computing relatedness.This is partially explained by the fact that LKBs are more theoretical views of the mental lexicon, while the distribution of words in a corpus models the way language is actually used.We are currently working on the combination of both kinds of models in a single, hopefully better, word similarity function, as others have done for English (e.g., [22,31]).Such a function might be useful for higher-level natural language tasks, such as semantic search systems or conversational agents.

Table 1 .
Conversion of relations in different LKBs.

Table 2 .
Number of lexical items extracted from each LKB.
* means that additional relation types were considered for computing the total.

Table 3 .
Number of triples extracted from each LKB.
* means that additional relation types were considered for computing the total.

Table 4 .
Occurrences of the same triples in different resources, per type.

Table 5 .
Proportion of relation instances in each LKB that occur only in this LKB, this and another, and this and two other LKBs.

Table 6 .
Examples of redundant relation instances.

Table 7 .
Examples of relation instances in only one LKB.

Table 8 .
Size of the redundancy-based LKBs.

Table 9 .
Characterization of the benchmark tests.

Table 10 .
First entries of each file of the B 2 SG test.
* Correct answers in bold.

Table 11 .
Relation instances in and guessed from the B 2 SG test.Highest and second highest numbers are in bold.

Table 12 .
First two adjectives, nouns and verbs of the Portuguese SimLex-999.

Table 14 .
First two cloze questions of the dataset used.

Table 15 .
Accuracy for answering cloze questions.

Table 16 .
Selected examples from the ASSIN training collection, for EurOpean Portuguese (PTPT) and for Brazlian Portuguese (PTBR).
(All added up, the penalties set in the contracts may reach R$ 23 million.)

Table 17 .
Exploiting LKBs in the ASSIN test set.Ten open Portuguese LKBs were overviewed in this paper, namely PAPEL; relations acquired from Dicionário Aberto and Wiktionary.PT; OpenWordnet-PT; PULO; TeP; OpenThesaurus.PT; semantic relations of Port4Nooj; Wordnet.Br; and the relations between Portuguese words in ConceptNet.