You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

2 February 2018

A Survey on Portuguese Lexical Knowledge Bases: Contents, Comparison and Combination †

Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra (CISUC), University of Coimbra, 3030-290 Coimbra, Portugal
This paper is an extended version of our paper published in Comparing and Combining Portuguese Lexical-Semantic Knowledge Bases, Proceedings of the 6th Symposium on Languages, Applications and Technologies (SLATE 2017), 26–27 June 2017, Vila do Conde, Portugal.
This article belongs to the Special Issue Special Issues on Languages Processing

Abstract

In the last decade, several lexical-semantic knowledge bases (LKBs) were developed for Portuguese, by different teams and following different approaches. Most of them are open and freely available for the community. Those LKBs are briefly analysed here, with a focus on size, structure, and overlapping contents. However, we go further and exploit all of the analysed LKBs in the creation of new LKBs, based on the redundant contents. Both original and redundancy-based LKBs are then compared, indirectly, based on the performance of automatic procedures that exploit them for solving four different semantic analysis tasks. In addition to conclusions on the performance of the original LKBs, results show that, instead of selecting a single LKB to use, it is generally worth combining the contents of all the open Portuguese LKBs, towards better results.

1. Introduction

Lexical-semantic knowledge bases (LKBs) are computational resources that organise words according to their meaning. In addition other features, they should have a significant coverage of the words of a language, which, according to their possible senses, should be connected by means of semantic relations. Princeton WordNet [] is the paradigmatic resource of this kind, for English, used in many natural language processing (NLP) tasks, and with a model also adapted to many other languages, including Portuguese. However, the first Portuguese WordNet [] is not available to be used by the research community and the first open alternatives were only developed in the last decade.
In order to cope with the lack of such a resource, several open Portuguese LKBs were since then created and most became available for download, either upon a paid license (e.g., MWN.PT (http://mwnpt.di.fc.ul.pt/)) or for free. However, those LKBs were developed by different teams, following different approaches, which resulted in LKBs with variable coverage and with slightly different features, regarding their organisation. Due to the difficulties inherent in crafting such a broad resource manually, most Portuguese LKBs have some degree of automation in their creation process, which increases the chance of noise. Furthermore, not all follow the full WordNet model. For instance, even though all of them cover one or more types of semantic relations, not all handle word senses. In fact, none of them is as consensual as Princeton WordNet, which was created manually and has a large community of users, is for English. Finally, some Portuguese LKBs are not large enough, while others have an interesting size but include several incorrect, unfrequent or unuseful relations or lexical items.
In this paper, ten open Portuguese LKBs are characterised in terms of covered lexical items and semantic relations. The redundancy across them is then analysed, towards the creation of (potentially) more useful LKBs. All the LKBs, including the new ones, are finally compared indirectly, when exploited in semantic similarity tasks with available benchmark datasets for Portuguese, namely: (i) given a word, selecting the most similar word from a predefined set; (ii) quantifying the semantic similarity of two words; (iii) filling a blank in a sentence with the correct word from a set; and (iv) quantifying the semantic textual similarity between two sentences. In addition, confirming our intuition that there are advantages in combining different LKBs, this can be seen as the first systematic comparison of the open Portuguese LKBs.
This is an extended version of a previously published paper [], where a more detailed comparison is made, including a conversion table that maps semantic relation names in different LKBs; where two new resources are considered (ConceptNet and CARTÃO) as well as the most recent version of another (PULO). This resulted in different redundancy-based LKBs and, consequently, new experimentation results.

3. Open Portuguese LKBs

Ten open Portuguese knowledge bases with lexical-semantic information were identified and explored in this work, namely:
  • Three wordnets: WordNet.Br [], OpenWordNet-PT (OWN.PT) [] and PULO [];
  • Two synset-based thesauri: TeP [] and OpenThesaurus.PT (http://paginas.fe.up.pt/~arocha/AED1/0607/trabalhos/thesaurus.txt (January 2018)) (OT.PT);
  • Three lexical-semantic networks extracted from Portuguese dictionaries: PAPEL [], relations extracted from Dicionário Aberto (DA) [], and relations extracted from Wiktionary.PT (http://pt.wiktionary.org (2015 dump));
  • Semantic relations available in Port4Nooj [], a set of linguistic resources.
  • Semantic relations between Portuguese words in the ConceptNet [] semantic network, which includes common-sense knowledge, lexical knowledge and others.
As these resources do not share exactly the same structure, to enable their comparison and integration, they were all reduced to a set of relation instances of the kind “x related-to y”, where x and y are lexical items and related-to is the name of a semantic relation. For synset-based LKBs, wordnets and thesauri, synsets had to be deconstructed. For example, the instance {porta, portão} partOf {automóvel, carro, viatura} resulted in: (porta synonymOf portão), (automóvel synonymOf carro), (automóvel synonymOf viatura), (carro synonymOf viatura), (porta partOf automóvel), (porta partOf carro), (porta partOf viatura), (portão partOf automóvel), (portão partOf carro), (portão partOf viatura)—In English, {door, gate} partOf {automobile, car} resulted in: (door synonymOf gate), (automobile synonymOf car), (door partOf automobile), (door partOf car), (gate partOf automobile), (gate partOf car). Adopted relation names were those defined in the project PAPEL [], a rich set that covered most relation types in all the LKBs. However, some relation names, in other LKBs, had to be converted to a common name, always considering their semantics. Table 1 presents the performed conversions. Inverse relation names are omitted from this table, but they were also considered in the conversion process.
Table 1. Conversion of relations in different LKBs.
The size and type of contents of the LKBs obtained after conversion is summarised in Table 2 and Table 3. Table 2 is focused on the number of covered lexical items, organised according to their part-of-speech (POS). Given that, without a context, the same lexical item may have different POS, the table also provides the number of distinct lexical items, when the POS is not considered. Table 3 targets the number of relations covered by each LKBs, grouped according to their broader types. The total number of relations is already provided, together with the average degree of each word, which measures the average number of relations involving each word in the network. A remark should be given on ConceptNet. In addition, being a slightly different knowledge base, not exclusively focused on lexical-semantic knowledge, it was also the last one to be included in this work. After analysing the set of available relations, several were not covered by our set of relation types. From this set, we discarded lexical relations, such as those related to word forms (e.g., FormOf, DerivedFrom, EtymologicallyRelatedTo), not so useful for semantic analysis, but we kept other interesting and potentially useful relation types (e.g., Desires, MotivatedByGoal). In the previous tables, the numbers of the latter types are only considered in the total, which is why the given number is followed by an asterisk (*). It should also be added that, for the converted relations, we only kept those for which we could identify the POS of both arguments. For this purpose, we used the POS provided by ConcepNet. However, as this information is only provided for some items, when it was not available, the possible POS of each word was automatically checked in the corpora of the AC/DC service []. More precisely, we considered that a word could have every POS with which its lemma occurred in AC/DC at least five times. It should also be mentioned that, although relation instances in the current version of ConcepNet have an attached confidence weight, the majority of the instances between two Portuguese words (≈95%) have this parameter set to 1.0, so it was not used.
Table 2. Number of lexical items extracted from each LKB.
Table 3. Number of triples extracted from each LKB.
Although the LKB with more lexical items is the one obtained from DA (≈95,000 distinct items), it contains substantially less relation instances than TeP, which covers ≈490,000 synonymy and antonymy instances but no other relation type. PAPEL, DA, OWN-PT and WN.Br all contain more than 100,000 relation instances. This is also noticeable from the average degree of each of those LKBs, which is the lowest in DA. On the other hand, WN.Br only covers verbs and is the smaller LKB in terms of lexical items, but the average degree of its words is substantially higher than others (36.9, followed by 11.9 in TeP). In fact, though lower than WN.Br, the average degrees of the synset-based LKBs are higher than for the others, which is, to some extent, a consequence of the synset deconstruction process.
On the relation types, all LKBs cover synonymy; antonymy is not covered by OT.PT, WN.Br and Port4Nooj; and hypernymy is not covered by TeP and OT.PT because the latter are originally synset-based thesauri. Other types are present in several LKBs (e.g., part, cause, property), but some types are only found in the LKBs extracted from dictionaries. ConcepNet also has an interesting range of covered types, where we highlight the quantity of purpose-of and place-of relations.

4. Redundancy in Portuguese LKBs

Open Portuguese LKBs are not only organised in slightly different models. They were also created with different approaches, most of which involve automatic or semi-automatic steps for exploiting available resources, such as dictionaries or encyclopaedias, not only in Portuguese, but also in other languages. Therefore, although they try to cover the whole language, they end up having different granularities and contents, not only in terms of covered relation types, but also of lexical items and relation instances, some of which are less useful for some tasks, or even incorrect. Table 4 shows the number of relation instances grouped by relation type and number of LKBs they were found in.
Table 4. Occurrences of the same triples in different resources, per type.
Table 5 complements Table 4 and gives an idea on the typical knowledge covered by each LKB. More precisely, for each LKB, the included relation instances are grouped into those that are exclusive from the target LKB, those that are in only one more LKB (+1), and those that are in only two more (+2). This table shows, for instance, that ConceptNet is the network with more non-overlapping knowledge. The LKBs extracted from dictionaries contain the lowest proportion of knowledge that is not found in another LKB, but this proportion is still high—≈57% for DA and ≈64% for PAPEL and Wiktionary.
Table 5. Proportion of relation instances in each LKB that occur only in this LKB, this and another, and this and two other LKBs.
The majority of relation instances found (≈82%) is in only one LKB, ≈13% is in two, ≈3% in three and just ≈1% in four. Only synonymy, and a residual number of antonymy and hypernymy instances, are in six or more LKBs, expectable because those also happened to be the types covered by more LKBs. Our intuition is that the more resources an instance is in, the more likely it is to transmit a consensual, frequent and useful relation. This does not mean, however, that most of the relations found in only one LKB are incorrect or not useful. It only means that the latter set should contain a higher proportion of relations that are either incorrect, very specific or useful only in a more limited domain of application, when compared to the set of relations in more than one LKB. This is confirmed by observed examples, including those in Table 6, which contains relation instances that are in nine to three LKBs. Each redundancy level includes only instances of relation types that were not present in the previous level, or were but with arguments with a different POS.
Table 6. Examples of redundant relation instances.
On the other hand, instances that only occur in one LKB are more likely to either be incorrect, due to noise on the automatic process, or to involve very specific meanings, which makes them less useful. Observed examples also confirm this. Some of them are presented in Table 7, which shows a list of relation instances that are in a single LKB, selected randomly for different relation types.
Table 7. Examples of relation instances in only one LKB.
Following the aforementioned intuition—relation instances in more LKBs are more likely to transmit a consensual, frequent and useful relation—, new LKBs were created, based on the redundancy level: one with all the relation instances in all LKBs (All) and eight more with the relation instances in at least two to nine LKBs (Redun2–9). The resulting LKBs are characterised in Table 8. From those, the largest three (All, Redun2, Redun3) were used to perform the same tasks as the original LKBs, which is reported in the following section. Due to historical reasons, CARTÃO [], an LKB completely extracted from dictionaries, that combines PAPEL, DA and Wiktionary.PT, was also used in the following experiments. Table 8 also contains information on the size of CARTÃO.
Table 8. Size of the redundancy-based LKBs.

5. Comparing Portuguese LKBs Indirectly

Due to the time-consuming work required for evaluating the contents of each LKB manually, plus the subjectivity of such a task, the Portuguese LKBs were compared indirectly, when exploited to solve semantic similarity-related tasks, for which datasets, here used as benchmarks, are available. Experiments performed in this comparison cover four different tasks, namely: selecting the most similar word from a small set (B 2 SG, Section 5.1); computing the semantic similarity between pairs of words (SimLex-999, Section 5.2); selecting the most suitable word, in a set, for a blank in a sentence (cloze questions, Section 5.3); and computing the semantic similarity between pairs of sentences (ASSIN, Section 5.4). Table 9 organises those benchmark tests according to their type.
Table 9. Characterization of the benchmark tests.

5.1. Selecting the Most Similar Word from a Small Set

The B 2 SG [] test is similar to the WordNet-Based Synonymy Test [], but based on the Portuguese part of BabelNet [] and partially evaluated by humans. It contains frequent Portuguese nouns and verbs (target), each followed by four candidates, from which only one is related, and is organised in six files: two for synonymy, two for hypernymy, and two for antonymy, respectively, between nouns and for verbs. Table 10 illustrates the B 2 SG test with the first line of each file. The correct answer is always the first candidate, followed by three distractors.
Table 10. First entries of each file of the B 2 SG test.
Although created for evaluating less structured resources, such as distributional thesauri, we analysed how many correct relations of this test are covered by the Portuguese LKBs. Furthermore, for the uncovered instances, the correct alternative was guessed from the top-ranked candidate, after running the Personalized PageRank [] algorithm in each LKB, for 30 iterations, using the target word as context.
Table 11 presents the number of covered (In) and guessed (Guess) relation instances for each LKB. Coverage numbers highlight known limitations of some LKBs. For instance, antonymy relations extracted from dictionaries are mostly between adjectives; synset-based thesauri do not cover hypernymy; only the wordnet-based LKBs cover hypernymy between verbs and WN.Br covers only verbs. However, for this specific test, some limitations could be minimized by exploiting the structure of the LKB. As expected, the highest coverage and proportion of guessed relations is obtained for the All LKB, for which 97.4% of the instances are guessed. It is followed by OWN-PT on both coverage and guesses, except for the guesses of hypernymy and antonymy between nouns. In the former, CARTÃO gets the second highest number, followed really close by Redun2, which gets the second highest number of guesses of antonymy relations between nouns. However, we suspect that these numbers are positively biased towards OWN-PT because it is currently integrated in BabelNet.
Table 11. Relation instances in and guessed from the B 2 SG test. Highest and second highest numbers are in bold.

5.2. Computing the Similarity between Word Pairs

SimLex-999 [] is a recent benchmark for assessing methods for computing semantic similarity. It contains 999 pairs of words, with the same POS, and their similarity score, given by human subjects who followed strict guidelines to differentiate between similarity and relatedness. No multiword expressions nor named entities are included. This dataset was originally made available for English but has been translated to other languages. The Portuguese adaptation was originally made to assess the distributional models of Portuguese words [] and is available online (http://metashare.metanet4u.eu/ or https://github.com/nlx-group/lx-dsemvectors/ (October 2017)). Table 12 shows two adjectives, two nouns and two verbs of the Portuguese SimLex-999.
Table 12. First two adjectives, nouns and verbs of the Portuguese SimLex-999.
In order to exploit the LKBs in this task, two different algorithms were applied to compute the similarity between the words of each pair, namely:
  • Similarity of the adjacencies of each word in the LKB, using measures such as the Jaccard coefficient (Adj-Jac, Equation (1)) or the cosine similarity (Adj-Cos, Equation (2)):
    A d j - J a c ( w 1 , w 2 ) = | a d j a c e n c i e s ( w 1 ) a d j a c e n c i e s ( w 2 ) | | a d j a c e n c i e s ( w 1 ) a d j a c e n c i e s ( w 2 ) | ,
    A d j - C o s ( w 1 , w 2 ) = | a d j a c e n c i e s ( w 1 ) a d j a c e n c i e s ( w 2 ) | | a d j a c e n c i e s ( w 1 ) | + | a d j a c e n c i e s ( w 2 ) | .
  • PageRank vectors, inspired by Pilehvar et al. []. For each word of a pair, Personalized PageRank was first run in the target LKB, for 30 iterations, using the word as context; a vector was then created with the resulting rank of each other word of the LKB in each position. Finally, the similarity between the vectors for each word was computed, using: the Jaccard coefficient between the sets of words in these vectors (PR-Jac) or the cosine of the vectors (PR-CosV). Given the large vector sizes, vectors were trimmed to the top - N ranked words. Different sizes N were tested, from 50 to 3200.
In addition, since SimLex-999 is a similarity test, the previous methods were tested using all the relations of each LKB, or only synonymy and hypernymy relations, which are more connected with this phenomena.
The obtained results were evaluated with the Spearman correlation ( ρ ) between the similarities in SimLex-999 and the similarities computed from each of the previous methods in each LKB. Table 13 shows the best results for each combination of method, relations used, and LKB, as well as different methods for the LKB with the best results (All).
Table 13. Selection of results for the SimLex-999 test.
Results show that LKBs extracted from dictionaries have better results with PageRank-based algorithms, using all relations. This also includes CARTÃO, which we recall combines relations extracted from three dictionaries. On the other hand, LKBs extracted from wordnets have better results with adjacency-based algorithms, using only synonymy and hypernymy relations. It should be noted that there are clear advantages on using the adjacency-based algorithms, which, because of their lower time complexity, take much less to compute the similarity scores, especially in larger LKBs. The best results are clearly obtained with the combination of all LKBs, using different configurations (0.56–0.61). The original LKB with the best performance is PAPEL (0.49), which performed slightly better than Redun2 (0.48), but lower than CARTÃO (0.53), which got second place overall. PAPEL was followed by OWN-PT (0.44) and Wiktionary.PT (0.42), both better than Redun3 (0.44).
Although the top result is obtained with a PageRank-based algorithm, adjacency-based similarity is close, and even higher for some LKBs. It should thus be seen as a valuable alternative, especially because PageRank-based algorithms are either time (complexity of running PageRank) or memory-expensive (ranks can be pre-computed, but large matrices are required). As for the size of the vectors, there is no clear trend, except that the best result is never obtained with the largest size tested (3200). Further discussion of the best methods is out of the scope of this paper.
Although languages are different and so are the available resources, a final word should be given on the comparison of these results with the top state-of-the-art results for English, as reported in the ACL Wiki (https://aclweb.org/aclwiki/SimLex-999_(State_of_the_art) (October 2017)). By combining distributional vectors with knowledge from Princeton WordNet, a Spearman coefficient of 0.642 was obtained for the English SimLex-999 [], which is not very far from the results of our best configuration (0.61). In the future, we will study the impact of combining the LKB-based approach with distributional vectors.

5.3. Answering Cloze Questions

Open domain cloze questions have been generated in the scope of REAP.PT [], an assisted language learning tutoring system for European Portuguese. Those consist of sentences with a blank, to be filled with a word from a shuffled list of candidates, of which only one is correct and the other are distractors. Some of the Portuguese LKBs have previously been exploited [] to answer a set of 3890 of those questions, provided by the researchers involved in the REAP.PT project. Table 14 illustrates the contents of this dataset with the first two questions and the respective set of candidate words, with the correct answer in bold.
Table 14. First two cloze questions of the dataset used.
The experiment reported here used the same dataset, this time answered with each of the LKBs explored in this work. The selection method was similar to the one used for the B 2 SG test (Section 5.1): for each sentence, answers were guessed from the top-ranked candidate, after running Personalized PageRank, this time using the lemmas of all the open-class words as context. For instance, for sentence #2, the words artista, verdadeiro, obra and arte were used.
Table 15 shows the accuracy in the selection of the correct answer, using each LKB, and with a baseline that selects the most frequent alternative, based on the frequency lists of the AC/DC corpora []. Results are shown as a total, and also organised according to the POS of the correct word to fill the blank. When no alternative was covered by the LKB, the answer would contain all the alternatives (25% correct).
Table 15. Accuracy for answering cloze questions.
Although all LKBs performed better than random chance (25%), this revealed to be a challenging task. WN.Br was just slightly higher than this number, possibly because it only covers verbs. Other LKBs were not much higher than the frequency baseline, which improved the random chance for nouns and verbs, but apparently did not make much difference for adjectives and adverbs. The highest rate of correct answers (≈40%) was obtained with CARTÃO, with no significant differences when compared to the result obtained with the All LKB. On the one hand, CARTÃO got the highest proportion of correct answers when the blank was to be filled with a verb (≈37%) or an adjective (≈36%), while the All LKB got the highest proportion for nouns (≈50%). For adverbs, this proportion is not significantly different than the random chance. Curiously, the highest result is for Port4Nooj (≈30%). If using a smaller LKB is desired, PAPEL (≈191,000 relation instances) or Redun2 (≈145,000) answer ≈38% of the questions correctly.

5.4. Textual Similarity and Entailment

The ASSIN shared task targeted semantic similarity and textual entailment in Portuguese []. Its training data comprises 6000 sentence pairs (t, h), half of which in Brazilian Portuguese (PTBR) and the other half in European Portuguese (PTPT). Test data comprises 4000 pairs, 2000 in each variant. Data is available in the task’s website (http://nilc.icmc.usp.br/assin/ (April 2017)), together with the gold annotations of the test data and evaluation scripts. Similarity values range from 1 (completely different sentences, on different subjects) to 5 (t and h mean essentially the same). Entailment can have one of the following values: Paraphrase, Entailment or None. Table 16 shows a selection of sentence pairs in the ASSIN training collection
Table 16. Selected examples from the ASSIN training collection, for EurOpean Portuguese (PTPT) and for Brazlian Portuguese (PTBR).
LKBs were exploited to compute similarity according to Equation (3). Briefly, after preprocessing the sentences and computing the cosine of their stems, a bonus ( γ ) was added for each additional word from t directly related to a word in h ( γ + = 0.75) or related to a common word ( γ + = 0.05):
S i m ( S 1 , S 2 ) = | S 1 S 2 | + γ | S 1 | | S 2 | .
A very simple approach was followed for the entailment task. Common words and synonyms were first removed from the longer sentence. If the proportion of remaining words was below α = 0.1 , the pairs would be classified as a Paraphrase. After this, words from the first sentence in an hypernymy relation with words from the second were also removed. If the proportion of remaining words was below β = 0.45 , the pair would be classified as Entailment. Parameters α and β were set after several experiments in the training collection.
Table 17 shows the obtained results for the PTPT and PTBR variants, with each LKB, plus a baseline that does not use an LKB ( α = β = 0 ), and the best official results of ASSIN. Entailment performance is scored in terms of accuracy and Macro-F1, while similarity resorts to the Pearson correlation and the mean square error (MSE).
Table 17. Exploiting LKBs in the ASSIN test set.
The approach followed in this task was assumedly simplistic. In fact, the performance of using different LKBs does not vary significantly and no strong conclusions can be taken, as the cosine seems to play a greater role. To reach the best performances, LKB features would have to be combined with others, possibly in a supervised approach, where the weights for each feature would be learned during the training phase. This is how most participating systems approached ASSIN, including the best results. Further experiments made with these LKBs with additional features can be found elsewhere [].
Despite the previous remark, in opposition to the cloze questions, in this case, using the A l l LKBs leads to the low results in most scores, possibly due to the noise in such a large LKB, and also due to the different method applied. Using the redundancy based LKBs would probably be a good option, especially for similarity.

6. Conclusions

Ten open Portuguese LKBs were overviewed in this paper, namely PAPEL; relations acquired from Dicionário Aberto and Wiktionary.PT; OpenWordnet-PT; PULO; TeP; OpenThesaurus.PT; semantic relations of Port4Nooj; Wordnet.Br; and the relations between Portuguese words in ConceptNet. An initial comparison focused on size, relation types covered, and redundancy across the LKBs. Despite sharing a similar goal, these LKBs were created by different teams, following different approaches, and there are significant differences in the covered lexical items, relations — more than 80% of all the relations instances are in only one LKB—, their correctness or utility. The creation of new LKBs by combining the existing ones was described and all LKBs were then compared indirectly, when exploited in different computational semantics tasks.
The limitations of some LKBs were confirmed, especially the smaller ones (Port4Nooj, OT.PT), or those focused on a single POS (WN.Br) or relation (OT.PT). Except for the expected impact of those limitations, obtained results are positive for every LKB, especially in the word-based similarity tests. However, experiments suggest that using all the available relation instances generally leads to the best results. Some of these LKBs were recently used to answer other word similarity and relatedness tests [] and, despite different results for different tests, the claim that combining several LKBs leads to better results still holds.
This comparison should not be seen as complete and further analysis is needed for stronger conclusions. Due to the large size of the LKB with all relation instances, in some cases, it might be worth using an LKB containing only relations in two or three LKBs. In the performed experiments, the negative impact of the latter solution on performance is higher for algorithms based on the structure of the network, such as PageRank, and not so much on approaches that do not go one level further than the direct adjacencies. This happens because PageRank exploits every link in the network structure, some of which are not redundant and thus missing from the redundancy-based LKBs. Even though the aforementioned conclusions are still valid for the sentence-oriented tests, additional features and more sophisticated approaches would be required for a higher performance (see []).
It should be added that all the ten LKBs compared in this work were exploited in the creation of new version of the fuzzy Portuguese wordnet CONTO.PT [], in order to be released in the future. In CONTO.PT, words are grouped together or related with a confidence measure, computed from the relations in all the exploited LKBs. This way, users may set their own cut-point on confidence and use either a smaller but more reliable LKB or a larger one, though not so reliable. All the redundancy-based LKBs are freely available for anyone to use, from http://ontopt.dei.uc.pt/index.php?sec=download_outros. We aim at using these LKBs in additional tasks, or in the same but focusing on certain aspects, such as the POS. However, a manual intervention might be required for stronger conclusions.
Following the current trend of using distributional models of words in NLP, such as word embeddings, the performance of the LKBs and algorithms used here was recently compared with the performance of some of the previous models for Portuguese []. On the one hand, LKBs lead to the highest results when it comes to genuine similarity. On the other hand, they are outperformed by the distributional models when computing relatedness. This is partially explained by the fact that LKBs are more theoretical views of the mental lexicon, while the distribution of words in a corpus models the way language is actually used. We are currently working on the combination of both kinds of models in a single, hopefully better, word similarity function, as others have done for English (e.g., [,]). Such a function might be useful for higher-level natural language tasks, such as semantic search systems or conversational agents.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Fellbaum, C. (Ed.) WordNet: An Electronic Lexical Database (Language, Speech, and Communication); The MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  2. Marrafa, P. Portuguese WordNet: General architecture and internal semantic relations. DELTA 2002, 18, 131–146. [Google Scholar] [CrossRef]
  3. Gonçalo Oliveira, H. Comparing and Combining Portuguese Lexical-Semantic Knowledge Bases. In Proceedings of 6th Symposium on Languages, Applications and Technologies (SLATE 2017); Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, OASICS: Kobe, Japan, 2017; Volume 56, pp. 16:1–16:15. [Google Scholar]
  4. De Paiva, V.; Real, L.; Gonçalo Oliveira, H.; Rademaker, A.; Freitas, C.; Simões, A. An overview of Portuguese Wordnets. In Proceedings of the 8th Global WordNet Conference (GWC’16), Bucharest, Romania, 27–30 January 2016; pp. 74–81. [Google Scholar]
  5. Magnini, B.; Cavaglià, G. Integrating Subject Field Codes into WordNet. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 31 May–2 June 2000; ELRA: Paris, France, 2000; pp. 1413–1418. [Google Scholar]
  6. Shi, L.; Mihalcea, R. Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing. In Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing’05); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2005; Volume 3406, pp. 100–111. [Google Scholar]
  7. Gurevych, I.; Eckle-Kohler, J.; Hartmann, S.; Matuschek, M.; Meyer, C.M.; Wirth, C. UBY—A Large-Scale Unified Lexical-Semantic Resource. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France, 23–27 April 2012; ACL Press: Avignon, France, 2012; pp. 580–590. [Google Scholar]
  8. Vossen, P. EuroWordNet: A multilingual database for information retrieval. In Proceedings of the DELOS Workshop on Cross-Language Information Retrieval, Zurich, Switzerland, 5–7 March 1997. [Google Scholar]
  9. Pianta, E.; Bentivogli, L.; Girardi, C. MultiWordNet: Developing an aligned multilingual database. In Proceedings of the 1st International Conference on Global WordNet (GWC 2002), Mysore, India, 21–25 January 2002. [Google Scholar]
  10. Bond, F.; Foster, R. Linking and Extending an Open Multilingual Wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; ACL Press: Sofia, Bulgaria, 2013; pp. 1352–1362. [Google Scholar]
  11. Gonzalez-Agirre, A.; Laparra, E.; Rigau, G. Multilingual Central Repository version 3.0. In Proceedings of the 8th International Conference on Language Resources and Evaluation (ELRA), Istanbul, Turkey, 21–27 May 2012; pp. 2525–2529. [Google Scholar]
  12. De Melo, G.; Weikum, G. Towards a Universal Wordnet by Learning from Combined Evidence. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2–6 November 2009; ACM: New York, NY, USA, 2009; pp. 513–522. [Google Scholar]
  13. Navigli, R.; Ponzetto, S.P. BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network. Artif. Intell. 2012, 193, 217–250. [Google Scholar] [CrossRef]
  14. Downey, D.; Etzioni, O.; Soderland, S. A Probabilistic Model of Redundancy in Information Extraction. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), Edinburgh, Scotland, 30 July–5 August 2005; pp. 1034–1041. [Google Scholar]
  15. Dias-da-Silva, B.C. Wordnet.Br: An exercise of human language technology research. In Proceedings of the 3rd International WordNet Conference (GWC), Jeju Island, Korea, 22–26 January 2006; pp. 301–303. [Google Scholar]
  16. De Paiva, V.; Rademaker, A.; de Melo, G. OpenWordNet-PT: An Open Brazilian WordNet for Reasoning. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India, 8–15 December 2012. [Google Scholar]
  17. Simões, A.; Guinovart, X.G. Bootstrapping a Portuguese WordNet from Galician, Spanish and English Wordnets. In Advances in Speech and Language Technologies for Iberian Languages, Proceedings of the 2nd International Conference on IberSPEECH 2014, Las Palmas de Gran Canaria, Spain, 19–22 November 2014; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2014; Volume 8854, pp. 239–248. [Google Scholar]
  18. Maziero, E.G.; Pardo, T.A.S.; Felippo, A.D.; Dias-da-Silva, B.C. A Base de Dados Lexical e a Interface Web do TeP 2.0—Thesaurus Eletrônico para o Português do Brasil. In Proceedings of the Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web, Vila Velha, Brazil, 26–29 October 2008; ACM: New York, NY, USA, 2008; pp. 390–392. [Google Scholar]
  19. Gonçalo Oliveira, H.; Santos, D.; Gomes, P.; Seco, N. PAPEL: A Dictionary-Based Lexical Ontology for Portuguese. In Proceedings of 8th International Conference on Computational Processing of the Portuguese Language (PROPOR 2008); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2008; Volume 5190, pp. 31–40. [Google Scholar]
  20. Simões, A.; Sanromán, Á.I.; Almeida, J.J. Dicionário-Aberto: A Source of Resources for the Portuguese Language Processing. In Proceedings of 10th International Conference on Computational Processing of the Portuguese Language (PROPOR 2012); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2012; Volume 7243, pp. 121–127. [Google Scholar]
  21. Barreiro, A. Port4NooJ: An open source, ontology-driven Portuguese linguistic system with applications in machine translation. In Proceedings of the 2008 International NooJ Conference (NooJ’08), Budapest, Hungaria, 8–10 June 2008; Cambridge Scholars Publishing: Newcastle upon Tyne, UK, 2010. [Google Scholar]
  22. Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4444–4451. [Google Scholar]
  23. Santos, D.; Bick, E. Providing Internet access to Portuguese corpora: The AC/DC project. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 31 May–2 June 2000; pp. 205–210. [Google Scholar]
  24. Gonçalo Oliveira, H.; Pérez, L.A.; Costa, H.; Gomes, P. Uma rede léxico-semântica de grandes dimensões para o português, extraída a partir de dicionários electrónicos. Linguamática 2011, 3, 23–38. [Google Scholar]
  25. Wilkens, R.; Zilio, L.; Ferreira, E.; Villavicencio, A. B2SG: A TOEFL-like Task for Portuguese. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23–28 May 2016; ELRA: Paris, France, 2016. [Google Scholar]
  26. Freitag, D.; Blume, M.; Byrnes, J.; Chow, E.; Kapadia, S.; Rohwer, R.; Wang, Z. New Experiments in Distributional Representations of Synonymy. In Proceedings of the 9th Conference on Computational Natural Language Learning (CONLL ’05), Ann Arbor, MI, USA, 29–30 June 2005; ACL Press: Stroudsburg, PA, USA, 2005; pp. 25–32. [Google Scholar]
  27. Agirre, E.; Soroa, A. Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL’09), Athens, Greece, 30 March–3 April 2009; ACL Press: Stroudsburg, PA, USA, 2009; pp. 33–41. [Google Scholar]
  28. Hill, F.; Reichart, R.; Korhonen, A. Simlex-999: Evaluating Semantic Models with Genuine Similarity Estimation. Comput. Linguist. 2015, 41, 665–695. [Google Scholar] [CrossRef]
  29. Querido, A.; Carvalho, R.; Rodrigues, J.; Garcia, M.; Silva, J.; Correia, C.; Rendeiro, N.; Pereira, R.; Campos, M.; Branco, A. LX-LR4DistSemEval: A collection of language resources for the evaluation of distributional semantic models of Portuguese. Rev. Assoc. Port. Linguíst. 2017, 265–283. [Google Scholar] [CrossRef]
  30. Pilehvar, M.T.; Jurgens, D.; Navigli, R. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, 4–9 August 2013; Volume 1, pp. 1341–1351. [Google Scholar]
  31. Banjade, R.; Maharjan, N.; Niraula, N.B.; Rus, V.; Gautam, D. Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods. In Proceedings of 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2015); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2015; Volume 9041, Part I, pp. 335–346. [Google Scholar]
  32. Correia, R.; Baptista, J.; Eskenazi, M.; Mamede, N. Automatic generation of cloze question stems. In Proceedings of 10th International Conference on Computational Processing of the Portuguese Language (PROPOR 2012); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2012; Volume 7243, pp. 168–178. [Google Scholar]
  33. Gonçalo Oliveira, H.; Coelho, I.; Gomes, P. Exploiting Portuguese Lexical Knowledge Bases for Answering Open Domain Cloze Questions Automatically. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC 2014), Reykjavik, Iceland, 26–31 May 2014; ELRA: Paris, France, 2014. [Google Scholar]
  34. Fonseca, E.R.; dos Santos, L.B.; Criscuolo, M.; Aluísio, S.M. Visão Geral da Avaliação de Similaridade Semântica e Inferência Textual. Linguamática 2016, 8, 3–13. [Google Scholar]
  35. Gonçalo Oliveira, H.; Alves, A.O.; Rodrigues, R. Gradually Improving the Computation of Semantic Textual Similarity in Portuguese. In Progress in Artificial Intelligence, Proceedings of the 18th EPIA Conference on Artificial Intelligence, Porto, Portugal, 5–8 September 2017; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2017; Volume 10423, pp. 841–854. [Google Scholar]
  36. Gonçalo Oliveira, H. Unsupervised Approaches for Computing Word Similarity in Portuguese. In Progress in Artificial Intelligence, Proceedings of the 18th Portuguese Conference on Artificial Intelligence (EPIA 2017), Porto, Portugal, 5–8 September 2017; Springer: Berlin, Germany, 2017. [Google Scholar]
  37. Gonçalo Oliveira, H. CONTO.PT: Groundwork for the Automatic Creation of a Fuzzy Portuguese Wordnet. In Proceedings of 12th International Conference on Computational Processing of the Portuguese Language (PROPOR 2016); Lecture Notes in Computer Science; Springer: Berlin, Germany, 2016; Volume 9727, pp. 283–295. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.