The outlier detection task is based on standard vocabulary techniques to learn new words and in specific lexical questionnaires of language exams [
23]. More precisely, given a group of words, the objective of the task is to identify the word that does not belong to the group. This task is oriented to separate synonyms and co-hyponyms (i.e., clusters of words with similar referents) from those that are neither synonyms nor co-hyponyms. Thus, the task is not suited to identify topical relationships.
3.1. The Compactness Score
In Camacho-Collados and Navigli [
3], the outlier detection task was defined on the basis of a generic concept of
compactness score that considers both symmetrical and asymmetrical measures. Here, we propose to define a more specific
compactness score by assuming that the similarity/distance coefficient must be symmetrical (e.g., cosine). Intuitively, given a set of nine words consisting of eight words belonging to the same group and one outlier, the
compactness score of each word of the set is the result of averaging the pair-wise similarity scores of the target word with the other members of the set.
Formally, given a set of words
, where
belongs to the same cluster and
is the outlier, we define the compactness score
of a word
, assuming a symmetrical similarity coefficient
sim, as follows:
An outlier is correctly detected if the compactness score of the outlier word is lower than the scores of the cluster words. Camacho-Collados and Navigli [
3] defined two evaluation coefficients to measure the degree of correction:
Outlier Position Percentage (
OPP) and
accuracy. The former relies on the
Outlier Position (
OP), which takes into account the position of the outlier in the set
W of
words ranked by the compactness score, which ranges from zero to
n (0 indicates the lowest overall score among all words in
W, and
n indicates the highest overall score). To compute the overall score on a dataset
D (composed of
sets of words), OPP is defined as follows:
On the other hand, Camacho-Collados and Navigli [
3] also defined
accuracy, which measures how many outliers were correctly detected by the system divided by the number of total detections: 12 × 8 in our
12-8-8 dataset. More formally, given
Outlier Detection (
OD), defined as 1 if the outlier is correctly detected (0 otherwise), accuracy is defined as follows:
3.2. New Benchmarks for the Outlier Detection Task
For the outlier detection task, Camacho-Collados and Navigli [
3] provided the
8-8-8 dataset (
http://lcl.uniroma1.it/outlier-detection/), which consists of eight different topics, each containing a cluster of eight words and eight outliers, which do not belong to the given topic. For instance, one of the topics is “European football teams”, which was defined with a set of eight nouns (see
Table 2) and a set of eight outliers (
Table 3).
To help improve and expand the first dataset, we have developed an extended version. In order to expand the number of examples, two annotators were asked to create four new topics and, for each topic, to provide a set of eight words belonging to the chosen topic and a set of eight heterogeneous outliers. One of them proposed all the words in less than 15 min, and the other annotator just agreed with all the decisions made by the first one. This 100% inter-annotator agreement is in contrast with the low inter-annotator levels achieved in the standard word similarity datasets; for instance in WordSim353 [
15], the average pair-wise Spearman correlation among annotators is merely 0.61. The new expanded dataset, called
12-8-8, contains 12 topics, each made up of 8 + 8 topic words and outliers. In addition, in order to simplify the comparison with systems that do not identify multi-words, we also changed the multi-words found in the
8-8-8 dataset by one-word terms denoting similar entities. For instance: the terms
Celtic and
Betis were used instead of
Atletico Madrid and
Bayern Munich, all referring to football teams. The
12-8-8 dataset contains 50% more test examples than the original one. Finally, we also created a new dataset by translating
12-8-8 into Portuguese.
One of the main problems in creating new clusters and outliers is the difficulty of finding a set of words that is quiet similar to the cluster, but which does not belong to it. For instance, take one of the new topics, name of colors, included in the
12-8-8 dataset and shown in the first column in
Table 4. There should be no disagreement about belonging to this class. However, to make the outliers’ search more challenging, it would be necessary to find words semantically related to the topic without belonging to the class of colors. Some of these words are close hyperonyms such as
color,
property, or
substance (see the second column of
Table 4). Annotators were instructed to use at least four or more words semantically closely related to the target topic. In
Table 4, the five first outliers (in bold) are semantically related to the class of colors. The more words of this type there are among the outliers, the more complicated the detection task becomes.
Given the characteristics of the task, it is easy to exploit all kinds of taxonomies in any domain of knowledge, for example zoological knowledge.
Table 5 shows the topic of ruminants and their outliers. Notice that the set of outliers also contains very difficult cases, namely the names of other similar animals to ruminants that are not in the zoological category (in bold in the second column).
The outlier detection task is conceptually closer to the TOEFL-style test than to the task relying on correlation with human similarity. However, unlike the TOEFL test, the outlier task is conceived to build new test datasets by non-professional annotators in an easy way. It also allows comparing a word (the outlier) with a larger set of words (not just three or four candidates per target word). The outlier is, in fact, compared against a cluster of words belonging to the same lexical class, e.g., mammals, colors, football teams, prime-ministers, German people, fresh vegetables, or whatever predefined class. This makes it possible to make more word comparisons and, therefore, to increase the coverage of the test.
Finally, as the groups are homogeneous and the word meanings are contextualized, and then are not ambiguous, there is much less subjectivity than in the correlation test. This facilitates and favors inter-annotator agreement. This also makes it easier to translate into other languages, without requiring too much adaptation.