In this section, we evaluate our APD algorithms on football matches and basketball games, and contrast our implementations with standard NER techniques. Our evaluation shows how NER techniques do not suffice to detect event participants, but that APD can significantly improve performance. Before we discuss these results, we describe our evaluation set-up, including the datasets, ground truth, and baselines.
In this discussion, we analyse how people talk about events on Twitter before events start, and how our APD algorithms and off-the-shelf NER libraries perform in this environment. A glance at Figure 3
makes it immediately clear that the discussion starts well before the event itself. The figure shows a sharp spike in tweeting volume in our football datasets around the time when clubs release their line-ups for the upcoming match. This behaviour reflects the importance that users attribute to participants, as evidenced by how announcing the starting line-ups propels discussion.
However, not all participants are equally important. The best way to understand how Twitter users talk about events is by examining the rankings produced by NLTK and TwitterNER. The low precision and recall results of these baselines stand out in Table 2
, only around a quarter of NLTK’s and TwitterNER’s top 50 participants are valid. Intuition would have it that NER tools return many false positives because of Twitter’s informal nature, but that is an inaccurate assessment. The poor results of the baselines do not reflect poorly on the quality of NLTK and TwitterNER, which are relatively capable of overcoming Twitter’s challenges, but on the way Twitter users talk about events.
In fact, discussions on Twitter go beyond the event and consider its broader context; the result of one football match may affect other teams, so Twitter users often talk about these effects. Other contexts are more unique. The passing of ex-Boston Celtics player K.C. Jones shortly before the NBA game opposing Boston Celtics and Brooklyn Nets generated many tributes, which led to all algorithms mistaking him as a participant.
Another explanation for the poor precision is that there are several ways of referring to participants, many of them colloquial and redundant. For example, TwitterNER captured three different references to Liverpool in the top 10 ranking of the match between Liverpool and Burnley: Liverpool, the reds, and LFC.
The tweeting behaviour does not only impact precision, but recall too. Logically, the fact that football clubs release the line-ups on Twitter should make it easier to capture participants. However, it has become common practice to announce the players using images or videos, which NER techniques cannot parse. In NBA, the starting five players are published much later, so sometimes the participants remain uncertain until the very start of the game, making for an even more challenging task.
There are also inequalities among participants, with star players generally attracting far more attention than the others. For example, out of 6889 tweets in the corpus of the match between Barcelona and Atlético Madrid, only 4 mention Atlético Madrid’s defender, José Giménez.
All of these factors explain why NER is inadequate to solve the APD problem on its own, no matter how accurate the library is at identifying named entities. APD’s two steps that go beyond NER, resolution and extrapolation, solve these challenges by rejecting non-participant named entities and looking for missed participants respectively.
The resolution step in our APD approach is relatively successful in overcoming the low precision values, as shown in Table 2
. Naturally, the recall values at the resolution stage cannot exceed the baseline’s results, but precision increases drastically at a small expense to recall. In other words, our APD implementation correctly resolves most valid participants while simultaneously filtering out noisy named entities.
While the role of resolution is to automatically-generate a seed set of participants with high precision, extrapolation looks for the missing participants to improve recall. The final rankings of our APD algorithms is the list of resolved participants followed by the list of extrapolated participants. We retain the top 50 participants in this ranking, and Table 3
shows how our APD algorithms’ final rankings see a sharp increase in recall with a smaller drop in precision.
Extrapolation achieves its goal of improving recall, often needing very few examples of participants to identify others. At the same time, while precision decreases from resolution, it remains much higher than the baselines’ precision values. We used the one-tailed paired samples t
-test to evaluate the improvements of our APD approaches over the baselines’ results of Table 2
. With p
-values well below 0.01, the improvements in precision and recall are all statistically-significant at the 99% confidence level.
We attribute the drops in precision to two factors. First, extrapolation captures several entities that would normally be participants. For example, Ousmane Dembélé plays regularly for Barcelona, but he was injured for the match against Atlético Madrid. Even his absence, Dembélé could be interpreted as being an indirect participant, as described in Section 3
. Regardless of the interpretation, since our APD approach was unaware of his status, it identified him as a participant along the other Barcelona players.
Second, extrapolation captures several tangential concepts. For example, the teams and players in English football matches are related to the Premier League. Therefore when extrapolation could not find more relevant players, it continued listing concepts related to the Premier League and English teams in general. We also note that the three lowest precision values in the football matches came from the smallest datasets, indicating that the more comprehensive the pre-event domain is, the better the results.
A peculiar behaviour of our APD approach is that it often recalls around half of all participants, both in football and basketball. This value is not incidental. When one team is more popular than the other, NER and resolution predominantly capture players from that team. Consequently, extrapolation looks for participants that also belong to the more popular team. When this happens, our APD approach captures almost all of the participants from the popular team, but few from the other team, bringing the recall values to around 50% or 60%.
This observation exposes extrapolation’s reliance on resolution, and is confirmed in other datasets. Of particular interest are the matches between Liverpool and Burnley, and between Liverpool and Chelsea. These two events had similar recall values after resolution, but the latter did not succumb to bias. This happened because most of the resolved participants in the first match were Liverpool players, as shown in Figure 4
, so extrapolation sought other Liverpool players. Conversely, APD resolved a mix of Liverpool and Chelsea players in the second match, and extrapolation could detect participants from both sides.
Finally, the MAP results confirm that our APD approaches rank participants higher than unrelated concepts. Whereas NLTK and TwitterNER obtain MAP results of 0.36 and 0.46 respectively, our APD implementations based on NLTK and TwitterNER obtain MAP results of 0.68 and 0.69 respectively.