The results of these three strategies for the Arabic and English data from the SemEval task are given in Table 3
. Mohammad et al. [5
] included an SVM-based algorithm as a baseline, and we have included their Jaccard results for that on the SemEval test data in these tables (we do not have their precision, recall and F-measure results), since we were unable to obtain any useful results using SVMs, alongside the scores for WCP on the test data.
For both datasets, the two DNN-algorithms had very similar scores. The multi-DNN versions, however, took much longer to train, since they required 11 distinct classifiers to be trained and tested. The DNNs seemed to outperform the SVM, but it should be noted that the SVM was tested on the SemEval test set, which may be easier than the development set. The ratios between WCP’s scores on the development sets and the DNN’s scores were lower than the ratios between WCP’s scores on the test sets and the SVM’s score (in plain English: WCP beat the SVM by more than it beat the DNNs; the DNNs were better than the SVM).
In general, the results for the Arabic and English were similar. Considering Arabic first, WCP performed well at identifying true positives. Some emotions (anger, joy, love, sadness) were more accurately identified than emotions such as anticipation, surprise and trust. Note, however, the extremely small number of training and test tweets for these emotions. True negatives were also identified well, for all emotions, with the lowest accuracy being 0.788. Where true positives and true negatives were high, the opposite metrics, false negatives and false positives, were low. This was not true for the three difficult emotions anticipation, surprise and trust. Since WCP failed to identify these emotions adequately, the false positive measures were extremely high. Indeed, for trust,WCP failed to classify even one tweet correctly.
For the English dataset, WCP again generally performed well at identifying true positives apart from for the same emotions anticipation
. For surprise
, this may be again due to the small number of training tweets. However, for anticipation
, the number of tweets could not have been a factor because fear
also had a similar number of tweets, but WCP performed much better on fear
. WCP also performed well at identifying true negatives even on the difficult emotions such as anticipation
. The accuracies were all above 0.7 and hence not particularly low. The lowest accuracies were on anger
. These figures were evidence that the lexicon was doing its job in using high-scoring words to classify a tweet, but also at the same time preventing incorrect classifications. Consider the following tweet that was not
annotated for anger
, but was incorrectly classified as anger
by WCP (i.e., a false positive):
“If you build up resentment in silence are you really doing anyone any favors”,
When this tweet was processed by WCP, tokens such as “resentment” and “silence” had large scores for anger and thus contributed significantly in taking the score beyond the threshold for anger. It can be seen that these words can, reasonably, be considered as words that may be used to convey anger, e.g., in the tweets “Anger, resentment, and hatred are the destroyer of your fortune today.” and “I’m going to get the weirdest thank you note–or worse–total silence and no acknowledgement.”. The most common false positive misclassifications were joy being misclassified as optimism, anger being misclassified as disgust and sadness being misclassified as disgust. It is interesting to note that surprise-trust was the only mismatch that did not occur.
Although WCP performed well on some emotions, it was seen that there were some emotions that WCP found hard to classify with high accuracy. There were a number of possible reasons that may have caused WCP to perform poorly on these emotions including excessive emotion co-occurrences in tweets, not enough shared test and training tokens and a lack of emotion-bearing words.
The experiments performed highlighted a number of factors that affected the performance of the classifier:
The effects of combining preprocessing steps such as lowercasing, removing punctuation and tokenising emojis were positive for the Arabic and the English datasets.
Expanding hashtags was a beneficial step for the English dataset, but detrimental in the Arabic dataset. This was because out of the 5448 distinct hashtags in the dataset, only 1168 (21%) appeared five or more times. Consequently, this reduced their ability to have any meaningful impact on the classifier.
Stemming using the tags from Albogamy and Ramsay’s tagger almost always decreased classifier performance.
There were emotions (e.g., trust) that WCP found difficult to classify.
The sizes of the training and test datasets and the proportions of tweets for each emotion were significant factors in classifier performance. Increasing the training dataset size only had a positive effect if the new data were from a similar domain and were annotated in a similar fashion.
Although WCP does have its limitations as described above, we observed that it outperformed many of the SemEval-2018 competition entrants, with significantly lower computational complexity.
In order for WCP to be effective, various preprocessing steps were used. These steps were easier for English than for Arabic. For evaluation purposes, two vastly different languages were used. The structures of Arabic and English differ in many ways; Arabic is written from right to left; most of the Arabic letters join each other when writing; and there are many forms of words created by adding letters to the three-letter root word, each with a nuanced, specific meaning. Consequently, the English preprocessing was easier than that for the Arabic.
It was observed that simplistic tagging performed better than taggers trained on non-Twitter datasets. For English, standard Morphy stemming was used, but the Arabic stemmer was particularly effective because existing Arabic stemmers did not stem effectively. Although other entrants would also have had Arabic taggers and stemmers, we believe that the tools we used gave us a slight advantage.
The steps in WCP had the effect of boosting a token’s score across all emotions if the difference between the mean across all emotions and the score for each emotion after the mean had been subtracted from it was high. This had the effect of double-counting, because subtracting the mean gave neutral tokens a score close to zero, and then, multiplying the actual score for each emotion by this variance emphasised the emotions for which the token was significant. This step simultaneously allocated words to emotions and gave extra weight to words that were much more important for one emotion than for the others. This had the effect of implicitly paying attention to both correlations and anti-correlations between emotions and allowed unhelpful tokens to be removed in the next stage.
Consider the word “rage”. Figure 4
shows how the probabilities evolved into scores. The scores for the token started as probabilities and were very similar to each other. They remained closely clustered during normalisation and also the step where the average was subtracted from the normalised value. It was only when the values were skewed that the scores started to become distinguishable from each other in a meaningful fashion.
and Figure 6
show the number of tokens that were removed from each emotion at each threshold during the autocorrection stage for the Arabic and English datasets (the colour scale goes from green to red with green for the smallest values and red for the largest values). For both datasets, it can be seen that as the threshold decreased, more tokens were removed by the autocorrection process. This is because as the threshold decreased, a smaller and smaller score was enough to classify the tweet as an emotion; this led to more false classifications, and thus, more tokens were removed. Generally, the pure emotions were the least autocorrected. This is indicative of strong unambiguous words being used in these tweets. The tweets that WCP had difficulty with (anticipation
) had the highest numbers of autocorrections. This corresponds with the earlier findings that these tweets contained the highest numbers of multiple emotions, and thus, it was more likely that tokens would appear in other correctly-classified tweets, hence invoking autocorrection.
At the largest threshold, 1.0, the least number of tokens were autocorrected. Autocorrection involved scoring and classifying tweets. Recall that one of the steps in scoring a tweet was to divide the scores for each emotion by the maximum emotion score. This process, by definition, set one of the emotion scores to 1.0. At the largest threshold, therefore, there could only be one emotion classified correctly; in effect, WCP reverted to a single-emotion classifier for this threshold. Hence, the largest threshold is perhaps of little use and should be discarded for autocorrection.
Tokens that appeared in multiple tweets labelled with multiple emotions were more likely to be removed due to the WCP autocorrection process. In other words, the more different emotions a token appeared against, the more likely it was to be removed.
Note that the autocorrection was only applied to true positives and false positives, i.e., when the tweet was annotated as having an emotion by the annotators. Only when a tweet had been annotated as having an emotion could autocorrection decide if it was useful or not; hence, tokens for tweets that had not been marked for an emotion were not used for autocorrection purposes.
Many tokens that were removed had no relationship to the emotion they were removed from and were removed only on the basis of being seen in a number of tweets that just happened to be misclassified, marginally, more often than classified correctly. For example, tokens such as “silver”, “mope” and “drink” (anger), “address”, “year” and “million” (fear), “buzz”, “older” and “jersey” (joy) and “son”, “inside” and “roll” (sadness) were all autocorrected on this basis.
Generally, those tokens that were removed that had much higher negative counts were more non-relevant to the emotion in the sense that one could easily imagine that they would not be useful in the emotions from which they were removed; for example, “horror”, “revenge” and “grudge” (anticipation), “gloomy”, “haunt” and “sadly” (love) and “nightmare”, “panic” and “terrible” (pessimism). These tokens with much higher autocorrection counter values were predominantly in the non-pure emotions that were difficult to classify. This indicates the difficulty of the exercise, that even after downplaying tokens that were unhelpful, it was still difficult to classify these tweets. It was also seen that some autocorrected tokens had multiple meanings (e.g., “cool”, “wicked”, “bad”, “gay”). These were tokens that originally meant something completely different (and sometimes opposite in emotion) to how they are used today.
Autocorrecting tokens had two effects:
Decreased the likelihood that tweets containing autocorrected tokens would be incorrectly classified.
Increased the likelihood that genuine tweets containing autocorrected tokens would be correctly classified.
Since the autocorrected tokens had been identified as being detrimental, the overall effect was that this increased the accuracy of the classifier.
It is important to note that this reclassification was carried out on the original training data. This was methodologically sound as the training data was not used for testing; it was simply reused as part of the overall training process. Experiments were performed where a portion of the training data was set aside for this purpose, but it turned out to be more effective to reuse the full set. It was obviously more important to have as much data as possible for this purpose than to keep the training and retraining data separate. This approach of autocorrecting was based on the suggestion of Brills [30
] that one should attempt to learn from one’s own mistakes.
A limitation of autocorrecting, however, is that there were tokens that were incorrectly autocorrected that could, conceivably, have been useful in the emotion in which they were autocorrected; for example, “intimidate”, “defend” and “war” (fear) “cheerfulness”, “cuddle” and “joyous” (love) and “together”, “grateful” and “assistance” (trust). Autocorrection for tokens such as these was carried out purely on the basis that they appeared in more non-useful tweets. More training data may have rectified this issue as it could have been expected that there would have been more instances of the correct use of the tokens, thus preventing them from being autocorrected. However, even this may not have fully rectified the problem because there were tokens that had large counts indicating that, regardless of what one might believe, they genuinely were not helpful to the emotion, e.g., “nervous” (, anticipation), “anxiety” (, pessimism) and “love” (, trust).
On the whole, the autocorrection process was beneficial, improving the Jaccard score on the Arabic dataset from 0.342 to 0.370 and on the English dataset from 0.401 to 0.431.
Experiments were performed running autocorrection multiple times, but it was found that very few words were removed after the first iteration. One possible explanation for this may be because the actual numbers of tokens that were removed was quite small: 1% for Arabic and 5% for English.
Choosing a threshold above which a new tweet was classified as an emotion vs. a non-emotion was an important step. The raw data for each emotion were different, and hence, a single fixed threshold across all emotions produced poor results.
This step was implicitly carried out by the multi-DNN models: when a DNN was used as a classifier, i.e., when there were two output nodes, one for YES and one for NO, it calculated the optimal excitation level for the YES node. Thus, when a set of DNNs was used for multi-classification, each one had its own threshold. For the multi-DNN, it is likely that a single threshold was chosen for all the output nodes.
and Figure 8
show the results for the Arabic and English datasets when WCP was used with a number of fixed thresholds compared to when a varying threshold was allowed, as well as the best thresholds for each emotion as determined by WCP. It was clear that WCP performed better on some emotions with higher thresholds and on others with lower thresholds. The “t2” column indicates the threshold used, followed by the Jaccard score for each emotion at that threshold and the overall Jaccard score for all the tweets in the last column. The last row in the table shows the best Jaccard scores for each emotion and the overall Jaccard score. The black line shows the best thresholds for each emotion. The results for both datasets shared a number of characteristics. WCP selected lower thresholds for good performance on anger
. However, for disgust
, the same threshold was selected for the Arabic dataset, whereas for the English dataset, the thresholds were variable. The last four emotions showed a clear pattern: pessimism
required low thresholds; surprise
needed a threshold higher than both of these; and trust
needed a small threshold for good performance.
It is interesting to note the scores at these thresholds (recall that although the thresholds were important, the aim was to maximise the scores). WCP scored highly for anger at the low thresholds regardless of the dataset. However, this was the only correlation between the two datasets. For every other emotion, both datasets behaved differently. Some emotions needed high thresholds, but still scored poorly (e.g., surprise in the Arabic dataset); some emotions had low thresholds, but scored poorly (e.g., trust in the English dataset); some emotions had high thresholds and scored highly (e.g., fear in the English dataset); and some emotions had low thresholds and scored highly (e.g., pessimism in the Arabic dataset). This highlights that there were general similarities, but also key differences between the languages and how they were used to convey emotion in tweets.
For both datasets, WCP performed better at lower thresholds for anger. This reflected the fact that there were many strong words that were highly indicative of anger alone, e.g., “fuming”, “inflame”, “pissed”, “scorn”, “furious” and “rage”. Words used to convey anger in the Arabic dataset were, perhaps, indicative of the current situation in certain Middle Eastern countries: الحوثيين (“Houthis”), الحوثيين (“irritation”), العسكر (“military”), اعتداء (“assault”). Both sets of words were indicative of anger, and consequently, obtaining a high score was achieved more easily than for some of the other emotions. However, at the higher thresholds, the Jaccard score decreased because anger was confused with the other emotions, e.g., disgust. A small threshold was clearly not the only criteria for a good score, as can be seen in both datasets, where the best Jaccard for anger was much higher than the best Jaccard for anticipation, although both had low thresholds. It is interesting to see why the best thresholds were low for some emotions and high for others.
Higher scoring tokens for an emotion were indicative of strong emotion-bearing words for that emotion. Tweets that were classified as an emotion due to being strongly indicative towards an emotion and contained highly emotive words for that emotion remained classified as that emotion, regardless of the threshold. For example, consider the tweet: “Bloody parking ticket _145__217_#fuming”, where _145_represents the UNAMUSED FACE emoji and _217_represents the MONEY WITH WINGS emoji (because this was only used in this one tweet and was removed as a singleton, it had a score of zero). Due to the presence of “#fuming” and “fume” (and their high scores), the normalised score for anger was 1.0 (i.e., the largest possible value). Consequently, this tweet was always classified as anger, regardless of the threshold.
However, this was not always the case, especially for tweets that contained words that were less emotive or could also be used for other emotions. For example, consider the tweet “brb going to sulk in bed until friday”. For anger, the tweet only scored 0.177, largely on the basis of the token “sulk”. Consequently, this tweet would only be classified as anger at the 0 and 0.1 thresholds. Beyond the 0.1 threshold, it would be classified as a false negative for anger, although it may be classified correctly for some other emotion.
Every emotion had a different optimal threshold; we believe this is due to the fact that different tokens were important for different emotions and also because there were different numbers of training tweets for each emotion. If there were more training tweets for trust, for example, there may have been more words specifically used for trust, and this may have led to more higher scoring tokens for trust.
Every emotion had tokens that scored highly and tokens that scored poorly. However, recall that for tokens that did not occur in some emotion (e.g., “furious” did not appear in any trust
tweets), these low scores were always large and negative (e.g.,
). For example, recall that trust
had the least number of training tweets. Consequently, for the trust
emotion, the lexicon contained relatively few tokens that genuinely came from trust
tweets, but many tokens for trust
with large negative scores that came from other emotions, which would vote against trust
. Consequently, according to Figure 7
and Figure 8
, for a tweet to be classified as trust
, it merely had to achieve a score large enough to exceed the minimum threshold. Even with this minimal threshold, the Jaccard score for trust
was reasonable; however, this may be due to the small number of test tweets.
The higher the token scores were for an emotion, the more likely that tweets with those tokens would generate high normalised tweet scores (i.e., closer to the maximum threshold of 1.0). This made these scores larger than many of the (lower) thresholds, and consequently, the likelihood of the tweet being classified as that emotion was increased. In other words, emotions containing tokens with high scores tended to be easier to classify, and the Jaccard scores for these emotions tended to be higher. However, as seen in the tweet about “sulking in bed until Friday”, not all angry tweets were straightforward to classify, and due to these tweets, as the threshold got higher, more and more angry tweets were classified as false negatives.
Although the final Jaccard scores were similar for both datasets, it is interesting to see that the emotive words used in the datasets were very different. It was observed that Arabs tend to use hashtags infrequently in tweets and that many of the top tokens in the Arabic dataset were referring to the situation in Yemen (“Houthis”, “victory”, “our land”). In general, it was difficult to draw concrete conclusions from the results because the Arabic dataset was small.
The top Arabic dataset tokens also scored substantially higher than their English counterparts. It is important to clarify that high scores did not
indicate that a token appeared many times in an emotion or in a dataset, rather it was an indication of the importance of the token to an emotion relative to the other emotions. Consider the token الانتصار (“the victory”). This was the highest scoring token for trust
in the Arabic dataset. However, this token appeared in only five tweets throughout this dataset, three of which were classified as trust
“Experience killed mercenaries of Taiz with a weapon and I killed the mercenaries of Twitter. Praise be to God for these victories. Hehe”
“The feeling of victory”
“Some fans were surprised about the amount of frustration inside them, trust your team and let the predestination does as it please. Hala Madrid! after defeat comes victory”
The token also appeared twice in tweets classified as optimism. Thus, the probability for the token was greater for trust than it was for optimism. This difference led to WCP ultimately calculating a high score for the token for trust. In other words, large scores were generated on the basis of a small number of tokens; these tokens then went on to contribute over-excessively to tweets being classified as trust.
This step had a reasonable impact on the performance of WCP, since it implicitly weighed up the likelihood of a given emotion occurring at all, as well as being sensitive to the fact that different emotions may be expressed in different ways.
Separate thresholds for each emotion increased the overall scores from 0.370 to 0.452 on the Arabic dataset and from 0.431 to 0.455 on the English dataset.
These results show that this process, while clearly helpful, was not a major contributor to the difference between WCPs’ performance on the target data and the multi-DNNs’ performance, since the multi-DNNs also had similar processes. However, the final results on the two datasets were extremely similar, thus confirming the robustness and adaptability of WCP.
The key elements of WCP, autocorrection and thresholding, showed that with these in place, significant performance gains were achieved. This is consistent with the initial intuitions that allowing the algorithm to drop unhelpful words and that a single threshold for all emotions would not be helpful resulted in significant performance gains. We also observed that certain emotions often presented together in tweets as pairs. For example, joy and optimism often presented together. We believe that this also has an effect on the results.
It was noted that WCP performed poorly on the trust emotion. However, only 2% of the tweets in the datasets were labelled as trust. This was an extremely modest amount of tweets to learn from and may account for the low scores. However, the surprise emotion had an even smaller number of tweets, yet WCP managed to perform at least on par with some of the other emotions. This suggested something different about the tokens used in the trust and the surprise tweets. This difference in performance may be due to the differences in the numbers of tweets that have multiple emotions. For example, trust had almost three-times as many tweets that had 2, 3 and 4 emotions than surprise. It was also noticeable that the English datasets seemed easier to classify.
It is possible that the annotation process also played a part in shaping the results. The crowdsourcing nature of the annotation left much scope for judgement; hence, datasets constructed in this way are more difficult to classify due to noise and misclassifications.
WCP was evaluated on the Arabic and English datasets provided for the task by the competition organisers, where it ranked second for the Arabic dataset (out of 14 entries) and 12th
for the English dataset (out of 35 entries). In the Arabic competition, WCP came second to Badaro et al. [31
]. Badaro et al. presented a system called “EMA” (emotion mining for Arabic), reporting that linear SVC (support vector classifier) in conjunction with word embeddings gave the best performance. These word embeddings were obtained from AraVec [32
], a word embedding model that was built from tweets, web pages and Wikipedia. More than three billion tokens were used to build the models. We believe that, in this context, to be outperformed by only 0.005 serves to highlight the effectiveness of our approach. In the English competition, the entries that outperformed us were either based on neural networks [6
] or word embeddings [7
], with a number of these also making use of external lexicons. The problem with using lexicons is that lexicons, stop-word lists and resources of this type are not always available for all languages. Furthermore, the language on Twitter is continuously evolving, and from this point of view WCP, has the advantage that it can be retrained quickly and easily.