Towards robust word embeddings for noisy texts

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform the state of the art on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with this type of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.


Introduction
Continuous word representations, also known as word embeddings, have been successfully used in a wide range of NLP tasks such as dependency parsing (Bansal et al., 2014), information retrieval (Vulić and Moens, 2015), POS tagging (Kutuzov et al., 2016), or Sentiment Analysis (SA) (Xiong et al., 2018). A popular scenario for NLP tasks these days is social media platforms such as Twitter (Lampos et al., 2017;Yang et al., 2018;Liang et al., 2018), where texts are usually written without following the standard rules, containing varying levels of noise in the form of spelling mistakes (socisl for social), phonetic spelling of words (dat for that), abbreviations for common phrases (tbh for to be honest), emphasis (yessss as an emphatic yes) or incorrect word segmentations (noway for no way). However, the most commonly-used word embedding approaches do not take these phenomena into account (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2016), and we instead rely on their implicit capacity to cope with non-standard words provided a large enough amount of varied training text, such as in (Sumbler et al., 2018).
Another possibility to tackle non-standard texts would be to apply some preprocessing step that removes the noise, such as spell checking or text normalization (Eisenstein, 2013;Chrupała, 2014;van der Goot and van Noord, 2017a). Nonetheless, the trend nowadays is to use end-to-end approaches (Bordes et al., 2016;Klein et al., 2017;Schmitt et al., 2018) which exploit the raw data from the source without applying preprocessing steps, in an attempt to harness every bit of information for the specific task at hand while also avoiding introducing early errors in the NLP pipeline. On the other hand, it is also not entirely clear whether a normalization approach outperforms the direct use of word embeddings on noisy texts (van der Goot et al., 2017). Normalization, as a preprocessing step, will alter the original information encoded in the input text, although in a way that would benefit the next stages of the pipeline. For instance, if we normalize nooooo to no, the emphasis of the first word is lost. In this case, it is important to highlight the intentionality when using one form over the other, which contrasts with accidentally introducing spelling mistakes in the writing. 1 Granted, a system that only includes normalized words in its vocabulary will probably benefit from using the latter form instead.
In this work, we introduce an adaptation of the skipgram model from (Bojanowski et al., 2016) to train word embeddings that better integrate word variants (otherwise considered noisy words) at training time. This can be regarded as an anal-ogous incremental improvement over fastText to what this one was over word2vec. Then, we perform an evaluation on a wide array of intrinsic and extrinsic tasks, comparing their performance to that of well-known embedding models such as word2vec and fastText on both standard and noisy English texts. The results show a clear improvement over the baselines in semantic similarity and sentiment analysis tasks, with a general tendency to retain the performance of the best baseline on standard texts and outperform them on noisy texts. Our ultimate goal is to improve the performance of traditional embedding models in the context of noisy texts. This would alleviate the need for the usual preprocessing steps such as spell checking or microtext normalization, and act as a good starting point for modern end-to-end NLP approaches.

Towards noise-resistant word embeddings
Word embedding models such as word2vec, GloVe or fastText are able to cluster word variants together when given a big enough training corpus that includes standard and non-standard language (Sumbler et al., 2018). That is, given enough examples where friend (standard word), freind (spell-checking error), frnd (phonetic-compressed spelling) and even dog or dawg (street-talk) appear in similar contexts, these words will be translated to similar vector representations. Taking advantage of this fact, many state-of-the-art microtext normalization systems use word embeddings in their pipelines (Bertaglia and Nunes, 2017;van der Goot and van Noord, 2017b;Ansari et al., 2017;Sridhar, 2015), both when generating normalization candidates for the input words and also when selecting them.
The problem with this approach is that the contexts where those example words appear are also likely to be affected by the same phenomena as the words themselves. For example, friend might appear in phrases such as that's my best friend or friend for life, while frnd in others such as dats my bst frnd or frnd 4 lifee. This can make it difficult for the embedding algorithm to find the semantic similarity between friend and frnd when only relying on the assumption that the training corpus is big and diverse enough to effectively convey this variability. However, not all of the embedding algorithms are equally affected by this, as those which take subword information into account may have an advantage: in our example, the similar morphology shared by the word variants may be exploited by algorithms such as fastText, which uses character n-grams to give them more similar vector representations.
In this paper, we present a modification of the skipgram model proposed by Bojanowski et al. (2016) (a modification of the original by Mikolov et al. (2013)), which tries to improve the clustering of standard words and their noisy variants. This is attained through the use of bridge-words, normalized derivatives of the original words from the training corpus where one of their constituent characters is removed. 2 By using these new words at training time in addition to the original ones, our objective is to increase the similarity between word variants, using those bridge-words as intermediate terms that match the words we want to cluster together. For example, friend and freind have in common the bridge-words frind and frend. Even if the original words do not appear in the same context in the training corpus, using the bridge-words in place of the originals allows for indirect paths to be discovered: friend-frind-freind and friendfrend-freind. In the case of friend and frnd, and assuming that we use an embedding algorithm that exploits subword information, as we propose here, the higher morphological similarities of the latter with respect to the bridgewords frend and frind benefits their grouping together in the same cluster. Notably, it should be also possible to apply analogous modifications to the ones described here to other training models, such as the continuous bag of words (Mikolov et al., 2013).
It is worth pointing out that we did not consider the latest state-of-the-art models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) as it would not be feasible to apply analogous modifications to these large and complex models at this point. On the other hand, although we currently consider a monolingual English setup, our method should be suitable for any other language with a similar concept of charac-ter, in contrast to those based on logograms such as Chinese.

Modified skipgram model
The skipgram model found in tools like word2vec and fastText establishes that for each word in a text it should be possible to predict those in their corresponding contexts (Mikolov et al., 2013). As a consequence, the words that appear in similar contexts end up represented by similar vectors, so that the transformation learned by the model can effectively map one group of words onto the other.
Based on the skipgram model from fastText, our proposal aims at increasing the similarity between standard words and their noisy counterparts.
To accomplish this, we introduce a new set of words at training time that we denominate bridge-words. For each word in the training corpus, we first lowercase, strip the accents and remove character successive repetitions, 3 and then obtain one bridge-word for each remaining character in the word, by removing one different character each time. Note that this procedure is exclusively applied to obtain all the bridge-words, and the unprocessed corpus will be used during training. Formally, let V be the word vocabulary extracted from the training corpus so that V = {w 1 , w 2 , ..., w n } with n the size of the vocabulary. The set of bridge-words is then defined as B = {b 1,1 , b 1,2 , ..., b 1,|w 1 | , ..., b n,1 , b n,2 , ..., b n,|wn| }, where |w i | is the length of word w i , and b i,j is the bridge-word obtained by first normalizing as described earlier and then removing the character at position j from the word w i ∈ V. 4 These new words are used in addition to the original words when predicting their context in the skipgram model training. For example, in the phrase that's my best Friëndd ever, the objective is not only to predict that's, my, best and ever using the word Friëndd, but also using the derived bridge-words riend, fiend, frend, frind, fried and frien. This idea of removing one character at a time is 3 This applies both to standard and non-standard repetitions (e.g. success vs daaammn), obtaining a common denominator for when users make the mistake of removing standard repetitions (e.g. from success to succes), or add repetitions to provide emphasis (e.g. from damn to daaammn). The resulting words are very similar and can still be read mostly in the same way. An analogous reasoning is used in the case of lowercasing and stripping the accents. 4 It is possible that V ∩ B = ∅. similar to the one used in the tool SymSpell 5 to speed up spell-checking, where it replaces the exhaustive approach of considering all possible edit operations (i.e. addition, removal, substitution and transposition). In our case, bridge-words are not interesting per se but as intermediaries between other words. We do not require that they coincide with real words with which they would establish a direct connection; in fact, we assume that these connections will be indirect most of the time. For instance, we do not consider the substitution operations that would construct tome and tame from time, which would explicitly connect the three, but only tme, which can be obtained from the three of them by removing one character, linking them together indirectly.
It is important to observe that these bridgewords also constitute artificial noise introduced in our training process that could play a harmful role. As an example, the word fiend appears as a bridge-word for friend, while also being a standard word from the English dictionary without much semantic relation to the concept of friendship. Because of this, bridge-words should not have the same impact as the original words when tuning the parameters of the model. We propose two mechanisms for lowering the weight of bridge-words in the training process: (1) introducing them randomly, with a fixed probability p b , instead of for all the original words, and (2) reducing the impact in the objective function by adding a weighting factor. Formally, let w x be an input word of length |w x |, b j the bridgeword for w x when the character at position j is removed, w y a target word in the context of w x , H a random variable with P (H = 1) = p b and P (H = 0) = 1 − p b , h ∼ H, λ the weight factor and E f t (w x , w y ) the objective function of the skipgram model from fastText, then our new objective function, E robust is defined as: where w x , w y , and b j are the vector representations of the corresponding input, target and bridge words.
In any case, the proposed technique does not rule out the requirement of a training corpus where standard and noisy variants of words are used. Rather, it enhances the capacity of already existing models (in this case, the skipgram model from fastText) to bridge or further interconnect these word variants. 6

Evaluation
We use multiple intrinsic and extrinsic evaluation tasks to study the performance of our approach together with word2vec and fastText. The models are trained using the same unprocessed corpus of web text and tweets. Starting with the usual word similarity task, we also include outlier detection (Camacho-Collados and Navigli, 2016), most of the extrinsic tasks from the SentEval benchmark (Conneau and Kiela, 2018), and then we add Twitter SA from various editions of the Se-mEval workshop. Ideally, we should see that our embeddings are able to retain the performance of "vanilla" fastText embeddings (Bojanowski et al., 2016) for standard and less-corrupted text, while outperforming them on noisier texts, and that word2vec (Mikolov et al., 2013) is at a disadvantage in this case.

Word embedding training
In this work, we use a combination of web corpora, specifically the UMBC corpus (Lushan Han and Weese, 2013), and tweets collected through the Twitter Streaming API from dates between October 2015 and July 2018. It is worth noting that we did not perform any preprocessing or normalization step over the resulting corpus, and the final dataset is formed by 64.653M lines and 3.3B tokens, of which 24.558M are unique.
We employed a modified version of the skipgram model from fastText which incorporates the changes described in Section 2.1 together with a vanilla version and a word2vec baseline, using the default hyperparameters for all models. In the case of the proposed model, we train four instances in order to take a first look at the influence of the hyperparameters introduced: probability of introducing a bridge-word (p b ) and weight for bridge-words in the objective (λ). The combinations are (p b = 1, λ = 1), (p b = 0.5, λ = 1), (p b = 1, λ = 0.1), and (p b = 0.5, λ = 0.1). In this work, we do not perform hyperparameter optimization, and those values were selected accord-ing to the initial hypothesis that a decreased impact of bridge-words in the training process should be beneficial to the model.

Intrinsic tasks: word similarity and outlier detection
The first intrinsic evaluation task is the wellknown semantic word similarity task. It consists in scoring the similarity between pairs of words, and comparing it to a gold standard given by human annotators. In a word embedding space, the similarity between two words can be measured through a distance or similarity metric between the corresponding vectors in the space, such as cosine similarity. The evaluation is performed using the Spearman correlation between the list of similarity scores obtained and the gold standard.
In this work, we use the wordsim353 (Finkelstein et al., 2002), SCWS (Huang et al., 2012), Sim-Lex999 (Hill et al., 2015) and SemEval17 (monolingual) (Camacho-Collados et al., 2017) evaluation datasets. The second task is outlier detection, which consists in identifying the word that does not belong in a group of words according to their pair-wise semantic similarities. As an example, snake would be an outlier in the set german shepherd, golden retriever and french pitbull, in spite of also being an animal, since it is not a dog. In this case, we use the 8-8-8 (Camacho-Collados and Navigli, 2016) and wiki-sem-500 (Blair et al., 2016) datasets, and measure the proportion of times in which the outlier was successfully detected (i.e. the accuracy).

Extrinsic tasks: the SentEval benchmark and Twitter SA
Since it is not evident that performance on intrinsic tasks translates proportionally to extrinsic tasks (Faruqui et al., 2016;Chiu et al., 2016), where word embeddings are used as part of bigger systems, we resort to the SentEval benchmark (Conneau and Kiela, 2018) in order to evaluate our embeddings in a more realistic setup. The tasks included in this benchmark evaluate sentence embeddings, which can be obtained from word embeddings using an aggregating function, which can go from the simple bag of words to the more complex neural-based models InferSent (Conneau et al., 2017) or GenSen (Subramanian et al., 2018). Additionally, some tasks require a classifier to be trained on the sentence embeddings in order to obtain an output of the desired type. In both cases, we maintain a simple approach where we focus on the raw performance of the word embeddings rather than the models used on top of them. This means using the bag of words model to obtain sentence representations, which simply averages the corresponding word embeddings from each sentence, and then linear regression for the classification tasks. SentEval includes 17 extrinsic tasks, of which we use 16, and 10 probing tasks. The first group includes semantic textual similarity (STS 2012-2016, STS Benchmark and SICK-Relatedness), natural language inference (SICK-Entailment and SNLI), sentiment analysis (SST, both binary and fine-grained), opinion-polarity (MPQA), movie and product review (MR and CR), subjectivity status (SUBJ), question-type classification (TREC) and paraphrase detection (MRPC). The second group is formed by tasks that evaluate other linguistic properties which could be found encoded in sentence embeddings, such as sentence length, depth of the syntactic tree or the number of the subject of the main clause. For a more detailed description of these tasks together with references to the original sources, see (Conneau and Kiela, 2018). In general, for the similarity tasks, the performance is measured using Spearman correlation, while in the rest of the cases, which correspond to classification tasks, the accuracy of the classification is obtained. Unfortunately, we leave image-caption retrieval task (COCO) out of our test bench as it is not possible to access the source texts. This would be needed for the processing that we perform as described in the next section.
Finally, we also evaluate on the SA datasets released in the SemEval workshops by Nakov et al. (2013) (task 2, subtask B), Rosenthal et al. (2014) (task 9, subtask B), 7 and Nakov et al. (2016) (task 4, subtasks B, D, C, and E). These already include noisy texts in the form of tweets, thus they are not processed in the same way as the following datasets are processed, as explained below. However, since we still use the SentEval code, we did filter the neutral/objective tweets in ternary SA datasets. We also performed downsampling on the 2016 training and development datasets, both binary and fine-grained, in order to compensate for the substantial unbalance across instance classes. This is important as the test datasets are also skewed in the same manner, and it lead the classifiers to adjust to this bias to obtain unrealistic results. In the case of the binary task, we equated the positive instances with the number of negative ones, while in the case of the fine-grained task we used a fixed maximum number of 500 instances per class, given the huge gap between the least frequent class (accounting for 71 instances) and the most frequent one (including 2876 instances). 8

Dataset de-normalization
Since we could not find noisy text datasets for such a wide variety of evaluation tasks as the ones from the SentEval benchmark, we decided to denormalize (i.e. introduce artificial noise into) these standard datasets, while also keeping the originals of the benchmark, in order to cover the case of noisy texts in the extension needed by this work. The procedure consists in randomly replacing every word in the texts by a noisy variant with some fixed probability. The noisy variants are obtained from two publicly available normalization dictionaries, utdallas and unimelb, released in the first (2015) edition of the W-NUT workshop (Baldwin et al., 2015), formed by (nonstandard, standard) word pairs.
For the word similarity and outlier detection datasets, this probability p d was fixed to 1; i.e., we modify all the words in the test set which appear in our normalization dictionaries (which cover 78.61% of them). In the case of the SentEval datasets, we created three versions for each one of them: a heavily corrupted version (p d = 1), a more balanced version (p d = 0.6) and a less noisy one (p d = 0.3). As an example, from the original sentence A man is playing a flute we obtain aa woma isz playiin thw flute, aa mann is playng da flute, and aa wman is playing the flute, in each respective case. The Twitter SA datasets, on the other hand, were not de-normalized.
Furthermore, we perform ten de-normalization runs over the intrinsic tasks datasets and three over the extrinsic ones, obtaining multiple noisy versions of each dataset. By averaging the results over the different de-normalizations, we try to neutralize extreme measurements that can be caused by different noisy variants of words.

Results
Our currently best model is obtained with the hyperparameter combination (p b = 0.5, λ = 0.1), which in some way validates our hypothesis that bridge-words should be introduced in a restrained fashion. In general terms, this model has a similar performance to fastText in the standard case, while outperforming both word2vec and fastText in noisy setups, with wider margins towards noisier texts.
Intrinsic evaluation Table 1 shows the results on the intrinsic word similarity task. On standard words, fastText and our model obtain similar performance, both surpassing that of word2vec. On non-standard words, however, our model is able to consistently outperform fastText in every dataset, and word2vec falls further behind possibly due to its lack of support for out-of-vocabulary words in this scenario, as 48.77% of the unique noisy test words are not included in the vocabulary of the word2vec model. In the case of outlier detection, shown in Table 2, we obtained mixed results. On the 8-8-8 dataset, our model outperforms the baselines both in the standard and noisy scenarios, although with visibly lower margins than in the case of semantic similarity. However, on the wiki-sem-500 dataset, word2vec outperforms its competitors on standard words and does not lose much performance on the noisy setup. The latter may be explained by the low amount of successfully denormalized words, with just 7.5% of the total (compared to 52.2% on the 8-8-8 dataset), which also hints to the tie between fastText and our model.
Extrinsic evaluation Given the considerable amount of tasks and datasets included in the Sen-tEval benchmark, we decided to group similar  tasks and datasets and show the aggregated results from each group instead of following an exhaustive approach. In this case, and given the variability in dataset sizes, we use a weighted average as the aggregation function.
First of all, we show in Figure 1 the dynamic behaviour of each model when going from standard texts to noisier ones. In this case, we divided the tasks into two groups based on the performance metric: Spearman correlation or accuracy. The first one encompasses the semantic similarity and relatedness tasks (STS* 9 and SICK-Relatedness) and the second one the rest of the tasks. Except in the case of word2vec on the first group (yellow lines and crosses), all the models start from a very similar position in the standard scenario. Then, the performance begins its downward trend, where our model starts to stand out above the baselines. As we go towards noisier texts, our model manages to stay above the rest of the lines, increasing the distance margin up until the last stretch.
Next, Table 3 shows in greater detail the performance of each model in a less aggregated view. In this case, datasets have been grouped by task as described in Section 3.3. As we can see, our model is on par with the baselines on standard texts, with a few interesting exceptions: (1) it is able to obtain some advantage on sentiment analysis, which fastText also obtains over word2vec; (2) on question-type classification, word2vec obtains the best performance, and still clearly outperforms fastText on the lowest noise level, although not our model; and (3) on the probing tasks, word2vec takes the lead again, this time by a smaller margin. Regarding noisy texts, our model is clearly superior on semantic similarity and relatedness, as we had already seen before, and it also outperforms the baselines on the rest of the tasks, with wider margins on noisier texts, but with the sole exception of paraphrase detection. In this surprising case, word2vec outperforms both fastText and our model obtaining better accuracy on texts with Figure 1: Performance of each considered model when going from standard texts to noisier ones on the extrinsic tasks. In lines and dots is the aggregated performance on semantic similarity and relatedness tasks (Spearman correlation). In continuous lines is the aggregated performance on the rest of the tasks (accuracy).
the highest level of noise compared to the previous step. It appears that, with the proper training (and hence, vocabulary), word2vec remains a strong baseline on extrinsic tasks, even in the case of noisy texts, where the level of noise has to be increased notably in order for fastText to obtain a clear advantage. This can also be observed following the continuous lines in Figure 1. On the other hand, the weakness seen on word semantic similarity (Table 1) relating to out-of-vocabulary words does not seem to translate to extrinsic tasks, where having more context and hence a higher chance of finding in-vocabulary words mitigates the problem, as we can see in the semantic similarity and relatedness (Table 3) results.
Finally, in Table 4 we show the results obtained on the SemEval Twitter SA datasets. In this case, word2vec continues to display a strong performance, fastText loses the advantage it had on the SentEval benchmark for the same SA task, and our approach is able to revert this performance loss to outperform, once again, both of the baselines. At this point, we can observe how fastText is inferior to word2vec on a real-world social media setting, when we may have expected the opposite at first. But for this same reason, it is remarkable to see our approach taking the lead despite being a modification of fastText, which also demonstrates the benefit of including the bridge-words at training time.
Having said that, it would be relevant to investigate if higher performance figures can be obtained by modifying the skipgram model from word2vec.

Related work
Word embeddings have been at the forefront of NLP research since the past decade, although the first application of vector representation of words dates back to (Rumelhart et al., 1986). More recently, the first models to attain wide use were word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), which take words as basic and indivisible units, implying that the word vocabulary is fixed at training time and any unknown word would be given the same vector representation, regardless of its context or any other intrinsic property. To address the limitations of word2vec and GloVe with out-of-vocabulary words, where morphologically-rich languages such as Finnish  or Turkish are specially affected, new models appeared which take subword information into account. The type of subword information used varies in each particular approach: some of them require a preprocessing step to extract morphemes (Luong et al., 2013), while others employ a less strict approach by directly using the characters (Ling et al., 2015;Kim et al., 2016) or character n-grams (Bojanowski et al., 2016;Wieting et al., 2016) that form the words. When targeting noisy texts from social media, such as tweets from Twitter, previous work relies solely on the high coverage that can be obtained from training in an equally noisy domain (Sumbler et al., 2018). An exception to this rule is the work from Malykh et al. (2018), where they try to obtain robust embeddings to misspelled words (one or two edit operations away from the correct form) by using a new neural-based model. In this case, the flexibility is obtained by an encoding of the prefix, suffix and set of characters that form each word. By using this set of characters in the encoding, where the specific order between them is disregarded, this approach achieves some form of robustness to low-level noise, while the prefix and suffix part encodes most of the semantic information. The main difference of our approach is that we are not proposing a whole new model but a generic technique to adapt existing ones. This could be applied to many others, including that from Malykh et al. (2018) itself. Furthermore, we evaluate our embeddings in the context of non-standard texts, a noisier medium than the slightly misspelled standard texts regarded in (Malykh et al., 2018). 10 Lastly, if we consider standard and nonstandard texts as pertaining to different languages, our approach would be similar to (Luong et al., 2015), where they also adapt the skipgram model 10 Unfortunately, we could not include this approach in our test bench as, probably due to differences in the development environment setup, we were not able to train new models nor extract embeddings through pretrained models using the latest version of the code at https://gitlab.com/ madrugado/robust-w2v/tree/py3_launch. to obtain bilingual embeddings. In this work, they start with comparable bilingual corpora and automatically calculate alignments between words across languages. At training time, they use the words from alignment pairs interchangeably in the texts from each language, requiring each word to predict not only the context in its own language but also the context in the other language. In our case, we only consider one training corpus and create a set of bridge-words that act as alignments between standard words and their noisy counterparts. On the other hand, the weight given to these new words in the objective function is λ < 1 as they represent noisy examples, whereas in (Luong et al., 2015) the words from the other language are given more weight (λ > 1).

Conclusions
In this work, we have proposed a modification of the skipgram model from fastText intended to improve the performance of word embedding models on noisy texts as they are found on social media, while retaining the performance on standard texts. To do this, we introduce a new set of words in the training process, called bridge-words, whose objective is to connect standard words with their noisy counterparts.
We have evaluated the performance of the proposed approach together with word2vec and fast-Text baselines on a wide array of intrinsic and extrinsic tasks. The results show that, while the performance of our best model on standard texts is mostly preserved when compared to the baselines, it generally outperforms them on noisier texts with wider margins as the level of noise increases.
As future lines of research, we will perform the same study on other languages and adapt the proposed modification of the skipgram model to work with the newest ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) models. In light of its competitive performance, adapting the skipgram model from word2vec might prove useful. Other types of bridge-words such as phonetic codes obtained from a phonetic algorithm like the Metaphone (Philips, 1990) could also prove to be beneficial. Additionally, our approach is orthogonal to other techniques that enhance the performance of word embeddings, such as the ones described in (Mikolov et al., 2017), and so they too can be applied to the models obtained in this work.