On the Processing and Analysis of Microtexts : From Normalization to Semantics †

User-generated content published on microblogging social platforms constitutes an invaluable source of information for diverse purposes: health surveillance, business intelligence, political analysis, etc. We present an overview of our work on the field of microtext processing covering the entire pipeline: from input preprocessing to high-level text mining applications.


Introduction
Extracting information from microtexts (e.g., tweets) requires the use of Natural Language Processing (NLP) techniques.Unfortunately, their performance is sensitive to the so-called texting phenomena (shortenings, substitutions, word concatenation, etc.) present in these texts.Thus, we first need to adapt the input to writing standards in a process called microtext normalization.

Microtext Normalization
One of the most usual approaches when implementing a microtext normalization system is decomposing it into two steps [1]: normalization candidate generation, where domain dictionaries, phonetic algorithms [2], as well as other spell checking techniques are used to obtain standard words to replace in the input text; and candidate selection, where the most likely normalized sequence according to some language model is constructed.
Notably, this approach works at the word level, as candidates are generated and selected for each word in the input text.However, word boundaries (in this case, blank spaces) are also affected by texting phenomena, hence their positioning cannot be assumed to be correct.
To address this issue we can add, as an early step in the normalization pipeline, a word segmentation subsystem that will try to normalize the positioning of word boundaries.In particular, we have experimented with character-based n-gram language models paired with a beam search algorithm, obtaining state-of-the-art results [3].
On top of this, in order to support multilingual environments such as most microblogging social platforms, it becomes essential to know in advance the language or languages in which the texts we want to normalize are written in, so that we can choose the right modules for the task.Consequently, we have added an automatic language identifier to our normalization pipeline.In this regard, we have tested and adapted well-known tools for the task [4].
The ongoing work is currently focusing on obtaining an accurate candidate selection mechanism, where language models play again a key role.

Sentiment Analysis
Normalization systems have many applications in downstream NLP tasks, such as Sentiment Analysis (SA) in Twitter, where the goal is to predict the polarity of a text being positive, negative or neutral.In this context, we have studied symbolic systems that compute the sentiment of sentences by taking into account their syntactic structure.The hypothesis is that syntactic relations between pairs of words are helpful to process linguistic phenomena such as negation, intensification or adversative subordinate clauses, very relevant for the task at hand.Our experiments suggest that our approach better deals with these phenomena than lexical-based systems.We also have developed machine learning models that have been evaluated in international evaluation campaigns [5,6].
These techniques are usually applied to monolingual environments, but their application to multilingual and code-switching texts, where words coming from two or more languages are used indistinctly, is gaining increasing interest [7].
Normalization and sentiment analysis might also be useful in higher level text mining applications.Political analysis, where the main goal is to use social media to estimate the popularity of politicians, is of special interest as it can be used as an alternative to traditional polls [8].
Furthermore, NLP techniques can be used in social analysis to study the cultural differences across different countries.More in particular, in [9] we explore the semantics of part-of-day nouns for different cultures in Twitter, which can be helpful to understand how different societies organize their day schedule.