Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19

Quantifying the characteristics of public attention is an essential prerequisite for appropriate crisis management during severe events such as pandemics. For this purpose, we propose language-agnostic tweet representations to perform large-scale Twitter discourse classification with machine learning. Our analysis on more than 26 million COVID-19 tweets show that large-scale surveillance of public discourse is feasible with computationally lightweight classifiers by out-of-the-box utilization of these representations.


Introduction
Coronavirus disease 2019 (COVID-19) was declared a pandemic by the World Health Organization on 11 March 2020 [1].Since first recorded case in Wuhan, China in late December 2019, 17.1 million people have been infected by COVID-19 and consequently, 670,000 people have lost their lives globally as of 30 July COVID-19 as the primary responsibility of risk management is not centralized to a single institution, but distributed across society.For instance, a recent study by Zhong et al. shows that people's adherence to COVID-19 control measures is affected by their knowledge and attitudes towards it [6].Previous national and global adverse health events show that social media surveillance can be utilized successfully for systematic monitoring of public discussion due to its instantaneous global coverage [7,8,9,10,11,12].
Twitter, due to its large user-base, has been the primary social media platform for seeking, acquiring, and sharing information during global adverse events, including the COVID-19 pandemic [13].Especially during the early stages of the global spread, millions of posts have been tweeted in a span of couple of weeks [14,15,16,17,18].Consequently, several studies proposed and utilized Twitter as a data source for extracting insights on public health as well as insights on public attention during the COVID-19 pandemic.Focus of these studies include nowcasting or forecasting of the disease, sentiment analysis, topic modeling, and quantifying misinformation/disinformation.Due to the novelty and unknown epidemiological characteristics of COVID-19, accurate quantification of public discussions on social media becomes especially relevant for disaster management (e.g.devising timely interventions or clarifying common misconceptions).So far, manual or automatic topical analyses of discussions on Twitter during COVID-19 pandemic have been performed in an exploratory or descriptive manner [19,20,21].Characterizing public discourse in these studies rely predominantly on manual inspection, aggregate statistics of keyword counts, or unsupervised topic modeling by utilizing joint distributions of word co-occurrences followed by qualitative assessment of discovered topics.Main reasons for previous studies to avoid supervised approaches can be lack of annotated (labeled) datasets of public discourse on COVID-19.Furthermore, previous studies either restrict their scopes to a single language (typically English tweets) or examine tweets from different languages in separate analyses.This is mainly due to limitations of traditional topic modeling algorithms as they do not operate in a multilingual or cross-lingual fashion.
In this study, we propose large-scale characterization of public discourse themes by categorizing more than 26 million tweets in a supervised manner, i.e., classifying text into semantic categories with machine learning.For this purpose, we utilize two different annotated datasets of COVID-19 related questions and comments for training our algorithms.To be able to capture themes from 109 languages in a single model, we employ state-of-the-art multilingual sentence embeddings for representing the tweets, i.e., Language-agnostic BERT Sentence Embeddings (LaBSE) [22].Our results show that large-scale surveillance of COVID-19 related public discourse themes and topics is feasible with computationally lightweight classifiers by out-of-the-box utilization of these representations.We release the full source code of our study and the trained models along with the instructions to access the experiment datasets1 .We believe our work contributes to the pursuit of expanding social media research for disaster informatics regarding health response activities.

Representing Tweets
As effective representation learning of generic textual data has been studied extensively in natural language processing research, tasks involving social media text benefit from recent advancements as well.While traditional feature extraction methods relying on word occurrence counts (e.g.bag-of-words or term frequency-inverse document frequency) have been extensively utilized in previous studies involving Twitter [50,51,52], they have been replaced by distributed representations of words in a vector space (e.g.word2vec [53] or GloVe [54] embeddings).Distributed word representations are learned from large corpora by a neural network, resulting in words with similar meanings being mapped to closer vector representations with a feature number that is much smaller than the vocabulary size.Consequently, sentences, documents, or tweets can be represented, e.g. as an average-pooling of its word embeddings.Such representations have also been learned specifically from Twitter corpora as tweet2vec [55,56] or hashtag2vec [57].
While distributed word/sentence embeddings provide effective capturing of semantics, they operate as a static mapping from the textual space to the latent space.Serving essentially as a dictionary look-up, they often fail to capture the context of the textual inputs (e.g.polysemy).This drawback has been circumvented by contextual word/token embeddings such as ELMo [58] or BERT [59].
Contextual word embeddings enable the possibility of same word being repre-  show that out-of-the-box sentence embeddings of BERT and its variants (also known as transformers) can not capture semantic similarities between sentences, requiring further training for that purpose [68].They propose a mechanism for learning contextual sentence embeddings using BERT neural architecture, i.e.
sentence-BERT, enabling large-scale semantic similarity comparison, clustering, and information retrieval with out-of-the-box vector representations [68].

Tweet Embeddings
As the daily volume of COVID-19 related discussions on Twitter is enormous, computational public attention surveillance would benefit from lightweight approaches that can still maintain a high predictive power.Preferably, numerical representations should encode the semantics of tweets in such a way that simple vector arithmetic should suffice for large-scale retrieval or even classification.
Moreover, developed machine learning systems should be able to accommo-date tweets in several languages to be able to capture the public discourse in an unbiased manner.Multilingual BERT-like contextual word/token embeddings [59] have been shown to be effective as pre-trained models if followed by a task-specific fine-tuning.However, they do not intrinsically produce effective sentence-level representations [68].In order to be able to take advantage of multilingual BERT encoders for extracting out-of-the-box sentence embeddings, we employ Language-agnostic BERT Sentence Embeddings [22].
LaBSE embeddings combine BERT-based dual-encoder framework with masked language modeling (an unsupervised fill-in-the-blank task where a model tries to predict a masked word) to reach state-of-the-art performance in embedding sentences across 109 languages [22].Trained on a corpus of 6 billion translation pairs, LaBSE embeddings provide out-of-the-box comparison ability of sentences even by a simple dot product (essentially corresponding to cosine similarity as embeddings are l 2 normalized).We encode both the training data and 26.8 million tweets using this deep learning approach, ending up with vectors of length 768 for each observation.Embeddings are extracted with Tensor-Flow (version 2.2) framework in Python 3.7 on a 64 bit Linux machine with an NVIDIA Titan Xp GPU.

Intent Classification
As our choice of embeddings provide effective, out-of-the-box latent space representations of the textual data, simpler classifiers can be directly employed for identifying semantically similar texts.In fact, LaBSE embeddings provide representations that are suitable to be compared with simple cosine similarity [22].We train 3 classifiers, namely k-nearest neighbour (kNN), logistic regression (LR), and support vector machine (SVM) to classify the observations into 11 categories.We employ a 10-fold stratified cross-validation scheme to evaluate the performance of the three models.Hyperparameters of the classifiers are selected by Bayesian optimization (see Section 3.4).Once the classifier with its set of hyperparameters giving the highest cross-validation classification performance is selected, the classifier is trained with full dataset of 4,919 observations.
With this model, inference on 26,759,164 samples of Twitter data embeddings is performed.

Bayesian Hyperparameter Optimization
Typically, machine learning algorithms have several hyperparameters that require tuning for the specific task to avoid sub-optimal predictive performance.
Most influential hyperparameters of k-nearest neighbour classifier are k (number of neighbours) and distance metric (e.g.cosine3 , euclidean, manhattan, etc.).
For logistic regression and support vector machine classifiers, l 2 regularization coefficient, λ, is the most crucial hyperparameter.We formulate the problem of finding the optimal set of classifier hyperparameters, θ, as a Bayesian optimization problem: where f (θ) is the average of cross-validation accuracies for a given set of hyperparameters, i.e., 1 N N i=1 ACC i .For our experiments N = 10 as we perform 10-fold cross-validation.We use Gaussian Processes for the surrogate model [77] of the Bayesian optimization by which we emulate the statistical relationships between the hyperparameters and model performance, given a dataset.We run the optimization scheme for 30 iterations (each iteration corresponds to one full cross-validation) for each classifier.
Bayesian optimization is especially beneficial in settings where the function to be minimized/maximized, f (θ), is a black-box function without a known closed-form and expensive to evaluate [78].As f (θ) corresponds to crossvalidation performance in our case, it indeed is a black-box function that is computationally expensive to evaluate.That is our motive for employing Bayesian hyperparameter optimization instead of manual tuning or performing grid-search over a manually selected hyperparameter space.

Evaluation
For visual inspection of LaBSE embeddings, we utilize Uniform Manifold Approximation and Projection (UMAP) to map the 768-dimensional embeddings to a 2-dimensional plane [79].UMAP is a frequently used dimensionality reduction and visualization technique that can preserve global structure of the data better than other similar methods [79].In their recent study, Ordun et al.
Evaluation of classifiers and their sets of hyperparameters are performed by 10-fold cross-validation.Randomness (seed) in cross-validation splits are fixed in order to perform fair comparison.Average accuracy (%) and Area Under the Receiver Operating Characteristic (AUROC) curve scores across 10 folds are reported for all classifiers (for their best performing set of hyperparameters).
AUROC scores are calculated in a one-vs-rest manner and macro averaging.As SVMs do not directly provide probability estimates required for AUROC calculation, Platt scaling is used for probabilistic output estimation [80].Confusion matrix for the best performing classifier is reported as well.After running inference on Twitter data to classify 26.8 million tweets into 11 categories with the best performing classifier, we aggregate the overall distribution of Twitter chatter into percentages.We also show tweet examples from each predicted category.This is intuitive as the total number of tweets in January is several magnitudes lower than that of April and sudden percentage jumps in January can be at-  tributed to only a handful of tweets.Finally, random samples of tweets and their predicted labels can be observed from Table 4.

Discussion
Adequate risk management in crisis situations has to take into account not only the threat itself but also the perception of the threat by the public [82].
In digital era, public heavily relies on social media to inform their level of risk perception, often in a rapid manner.In fact, social media enhances collaborative problem-solving and citizens ability to make sense of the situation during disasters [4].With this paradigm in mind, we attempt to perform large-scale classification of 26.8 million COVID-19 tweets using natural language processing and machine learning.We utilize state-of-the-art language-agnostic tweet representations coupled with simple, lightweight classifiers to be able to capture COVID-19 related discourse during a span of 13 weeks.
Our first observation of "increasing Twitter activity with increased COVID-  19 spread throughout the globe" (Figure 1) is in parallel with other studies.For instance, Bento et al. show that Internet searches for "coronavirus" increase on the day immediately after the first case announcement for a location [83].
Wong et al. correlates announcement of new infections and Twitter activity [84].
Similar associations have been discovered between official cases and Twitter activity by causal modeling as well [69].Secondly, we show that language-agnostic embeddings can be utilized in an out-of-the-box fashion (without requiring task- When compared to existing studies that often employ unsupervised topic modeling, our approach tries to perform public attention surveillance with a more automated perspective as we formulate the problem as a supervised learning one.Topic modeling with LDA, which has been employed by majority of previous studies, relies on manual/qualitative inspection of discovered topics.
Furthermore, plain LDA fails to accommodate contextual representations and does not assume a distance metric between discovered topics as it is based on the notion that words belonging to a topic are more likely to appear in the same document.With language-agnostic embeddings, we also include tweets from languages other than English to our analysis, hence decrease the selection bias.
Utilization of large-scale social media data for extracting health insights is even more pertinent during a global pandemic such as COVID-19, as running randomized control trials becomes less practical.Moreover, traditional surveys for public attention surveillance may further stress the participants whose men-tal health and overall well-being might have been affected by lockdowns, associated financial issues, and changes in social dynamics [86,87,88].Once accurate estimation of global or national discourse is possible, social media can also be used to direct people to trusted resources, counteract misinformation, disseminate reliable information, and enable a culture of preparedness [89].Assessment Future research includes running similar analysis for a more granular category set or sub-categories.For instance, Speculation category can be divided into conspiracies related to origin of the disease, transmission characteristics, and treatment options.Including up-to-date Twitter data (after April 2020) as well as extracting location-specific insights will be performed in future analyses as well.

Conclusions
Transforming social media data into actionable knowledge for public health systems face several challenges such as advancing methodologies to extract relevant information for health services, creating dynamic knowledge bases that address disaster contexts, and expanding social media research to focus on health response activities [90].We hope our study serves this purpose by prov-ing methodologies for large-scale, language-agnostic discourse classification on Twitter.

Figure 1 :
Figure 1: Daily Twitter activity related to COVID-19 during the early stages of the pandemic.

Figure 4
Figure 4 depicts the timeline of normalized daily category distributions obtained by running inference on tweets posted between 26 January and 5 April 2020.Transmission and travel-related chatter as well as speculations (opinions on origin of COVID-19, myths, and conspiracies) show significance presence throughout the pandemic.What Is Corona?, i.e. questions and inquiries regarding what exactly COVID-19 is, shows a presence in the early stages of the pandemic but decreases through time, possibly due to gained scientific knowledge about the nature of the disease.On the contrary, prevalence of Prevention related tweets increase through time especially after the declaration of pandemic by WHO on March 11.Similarly, chatter for Donation discussions are observed only starting from March.Timeline curves become smoother (less spiky) with increasing date as the percentage changes between consecutive days gets smaller.

Figure 4 :
Figure 4: Distribution of semantic discussion categories in Twitter predicted by the classifier during COVID-19.
specific fine-tuning of BERT models) even by a simple nearest neighbour classifier which achieves 0.964 AUROC.A SVM classifier reaches 86.92 % accuracy and 0.986 AUROC for classification into 11 topic categories.Finally, we show that overall public discourse shifts through the pandemic.Questions of "what coronavirus is" leave their place to donation and prevention related discussions as the disease spreads into more and more countries especially during March 2020.Tweets related to donation increase especially around 13 March 2020 when WHO and the United Nations Foundation start a global COVID-19 donation fund [85].
of effectiveness of public risk communication and interventions is also feasible with properly designed computational systems.Guided by machine learning insights, some of these interventions can be made on social media itself.Our study has several limitations.First, the training data consists of single label annotations while in reality a tweet can have several topics simultaneously, e.g.Prevention and Travel.Secondly, we do not employ a confidence threshold for categorizing tweets which forces our model to classify every observation into one of the 11 categories.Considering some Twitter discourses related to COVID-19 may not be properly represented by our existing categories, a probability threshold can be introduced for the final classification decision.Finally, we discard retweets in our analysis, which in fact contributes to public attention on Twitter.

Table 1 :
Tweets have been collected using the Twitter streaming API with the following keywords: COVID19, CoronavirusPandemic, COVID-19, 2019nCoV, CoronaOutbreak, coronavirus, WuhanVirus, covid19, coronaviruspandemic, covid-19, 2019ncov, coronaoutbreak, wuhanvirus [74].As Twitter Terms of Service does not allow redistribution of tweet contents, only tweet IDs are publicly available.Extraction of textual content of tweets, timestamps, and other meta-data was performed with the use of open-source software Hydrator 2 with a Twitter developer account.For our study, we discard the Distribution of languages.
[75]eets and at the time of extraction 26,759,164 unique tweets were available which is the final number of observations used in this study.Daily distribution of these tweets (7-day rolling average) can be observed from Figure1.For training machine learning classifiers, we utilize the following two recentlycurated datasets: COVID-19 Intent[75]and COVID-19 Questions[76].Intent dataset consists of 4,938 COVID-19 specific utterances (typically a question or a request) categorized into 16 categories to describe the author's intent[75].For instance, the sample "is coughing a sign of the virus" has an intent related to Symptoms.The dataset consists of English, French, and Spanish utterances and has been synthetically created by native-speaker annotators based on an ontology.We discard the uninformative categories of Hi and Okay/Thanks to end up with 4,325 samples from this dataset.We combine Can i get from feces animal pets, Can i get from packages surfaces, and How does corona spread categories into a single category of Transmission.Similarly, we merge What if i visited high risk area category into Travel category to end up with 11 categories (classes).

Table 2 :
Distribution of category labels.
Speculation, Symptoms, Transmission, and Treatment categories.In the end, the dataset for our experiments, i.e., training and validating text classification algorithms, consists of 4,919 textual samples collected from the abovementioned two datasets.11 category labels of the final dataset are Donate, News & Press, Prevention, Reporting, Share, Speculation, Symptoms, Transmission, Travel, Treatment, What Is Corona?. Sample distribution of languages and categories among the dataset can be examined from Table

Table 3 :
Cross-validation results of three classifiers.

Table 4 :
Example tweets and predicted classification categories.