TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

The widespread usage of social networks during mass convergence events, such as health emergencies and disease outbreaks, provides instant access to citizen-generated data that carry rich information about public opinions, sentiments, urgent needs, and situational reports. Such information can help authorities understand the emergent situation and react accordingly. Moreover, social media plays a vital role in tackling misinformation and disinformation. This work presents TBCOV, a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. More importantly, several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. Last but not least, a geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues. Our sentiment and trend analyses reveal interesting insights and confirm TBCOV's broad coverage of important topics.

. A sample of keywords/hashtags used for data collection first three weeks is relatively lower, e.g., ∼5 million daily tweets on average. However, a sudden surge can be noticed starting from week four, which amounts to an overall average of 33 million tweets per week. The maximum number of tweets recorded in a week is 65 million. The tweets in TBCOV dataset are posted by 87,771,834 unique users and among them 268,642 are verified users. In total, the dataset covers 67 international languages. Figure 2 shows the distribution of languages (with at least 10K tweets) and the corresponding number of tweets in the log scale. The English language dominates with around 1 billion tweets and the second and third largest languages are Spanish and Portuguese, respectively. There are around 55 million tweets for which the language is undetermined-this is an important set of tweets suitable for the language detection task with code-mixing properties 30 . 10  The TBCOV dataset is a substantial extension of our previous COVID-19 data release named GeoCoV19 31 . The TBCOV dataset is superior in many ways. First, the TBCOV dataset contains 1.4 billion more tweets than the GeoCoV19 dataset that consists of 524 million tweets. Second, the data collection period of GeoCoV19 was restricted to only four months (Feb 2020 to May 2020), whereas the TBCOV coverage is 14 months (Feb 2020 to March 2021). The third and the most critical extension represents several derived attributes that TBCOV offers, including sentiment labels, named-entities, user types, and gender information. None of these attributes were part of the GeoCoV19 data. Furthermore, the geotagging method used in GeoCoV19 has been substantially improved and used in TBCOV, which yields better inference results.

Language Person Organization Location Miscellaneous
English (U) 14 Table 2. Named-entities extraction results for the top four languages. 'U' denotes "unique occurrences" and 'A' denotes "all occurrences" of entities.
persons, organizations, locations, language, product, time, money, etc. However, all other NER models can detect only the three fundamental entity types, i.e., persons, organizations, and locations, in addition to a miscellaneous type representing other entities. We introduced an additional entity, named covid19, to represent different COVID-19 related terms (N = 60), including Coronavirus, SARS-CoV, SARS-COVID-19, Corona, Covid19, etc. All six models and their performance scores are publicly available 1 . Text of all two billion tweets was first preprocessed by removing URLs, usernames, emojis, and other special characters, and then fed to one of the six NER models depending on the language attribute. Four NVIDIA Tesla P100 GPUs were used to process all the data. The entities recognition and extraction process resulted 4.7 billion entities from all tweets. Table 2 shows the number of entities extracted of type person, organization, location, and misc (i.e., miscellaneous) for the top four languages. The selected languages represent 38% of person, 68% of organization, and 76% of location out of all the extracted entities. The remaining entities represent a long-tail distribution.

Geographic information
Geotagged social media messages with situational or actionable information have a profound impact on decision-making processes during emergencies 35,36 . For example, recurring tweets showing face mask violations in a shopping mall or a park, or on a beach, can potentially inform authorities' decisions regarding stricter measures. Moreover, when governments' official helplines are overwhelmed 37 , social media reports, e.g., shortages of essential equipment in a remote hospital or patients stuck in traffic requiring urgent oxygen supply 38 , could be life-saving if processed and geotagged timely and effectively. Furthermore, GIS systems, which heavily rely on geotagged information, are critical for many real-world applications such as mobility analysis, hot-spot prediction, and disease spread monitoring. Despite these advantages, social media messages are often not geotagged, thus not suitable for automatic consumption and processing by GIS systems. However, they may still contain toponyms or place names such as street, road, or city-information useful for geotagging.

Geotagging approach
This work geotags tweets using five meta-data attributes. Three of them, i.e., tweet text, user location, and user profile description, are free-form text fields potentially containing toponym mentions. The tweet text attribute, which represents the actual content of a tweet in 280 characters, can have multiple toponym mentions for various reasons. The user location is an optional field that allows users to add location information such as their country, state, and city whereas the user profile description field usually carries users' demographic data 39 . The latter two user-related attributes are potential sources for user location inference 40 . The remaining two attributes, i.e., geo-coordinates and place tags carry geo-information in a structured form that is suitable for the direct consumption by the automatic GIS systems. The geo-coordinates field contains latitude and longitude, which are directly obtained from the users' GPS-enabled devices. However, many users refrain from enabling this feature, thus only 1-2% of tweets contain exact coordinates 41 . The place attribute carries a bounding box representing a location tag that users optionally provide while posting tweets. Although geo-coordinates and place attributes suit GIS consumption, for the sake of standardization with text-based attributes, we convert them to country, state, county, and city-level information using a process known as "reverse geocoding" which is described next.
The pseudo-codes of the proposed geotagging procedures are presented in Algorithms 1, 2, & 3. Two common processes across three procedures are (i) geocoding and (ii) reverse geocoding. The geocoding process is used to obtain geo-coordinates from a given place name (e.g., California) while the reverse geocoding process is used to retrieve the place name corresponding  to a given geo-coordinates. Multiple geographical databases exist and support these two processes. We use the Nominatim database 2 , which is a search engine of OpenStreetMap 3 . The official online Nominatim service restricts 60 calls/minute, and hence, is not suitable for us to make billions of calls in a reasonable time period. Therefore, we set up a local installation of the Nominatim database. Both Nominatim calls (i.e., geocoding and reverse geocoding) return, among others, a dictionary object named "address", which depending on the location granularity, comprising several attributes such as country, state, county, and city. The procedure to process toponyms from text fields (except user location) is highlighted in Algorithm 1. The procedure assumes that all six NER models are already loaded (line 1). After initializing the required arrays, preprocessing of the text (i.e., remove all URLs, usernames, emoticons, etc.) is performed (line 3). The lang attribute, which represents the language of a tweet, determines the NER model to be applied on the processed text for entity extraction. Recall that five language-specific and one multilingual NER models are used in this study. Since NER models return different types of entities, next we iterate over all predicted entities (line 7) to retain the ones with the following types: LOC, FAC, or GPE (line 8). The LOC entity type represents locations, mountain ranges, bodies of water; the FAC corresponds to buildings, airports, highways, bridges, etc., and GPE represents countries, cities, and states. Finally, a geocoding call per entity is made and responses are stored (line 9 & 10). Algorithm 2 outlines the procedure for processing the place attribute. The place_type attribute inside the place object helps determine if a reverse or a simple geocoding call is required (lines 2 & 5). Places of type POI (Point-of-Interest) contain exact latitude and longitude coordinates, and thus, suitable to perform reverse geocoding calls (line 4). However, non-POI places (i.e., city, neighborhood, admin or country) are represented with a bounding box spanning a few hundred square feet (e.g., for buildings) to thousands of square kilometers (e.g., for cities or countries). Moreover, large bounding boxes can potentially cover multiple geographic areas, e.g., two neighboring countries, and hence, can be ambiguous to resolve. To tackle this issue, we use full_name attribute to make geocoding calls (lines 7 & 16) and compare the country name of the obtained address with that of the original place object (lines 9 & 18). In case countries do not match, as a last resort, a midpoint of the bounding box is obtained (lines 11 & 20) to make reverse geocoding calls (lines 12 & 21). Algorithm 3 outlines the pseudo-code of the overall geotagging process. It starts with loading a batch of tweets (line 1) and iterating over them (line 2). Tweets with coordinates are used to make a reverse geocoding call (lines [3][4][5]. For place tweets, the geoLocalizePlace procedure is called, which is defined in Algorithm 2. And, for the two text-based attributes (i.e., text, user profile description), the geoLocalizeText procedure is called, which is defined in Algorithm 1. However, the user location attribute is pre-processed and geo-coded without applying the NER model (lines [13][14][15]. The evaluation results of the proposed geotagging approach are presented in the next section. The geotagging approach identified 515,802,081 mentions of valid toponyms from tweet text and 180,508,901 from user profile description. More importantly, out of all 1,284,668,011 users' self-declared locations in the user location field, 1,132,595,646 (88%) were successfully geotagged . Moreover, the process yielded 2,799,378 and 51,061,938 locations for  geo-coordinates and place fields, respectively. Table 3 shows important geotagging results, including total occurrences, geotagging yield, and resultant resolved locations granularity at country, state, county, and city level. To determine the country, state, county, and city of a tweet, we mainly rely on three attributes. The first two attributes are users' self-reported location in the user location or user profile description fields. GPS coordinates are used (if available) in case a tweet is not resolved through user location and user profile description fields. Altogether, >1.8 billion locations corresponding to 218 unique countries, 2,518 sates, 26,605 counties, and 24,424 cities worldwide were resolved. The dataset contains 175 countries and 609 cities around the world having at least 100K tweets. Figure 3 depicts the monthly distribution of top 10 countries and cities throughout the data collection period.
To allow meaningful comparisons of geotagged tweets across different countries, we normalize tweets from each country by its population and calculate posts per 100,000 persons. For this purpose, geotagged tweets resolved through user location, user profile description, and geo coordinates attributes were used. Figure 4 shows the normalized counts of geotagged tweets for each country on a world map.

Sentiment classification
Understanding public opinion and sentiment is important for governments and authorities to maintain social stability during health emergencies and disasters 42,43 . Prior studies highlighted social networks as a potential medium for analyzing public sentiment and attitude towards a topic 44 . Opinionated messages on social media can vary from reactions on a policy decision 45 or expressions of sentiment about a situation 46 to sharing opinions during sociopolitical events such as Arab Spring 47 . Sentiment analysis, which is a computational method to determine text polarity, is a growing field of research in the text mining and NLP communities 48 . There is a vast literature on the algorithms and techniques proposed for sentiment analysis-detailed surveys can be found in [49][50][51] . Moreover, numerous studies employ sentiment analysis techniques to comprehend public sentiment during events ranging from elections, sports, to health emergencies 46,52 . We are interested in understanding the public sentiment perceived from multilingual and multi-topic COVID-19 tweets from worldwide.
Our Twitter data is multilingual and covers dozens of real-world problems and incidents such as lockdowns, travel bans, food shortages, among others. Thus, sentiment analysis models that focus on specific topics or domains and support specific languages do not suit our purpose. The NLP community offers a myriad of multilingual architectures ranging from LSTMs to more famous transformer-based models 51 . Most recently, a transformer-based model called XLM-T has been proposed as a multilingual variant of the XLM-R model 53 by fine-tuning it on millions of Twitter general-purpose data in eight languages 17 . Although the original XLM-R model is trained on one hundred languages using more than two terabytes of filtered CommonCrawl data 4 , its Twitter variant XLM-T achieves better performance on a large multilingual benchmark for sentiment analysis 17 . We used the XLM-T model to obtain sentiment labels and confidence scores for all two billion tweets in our dataset. Next, we highlight important distributions and present our brief analyses of the obtained sentiment labels. Of all two billion tweets, 1,054,008,922 (52.31%) labeled as negative, 680,300,793 (33.77%) as neutral, and 280,483,181 (13.92%) as positive. Figure 5 presents weekly aggregation of sentiment labels for all tweets in all languages. As anticipated, the negative sentiment dominates throughout (i.e., all 14 months) the data collection period. A significant surge of negative sentiment is apparent in the beginning of March, peaking in April (first week), and then averaging down during the later months. Several hills and valleys appear, but no weeks after April 2020 reaches as high as negative tweets surged in April. The neutral sentiment worldwide stays always lower than the negative, but follows a similar pattern as in the case of the negative sentiment. Not surprisingly though, the positive sentiment remains the lowest sentiment expressed in tweets with steady average except a few weeks in April 2020. Figure 6 shows countries' aggregated sentiment on a world map. The sentiment scores for countries represent normalized weighted averages based on the total number of tweets from a country and model's confidence scores for positive, negative, and neutral tweets. Equation 1 shows the computation of weighted average sentiment score for a country: where t c i represents the sentiment label of tweet i form country c while Θ c i indicates the model's confidence score for t c i , and N c corresponds to the total number of tweets from the country. The normalized scores range from -1 to 1, where -1 represents high-negative and 1 high-positive, with zero being neutral. The model confidence score represents the model's trust level for assigning a sentiment class to a tweet and it ranges between 0 and 1. The numbers on top of each country are z-scores computed using the representative sentiment tweets normalized by the total tweets from all countries. Overall, the map shows overwhelming negative sentiment across all except a few countries. Surprisingly, Saudi Arabia and other Gulf countries, including Qatar, UAE, Kuwait, show a strong positive sentiment. Rest of the world, including the US, Canada, and Australia, show moderate to strong negative sentiment. Figure 7 shows the weekly sentiment trends for the top-six countries (by total tweets in our data). Consistent to the worldwide sentiment trends, the negative sentiment of all six countries dominates throughout. While a few countries (US, UK and India) reach a couple million negative tweets for a few weeks, the other countries stay lower around half a million in the remaining weeks.
In Figure 8, we provide additional information about the distribution, skewness through quartiles, and median for positive and negative sentiments for the top-five countries. We notice that in most cases the earlier months of COVID-19 (i.e., February-March 2020) show high variations in both positive and negative sentiments, except for UK and India, where the number of both positive and negative sentiments are comparatively low with high level of agreement with each other. Surprisingly, the February 2020 data for both types of sentiments in the US and especially negative sentiment in other countries is highly positively skewed.  Numbers on countries are z-scores computed using the representative sentiment tweets normalized by total tweets from all countries.

8/20
Most countries seem to have less dispersion in April 2020 with quite high maximum range of any type of sentiment. These interesting patterns can reveal many more hidden insights, which could help authorities gain situational awareness leading to timely planning and actions. Figure 9 shows the distributions of sentiment scores across the US counties. Similar to the worldwide sentiment map, the sentiment scores for counties are normalized by the total number of tweets from each county using the weighted average for positive, negative, and neutral tweets. Overall, the negative sentiment dominates across different states and counties. While most counties show strong to moderate negative sentiment, a strong positive sentiment can be observed for the Sioux County in Nebraska, Ziebach County in South Dakota, Highland County in West Virginia, and Golden Valley County in Montana. California is mostly on the negative side whereas New York appears near neutral or on the negative side. Texas seems to represent all ends of the spectrum-covering moderate-to-strong negative as well as some positive sentiment. Florida and Washington are all negative. Overall, the western region is mostly negative, the Midwest is fairly divided but strong in whatever sentiment it exhibits, the Northeast region shows less negative intensity (more towards neutral), and the Southern region shows some counties with positive sentiment, but the majority is either negative or neutral.

User type and gender classification
Twitter has 186 million daily active users with 70.4% male and 29.7% female users 54 . Twitter users represent, among others, businesses, government agencies, NGOs, bots, and-most importantly-the general public 55,56 . Information about user types is helpful for many application areas, including customer segmentation and engagement 57 , making recommendations 58 , users profiling for content filtering 59 , and more. Moreover, users demographic information such as gender is important for addressing societal challenges such as identifying knowledge gaps 26 , health inequities 28 , digital divide 27 , and other health-related issues 29 . The tweets in TBCOV are from 87.7 million unique users worldwide, which is 47% of the daily active users on Twitter. Our aim is to determine accounts which belong to the general public, hereinafter personal accounts, and their gender. However, Twitter neither provides account types nor their gender information. To this end, we observed that user-provided names in personal accounts can potentially be used to not only distinguish them from other types such as organizational accounts, their morphological pattern are indicative of gender as well 60,61 . For example, the username "Capital Press" is a media account whereas the username "Laura Sanchez" is a personal account that likely belongs to a female. First, we determine users' type (i.e., personal, organizations, etc.) by applying the English NER model (described previously) on user-provided names. Usernames are preprocessed (i.e., remove URLs, numerals, emojis, tabs spaces, newlines) prior to feeding the model, which assigns one of the eighteen entity types to a username, including person. Entity types of all 87.7 million usernames are obtained according to which there are 46,504,838 (52.98%) person, 11,909,855 (13.57%) organization, and 29,357,141 (33.45%) miscellaneous user types. More importantly, nearly half (48%) of the tweets in the dataset are posted by personal accounts, 11% by organizational, and 40% by other user types.
Next, we sought to further disaggregate the identified personal accounts (i.e., 46,504,838) by their gender. Prior studies demonstrate that morphological features of a person's given name (also known as a first name or forename) provide gender cues, such as voiced phonemes are associated with male names and unvoiced phonemes are associated with female names 61 . Hence, the first names of the identified personal accounts are employed for training supervised machine learning classifiers. Several publicly available name-gender resources were used 62-64 as our training datasets. Names in these datasets are written using the English alphabets. We combined the datasets and removed duplicates. This process yielded 121,335 unique names with a distribution of female and male as 73,314 (60%) and 48,021 (40%), respectively.
Prior to training classifiers, data was split into train and test sets with a 80:20 ratio, respectively, and phonetic features from first names are extracted by moving a variable-sized window over them in two directions (i.e., left-to-right and the opposite). The window of length one moves from its starting point (i.e., either the first or the last character of a name). Subsequent moves increases window size by one until a threshold value reached. The threshold limits the number of features required in one direction, which we empirically learned by experimenting several values ranging from 1-to-7 (i.e., 7 is the average length of names in our dataset). Fewer than four features (in one direction) negatively impact classifiers' performance, whereas, larger values yield diminishing effect. Thus, a threshold of four is set, i.e., representing the first four and last four features of a name. For example, given a name "Michael", the feature extraction method extracts eight features, four from the start (i.e., 'm', 'mi', 'mic', 'mich') and four from the last (i.e., 'l', 'el', 'ael', 'hael'). The extracted features are then encoded with their corresponding positions in names, e.g., the 'mic' feature in the earlier example caries its position i.e., first-three-letters. The extracted positional features are then used to train several well-known machine learning classifiers, including Naive Bayes 65 , Decision Trees 66 , and Random Forests 67 . The Random Forests algorithm yields better performance, and thus, used to process all 87.7 million names. The evaluation of gender classification model is presented in the next section.

10/20
The gender classification process identified 19,598,252 (72.84%) female and 26,906,586 (57.86%) male users. Although the proportion of female users is higher than the male users, the number of tweets posted by the male users is 15% more than the female users. Specifically, of all 963,681,513 tweets from personal accounts, 558,259,178 (57.93%) are from male and 405,422,335 (42.07%) from female users. We further determine female to male ratios for each country. To choose countries for computing female to male ratios, we estimated the required sample size for each country. We set our confidence interval at 95% and margin of error to ≤1%. Countries with users (any gender) less than the required sample size are dropped (N = 78). Figure 11 shows the percentage of female users for countries meeting the representativeness criteria. 11/20 Figure 9. Sentiment across US counties. Tweets geotagged using user location, user profile description, and GPS-coordinates are used after normalizing by the total number tweets from each county.

Global digital divide
Next, we sought to determine global digital divide by relying on users access to different types of devices used for tweeting. Out of all more than two billion tweets, we extracted 1,003 unique application types (provided by Twitter) supporting the tweet posting feature. Dozens of applications support tweeting feature, including both web-, and mobile-based apps. We manually analyzed all the applications to determine the operating system they are built for (e.g., iOS, Android). Next, based on the operation system information, we categorized each application into one of the three device types i.e., (i) Apple device-  representing all iOS devices such as iPhone, iPad, etc., (ii) Android-representing all types of Android-based devices, and (iii) Web-representing all the web-based applications for tweeting. Finally, an aggregation is performed on device types for each county and the most frequent device is selected. Figure 12 shows the most frequently used device type in each country. The map shows a device type for 217 countries worldwide. Of all, the Android is the most used device type with N = 103 (48%), Apple with N = 97 (45%), and Web is the least used with N = 17 (7%). As Apple devices are more expensive than Android, we expect to see Apple's domination in rich countries. This assumption stands true except a couple of countries, including Niger and Senegal, among others.

Trends Analysis
The impact of the COVID-19 pandemic on people's livelihoods, health, families, businesses, and employment is devastating.
To determine whether TBCOV covers information about such unprecedented challenges, next we perform trend analysis of six important issues. The first two issues are directly related to people's health, i.e., (i) tweets about anxiety and depression, and (ii) self-declared COVID-19 symptoms. Next two issues represent severe consequences of COVID-19 that millions of families worldwide directly faced, i.e., (iii) deaths of family members and relatives, and (iv) food shortages. The last two issues are about people's social life and preventive measures, i.e., (v) face mask usage in public areas as well as shortages, and (vi) willingness to take or already taken vaccine. For each issue, a set of related terms are curated to form logical expressions. For instance, in the case of the "COVID19 symptoms" issue, we divide it into five sub-groups representing different COVID-19 symptoms listed on the CDC website 5 , which can also be seen below in Table 4. Several related terms were added to each sub-group to increase the recall. For example, for COVID deaths of parents, the "parents" group contains two sets of terms: (i) "father OR mother OR dad OR mom", and (ii) "deceased OR succumbed OR perished OR lost battle OR killed OR my * passed OR my * died" 6 . The logical operator 'AND' between these two sets forms the final expression used to retrieve weekly tweets. The full list of terms will be released with the dataset.  Table 4. Term groups of four topics for trend analysis Figure 13 depicts weekly distributions (in log scale) of the retrieved tweets. Figure 13(a) shows sub-groups of the COVID-19 symptoms category. The two most reported symptoms in tweets are fever and cough followed by the shortness of breath and headache. Interestingly, reports of loss of taste and smell are almost zero until the end of February 2020, which then suddenly spike from March 8th onward. Figure 13(b) shows trends of different groups for the anxiety and depression topic. The feelings of sadness and hopelessness seem to dominate throughout the year followed by anger, outburst, and frustration. Surprisingly, the expressions with suicidal thoughts are captured in the data, as well. These particular trends need an in-depth investigation to better understand motives behind such extreme thoughts for authorities to intervene and offer counseling.
The weekly trends representing two important and direct consequences of COVID-19 on the general public are shown in Figure 13(c & d), i.e., tweets mentioning death of parents, siblings, relatives or close connections; and food insecurity in terms of its availability, accessibility, adequacy, and acceptability. A large number of tweets reporting deaths is observed with majority about parents. Grandparents and the category representing uncle and aunt are significant as well. Overall, elderly death reports are significantly higher than younger population.
Similarly, TBCOV shows coverage of the food insecurity topics (i.e., Figure 13(d). Food availability dominates over food accessibility and adequacy in most weeks. However, food acceptability, other than a few spikes in February and May 2020, remains less of a concern for the public, thus not discussed on Twitter. Food shortage was one of the critical issues faced by many countries around the world. This Twitter data might help detect hot-spots with severe food shortages ultimately helping authorities focus on most vulnerable areas. Figure 13(e & f) shows trends for mask usage and shortage as well as vaccination. The "Importance of mask" category, which includes mask usage, importance of mask, etc., leads the discussion throughout. The mask shortage category spikes in the early months of 2020 and then averages out. Mask violations seem to surge in May and November 2020 and for the rest it stays steady. Mask shortage tweets worth further analysis to find out areas with severe shortages. The discussion on vaccines is comparatively lower than all other topics. However, the category on willingness to take or already taken vaccine is hopeful and spiked for the most months, in particular, late 2020 and early 2021.

Data Records
The TBCOV dataset is shared through the CrisisNLP repository 7 . The dataset contains three types of releases covering different dimensions of the data. Specifically, we offer a base release including a comprehensive set of attributes such as tweet ids, user ids, sentiment labels, named-entities, geotagging results, user types, and gender labels, among others. The base release contains tab-separated values (TSV) files representing the data collection months (i.e., February 1 st , 2020 to March 31 st , 2021). In addition to the base data, we offer two additional releases consisting of tweet ids for the top 20 languages and top ten countries. The purpose of id-based releases is to maximize data accessibility for data analysts targeting one or few languages or counties for their analyses. Additional releases will be provided based on end-user demands. We make the dataset publicly available for research and non-profit uses. Adhering to Twitter data redistribution policies, we cannot share full tweet content.

Validation of geotagging approach
To evaluate the proposed geotagging method we first obtain ground-truth data for different attributes. Geotagged tweets with GPS coordinates, i.e., latitude and longitude, were used as ground truth for the evaluation of the place field. Specifically, tweets with (i) geo-coordinates and (ii) place fields are sampled and their location granularities such as country, state, county, and city were obtained. Finally, we compute the precision metric, i.e., the ratio of correctly predicted location granularity to the total predicted outcomes (i.e., sum of true positives and false positives). Table 5 shows the evaluation results along with the number of sampled tweets (in parenthesis). All location granularity scores, except county, are promising. The evaluation of the user location geotagging method is performed on a manually annotated 8 random sample of 500 user locations. Specifically, each user location string was examined to determine its corresponding country, state, county, and city. Google search, Wikipedia and other sources were allowed to search and disambiguate in case multiple candidates emerge. Location strings such as "Planet earth", were annotated as "NA" and used in the evaluation procedure (i.e., the system's output for an "NA" case is considered True Positive if blank and False Positive otherwise). Table 6 shows the evaluation results in terms of precision, recall, and F1-score. Overall, the F1-scores for all location granularities are high. However, fine-grained location resolution poses more challenges for the method (e.g., the recall at the city level is 0.656 compared to the recall of 1.0 at the country level). 7 https://crisisnlp.qcri.org/tbcov 8 The authors of this paper performed the manual annotation.
Lastly, to evaluate text-based attributes (i.e., tweet text and user profile description), 1,000 tweets were randomly sampled and crowdsourced on Appen 9 , which is a paid crowdsourcing platform. Specifically, given a tweet text, annotators were asked to (i) tag toponyms (i.e., location names such as USA, Paris) and (ii) specify the location type (i.e., country, state, county, and city) of the identified toponyms. Three evaluation metrics, i.e., precision, recall, and F1-score were computed using the annotated location tokens. Table 7 presents geotagging evaluation results for the two text-based attributes (i.e., tweet text and user profile description). Geotagging at country and state levels yields promising F1-scores (i.e., 0.803 and 0.703, respectively). However, the results for county and city are weak.

Metric
Country

Validation of person user type
Since our main focus is on the tweets posted by the general public, here we evaluate the person entity predictions. A random sample consisting of 200 model predictions of the person entity is selected for the evaluation. The sampled accounts were manually checked by the authors of this paper and marked as either person or non-person. The manual investigation revealed 186 user accounts with correct and 14 with incorrect model predictions. This yields a precision of 0.93 for the Person category, which is quite promising.

Validation of gender classification
To evaluate the gender classification model, 20% (i.e., 24,267) of the 121,335 annotated names were randomly sampled and hold out during the training phase. The unseen hold out set was used to test the model and compute several evaluation metrics. Table 8 shows the evaluation results. The F1-score of the female class is very reasonable (0.878) compared to the male class (0.807). This is probably due to the high prevalence of the female class in the training set.

Usage Notes
All the collected data is persisted in Elasticsearch 7.10 database. The code used for data processing is written in Python 3. The code required to hydrate tweets and to use the provided base release files is available on GitHub 10 . Furthermore, we postulate that this large-scale, multilingual, geotagged social media data can empower multidisciplinary research communities to perform longitudinal studies, evaluate how societies are collectively coping with this unprecedented global crisis as well as to develop computational methods to address real-world challenges, including but not limited to the following: • Disease forecasting and surveillance lead to the early detection and prevention of an outbreak. Moreover, early warning systems alert authorities and healthcare providers to prepare and respond to outbreaks in a timely fashion. TBCOV's broad topical coverage, particularly about self-reported symptoms and deaths, can be a strong indicator for the early warning systems. • Identification of fake information is essential to tackle negative influences on societies, especially during health emergencies. Tweets' temporal information, re-sharing and retweeting patterns, and the use of specific tone in the textual content can potentially lead to the identification of rumors and fake information. More than two billion tweets in the TBCOV dataset is a goldmine for detecting conspiracies, rumors, and misinformation circulated on social media (e.g., drinking bleach can cure COVID-19). More importantly, the data can be used to develop robust models for fake news and rumor detection. • Understanding communities' knowledge gaps during emergency situations such as the COVID-19 pandemic is crucial for authorities to deal with the surge of uncertainties. TBCOV's comprehensive geographic as well as temporal coverage can be analyzed to understand public questions and queries. • Identification of shortages of important items such as Personal Protective Equipment (PPE), oxygen, and face mask becomes the top priority for governments during health emergencies. Building models to identify pertinent social media reports could help authorities plan and prevent devastating consequences of shortages. • Understanding public sentiment and reactions against governments policies such as lock downs, closure of businesses, as well as slow response or vaccination rate can be performed using social media data such as TBCOV. • Rapid needs assessment informs humanitarian organizations' and governments' response operations and determines relief priorities for an affected population during emergencies such as the COVID-19 pandemic. Our trends analysis results highlighted the effectiveness of TBCOV for mining priority needs of population in terms of food, cash, medicines, and more. • Identification of self-reported symptoms such as fever, cough, loss of taste, etc. through social media data could indicate a likely future hot-spot when reports spike in a geographical area. TBCOV tweets geotagged with fine-grained locations, such as counties and cities, can be useful to build models for symptom detection and hot-spot prediction. • Finding correlations is an important measure of relationship between two variables. We remark that the TBCOV dataset can be used to perform various types of correlation analysis to detect patterns and generate hypotheses. These analyses include, but are not limited to, finding correlations between COVID-19 cases and self-reported symptoms on Twitter; or between COVID-19 cases and death reports. Correlations between COVID-19 cases and negative sentiment in a geographical location or the surge of messages showing anxiety and unemployment rate; or correlation between daily negative tweets and the rate of food insufficiency in an area can open new avenues for interesting analyses. The aforementioned topics mainly cover real-world applications of the TBCOV dataset. However, we believe that the dataset is useful for several computing problems such as unsupervised learning to identify clusters of related messages, transfer learning between topical domains and language domains, geographic information systems, automatic recognition and disambiguation of location mentions, named-entity extraction, topic evolution and concept-drift detection, among others.

Code availability
The code to use this dataset is available through https://github.com/CrisisComputing/TBCOV. The code repository contains scripts to perform hydration of tweets using the released tweet ids. The hydration process fetches full tweet content from Twitter APIs. Moreover, we provide code to use the base release data files in a more efficient way, particularly for analyses focusing on specific languages or countries.