A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas over a 255-Day Period

One of the unfortunate findings from the ongoing COVID-19 crisis is the disproportion1 ate impact the crisis has had on people and communities who were already socioeconomically 2 disadvantaged. It has, however, been difficult to study this issue at scale and in greater detail using 3 social media platforms like Twitter. Several COVID-19 Twitter datasets have been released, but 4 they have very broad scope, both topically and geographically. In this paper, we present a more 5 controlled and compact dataset that can be used to answer a range of potential research questions 6 (especially pertaining to computational social science) without requiring extensive preprocessing 7 or tweet-hydration from the earlier datasets. The proposed dataset comprises tens of thousands of 8 geotagged (and in many cases, reverse-geocoded) tweets originally collected over a 255-day period 9 in 2020 over 10 metropolitan areas in North America. Since there are socioeconomic disparities 10 within these cities (sometimes to an extreme extent, as witnessed in ‘inner city neighborhoods’ in 11 some of these cities), the dataset can be used to assess such socioeconomic disparities from a social 12 media lens, in addition to comparing and contrasting behavior across cities. 13


15
In addition to its medical consequences, the ongoing COVID-19 crisis has also geographically). What is missing is a carefully controlled dataset that would enable 22 computational social scientists in specific contexts to study the issue from a social media 23 lens without much hassle. At the same time, rather than reinvent the wheel and collect 24 raw data from scratch, it should be possible to use some of these earlier larger datasets as 25 a starting point for constructing the more controlled (and also, appropriately augmented) 26 dataset. 27 In this paper, we address these desiderata by presenting a dataset that, while 28 compact, contains many tens of thousands of tweets that comprise a sub-set of the 29 broader GeoCOV19Tweets dataset, originally obtained by filtering English tweets from 30 the Twitter streaming API by using a continuously updated, expansive list of keywords 31 and hashtags [7]. As of this writing, the GeoCOV19Tweets Twitter feed is monitored 32 using 90+ keywords and hashtags commonly used when referencing the pandemic. 33 Although only English tweets were gathered, the data collection has global span. Each 34 collection starts between 10:00-11:00hrs GMT+5:45 every day [9]. The data collection 35 started on March 20, 2020 and has been updated daily with newly collected tweet IDs.

36
Unlike the GeoCOV19Tweets dataset, our dataset has further filtered the tweets   The data release comprises 10 Java Script Object Notation (JSON) files, one for 55 each of the 10 metropolitan areas. The upper-level JSON object within the file is a list, 56 each element of which is a dictionary (also technically a JSON, since JSON is defined 57 recursively). Each dictionary represents a tweet. In addition to containing geographic 58 metadata, the tweet also retains the sentiment score in the GeoCOV19Tweets 2 data.

62
Concerning quality, we note that since the dataset is an augmented subset of 63 GeoCOV19Tweets, the coverage of the dataset is bound above by the coverage of Geo-  Figure 2. A workflow illustrating the methodology behind data processing and collection as applied to the underlying GeoCOV19Tweets dataset to obtain the proposed dataset.   One of the issues with the tweets hydrated from GeoCOV19Tweets is that some 105 tweets contain a user-defined location tag that may be different from the origin of the 106 tweet. We are interested in the latter rather than the former. To enforce this constraint, 107 we only consider tweets having a populated (i.e., non-null) "coordinates" object in 108 the metadata. This precludes inclusion of tweets that may be assigned a user-defined 109 location in the "place" object associated with the tweet, even though the tweet itself did 110 not originate from that place. However, tweets returned from the Twitter Developer API 111 having the "coordinates" object defined, populate the "place" object corresponding to 112 the location indicated by the "coordinates" object. Therefore, an important processing 113 step upon hydration is to filter out tweets that do not have a "coordinates" object defined 114 in the metadata. As mentioned earlier, although the Twitter Developer API deduces the "place" 117 object from the "coordinates" object associated with the tweet, sometimes, the "place" 118 object is None in a tweet even though the "coordinates" object is defined. In these 119 instances, we reverse-geocode the latitude and longitude in the "coordinates" object 120 using the Geocodio tool 7 .

121
Geocodio's API allows both forward-and reverse-geocoding within the United

122
States and Canada, returning up to five possible matches ranked by an accuracy score 123 between 0.00 and 1.00. When the geocoding service is needed (i.e. when a "coordinates" 124 object, but not a "place" object exists within the tweet metadata), we use the reverse-125 geocoded result with the highest accuracy score to deduce the location's city, state, zip 126 code, and country.

127
Unlike the Twitter Developer API, Geocodio is unable to provide the "place_type" 128 of the latitude-longitude location. The "place_type" field returned by the Twitter API 129 contains a description of the tweet's origin (e.g., "city" or "admin"). However, Geocodio 130 can return the location's zipcode. For this reason, the "place_type" field in the presented 131 dataset may either contain the "place_type" extracted directly from the Twitter API 132 (which is a string) or the zipcode obtained from Geocodio. We note, however, that in 133 most cases, the "coordinates" object did not need to be reverse-geocoded and we were 134 simply able to use the populated "place" object provided directly in the tweet. States-Canada region, it is disregarded due to its significant French-speaking population 8 139 5 https://scholarslab.github.io/learn-twarc/ 6 The GeoCOV19Tweets dataset begins on March 20, 2020. 7 https://www.geocod.io/ 8 According to https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=CMACA&Code1=462&Geo2 =PR&Code2=01&Data=Count&SearchText=montreal&SearchType=Begins&SearchPR=01&B1=All&TABID=1   To determine whether a tweet originates from one of the 10 areas of interest ( Figure   146 3), bounding rectangles are drawn around the city and its surrounding neighborhoods.

147
Any tweet that originates within or on the bounding box is labeled with the respective 148 metropolitan area. The selected coordinates for each metropolitan area are tabulated in 149

154
A number of datasets related to COVID-19 have been released in the last year. 155 We note the ones that are particularly related to the presented dataset below, and also 156 describe the value that our dataset provides that helps complement the value of these 157 other datasets.

158
The USC Dataset by [6] tracks social media discourse about the COVID-19 pandemic.

159
While broad (comprising more than a hundred million tweets at present), the dataset 160 does not limit itself to geotagged tweets, and it is difficult to obtain this data without 161 first hydrating the tweets. The dataset is therefore useful for large scale studies tracking 162 COVID discourse on social media (as the authors present its primary use case to be) but 163 not as much for more constrained studies within a specific locational and topical context.

164
Other similar examples include the COVID-19 Twitter Dataset [5] and TweetsCOV19 [8]. 165 Earlier, we also provided details on GeoCOV19Tweets [7], which served as the primary 166 super-set on which the presented dataset is based.

167
Other examples that are designed for studying specific topics or are in specific   The presented dataset, with its metadata, is findable as it has been hosted on Zenodo 178 and assigned a globally unique and eternally persistent DOI. It has also been described 179 with rich metadata, and indexed in a searchable resource. The metadata also specify the 180 data identifier.

181
The dataset is also accessible and interoperable, the latter due to the use of a formal, 182 accessible, shared and broadly applicable language (JSON) for knowledge representation.

183
Finally, the data is re-usable as it is also associated with provenance, since it has been 184 derived from GeoCOV19Tweets, and the tweets are also accompanied by their IDs.

185
The dataset also has a plurality of accurate and relevant attributes (including location, 186 hashtags and sentiment score) that could be used to answer a range of computational 187 social science questions.    We hypothesize that the dataset can be used to address a range of research questions: 231 10 As mentioned, the original dataset began sampling tweets on March 20, 2020. The average sentiment score for March is therefore taken over an 11-day period. Figure 5. The 10 most prevalent hashtags (determined over all the tweets in our dataset). We also illustrate, for each hashtag, the metropolitan area with the lowest and highest prevalence in the corresponding metropolitan area.

1.
Given that different metropolitan areas were impacted differently by COVID-19 (in  cities? 243 We note again that an important advantage of this dataset is that some of these 244 questions can be answered relatively quickly, due to the far lower number of tweets that 245 would have to be hydrated. For other questions, only sentiment analysis or hashtags 246 may be necessary, which would require no hydration at all as we provide these metadata.