1. Introduction
The rapid development of information and communications technology (ICT) has led to a proliferation of highly personalized mobility data collection extracted from social media posts. Social media services, especially micro-blogging platforms like Twitter, make it easy for people to share their thoughts about real-time events more spontaneously, providing information that can be extracted and used by researchers for a variety of purposes. From 2007 to 2013, the daily number of total tweets rocketed from five thousand to 500 million across the world [
1]. The increasing amount of tweets and the abundant geographical content embedded in tweets have turned Twitter into a great resource for geographic mobility studies. Twitter data have been a great boon to geography researchers who struggle to collect geographic event information, which can be fleeting and dynamic.
Several key features of Twitter make it valuable in monitoring how events develop. First, the Twitter platform enables users to tweet about what is happening at any time and in any location. Second, users functioning as social sensors constantly report real-time or near real-time information to the public. The retweet feature (re-posting of someone else’s tweet) helps information spread wider and faster. Third, the geo-referenced tweets also provide explicit time-space descriptions of the events. Fourth, the large user base (both individuals and organizations) and broad geographic distribution of users offer a wide coverage of events across the world. These features have attracted a growing number of researchers who use Twitter data to investigate urban human activity and mobility patterns [
2,
3,
4].
Despite the advantages of Twitter in geography research, the data heterogeneity and big data size make extracting useful information from Twitter data challenging [
5]. Messages from users address a variety of topics and emotions, personal interests, and activities. Tweets also use abbreviations and unusual expressions or words. Extracting consistent information about events is difficult due to the abundant and varied content. Some major events can trigger a huge number of posts in a very short time period, making it difficult to efficiently handle the large data volume in time-sensitive applications. In recent years, many systems and algorithms have been developed to address these challenges [
6], including approaches that analyze the spatial distribution of geotagged tweets, such as geospatial clustering or spatial-temporal scans [
7]. Most prior studies focus on developing algorithms for event detection. Less attention has been given to analyzing the spatial-temporal evolution of the detected events. Finding patterns and sequences of events become essential in a state of continual event flux [
8]. Questions, such as when and where events start to form and how the events dictate the evolution of tweets in space and time, are less documented.
In this paper, we present a systematic approach for harvesting, processing, and analyzing social media data in order to delineate the spatiotemporal evolution of events.
Our approach offers three unique contributions to the literature. First, this study uses real time streaming Twitter data. Unlike approaches such as spatial-temporal scans mainly used for batch processes, this study applies two moving windows to efficiently identify potential real time events in the study area. Second, the proposed approach also simultaneously discovers regional and local events based on features from multiple dimensions of tweets. Third, we also explore the spatial-temporal evolution and dynamics of natural and social events.
In this study, we develop an efficient approach based on machine learning and geovisualization by utilizing multiple dimensions of tweets (message, author, time, and location information), to identify the evolution of events, including planned events (e.g., festivals or sports) and incidental events (e.g., disasters or accidents). We trace the trajectory of the events through space and time. We demonstrate the method with two case studies, which analyze the temporal movement patterns of events in the New York City—Washington, DC area. By synthesizing multiple dimensions of Twitter data, this paper presents a method for creating spatiotemporal trajectories of events by mining voluntary data from social media platforms. It demonstrates a means of leveraging local knowledge to better depict city dynamics and discover spatiotemporal characteristics of events.
2. Related Work
Analyzing social media data to obtain geospatial information and event-related knowledge has received increasing attention [
9,
10,
11]. Social media data present an unprecedented opportunity to study temporal dynamics in near real time and at multiple scales [
8]. However, due to the noisy and complex nature of social media messages, extracting meaningful information is nontrivial. For instance, more than 200 million tweets were posted each day in 2011 [
12]. Important urban information is often buried in a large pool of irrelevant data. Extracting meaningful information without smart text analytics and efficient strategies is practically impossible [
12].
To facilitate such data extraction, recent studies have developed methods to capture the spatiotemporal patterns of human activities and urban events from Twitter data [
6,
13]. The event detection methods in these studies can be largely classified as targeted or general. Targeted event detection usually focuses on certain types of events based on a selection of words or hashtags, such as earthquakes [
14], influenza epidemics [
15], and sports games [
16,
17]. Tweets containing certain keywords or hashtags, such as “earthquake” or “NFL” can be used to accurately detect events related to the topic of interest. However, the collection of keywords may be subjective and exclude many other tweets related to the events. It may also require prior experience of the event to select the appropriate words to track [
7]. Some recent studies have developed algorithms to deal with this issue. The TEDAS system was developed to detect crime and disaster-related events (CDE). This study manually set a collection of keywords related to CDE as seeds, and then applied an iteratively refined algorithm to extract new related keywords [
4]. Laylavi et al. (2017) assessed the degree of relatedness of Twitter messages to a specific event of interest [
18]. Wang et al. (2012) used a semantic role labeling approach to target crime-related tweets [
19].
General event detection, in contrast, focuses on emerging topics that attract the attention of a large population (e.g., a hurricane or national festival) or local incidents that happen quickly in time and densely in space (e.g., traffic accidents or parade). A variety of methods have been used to detect these general events. Content-based detection methods use either document-pivot or term-pivot techniques [
20,
21]. Document-pivots usually apply clustering techniques to a document-term matrix to detect a topic in a large corpus. Term-pivot techniques work on n-grams features, aiming to detect representative terms for the event in question. Many data mining techniques have been used in these two approaches, including hierarchical clustering techniques based on pairwise distances [
21], wavelet analysis of word frequencies to obtain features for each word [
22], and locality sensitive hashing (LSH) to discover potential events [
23]. For coordinate-based detection, spatial proximity has been widely used to prepare candidate tweets for local events [
3,
24]. DBSCAN has also been used to discover clusters with arbitrary shape [
25]. The detected hot spots are likely to be associated with certain events.
The space-time scan statistic has been used to look for clusters of tweets across both space and time, regardless of tweet content. This method can detect various events, even within a relatively short time of data collection [
7]. Temporal patterns of tweets can also be used to recognize events. Events usually exhibit a burst of features in Twitter streams, such as a sudden increase in specific keywords [
20]. Lee used a sliding window technique to detect context changes and weighed message streams accordingly [
26]. Boettcher and Lee used density-based clustering techniques on the tweets captured within a sliding time interval to detect potential events [
24].
Many of the methods used to extract events, such as the space-time scan method, are based on location clustering techniques. These techniques are effective in the retrospective event detection (RED) context, because historical datasets usually contain rich point coordinates. However, we need approaches to tackle the challenge of new event detection (NED) from real time streams. Prior methods that tackle NED, such as hierarchical clustering, are computationally intensive and slow. Light-weight and efficient methods are needed to process the real time tweet data.
Many studies focus on event detection techniques, but fewer of them explore the spatial-temporal evolution of these events. Social media data may embed semantic meaning, background information, and sentiments in the content. This content is sometimes geo-tagged, either in the form of precise location from where these tweets were posted, or as toponyms of these locations [
9]. Studies have reported that the percentage of precisely geo-tagged tweets may vary depending on the event, time, and location, ranging approximately from 0.5% to 5.0% of the total data corpus [
9,
27,
28]. Although the overall percentage of geo-tagged tweets is not high, it is still possible to discern geo-tagged events from tweets at an aggregated level, especially at a regional scale. The semantic and locational information at regional scales provides a good opportunity to analyze the spatial-temporal evolution of events. The sentiments embedded in the tweets can also track public attitudes and emotions as the event develops. A few geographical studies have explored the progression of events by harvesting and analyzing geospatial information from social media content. These studies have explored the progression of natural disasters such as wildfires [
29] and earthquakes [
30]. The spatiotemporal analysis of Twitter content has also been used to track disease outbreaks and distribution [
31,
32].
This study aims to first develop an efficient approach to quickly scan multiple dimensions of tweets to capture real time and regional events, including planned (e.g., festivals or sports) and accidental (e.g., disasters or accidents) events and formulate thematic depictions of these events at their points of origin. Second, we trace the spatiotemporal trajectories of the formulated events to investigate spatiotemporal characteristics of those detected regional events and examine people’s reactions to these events.
3. Methods
Figure 1 shows the overall data process flow. Real-time Twitter data published via the streaming application-programming interface (API) are collected and parsed as MongoDB documents. Then, spatiotemporal information is read from the MongoDB document for the pattern recognition process. The computational model for pattern recognition was constructed in R language. Leaflet and R were used to visualize the results.
3.1. Data Collection
Data used in this study were piped into our system through Twitter streaming API in real-time. Twitter’s Geo API allows users to collect real-time tweet posts that are within a geographic area defined by a bounding box. In this study, we draw a bounding box that covers the metropolitan areas from New York City to Washington, DC. This area is not only one of the regions with the largest Twitter user population but also has cities with distinctive and prominent socioeconomic status. Washington, DC is the U.S. capital, and New York City is a world city with a decisive role in the world economy. The data are published using a standardized key-value structure. Much valuable geographic information can be extracted using this structure, such as user profiles and geographic locations of the tweets. Tweets collected from Geo API have at least one type of location data, such as location content, time zones, place names, and global positioning system (GPS) measurements. GPS information gives the most accurate point information of where a tweet was posted. The estimated median horizontal error range for GPS on smart phones is about 5–8.5 m [
33]. As the majority of the tweets used GPS to denote their locations, we used only tweets with GPS information. In this study, the actual tweet message, the posting location and the time of the post were parsed. Each parsed tweet can be represented with the following expression: tw = (id; uid; twtxt; twtime; twloc) where id is tweet id, uid is user ID, twtxt is tweet message, twtime is timestamp, and twloc is geolocation. The information from each parsed tweet was saved as a MongoDB document. MongoDB supports text (non-spatial) queries as well as spatial queries. In order to improve MongoDB query performance, two indices have been created, one for the non-spatial id and another spatial index for the spatial twloc. For the spatial index, relative geographic relations of tweets can be created to indicate nearby tweets.
3.2. Data Pre-Processing
In this step, tweets were cleaned and filtered by removing non-English tweets, special characters, stop words, replacing capital letters with lower case, and tokenizing each tweet to individual words. Stop words were retrieved from SMART list in R tm package. We also detected other popular words in the study area based on historical tweets, such as “feel,” “watch,” and “friend” that are not included in stop-words packages. These words are less useful in the detection process and hence were removed from the set. We used the MC toolkit to tokenize a document into a vector space. MC toolkit is a C++ based program that creates vector-space models from text documents using multi-threaded implementation that can efficiently process very large document collections [
34]. Suppose
is the tweet data being processed, and
is the content of
. The result of the pre-preprocessing will split
around whitespaces to generate a set of words
. The phrase
would be transformed to
.
3.3. Layer Construction Module
Based on timestamps, tweets in a one-hour interval time window were first extracted and sent to a layer constructor. The constructor mapped each tweet t to its token number, user number, and coordinates number, namely a token-number tuple Pwn = (w, n), a token-user tuple Pwu = (w, u), and a token-coordinate tuple Pwc = (w, C), where n is the number of each token w, u is the number of users who mentioned the token w, and C is a list of coordinate pairs (lat, lon) in the study area associated with w. The constructor also computed each token’s frequency f(w) and the user frequency f(u) for each token w in the corpus. Tokens with a frequency smaller than three and tokens just mentioned by one user were excluded. These words account for a large proportion of total words but are very likely to be noise and are rarely associated with potential events. We thus discard these tokens and encapsulate the rest of the tokens to key-value hash tables. The keys are tokens, the values are token frequencies, user numbers, and lists of coordinates associated with each token. Results of this step include three hash sets with keys being the resultant tokens while values being the corresponding values from Pwu, Pwn, and Pwc for . A layer is a list made up of three hash sets.
In order to find the tokens that are potentially related to events, we used bursty word detection techniques. Bursty words are spikes in the frequency of tweets along the time spectrum. Event detection based on bursty words is similar to trend detection [
24]. Instead of using traditional methods to detect bursty words, we designed two floating time-window pairs to better reveal events. The first window pair compares tokens that occurred in the most recent hour
with the same tokens that occurred in the past four hours
to
. The second window pair compares tokens that occurred in the most recent hour
with the same tokens that occurred at the same time a week ago
. The system maintains two queues to store data in the two time window pairs. Each queue contains five components corresponding to five tables for each hour (
Figure 2). When the system first launches, the ten components are calculated at once (represented by color blocks in
Figure 2). In the following hours, only two components (the most recent hour and the same hour a week ago) are pushed respectively to the time window queues. The oldest components are popped from the queues. The rest are kept in the queue (represented by b/w color in
Figure 2). The design of the moving window largely reduces the computational demand. Only two out of ten components need to be updated each hour. Tokens in the most recent hour
are marked as the reference layer
RL. Classification features are prepared based on the reference layer. The classification feature will be introduced in the following section.
3.4. Feature Preparation Module
Once ten layers are constructed, classification features are prepared for identifying regional and local events. Tokens in the reference layer RL are used as observations. In other words, features will only be computed for tokens occurring in the RL. Tokens occurring in other layers but not in RL will not be computed.
Tokens related to events usually have four characteristics. First, words associated with major events tend to have a higher sudden increase in frequency. Second, tokens being tweeted by many people tend to indicate regional events. Third, regional events tend to be associated with a sudden increase in tokens from a wide geographic area. Fourth, tokens concentrated in a small area in a short time may imply local events.
To capture the first characteristic, we computed the frequency for each token ∈ in ten layers. To adjust the total tweet number at different times, we divided the token by the total tweet number in the time window to get the time adjusted token frequency.
There is a possibility that similar tweets are being tweeted by only one person or machine many times in a short time period. The contribution of keywords in such tweets should be discounted. For this reason, as with token frequency, we computed the user numbers for the ten layers as well. User numbers represent popularity of keywords among different users.
To account for the third characteristic, we computed the number of coordinates that are associated with token ∈ in ten layers. Tokens that are mentioned widely are more likely to be associated with a regional event.
Local events may contain densely reported messages. We used the DBSCAN technique to account for local clusters. DBSCAN is a clustering algorithm, which groups points that are closely located. DBSCAN requires two parameters: a minimum number of points and a maximum radius around one of its members (seed) to compute. Points within a radius of a given point, which satisfy the seed condition, are recursively selected as cluster members [
35]. We used the “fpc” package in R to conduct the DBSCAN analysis. We scanned tokens with geographic coordinates and determined the two parameters by observing the size and average points for the local events. We used search radius eps = 0.0007 and minimum points MinPts = 3 as parameters in this study. The number of clusters and the total number of points in clusters were used as features.
Events usually occur when number of users, geographic coverage, and number of tokens have a sudden change. We calculated the ratios of features (i.e., features , , computed above) to capture the abrupt change. We computed two groups of ratios: (1) ratios of features between the most recent hour hi and the ones a week ago and (2) ratios of features between the past five hours hi–hi-4 and a week ago. A higher ratio represents a higher chance that the keyword is related to an event.
There are cases when a certain token
w emerges in the reference layer (
LR) but does not occur in the layer seven days earlier (
Ld-7). To calculate the ratio for this case, we would have the “divide by zero” problem. Tokens in these cases can be further divided into scenarios
S1 and
S2.
S1 contains random words (e.g., special words or misspelled words) in
LR but not in
Ld-7. These words have a low frequency in
LR and are unlikely to be associated with events.
S2 contains bursty words not occurring in
Ld-7 but occurring frequently in
LR. These words are very likely to be event-related words. We tested the 60th to 80th percentile and the results did not vary considerably. We thus used the 70th percentile of word frequency as the cutoff to distinguish these two scenarios. The words in
S1 will be assigned a ratio of zero, and the ratio for words in
S2 will be proportional to the word frequency in
LR as shown in the following equation:
where
R represents the ratio of word occurrence between the reference layer (
LR) and the layer seven days ago (
Ld-7), and
calculates the frequency for the word w. The function quantile calculates the sample quantiles corresponding to the given probabilities as 0.7.
3.5. Classification Module
To prepare the training dataset, we manually sampled and coded 8167 tokens in consecutive days in August, 2015. Based on our pre-testing of different algorithms including kNN, SVM, Naive Bayes, and random forest (RF), we found the RF algorithm produced the highest accuracy in the classification performance. The RF classifier generates multiple decision trees in the training process to predict an outcome variable. To classify a new observation, random forest puts the variables into each of the trees in the forest. Each tree produces a classification result. The forest then chooses the classification with the most votes as the final classification. We used the “randomForest” package in R to conduct the RF analysis. In our model, we grew 200 trees to classify the input variables. The inputs of the classification module include 14 features: the time adjusted token frequency and the token frequency ratio in the time interval hi and hi to hi-4 respectively (4 features), the time adjusted user frequency and the user frequency ratio in the time interval hi and hi to hi-4 respectively (4 features), the time adjusted coordinate frequency and the coordinate frequency ratio in the time interval hi to hi-4 respectively (4 features), and the number of clusters and the total number of points in clusters (2 features). Outcomes of the model were dichotomous classes indicating whether a token belongs to an event class. The model also generates a probability score suggesting how likely the token is related to an event. We selected tokens with probability scores greater than 90% as event-related candidates.
Based on the trained model, each token in the hash set was labeled as event-related or not event-related. We define a potential event as a key-value tuple PE = (Ke, Ve) where Ke is a set of tokens being classified as event-related and V is a set of tweets. We used an association index between tokens in a term-document matrix to find tokens that are related to the same event in the set Ke. The association index indicates the correlation between a pair of terms among all tweets in the documents. A high association index represents a high probability that two words coexist in tweets. For instance, if we find tokens associated with the word “NYFW” (New York Fashion Week) with an association index greater than 0.4, we can detect the keyword “fashion.” All tweets associated with these keywords are put in the set Ve. Tweets that contain geographic coordinates are prepared and saved in a Shapefile for further pattern analysis.
3.6. Spatial-Temporal Evolution of Events
In the event analysis module, we mainly look at the temporal, spatial, and sentiment characteristics of an event. For temporal characteristics, we analyzed the time spectrum for a unique event and detect when this event starts, ends, or reaches prime time. For spatial characteristics, the spatial pattern evolution during the event was explored. Contour lines were created to display the density of tweets. Because city centers are usually the places most tweets concentrate, we used floating catchment area (FCA) to dampen the weight of tweets in densely distributed areas. Specifically, we drew a buffer with 0.05 degree around each related tweet to define a filtering window. The weight for each event-related tweet is inversely related to all tweets within the filtering window. We computed the kernel density based on the weight for each tweet across the study area. The kernel density spaces were investigated as time proceeded. For semantic characteristics, we analyzed popular expressions and created word clouds associated with each event. The SentiWordNet 3.0 English lexical resources were used to infer the sentiment of the event-related tweets. SentiWordNet is publicly available for supporting sentiment classification and opinion mining applications [
36]. The background database WordNet includes a rich set of nouns, verbs, adjectives, and adverbs in different cognitive concepts and sentiment scores. We calculated sentiments of each tweet and aggregated them into an hourly window. Positive or negative scores represent positive or negative sentiments respectively. A score of zero means a neutral sentiment.
Figure 3 summarizes the methods used in this study.
5. Discussion and Conclusions
Parkes and Thrift (1980) argue that urban life has a rhythmic pattern, formed by the spatial distribution of facilities and events and their temporal availability [
37,
38]. Now researchers can use Twitter data to make these invisible rhythmic patterns visible. The rhythmic pattern becomes even more visible when multiple days of data are displayed. We demonstrate how to use peaks of tweet data to identify significant events and start to identify the ebb and flow of urban life. Twitter data provide a unique window into unique urban spatial and temporal patterns.
In this study, we propose an innovative approach to identify event-related tweets by analyzing live-streaming Twitter data. This approach allows us to analyze streaming tweets, posted approximately one hour before collection. The prominent feature of this approach is that it does not assume any prior knowledge about events. Instead, it establishes knowledge about places as rhythmic profiles. No prior keywords were used to confine the domain of the events. This system only relies on streaming tweets. Other knowledge, such as news or GIS data, is not required to infer events. Hence, although the case study was conducted in the Washington, DC—New York City region, it can be applied to other geographic regions. By extracting training features from tweets (messages, users, timestamps, geo-coordinates), this system can discover the spatial-temporal development of regional events in near real time. Because we used sliding window and hash structures, they can quickly extract keywords associated with potential events. Keywords were used as the unit of the analysis. By applying association functions, we can find a set of keywords that are closely related to one event, and then extract the related tweets from the corpus.
We show how this approach can identify the spatial-temporal patterns of two events: a natural event and a social event. These two events have different temporal durations and geographic coverage. The rainy event spanned a day, while the Pope Visit event spanned a week. We were able to easily detect the start, end and peak of these events. When looking at the time-space patterns, the distribution of rain-related tweets generally conformed to the satellite cloud map while the tweets of the Pope related events reflected the itinerary of the visit. Additionally, the analysis also reflected people’s sentiments of the events. Such analysis enriches the geographic information by adding human perceptions to traditional GIS data.
By using features from both temporal and locational dimensions, the proposed method can capture both regional and local events. Numbers and ratios of users, tokens as well as the geographic coverage of tokens provide clues that the event is regional while the cluster size and numbers provide hints that the events are local. The analysis helps to discover geographic information at different scales.
Figure 11 shows snapshots of the extracted regional and local events and their spatial distribution. For instance, we learned that 29 September was National Coffee Day. The tweets contained tokens “coffee” spread across the study area. Many people mentioned free coffee from Dunkin Donuts (e.g., “Nothing makes me happier than free coffee @DunkinDonuts #CoffeeDay”). We were also able to identify local events, such as the United States Conference on AIDS (USCA) on 10 September in Washington, DC, the US Open Tennis Championship (USOPEN) in Arthur Ashe Stadium on 2 September 2015, New York Fashion week on 10 September 2015, and the mixed martial arts event UFC 205 on 12 November 2016. Clustered and bursty tokens were observed for these local events.
This is an explorative study that extracted events from tweets published in a recent one-hour window. We acknowledge several limitations in this study. First, due to the low proportion of geo-tagged tweets, even though we can extract meaningful local events, only major local events can be revealed, especially in near real-time. Second, we used individual words as the analysis unit. Events described by phrases may not be captured as well using this approach. In future studies, we plan to compare the extracted events with information reported from traditional media and evaluate the relevancy of the events discovered from tweets. We also plan to extend this study in two ways. In this study, we only consider individual tokens, and the model was trained based on these tokens. We plan to incorporate a contiguous sequence of n items (n-grams) to better represent longer expressions. Second, the detected event-related keywords do not carry any additional attributes in the current approach. We do not know the relative importance of information extracted from tweets. We plan to discover the ranking for the importance of the detected events as well as the type of the event (e.g., sports) in our future work.