Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.


Introduction
Social media channels like Twitter have become an established communication channel for various actors, private and public. The restriction of Tweets to a length of 140 characters until 2017 and 280 characters since leads to concise messages that are available in near real-time and at no cost. Twitter started service in 2006, and in 2009, when an airplane crashed on the Hudson river in New York City, the news was spread on Twitter 15 minutes before mainstream media caught up [1]. With currently around 6000 Tweets sent every second and about 500 million per day [2], events all around the world are reported and made public. Monitoring public sentiment to predict election results [3] or even riots [4] or to analyze urban processes [5] is one aspect of Twitter based event detection.
Another line of research is concerned with detecting natural disasters by analyzing Tweets that share information in near real-time [6]. Especially geo-tagged Tweets can be a valuable source for disaster response. Twitter users as "social sensors" [7,8] immediately deliver in-situ information at currently no additional cost. Traditional remote sensing approaches such as satellite or drone images suffer from lower coverage (both spatial and temporal) and a temporal lag before the data is available [9]. Although social media data have been shown to be a valuable addition, their analysis is difficult due to the data's noisiness, its unstructured and diverse nature and the lack of labeling.
Annotating and classifying the data is not feasible in a timely manner. An unsupervised alternative are topic models, allowing for organizing document collections into general themes, so called topics. As an example, Latent Dirichlet Allocation (LDA) [10] models topics as distributions over a fixed vocabulary and documents as a mixture of topics.
The output of an LDA is difficult to process. Topics can be noisy, not well separated and identifying a topic with desired content can be infeasible even for humans [11]. The fictitious example in Table 1 (left) with five topics illustrates that there might be not a single disaster-related topic, but three ( 2 , 4 , and 5 ). Automating the detection of relevant topics with a keyword-based approach could fail, as topic 5 lacks the term "earthquake". For real-world and more diverse datasets, the difficulties even increase. Guided [12] and seeded [13] LDA variants are a first step towards automated extraction of the disaster-related topic. Both suggest methods to incorporate prior knowledge about the topics' term distribution. Instead of initializing the topics with random terms, the so called seed words are assigned a high probability for the specific topic. This seeding guides the LDA towards a desirable output during inference, meaning that terms that occur in the same context will eventually also have a high probability in the same topic. Topic 1 in Table 1 (right) could be an example for a topic seeded with disaster-related terms, e.g., "earthquake" or "shaking". Tweets corresponding to the seeded topic can then automatically be detected. Table 1. Fictitious example to illustrate the output of a normal, unseeded (left) and a guided LDA (right). The five topics could be detected by a topic model in a dataset containing Tweets related to an earthquake. Note that for guided LDA, all earthquake-related terms are in a single topic 1 , whereas the event is covered in the topics 2 , 4 , 5 by the LDA.

LDA
Guided LDA 1 love thank happy 1 earthquake sleep damage 2 earthquake woke felt 2 show watch life 3 game stadium sunday 3 game stadium today 4 earthquake last night 4 school tomorrow start 5 damage street hope 5 love thank happy One remaining question is how to automatically determine meaningful seed words. In [14], domain experts manually define the seeds. Another possibility is to apply external sources to incorporate word correlation knowledge, such as WordNet [15] or pretrained word embeddings [16]. Jagarlamudi et al. [13] rely on labeled data to extract seed words by applying feature selection. We aim to close this gap between currently existing approaches, where manual interference is needed especially to adapt to a new event, and the applicability in real world scenarios that often requires immediate action. We propose a method to automatically determine seed words for the disaster-related topic by comparing the vocabulary of the day when the disaster took place with that of a preceding, typical day in the same area. The resulting seed words are used to initialize a single topic of the LDA. After modeling the dataset, Tweets having assigned their maximum value at the specific topic are labeled as related to the event.
This paper investigates the potential of the fully automated seeding for the topic modeling of Tweets. We compare the performance against a basic LDA model and against a single pre-determined seed word for two different Twitter datasets: one covering the Napa valley earthquake in 2014 and the other covering hurricane Harvey in Texas in 2017. Besides an intrinsic evaluation of the coherence of the modeled topics, we determine the classification performance on a small, labeled subset of the data to assess the semantic congruence of the extracted relevant Tweets. Furthermore, we apply a geospatial analysis to determine the exact locations and "hot spots" affected by the events. Validating the spatial distributions of the Tweets against the official disaster footprints allows to generate additional value for disaster management.
In the following, Section 2 summarizes existing work on using topic models for event and disaster detection in Tweets. Our proposed method to automatically seed a LDA model is introduced in Section 3. Section 4 presents the datasets used to run the corresponding experiments and the generated seed words. Intrinsic and extrinsic evaluations are conducted in Sections 5 and 6, respectively, followed by a discussion of the results in Section 7 and a conclusion in Section 8.

Background and Related Work
The use of Tweets to gain relevant information for disaster management is an increasingly popular field of study. Some approaches for exploiting Twitter data for disaster management extract messages based on predefined hashtags and keywords only [17], without applying topic models. An issue with working with hashtags and keywords only is that the approach loses its immediacy. Hashtags are defined by the Twitter community and evolve over time. Topic models capture more of the semantics of a Tweet and are less dependent on single terms. Gründer-Fahrer et al. [18] assess the possibilities of the LDA model for the detection of a flood in Germany and Austria, based on German language Tweets. Their claim is that LDA combined with optimization methods shows high applicability for disaster management. Before discussing the various modifications of the model meant to improve the performance, we formally introduce the basic LDA and its key concepts.

Latent Dirichlet Allocation
Latent Dirichlet Allocation is a generative probabilistic model where it is assumed that words observed in a number of documents of a corpus are generated by latent topics. The assumption is that each document, and a single Tweet is considered a document, is a mixture of topics. A topic is a distribution over terms of a fixed vocabulary and is usually represented by its terms with the highest probabilities. Note that we use "term" to denote a single class or entity, while word describes a specific occurrence of a term in a document.
The plate notation in Figure 1 illustrates the generative process of the LDA for M documents in the corpus, N words in a document with V different terms in the vocabulary and K topics. Each observed word w i has a unique topic assignment z i . The assignment is drawn from the multinomial distributions θ (distributions of topics in a document) and β (distribution of terms in a topic) with parameters α and η, respectively. θ and β are Dirichlet distributions. With θ ∈ R M×K and β ∈ R K×V known, each document can be represented in K-dimensional space by the topics it contains.  [19]). The shaded node represents the observed variable, a word w i in a document. The latent variables z i for the topic assignment, θ for the document-topic distribution and β for the topic-term distribution are shown as clear nodes.
In a topic modeling application, the variables θ, β and z are unobserved, or latent. The goal is to learn the posterior probability and thus computing the latent variables. As there is no closed-form solution to this problem, a number of approximation methods are used instead. Variational inference is applied in the original LDA paper [19]. Hoffman et al. [20] apply variational Bayes inference. Another approach is to use Gibbs sampling, where the topic assignments per word are iteratively refined [21]. The evaluation of topic models is not straightforward. Intrinsic measurement scores such as log-likelihood and perplexity are shown to be negatively correlated with human judgement [11]. Thus, Mimno et al. [22] suggest an alternative intrinsic measure for assessing the coherence of a topic model: UMass coherence. In contrast to other coherence metrics, UMass coherence does not rely on external data sources [23]. It is instead based on the co-occurrence of the top n terms of each topic in the modeled documents: is the number of documents both terms w i and w j occur in, D(w j ) is the document frequency of term w j . The UMass coherence indicates how well the topics are separable based on their vocabularies. Higher values indicate higher topic coherence. Besides the intrinsic measures, topic models can be evaluated with task-specific measures or based on classification performance for a held-out, annotated part of the data.
Incorporating prior knowledge in the model can improve the model output, especially when used for follow-up tasks. Wang et al. [24] introduce a targeted topic model (TTM) that uses seed words to guide the decision about the relevance of a document for a topic. The relevance is represented by a Bernoulli distribution included in the generative process.
Guided [12] and seeded [13] LDA variants do not change the generative process of the original LDA, but adapt the initialization of the topic-term distribution to reflect the importance of the seed words. Another option to recognize prior knowledge would be to change the Dirichlet prior of the topic-term distribution from symmetric to (heavily) skewed towards the seeds.

LDA for Disaster Detection in Tweets
Various efforts have been made to adapt and improve the basic LDA model to meet the requirements for specifically detecting disaster-related content in Twitter data. One line of research builds on iteratively refining the model by manually selecting the topic attributed to the disaster.
Kireyev et al. [25] use a general-purpose dataset to model the LDA and apply it afterwards to their Twitter data (approx. 20,000 Tweets each) of two disasters: a tsunami and an earthquake. By analyzing the topic distribution of disaster-related Tweets, similar documents in the larger original corpus are selected to further refine the topic model. Resch et al. [26] use a cascaded approach to select disaster related Tweets from a dataset on the Napa valley earthquake in a first step and separate into more refined topics afterwards. The selection of relevant topics is done manually in both stages and the performance is assessed on a small, labeled part of the dataset. They introduce a method to assess the spatial distribution of relevant Tweets.
Using labeled Tweets is another method to improve the basic LDA model, yet a labor-intensive one. Imran and Castillo [27] have experts label a set of Tweets on different crisis in 2012 and 2013 with predefined, relevant topics. Tweets that fit neither of these topics are further refined using LDA. Both works use qualitative analysis, i.e., visualization and human judgment, to assess their results.
Supervised LDA (sLDA) was introduced by Blei and McAuliffe [28] to predict document classes and is trained on a labeled dataset. Ashktorab et al. [29] use sLDA to provide relevant information extracted from social media to first responders. They experimented with datasets on twelve different crisis events in North America with approximately 1000 annotated Tweets each. Other classification methods proved to be more accurate than the sLDA and achieve higher evaluation scores (including F 1 , precision and recall) for the binary classification task.
In more recent work, topic models capable of handling seed words to produce a more desired and better interpretable output have been used. To the best of our knowledge, those seed words have so far been only defined manually.
Kirsch et al. [30] use a targeted topic model (TTM) with predefined keywords in an emergency context. The results are evaluated by qualitative analyses such as the top words and hashtags of the topics or related sentiment.
Yang et al. [31] suggest a location-based dynamic sentiment-topic model (LDST) to make use of the geographic information of Tweets directly in the topic model. They apply the LDST to a dataset of approx. 160,000 Tweets with geo-reference related to hurricane Sandy and manually define seed words for five different topics. The evaluation of the geographic accuracy and relevance is done subjectively by assessing the shift in topics in different areas and states. Table 2 gives an overview of the methods discussed above. While all approaches succeed in extracting relevant information, all rely on manually provided prior information (either hand-crafted seed words, manual data annotation or human assignment of the relevant topic). This delays the reaction to a disaster and minimizes generalization over different disaster types which both can be avoided by automatically extracting the initial seed words.

Automated Generation of Seed Words
The automated generation of seed words is based on the assumption that in the case of a natural disaster, as with any other event, Tweets from within the affected area will differ significantly from those on "normal" days both concerning the discussed topics (i.e., terms used) and the frequency of certain terms. However, extracting the most frequent terms of the Tweets posted on the day of the event might not be significant enough. Thus, we propose to compare the vocabularies of two different days and suggest two variants: the first one computes term frequencies based on the union set of the two vocabularies, the second based on their difference set.

Union Vocabulary
Let C and D be the set of Tweets from the normal ("comparison") and event ("disaster") day, respectively, and V 1 = V C ∪ V D the shared vocabulary from both corpora. tf t,D denotes the absolute term frequency of term t ∈ V 1 in corpus D.
The score s t ∈ [−1, 1] for each term is the difference between the normalized term frequencies in D and C. Terms only occurring in V D have a high positive score, while terms only occurring in V C have a negative score. For terms being equally present in both vocabularies, the score is 0. The set of n seed words S 1 = {t 1 , . . . , t n } consists of the n highest positive scores. This method allows to also determine the n most negative values in v 1 for seeding a second topic with non-disaster-related terms. This possibility is currently not further explored.

Difference Vocabulary
The second method computes the frequency of terms that only occur in the disaster vocabulary The set of n seed words S 2 = {t 1 , . . . , t n } consists of the n terms with highest scores s t ∈ (0, 1]. Because of the second method being very restrictive and potentially yielding an empty set V 2 when only considering single terms, biterms are added to the vocabularies for both methods.

Topic Initialization
Once the seed words are available, they can be incorporated into the LDA model. The LDA algorithm is given a document-term matrix (the observed words w in M documents) as input and produces a topic-term matrix β and a document-topic matrix θ in the run of several iterative inference steps as output. Each Tweet is treated as a single document. The topic-term matrix β is usually initialized by a constant [19] or randomly [20]. For guiding the LDA, we explicitly initialize the selected seed terms with 1 for the disaster topic (usually the first topic) and 0 else, while randomly initializing all other terms, which is similar to the method in [12].

Tweet Extraction
During several training iterations (inference), the distributions for the topic-term matrix β and the document-topic matrix θ are learned. The result is a modeled LDA, meaning that both distributions β and θ are as close as possible to the mathematically exact solution of C = β · θ, where C is the normalized co-occurrence matrix of terms in documents.
The modeled matrices allow for an intuitive interpretation: The topic-term matrix reveals the probability of each term occurring in a specific topic, such that we can deduce the top terms for each topic. The document-topic matrix models the probability distribution of topics in a Tweet so that we can assign the most probable topic to a Tweet if we need a single label only.

Experiments
The aim of the experiments is to show that (i) the suggested methods extract suitable seed words that (ii) improve the performance of a baseline LDA implementation. The proposed method for an automated topic modeling process is evaluated on two Twitter datasets covering different natural disasters. The first one contains 94,458 geo-referenced Tweets captured on the 24th of August, 2014 in the area of the Napa valley in California, where an earthquake occurred in the early morning. The second dataset covers geo-referenced Tweets in the area around Houston, Texas from 27 August 2017, when a hurricane hit. It contains 8078 Tweets. Both datasets contain Tweets in English only and were crawled using Twitter's Streaming and REST API [32] restricted to extracting only Tweets with geo-reference.
The difference of the two datasets lies not only in their size, but also in the different nature of the two disasters: While early-warning systems can predict hurricanes in advance, earthquakes hit unexpectedly. Smaller earthquakes can follow the initial one, whereas hurricanes are often accompanied by heavy rainfall and flooding [33]. Both the preparedness to a hurricane and its more diverse nature will affect the coverage and terminology of the disaster in social media and thus the topic models. This also lead to different approaches in selecting a day for comparison: While for the earthquake, a day from a week before the event is considered, the day of comparison for the hurricane lies one month ahead. By being this restrictive, we aim at ruling out possible vocabulary overlaps, as the imminent weather changes might be announced in the days following up to the hurricane.
The Tweets are preprocessed in the following order before passing them to the LDA: 1. Removal of URLs, user names, mentions and email addresses; 2. Lower casing; 3. Removal of numbers and special characters; 4. Stemming; 5. Removal of (stemmed) stopwords; 6. Removal of words shorter than four characters.
For assessing the term frequencies, scikit-learn's [34] CountVectorizer is used. The parameter ngram_range is set to (1,2) to also consider biterms and min_df is set to 2 to disregard terms that only occur once in the dataset. Applying this range of preprocessing steps is in line with findings by Denny and Spirling [35] and Maier et al. [36] who investigated the effects of different preprocessing steps and their ordering.

Earthquake
The 17th of August, 2014 is considered as normal day for comparison with the earthquake on the 24th of August, 2014. After preprocessing, 85,311 Tweets from the day of the disaster and 72,320 Tweets from the normal day are available. Note that the Tweets from the normal day are not used for topic modeling, but only for the seed word generation. Figure 2 lists the ten most frequent terms in the Tweets on the event day and compares them to the frequencies in the Tweets on the normal day. "earthquake" is the most frequent term and could be considered a good seed word for the disaster topic. The remaining terms, except for the swear words and "night" to some extent, are unrelated to the earthquake. Moreover, their frequencies do not differ from the normal day. The most frequent terms thus cannot be expected to help in guiding the LDA. In order to extract a wider range of seed words, it is inevitable to set the frequencies of the two days in relation: Applying the two presented methods V1 (union vocabulary) and V2 (difference vocabulary) with n = 10 yields the terms listed in Table 3. A number of terms is directly related to the earthquake ("earthquak", "felt", "quak", "shake"), others indirectly, referring to the night-time of the event ("woke", "sleep"). Some terms refer to a music event happening at the same day: "beyonc" for the singer Beyoncé and "voteso" as the preprocessed version of "#vote5sos". As expected, V2 contains more biterms. All terms are directly related to the earthquake. The term "napaquak" is in fact a hashtag, with the character "#" removed during preprocessing. For simplicity, the biterms are split up (with duplicates removed) to only provide single terms to the LDA. Table 4 lists the final set of 10 seed terms for both variants.

Hurricane
Hurricane Harvey swept over Houston, Texas particularly from 26th to 28th of August, 2017 [37]. We consider the 27th of August as day of the disaster, and the 28th of July, 2017 as normal day for comparison. After preprocessing, 7043 Tweets from the disaster day and 7020 from the normal day are available.
As with the earthquake dataset, extracting the most frequent terms on event day as seed words would be misleading. Most of the terms listed in Figure 3 are related to job postings ("hire", "careerarc", "open") in the Houston area. Except for the terms "houston" and "texa", the term frequencies are also comparable to the normal day.
The methods V1 and V2 comparing the two datasets yield different terms. Table 5 displays the results of V1 and V2 with n = 10. Although biterms are less frequent for this dataset, again most occur in V2, while only one is included in V1. Both methods cover a number of related terms to the hurricane (e.g., "flood", "rain") and also the most prominent hashtag "hurricaneharvey". Table 6 lists the final set of seed terms after splitting the biterms, resulting in nine seeds for V1 and 12 for V2.   Table 6. Final set of seed words for the variants V1 (union vocabulary) and V2 (difference vocabulary) for the hurricane dataset (bottom). The number of seed words can differ as the biterms from Table 5 are split up.

Intrinsic Evaluation
As there are no other variants that allow for a fully automated process, we assess the competitiveness of our suggested methods (referred to as GLDA V1 and GLDA V2 in short) against a basic, unseeded LDA (LDA in short) and a baseline guided LDA (GLDA Baseline) that only uses the terms "earthquake" and "hurricane" as seeds, respectively. The selection of the disaster types as seed words could be considered as an almost automatic method, as the manual effort is negligible. That said, the dataset is not investigated any further to come up with more manual seed words for the GLDA Baseline. For all experiments, the number of topics K are varied from 2 to 100 to find an optimum.
Scikit-learn's LatentDirichletAllocation was used as LDA implemenation, where we overwrite the initialization method as described in Section 3.3 and allow for 25 iterations. The implemented inference method is the variational Bayes algorithm applied in batch mode, i.e., using all training data at once in each update step. The parameters α and η, prior parameters for the document-topic distribution and the topic-term distribution (see Figure 1), are set to 1/10, 000 and 1/K. 1/K is the default value in scikit-learn, whereas α is deliberately set to a low value to account for the short text length in Tweets instead of applying a specific Twitter LDA model as in [38,39].
For our experiment, we compute the topic coherence for the disaster topic T based on the list of n = 20 top terms W (T) = (w 1 , . . . , w S ) (see Equation (2)). Figure 4 (left) illustrates the topic coherence of the disaster topic for varying K on the earthquake and the hurricane dataset, respectively. We run the LDA 5 times for each K to assess the average performance. For all methods, higher values of K are better. Obviously, the coherence of the LDA and the baseline GLDA improve significantly while the values of the GLDA with automatically extracted seed words remain low. However, for the overall topic coherence, i.e., the mean over all topics, the GLDA V1 and V2 are competitive (see Figure 4 (middle)). The variance over five runs is much higher for the GLDA Baseline and the LDA, indicating that the guidance by seed words adds stability to the model. The disaster topic's size in Figure 4 (right) shows an opposing trend to its coherence and is lower for the LDA and GLDA Baseline variants than for the GLDA versions 1 and 2. As the topic coherence increases for the LDA, the topic becomes very restrictive and only covers a small fraction of Tweets. The remaining fraction of Tweets in the disaster topic is close to zero for the hurricane dataset. For the topic classification, this could suggest that the recall is low for increasing K.

Extrinsic Evaluation
Besides topic coherence, we use a small set of annotated data for an extrinsic evaluation and define a binary classification task: The test sets comprise 1331 manually labeled Tweets (1.6%) from the earthquake dataset and 993 (14.1%) from the hurricane dataset. Almost half of the labeled Tweets are disaster related for the earthquake dataset and about a quarter for the hurricane dataset.
Tweets having their maximum at the disaster topic in the document-topic matrix are considered as disaster related, all other as not related. For the LDA, we define the disaster topic as topic for which the term "earthquake" or "hurricane" has the highest value. For the guided variants, the seeded topic is the disaster topic. All variants in the experiments model the unlabeled Tweets, and the results are used to "classify" the held-out set of labeled Tweets.

Tweet Classification
As performance measure for the binary classification, we use the F 1 score, the harmonic mean of precision and recall [40]. Precision is the measure for the number of Tweets correctly classified as disaster related. Recall defines the ratio of Tweets classified as disaster related from all Tweets labeled as disaster related. In other words, precision gives an idea of how many relevant Tweets are extracted, while recall assesses the completeness of retrieved relevant Tweets.
The resulting mean F 1 scores and their 95% confidence interval over five runs for both datasets are illustrated in Figure 5 (left). The guided variants outperform the LDA. For the hurricane dataset, the GLDA Baseline is also clearly inferior to the variants V1 and V2. Moreover, the performance of the variants V1 and V2 is more stable and stays above 60% F 1 score for all K. Again, the variance over the five runs is lower for the variants V1 and V2. As expected, the recall (see Figure 5 (middle)) correlates with the disaster topic size and is lower for the LDA. While the recall is better for smaller values of K, the precision ( Figure 5 (right)) increases with higher K.  Table 7 lists the detailed classification performance measures for both datasets and all variants corresponding to the best F 1 score each. With all models, the best performance is achieved with K < 8 on both datasets. With this optimal setting, the performance of the three guided LDA variants is almost on a par for the earthquake dataset, but better than that of the LDA. For the hurricane dataset, the performance in terms of F 1 score of GLDA V1 and V2 outperforms the LDA by approximately 10% and the GLDA Baseline by more than 20%. Table 8 highlights the resulting top 10 terms of the disaster topic for both GLDA variants, and K = 5 or K = 6 for the earthquake and hurricane dataset, respectively. Most of the terms occur for both variants (90%, unique terms in italic), although the ordering differs. It is interesting to note that the terms differ from the initial seed words shown in Tables 4 and 6. This underlines the idea that the initial setting only guides the LDA but is not restrictive. Note, on the one hand, that the term "hurrican" is missing, as well as "shake" for the earthquake dataset. On the other hand, the unrelated terms "beyonc" and "voteso" have vanished. Table 7. Detailed classification results including precision and recall on the earthquake dataset (top) and the hurricane dataset (bottom) based on the best average F 1 score of the five runs for each variant.  Table 8. Top 10 terms in the disaster-related topic (i.e., highest probability in the topic-term matrix β) for both datasets based on the best value for K for the two automated variants GLDA V1 and V2. The ordering corresponds to the term probability. Terms that are unique for one variant are highlighted in italic.

Geospatial Analysis
In addition to the extraction of semantically relevant Tweets, their geospatial distribution is analyzed in Figures 6-9. For both datasets, we compare the results based on the output of the LDA and GLDA V2. Cells in red and blue signify the hot and cold spots of the disaster related Tweets in relation to the total amount of Tweets in the given cell area as proposed by Resch et al. [20]. All Tweets are aggregated in a fishnet that overlaps with the area of interest and the ratio between the number of total Tweets and the disaster related Tweets in each grid cell is computed. The fishnet consists of squared cells with side length l = 2 A n , where A is the size of the study area, and n is the number of points in the study area [41]. The informativeness is high if the ratio of disaster related Tweets is high which is especially true in rural areas where the Tweet activity is usually lower than in urban areas.
To identify geospatial clusters, we applied a hot spot analysis based on Getis-Ord GI* [42] that determines statistically significant hot and cold spots by including the neighborhood of the analyzed cell. The hypothesis is that disaster related Tweets will be more frequent in affected areas. Thus, the identified hot spots are a strong indicator for areas impacted by a natural disaster. In order to evaluate that the automatically generated seed words yield an accurate disaster topic, we compare the hot and cold spots with the official disaster footprints. For the Napa valley earthquake in Figures 6 and 7, the results are matched against the earthquake's footprint as assessed by the US Geological Survey (USGS) [43]. We excluded values that would fall in the categories "Not felt" and "Weak" in the USGS Peak Ground Acceleration (PGA) dataset [43]. Those categories would not add significant value to the interpretation of the map as those values signify that people do not notice the appearance of an earthquake and do not recognize it as such [44].
The hot and cold spots as identified based on the disaster related Tweets from the LDA in Figure 6 reveal mismatches: North of Santa Rosa, where the earthquake still was felt, the analysis reveals cold spots. For the area south of Oakland that lies outside of the affected area, no significant results were obtained. Most importantly, the epicenter around Napa is not correctly detected, as no hot spots are identified in this area. However, the hot spots that are detected mostly are located within the official footprint.
The results based on the disaster related Tweets from GLDA V2 in Figure 7 reveal some major improvements. First of all, the epicenter is correctly identified by a cluster of hot spots with highest confidence. The area south of Oakland down to San Jose that lies outside of the official footprint is clearly highlighted as cold spot. The area north of Santa Rosa is now correctly identified as hot spot. Figure 6. Earthquake hot and cold spots obtained using plain LDA compared to the US Geological Survey (USGS) [43] footprint measuring the intensity in per cent of the peak ground acceleration (PGA).
The official data available for hurricane Harvey assesses the extend of the flood that followed in the area [45]. Figure 8 shows the results based on the disaster related Tweets as detected by the LDA. Almost no significant spots could be detected, although the most affected area should be in and around Houston, while San Antonio, Austin and Lafayette are hardly affected. Figure 9 illustrates the results based on GLDA V2 that now are in line with the flooded areas. Not only the hot spot in Houston is correctly detected, but also the area further east. The surrounding cities (San Antonio, Austin and Lafayette) are identified as cold spots this time.
The accordance between the computed hot spots and the official footprints for both disasters suggests that the automatically extracted information from Tweets can be an immediate evidence to support disaster management.

Discussion
Comparing Tweets from two different days to extract meaningful seed words is a feasible way to automate the topic modeling process. Moreover, the experiments show that guiding the LDA towards those seed words for topics of interest improves the performance-only slightly in terms of overall topic coherence, but significantly for extracting disaster related Tweets, which is the central goal in supporting disaster management [26].
When comparing the two datasets, it is clear that by fully automating the topic modeling process, the classification performance does not suffer. On the contrary, the classification of the hurricane dataset improves considerably. Keeping in mind that the nature of a hurricane is more diverse than that of an earthquake, it seems that the standard methods are limited. While using a "dummy" keyword (GLDA Baseline) as seed is competitive on the earthquake dataset, this variant also performs poorly on the hurricane dataset. Automatically generating seed words thus will also be helpful in case of unknown, diverse or rare events that are hard to describe by a small set of manually defined keywords.
As the results of five different runs each imply, the performance of the guided LDA with automatically generated seed words is more stable. As the variation over the runs is low, there is no need for further hyper-parameter tuning or experimenting with the right number of topics, which is beneficial in real-world scenarios. For the LDA, even with the best performing K, the standard deviation is 3.9% on the earthquake dataset, while only 0.5% for both GLDA variants. On the hurricane dataset, the standard deviation is 1.7% for LDA, 1.5% for GLDA V1 and 0.8% for GLDA V2. At least for the LDA, this result could also mean that the model did not fully converge yet and more iteration steps (and more time) are needed.
According to our experiments, meaningful and disaster related Tweets are extracted with a small number of topics, i.e., K <= 6. However, the F 1 score might not always be a good proxy for assessing the retrieval. Tasks focusing on displaying the retrieved Tweets in text-based form might require smaller, but highly precise result sets and thus favor precision over recall. For quantity-focused tasks, a higher recall might be preferable. The experiments suggest that the optimal value for K then differs: smaller K for better recall, higher K for improved precision.
The geospatial visualization of hot and cold spots reveals an impressive alignment with the actual, official disaster footprints when computed based on the automatically extracted Tweets with GLDA V2. Correctly distinguishing between affected and non-affected areas allows to direct aid and rescue efforts to places where help is needed, and not to densely populated, urban areas by default.
Concerning the generalizability of these observations, experiments with further datasets and different events would be needed. Future experiments could also serve to investigate the effect of multilingualism on our approach. As both catastrophes took place in the United States, the predominant number of Tweets is in English. Although LDA can model topics over documents in different languages, more sophisticated topic models have been introduced to explicitly handle multilingualism. Existing methods for multilingual topic modeling [46] would need to be adapted to also handle seed words.
In this study, the Twitter dataset was collected with a geo-crawler software that can crawl the Streaming and REST API of Twitter besides multiple other social media network APIs. Using the API, Tweets can currently be crawled up to a week back. In case of a natural disaster or other event, this allows for a comparison of two different days a week apart. For dates further back, online repositories such as the Internet Archive's Twitter stream [47] that collect data continuously can be consulted. Authorities or disaster response organizations might even have an interest in monitoring a geographic region by crawling the data regularly by themselves.
The geo-crawler focuses on geo-referenced social media posts that can include the precise coordinates of a GPS-enabled device. Although in June 2019, Twitter announced a fundamental change in adding precise location information to a Tweet [48], Tweets with precise geolocation information still can be collected. Furthermore, the extraction of location information from text has made substantial progress in the last years and has consequently opened new opportunities for geospatial analysis on Twitter data [49][50][51]. Therefore, the methodology developed in this paper can also be applied to other use cases which has been tested with success outside the scope of this study.

Conclusions
Information on social media is available in a timely manner, thus being an important source for detecting and responding to events. Especially for disaster management, a system monitoring the Twitter stream in a given region and automatically detecting natural disasters or other crises would be useful. Such a system can be implemented based on our fully automated approach to incorporate data-intrinsic a-priori knowledge into a topic model (LDA) without any need for manual interference. By comparing Twitter vocabularies from different days, seed words indicating the events can be generated. These extracted seed words are used to guide the LDA to model a single disaster-related topic. The Tweets corresponding to this topic can then be mapped to get an estimate for affected areas.
The method is tested on Tweets from two datasets covering an earthquake and a hurricane, respectively. Two approaches for comparing the Tweets' vocabularies from a preceding day with the day of the disaster event are presented. The first approach is to extract the most frequent terms over the union vocabulary of both days and subtracting the normal day's term counts from that of the disaster day. The second possibility is to extract the most frequent terms over the difference set.
Both methods yield event-related seed words that improve the F 1 score of a basic LDA or a baseline guided LDA on a small set of labeled Tweets. The resulting topics are similarly coherent as with a standard topic model. Moreover, the detected hot and cold spots based on the geo-references of the Tweets are aligned with the official footprints of the examined disasters. This underlines the value of Tweets as decision support for disaster management. The model parameter K, i.e., the number of topics to model, can be used to trade off recall against precision. Funding: This study has been carried out in the HUMAN+ project, which has been funded by the Austrian security research programme KIRAS of the Federal Ministry of Agriculture, Regions and Tourism (BMLRT), project number 865697. Additional funding was granted by the European Regional Development Fund (ERDF) for project number AB215 in the INTERREG program Austria-Bavaria 2014-2020.