Measuring the Impact of Natural Hazards with Citizen Science: The Case of Flooded Area Estimation Using Twitter

: Twitter has signiﬁcant potential as a source of Volunteered Geographic Information (VGI), as its content is updated at high frequency, with high availability thanks to dedicated interfaces. However, the diversity of content types and the low average accuracy of geographic information attached to individual tweets remain obstacles in this context. The contributions in this paper relate to the general goal of extracting actionable information regarding the impact of natural hazards on a speciﬁc region from social platforms, such as Twitter. Speciﬁcally, our contributions describe the construction of a model classifying whether given spatio-temporal coordinates, materialized by raster cells in a remote sensing context, lie in a ﬂooded area. For training, remotely sensed data are used as the target variable, and the input covariates are built on the sole basis of textual and spatial data extracted from a Twitter corpus. Our contributions enable the use of trained models for arbitrary new Twitter corpora collected for the same region, but at different times, allowing for the construction of a ﬂooded area measurement proxy available at a higher temporal frequency. Experimental validation uses true data that were collected during Hurricane Harvey, which caused signiﬁcant ﬂooding in the Houston urban area between mid-August and mid-September 2017. Our experimental section compares several spatial information extraction methods, as well as various textual representation and aggregation techniques, which were applied to the collected Twitter data. The best conﬁguration yields a F1 score of 0.425, boosted to 0.834 if restricted to the 10% most conﬁdent predictions.


Introduction
Several authors have explored the use of Twitter in the context of environmental hazards prevention and mitigation in the literature. For example, Twitter is used by Sakaki et al. [1] to help damage detection and reporting in the context of Earthquake events. The TAGGS platform aims at using Twitter for flood impact assessment at a global scale [2]. Twitter has also been used to monitor the spread of the seasonal flu disease [3]. Jongman et al. [4] use Twitter in order to trigger humanitarian actions for early flood response. Specifically, the authors mainly use Twitter to raise an alert over a region, which is coupled to the Global Flood Detection System (GFDS) [5] in order to trigger near real-time mapping, which is in a way that keeps up with GFDS data update frequency. Wiegmann et al. [6] identify the prime promise of filling in information gaps with citizen observations, issued by citizen sensors. Opportunities for social media are identified as helping impact assessment and model verification, and for strengthening the acquisition of relevant information. Their literature review concludes that social media content is unlikely to increase spatial resolution of traditional sources, such as remotely Fohringer et al. [32] use a DEM in a similar manner, but with a view to deriving water levels over a region from water depths manually estimated from photographs.
In order to facilitate textual content processing in a general manner, and Twitter messages in particular, a necessary step is to convert textual items to a numerical representation (often referred to as feature vectors [20,33,34], or embeddings [35,36] in the machine learning literature). A classification model may then learn the relationship between these representations and a target concept, for example, does the message reveal a flood event? in the context of this paper. The most immediate representation is the binary value indicating whether the text string matches any of pre-defined keywords, known as the keyword-matching feature in the literature [37,38]. TFIDF (Term-Frequency-Inverse-Document-Frequency) vectors [33] are also often used in order to represent textual documents. Dimensions in this feature vector are attached to the vocabulary of all possible phrases, and their non-negative values are given by the product between the frequency of respective phrases in the document and their inverse frequency in the whole document corpus. Intuitively, the value of a dimension is greater as the respective phrase characterizes the document, in other words because it is frequent in the document and rare in the remainder of the corpus. In the context of TFIDF, phrases are tuples of n words, often known as n-grams. The simpler option is to only use single words (one-grams), but the union of one-grams and two-grams has often been considered in the literature [34,37,38]. A simpler, binary version of TFIDF, which assigns 1 to the respective dimension when a given phrase is found in the text, and 0 else, has also been witnessed [39].
The dimensionality of TFIDF feature vectors is determined by the size of the vocabulary, which can be quite large. Additionally,the TFIDF feature vectors of Twitter messages are very sparse, which is problematic for most statistical learning models. Neural language models were recently introduced as a mean to generate numerical embedding vectors. They are generally trained on large corpora offline, and used on smaller test collections. The obtained feature vectors are more compact and denser (generally a few hundred dimensions) at the cost of interpretability, which is individual dimensions cannot be attached to semantics, unlike TFIDF. A well-known neural language model is Word2Vec [40], which is trained unsupervisedly and generatively to map textual documents to feature vectors. However, similarly to TFIDF, this model is phrase-based, and it is trained on well-curated text, which is unadapted to Twitter content. Tweet2Vec circumvents these issues by considering a character-based model [35]. It is trained self-supervisedly on a hashtag prediction task. Additionally, instead of a shallow neural network in Word2Vec, the authors use a bidirectional LSTM model [41].
Keyword matching is implemented by the Twitter API, and it has served to collect a corpus related to Hurricane Harvey with keywords Hurricane Harvey, #HurricaneHarvey, #Harvey, and #Hurricane during the event [42]. In the context of a flu spread analysis, Gao et al. [39] also isolated a corpus using this technique. However it is likely to inaccurately reflect whether a tweet is semantically related to a flood event. To circumvent this issue, Gao et al. manually annotated 6500 tweets in order to train a SVM classifier that further filters out false positives. In its counterpart, this approach requires heavy manual annotation work. Instead, when a spatial distribution is available for the target event (e.g., rainfall [34] or flood [27]), an alternative approach is to aggregate tweet feature vectors with respect to spatial cells, and then learn a function that maps the aggregated representation to the target values. In the context of data assimilation mentioned above, this kind of approach opens the possibility to learn a mapping function from a corpus of tweets that were collected at time t to a related ground truth map (e.g., remotely sensed data at time t) and, hence, increase the temporal frequency of proxy observations by using a new test corpus of tweets that were collected at time t + δt.

Author Contributions and Position
Volunteered Geographic Information (VGI) is the umbrella domain that denotes the contribution of geographic information by the public in a participatory fashion, and the processing and exploitation of this information [2,[13][14][15]. This domain encompasses contributions to the OpenStreetMap database, as well as people geotagging photos showing flood extents on social networks. In particular, Craglia et al. [13] distinguish explicit and implicit VGI, depending mostly on the initial purpose of information contributions. The contribution to OpenStreetMap is a typical example of explicit VGI: the contributor purposefully updates the database. On the other hand, the initial aim of a Twitter user witnessing a natural hazard, such as a storm or a flood, and attaching geographic information, such as a geotag or a toponym mention to her message, is to inform her followers about what she witnesses.
In the context of natural hazards, VGI is presented as an interesting way to provide data that are useful for emergency response personnel, emergency reporting, civil protection authorities, and the general public [2,13]. It comes in complement to more traditional information sources, such as forecasts obtained from hydrological and hydraulic models [7], and as an additional tool for situational awareness. Additionally, social platforms, such as Twitter, are constantly updated, which opens the possibility to picture the situation in impacted regions quickly and regularly.
Our technical contributions relate to the exploitation of implicit VGI, which poses specific challenges. First, the event sought is intertwined with many topics in the Twitter stream of posts. The identification of relevant implicit VGI is then not trivial. Second, in contrast to standard GIS formats, geographic information in Twitter is provided in several heterogeneous forms (geotags, spatial bounding boxes, user information, and toponym mentions in the tweet text) with various levels of accuracy and reliability.
In this work, we formalize mapping function fitting as the classification task parametrized by a coordinate space. This model allows for us to implicitly account for the spatio-temporal dimensions in a classical statistical learning context. We also formalize the way a set of tweets collected in a given time frame parametrizes a feature vector mapping function, which associates aggregated feature vectors to coordinate points. We show how to compute a feature vector that reflects the local diversity of aggregated tweets.
For the experiments, we focus on flooded area estimation, and make use of a new corpus of tweets that were collected during Hurricane Harvey, which has affected the Houston urban region between mid-August and mid-September 2017. We presented this corpus at length in a dedicated technical report [22], recalled in Section 5.2. We compare several textual representations and spatial extraction methods, which affect the obtained feature vector mapping function. We address concerns regarding Twitter spatial information accuracy [23], by issuing recommendations for filtering the corpus at hand, and quantified consequences in terms of imbalance and support, contributing to addressing the challenges that are posed by implicit VGI.
The present work is heavily inspired by Lampos and Cristianini [34], and it is incremental with respect to one of our previous contributions [43]. From the former, we took the idea of mapping aggregated vectors to a target spatial variable, and then moved it to the context of a much finer-grained spatial context (Lampos and Cristianini [34] perform the classification task at a city neighborhood scale), in order to extend the range of possibilities offered by VGI. We improved the latter with a more general formalism in Section 3, and a more comprehensive address of VGI through the combination of Twitter geographic information, a DEM, and toponyms extracted using NER techniques. We also compared several solutions for feature vector construction, involving advanced machine learning techniques, such as neural embedding models [35] and Fisher vectors [44]. Spatial information extraction methods, combining one or more sources among place bounding boxes, geotags, and toponyms obtained by NER techniques, are compared with respect to F1, precision, and recall scores. Finally, we describe a qualititative comparison between predictions that were achieved by the proposed method, and labels obtained by crowdsourcing.
The objective of this study is significantly different from Brouwer et al.'s (2017) related work on flood extent prediction [27]. In their study, the authors attempt to infer flood boundaries from a small amount of carefully validated tweets, whereas, here, we aim to identify a subset of raster cells with high confidence in their classification as a flooded or dry area. To this end, we make use of a larger amount of automatically collected tweets, with minimal manual effort.
Here, we aggregate all tweets, irrespective of their relevance to the subject, and learn feature combinations that enable the prediction of the ground truth class. The spatial uncertainty of Twitter content is accounted for by having the parameters of an aggregation function varying with the bounding box size, giving higher weight to content with more precise spatial information.
Specifically, in this work, we retain all tweets that match the requirements in featuring geographic information accuracy (as discussed in Section 5.4), and learn decisive features without traditional supervision at the tweet level. Here, the supervision comes from a ground truth on geographic raster cells being flooded or not, and useful features are aggregated at a spatial cell level. This design can be seen as a distant supervision variant [45], where spatial aggregation acts as a bridge to connect labels to textual representation features. Unrelated content is implicitly discarded by the model building procedure that is presented in Section 3, which learns to exclude irrelevant features from aggregated vectors.
The authors of the present study have contributed related techniques for extracting flood extent observations from SAR imagery using a hierarchical split-based approach [46], which was recently improved in urban areas using temporal coherence between SAR acquisitions [47,48]. The target variable in our experiments, as presented in Section 5.1, was actually generated using one of these techniques. Besides advancing VGI in general, a motivation for the present work is to show that a mapping between information extracted from Twitter and SAR imagery can be built, hence opening the computation of a flood observation proxy at a much higher temporal frequency than what is currently possible with remote sensing derived information. This perspective is highly promising for improving flood forecasting using techniques from the data assimilation domain, to which the authors of this study have also contributed [9].

Spatial Mapping Problem Definition
In this section, we formalize a classification problem, which relates feature vectors to a spatially distributed binary target variable (flooded or not flooded). Let us consider a coordinate space ∆. In the context of remotely sensed data, points p ∈ ∆ generally lie on a regular spatio-temporal grid. We also consider the binary target mapping function y : ∆ → {0, 1}, with y(p) = 1 if the spatio-temporal cell p is flooded. We assume that a multivariate feature mapping function x : ∆ → R d is also available, and it is linked to the binary function by another function f : R d → {0, 1} so that y(p) = f (x(p)). In practice, training feature vector and target couples {x(p), y(p)} p∈∆ train are provided, and the best possible function f * is found, which minimizes the cross-entropy loss: The trained model can then be used to predict the target variable y for coordinates p ∈ ∆ test unseen at training time. In the remainder of the paper, without a loss of generality, we assume that training data are collected for a single point in time, which is all training coordinates lie in a spatial grid for the same given day. The trained model would then use coordinates for other days as test data. This design choice conforms to the alleged objective of providing a flooded area proxy estimator with higher temporal frequency, as expected in the literature [6]. Hence, training coordinates refer to a single day: for simplicity in the remainder of the paper, coordinate points in equations will leave the temporal dimension implicit, and refer to the (longitude, latitude) couple.
The loss function in Equation (1) is minimized by models, such as the logistic regression model and classification neural networks, while using nonlinear optimization [49] or gradient descent algorithms [50], depending on the model at hand. Fitting a classification model with respect to a coordinate space has already been considered in the literature [34,39,51]. However, the formalism here is more general, as we are not tied to a specific model architecture, as long as it minimizes the required loss function. The computation of the multivariate feature vector mapping function is the crux of our methodology. This aspect is discussed in the next section.

Feature Vector Mapping Function
We now detail the construction of the feature vectors (left implicit in the previous section) by weighted aggregation, and expose several aggregation variants. Let us consider a collection of tweets C N : with p n ∈ ∆ the spatial coordinates of the tweet, d n its spatial dispersion, and x n its feature vector representation, which we assume as homogeneous to the multivariate feature vectors x(p) in Equation (1). Numerical vector representations would typically be obtained using language models, such as TFIDF [33] and Word2Vec [40]. Confirmed representations are presented in Section 5.3. The dispersion reflects the geographic precision of the tweet.
In this paper, we use the surface of a geographic bounding box as dispersion. The feature mapping function for coordinates p ∈ ∆ is generated from C N , as: with weight w n obtained from: The weight w n for spatial coordinates p with respect to the nth tweet is hence obtained from the 2D Gaussian distribution N centered on the tweet spatial coordinates p n with variance d n . Hence, the importance of the nth tweet in the feature vector for coordinates p will be greater as the tweet coordinates get close to p, and its dispersion is small (i.e., tweet geographic information is accurate). Let us note that, if the temporal coordinate was to be included in the coordinate space, specific mean and dispersion terms should be added to Equation (4).
Gao et al. [39] used the Epanechnikov kernel function instead of the Gaussian distribution in Equation (4), in combination to the keyword-matching binary feature. Lampos and Cristianini [34] discretized the coordinate space ∆ into neighborhoods, and used constant weights when computing per-neighborhood feature vectors with Equation (3). The weighting function in Equation (4) is also closely related to the Inverse Distance Weighting interpolation described Brouwer et al. [27], which involves a heavy tailed improper distribution instead.
In this paper, we consider additional variants to weight and feature vector computation. First, we propose to account for topographical features with an alternative weight computation function: where the t superscript stands for terrain. The values for I are obtained by applying a flood-fill algorithm (https://scikit-image.org/docs/dev/auto_examples/segmentation/ plot_floodfill.html (accessed on 17 March 2021)) originating from the tweet coordinates p n . The flood-fill algorithm is parametrized by a DEM. At this stage, we assume the availability of this DEM and that its resolution matches the grid on which coordinates p n lie: implementation details regarding the DEM are disclosed in Section 5.1. Subsequently, under the hypothesis that p n is flooded according to the ground truth y, I(p, p n ) is 1 if p is flooded as a consequence of the flood-fill algorithm, and 0 otherwise. In practice, flood-filling introduces a spatial regularization to w, which favors the propagation of tweet information in a consistent way regarding the DEM. Information extracted from a DEM was also accounted for in similar contexts. Fohringer et al. use a DEM to propagate water levels that were estimated visually in photographs with known locations [32]. Water heights are alternatively identified in the text by Eilander et al. [30]. However, these two approaches proceed by propagating information from a low number (less that 100) of carefully curated tweets, whereas our approach is fully automated, and accounts for a much larger (a few thousands in our experiments), and possibly more noisy, amount of tweets. Brouwer et al. [27] proposed a more advanced method using the HAND model. However, we argue that the flood-fill algorithm is sufficient for our purpose to mask weights that characterize a small area around a given tweet. We also considered the adaptation of Fisher vectors [44] to our feature vector mapping context. We first proceed by extracting K clusters from a collection of feature vectors, in our case {x n } n∈1...N . Clusters are obtained from a Gaussian Mixture Model, which is estimated using the EM algorithm [52]. The model outputs the membership probabilities of feature vectors a n , so that a nk is the probability that x n belongs to cluster k. Krapac et al. [44] then compute the gradient of the log-likelihood of the set of feature vectors that were extracted from a test image with respect to the clustering model parameters, namely the cluster weights, means, and covariance matrices.
In this paper, we limit ourselves to taking gradients with respect to cluster weights θ k in order to limit the dimensionality of the resulting Fisher vector (K = 200, as recommended by Krapac et al. [44]). Additionally, instead of using an independent test image, we consider the feature vectors in C N , weighted with respect to the coordinates p (Equation (4)). Krapac et al. show that the gradient of the log-likelihood of x n with respect to θ k is a nk − θ k . Translating to our case, the kth dimension of the Fisher vector results in: Intuitively, the Fisher vector of a raster cell is a multi-dimensional measure of the extent to which the collection of tweets specific to this cell diverges from the overall tweet distribution.

Data and Preprocessing
In this section, we describe the map data feeding the target variable in the problem formulation that is given in Section 3. We also disclose the criteria that directed the collection of our Twitter data corpus. Subsequently, we enumerate the textual representation models that were tested in this paper. We recall that they allow associating numerical feature vectors to pieces of text, hence building x n in Equation (2). Afterwards, we detail how the relevant spatial information was extracted from the Twitter corpus, and enriched with NER from Twitter text.

Target Map Construction
As the ground truth for our learning procedure (y(p) in Equation (1)), we consider the flood map that is shown in Figure 1a. Hurricane Harvey affected the Houston urban region between mid-August and mid-September 2017, with a flooding peak around the 30 August 2017. The map originates from Sentinel-1 SAR images that were acquired over the metropolitan area of Houston both during the flood event on 30 August 2017 as well as prior to the flood event on 18 and 24 August 2017. The Sentinel-1 mission is a constellation of two SAR satellites of the European Copernicus programme. The SAR images were transformed using a method that was contributed by authors of this paper [47]. Because of the requirements of the present work, it was rescaled to 2.10 −3 • resolution (approx. 200 m). The map features approximately 1.23 M pixels, among which 3.1% are reporting flooded areas. After masking permanent waters from this map, 840 k pixels remain, among which 4.4% are marked as flooded, thus allowing for a slight rebalance of the target data set. In the end, for coordinates p falling in its non-masked area, the map returns a binary value indicating whether the associated area is flooded or not according to remote sensing sources. The DEM for the region enclosing the spatial bounds in Figure 1b, which parametrizes Equation (5), was obtained from the US Geological Survey (https://viewer.nationalmap. gov/basic (accessed on 17 March 2021)). The obtained raster file has 10 −4 • resolution. We converted it to the same coordinate system as the SAR-derived data that are described above, and rescaled it with respect to the nearest neighboring pixels, so that its resolution equals 2.10 −3 • .

Twitter Data Collection
A corpus of tweets collected during the Harvey Hurricane has been made available shortly after the event [42]. It features tweets matching any of the phrases Hurricane Harvey, #HurricaneHarvey, #Harvey, and #Hurricane. Using the Twitter interfaces, we also collected our own corpus of 7.5 M tweets that were related to the event. Tweets are obtained as JavaScript Object Notation (JSON) items. Under this format, each tweet is a set of key-value pairs. Among the variety of meta-data distributed by Twitter, we identified coordinates (the geotag), place.full_name (the toponym), place.bounding_box (the spatial bounding rectangle that is associated to the toponym), and text (the actual tweet) as the pieces of information relevant to our work.
In order to match the scope and objectives disclosed in the introduction, especially regarding content localization, we did not use textual query filters, and collected all tweets with either the attached bounding box or geotag overlapping the spatial bounds of the Houston urban surroundings between the 19 August and the 21 September 2017. The spatial area of interest, as shown in Figure 1b, has been determined according to posterior analyses of Hurricane Harvey impacts. We found a very significant positive correlation (p < 10 −10 at Pearson's and Kendall's tests) between toponym frequency in the corpus and the associated bounding box surface. Such correlation is quite natural, as toponyms, such as Houston, are indeed much more likely to be attached as meta-data to tweets than very specific places, such as Cypress Park High School. As a consequence, only a small portion of a Twitter corpus collected according to geographic criteria will be actually usable for fine-grained purposes, such as identifying flooding areas at a city scale. Figures that are obtained in the context of the requirements of the present work, as given in Section 5.4, confirm this conjecture. With this regression analysis, we also qualitatively identified the surface of 350 km 2 as the threshold delineating large (city-scale) toponyms, from local-scale toponyms, such as streets or malls. We provide additional details regarding the corpus collection, pre-processing, and descriptive analysis in a dedicated technical report [22].

Textual Representation
The Twitter corpus definition that we use in Equation (2) assumes that individual tweets can be represented by a numerical vector, but it leaves the transformation method actually used implicit. The act of numerical embedding, which is transforming a piece of text into a numerical vector, has been covered in the introduction. In this experimental section, we compare three embedding methods from the literature. For each method, we indicate specific implementation and pre-processing details.
The keyword-matching feature equals 1 if the tweet text matches any of predefined keywords, after converting it to lowercase. Therefore, the resulting feature vector has a single dimension. We use harvey and flood as keywords reflecting the relatedness to the flooding event, as we showed these keywords are more effective than those used by Littman [42] in a previous contribution (see Section 2).
Given a vocabulary of V phrases, the TFIDF model [33] is defined as: with N v = |{n ∈ 1 . . . N|tf n (v) > 0}|. Notations in Equation (7) refer to the Twitter corpus definition given in Equation (2). For TFIDF, as suggested in Lampos and Cristianini [34], we adapt Twitter-specific features. Specifically, user references (for example, @Bob) are removed, hashtags are expanded as regular text (accounting for camel case, for instance #HarveyFlood becomes Harvey Flood). We also filter out URLs. Punctuation is removed, and Porter stemming [53] is applied to the result. This pre-processing is similar to that presented by Dittrich [19] in the context of NER. Following the recommendations by Lampos and Cristianini [34], we consider the set of one and two-grams as the dimensions of the TFIDF space. Simply put, dimensions in TFIDF are associated to single words in the one-gram model, whereas two-word phrases are also included in the two-gram model. TFIDF yields a very high-dimensional space (6618 with one-grams alone, 26,125 with one and two-grams, averaged over experimental conditions in this section). Actually, many of these tokens are present only a few times in C N : as suggested by Lampos and Cristianini [34], we retain features that appear at least 10 times in C N . This parametrization yields 534 one-grams, and 188 two-grams, hence the dimensionality of TFIDF space is 722 (averaged over experimental conditions reported in this section). The obtained feature vectors are normalized to unit norm. As the neural language embedding method, we consider the Tweet2Vec model that was proposed by Dhingra et al. [35]. It uses a bidirectional LSTM [41]. The model is character-based in order to mitigate Twitter-specific language artifacts (e.g., poor syntax, abbreviations). It updates a latent state vector using bidirectional passes on the sequence of tweet characters. Hashtags are removed from tweets beforehand, and the model is trained to predict them. Dhingra et al. [35] showed that the logistic regression layer that relates the eventual latent vector to hashtag classes captures some of the corpus semantics. The model is trained as specified in the reference paper. The size of the output embedding is the only notable difference in our implementation, which is set to 500 in the Tweet2Vec paper. Experimentally, we found that it could be reduced to 200 with negligible consequences in terms of performance (less than 1% reduction of the macro-averaged F1 score, see Section 6.2 for the definition of the F1 score). The models were trained using a corpus of 1.3 M tweets that were sampled from all tweets sent in English during one week. URLs are removed, and user references are kept after replacing all user references by the same @user placeholder. In test conditions with our Harvey corpus, similarly to the TFIDF case, the text in hashtags is also kept alone, just removing the hash key and expanding the camel case text. Dhingra et al. [35] completely remove the hashtags, as they are the target classes to be predicted for training the language embedding vectors. We keep this text when the point is to compute a representative embedding vector, as the text behind the hash key may also have semantic value.

Spatial Information Extraction
In the context of this paper, we focus on tweets sent on the 30 August. For this day, 310 k tweets are stored in our database. As already discussed in the introduction, two main pieces of geographic information are optionally attached to tweets: geotags and place bounding boxes. Among tweets sent on the 30 August, 4501 (ca. 1.5%) are holding a geotag. This proportion is close to expectations according to the literature [14]. The range of surfaces associated to place bounding boxes is very large, from the street to city scale. As the target application in this paper is fine-grained, we considered only bounding boxes with at most 20 km 2 (1.6 × 10 −3 deg 2 ) associated surface. On the 30 August, 2630 tweets (ca. 0.8%) match this specification. We note that only 244 tweets fall in the overlap between selections based on place or geotag, so these two collections are complementary. As the utility of geotags for a fairly high resolution target has been questioned in the literature [23,[54][55][56], in our experiments we will test all of the alternatives (bounding box alone, geotag alone, joint use). We note that alternatives cause a variation of the size of C N (Equation (2)), which serves as basis for the computation of the feature vectors (N = 2630 in the former case, 6887 in the latter case).
The surface of geographic bounding boxes (in squared degrees) is used as the dispersion parameter, which drives the feature vector mapping function, as mentioned in Section 4. This parameter may be zero if the only geographic information attached to the tweet is a geotag, or if the bounding box has zero length or width (occurs in the collected data). Such a case would cause a degenerate optimization problem in Equation (1), as infinite weights are then issued from Equation (4). To prevent this problem, we lower bound d n by d min = 8 × 10 −5 deg 2 . This threshold was set, so that the weight at the center of this minimal bounding box is 20 times the weight at the center of a 20 km 2 bounding box. When both the geotag and place bounding box are attached to a given tweet, we use d min as its dispersion, in order to reflect that the geotag is the most accurate a priori. Additionally, in this case, we use the geotag as p n , which is when both the place bounding box and geotag are present, we ignore the spatial information that is carried by the bounding box. It is worth noting that the usable data are only a small portion of the total amount of collected tweets. However, the quantity of tweets used is still two orders of magnitude larger than that witnessed in related work, such as Fohringer et al. [32] and Eilander et al. [30].

Named Entity Recognition
We used the TwitterNER system to augment the spatial information described in the previous section [20]. Extracted toponyms were geocoded using a custom Nominatim (https://github.com/mediagis/nominatim-docker (accessed on 17 March 2021)) instance that was restricted to the smallest available database file enclosing the region of interest in the experiments (Texas). Specifically, we retain extracted location entities, and retrieve associated candidate place bounding boxes from the Nominatim instance. Nominatim searches may return multiple results. We chose to exclude identified NER locations with the first returned result associated to a surface larger than 350 km 2 (see Section 5.2 for the justification of this threshold) in order to ignore place mentions, such as Houston, which are not useful regarding the accuracy requirements in this paper. Multiple returned results are stored after checking for the overlap with the area of interest highlighted in Figure 1b. Entities having their first matching results not overlapping the area of interest in the study were also excluded. Thus, over the whole corpus, 72,026 location named entities were extracted, which were associated to 36,715 distinct tweets. These mentions refer to 5249 unique location names.
Because Twitter is a noisy text source, and TwitterNER may output false positives, the extracted location names feature obvious mismatches (for example, Cali, short for California, but mistaken for Cali Drive in the results returned by Nominatim). Hence, we manually curated a subset of the identified location names. Empirically, we found that the sorted frequencies of extracted location names are exponentially decreasing: therefore, the curation effort was focused on the 1000 most frequent location names, which account for 90% of the extracted mentions. In practice, we curated 1175 location names, among which 711 were identified as valid and 444 marked as mismatches. The curation resulted in 38 k valid and 26 k invalid location mentions. In the end, 2470 tweets contain at least one valid location reference matching the targeted time frame (30 August).
The polygon stacking technique that was presented by Schulz et al. [18] allows for combining multiple geographic information sources in view of supporting toponym resolution at the continental scale. In our work, we focus on a smaller area of interest, and resort to a simplified variant of this method in order to combine the spatial information provided by Twitter (Section 5.4) with location named entities that were extracted via NER. For a given tweet, the intersection of all bounding boxes is computed, with an index incremented, depending on the degree of overlap. In this context, the geotag adds an increment to all bounding boxes that it overlaps with. Figure 2 summarizes this procedure. For each tweet, we sort the resulting bounding boxes with respect to decreasing indexes. We retain the first resulting intersection with an associated surface less or equal to 20 km 2 (if it exists), in order to be consistent with the granularity requirements in this work. This procedure results in the selection of 5144 tweets, irrespective of the account of geotags. Hence, the amount of accounted tweets is doubled when jointly using NER and Twitter place bounding boxes, when compared to using Twitter place bounding boxes alone, as discussed in Section 5.4. We note that polygon stacking differs essentially from the latter section in the account of geotags: while they are converted to bounding boxes with dispersion d min there, they merely increment the indexes of bounding boxes containing them here.

Experimental Protocol
As the functional form f in Equation (1), we use boosted ensembles of 10 logistic regression classifiers, optimized using the limited-memory BFGS algorithm [49]. This model outputs probabilities y * (p) that coordinates p belong to a flooded area. Let us define ∆ map the set of coordinates, so that p ∈ ∆ map if p is included in the area of the target map described in Section 5.1 . Training data couples {x(p), y(p)} are then made by picking y values from the target map, which are associated to a feature vector built using a mapping function.
We define the support of a collection of tweets with respect to a weight function as the subset of ∆ map , so that the cumulated weight ∑ N n w n (p, p n , d n ) for coordinates p in this subset is greater than a threshold. As Figure 3 illustrates, this support generally lies in urban areas, from which tweets are generally sent. Coordinates outside this support area do not have enough tweets in their vicinity to enable the reliable computation of a feature vector. In contrast to remote sensing based methods, prediction of flooded areas using social media will hence be limited to this support. We note that the support is a function of the tweet collection, which varies in our experiments depending on the geographic pre-processing method used (see Sections 5.4 and 5.5), reported in Table 1. As the table shows, it also depends on the weight function used to compute the feature vector mapping. We use w max/2 = w(p,p,d min ) /2 as threshold, meaning that p belongs to the support only if its cumulated weight is greater than half the weight obtained if p matches the center of a tweet bounding box with dispersion d min . In Table 1, we see that using geotags and polygon stacking yields an increased support. Masking pixels with the flood-fill algorithm naturally decreases this support (by approximately 50%). We note that limiting the support decreases the imbalance of the target variable y, and that flood-fill masking tends to reinforce this tendency.  Each experiment that is reported in the next section is characterized by a weight function (w or w t , see Section 4), a feature vector type (x or x F , see Section 4), a textual representation (keyword-matching, TFIDF, or Tweet2vec, see Section 5.3) and a spatial information extraction method (place bounding boxes only, geotags only, place and geotag jointly, polygon stacking without geotags or polygon stacking with geotags). For each experiment, we extract a stratified random sample (with same balance as the original data set) of 200 k coordinates in ∆ map (23% of the map), compute the support of this subset with respect to the tested spatial information extraction method, and fit a function f * using Equation (1) that was parametrized by the tested experimental conditions. We then use the whole support of ∆ map as the test set, so that the results are easily comparable across experimental conditions. Table 2 reports the test precision, recall, and F1 (defined as the harmonic mean between precision and recall) scores for all possible experimental conditions. Metrics for Top-10 predictions, which is the 10% most confident positive and negative predictions, are also reported. Each metric shown in Table 2 averages the results of five independent experiments. Limiting to Top-10 predictions further reduces the predicted area, but outputs coordinates with higher confidence. Such information is still valuable in the context of assimilation approaches that can cope with partial maps [9]. Only the best performing condition for the algorithm using the keyword-matching feature is reported. We see that its performance is very significantly below any other variant reported in Table 2. Furthermore, contrasting with all other tested conditions that are shown in Table 2, restricting to the 10% most confident predictions yields degraded performance metrics, which suggested that this setting is close to a fully random classifier. This observation highlights that merely matching a shortlist of carefully chosen keywords does not have enough expressiveness for the task.

Quantitative and Qualitative Results
First, we analyze the marginal influence of the flood-fill masked weights computation (w t , see Equation (5)), the Fisher vector (x F , see Equation (6)), and their combined use. With the geotag only spatial extraction method (for both textual representations), both w t and x F yield a degraded performance. In other cases, using w t always yields very significant improvement in performance, on both global and Top-10 metrics. As noted earlier in Table 1, this improvement comes at the cost of a reduced support, though. Using the Fisher vector x F occasionally brings improvement, but this tendency is less general than that observed while using w t . With the TFIDF representation, except under the place only spatial extraction method in conjunction with w t , using x F always leads to a performance degradation (especially in terms of precision). With the Tweet2Vec representation, the improvement with respect to the global metrics brought by x F alone is not very significant, but it is much more apparent in terms of Top-10 metrics (for instance +0.082 Top-10 F1 with the polygon stacking method). Additionally, with the Tweet2Vec representation, a synergy between w t and x F is observed for most spatial extraction methods (for example, +0.086 Top-10 F1 with the place only condition).
Spatial extraction involving NER with polygon stacking methods does not yield spectacular improvement. However, they generally lead to decent performance (if gathering the Top 5 results for each metric column, 14 over 30 are tied to polygon stacking methods). In particular, they are generally associated with a good precision (for instance, the 2nd and 3rd best Top-10 precisions are obtained by polygon stacking variants). Actually, if the NER-based methods slightly underperform with respect to the place only methods (15 of the 30 best results), they are also tied to increased support (see Table 1): in comparison to using only Twitter place bounding boxes, we are able to infer approximately 20% more pixels at a minimal performance cost.
Polygon stacking yields stable results with or without using the geotags (seven over the best 30 results in each case). If we aggregate Table 2 with respect to the feature vector column, and then rank the polygon stacking variants, it appears that variants not using geotags are slightly ahead. If we focus on the F1 metric, which reflects precision and recall holistically, Tweet2Vec with w t and x F obtains both the most competitive overall F1 (0.425) and the best Top-10 F1 (0.834). Both of these results are established with the place only spatial extraction method.
On 31 August 2017, just one day after the S-1 acquisitions that were used to build the flood map in Figure 1a, a GeoEye-1 image was acquired over the Houston area and it was immediately made available thanks to Digital Globe's Open data program. The image clearly shows the massive flooding in and around Houston due to reduced cloud coverage at the acquisition time. In addition to GeoEye-1 images, by 31 August 2017, a total of 14,525 points over the city of Houston had been labeled as flooded houses or roads by Digital Globe's Tomnod crowdsourcing team (Data available for download at https://www.maxar. com/open-data/hurricane-harvey (accessed on 17 March 2021)), providing pointwise and independent interpretation of these images. Figure 4a,b show how predictions by our method (flooded in blue, non-flooded in grey) relate to the labeled points (marker glyphs). The displayed sub-figures were chosen not only to qualitatively show the overall agreement of our method with the ground truth labels, but also highlight the complementarity between these information sources, with expanded riverbeds inferred as flooded, but also grey areas marking predicted flooding limits.

Discussion
The TFIDF representation for tweets is very sparse, and it generally gets better as more tweets are accumulated. Alternatively, the Tweet2Vec representation is meant to effectively represent tweets individually, but there is no guarantee that their weighted average is a fair representation of the whole collection. The differentiated effectiveness of the Fisher vector is explained by the fact it reflects the distribution of the weighted Tweet2Vec vectors with respect to the current coordinates, which is a more sensible way to exploit the individual representation capacity of the Tweet2Vec feature vector.
Regarding spatial extraction methods, it is interesting to note that most best-performing variants with respect to any metric reported in Table 2 do not use the geotag. If gathering the Top 5 results for each metric column (so, 30 results overall), only one makes primary use of the geotag (we did not include the polygon stacking + geotag method, as the geotag has a secondary utility in this case). Therefore, we provided experimental confirmation of the sensor displacement problem, coined by Robertson and Feick [23], which disqualifies Twitter geotags for fine-grained spatial applications, such as considered in this paper. We found out that exploiting the bounding boxes without consideration of the geotags led to the construction of more effective flood mapping functions, at the cost of reduced spatial support (see Table 1). This conclusion is further supported by the fact that the most competitive condition overall does not use geotags. Combining these bounding boxes to toponyms extracted thanks to NER techniques emerged as a middle path, with increased support at minimal influence on model performance.
Our Twitter corpus collection method is based on geographic criteria instead of keywords. This is meant to avoid false positive problems (for example, people posting support messages to population affected by Harvey from other US states) as well as low recall of the relevant content (many relevant posts may be missed with keyword searches). However, doing so, we retrieve many unrelated Twitter posts, for example, from automated sources. In the end, our experiments show that even a strict and small set of keywords yields low usability for our application. In contrast, using language models and identifying relevant features discriminatively acts as an automatic filter of unrelated content, even if this filter is far from perfect, as the experimental results show.
In this work, we circumvented the cost of individual tweet annotation by using aggregation with respect to a coordinate space, in an agnostic way regarding the topic of the content, discriminatively learning decisive features. Among types of content found on Twitter lies automatically generated content. If we consider this type of content as an additional set of topics, our method intrinsically copes with this content, just adding to the difficulty of the model fitting task. The only hypothesis that we rely on is that the spam proportion remains fairly constant through time. In parallel, the authors of the present study work on active learning aiming at efficient Twitter corpus labelling [57]. In preliminary experiments, the method was found to be effective for quickly isolating a large part of the automatically generated content from the remainder of the corpus. Filtering the set of selected tweets using this method could be tested in order to further improve the mapping quality.
The present approach assumes that the representativeness of the general population by the set of Twitter users is constant with respect to geographic coordinates. A potential bias regarding wealth in the Twitter population may severely harm the validity of this hypothesis. A slight over-representation of wealthy classes in the Twitter user base seems to exist, but the penetration rate is similar in the underprivileged and middle classes [58].
Experimentally, we found that areas with sufficient support are heavily correlated to the population density. Assuming that the proportion of Twitter users is constant with respect to space, this is a quite natural conclusion. As a consequence, the practical usability of the present method is restricted to urban areas. The supported area may be extended by lowering the threshold presented in Section 6.1, but experimentally doing so led to systematic performance degradation.
The take-away message to stakeholders, such as emergency response or civil protection authorities, is that regional event mapping using Twitter from the implicit VGI perspective is possible in an automated way, but they should be aware of the restriction to sufficiently populated areas, as only a small fraction of the Twitter feed is usable in practice. An alternative (or complementary) way could be the move to explicit VGI, for example, with incentives for population to send localized storm-related content that is associated to a dedicated hashtag. However, anticipation is needed in the latter case, which is often impossible, especially in the case of unpredictable events, such as riots.

Conclusions
The proposed classification model converts a collection of tweets to flood probabilities for the subset of a map, determined by the presence of a sufficient support in terms of Twitter content. We emphasize that this support correlates to urban density. Extensive experiments show that combining filtering with respect to a DEM and Fisher vectors yields increased performance, and neural embedding models bring a significant qualitative improvement. At best, our method is able to reach 0.425 F1 score if predicting the whole map support, and 0.834 when restricting to the 10% most confident predictions.
Restricting the predicted area to the most confident model outputs yields an important boost. We quantified to which extent the various experimental conditions affect the size of the predicted area as well as the classification accuracy. We note that some applications, such as data assimilation, can cope with rather sparse areas, provided that the accuracy of predictions is high enough. Our most immediate perspective is to validate the proposed method in the context of a data assimilation technique for flood forecasting, such as presented in Sections 1 and 2, hence assessing the value brought by an observation proxy with higher temporal frequency.
Tweets also often feature images, some of which may picture flooded scenes. Detecting flooded scenes has already been considered in the literature. For example, the MediaEval Multimedia Satellite task disclosed a dedicated training data set [59]. As a perspective, we could consider including this information as additional input to the feature mapping function that is described in Section 4.
The procedure that is described in this paper requires training the model using the data of a reference day. The model is then used in production for any time frame of another day to generate novel maps. We believe that our approach is robust to changes in distribution of the messages sent on Twitter, but probably not to the appearance of novel topics, for example, if a specific public building or bridge collapses due to an ongoing storm. Beyond simply updating the model on a regular basis, the evolution of our model to a stochastic process integrating time could address this limitation.