How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis?

Wang, Jia; Zhang, Nan; Liu, Yang; Liu, Mengmeng; Wang, Xiao; Li, Zijun

doi:10.3390/w18030405

Open AccessArticle

How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis?

by

Jia Wang

¹

,

Nan Zhang

²,

Yang Liu

³

,

Mengmeng Liu

¹,

Xiao Wang

¹ and

Zijun Li

^4,5,*

¹

School of Ocean Energy, Tianjin University of Technology, Tianjin 300384, China

²

School of Environmental Science and Engineering, Tianjin University, Tianjin 300072, China

³

Science and Technology Review Publishing House, Beijing 100081, China

⁴

Key Laboratory of Water Safety for Beijing-Tianjin-Hebei Region of Ministry of Water Resources, Beijing 100038, China

⁵

School of Geographical Sciences, Hebei Normal University, Shijiazhuang 050024, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(3), 405; https://doi.org/10.3390/w18030405

Submission received: 17 December 2025 / Revised: 22 January 2026 / Accepted: 2 February 2026 / Published: 4 February 2026

(This article belongs to the Special Issue Managing Water Under a New Hydrological Normal: Innovations for Resilience in the Face of Climate Change)

Download

Browse Figures

Versions Notes

Abstract

Urban flood information enables managers to rapidly synthesize comprehensive flood event profiles, serving as critical evidence for flood control decision making. Compared with traditional methods, public data offer unprecedented spatiotemporal granularity due to its high volume, multidimensionality, and real-time nature. In this paper, we investigated public data’s usefulness and generalizability of spatial feature differences using multi-source social media data as an entry point. We selected rainstorm events that occurred in three cities located in the North China Plain, the Southeast Coastal Region, and the Western Region of China, with vastly different developmental statuses in 2023. Then, multi-platform data from the events were collected and analyzed through crawling and topic mining. The results indicate that: (1) social media data from different sources are complementary to each other and can collectively extract plenty of neglected waterlogging points to supplement official data, with a supplementary rate reaching 171% on average; and (2) social media data has significant value in spatial characterization, which means that its availability remains constant despite geographical differences and can self-adapt to local geography, inhabitant profiles and social development levels. To address the issues of limited available data and essential information lacking during the analysis process, we propose recommendations for data processing and city managers to enhance the scientific value of social media data utilized in practice.

Keywords:

social media; urban flood; public data; data crawler; spatial characterization

Graphical Abstract

1. Introduction

As pointed out by the World Meteorological Organization (WMO), more than 11,000 natural disasters have occurred in the past half-century, among which floods are one of the main types, causing huge economic losses and casualties. With the continuous change in global climate and the acceleration of urbanization, the frequency of floods in urban areas has increased dramatically, and its influence has been expanding. The latest research finds that about 1.81 billion people worldwide are exposed to severe flood risks, and 89% live in a relatively fragile socio-economic environment [1]. How to quickly respond to and minimize the loss of urban floods has become a global concern [2]. We believe that timely and effective access to urban flood information can help improve a city’s flood prevention capability and emergency management efficiency, as well as provide a reference for predicting and handling potential future events.

Urban flood information refers to information that can represent possible or ongoing flood events, including information related to the event itself (e.g., time, scope, location), residents (e.g., deaths, health and epidemic prevention, psychological counseling), and economic activities (e.g., industrial losses, communication barriers). By piecing this together, city managers can quickly capture the whole picture of the event and determine its level to take corresponding action and ensure the normal operation of urban activities. Current urban flood prediction mainly relies on numerical models. However, because of the underlying surface, regional rainfall pattern and fuzzy flood mechanism, it is usually difficult for models to reproduce complex urban hydrological processes. Up-to-date flooding information supports calibration and validation of urban flooding models, thus improving their simulation accuracy and foresight range [3]. In addition, basic flooding patterns can be obtained through statistics and analysis of information from several historical rainfall events. On this basis, flood-prone and hazardous areas in a city could be easily delineated and would be noticed during future events, thereby reducing the event risk and the loss of all parties.

Traditionally, flood information is acquired in three ways: One is using multiple types of sensors for field data monitoring. Sensors placed around drainage facilities, flood-prone areas and rivers enable real-time monitoring of regional rainfall and water level changes, as well as measuring proxies like the electromagnetic field of the Earth [4,5,6]. But for severe floods, debris such as collapsed buildings and branches form a large-sheltered area at the affected location, which may damage the sensors and reduce the collection range. In remote areas, sensors are installed in low densities and are difficult to maintain, resulting in under-reporting. Remote sensing technology can solve some of these problems, such as measuring ground data in backward and complex terrain areas and improving the efficiency of data acquisition and processing. However, there are still some thorny problems: adverse weather and rapid changes in spatial conditions may affect the behavior of remote sensing instruments, resulting in data loss [7]; and the resolution of remote sensing images may decrease in smaller affected areas, such as commercial zones with dense buildings and residential areas full of greenery. For difficult-to-access data, researchers also utilize the method of field investigation. This approach, however, has low efficiency and accuracy, and generally serves as a supplement to the first two methods.

It is not difficult to notice that current flood information acquisition has problems of slow speed, low coverage, and limited timeliness. In addition to constantly improving technical tools (sensors and remote sensing equipment), an effective breakthrough might be generated by considering the perspective of carriers (participants). Let us shift the focus from professionals (city manager and research institution) to the formerly served public. With the distinctive features of large quantity, high density and strong immediacy, the data acquired from the public are regarded as having vast potential. Across scientific fields, these “Citizen Data Scientists” can not only improve the scale and scope of data collection and reduce the research cost, but also break down the barriers and dimension of information sharing, thus allowing researchers to touch the most authentic emotional expressions.

The ways to obtain data from the public mainly include active acquisition based on crowdsourcing [8] and passive acquisition based on ubiquitous sensing. Unlike crowdsourcing data, ubiquitous sensing data are broader and can be collected anywhere, anytime, and by anyone or anything, such as data from social media, public systems and map applications [9]. Today, nearly 5 billion people around the world use various electronic devices daily. Aside from the limitations of identity, manner, content and time, people can share anything that happens nearby in fragmented time. We realize that if the value embedded in such data can be fully explored and utilized, it would have a positive impact on the information collection of urban flood events.

Existing research has confirmed that data from social media can be processed to complement the flood data obtained by existing collection methods, contributing to a comprehensive view of the full lifecycle of an event [10,11]. Although the content posted by the public contains a relatively low percentage of valid data due to various reasons, it can still reflect the characteristics (such as duration, peak, trends, and population movement) of an event well. For example, Rowe [12] discovered that digital footprint data from Meta Facebook could provide real-time feedback on the survivors’ movement routes, which facilitated the timely arrival of government aid; Ponukumati and Regonda [13] explored the method of extracting and analyzing information from Twitter data and designed a flood impact score to evaluate flood impact. Some researchers investigated methods for automatically extrapolating water levels in images from Twitter, Facebook, and Instagram, successfully generating flood maps of some severe flood events [14,15]. The distribution of regions with high flood susceptibility in a certain city can equally be acquired by analyzing historical social media data to predict possible inundation events in the future. Most existing studies with China as the research area have discussed the identification method for urban storm disasters based on Weibo data and highlighted its successful performance in the actual inundation events [16,17]. For example, Wang et al. [18] revealed the spatiotemporal dynamics of public emotion in response to disaster evolution by establishing the relationship between rainfall and Weibo activity, while Yu and Wang [19] proposed an analytical method integrating multimodal neural networks and a location recognition model based on Weibo data from extreme rainfall events, demonstrating the application value of social media data with diverse formats in disaster dynamic perception. Beyond just postings, changes in browsing behavior are also able to reflect the severity of an event and its transfer path [20]. However, it can be seen that existing research primarily analyzes and discusses data from a single social media platform. The user portrait and content-sharing style may probably be directed, leading to incomplete data acquisition. Considering China’s vast territory, there are great variations in economic conditions, residents’ living habits, and social structures among different regions. Whether the validity of public data has regional relevance is also a matter of concern.

Thus, this study selected three cities with diverse geographical characteristics and differing developmental levels. For each city, an extreme rainfall event occurring in 2023 was chosen. Relevant data during the rainfall events were extracted from China’s two most popular social media platforms. By analyzing the data alongside urban features, this study can comprehensively explore the practicality of multi-source social media data in reflecting and supplementing waterlogging events and their adaptability to spatial variations, which providesF suggestions on how such data can be better utilized by scientists and government decision makers.

2. Methods

The study obtained multi-source data on urban floods based on web crawler technology and evaluated the availability and universality from various temporal and spatial dimensions. Meanwhile, a topic classification model was constructed by combining the Latent Dirichlet Allocation model and Support Vector Machine, so as to mine the social media data comprehensively and expand the mapping scope. In the following sections, the actions described in Figure 1 are addressed in detail.

Step 1: Data crawler.

We employed a keyword-based web crawler technique to collect data. A web crawler is a program that automatically accesses web pages and extracts information according to certain rules and targets. It can accurately and efficiently search for large amounts of required information and help complete the data analysis and visualization, and it is mainly used in search engines and data mining.

The types of data required should be determined before starting crawling, including the username, user type (official/self-media/VIP /common), release time, text content, picture, video and IP address. For short-video apps, additional information such as the title, link to the work, and auto-tag should also be crawled based on each app’s page settings. After setting the keywords related to urban floods, the release timeframe ought to be limited to a certain event to ensure timeliness of the data. In order to enhance the effectiveness, accuracy and precision of the information obtained, crawling is performed only for the main content of the original posts.

The specific crawler process is as follows: establish the pre-login connection by using user information from the current website and encrypt the results; make a login request to the website, and extract cookie information from the returned URL to complete the simulated login process; after setting the conditions, obtain the initial URL by manually constructing the list; and, accordingly, execute the crawler, equipped with pre-set cookies, and submit a request to that URL via the HTTP library to acquire the HTML structure of the page. XML Path Language is used to parse the HTML information. The required data is stored as a CSV file for further processing. The above steps are repeated to continuously crawl the subsequent pages with newly fetched URLs, until all the pages have been visited. Against the website’s anti-crawler mechanism, such as IP blocking and request rate limiting, a random buffer time (60 ± 15 s) is employed to extend the crawling interval. This delay allows the crawler to rest briefly between requests.

Step 2: Data pre-processing.

Delete invalid data from the original dataset, including the data that are: (1) irrelevant to urban flooding; (2) obviously false; (3) lacking valid information (e.g., keywords refer to unrelated concepts or posts are unrelated to rainfall event features); or (4) too short (less than five Chinese characters).

Chinese Word Segmentation (CWS) refers to converting Chinese text into a separate word sequence to eliminate ambiguity when processing the text in models. CWS is the basis of natural language processing and text mining algorithms. Jieba Chinese text segmentation is employed in this step, which is available on GitHub and easy to implement. Jieba offers three modes of operation. Among them, the default mode of sentence cutting is suitable for text analysis. On this basis, the search mode can conduct a second cut on lengthy words to enhance the recall rate. Part-Of-Speech tagging can also be achieved using the posseg module in Jieba. It first pre-processes the input text to segment it into Chinese character sequences. For these sequences, the posseg module can construct a directed acyclic graph based on the prefix dictionary, find the highest-probability path to perform word segmentation and then assign appropriate parts of speech to the words.

Stop word removal refers to deleting words without any semantic value in the text to optimize the result of CWS, including commonly used conjunctions, modal and auxiliary verbs and punctuation. We apply a combination of the Baidu stop word list and Harbin Institute of Technology stop word list in this study.

Step 3: Content recognition.

Location information extraction is carried out through text filtering. After sorting the publishing time, the data from the same location during a specific consecutive time period are consolidated. Following this, an open-source location search application is utilized to obtain the longitude and latitude of the flood points and then visualize them. The coordinate system used here is the World Geodetic System 1984 (WGS-84).

Topic detection is conducted. The Latent Dirichlet Allocation (LDA) model is adopted to ensure high-level identification and understanding of the topics and potential insights within the original dataset, exhibiting high effectiveness in extracting latent semantic topics from short, sparse texts like social media posts [18]. The interrelation among the model parameters can be expressed as follows (Figure 2).

As shown in Figure 2, a word is the basic unit in this concept. A document is a random mixture of N words (w = (w₁, w₂,…w_n,)), and a corpus is a random mixture of M documents [21,22]. α and β are two hyperparameters intended for generating θ (multinomial topic distribution of a document) and φ (multinomial word distribution of a topic). Assuming the number of topics is K, for each document, the LDA model randomly samples a topic from θ, and then selects a word from the corresponding φ for the given topic. The process is repeated until a document is completely generated. Parameters α and β were set to default values (50/K and 0.01), which are empirical standards in LDA. The topic number K = 15 was selected based on the perplexity calculation.

In contrast to the generative process, topic identification involves statistical inference to estimate the latent topic variables that best explain the observed data. For a specific corpus, the model needs to acquire the topic distribution and word distribution through effective dimensionality reduction in the text data. The formula is [21]

p(w) = p(t|d) ⋅ p(w|t)

(1)

where p(w) denotes the occurrence probability of each word in the dataset, p(t|d) is the occurrence probability of each topic in a document, and p(w|t) represents the occurrence probability of each word in a topic.

The detailed construction process of the topic recognition model is as follows:

(1) Determine the number of topics for the LDA model. To obtain the appropriate number of topics, it is necessary to identify the topic set with the least similarity. In the visualization of topic classification, reduced overlap of topics leads to improved topic division. Perplexity is, meanwhile, adopted to help define the optimal topic number [21], which is calculated as follows:

perplexity (D) = \exp {- \sum logp (w) / \sum_{d = 1}^{M} N_{d}}

(2)

where N_d is the number of words in the text d. Perplexity quantifies the uncertainty that a given text belongs to a particular topic. Typically, the lower the perplexity, the better the model’s performance in dealing with new text data.

(2) Extract the features using the Gibbs sampling algorithm. This algorithm is frequently used for analyzing multidimensional objectives, wherein single-dimensional sampling is executed successively for each dimension [23]. In the context of LDA, the process iteratively samples a new topic for each word in the corpus based on the current topic assignments of all other words. A topic number t_j is randomly assigned to each word w_j of each text d_s in the corpus, and then Equation (1) becomes

p(w_i) = p(t_j|d_s) ⋅ p(w_i|t_j)

(3)

Then, the topic number of each word in the corpus is iteratively updated until the sampling reaches convergence. Finally, the probability distribution of each topic is derived.

Topic classification. The Support Vector Machine (SVM) is employed to construct the topic classification model for further aggregate analysis of the data due to its robustness in high-dimensional spaces with limited training samples. To train feature words for the current corpus, the SVM is integrated with the LDA–Gibbs topic model above. When the latest data is imported, the classification model can match its topics by assessing feature similarity. In addition, 20% of data are randomly sampled to evaluate model performance [24]. Precision and F1-Score are used as follows [25]:

Pt_j = TPt_j/(TPt_j + FPt_j)

(4)

{Rec}_{t_{j}} = {TP}_{t_{j}} / ({TP}_{t_{j}} + {FN}_{t_{j}})

(5)

{F 1 - Score}_{t_{j}} = 2 \cdot {Prec}_{t_{j}} \cdot {Rec}_{t_{j}} / ({Prec}_{t_{j}} + {Rec}_{t_{j}})

(6)

where

{T P}_{t_{j}}

represents the data that are correctly categorized as

t_{j}

,

{F P}_{t_{j}}

represents the data belonging to other categories but wrongly classified as

t_{j}

, and

{F N}_{t_{j}}

represents the data belonging to

t_{j}

but wrongly classified into other categories.

3. Study Area and Data Source

3.1. Study Area and Event Selection

China is one of the world’s most vulnerable countries to flooding. Its vast latitudinal and longitudinal expanse and unbalanced regional development cause an irregular spatio-temporal distribution of rainfall and conspicuous differences among urban flood events. Three cities were selected to represent different tiers of economic development (first, second, and fourth tier) and distinct geographical features (inland plain, river basin, and coastal area), ensuring the generalizability of the study (Figure 3).

Beijing (first-tier city), the capital of China, lies in the northern part of the North China Plain. The city has a typical sub-humid warm temperate continental monsoon climate, making it one of the rainiest cities in North China. Between 29 July and 2 August 2023, influenced by Typhoon Dusuri and the subtropical high pressure, Beijing experienced 81 h of continuous rainfall. The average rainfall reached 276.5 mm, far surpassing that of the exceptionally heavy rainfall in 2012. The worst-hit district, Mentougou, reported an average rainfall of 538.1 mm, with around 77% of the population affected.

Fuzhou (second-tier city) is located downstream of the Min River and is one of the cities with the highest precipitation and water resources in the country. Because of tidal interaction and typhoons, drainage and flood prevention is a constant concern for the city’s management. On 5th September 2023, Typhoon Haikui hit Fuzhou, causing a prolonged period of heavy rain that broke several rainfall records. A total of 25 towns in seven districts received more than 250 mm of rainfall, with the heaviest rainfall occurring in Gaishan Town (439.7 mm). Flooding on several roads, accompanied by strong winds, stagnated economic activity in the city, with a resulting economic loss of 552 million yuan.

Beihai (forth-tier city) is located in western part of China, which is relatively economically backward. The city is of a tropical oceanic monsoon climate, characterized by a high annual precipitation of 1670 mm. On 7 June 2023, concentrated and intense rainfall occurred in Beihai City. The rainfall amounted to 453 mm during a 24 h period spanning from the 7th to the 8th. Several regions were affected by flooding due to the heavy rainfall that occurred over a short period of time, causing severe travel disruption for residents.

The Statistical Report on Internet Development in China indicates that the Internet penetration rates in China surpass 75% [26]. During these widely publicized rainfalls, many people shared information about their surroundings, requested assistance, and expressed concern on social media.

3.2. Data Sources

We chose Weibo and Douyin, the two most popular social media platforms in China today, as our data source (Figure 4). Weibo is an instant messaging platform that operates similarly to Twitter. Users can freely share text, images, and videos, while also expressing and disseminating information through commenting and retweeting functions. The data obtained offer research advantages including richness, real-time accessibility, diverse types, ease of access, and high visibility, with the majority being open source. Initially, users tended to post various events in their lives and share their personal emotions on the platform. As society progressed, the contents shifted towards social and entertainment topics. The topic list interface of the platform is divided into three sections of general, entertainment and nearby topics. The former two categories are usually more favored among users. Notably, the majority of Weibo users are young women residing in first-tier and new first-tier cities. Therefore, Douyin data are employed as a complementary source.

Douyin fosters a more daily and personalized content creation culture. The convenient posting method, interaction and precision recommendation system are highly valued by people from different age and social groups. Both user profiles and content on Douyin have a complementary effect on Weibo, as depicted in Figure 4 [27,28]. Urban flooding is a critical issue, and Douyin data provide abundant content that can help researchers estimate inundation more accurately by presenting short videos. However, accessing the historical data on Douyin is challenging because of its limitations in historical data retrieval. Unlike Weibo’s chronological search, Douyin’s algorithm prioritizes content popularity, and its lack of precise time-range filtering makes exhaustive historical collection challenging. This could lead to omissions when researching previous flood occurrences.

It is noteworthy that due to the stricter protection of historical content on Douyin, it is necessary to include time keywords in the qualification conditions of web crawling to filter out irrelevant information. Moreover, Douyin’s recommendation system suggests not only highly relevant videos, but also other videos based on the interests of creators and their followers. Thus, the relevance of the content and keywords diminishes progressively whilst crawling. The program can be terminated if the content is no longer pertinent to the flood events.

4. Results and Discussion

4.1. Social Media Data Extraction

Firstly, data from Weibo and Douyin were collected for the selected rainstorm and flood events that occurred in Beijing, Fuzhou, and Beihai in 2023. The time frame according to rainfall duration and keywords required are shown in Table 1. Due to the limited display of up to 50 pages per visit on the Weibo platform, appropriate time segmentation was necessary to obtain the complete data. The keywords utilized in the crawling mainly pertained to the weather conditions, extent of damage, and resulting consequences thereof. In addition, given the infrequent use of location-based services by Weibo users, we incorporated location-specific keywords based on local Points of Interest (POIs). These included names of major intersections, underpasses, and landmarks (e.g., “Wusi Road” in Fuzhou), used to capture implicit geographic information that omitted generic flood terms.

A total of 15,883 pieces of data were obtained as the original dataset and then screened by removing duplicate records, invalid data (including excessively short text), and irrelevant items. The remaining dataset contained 7447 values, as outlined in Table 2. It is evident that both the raw and processed data acquired from Weibo are considerably more numerous than those from Douyin, highlighting the fact that Weibo is still the primary information source relied upon by the public at present. However, the residual rates of Weibo and Douyin were 42.7% and 75.5%, respectively, after comparing the data before and after filtering, which demonstrates the fairly high validity of Douyin data. This is probably attributable to the higher proportion of personal users in Douyin and its page setup allowing users to interact in a comment-based manner, thereby reducing the repetition rate of raw data. Additionally, a correlation can be observed between the Weibo data quantity during different rainfall events and the city scale; that is, in more developed cities, there are more users sharing information on Weibo. In contrast, the amount of Douyin data exhibits some randomness. The distinction between Beihai (fourth-tier city) and Beijing (first-tier city) is remarkable. The efficient data volume of Weibo in Beihai amounts to only about 30% of that in Beijing, whereas this volume for Douyin is almost 28% greater than that in Beijing, thereby reflecting Douyin’s reach and inclusiveness in smaller cities. This confirms that the user profiles of Douyin and Weibo are complementary in expanding data coverage.

4.2. Spatio-Temporal Validity of Social Media Data

The data collected were counted every 2 h to identify the changing pattern and its correlation with the official rainfall intensity (which was obtained from official emergency management reports provided by local water authorities), as shown in Figure 5. Although high rainfall intensity does not necessarily lead to flooding events, analyzing it can still reveal how social media data reflect rainfall processes. Pearson correlation analysis revealed a significant positive correlation between rainfall intensity and social media volume, with a time lag of 2 h (R² > 0.6, p < 0.05), statistically confirming the response capability of public sensing. Social media data typically increased within 2–12 h following the event outburst, and then displayed a cyclical fluctuation trend by the day. The peak of user data usually appeared during the “cooling off” period of the event, because users were more concerned about the impact of waterlogging on their personal and professional lives, rather than the forced shutdown of economic activities and traffic as the event unfolded. Also, the variation in user data mirrored the consistency with residents’ daily timetable. Hours with a greater volume of user data per day were more likely to occur during residents’ commuting with traffic congestion. And even in the most active period of the event, the number of posts remained mostly below 20 late at night.

Event A2 in Fuzhou was characterized by a short duration and high intensity of rainfall with a single peak, while the data obtained from both sources were substantial, making the trend concise and representative. Thus, taking Figure 5b as an illustration, we detail the trend and reliability of social media data on the temporal dimension of a rainstorm event. From 6 pm onwards, water accumulation in the city became severe as the intensity of rainfall increased. The rainstorm gradually impacted the daily routine of the users as it coincided with the evening rush hour, so that the number of related posts rose sharply. At 10:30 pm on that day, the Fuzhou government implemented the second-level contingency plan for rainfall and started the red warning signal an hour later. In this context, a small peak of user attention to the event was observed. But, although the precipitation reached the peak at 00:00 on the 6th, the users’ concern gradually diminished to accommodate the rest schedule. The second surge of focus appeared on the second day during the morning rush. In the early afternoon of the 6th, the rainstorm event drew to a close, yet the fervor of users’ discussion continued to remain high. The public remained entangled in the aftermath of the event by a sequence of direct or indirect disasters. A recovery period was required to allow society to return to normal.

Then, the social media data were further differentiated by user type based on event A2. Figure 6 displays the trend in the number of posts published by news accounts with authority and influence and general user accounts by different colored lines, separately. As can be seen, news users appeared to possess a marginally stronger insight into events, with the trend line occasionally exhibiting a turning point prior to that of common users. The time difference typically amounted to approximately 2 h, and the advantage was not completely notable. Additionally, data from news accounts frequently showed a rapid increase to a peak followed by a sharp decline, demonstrating greater volatility than that of normal users. That is, normal users illustrated slightly higher persistence to help maintain the event’s clout over an extended period. For example, on 6th September, the data from news users only experienced two peaks during the morning and evening peak hours, whereas normal users’ focus remained stable for over 14 h until late at night, after they started to notice the flooding situation at 6 am.

The quantity of data posted did not significantly differ between the two types of users, with news users sharing even more than normal users occasionally. In fact, over 40% of the crawled data were released by news accounts, a phenomenon linked to users’ posting habits and the varying focus of news users. Firstly, during periods of heavy rainfall and flooding, many normal users tend to disseminate accurate information to aid others in avoiding being trapped, as opposed to expressing personal emotions. This creates some duplicated information. Secondly, news accounts can be divided into two categories: news media that report breaking news, such as rescue requests and notifications of sacrifices, and official accounts of government departments that primarily publish professional weather forecasts or officially measured rainfall data. We prefer to focus on the former in the analysis of social media data reflecting and complementing the event situation, i.e., the objective description of what happened. But the latter kind of account amalgamates the benefits of conventional journalism and social media, offering the public reliable information to stimulate increased attention and wider distribution. Hence, we argue that it is possible for all three parties—general users, news users, and official accounts—to collaborate in shaping a social media opinion matrix in a rainstorm event, with each contributing analytical value.

As previously mentioned, users tend to prioritize how the rainstorm affects them and their immediate surroundings, resulting in a higher concentration of social media data in the cities where the events occurred, despite the lack of geographic restrictions. To investigate the spatial distribution of social media data, we tallied the mentions of districts in each urban rainfall event. The findings are depicted in Figure 7.

Taking Figure 7a as an example, postings predominantly focused on multiple districts located in the central and southwestern regions of the city during event A1. Among them, the four districts in the city center gained the highest amount of relevant data, with Fangshan, Mentougou, and Daxing Districts following closely behind. The distribution map depicts a gradual decline in data quantity from south to north. Indeed, precipitation was concentrated and widespread, influenced by the typhoon and the peripheral low-pressure cloud system. Districts in the southwest and south indeed experienced exceptionally heavy rainfall accompanied by thunder and strong winds. And the southwestern regions were at a higher risk of secondary disasters such as flash floods and building collapses because of their proximity to mountainous areas. According to data from the Meteorological Service, Fangshan and Mentougou Districts had average rainfall of 598.7 mm and 538.1 mm, respectively, affecting nearly 80% of the residents’ lives and representing the worst-impacted areas during this event. A comparison with the official map of rainfall distribution clearly indicates the coherence of the social media data’s spatial distribution with the real rainfall process.

4.3. Spatial Characterization Value of Social Media Data

The waterlogging locations were acquired through matching the latitude and longitude of imprecise geo-information taken from social media data. According to relevant government reports, the officially announced flood-prone points of each city are presented in Figure 8, alongside public data. Table 3 summarizes the flood-prone points extracted from various social media data and their discrepancies with official points. The supplementary rate is defined as the ratio of newly identified inundation points to official inundation points. Most of the officially given points are concentrated in the urban center of the city, such as Hai Cheng District in Beihai. However, the information shared by individuals with various identities and geographical locations can partially portray the flooding conditions in regions with lower administrative levels. Thus, social media data can help recognize several unknown inundated spots, as a complement to official data. While it is recognized that data provided by non-expert users may be subject to uncertainties like vague descriptions and low resolution, we still consider the potential value of such information in identifying infrastructure problems on a given road, such as pipe leakage or small-scale ground depressions. Also, multi-source inundated points exhibited significant variations in characteristics across cities located in diverse geographical regions. Next, we explore the geospatial features of information shared on social media during the events by considering their distinctive regional contexts to deeply ascertain the universal value of social media data.

4.3.1. Geographical Features

The social media data results were basically aligned with local geographic features. Its vast size endows China with diverse geographical attributes. The three representative cities selected for the study are located in the northern, southeastern coastal and southern regions of China. Beijing is surrounded by mountains in the north-west and flat land in the south and east, driving rainfall runoff towards the plains. During periods of heavy rainfall, water from mountainous regions flows into urban areas and intensifies as the terrain changes, thereby triggering flash floods and urban waterlogging. As illustrated in Figure 8, most flooding points identified from public data in Beijing were distributed in the southern and southwestern regions of the city, overlapping with the actual occurrence of localized waterlogging. Fuzhou has a diverse terrain comprising plains, hills, and mountains, with an overall basin inclined from west to east. The city experiences plentiful rainfall and has a dense water network. The Min River, which ranks seventh among China’s largest rivers, runs through the city and ultimately empties into the sea via the Changle District. It is evident that the inundation points from both official sources and social media in Fuzhou were focused within the Min River basin, representing the greatest level of overlap compared to the other two cases. Beihai has a mostly flat landscape, with an average altitude of 10–15 m. Despite being a coastal city like Fuzhou, it has distinct wet and dry seasons and receives much less rainfall than Fuzhou. Whilst the amount of social media data and the flooding points collected in Beihai were limited, they were still discerned to be mainly situated in the southwestern region, where the terrain is notably low and the area next to the sea is substantial.

4.3.2. Demographic Characteristics

The population size also influences the results obtained from the data analysis. As the capital city of China, Beijing possesses the largest permanent population (21.9 million people by the end of 2023), mainly residing in the central and southern urban areas (Figure 9). Fuzhou and Beihai have 38.6% and 8.6% of Beijing’s population, respectively. Despite the fact that the youth and middle-aged (15–59) individuals, who are considered the backbone of society, constitute approximately 65% of the population in all three cities, the vast population base variance still remains significant, with disparities in the number of potential users willing to adopt new technological devices and sharing platforms. This partly explains one of the notable phenomena in the graph: Beijing, as the only inland city with neither the highest annual precipitation nor the highest waterlogging risk, had a far higher amount of social media data and more waterlogging points compared to any other city.

4.3.3. Difference in Regional Development

Differences in the social development levels of cities constitute another factor for this phenomenon. The Gini coefficient in mainland China has remained around 0.40 over an extended period, indicating a significant issue of uneven development [24]. In recent times, the openness of society and the flow of resources have contributed to the growing north–south gap and east–west gap. Also, unbalanced development exists within the city. Rural populations have been migrating to towns and cities for better social welfare, employment opportunities and educational quality, causing a depletion of labor and resources in the underdeveloped regions. But the first-tier cities always benefit from extensive resources that enable a stronger focus on overall city development, rather than concentrating on a specific area. In this study, the inundation points from social media in the three cities were all distributed mainly within the main urban areas, yet a considerable number of public findings were also reported in the secondary regions of Beijing.

The imbalanced development results in substantial disparities in the industrial structure and economic status among various cities. Table 4 illustrates this economic disparity. The difference in public budget and education expenditure directly correlates with the “digital divide”, explaining the lower volume of high-quality data in less developed cities. As the city scale expands, the industrial structure undergoes optimization, with positive effects on social and economic benefits as well as population quality. The aggregate sum of each value in Beijing overwhelmingly surpasses that of the other cities, particularly in regard to public expenditure on education. High-quality educational resources increase awareness and acceptance of emerging things among residents, thereby motivating them to pay closer attention to urban construction and flood prevention. This stimulates the production of richer and more valuable data on flood-prone areas. Comparatively, Beihai, as the least developed city among the three, has users who are mostly in a downward-reaching (consumer) market. These individuals prioritize immediate gratification over rigorous analysis and scientific thought. So, despite Beihai having 28.3% more efficient data volume on Douyin compared to Beijing, its data volume featuring precise descriptions of events and locations was merely 45.6% of that of Beijing.

The city’s infrastructure status represents another reflection of the economic disparities. Fuzhou experiences the greatest frequency of heavy rainfall events and flood risk among the three cities and even nationwide. The extracted data and identifiable waterlogging points, however, were only moderately ranked, which may be somewhat interrelated with the evolution of water management strategies in Fuzhou. In August 2015, Typhoon Soudelor hit Fujian Province. The torrential rain that raised the water level in the inland river, coupled with the high tide of the Min River, caused certain river segments to overflow into the city. A total of 956,000 people in the city were affected by the rainfall, and the direct economic losses amounted to about 535.98 million dollars. Following the experience gained from the rarely seen rainstorm, the Fuzhou government initiated a review of water management measures and proceeded to establish a more effective policy framework alongside enhanced flood prevention and drainage systems. In this context, the government centralized the governance of the urban flood disaster risk reduction system from various administrations and integrated information technology, such as IoT and big data, to establish the country’s most groundbreaking water joint-scheduling center. Since implementing the mature new water management system, although sporadic occurrences continue to transpire, waterlogging in the city has decreased in both frequency and duration, which could also be a contributor to the lower amount of waterlogging data from social media within the city.

4.4. Topic Classification Results and Analysis

The LDA model was constructed, employing data from three rainfall events as training samples. The optimal number of topics was determined to be 15 (K = 15). To validate the classification performance, we utilized the 20% hold-out dataset mentioned in Equation (4). The Support Vector Machine (SVM) classifier achieved an average precision of 88.4% and an F1-score of 85.2% across the five topics, demonstrating robust performance in identifying flood-related categories. The connections between the word sets were further analyzed to merge similar items. Eventually, five main topics were identified: real-time weather, disaster situation, rescue information, damage and losses, and emotion. Statistics were computed separately for each topic to comprehend the quantity of data available. The results are visualized in Figure 10.

From the results, public opinion on urban flooding was primarily associated with the damage and losses caused by intense events, which affected the daily lives of users and their ability to travel. Real-time weather and disasters held the next highest importance, constituting 21.6% and 18% of all topics, respectively. Recently, the government’s focus on climate change and urban drainage construction has raised citizens’ awareness of waterlogging. Construction achievements have generated excitement, as they are expected to increase urban resilience and help people return to their daily routines quickly. Subsequently, we analyzed the topic distribution of data from different social media platforms through segmented word matching. The results revealed discernible user profiles and distinct atmospheres of the two social media platforms. Although both types of data expressed significant concern about disaster damage, which was in line with the overall trend, Weibo users shared a greater quantity of information and opinions regarding the disaster, whereas Douyin users displayed richer emotional reactions. Integrating both sources of information could provide decision makers with a fuller picture of the situation and public sentiments, enabling them to select an appropriate and prompt emergency response plan.

Another question is whether residents in less developed areas are conscious of sharing, disseminating immediate information and seeking relief whilst postings. We analyzed the frequency of typical topic terms to investigate the concerns and sentiment changes among residents during rainfall (in Figure 11). Real-time weather conditions were the most concerning topic for users. The words “rainfall” and “warning” appeared more often than other words relating to this topic. Specifically, the frequency of “typhoon” in the Fuzhou users’ posts was considerably greater than that in the other two cities. From June to September every year, Fuzhou experiences the typhoon season, which is responsible for the majority of the flooding in the city. Concerned about the typhoon’s intensity and movement, users anticipate receiving up-to-date information from social media platforms. The disaster zone and rescue communication were two other welcomed subjects of user discussion. High-frequency words such as “rescue”, “safe”, and “submerged” showcase social media’s capacity to broaden response channels for emergency situations while also conveying users’ unease and apprehension. However, in comparison to emotional terms, users were not accustomed to directly expressing their emotions directly in text content. “Why” was the most popular emotional word in Beijing and Fuzhou. In context, this term usually indicated that users harbored reservations and disappointment regarding the generation and handling of waterlogging events, thereby underscoring the importance they attach to urban flood control. The sentiment hot word in Beihai was “anxious”. Residents of smaller cities were more interested in how the event had impacted their own arrangements, instead of further questioning the monitoring system and response procedures behind the event. But despite the limited public data available for Beihai, its exhibited distribution trend was consistent with that of larger cities, and also reflected the city’s unique characteristics. Therefore, we recommend that such cities increase their focus on social media to raise awareness among residents of the platform communication model, which may achieve unexpected benefits in emergency situations.

4.5. Discussion

Our findings regarding the temporal lag of social media data align with previous studies [29], confirming that peak public attention typically trails peak rainfall by 2–4 h. However, unlike studies focusing solely on first-tier cities, our cross-regional analysis reveals that this lag is more pronounced in lower-tier cities due to different user habits. The integration of Weibo (text-heavy) and Douyin (video-heavy) proved crucial. While Weibo provided higher temporal resolution, Douyin contributed unique visual confirmations of flood depth, validating the necessity of multi-source fusion proposed in our methodology. The primary uncertainty lies in deviations in socioeconomic development, as shown in Table 4 and Figure 9. The lower GDP and education budget of Beihai correlate positively with sparse effective data, suggesting that social media analysis may underestimate risks in underdeveloped regions. Furthermore, social media data are published by users, making it difficult to avoid introducing a degree of subjectivity. Different users describe the waterlogging degree based on their perceptions and adaptability. Also, relying solely on textual location descriptions (e.g., “near the mall”) can introduce spatial positioning errors. In the future, computer vision technology can be integrated with social media data processing to enable automatic extraction of images to address these data limitations.

Despite some limitations (low percentage of valid data, ambiguous location information, etc.), social media data were shown to be a valuable source of information for urban flood management. How to increase the potential value of social media data and make it more useful in scientific research and urban management is the main challenge in the future. To advance in this direction, we provide suggestions from two perspectives for data analysis and city managers:

(1) Image and video analysis techniques should be refined to improve the scalability and application efficiency of social media data. The existing methods comprise utilizing known-sized objects in a single frame to determine flood depth [30] and implementing machine learning methods for automatic image classification, filtering, and real-time processing [31]. However, there are several issues that need to be addressed here: the quantity of available data is limited, and obtaining videos and images with no accompanying text is arduous; image quality and category differences sometimes fail to support the construction of representative training datasets; and privacy concerns may be prompted. Considering the perspective of city managers could partially alleviate the problems above.

(2) The city manager ought to guide the market and public, establish correct common values and foster an environment of scientific democratization between them, so as to fundamentally boost the availability of data. Firstly, deeper integration of industry (platforms) and scientific research should be promoted, such as establishing dedicated data channels for scientific research. Secondly, regulating and upholding transparency and fairness on platforms is crucial to stimulate users’ willingness to share. Finally, cultivating a public sense of scientific ownership and developing new disciplines, such as public hydrology, that explore new forms of ubiquitous data to enhance traditional research paradigms are highly promising. When citizens recognize that what they are sharing can function as significant scientific data, the overall quality of such data can potentially be enhanced to a great extent.

5. Conclusions

In recent years, social media data have emerged as a novel data source attracting considerable attention. Many researchers have studied and demonstrated their scientific value, particularly in the field of urban flood management. However, the majority of current studies have been conducted on a single waterlogging event in a particular region, with data primarily sourced from a targeted social media platform for the specific user group. Therefore, for the issue of generalizability of social media data in response to spatial differences, the study selected representative rainfall events occurring in areas with diverse geographic features and developmental conditions, to thoroughly explore the validity and spatial adaptability of public data from multiple sources. The results can serve as a foundation for applying a data-intensive approach in urban flood management.

The study summarized the analysis results of the social media data from Weibo and Douyin across three diverse cities in the northern, western and northeast coastal Regions in China. The results indicates that: (1) Social media data from different sources are complementary to each other. The changing preferences of public expression lead to the existing platforms adjusting their sharing focus and the market witnessing the emergence of multiple new platforms. Integrating data from multiple social media sources provides deeper insight into users’ behavioral and emotional responses for gaining a broader understanding of events. (2) Social media data are fairly spatially and temporally valid and can serve as a supplement to official data. Inundation spots released by authorities are mainly concentrated in urban center regions with dense populations and economic productivity. Conversely, social media data, which lack spatial restrictions, are helpful for identifying plenty of unattended waterlogging points in the surrounding area. (3) Social media data possess the value of spatial characterization. By analyzing geographical conditions, residential demographics, and social development levels, it was found that the availability of social media data not only remains constant despite geographical differences, but also be capable of self-adapting to the collective impact of these features. (4) Raising awareness of social media data can make a significant contribution during emergency flood events. Because the user data from various regions align with general trends and demonstrate region-specific user consciousness, they are is valuable in formulating emergency response plans.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, N.Z. and X.W.; validation, X.W., Y.L. and M.L.; formal analysis, J.W. and Z.L.; investigation, Z.L.; resources, M.L. and X.W.; data curation, J.W., N.Z. and Z.L.; writing—original draft preparation, J.W. and N.Z.; writing—review and editing, J.W., N.Z. and Z.L.; supervision, J.W., N.Z. and Z.L.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China [Grant No. 52309031].

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy reason.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rentschler, J.; Salhab, M.; Jafino, B.A. Flood exposure and poverty in 188 countries. Nat. Commun. 2022, 13, 3527. [Google Scholar] [CrossRef]
Gao, M.Y.; Wang, Z.M.; Yang, H.B. Review of Urban Flood Resilience: Insights from Scientometric and Systematic Analysis. Int. J. Environ. Res. Public Health 2022, 19, 8837. [Google Scholar] [CrossRef]
Lin, H.; Wu, X.; Pan, J.; Zou, H. Real-time forecast of urban flood in China: Past, present and future. Acta Geod. Cartogr. Sin. 2022, 51, 1306–1316. [Google Scholar] [CrossRef]
Khan, T.; Kadir, K.; Alcm, M.; Fchilihid, Z.; Mazliham, M.S. Geomagnetic field measurement at earth surface: Flash flood forecasting using tesla meter. In Proceedings of the 2017 International Conference on Engineering Technology and Technopreneurship (ICE2T) 2017, Kuala Lumpur, Malaysia, 18–20 September 2017; pp. 1–4. [Google Scholar] [CrossRef]
Khan, T.A.; Alam, M.; Shahid, Z.; Suud, M.M. Prior investigation for flash floods and hurricanes, concise capsulization of hydrological technologies and instrumentation: A survey. In Proceedings of the 2017 IEEE 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS) 2017, Bangkok, Thailand, 7–8 August 2017; pp. 1–6. [Google Scholar] [CrossRef]
Skordas, E.S.; Sarlis, N.V.; Varotsos, P.A. Possible detection of the onset of flash flooding in Thessaly Greece by measurements of the earth’s surface electric field. Nat. Hazards 2025, 121, 8613–8629. [Google Scholar] [CrossRef]
Triglav-Čekada, M.; Radovan, D. Using volunteered geographical information to map the November 2012 floods in Slovenia. Nat. Hazards Earth Syst. Sci. 2013, 13, 2753–2762. [Google Scholar] [CrossRef]
Wazny, K. “Crowdsourcing” ten years in: A review. J. Glob. Health 2017, 7, 020602. [Google Scholar] [CrossRef] [PubMed]
Amakpah, S.W.; Liu, G. Eco-cities: UE net systems integration as new paradigm shift in sustainable energy generation and utilization. J. Environ. Account. Manag. 2015, 3, 385–394. [Google Scholar] [CrossRef]
Hossaki, V.Y.; Seron, W.F.M.S.; Negri, R.G.; Londe, L.R.; Tomás, L.R.; Bacelar, R.B.; Andrade, S.C.; Santos, L.B.L. Physical- and Social-Based Rain Gauges—A Case Study on Urban Flood Detection. Geosciences 2023, 13, 111. [Google Scholar] [CrossRef]
Hossaki, V.Y.; Negri, R.G.; Santos, L.B.L. Combining social media data and meteorological sensors for urban flood detection: A statistical analysis in São Paulo City. Earth Sci. Inform. 2025, 18, 281. [Google Scholar] [CrossRef]
Rowe, F. Using digital footprint data to monitor human mobility and support rapid humanitarian responses. Reg. Stud. Reg. Sci. 2022, 9, 665–668. [Google Scholar] [CrossRef]
Ponukumati, P.; Regonda, S.K. Development of a flood impact assessment framework integrating crowdsourced data and geospatial information for data sparse urban regions. Int. J. Disaster Risk Reduct. 2025, 116, 105048. [Google Scholar] [CrossRef]
Li, J.R.; Cai, R.Y.; Tan, Y.; Zhou, H.J.; Sadick, A.M.; Shou, W.C.; Wang, X.L. Automatic detection of actual water depth of urban floods from social media images. Measurement 2023, 216, 112891. [Google Scholar] [CrossRef]
Tripathy, S.S.; Chaudhuri, S.; Murtugudde, R.; Mhatre, V.; Parmar, D.; Pinto, M.; Zope, P.E.; Dixit, V.; Karmakar, S.; Ghosh, S. Analysis of Mumbai floods in recent years with crowdsourced data. Urban Clim. 2024, 53, 101815. [Google Scholar] [CrossRef]
Chen, Y.L.; Hu, M.C.; Chen, X.H.; Wang, F.; Liu, B.J.; Huo, Z.W. An approach of using social media data to detect the real time spatio-temporal variations of urban waterlogging. J. Hydrol. 2023, 625, 130128. [Google Scholar] [CrossRef]
Wang, C.; Zhang, X.H.; Wu, J.D. Disaster information mining from a social perception perspective: A case study of the “23·7” extreme rainfall event in the Beijing-Tianjin-Hebei region. Int. J. Disaster Risk Reduct. 2024, 115, 105056. [Google Scholar] [CrossRef]
Wang, W.X.; Zhu, X.H.; Lu, P.C.; Zhao, Y.; Chen, Y.W.; Zhang, S.L. Spatio-temporal evolution of public opinion on urban flooding: Case study of the 7.20 Henan extreme flood event. Int. J. Disaster Risk Reduct. 2024, 100, 104175. [Google Scholar] [CrossRef]
Yu, C.; Wang, Z.G. Multimodal social sensing for the spatio-temporal evolution and assessment of nature disasters. Sensors 2024, 24, 5889. [Google Scholar] [CrossRef]
Qu, Z.; Wang, J.L.; Zhang, M. Mining and analysis of public sentiment during disaster events: The extreme rainstorm disaster in megacities of China in 2021. Heliyon 2023, 9, e18272. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Guo, C.H.; Lu, M.L.; Wei, W. An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts. Ann. Data Sci. 2021, 8, 331–344. [Google Scholar] [CrossRef]
Ma, H.Q.; Zhang, T. Research on policy text clustering algorithm based on lda-gibbs model. J. Adv. Comput. Intell. Intell. Inform. 2019, 23, 268–273. [Google Scholar] [CrossRef]
Wang, Z.X.; Jv, Y.Q. Revisiting income inequality among households: New evidence from the chinese household income project. China Econ. Rev. 2023, 81, 102039. [Google Scholar] [CrossRef]
Chang, V.; Ganatra, M.A.; Hall, K.; Golightly, L.; Xu, Q.A. An assessment of machine learning models and algorithms for early prediction and diagnosis of diabetes using health indicators. Healthc. Anal. 2022, 2, 100118. [Google Scholar] [CrossRef]
China Internet Network Information Center. The 56th Statistical Report on China’s Internet Development. 2025. Available online: https://www3.cnnic.cn/NMediaFile/2025/0730/MAIN1753846666507QEK67ZS9DH.pdf (accessed on 15 January 2026).
AppGrowing, MoonFox. White Paper on Mobile Ad Traffic 2022. 2023. Available online: https://appgrowing.cn/blog/2023/01/05 (accessed on 15 December 2023).
Weibo. Weibo Report. 2023. Available online: https://data.weibo.com/report/report (accessed on 15 December 2023).
Guo, Q.C.; Jiao, S.; Yang, Y.C.; Yu, Y.; Pan, Y.Q. Assessment of urban flood disaster responses and causal analysis at different temporal scales based on social media data and machine learning algorithms. Int. J. Disaster Risk Reduct. 2025, 117, 105170. [Google Scholar] [CrossRef]
Michelsen, N.; Dirks, H.; Schulz, S.; Kempe, S.; Al-Saud, M.; Schüth, C. YouTube as a crowd-generated water level archive. Sci. Total Environ. 2016, 568, 189–195. [Google Scholar] [CrossRef] [PubMed]
Kanth, A.K.; Chitra, P.; Sowmya, G.G. Deep learning-based assessment of flood severity using social media streams. Stoch. Environ. Res. Risk Assess. 2022, 36, 473–493. [Google Scholar] [CrossRef]

Figure 1. Conceptual flowchart of the methodology.

Figure 2. Basic relationship among the parameters of the LDA model.

Figure 3. Geographic location of case area (cities are indicated in gray).

Figure 4. User profiles of Weibo and Douyin.

Figure 5. Trend comparison of rainfall and the amount of social media data.

Figure 6. Trends in data quantity from news users and general users during event A2.

Figure 7. Spatial distribution of social media data during rainstorm events: (a) event A1 in Beijing; (b) event A2 in Fuzhou; and (c) event A3 in Beihai.

Figure 8. Visualization of waterlogging points identified by official departments and gathered from social media.

Figure 9. Population size and age distribution in different cities.

Figure 10. Topic distribution of social media data.

Figure 11. Distribution of high-frequency words under different topics.

Table 1. Qualifiers utilized in data crawling.

	Event Code	Time Frame	Keywords (In Chinese)
Beijing	A1	29 July 2023 10 a.m.–2 August 2023 8 a.m.	Weather condition: Rainfall, Rainstorm, Warning Extent of damage: Wet, Drowned, Inundation, Waterlogging Consequence: Help, Collapse, Trapped In, Rescue
Fuzhou	A2	5 September 2023 6 a.m.–7 September 2023 6 a.m.
Beihai	A3	7 June 2023 6 p.m.–11 June 2023 6 p.m.

Table 2. Results of data crawling.

Event Code	Amount of Raw Data		Amount of Processed Data
Event Code	Weibo	Douyin	Weibo	Douyin	Total
A1	6456	490	2861	388	3249
A2	5058	854	2213	656	2869
A3	2327	698	831	498	1329
Total	13,841	2042	5905	1542	7447

Table 3. Waterlogging points identified by official departments and gathered from social media.

	Number of Waterlogging Points					Supplementary Rate
	Official	Social Media		Coincidence	Newly Identified
	Official	From Weibo	From Douyin	Coincidence	Newly Identified
Beijing	64	121	76	9	138	216%
Fuzhou	48	37	44	5	54	113%
Beihai	20	5	33	1	34	170%
Total	132	163	153	15	226	171%

Table 4. Socio-economic levels and development priorities in studied cities (Statistics in 2023).

	Weight of Tertiary Industry	GDP (Billions)	Public Budget Expenditure (Billions)
	Weight of Tertiary Industry	GDP (Billions)	Total	For Education
Beijing	84.8%	4376.1	797.1	122.8
Fuzhou	58.3%	1292.8	100.8	21.2
Beihai	40.5%	175.1	19.4	3.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Zhang, N.; Liu, Y.; Liu, M.; Wang, X.; Li, Z. How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis? Water 2026, 18, 405. https://doi.org/10.3390/w18030405

AMA Style

Wang J, Zhang N, Liu Y, Liu M, Wang X, Li Z. How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis? Water. 2026; 18(3):405. https://doi.org/10.3390/w18030405

Chicago/Turabian Style

Wang, Jia, Nan Zhang, Yang Liu, Mengmeng Liu, Xiao Wang, and Zijun Li. 2026. "How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis?" Water 18, no. 3: 405. https://doi.org/10.3390/w18030405

APA Style

Wang, J., Zhang, N., Liu, Y., Liu, M., Wang, X., & Li, Z. (2026). How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis? Water, 18(3), 405. https://doi.org/10.3390/w18030405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

How Does Multi-Source Social Media Data Serve in Urban Flood Information Collection, Recognition, and Analysis?

Abstract

1. Introduction

2. Methods

3. Study Area and Data Source

3.1. Study Area and Event Selection

3.2. Data Sources

4. Results and Discussion

4.1. Social Media Data Extraction

4.2. Spatio-Temporal Validity of Social Media Data

4.3. Spatial Characterization Value of Social Media Data

4.3.1. Geographical Features

4.3.2. Demographic Characteristics

4.3.3. Difference in Regional Development

4.4. Topic Classification Results and Analysis

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI