1. Introduction
As pointed out by the World Meteorological Organization (WMO), more than 11,000 natural disasters have occurred in the past half-century, among which floods are one of the main types, causing huge economic losses and casualties. With the continuous change in global climate and the acceleration of urbanization, the frequency of floods in urban areas has increased dramatically, and its influence has been expanding. The latest research finds that about 1.81 billion people worldwide are exposed to severe flood risks, and 89% live in a relatively fragile socio-economic environment [
1]. How to quickly respond to and minimize the loss of urban floods has become a global concern [
2]. We believe that timely and effective access to urban flood information can help improve a city’s flood prevention capability and emergency management efficiency, as well as provide a reference for predicting and handling potential future events.
Urban flood information refers to information that can represent possible or ongoing flood events, including information related to the event itself (e.g., time, scope, location), residents (e.g., deaths, health and epidemic prevention, psychological counseling), and economic activities (e.g., industrial losses, communication barriers). By piecing this together, city managers can quickly capture the whole picture of the event and determine its level to take corresponding action and ensure the normal operation of urban activities. Current urban flood prediction mainly relies on numerical models. However, because of the underlying surface, regional rainfall pattern and fuzzy flood mechanism, it is usually difficult for models to reproduce complex urban hydrological processes. Up-to-date flooding information supports calibration and validation of urban flooding models, thus improving their simulation accuracy and foresight range [
3]. In addition, basic flooding patterns can be obtained through statistics and analysis of information from several historical rainfall events. On this basis, flood-prone and hazardous areas in a city could be easily delineated and would be noticed during future events, thereby reducing the event risk and the loss of all parties.
Traditionally, flood information is acquired in three ways: One is using multiple types of sensors for field data monitoring. Sensors placed around drainage facilities, flood-prone areas and rivers enable real-time monitoring of regional rainfall and water level changes, as well as measuring proxies like the electromagnetic field of the Earth [
4,
5,
6]. But for severe floods, debris such as collapsed buildings and branches form a large-sheltered area at the affected location, which may damage the sensors and reduce the collection range. In remote areas, sensors are installed in low densities and are difficult to maintain, resulting in under-reporting. Remote sensing technology can solve some of these problems, such as measuring ground data in backward and complex terrain areas and improving the efficiency of data acquisition and processing. However, there are still some thorny problems: adverse weather and rapid changes in spatial conditions may affect the behavior of remote sensing instruments, resulting in data loss [
7]; and the resolution of remote sensing images may decrease in smaller affected areas, such as commercial zones with dense buildings and residential areas full of greenery. For difficult-to-access data, researchers also utilize the method of field investigation. This approach, however, has low efficiency and accuracy, and generally serves as a supplement to the first two methods.
It is not difficult to notice that current flood information acquisition has problems of slow speed, low coverage, and limited timeliness. In addition to constantly improving technical tools (sensors and remote sensing equipment), an effective breakthrough might be generated by considering the perspective of carriers (participants). Let us shift the focus from professionals (city manager and research institution) to the formerly served public. With the distinctive features of large quantity, high density and strong immediacy, the data acquired from the public are regarded as having vast potential. Across scientific fields, these “Citizen Data Scientists” can not only improve the scale and scope of data collection and reduce the research cost, but also break down the barriers and dimension of information sharing, thus allowing researchers to touch the most authentic emotional expressions.
The ways to obtain data from the public mainly include active acquisition based on crowdsourcing [
8] and passive acquisition based on ubiquitous sensing. Unlike crowdsourcing data, ubiquitous sensing data are broader and can be collected anywhere, anytime, and by anyone or anything, such as data from social media, public systems and map applications [
9]. Today, nearly 5 billion people around the world use various electronic devices daily. Aside from the limitations of identity, manner, content and time, people can share anything that happens nearby in fragmented time. We realize that if the value embedded in such data can be fully explored and utilized, it would have a positive impact on the information collection of urban flood events.
Existing research has confirmed that data from social media can be processed to complement the flood data obtained by existing collection methods, contributing to a comprehensive view of the full lifecycle of an event [
10,
11]. Although the content posted by the public contains a relatively low percentage of valid data due to various reasons, it can still reflect the characteristics (such as duration, peak, trends, and population movement) of an event well. For example, Rowe [
12] discovered that digital footprint data from Meta Facebook could provide real-time feedback on the survivors’ movement routes, which facilitated the timely arrival of government aid; Ponukumati and Regonda [
13] explored the method of extracting and analyzing information from Twitter data and designed a flood impact score to evaluate flood impact. Some researchers investigated methods for automatically extrapolating water levels in images from Twitter, Facebook, and Instagram, successfully generating flood maps of some severe flood events [
14,
15]. The distribution of regions with high flood susceptibility in a certain city can equally be acquired by analyzing historical social media data to predict possible inundation events in the future. Most existing studies with China as the research area have discussed the identification method for urban storm disasters based on Weibo data and highlighted its successful performance in the actual inundation events [
16,
17]. For example, Wang et al. [
18] revealed the spatiotemporal dynamics of public emotion in response to disaster evolution by establishing the relationship between rainfall and Weibo activity, while Yu and Wang [
19] proposed an analytical method integrating multimodal neural networks and a location recognition model based on Weibo data from extreme rainfall events, demonstrating the application value of social media data with diverse formats in disaster dynamic perception. Beyond just postings, changes in browsing behavior are also able to reflect the severity of an event and its transfer path [
20]. However, it can be seen that existing research primarily analyzes and discusses data from a single social media platform. The user portrait and content-sharing style may probably be directed, leading to incomplete data acquisition. Considering China’s vast territory, there are great variations in economic conditions, residents’ living habits, and social structures among different regions. Whether the validity of public data has regional relevance is also a matter of concern.
Thus, this study selected three cities with diverse geographical characteristics and differing developmental levels. For each city, an extreme rainfall event occurring in 2023 was chosen. Relevant data during the rainfall events were extracted from China’s two most popular social media platforms. By analyzing the data alongside urban features, this study can comprehensively explore the practicality of multi-source social media data in reflecting and supplementing waterlogging events and their adaptability to spatial variations, which providesF suggestions on how such data can be better utilized by scientists and government decision makers.
2. Methods
The study obtained multi-source data on urban floods based on web crawler technology and evaluated the availability and universality from various temporal and spatial dimensions. Meanwhile, a topic classification model was constructed by combining the Latent Dirichlet Allocation model and Support Vector Machine, so as to mine the social media data comprehensively and expand the mapping scope. In the following sections, the actions described in
Figure 1 are addressed in detail.
Step 1: Data crawler.
We employed a keyword-based web crawler technique to collect data. A web crawler is a program that automatically accesses web pages and extracts information according to certain rules and targets. It can accurately and efficiently search for large amounts of required information and help complete the data analysis and visualization, and it is mainly used in search engines and data mining.
The types of data required should be determined before starting crawling, including the username, user type (official/self-media/VIP /common), release time, text content, picture, video and IP address. For short-video apps, additional information such as the title, link to the work, and auto-tag should also be crawled based on each app’s page settings. After setting the keywords related to urban floods, the release timeframe ought to be limited to a certain event to ensure timeliness of the data. In order to enhance the effectiveness, accuracy and precision of the information obtained, crawling is performed only for the main content of the original posts.
The specific crawler process is as follows: establish the pre-login connection by using user information from the current website and encrypt the results; make a login request to the website, and extract cookie information from the returned URL to complete the simulated login process; after setting the conditions, obtain the initial URL by manually constructing the list; and, accordingly, execute the crawler, equipped with pre-set cookies, and submit a request to that URL via the HTTP library to acquire the HTML structure of the page. XML Path Language is used to parse the HTML information. The required data is stored as a CSV file for further processing. The above steps are repeated to continuously crawl the subsequent pages with newly fetched URLs, until all the pages have been visited. Against the website’s anti-crawler mechanism, such as IP blocking and request rate limiting, a random buffer time (60 ± 15 s) is employed to extend the crawling interval. This delay allows the crawler to rest briefly between requests.
Step 2: Data pre-processing.
Delete invalid data from the original dataset, including the data that are: (1) irrelevant to urban flooding; (2) obviously false; (3) lacking valid information (e.g., keywords refer to unrelated concepts or posts are unrelated to rainfall event features); or (4) too short (less than five Chinese characters).
Chinese Word Segmentation (CWS) refers to converting Chinese text into a separate word sequence to eliminate ambiguity when processing the text in models. CWS is the basis of natural language processing and text mining algorithms. Jieba Chinese text segmentation is employed in this step, which is available on GitHub and easy to implement. Jieba offers three modes of operation. Among them, the default mode of sentence cutting is suitable for text analysis. On this basis, the search mode can conduct a second cut on lengthy words to enhance the recall rate. Part-Of-Speech tagging can also be achieved using the posseg module in Jieba. It first pre-processes the input text to segment it into Chinese character sequences. For these sequences, the posseg module can construct a directed acyclic graph based on the prefix dictionary, find the highest-probability path to perform word segmentation and then assign appropriate parts of speech to the words.
Stop word removal refers to deleting words without any semantic value in the text to optimize the result of CWS, including commonly used conjunctions, modal and auxiliary verbs and punctuation. We apply a combination of the Baidu stop word list and Harbin Institute of Technology stop word list in this study.
Step 3: Content recognition.
Location information extraction is carried out through text filtering. After sorting the publishing time, the data from the same location during a specific consecutive time period are consolidated. Following this, an open-source location search application is utilized to obtain the longitude and latitude of the flood points and then visualize them. The coordinate system used here is the World Geodetic System 1984 (WGS-84).
Topic detection is conducted. The Latent Dirichlet Allocation (LDA) model is adopted to ensure high-level identification and understanding of the topics and potential insights within the original dataset, exhibiting high effectiveness in extracting latent semantic topics from short, sparse texts like social media posts [
18]. The interrelation among the model parameters can be expressed as follows (
Figure 2).
As shown in
Figure 2, a word is the basic unit in this concept. A document is a random mixture of N words (w = (w
1, w
2,…w
n,)), and a corpus is a random mixture of M documents [
21,
22]. α and β are two hyperparameters intended for generating θ (multinomial topic distribution of a document) and φ (multinomial word distribution of a topic). Assuming the number of topics is K, for each document, the LDA model randomly samples a topic from θ, and then selects a word from the corresponding φ for the given topic. The process is repeated until a document is completely generated. Parameters α and β were set to default values (50/K and 0.01), which are empirical standards in LDA. The topic number K = 15 was selected based on the perplexity calculation.
In contrast to the generative process, topic identification involves statistical inference to estimate the latent topic variables that best explain the observed data. For a specific corpus, the model needs to acquire the topic distribution and word distribution through effective dimensionality reduction in the text data. The formula is [
21]
where p(w) denotes the occurrence probability of each word in the dataset, p(t|d) is the occurrence probability of each topic in a document, and p(w|t) represents the occurrence probability of each word in a topic.
The detailed construction process of the topic recognition model is as follows:
(1) Determine the number of topics for the LDA model. To obtain the appropriate number of topics, it is necessary to identify the topic set with the least similarity. In the visualization of topic classification, reduced overlap of topics leads to improved topic division. Perplexity is, meanwhile, adopted to help define the optimal topic number [
21], which is calculated as follows:
where N
d is the number of words in the text d. Perplexity quantifies the uncertainty that a given text belongs to a particular topic. Typically, the lower the perplexity, the better the model’s performance in dealing with new text data.
(2) Extract the features using the Gibbs sampling algorithm. This algorithm is frequently used for analyzing multidimensional objectives, wherein single-dimensional sampling is executed successively for each dimension [
23]. In the context of LDA, the process iteratively samples a new topic for each word in the corpus based on the current topic assignments of all other words. A topic number t
j is randomly assigned to each word w
j of each text d
s in the corpus, and then Equation (1) becomes
Then, the topic number of each word in the corpus is iteratively updated until the sampling reaches convergence. Finally, the probability distribution of each topic is derived.
Topic classification. The Support Vector Machine (SVM) is employed to construct the topic classification model for further aggregate analysis of the data due to its robustness in high-dimensional spaces with limited training samples. To train feature words for the current corpus, the SVM is integrated with the LDA–Gibbs topic model above. When the latest data is imported, the classification model can match its topics by assessing feature similarity. In addition, 20% of data are randomly sampled to evaluate model performance [
24]. Precision and F1-Score are used as follows [
25]:
where
represents the data that are correctly categorized as
,
represents the data belonging to other categories but wrongly classified as
, and
represents the data belonging to
but wrongly classified into other categories.
4. Results and Discussion
4.1. Social Media Data Extraction
Firstly, data from Weibo and Douyin were collected for the selected rainstorm and flood events that occurred in Beijing, Fuzhou, and Beihai in 2023. The time frame according to rainfall duration and keywords required are shown in
Table 1. Due to the limited display of up to 50 pages per visit on the Weibo platform, appropriate time segmentation was necessary to obtain the complete data. The keywords utilized in the crawling mainly pertained to the weather conditions, extent of damage, and resulting consequences thereof. In addition, given the infrequent use of location-based services by Weibo users, we incorporated location-specific keywords based on local Points of Interest (POIs). These included names of major intersections, underpasses, and landmarks (e.g., “Wusi Road” in Fuzhou), used to capture implicit geographic information that omitted generic flood terms.
A total of 15,883 pieces of data were obtained as the original dataset and then screened by removing duplicate records, invalid data (including excessively short text), and irrelevant items. The remaining dataset contained 7447 values, as outlined in
Table 2. It is evident that both the raw and processed data acquired from Weibo are considerably more numerous than those from Douyin, highlighting the fact that Weibo is still the primary information source relied upon by the public at present. However, the residual rates of Weibo and Douyin were 42.7% and 75.5%, respectively, after comparing the data before and after filtering, which demonstrates the fairly high validity of Douyin data. This is probably attributable to the higher proportion of personal users in Douyin and its page setup allowing users to interact in a comment-based manner, thereby reducing the repetition rate of raw data. Additionally, a correlation can be observed between the Weibo data quantity during different rainfall events and the city scale; that is, in more developed cities, there are more users sharing information on Weibo. In contrast, the amount of Douyin data exhibits some randomness. The distinction between Beihai (fourth-tier city) and Beijing (first-tier city) is remarkable. The efficient data volume of Weibo in Beihai amounts to only about 30% of that in Beijing, whereas this volume for Douyin is almost 28% greater than that in Beijing, thereby reflecting Douyin’s reach and inclusiveness in smaller cities. This confirms that the user profiles of Douyin and Weibo are complementary in expanding data coverage.
4.2. Spatio-Temporal Validity of Social Media Data
The data collected were counted every 2 h to identify the changing pattern and its correlation with the official rainfall intensity (which was obtained from official emergency management reports provided by local water authorities), as shown in
Figure 5. Although high rainfall intensity does not necessarily lead to flooding events, analyzing it can still reveal how social media data reflect rainfall processes. Pearson correlation analysis revealed a significant positive correlation between rainfall intensity and social media volume, with a time lag of 2 h (R
2 > 0.6,
p < 0.05), statistically confirming the response capability of public sensing. Social media data typically increased within 2–12 h following the event outburst, and then displayed a cyclical fluctuation trend by the day. The peak of user data usually appeared during the “cooling off” period of the event, because users were more concerned about the impact of waterlogging on their personal and professional lives, rather than the forced shutdown of economic activities and traffic as the event unfolded. Also, the variation in user data mirrored the consistency with residents’ daily timetable. Hours with a greater volume of user data per day were more likely to occur during residents’ commuting with traffic congestion. And even in the most active period of the event, the number of posts remained mostly below 20 late at night.
Event A2 in Fuzhou was characterized by a short duration and high intensity of rainfall with a single peak, while the data obtained from both sources were substantial, making the trend concise and representative. Thus, taking
Figure 5b as an illustration, we detail the trend and reliability of social media data on the temporal dimension of a rainstorm event. From 6 pm onwards, water accumulation in the city became severe as the intensity of rainfall increased. The rainstorm gradually impacted the daily routine of the users as it coincided with the evening rush hour, so that the number of related posts rose sharply. At 10:30 pm on that day, the Fuzhou government implemented the second-level contingency plan for rainfall and started the red warning signal an hour later. In this context, a small peak of user attention to the event was observed. But, although the precipitation reached the peak at 00:00 on the 6th, the users’ concern gradually diminished to accommodate the rest schedule. The second surge of focus appeared on the second day during the morning rush. In the early afternoon of the 6th, the rainstorm event drew to a close, yet the fervor of users’ discussion continued to remain high. The public remained entangled in the aftermath of the event by a sequence of direct or indirect disasters. A recovery period was required to allow society to return to normal.
Then, the social media data were further differentiated by user type based on event A2.
Figure 6 displays the trend in the number of posts published by news accounts with authority and influence and general user accounts by different colored lines, separately. As can be seen, news users appeared to possess a marginally stronger insight into events, with the trend line occasionally exhibiting a turning point prior to that of common users. The time difference typically amounted to approximately 2 h, and the advantage was not completely notable. Additionally, data from news accounts frequently showed a rapid increase to a peak followed by a sharp decline, demonstrating greater volatility than that of normal users. That is, normal users illustrated slightly higher persistence to help maintain the event’s clout over an extended period. For example, on 6th September, the data from news users only experienced two peaks during the morning and evening peak hours, whereas normal users’ focus remained stable for over 14 h until late at night, after they started to notice the flooding situation at 6 am.
The quantity of data posted did not significantly differ between the two types of users, with news users sharing even more than normal users occasionally. In fact, over 40% of the crawled data were released by news accounts, a phenomenon linked to users’ posting habits and the varying focus of news users. Firstly, during periods of heavy rainfall and flooding, many normal users tend to disseminate accurate information to aid others in avoiding being trapped, as opposed to expressing personal emotions. This creates some duplicated information. Secondly, news accounts can be divided into two categories: news media that report breaking news, such as rescue requests and notifications of sacrifices, and official accounts of government departments that primarily publish professional weather forecasts or officially measured rainfall data. We prefer to focus on the former in the analysis of social media data reflecting and complementing the event situation, i.e., the objective description of what happened. But the latter kind of account amalgamates the benefits of conventional journalism and social media, offering the public reliable information to stimulate increased attention and wider distribution. Hence, we argue that it is possible for all three parties—general users, news users, and official accounts—to collaborate in shaping a social media opinion matrix in a rainstorm event, with each contributing analytical value.
As previously mentioned, users tend to prioritize how the rainstorm affects them and their immediate surroundings, resulting in a higher concentration of social media data in the cities where the events occurred, despite the lack of geographic restrictions. To investigate the spatial distribution of social media data, we tallied the mentions of districts in each urban rainfall event. The findings are depicted in
Figure 7.
Taking
Figure 7a as an example, postings predominantly focused on multiple districts located in the central and southwestern regions of the city during event A1. Among them, the four districts in the city center gained the highest amount of relevant data, with Fangshan, Mentougou, and Daxing Districts following closely behind. The distribution map depicts a gradual decline in data quantity from south to north. Indeed, precipitation was concentrated and widespread, influenced by the typhoon and the peripheral low-pressure cloud system. Districts in the southwest and south indeed experienced exceptionally heavy rainfall accompanied by thunder and strong winds. And the southwestern regions were at a higher risk of secondary disasters such as flash floods and building collapses because of their proximity to mountainous areas. According to data from the Meteorological Service, Fangshan and Mentougou Districts had average rainfall of 598.7 mm and 538.1 mm, respectively, affecting nearly 80% of the residents’ lives and representing the worst-impacted areas during this event. A comparison with the official map of rainfall distribution clearly indicates the coherence of the social media data’s spatial distribution with the real rainfall process.
4.3. Spatial Characterization Value of Social Media Data
The waterlogging locations were acquired through matching the latitude and longitude of imprecise geo-information taken from social media data. According to relevant government reports, the officially announced flood-prone points of each city are presented in
Figure 8, alongside public data.
Table 3 summarizes the flood-prone points extracted from various social media data and their discrepancies with official points. The supplementary rate is defined as the ratio of newly identified inundation points to official inundation points. Most of the officially given points are concentrated in the urban center of the city, such as Hai Cheng District in Beihai. However, the information shared by individuals with various identities and geographical locations can partially portray the flooding conditions in regions with lower administrative levels. Thus, social media data can help recognize several unknown inundated spots, as a complement to official data. While it is recognized that data provided by non-expert users may be subject to uncertainties like vague descriptions and low resolution, we still consider the potential value of such information in identifying infrastructure problems on a given road, such as pipe leakage or small-scale ground depressions. Also, multi-source inundated points exhibited significant variations in characteristics across cities located in diverse geographical regions. Next, we explore the geospatial features of information shared on social media during the events by considering their distinctive regional contexts to deeply ascertain the universal value of social media data.
4.3.1. Geographical Features
The social media data results were basically aligned with local geographic features. Its vast size endows China with diverse geographical attributes. The three representative cities selected for the study are located in the northern, southeastern coastal and southern regions of China. Beijing is surrounded by mountains in the north-west and flat land in the south and east, driving rainfall runoff towards the plains. During periods of heavy rainfall, water from mountainous regions flows into urban areas and intensifies as the terrain changes, thereby triggering flash floods and urban waterlogging. As illustrated in
Figure 8, most flooding points identified from public data in Beijing were distributed in the southern and southwestern regions of the city, overlapping with the actual occurrence of localized waterlogging. Fuzhou has a diverse terrain comprising plains, hills, and mountains, with an overall basin inclined from west to east. The city experiences plentiful rainfall and has a dense water network. The Min River, which ranks seventh among China’s largest rivers, runs through the city and ultimately empties into the sea via the Changle District. It is evident that the inundation points from both official sources and social media in Fuzhou were focused within the Min River basin, representing the greatest level of overlap compared to the other two cases. Beihai has a mostly flat landscape, with an average altitude of 10–15 m. Despite being a coastal city like Fuzhou, it has distinct wet and dry seasons and receives much less rainfall than Fuzhou. Whilst the amount of social media data and the flooding points collected in Beihai were limited, they were still discerned to be mainly situated in the southwestern region, where the terrain is notably low and the area next to the sea is substantial.
4.3.2. Demographic Characteristics
The population size also influences the results obtained from the data analysis. As the capital city of China, Beijing possesses the largest permanent population (21.9 million people by the end of 2023), mainly residing in the central and southern urban areas (
Figure 9). Fuzhou and Beihai have 38.6% and 8.6% of Beijing’s population, respectively. Despite the fact that the youth and middle-aged (15–59) individuals, who are considered the backbone of society, constitute approximately 65% of the population in all three cities, the vast population base variance still remains significant, with disparities in the number of potential users willing to adopt new technological devices and sharing platforms. This partly explains one of the notable phenomena in the graph: Beijing, as the only inland city with neither the highest annual precipitation nor the highest waterlogging risk, had a far higher amount of social media data and more waterlogging points compared to any other city.
4.3.3. Difference in Regional Development
Differences in the social development levels of cities constitute another factor for this phenomenon. The Gini coefficient in mainland China has remained around 0.40 over an extended period, indicating a significant issue of uneven development [
24]. In recent times, the openness of society and the flow of resources have contributed to the growing north–south gap and east–west gap. Also, unbalanced development exists within the city. Rural populations have been migrating to towns and cities for better social welfare, employment opportunities and educational quality, causing a depletion of labor and resources in the underdeveloped regions. But the first-tier cities always benefit from extensive resources that enable a stronger focus on overall city development, rather than concentrating on a specific area. In this study, the inundation points from social media in the three cities were all distributed mainly within the main urban areas, yet a considerable number of public findings were also reported in the secondary regions of Beijing.
The imbalanced development results in substantial disparities in the industrial structure and economic status among various cities.
Table 4 illustrates this economic disparity. The difference in public budget and education expenditure directly correlates with the “digital divide”, explaining the lower volume of high-quality data in less developed cities. As the city scale expands, the industrial structure undergoes optimization, with positive effects on social and economic benefits as well as population quality. The aggregate sum of each value in Beijing overwhelmingly surpasses that of the other cities, particularly in regard to public expenditure on education. High-quality educational resources increase awareness and acceptance of emerging things among residents, thereby motivating them to pay closer attention to urban construction and flood prevention. This stimulates the production of richer and more valuable data on flood-prone areas. Comparatively, Beihai, as the least developed city among the three, has users who are mostly in a downward-reaching (consumer) market. These individuals prioritize immediate gratification over rigorous analysis and scientific thought. So, despite Beihai having 28.3% more efficient data volume on Douyin compared to Beijing, its data volume featuring precise descriptions of events and locations was merely 45.6% of that of Beijing.
The city’s infrastructure status represents another reflection of the economic disparities. Fuzhou experiences the greatest frequency of heavy rainfall events and flood risk among the three cities and even nationwide. The extracted data and identifiable waterlogging points, however, were only moderately ranked, which may be somewhat interrelated with the evolution of water management strategies in Fuzhou. In August 2015, Typhoon Soudelor hit Fujian Province. The torrential rain that raised the water level in the inland river, coupled with the high tide of the Min River, caused certain river segments to overflow into the city. A total of 956,000 people in the city were affected by the rainfall, and the direct economic losses amounted to about 535.98 million dollars. Following the experience gained from the rarely seen rainstorm, the Fuzhou government initiated a review of water management measures and proceeded to establish a more effective policy framework alongside enhanced flood prevention and drainage systems. In this context, the government centralized the governance of the urban flood disaster risk reduction system from various administrations and integrated information technology, such as IoT and big data, to establish the country’s most groundbreaking water joint-scheduling center. Since implementing the mature new water management system, although sporadic occurrences continue to transpire, waterlogging in the city has decreased in both frequency and duration, which could also be a contributor to the lower amount of waterlogging data from social media within the city.
4.4. Topic Classification Results and Analysis
The LDA model was constructed, employing data from three rainfall events as training samples. The optimal number of topics was determined to be 15 (K = 15). To validate the classification performance, we utilized the 20% hold-out dataset mentioned in Equation (4). The Support Vector Machine (SVM) classifier achieved an average precision of 88.4% and an F1-score of 85.2% across the five topics, demonstrating robust performance in identifying flood-related categories. The connections between the word sets were further analyzed to merge similar items. Eventually, five main topics were identified: real-time weather, disaster situation, rescue information, damage and losses, and emotion. Statistics were computed separately for each topic to comprehend the quantity of data available. The results are visualized in
Figure 10.
From the results, public opinion on urban flooding was primarily associated with the damage and losses caused by intense events, which affected the daily lives of users and their ability to travel. Real-time weather and disasters held the next highest importance, constituting 21.6% and 18% of all topics, respectively. Recently, the government’s focus on climate change and urban drainage construction has raised citizens’ awareness of waterlogging. Construction achievements have generated excitement, as they are expected to increase urban resilience and help people return to their daily routines quickly. Subsequently, we analyzed the topic distribution of data from different social media platforms through segmented word matching. The results revealed discernible user profiles and distinct atmospheres of the two social media platforms. Although both types of data expressed significant concern about disaster damage, which was in line with the overall trend, Weibo users shared a greater quantity of information and opinions regarding the disaster, whereas Douyin users displayed richer emotional reactions. Integrating both sources of information could provide decision makers with a fuller picture of the situation and public sentiments, enabling them to select an appropriate and prompt emergency response plan.
Another question is whether residents in less developed areas are conscious of sharing, disseminating immediate information and seeking relief whilst postings. We analyzed the frequency of typical topic terms to investigate the concerns and sentiment changes among residents during rainfall (in
Figure 11). Real-time weather conditions were the most concerning topic for users. The words “rainfall” and “warning” appeared more often than other words relating to this topic. Specifically, the frequency of “typhoon” in the Fuzhou users’ posts was considerably greater than that in the other two cities. From June to September every year, Fuzhou experiences the typhoon season, which is responsible for the majority of the flooding in the city. Concerned about the typhoon’s intensity and movement, users anticipate receiving up-to-date information from social media platforms. The disaster zone and rescue communication were two other welcomed subjects of user discussion. High-frequency words such as “rescue”, “safe”, and “submerged” showcase social media’s capacity to broaden response channels for emergency situations while also conveying users’ unease and apprehension. However, in comparison to emotional terms, users were not accustomed to directly expressing their emotions directly in text content. “Why” was the most popular emotional word in Beijing and Fuzhou. In context, this term usually indicated that users harbored reservations and disappointment regarding the generation and handling of waterlogging events, thereby underscoring the importance they attach to urban flood control. The sentiment hot word in Beihai was “anxious”. Residents of smaller cities were more interested in how the event had impacted their own arrangements, instead of further questioning the monitoring system and response procedures behind the event. But despite the limited public data available for Beihai, its exhibited distribution trend was consistent with that of larger cities, and also reflected the city’s unique characteristics. Therefore, we recommend that such cities increase their focus on social media to raise awareness among residents of the platform communication model, which may achieve unexpected benefits in emergency situations.
4.5. Discussion
Our findings regarding the temporal lag of social media data align with previous studies [
29], confirming that peak public attention typically trails peak rainfall by 2–4 h. However, unlike studies focusing solely on first-tier cities, our cross-regional analysis reveals that this lag is more pronounced in lower-tier cities due to different user habits. The integration of Weibo (text-heavy) and Douyin (video-heavy) proved crucial. While Weibo provided higher temporal resolution, Douyin contributed unique visual confirmations of flood depth, validating the necessity of multi-source fusion proposed in our methodology. The primary uncertainty lies in deviations in socioeconomic development, as shown in
Table 4 and
Figure 9. The lower GDP and education budget of Beihai correlate positively with sparse effective data, suggesting that social media analysis may underestimate risks in underdeveloped regions. Furthermore, social media data are published by users, making it difficult to avoid introducing a degree of subjectivity. Different users describe the waterlogging degree based on their perceptions and adaptability. Also, relying solely on textual location descriptions (e.g., “near the mall”) can introduce spatial positioning errors. In the future, computer vision technology can be integrated with social media data processing to enable automatic extraction of images to address these data limitations.
Despite some limitations (low percentage of valid data, ambiguous location information, etc.), social media data were shown to be a valuable source of information for urban flood management. How to increase the potential value of social media data and make it more useful in scientific research and urban management is the main challenge in the future. To advance in this direction, we provide suggestions from two perspectives for data analysis and city managers:
(1) Image and video analysis techniques should be refined to improve the scalability and application efficiency of social media data. The existing methods comprise utilizing known-sized objects in a single frame to determine flood depth [
30] and implementing machine learning methods for automatic image classification, filtering, and real-time processing [
31]. However, there are several issues that need to be addressed here: the quantity of available data is limited, and obtaining videos and images with no accompanying text is arduous; image quality and category differences sometimes fail to support the construction of representative training datasets; and privacy concerns may be prompted. Considering the perspective of city managers could partially alleviate the problems above.
(2) The city manager ought to guide the market and public, establish correct common values and foster an environment of scientific democratization between them, so as to fundamentally boost the availability of data. Firstly, deeper integration of industry (platforms) and scientific research should be promoted, such as establishing dedicated data channels for scientific research. Secondly, regulating and upholding transparency and fairness on platforms is crucial to stimulate users’ willingness to share. Finally, cultivating a public sense of scientific ownership and developing new disciplines, such as public hydrology, that explore new forms of ubiquitous data to enhance traditional research paradigms are highly promising. When citizens recognize that what they are sharing can function as significant scientific data, the overall quality of such data can potentially be enhanced to a great extent.