Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams

Wani, Mudasir Ahmad

doi:10.3390/math14081369

Open AccessArticle

Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams

by

Mudasir Ahmad Wani

College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 13318, Saudi Arabia

Mathematics 2026, 14(8), 1369; https://doi.org/10.3390/math14081369

Submission received: 26 February 2026 / Revised: 4 April 2026 / Accepted: 15 April 2026 / Published: 19 April 2026

(This article belongs to the Special Issue Deep Representation Learning for Social Network Analysis)

Download

Browse Figures

Versions Notes

Abstract

Event detection is crucial for disaster response, public safety, and trend analysis, enabling real-time identification of critical events. Social media platforms provide a vast content source, offering timely and diverse event coverage compared to traditional news reports. However, challenges arise due to the informal and noisy nature of the text, along with the limited availability of ground truth data for training models. This study introduces SOCIAL (Social Media Event Classification using Integrated Artificial Learning and Natural Language Processing), a mathematically grounded framework for real-time social media event detection. SOCIAL integrates a formal representation of social media text with a customized CNN–LSTM architecture, combining convolutional operations for local feature extraction with sequential modeling to capture temporal dependencies, thereby enhancing classification accuracy. Generative AI is employed to create synthetic event-related samples, addressing data scarcity and ensuring a balanced dataset, while the design incorporates quantitative principles to guide embedding selection and model optimization. This study systematically evaluates six experimental configurations with TF-IDF and Word2Vec embeddings. The TF-IDF-based CNN–LSTM model achieved top performance with 98.59% accuracy, 98.13% precision, 99.06% recall, and 0.9719 MCC. Additionally, the

F_{0.5}

,

F_{1}

, and

F_{2}

scores were 98.31%, 98.59%, and 98.87%, respectively, confirming the model’s strong predictive capabilities. TF-IDF integration enhanced event-specific term recognition, reducing misclassifications and improving reliability. These results demonstrate that SOCIAL is not only a fast, accurate, and scalable tool for crisis event detection, but also a formally principled framework for modeling and analyzing social media signals.

Keywords:

social network analysis; event detection; deep learning; natural language processing; generative AI; convolutional neural networks (CNNs); long short-term memory (LSTM)

MSC:

68T07

1. Introduction

Event detection is a process that automatically identifies events from OSN data, providing details about what occurred, when, where, to whom, and why [1]. Event detection is crucial in areas like disaster management [2], public safety, marketing, and trend analysis. The rise of social media provides real-time, user-generated content such as tweets, posts, and videos that offer immediate insights into unfolding events. Leveraging this, deep learning, particularly in natural Language Processing (NLP) and computer vision, enables automatic, near-real-time event detection and classification. The field has rapidly advanced, fueled by large-scale data, modern neural architectures, and increased computational power [3].

Early research in social media event detection primarily relied on keyword-based and statistical approaches to identify bursts in term frequency and co-occurrence patterns across platforms such as Twitter [4]. These traditional methods, although effective in detecting sudden events, often struggle with noise, ambiguity, and a lack of contextual understanding. Subsequently, machine learning-based approaches improved detection accuracy by incorporating feature engineering techniques such as TF-IDF and topic modeling [5]. With the evolution of deep learning, more sophisticated models have been developed that leverage contextual embeddings and sequential modeling to better capture semantic and temporal dependencies in social media streams. Recent studies have further explored hybrid and ensemble approaches, integrating multiple models and data sources to enhance robustness and scalability in real-time event detection systems [2].

Unlike traditional event detection that relies on delayed, structured data from news or official sources, social media offers real-time, unfiltered insights directly from the public. Deep learning models such as Convolutional Neural Networks (CNNs) [6], Recurrent Neural Networks (RNNs), and transformers excel in processing this noisy, informal text by capturing context, timelines, and latent meaning. For instance, during the COVID-19 pandemic [7], social media posts revealed early signs of outbreaks and shifting public sentiment on vaccines and lockdowns long before formal announcements. Similarly, in natural disasters, real-time posts using keywords like “evacuation” or “rescue” provided critical information on affected areas. Such timely insights enhance situational awareness and accelerate emergency response efforts.

Given that social media platforms continuously generate vast continuous streams of posts, images, videos, and metadata, the resulting information can be formalized mathematically as a high-dimensional input space

X \subset R^{d}

. Each data instance in this space corresponds to a multi-modal, high-dimensional representation of social media activity. Formally, each social media instance is denoted as

x \in X

, while the corresponding crisis state is represented by a label

y \in Y = {0, 1}

, where

y = 1

indicates a crisis-related signal and

y = 0

denotes a non-crisis content. The central objective of crisis detection is to learn a mapping,

f : X \to Y,

(1)

such that

f (x)

is a function that accurately predicts the crisis state of previously unseen signals.

To endow the mapping f with strong representational power, recent approaches integrate deep neural networks with generative modeling paradigms. In a probabilistic formulation, the learning objective can be written as the posterior distribution:

f (x) = arg max_{y \in Y} P (y ∣ x),

(2)

where

P (y ∣ x)

quantifies the likelihood that an observed signal x corresponds to a crisis event. Since direct computation of the posterior is often intractable due to the high dimensionality and heterogeneity of social media data, generative models approximate the data distribution using a latent representation

z \in Z \subset R^{k}

. This forms the basis for the generative process:

x \sim P_{θ} (x ∣ z), z \sim P (z),

(3)

where

P (z)

is commonly chosen as a multivariate Gaussian prior and

P_{θ} (x ∣ z)

is a parameterized decoder. The latent representation itself is inferred via the encoder distribution:

z = E_{ϕ} (x),

(4)

with

ϕ

denoting the encoder parameters. This latent space provides a compressed, noise-reduced abstraction of the social media input. It allows the model to capture underlying crisis-related semantics more effectively.

To train the generative framework, the reconstruction-based objective function

L_{rec}

is employed:

L_{rec} = E_{z \sim E_{ϕ} (x)} [∥ x - D_{θ} {(z) ∥}^{2}],

(5)

where

D_{θ} (\cdot)

represents the decoder. This loss encourages the model to preserve important structural information in the latent space. In addition, a classification loss

L_{cls}

guides the latent representation toward discriminative crisis features:

L_{cls} = - \sum_{i = 1}^{N} y_{i} log f (x_{i}),

(6)

which ensures accurate crisis prediction over a dataset of N labeled samples.

The final optimization problem thus combines the reconstruction and classification objectives using a weighted formulation:

L = λ_{1} L_{rec} + λ_{2} L_{cls},

(7)

where

λ_{1}

and

λ_{2}

control the contribution of generative learning and discriminative learning, respectively. This unified formulation leverages the strengths of both deep learning and probabilistic modeling, enabling robust crisis event detection even in noisy, imbalanced, and rapidly evolving social media environments.

Overall, this mathematical formulation establishes a principled framework for crisis detection, connecting high-dimensional input representations, latent generative modeling, and supervised classification within a single end-to-end trainable architecture.

Apart from formal representation of crisis detection, NLP is pivotal in extracting meaningful insights from the vast and unstructured social media text, enabling efficient event detection [8]. By applying techniques such as text classification, sentiment analysis, topic modeling, and entity recognition, NLP can identify keywords, emotions, and contextual patterns indicative of specific events. For instance, during the COVID-19 pandemic, NLP models could detect spikes in discussions about quarantine, hospital shortages, or vaccination drives, signaling emerging health crises or public concerns [9]. Similarly, during floods in Pakistan and Sri Lanka, NLP algorithms could analyze posts mentioning terms like “floodwater”, “rescue”, or “evacuation”, helping to pinpoint affected areas and prioritize relief efforts. We are mentioning the floods in Pakistan and Sri Lanka because the datasets used in this study hold the tweet instances related to flood incidents in these two countries. By leveraging NLP’s ability to process multilingual and noisy data, social media platforms become a rich source of real-time information, empowering researchers and organizations to monitor, detect, and respond to events with unprecedented speed and precision.

We propose an AI-based system named SOCIAL (Social Media Event Classification using Integrated Artificial Learning and NLP) for real-time event detection from social media streams, focusing on COVID-19 and flood-related incidents. Leveraging the richness of social media data and the power of deep learning, our approach supports applications in disaster management, public safety, marketing, and trend analysis.

Contributions

The proposed study’s major contributions are as follows:

Customized Deep Learning Architecture for Social Media Event Classification: This study presents SOCIAL, a custom deep learning framework for real-time event classification from social media. Unlike standard CNN–LSTM models, SOCIAL features a restructured architecture tailored toward short, noisy, and unstructured text. It combines CNN for local feature extraction and LSTM for sequence modeling to boost accuracy. The study also evaluates the effects of embedding choices (TF-IDF, Word2Vec) and generative AI-based data augmentation to ensure balanced training and enhanced performance.
Integration of Generative AI for Data Augmentation: This study employs generative AI, such as ChatGPT (GPT-5.3), to generate synthetic event-related data, addressing issues of data scarcity and imbalance in social media event detection. Generated samples are validated through a human-in-the-loop process involving three domain experts, ensuring semantic accuracy and contextual relevance. Disagreements in annotations are resolved via majority voting, producing high-quality, diverse datasets.
Feature Engineering Optimization for Short-Text Event Classification: This study presents a comparative analysis of TF-IDF and Word2Vec embeddings for event classification in social media streams. Given the short and noisy nature of social media text, we evaluate their performance across CNN, LSTM, and CNN + LSTM models. Results show that TF-IDF, by highlighting event-specific term importance, outperforms Word2Vec in accuracy, recall, and robustness, addressing context sparsity more effectively. These findings offer practical guidance for selecting feature representations in real-time NLP-based event detection systems.
Comprehensive Evaluation Metrics: This study employs an extensive evaluation framework beyond standard accuracy, precision, recall, and F1-score to ensure a rigorous assessment of social media event classification models. Matthews Correlation Coefficient (MCC), Negative Predictive Value (NPV), False Positive Rate (FPR), False Discovery Rate (FDR), and False Negative Rate (FNR) provide deeper performance insights, addressing imbalanced and high-stakes classification challenges. Additionally, F-score variants ( $F_{0.5}$ , $F_{1}$ , and $F_{2}$ ) refine the balance between precision and recall for different event detection priorities. This approach enhances model generalizability, ensuring reliable decision-making in real-world applications such as disaster response and misinformation analysis.

2. Literature Survey

Crisis-event detection from social media has attracted significant research attention in recent years; this section examines key contributions, model architectures, and remaining challenges in the field. A system called “EARWORM” was introduced in [10]. This system utilizes Twitter data to detect earthquakes in Japan. The authors performed keyword-based filtering and applied a machine-learning algorithm to these data. Another study proposed a system called FireExpert [11]. In this work, the authors presented a two-stage framework for fire event detection and their assessment. The first stage uses a multi-band remote sensing method and environmental images to classify fire types and map affected areas. The second stage integrates detection results of stage one with social media data using a large language model for real-time impact assessment. Evaluations carried out by the authors on a real-world dataset show an F1-score of 61.0% and a mean average precision (mAP) of 57.7%, surpassing state-of-the-art methods. Similarly, SatCoBiLSTM, a self-attention-based hybrid deep learning framework for crisis event detection in noisy social media streams, was proposed in [12]. It combines multi-scale convolution, Bidirectional Long Short-Term Memory (BiLSTM), and self-attention to capture local, contextual, and critical crisis features. Experiments were conducted on three benchmark datasets and achieved an F1-score of 96%, 94%, and 95%, with 1%, 1%, and 6% improvements, respectively, over state-of-the-art methods. A probabilistic model that identified bursts of tweets related to specific events was proposed in [13]. The study utilized bursty topic modeling and demonstrated its effectiveness in capturing event-related discussions on Twitter. Similarly, in [14], TwitterMonitor, a system, was proposed for trend detection over the Twitter stream. The authors in this study introduced the concept of burstiness. They also utilized content similarity to detect and track emerging events in real time. Ref. [15] introduces DisTGranD, a 47,600-tweet crisis dataset with dual-layer event classification using the Automatic Content Extraction (ACE) standard. In this study, the authors achieved high inter-annotator agreement (Kappa: 0.90, 0.93). They also utilized XLNet, which attains 96–97% intra-label similarity. The authors proposed the RoBiCCus model, which outperforms existing models on DisTGranD and public disaster datasets. Similarly, the study in [16] proposes CrisisSpot, a Graph-based Neural Network (GNN) that integrates textual, visual, and Social Context Features (SCF) for disaster analysis. It introduces the concept of Inverted Dual Embedded Attention (IDEA) to capture complex multimodal interactions. The proposed method, CrisisSpot, outperforms other state-of-the-art methods, achieving F1-score gains of 9.45% on the Crisis Multimodal Dataset (CrisisMMD) and 5.01% on TSEqD.

Clustering methods to detect events in news streams have also been utilizied in some studies [17]. These studies are based on leveraging temporal and content features for improved event clustering. Another study [18] extended this work to event tracking using supervised learning to label incoming news with minimal training data efficiently. Similarly, ref. [19] conducted a user study exploring the integration of news articles and user-generated content for event-related information retrieval. They assessed how combining these two sources meets user needs. A comparison of news streams and tweets in event reporting was performed in [20], using discrete dynamic topic modeling and a Hidden Markov Model for event detection, followed by clustering news documents and analyzing the timeliness of tweets and articles in reporting events. The study in [21] addresses the challenges of Event Extraction (EE) from unstructured web data by proposing a novel event ontology called Cofee. It integrates expert knowledge, prior ontologies, and a data-driven approach to enhance event identification.

The authors of [22] propose an early warning system for early alerts against diseases such as meningitis or COVID-19 epidemics using Twitter data. They employ NLP techniques and domain ontologies to classify tweets, followed by geolocation and timestamp extraction from metadata. Real-time tweet identification is performed using SVM and CNN models, with the CNN model achieving an accuracy of 0.99. Similarly, ref. [23] proposes a CNN–BiLSTM hybrid model with an attention mechanism to identify resource-related tweets for disaster recovery. Trained on earthquake tweets from Nepal (2015) and Italy (2016), it outperforms state-of-the-art methods in in-domain and cross-domain experiments. Results show that leveraging local key information and global term dependencies improves crisis text classification, aiding efficient resource allocation. One more study [24] used Space–Time Scan Statistics (STSS) to detect events on Twitter without predefined keywords. Applied to the 2013 London helicopter crash, STSS identified a short-lived but informative cluster of tweets. This method was also effectively able to detect other events like football matches and transportation delays, showcasing its effectiveness for spatio-temporal event analysis. In another similar study [25], the authors proposed event detection algorithms for short- and long-term Twitter trends. They were based on using dynamic metrics like time-sensitive IDF and fuzzy representations to capture temporal shifts. Events were detected via multi-assignment spectral graph partitioning, enabling words to belong to multiple clusters. Similarly, [1] proposes a novel framework to detect composite social events by integrating content, time, and location dimensions using a Location–Time-Constrained Topic (LTT) model. This model represents social messages as topic distributions and measures similarity via distribution distances. Events are identified through efficient similarity joins, accelerated by a variable-dimensional extendible hash.

Some of the studies in the literature are based on deep machine learning. For example, a dual-CNN, semantically enhanced deep learning model for crisis event detection from social media was proposed in [26]. It achieved over 79% F-measure for event types but 61% for fine-grained details, competing with traditional models like SVM. Similarly, ref. [27] uses word embeddings to map tweets into numerical vectors and classify them into non-traffic, traffic incidents, or traffic information using CNN and RNN models. In this study, a dataset of 51,100 tweets was collected, labeled, and publicly released.

It is now clear from this section that the existing literature on crisis-event detection from social media has explored a wide range of approaches, including traditional machine learning models, deep learning architectures, such as CNN and LSTM, and, more recently, transformer-based models (e.g., BERT, RoBERTa, and DeBERTa). While transformer-based approaches have demonstrated strong performance in many NLP tasks, they are primarily designed for large-scale corpora and long-context modeling, which may not directly align with the characteristics of social media data.

In particular, social media text is typically short, noisy, and context-sparse, often containing informal language, abbreviations, and irregular structures. In such scenarios, the advantages of deep self-attention mechanisms may be limited, while the computational complexity and resource requirements of Transformer models remain high. Moreover, real-time crisis-event detection systems demand efficient models capable of fast inference and deployment under resource-constrained environments.

To address these challenges, this study proposes the SOCIAL framework, which adopts a hybrid CNN–LSTM architecture specifically optimized for short-text event detection. The CNN component captures local n-gram patterns and event-specific lexical cues from noisy inputs, while the LSTM component models sequential dependencies and contextual flow within the text. This combination enables effective feature learning without relying on large-scale pretraining or extensive computational resources.

Furthermore, the proposed framework integrates feature engineering techniques (TF-IDF and Word2Vec) with generative AI-based data augmentation to enhance data representation and address class imbalance. This holistic design differentiates SOCIAL from existing approaches by combining architectural efficiency, feature optimization, and data augmentation within a unified pipeline.

Overall, this work addresses a key gap in the literature by providing a computationally efficient and practically deployable solution for real-time crisis-event detection in noisy social media streams, while maintaining competitive performance relative to more complex models.

3. Event Data Collection and Preprocessing

3.1. Data Collection

Publicly available datasets related to the COVID-19 pandemic and flood events have been used in this research. These datasets have also been utilized in various previous studies such as [28,29]. The COVID-related datasets, including [30,31,32], comprise tweets discussing the COVID-19 pandemic from different countries. In contrast, the flood-related datasets focus on tweets from Pakistan [33,34], etc., and Sri Lanka regarding flood events. These diverse sources were used to aggregate similar datasets from different studies to create a more comprehensive and enriched Main Dataset (MD). The first set of datasets, which is primarily concerned with the COVID-19 pandemic, contains approximately 300,000 samples. The second set, focusing on flood-related events in Pakistan and Sri Lanka, includes around 400,000 samples. Before incorporating any sample into the MD, we conducted a series of preliminary checks. First, we ensured that the sample was unique and had not already been included in the MD database. Next, we verified that each sample was relevant, i.e., it either pertained to the COVID-19 or flood-related events. Any sample failing to meet these criteria was discarded. After this initial round of filtering, we retained approximately 109,000 COVID-19-related samples and 58,000 flood-related samples.

To address the imbalance in the MD, where COVID-related samples outnumbered flood-related ones, we used ChatGPT [35] to generate additional flood-related samples. These were integrated to create a balanced dataset, with 109,000 COVID-19 and 108,900 flood samples, as shown in Table 1. The data were then cleaned, normalized, and formatted to produce the final Experimental Dataset (ED) for analysis. To ensure ethical and responsible use of social media data, several measures were implemented to mitigate bias, misinformation, and privacy risks. The dataset was balanced across event types to reduce potential demographic or geographic skew, and synthetic samples were carefully generated and validated to avoid reinforcing misleading or biased narratives. A comprehensive data sanitization process was applied, including the removal of Personally Identifiable Information (PII), such as usernames, mentions, and profile links. Tweets containing embedded user profile references were excluded where necessary to further protect user privacy. Only publicly available and anonymized data were utilized in this study. These steps ensure that the proposed SOCIAL framework adheres to ethical data usage practices while maintaining data quality and model reliability.

The complete workflow of this data collection, augmentation, and preprocessing process is illustrated in Figure 1, providing a visual overview of the steps involved in transforming raw data into an experimental-ready dataset.

It should be observed in the figure that the combined dataset is obtained by aggregating multiple COVID and flood datasets through sample-level integration. Let

C D_{1}, C D_{2}, \dots, C D_{n}

denote COVID datasets and

F D_{1}, F D_{2}, \dots, F D_{m}

denote flood datasets. The unified dataset

D_{combined}

is defined as follows:

D_{combined} = ⋃_{i = 1}^{n} C D_{i} \cup ⋃_{j = 1}^{m} F D_{j} .

(8)

This approach merges all available samples into a single comprehensive dataset,

D_{combined}

, enabling joint analysis and cross-domain modeling when the datasets share a common feature space.

Furthermore, a detailed process of data collection is presented in Algorithm 1. The pseudocode outlines the Data Collection and Curation Process (DCCP), which collects, preprocesses, and balances COVID-19 and flood datasets to produce a unified, curated Final Experimental Dataset (FED) for analysis.

The algorithm initializes empty structures for the IMD (Initial Main Dataset), PD (Preprocessed Dataset), and FED (Final Experimental Dataset). It then aggregates instances from input COVID-19 and flood datasets (CD1... CDn, FD1... FDn) into IMD, forming a unified raw dataset. Next, NLTK-based preprocessing (e.g., tokenization, stopword removal, stemming, and lemmatization) is applied to IMD to produce a clean, structured pd. Class distribution is then analyzed to detect an imbalance between COVID-19 and flood samples. If an imbalance is found, synthetic samples from the Generated Flood Data Samples (GFDS) are appended to the smaller class (e.g., flood samples in our case). Finally, the algorithm compiles the PD into the FED by removing duplicates, resulting in a balanced, preprocessed dataset (FED.csv) optimized for experimental analysis and model training. For binary classification, COVID-19 tweets are designated as Class 0 and flood tweets as Class 1. This assignment is purely technical, required by the CNN and LSTM architectures, and all performance metrics are computed consistently with this convention. It does not imply any contextual or semantic interpretation of COVID-19 as “positive” or Flood as “negative”. All performance metrics, including precision, recall, F1-score, and related evaluations, are computed consistently with this labeling convention, ensuring clear and unambiguous interpretation of results across all experiments and tables.

3.2. Synthetic Data Generation and Validation

To address class imbalance and the limited availability of labeled data, synthetic tweets related to flood and COVID-19 events were generated using a generative AI model through a carefully designed prompt engineering strategy. The primary objective of this process was to create realistic, diverse, and contextually relevant samples that closely resemble real-world social media content. Given the informal and noisy nature of social media text, particular attention was paid to ensuring that the generated tweets reflect typical user behavior, including brevity, conversational tone, and the use of event-specific keywords and hashtags.

Multiple prompt templates were designed to capture a wide range of event scenarios and linguistic variations. These prompts explicitly specified the event type (flood or COVID-19), contextual situations (e.g., heavy rainfall, rising water levels, evacuation alerts, increasing infection rates, hospital overcrowding, and lockdown announcements), and stylistic characteristics, such as informal phrasing and concise structure. Additionally, variations in prompt wording were introduced to encourage diversity in sentence construction, vocabulary usage, and tone, thereby reducing redundancy in the generated data.

Algorithm 1: Data Collection and Curation Process (DCCP)

The prompt design process also considered different perspectives commonly observed in social media posts, such as eyewitness reports, public warnings, personal experiences, and general observations. This ensured that the synthetic dataset captures a broad spectrum of real-world expressions. In total, multiple prompt variations were employed to generate a diverse set of synthetic tweets, which were subsequently filtered through a validation process. The examples of prompt templates and generated synthetic tweets used in this study are presented in Table 2.

To ensure the quality and reliability of the generated data, a human-in-the-loop validation process was employed. Three domain experts independently reviewed the synthetic tweets to assess semantic correctness, contextual relevance, and linguistic naturalness. Tweets that were ambiguous, repetitive, or inconsistent with real-world scenarios were removed. In cases of disagreement, majority voting was used to finalize the selection, resulting in a high-quality and diverse synthetic dataset.

3.2.1. Analysis of Synthetic and Real Data Distributions

To assess the realism and suitability of the generated synthetic tweets, a comparative analysis was conducted between synthetic and real data based on key distributional characteristics. Given the short and informal nature of social media text, features such as tweet length, vocabulary usage, and frequency of event-specific keywords were considered essential indicators of similarity. The analysis revealed that the synthetic tweets closely follow the structural patterns of real tweets, maintaining comparable sentence lengths, informal tone, and use of hashtags and event-related terminology.

In particular, the average length of synthetic tweets was found to be consistent with that of real tweets, typically ranging between 15 and 25 words. This alignment is important, as excessively long or overly simplified synthetic text could introduce bias in model training. Furthermore, the frequency distribution of key terms such as flood, rain, water, evacuation, COVID, cases, lockdown, and hospital showed similar trends across both datasets. This indicates that the synthetic data preserves the semantic focus and topical relevance observed in real-world social media content.

A qualitative comparison further supports these findings, as illustrated in Table 3, where synthetic tweets closely resemble real tweets in terms of linguistic style, phrasing, and contextual meaning. Both datasets exhibit common characteristics such as abbreviated expressions, conversational tone, and the inclusion of situational updates or warnings. This similarity suggests that the synthetic data effectively captures the variability and noise inherent in social media streams.

The quality and reliability of these synthetic samples were evaluated using both quantitative and qualitative criteria to ensure that they accurately represent real-world social media content without introducing spurious patterns.

Several key metrics between the real and synthetic tweets were compared. These metrics included the average tweet length (in words and characters), vocabulary size, embedding similarity using pre-trained Word2Vec vectors, and perplexity estimated via a GPT-style language model. Table 4 summarizes these comparisons. The results demonstrate that the synthetic tweets closely resemble real tweets in terms of lexical diversity, semantic alignment, and fluency, providing evidence that the generative model produced realistic content suitable for training.

As reported in the table, the average tweet length in words is closely matched between real (15.2 ± 4.1) and synthetic tweets (15.5 ± 3.9), indicating similar structural characteristics. A comparable trend is observed in character length, with real tweets averaging 92.3 ± 28.6 characters and synthetic tweets 94.1 ± 27.8, suggesting consistent text compactness. The vocabulary size is also similar, with 8240 unique tokens in real tweets and 8105 in synthetic tweets, reflecting comparable lexical diversity. Furthermore, the embedding cosine similarity of 0.87 ± 0.05 demonstrates strong semantic alignment between synthetic and real data. Collectively, these results confirm that the synthetic data closely approximates the distributional properties of real-world social media text.

A human-in-the-loop review was performed by three domain experts to assess the semantic accuracy, contextual coherence, and plausibility of the generated tweets. Disagreements in assessment were resolved using majority voting, resulting in a high-quality set of synthetic samples that preserve the characteristics of real flood-related tweets.

Overall, this rigorous validation demonstrates that ChatGPT-generated synthetic data are both representative and reliable, supporting model generalization while addressing class imbalance.

Overall, the combined quantitative and qualitative analysis confirms that the generated synthetic tweets reasonably approximate the underlying distribution of real data. This alignment ensures that the augmented dataset enhances model learning without introducing significant distributional bias, thereby improving generalization while maintaining the integrity of the original data characteristics.

3.2.2. Data Leakage Prevention

To ensure the validity and fairness of the experimental evaluation, strict measures were implemented to prevent data leakage. All synthetic tweets generated through the prompt engineering process were exclusively incorporated into the training dataset to address class imbalance and improve model learning. Under no circumstances were synthetic samples included in the validation or test datasets.

The validation and test sets were composed entirely of real tweets, ensuring that model performance was evaluated on authentic, unseen data. This separation guarantees that the reported results accurately reflect the generalization capability of the proposed model in real-world scenarios. By preventing any overlap between synthetic and evaluation data, the study ensures that performance metrics are unbiased and not artificially inflated due to exposure to generated samples during testing.

3.3. Data Preprocessing and Preparation

This preprocessing step removes noise, including URLs, hashtags, mentions, and emojis. The Natural Language Toolkit (NLTK) [36] provides several useful functionalities for this purpose. Here are the basic steps we applied in preprocessing text using NLTK:

Tokenization:
Tokenization involves splitting text into individual tokens. NLTK provides various tokenizers, such as word-tokenize for word-level tokenization and sent_tokenize for sentence-level tokenization. Formally, we can say that tokenization in NLTK splits text S (a string) into a sequence of tokens as follows: $T (S) = {t_{1}, t_{2}, \dots, t_{n}}$ , where $t_{i}$ are the extracted tokens.
Example Sentence S: “Covid is an unseen enemy.”
For word tokenization:

$T (S) = Word_Tokenizer (S),$

(9)

$T (“ Covid is an unseen enemy . ”) = {“ Covid ”, “ is ”, “ an ”, “ unseen ”, “ enemy ”, “ . ”} .$

(10)

For sentence tokenization:

$T (S) = Sentence_Tokenizer (S),$

(11)

$\begin{matrix} T (“ This is a test . Tokenization splits sentences! ”) \\ = {“ This is a test . ”, \\ “ Tokenization splits sentences! ”} . \end{matrix}$

(12)
Case Conversion: NLTK methods, like lower() or upper(), help to normalize it by converting all characters to a consistent case. This prevents the duplication of words that differ only in letter casing.
For lowercase/uppercase conversion:

$\begin{matrix} C (S) & = {t_{1}, t_{2}, \dots, t_{n}} \\ t_{i} & = Lowercase / Uppercase (t_{i}), \forall t_{i} \in T (S) . \end{matrix}$

(13)
Stop Word Removal: Stop words are commonly occurring words that usually do not carry much meaningful information, such as ”a”, ”the”, and “is”. NLTK provides a predefined list of stop words to reduce noise and focus on important content words. The mathematical representation of stopword removal is as follows:

$R (S) = {t_{i} \in S ∣ t_{i} \notin S W},$

(14)

where $S = t_{1}, t_{2}, \dots, t_{n}$ are the tokenized input text, and $S W$ is a predefined set of stopwords.
Punctuation Removal: NLTK’s string punctuation provides a list of symbols that can be removed using string operations or regular expressions. Their removal is mathematically similar to eliminating stop words from text.
Lemmatization/Stemming: Lemmatization and stemming reduce words to their root or base form, helping to normalize vocabulary and minimize variations. NLTK supports this through stemmers like PorterStemmer, SnowballStemmer, and WordNetLemmatizer for lemmatization. Mathematically, lemmatization can be represented as follows:

$L (S) = {L (t_{i}) ∣ t_{i} \in S}$

(15)

where $S = t_{1}, t_{2}, \dots, t_{n}$ are the tokenized input text, and L is a lemmatization function that maps a word to its lemma.
Example: Input tokens: $S = {“ r u n n i n g ”, “ b e t t e r ”, “ c a t s ”}$ .
$Lemmatization : L (“ r u n n i n g ”) = “ r u n ”, L (“ b e t t e r ”) = “ g o o d ”, L (“ c a t s ”) = “ c a t ”$ .
$Output :$ $L (S) = {“ r u n ”, “ g o o d ”, “ c a t ”}$ .

Starting with the Initial Raw Data (IRD), as illustrated in Figure 2, standard preprocessing techniques were employed to clean and structure the dataset, resulting in a reduced sample size and introducing class imbalance. To address this, we leveraged keyword-based sampling and prompt engineering using ChatGPT to generate additional representative samples. The resulting Final Experimental Dataset (FED) comprises approximately 109,000 samples for Class 0 and 108,900 for Class 1, totaling around 217,900 samples, as shown in Table 1, thus ensuring class balance and robustness for subsequent experimentation.

The pseudocode in Algorithm 2 outlines the process applied to event-related tweets using NLTK. The input dataset (IMD.csv) is cleaned and saved as processed_tweets.csv. The pipeline includes tokenization, lowercasing, stop word and punctuation removal, lemmatization, spell correction, and reassembly of tokens. This ensures normalized, clean data suitable for training deep learning models. The transformation of the Initial Raw Dataset (IRD) into the Final Experiment-Ready Dataset (FED) was carried out through three sequential stages, including data preprocessing, generative sampling, and sample balancing, as shown in Figure 3.

Algorithm 2: Data Preprocessing with NLTK Toolkit

3.4. Dataset Aggregation and Validation

To finalize the data collection process, multiple publicly available datasets related to flood and COVID-19 events were aggregated and merged into a unified corpus. This approach ensures comprehensive coverage of diverse crisis scenarios and linguistic variations commonly observed in social media posts. For example, flood-related tweets capture reports of urban waterlogging, road closures, and evacuation alerts, whereas COVID-19-related tweets reflect rising infection rates, hospital capacity updates, and public health advisories.

Given the heterogeneous nature of these datasets, a systematic validation process was implemented to guarantee consistency and reliability. The merged dataset was manually inspected to ensure proper label alignment across sources and to remove inconsistent or irrelevant entries. For instance, tweets incorrectly labeled as flood-related or off-topic were excluded, while variations in phrasing and terminology were preserved to maintain the natural diversity of social media language.

This careful curation process ensures that the final unified dataset is both comprehensive and high-quality, providing a reliable foundation for training the proposed CNN–LSTM model. The representative examples from the unified dataset are presented in Table 5. The combination of dataset aggregation, manual validation, and quality checks enables robust learning and accurate crisis event detection across multiple domains.

4. Feature Engineering

For feature engineering, we used TF-IDF [37] and Word2Vec [38] embedding techniques. These embedding methods are mentioned as follows.

4.1. Word2Vec-Driven Vectorized Text Features

The Word2Vec technique captures semantic relationships between words and detects contextual patterns essential for social event detection. It uses a three-layer neural network (input, hidden, and output) to generate word embeddings, mapping semantically similar words in a high-dimensional space. It supports the following algorithms: Continuous Bag of Words (CBOW) and Skip-Gram [39]. The former uses the context to predict a target word (w) and the latter uses a word (w) to predict the target context. In other words, CBOW predicts a word based on its surrounding context, and Skip-Gram predicts the context given a specific word. These techniques allow the model to generalize word usage patterns effectively, facilitating robust event detection in large-scale textual datasets. Formally, the Word2Vec Continuous Bag-of-Words (CBOW) model can be described as follows:

Given the context words

w_{i - 1}, w_{i - 2}, w_{i + 1}, w_{i + 2}

, the CBOW model predicts the target word

w_{i}

. Conversely, the Word2Vec Skip-Gram model operates in the opposite direction- Given the target word

w_{i}

, the Skip-Gram model predicts the surrounding context words

w_{i - 1}, w_{i - 2}, w_{i + 1}, w_{i + 2}

.

Figure 4 presents a visualization of the Word2Vec architecture with the input layer, hidden layer, and output layer, as well as the embedding space where semantically similar words are grouped closely. There are two matrices shown, including the embedding (m) and vector (n) matrices. The embedding matrix stores the vector representations (embeddings) of all words in the vocabulary, where each row corresponds to a word and its learned representation. These embeddings capture the semantic relationships between words in a continuous vector space. The context matrix, used in the Skip-Gram approach, stores embeddings of surrounding words and is optimized during training to capture word associations based on textual proximity.

4.2. TF-IDF-Informed Sparse Text Representations

TF-IDF embeddings have been utilized to identify distinguishing linguistic patterns between COVID-19 and flood-related textual instances. TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical technique used to determine the relevance of a word within a specific document relative to a larger collection of documents. This is calculated by multiplying two components: the term frequency (

T F

), which measures how frequently a word w appears in a document, and the Inverse Document Frequency (IDF), which evaluates how unique or rare that word is across the entire corpus [40].

Formally, term frequency (

T F

) is defined as the number of times a word w occurs in a document d, divided by the total number of unique words in that document. In contrast, Inverse Document Frequency (

I D F

) is calculated as the logarithmic ratio of the total number of documents in the corpus to the number of documents in which the word w appears. Together, the

T F

-

I D F

score highlights terms that are common in individual documents but uncommon across the corpus, thus making them valuable for distinguishing between different thematic instances, such as those related to COVID-19 and flood scenarios. More formally, as follows:

T_{f} = \frac{w (d)}{nd}, and {ID}_{f} = \frac{D}{nd \in D} .

(16)

Thus, the TF-IDF value of a word w in a document d from a set of documents D can be calculated as follows:

\begin{matrix} T F - I D F (w, d, D) & = T F (w, d) \times I D F (w, D) \\ where T F (w, d) & = log (1 + frequency (w, d)) \\ and I D F (w, D) & = log (\frac{| D |}{| {d \in D : w \in d} |}) \end{matrix} .

(17)

The TF-IDF technique is applied to each tweet to identify and weight words most relevant to specific events. TF-IDF assigns higher importance to words that appear frequently in a particular document (tweet) but are relatively rare across the corpus, making them indicative of the tweet’s unique context. Figure 5 presents a flow from the document corpus through TF-IDF calculations to the final TF-IDF matrix. It computes Term Frequency (TF) to capture a word’s occurrence within a document and Inverse Document Frequency (IDF) to downweight common terms. Their product forms the TF-IDF matrix, where rows are documents, columns are terms, and values reflect term importance.

The study employs a hybrid feature representation combining Word2Vec and TF-IDF to balance semantic understanding with term specificity, enhancing classification accuracy and robustness across diverse events. It also systematically evaluates the impact of these embeddings in real-time event detection using deep learning models tailored for social media data.

5. Experimental Configuration and Proposed Framework

After preprocessing the raw and unstructured data with the Natural Language Toolkit (NLTK) Python (3.4) package [36], we obtain a clean, structured dataset ready for analysis. As shown in Figure 2, preprocessing includes tokenization, case conversion, removal of stop words and punctuations, lemmatization, and spell checking, transforming the text into a form more suitable for machine learning applications.

Following this, each data instance undergoes a rigorous manual filtering process, where three trained annotators independently label samples to ensure high-quality, accurate labeling, and to remove ambiguous cases that could impact classification results. This carefully curated and annotated dataset serves as a solid foundation for training robust classification models, minimizing noise that can degrade performance. The annotated samples are then represented using several text embedding techniques, such as Word2Vec and TF-IDF, that convert textual data into vectorized formats for effective input to deep learning models. These embedding methods enable the models to capture syntactic and semantic relationships within the data. By applying these techniques, we construct two distinct feature representations, each capturing unique aspects of the data’s linguistic and contextual information. This variety of embeddings enriches the learning process, allowing models to explore both simple binary relationships and complex semantic connections.

Before proceeding to training, to evaluate the proposed SOCIAL framework rigorously, the Final Experimental Dataset (FED) presented in Table 1 was partitioned into train, validation, and test sets while ensuring data integrity and preventing leakage. After preprocessing to remove exact and near-duplicate tweets, splits were performed in a source-aware and user-aware manner: tweets originating from the same user or dataset source were restricted to a single partition, thereby mitigating the risk of highly similar instances appearing in multiple splits.

The resulting distribution allocated approximately 70% of instances for training, 15% for validation, and 15% for testing, with both COVID-19 and flood categories proportionally represented. Table 6 summarizes the final split counts, showing balanced representation across both classes. This careful split strategy ensures that the CNN–LSTM model is evaluated on genuinely unseen content, providing a reliable assessment of generalization performance and robustness in real-world social media scenarios.

For model training, CNNs and LSTM networks were employed. Customized CNN and LSTM layers were also integrated into a hybrid model. Combining two embedding techniques with three model configurations resulted in six (2 × 3) distinct experimental setups. An overview is shown in Figure 6. The dataset is partitioned into an 80:20 training-to-validation split, allowing the models to learn from a majority of the data while reserving a portion for validation, thus aiding in generalization. Performance is then evaluated using an independent test set derived from the original dataset.

To comprehensively assess model performance, we utilize several evaluation metrics, including accuracy, precision, recall, F1-score, and the Matthews Correlation Coefficient (MCC) derived from the confusion matrix shown in Table 7.

We have used the F-score to calculate the balanced performance. It is the harmonic mean of the model’s precision and recall. This harmonic mean is calculated as presented in Equation (18).

F 1 = \frac{2 * Precision * Recall}{Precision + Recall} = \frac{2 * TP}{2 * TP + FP + FN} .

(18)

The

F_{β}

-score is a generalization of the F-score, which adds a configuration parameter called beta (

β

). A lower

β

(e.g., 0.5) favors precision, while a higher

β

(e.g., 2.0) favors recall. In this study, the

F_{β}

-score is calculated as shown in Equation (19).

F_{β} - s c o r e = (1 + β^{2}) * \frac{PR}{β^{2} P + R} .

(19)

The MCC, or phi coefficient, measures the association between two binary variables in statistics and evaluates binary classification quality in machine learning. It is preferred over the F1-score for providing a more balanced assessment regardless of the positive class. Formally, the MCC is calculated as shown in Equation (20).

MCC (phi) = \frac{TN * TP - FP * FN}{\sqrt{(TN + FN) (FP + TP) (TN + FP) (FN + TP)}} .

(20)

Furthermore, we evaluated our models to thoroughly validate our results using key metrics, including the Negative Predictive Rate (NPR), False Positive Rate (FPR), False Discovery Rate (FDR), and False Negative Rate (FNR). A summary of these metrics associated with mathematical representation is provided in Table 8.

6. Results and Discussion

An ablation study, presented in Table 9, Table 10 and Table 11, was conducted to evaluate the contribution of individual components of the proposed framework. Specifically, the author compared the performance of CNN (word2Vec)-only, CNN (TFIDF)-only, LSTM (word2Vec)-only, LSTM (TFIDF)-only, and the combined CNN–LSTM architecture under the same experimental settings. To provide a statistically robust assessment of model performance, all reported metrics, including accuracy, precision, recall, F0.5-score, and F1-score, are accompanied by 95% confidence intervals. These intervals were computed using bootstrap resampling with 500 iterations over the test set, capturing variability across different sample subsets. Reporting confidence intervals quantifies the uncertainty in predictions, allowing a more reliable interpretation of the proposed CNN–LSTM model’s performance on COVID-19 and flood event tweets. The results in Table 10 demonstrate both high predictive accuracy and low variability, indicating strong stability and robustness of the framework.

Table 9 presents the classification performance of various models for COVID-19 (Class 0) and flood (Class 1) events.

Word2Vec-CNN achieves high true positives (9907) and true negatives (9410) with low misclassification. Word2Vec-LSTM underperforms, showing higher false positives (451) and false negatives (332), reflecting its limitations on short, sparse texts. The Word2Vec-CNN + LSTM hybrid improves over LSTM alone but remains less effective than CNN, with 9788 true positives and 346 false positives. Models based on TF-IDF embeddings outperform Word2Vec. TFIDF-CNN demonstrates strong results (9866 true positives, 268 false positives), while TFIDF-LSTM shows increased false negatives (514). The TFIDF-CNN + LSTM model achieves the best performance, recording the highest true positives (9938) and the lowest false positives (196), indicating superior precision and balance. Overall, TFIDF-CNN + LSTM emerges as the most effective model, leveraging both keyword-focused feature extraction and sequential context learning for accurate event classification.

All models were evaluated based on accuracy, precision, recall, and F-scores (F0.5, F1, and F2) for event classification, as shown in Table 10. Starting with Word2Vec-CNN, the model achieved 97.68% accuracy, 96.31% precision, 99.05% recall, 96.85% F0.5, 97.66% F1, and 98.49% F2. The high recall highlights its effectiveness in capturing nearly all relevant events, though with a slight drop in precision compared to others. The Word2Vec-LSTM model followed closely, achieving 96.86% accuracy, 95.29% precision, 98.41% recall, 95.90% F0.5, 96.82% F1, and 97.77% F2, indicating strong sequential modeling but slightly lower performance than CNN on short texts. The Word2Vec-CNN + LSTM hybrid improved on both, attaining 98.12% accuracy, 97.22% precision, 99.03% recall, 97.58% F0.5, 98.12% F1, and 98.66% F2, effectively capturing both local and sequential patterns by combining CNN’s and LSTM’s strengths.

The TFI_DF-CNN model achieved an accuracy of 97.78%, precision of 97.08%, recall of 98.48%, F0.5 score of 97.36%, F1 score of 97.78%, and F2 score of 98.20%. Compared to Word2Vec-based models, this model’s precision is higher, reflecting the TF-IDF feature representation’s advantage in focusing on specific, event-related terms within short texts like tweets. Next, the TFI_DF-LSTM model scores 97.65% in accuracy, 97.06% in precision, 98.24% in recall, 97.29% in F0.5 score, 97.64% in F1 score, and 97.99% in F2 score.

Finally, the TFI_DF-CNN + LSTM hybrid model stands out with the highest scores across all metrics: an accuracy of 98.59%, precision of 98.13%, recall of 99.06%, F0.5 score of 98.31%, F1 score of 98.59%, and F2 score of 98.87%. This model merges CNN and LSTM strengths with TF-IDF, capturing both local and sequential patterns while highlighting key event terms. Its high precision and recall demonstrate accurate event detection with minimal false positives and negatives, making it the top-performing model. We also recorded values for NPR, FPR, FDR, and FNR against each trained model, as shown in Table 11. As mentioned, the results provide insights into the performance of various models in classifying COVID-19 pandemic-related events (positive class) and flood-related events (negative class). The TF-IDF-CNN + LSTM model again achieves the best performance across most metrics, with the highest MCC of 0.9719, indicating strong overall classification reliability. It also exhibits the highest NPR of 0.9813, confirming its accuracy in identifying flood-related events, and the lowest FNR of 0.0187, ensuring minimal misclassification of COVID-19 events. Overall, the results highlight the superior performance of the TF-IDF-CNN+LSTM model due to its balance across all metrics, making it the most reliable choice for event classification in this context.

Furthermore, it should also be noted that the current evaluation is restricted to two event domains, namely pandemic (COVID-19) and natural disaster (flood), within a unified and consistently preprocessed dataset. While this setup allows for controlled experimentation and reliable assessment of the proposed SOCIAL framework, it does not directly evaluate cross-domain transferability or robustness across different languages and regions. Extending the model to out-of-domain crises, such as earthquakes, wildfires, or conflict-related events, as well as datasets from diverse geographical regions and languages, would require additional adaptation. This may include domain-specific fine-tuning, incorporation of multilingual embeddings, or other transfer learning strategies to ensure robust performance across heterogeneous contexts. These considerations define a clear direction for future research aimed at improving the model’s generalization in real-world, multi-domain crisis-event detection scenarios.

6.1. Cross-Domain Analysis

To assess the robustness and generalization capability of the proposed CNN–LSTM model, a cross-domain analysis was conducted across the different event categories in the unified dataset. Specifically, the model was trained on the combined dataset containing both flood and COVID-19 tweets and then evaluated separately on each domain. This evaluation allows us to examine how well the model captures domain-invariant features and handles differences in linguistic patterns, vocabulary, and context between distinct event types.

The results, summarized in Table 12, show that the model maintains consistently high performance across domains. When trained on the combined dataset, the model achieved 98.45% accuracy on flood-related tweets and 98.72% accuracy on COVID-19-related tweets. Precision, recall, and F1-score metrics also remain high and comparable across both domains, indicating effective generalization without favoring one event type over another.

Qualitative observations further support these findings. Flood-related tweets often contain terms like “waterlogging”, “evacuation”, and “road closures”, while COVID-19-related tweets emphasize “cases”, “hospital”, and “lockdown”. Despite these differences, the CNN–LSTM model successfully captures the underlying semantic context in both domains, demonstrating strong domain-invariant feature learning. This capability is essential for real-world deployment, where social media streams often contain posts from multiple event types simultaneously.

Overall, the cross-domain analysis confirms that the proposed model is robust and generalizable. By performing consistently across heterogeneous datasets, the framework proves suitable for detecting multiple types of crisis events in real-time social media streams, ensuring reliable performance in complex, multi-domain environments.

6.2. Optimized CNN + LSTM Layered Architecture for Event Detection

The best-performing integrated and customized CNN-LSTM architecture in Figure 7 is tailored for event detection in social media text, combining CNN for local feature extraction and LSTM for modeling sequential dependencies. As shown in the figure, the tokenized input is first transformed into embedding vectors via an embedding layer. These vectors are processed through a CNN layer to capture spatial structures and local patterns. The extracted features are then passed to an LSTM layer to learn temporal and contextual relationships. Following this, a max-pooling layer reduces dimensionality by retaining the most salient features, mitigating overfitting. Finally, the LSTM output is fed into a dense layer with a sigmoid activation function for event prediction.

Figure 8 presents a detailed breakdown of the layers of the proposed architecture.

The architecture begins with an Input Layer of shape (100, 300), representing a sequence of 100 tokens, each embedded in a 300-dimensional space. This ensures a consistent structure for all textual inputs. Next, an embedding layer incorporating TF-IDF representations is employed to assign higher weights to rare yet informative terms, thereby enhancing feature discrimination. The embedded sequences are then passed through three parallel 1D convolutional layers with kernel sizes of 3, 4, and 5. These layers act as n-gram detectors, capturing trigrams, four-grams, and five-grams to learn diverse local linguistic patterns. The outputs are concatenated to form a unified feature map, allowing multi-scale feature extraction. A combination of batch normalization, max pooling, and dropout follows batch normalization, accelerates and stabilizes training, max pooling reduces spatial dimensions, and dropout mitigates overfitting. The pooled features are fed into a Bidirectional LSTM Layer with 256 units in each direction, allowing the model to capture both past and future context in the sequence. Another Dropout Layer is applied for further regularization. The resulting feature map is flattened via a flatten layer and passed through a sequence of dense layers, first with 128 units (followed by batch normalization and dropout) and then with 64 units to enable deep feature abstraction.

Finally, the output layer applies Softmax for multi-class or Sigmoid for binary classification, producing final prediction probabilities. This hybrid CNN–BiLSTM model effectively captures both local and sequential patterns, making it well-suited for robust event detection in noisy and dynamic social media environments.

Unlike transformer-based models, which typically require large-scale training data and substantial computational resources, the proposed CNN–LSTM architecture is optimized for efficiency and performs effectively on relatively smaller datasets. The convolutional layers are designed to capture salient local patterns and event-specific keywords that are characteristic of short, noisy social media posts, while the LSTM layer models sequential dependencies to retain contextual information within the text. This combination enables the model to achieve high classification accuracy without the extensive computational overhead associated with transformer-based architectures.

Moreover, compared with BiLSTM–attention networks, which introduce additional complexity through attention mechanisms, the SOCIAL framework emphasizes a streamlined and lightweight architecture tailored for real-time crisis event detection. By focusing on efficient feature extraction and robust sequence modeling, the proposed design balances representational power and computational efficiency, making it particularly suitable for short-text, noisy, and rapidly evolving social media streams. These architectural choices distinguish SOCIAL from existing approaches and justify its effectiveness for real-time event detection applications.

Table 13 summarizes the key differences between the proposed CNN–LSTM SOCIAL framework and transformer-based models. As shown, the SOCIAL architecture achieves a balance between representational capability and computational efficiency, making it by social media posts. In contrast, BiLSTM–attention networks introduce additional complexity through attention mechanisms, and transformer-based models require large-scale data and substantial computational resources, which can limit their applicability for real-time event detection in dynamic social media streams.

An ablation study was conducted to evaluate the contribution of each component within the SOCIAL framework, with results summarized in Table 14. As mentioned before, the hybrid CNN + LSTM model achieves the highest performance, with an accuracy of 98.59% and an F0.5-score of 98.31%, outperforming both standalone CNN (accuracy: 97.78%, F0.5-score: 97.36%) and LSTM (accuracy: 97.65%, F0.5-score: 97.29%) models. This demonstrates that the combination of convolutional layers for local feature extraction and LSTM layers for sequential dependency modeling captures complementary information, enhancing the classification of short, noisy social media text.

The importance of preprocessing steps is also evident. Removing lemmatization reduces the F0.5-score to 96.90%, while skipping spell correction further decreases it to 96.70%, indicating that these steps are critical for mitigating lexical variability and noise. Collectively, these results confirm that both architectural design and preprocessing strategies are essential for achieving optimal model performance and that the integrated CNN + LSTM configuration provides a robust and efficient framework for crisis-event detection.

This study focuses on evaluating the CNN-LSTM architecture and preprocessing strategies under a consistent experimental setup for short, noisy social media text. Classical models such as SVM and logistic regression showed comparatively lower performance in preliminary experiments due to limited contextual modeling. Furthermore, the SOCIAL framework demonstrates strong performance under controlled experiments; real-world deployment presents additional challenges. Key considerations include robustness to domain shifts arising from different event types, languages, or regions, the ability to generalize to previously unseen crises, reliance on curated and balanced datasets, real-time constraints such as inference latency and throughput, computational and memory efficiency, and minimizing bias or misinformation amplification. Addressing these factors through domain adaptation, transfer learning, and evaluation on diverse real-world datasets is essential for ensuring reliable, scalable, and robust operational performance.

7. Comparative Analysis

Table 15 shows a comparative analysis between existing methods and the proposed SOCIAL framework. The F1-score is selected as the primary performance metric due to its balanced consideration of both precision and recall.

Accuracy, while commonly used, can be misleading in imbalanced datasets where one class dominates, potentially overstating model performance. Precision and recall individually focus on minimizing false positives and false negatives, respectively. The F1-score, being the harmonic mean of precision and recall, offers a more comprehensive evaluation by accounting for both types of errors simultaneously. This makes it particularly suitable for event classification tasks involving noisy and imbalanced social media data, where accurately identifying relevant events and minimizing misclassifications are crucial.

As shown in the table, the proposed SOCIAL framework, combining CNN, LSTM, and CNN + LSTM models, achieves up to 98.31% F-score across diverse textual events, outperforming prior approaches. SatCoBiLSTM [12] scored 94–89% on CrisisLexT6, CrisisLexT26, and 2CTweets, CrisisSpot [16] ranged from 50.1% to 95.5% across datasets, and dual-CNN/COfEE models achieved 70.3–81.4%. SOCIAL’s consistent high performance demonstrates the advantage of integrating convolutional and sequential modeling for robust event detection.

7.1. Key Observations from Comparative Analysis

The following are some of the observations from comparative analysis:

Overall Performance Superiority: The proposed SOCIAL framework achieves the highest classification performance across all compared models, with CNN + LSTM delivering an F-score of 98.31%, significantly exceeding previous methods.
Hybrid Model Effectiveness: While some studies leverage hybrid approaches (e.g., SatCoBiLSTM, CrisisSpot, and Cofee), none match the performance of the optimized CNN + LSTM architecture in SOCIAL, which effectively captures both spatial dependencies and temporal patterns in event data.
Real-time Adaptability: Many existing models rely on pre-trained datasets or multimodal inputs, which can limit real-time event classification. SOCIAL’s deep learning-based design is optimized for real-time social media streams, enhancing its practical application in crisis management.

This comparative analysis highlights that while previous approaches have contributed to event classification, they often suffer from dataset dependency, inconsistent performance, or limited adaptability.

Furthermore, while transformer-based models, such as BERT and its variants, have demonstrated state-of-the-art performance in many NLP tasks, they typically require large-scale pretraining and fine-tuning on massive corpora to achieve optimal results. In contrast, the Final Experimental Dataset (FED) used in this study comprises approximately 217,900 short-text instances, balanced between 109,000 COVID-19 tweets and 108,900 flood tweets (Table 1). Given the relatively moderate dataset size and the inherently noisy and unstructured nature of social media text, training large Transformer models could lead to overfitting, high computational costs, and suboptimal generalization on short, context-sparse sequences.

The CNN-LSTM architecture was selected to address these specific challenges. Convolutional layers efficiently extract local n-gram features and salient patterns from short text sequences, capturing event-specific lexical cues even in noisy contexts. Sequential dependencies across tokens are then modeled by LSTM layers, enabling the network to learn temporal or contextual relationships critical for accurate event classification. This hybrid approach combines the strengths of both local feature extraction and sequential modeling, providing a computationally efficient alternative to transformers while achieving robust predictive performance. By leveraging this architecture, the proposed framework maintains scalability, reduces training time, and remains effective for real-time crisis event detection, making it well-suited for datasets like FED, where both data volume and sequence length are constrained.

Finally, while the current study focuses on CNN–LSTM, evaluating transformer-based baselines remains an important direction for future research, which would provide additional comparative insights and further validate the effectiveness of the proposed architecture.

7.2. Justification and Limitations of Comparative Analysis

As we can observe, Table 15 presents a comparative performance analysis of the proposed SOCIAL framework with several state-of-the-art approaches reported in the literature. While such comparisons are useful for positioning the proposed model within the broader research landscape, it is important to acknowledge the inherent challenges associated with direct numerical comparison across different studies.

One major limitation arises from the diversity of datasets and experimental configurations used in prior work. Many existing studies rely on domain-specific datasets such as CrisisLex, CrissMDD, or multimodal datasets incorporating textual, visual, and contextual features. In contrast, the proposed study utilizes a unified dataset combining two distinct event categories, namely floods (natural disaster) and COVID-19 (pandemic-related events). This cross-domain setup introduces variability in linguistic patterns, context, and data distribution, making it difficult to identify studies with an exactly matching experimental setting. Therefore, the selected studies represent the closest possible comparisons in terms of objectives and application domains.

Furthermore, several compared approaches employ multimodal inputs or additional contextual features (e.g., images, graph structures, or social context), whereas the proposed framework focuses primarily on textual data. This difference in input modalities and feature representations further limits the possibility of direct equivalence in performance comparison. As such, the results presented in Table 15 should be interpreted as indicative rather than strictly comparable.

To provide a fair basis for comparison, the F1-score is used as the primary evaluation metric. The F1-score offers a balanced measure of precision and recall, making it particularly suitable for social media event detection tasks where class imbalance and noisy data are common. By emphasizing F1-score, the comparison avoids bias toward any single performance aspect and enables a more meaningful evaluation across studies with varying class distributions and experimental conditions.

Overall, while acknowledging these limitations, the comparative analysis demonstrates that the proposed SOCIAL framework achieves competitive and superior performance relative to closely related approaches. This highlights its effectiveness in handling noisy, short-text social media data and supports its applicability for real-time crisis event detection across diverse event categories.

8. Conclusions and Future Work

This study introduces a deep learning-based robust framework called SOCIAL (Social Media Event Classification using Integrated Artificial Learning and NLP), aimed at classifying COVID-19 pandemic-related and flood-related events from social media streams. This proposed architecture, along with two prominent text representation techniques, Word2Vec and TF-IDF, effectively combines CNN’s local feature extraction and LSTM’s ability to model long-term dependencies, making it robust for event detection in noisy social media streams. Comprehensive evaluations were conducted across a range of performance metrics, including traditional measures like accuracy, precision, recall, and F1-score, as well as more refined metrics such as the Matthews Correlation Coefficient (MCC), False Positive Rate (FPR), False Discovery Rate (FDR), and Negative Predictive Value (NPR).

The results demonstrate the effectiveness of the TF-IDF-CNN + LSTM model, which achieved the highest Accuracy of 98.59%, MCC (0.9719), precision (0.98), recall (0.99), and the lowest FPR (0.0094) and FNR (0.0187). These findings indicate that the hybrid approach effectively captures both local patterns and long-term dependencies, particularly when paired with TF-IDF embeddings that emphasize distinctive event-related terms. The comparison between models also highlights that hybrid architectures such as Word2Vec-CNN + LSTM with an accuracy of 98.59% and TF-IDF-CNN + LSTM outperform standalone Word2Vec-based CNNs with accuracy (97.68%), precision (0.96), recall (0.99), F0.5-score (0.97), F1-score (0.97), and F2-score (0.98), and Word2Vec-based LSTMs with accuracy (96.86%), precision (0.95), recall (0.98), F0.5-score (0.96), F1-score (0.96), and F2-score (0.97). Furthermore, comparatively, Word2Vec-based models score less than TFIDF models, as presented in the results. Therefore, TF-IDF embeddings are considered better suited for classifying context-light texts like tweets compared to Word2Vec.

This study underscores the importance of utilizing integrated deep-learning techniques for event classification in social media streams, where rapid and accurate event identification can significantly aid in disaster response and crisis management. The comprehensive evaluation framework and insights into misclassification trends further enhance the reliability and applicability of the proposed solution. Overall, the SOCIAL framework sets a solid foundation for real-time, scalable event classification systems capable of addressing diverse challenges in social media analytics.

Future Research Directions

Event detection on social media streams has made notable progress, but challenges like noisy data, fine-grained event detection, and multimodal integration persist. This section highlights key areas for advancing research in this field, also visually presented in Figure 9.

Incorporation of Multimodal Data and Techniques: Future work could expand the framework by incorporating images, videos, or user interaction data from social media platforms to enhance the accuracy of event classification, especially for events with strong visual or multimodal characteristics. Furthermore, incorporating techniques such as stance detection [32] and fake profile detection [41] enhances event detection by improving reliability and filtering misinformation. Stance detection analyzes user opinions to identify bias and misleading narratives, while fake profile detection eliminates inauthentic accounts that amplify false information. Integrating these techniques strengthens event classification, ensuring more accurate and trustworthy detection in social media streams.
Cross-Linguistic Event Detection: To address the global nature of crises, the framework could be extended to handle multilingual social media streams, leveraging advanced multilingual embeddings like BERT, mBERT, or LASER. Several studies have highlighted the effectiveness of word embeddings and deep learning, offering valuable insights for enhancing event detection on social media. The MPAN model further improves precision and recall through multilevel attention mechanisms.
Adaptation to Emerging Events: Developing mechanisms for transfer learning or zero-shot learning could enable the framework to adapt to emerging, previously unseen event categories with minimal retraining.
Cross-Domain and Multilingual Generalization: Future work will evaluate the generalization capability of the SOCIAL framework across out-of-domain crises (e.g., earthquakes, wildfires, and conflicts) and datasets from diverse geographical regions and languages. Cross-dataset transfer learning can be explored to assess the robustness of the framework in multi-domain, real-world crisis detection applications.
Scalability in High-Velocity Streams: Investigating lightweight, distributed models could improve the system’s ability to process high-velocity social media streams in real time, ensuring timely event detection even during data surges.
Incorporation of External Knowledge: Integrating domain-specific knowledge graphs or ontologies could further refine the classification process by providing additional contextual information about events.
Model Interpretability: Enhancing the interpretability of the models used within the framework would allow researchers and practitioners to understand and trust classification decisions, especially in high-stakes scenarios like disaster management.
Bias Mitigation: Addressing potential biases in the training data or model predictions is crucial to ensure fair and equitable event classification across different regions, demographics, or event types.
Application to Non-Crisis Events: Expanding the framework’s scope to include non-crisis events, such as marketing trends or political developments, could broaden its applicability to other domains.
Integration with Real-Time Monitoring Systems: Future studies could focus on deploying the SOCIAL framework as a component of real-time monitoring and alerting systems, seamlessly integrating it with public safety and disaster response infrastructure.
Exploration of Advanced Architectures: Investigating advanced neural network architectures, such as transformers or graph neural networks, could further enhance the framework’s ability to model complex relationships in text data for event classification.

By addressing these directions, the SOCIAL framework can evolve into a more versatile and comprehensive tool for real-time social media analytics, benefiting a wide range of applications from disaster response to trend analysis.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2602).

Data Availability Statement

The original data presented in the study are openly available in [Floods in Pakistan 2022 Tweets Dataset] at [https://www.kaggle.com/datasets/revolutionarybukhari/floods-in-pakistan-2022-tweets-dataset (accessed on 17 January 2026)] or [27] and [COVIDSenti] at [https://github.com/usmaann/COVIDSenti (accessed on 17 January 2026)] or [24].

Acknowledgments

I would like to thank the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) for their valuable support. I am also deeply thankful to my wife, Kashish Ara Shakil, for her thoughtful review of my work prior to submission and for her insightful and constructive feedback.

Conflicts of Interest

The author declares no conflicts of interest.

References

Zhou, X.; Chen, L. Event detection over twitter social media streams. VLDB J. 2014, 23, 381–400. [Google Scholar] [CrossRef]
Alfalqi, K.; Bellaiche, M. An Emergency Event Detection Ensemble Model Based on Big Data. Big Data Cogn. Comput. 2022, 6, 42. [Google Scholar] [CrossRef]
Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. arXiv 2018, arXiv:1708.02709. [Google Scholar] [CrossRef]
Weng, J.; Lee, B.S. Event Detection in Twitter. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), Barcelona, Spain, 17–21 July 2011; Volume 5, pp. 401–408. [Google Scholar] [CrossRef]
Atefeh, F.; Khreich, W. A Survey of Techniques for Event Detection in Twitter. Comput. Intell. 2015, 31, 133–164. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
Araghi, M.; Sahota, A.; Czachorowski, M.; Naicker, K.; Bohm, N.; Phillipps, K.; Gaddum, J.; Cook, E.J. Analysis of Social Media Perceptions During the COVID-19 Pandemic in the United Kingdom: Social Listening Study (2019–2022). JMIR Form. Res. 2025, 9, e63997. [Google Scholar] [CrossRef] [PubMed]
Imran, M.; Castillo, C.; Diaz, F.; Vieweg, S. Processing social media messages in mass emergency: A survey. ACM Comput. Surv. (CSUR) 2015, 47, 67. [Google Scholar] [CrossRef] [PubMed]
Hall, K.; Chang, V.; Jayne, C. A review on Natural Language Processing Models for COVID-19 research. Healthc. Anal. 2022, 2, 100078. [Google Scholar] [CrossRef]
Sakaki, T.; Okazaki, M.; Matsuo, Y. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 851–860. [Google Scholar]
Luo, G.; Weng, L.; Li, Y.; Sun, Y.; Hong, Y.; Wu, Y.; Luo, R.; Wang, L.; Wang, C.; Chen, L. FireExpert: Fire Event Identification and Assessment Leveraging Cross-Domain Knowledge and Large Language Model. IEEE Trans. Mob. Comput. 2025, 24, 4794–4810. [Google Scholar] [CrossRef]
Upadhyay, A.; Meena, Y.K.; Chauhan, G.S. SatCoBiLSTM: Self-attention based hybrid deep learning framework for crisis event detection in social media. Expert Syst. Appl. 2024, 249, 123604. [Google Scholar] [CrossRef]
Petrovic, S.; Osborne, M.; Lavrenko, V. Streaming first story detection with application to twitter. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10); Association for Computational Linguistics: Kerrville, TX, USA, 2010; pp. 181–189. [Google Scholar]
Mathioudakis, M.; Koudas, N. Twittermonitor: Trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 1155–1158. [Google Scholar]
Adesokan, A.; Madria, S.; Nguyen, L. DisTGranD: Granular event/sub-event classification for disaster response. Online Soc. Netw. Media 2025, 45, 100297. [Google Scholar] [CrossRef]
Dar, S.S.; Rehman, M.Z.U.; Bais, K.; Haseeb, M.A.; Kumar, N. A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies. Expert Syst. Appl. 2025, 259, 125337. [Google Scholar] [CrossRef]
Yang, Y.; Pierce, T.; Carbonell, J. A study of retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 28–36. [Google Scholar]
Yang, Y.; Carbonell, J.G.; Brown, R.D.; Pierce, T.; Archibald, B.T.; Liu, X. Learning approaches for detecting and tracking news events. IEEE Intell. Syst. Their Appl. 1999, 14, 32–43. [Google Scholar] [CrossRef]
McCreadie, R.; Macdonald, C.; Ounis, I. News vertical search: When and what to display to users. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 28 July–1 August 2013; pp. 253–262. [Google Scholar]
Mele, I.; Bahrainian, S.A.; Crestani, F. Event mining and timeliness analysis from heterogeneous news streams. Inf. Process. Manag. 2019, 56, 969–993. [Google Scholar] [CrossRef]
Balali, A.; Asadpour, M.; Jafari, S.H. COfEE: A comprehensive ontology for event extraction from text. Comput. Speech Lang. 2025, 89, 101702. [Google Scholar] [CrossRef]
Thiombiano, J.; Traore, Y.; Malo, S. Early alert of an outbreak based on event detection and tracking on social networks: The case of COVID-19 and meningitis. In Proceedings of the 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS); IEEE: Piscataway, NJ, USA, 2023; pp. 85–91. [Google Scholar]
Koshy, R.; Elango, S. Utilizing social media for emergency response: A tweet classification system using attention-based BiLSTM and CNN for resource management. Multimed. Tools Appl. 2024, 83, 41405–41439. [Google Scholar] [CrossRef]
Cheng, T.; Wicks, T. Event detection using Twitter: A spatio-temporal approach. PLoS ONE 2014, 9, e97807. [Google Scholar] [CrossRef]
Doulamis, N.D.; Doulamis, A.D.; Kokkinos, P.; Varvarigos, E.M. Event detection in twitter microblogging. IEEE Trans. Cybern. 2015, 46, 2810–2824. [Google Scholar] [CrossRef]
Burel, G.; Saif, H.; Fernandez, M.; Alani, H. On semantics and deep learning for event detection in crisis situations. In Workshop on Semantic Deep Learning (SemDeep), at ESWC 2017, 29 May 2017, Portoroz, Slovenia; ACL Anthology: Stroudsburg, PA, USA, 2017. [Google Scholar]
Dabiri, S.; Heaslip, K. Developing a Twitter-based traffic event detection model using deep learning architectures. Expert Syst. Appl. 2019, 118, 425–439. [Google Scholar] [CrossRef]
Alam, F.; Qazi, U.; Imran, M.; Ofli, F. Humaid: Human-annotated disaster incidents data from twitter with deep learning benchmarks. In Proceedings of the International AAAI Conference on Web and Social Media, Virtual, 7–10 June 2021; Volume 15, pp. 933–942. [Google Scholar]
El-Dakhs, D.A.S. # StayHome—A pragmatic analysis of COVID-19 health advice in Saudi and Australian tweets. Lang. Dialogue 2021, 11, 223–245. [Google Scholar]
Naseem, U.; Razzak, I.; Khushi, M.; Eklund, P.W.; Kim, J. COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis. IEEE Trans. Comput. Soc. Syst. 2021, 8, 1003–1015. [Google Scholar] [CrossRef]
Wani, M.A.; ELAffendi, M.; Bours, P.; Imran, A.S.; Hussain, A.; Abd El-Latif, A.A. CoDeS: A Deep Learning Framework for Identifying COVID-Caused Depression Symptoms. Cogn. Comput. 2024, 16, 305–325. [Google Scholar] [CrossRef]
Wani, M.; Agarwal, N.; Bours, P. Impact of Unreliable Content on Social Media Users during COVID-19 and Stance Detection System. Electronics 2021, 10, 5. [Google Scholar] [CrossRef]
Murthy, D.; Longwell, S.A. Twitter and disasters: The uses of Twitter during the 2010 Pakistan floods. Inf. Commun. Soc. 2013, 16, 837–855. [Google Scholar] [CrossRef]
Jawad, K. Flood Tweets Sentiment Analysis (Pakistan 2022). 2022. Available online: https://www.kaggle.com/ (accessed on 17 January 2026).
OpenAI. ChatGPT: A Conversational Language Model. GPT-3.5, 30 November 2022. 2023. Available online: https://openai.com/blog/chatgpt (accessed on 1 March 2026).
Bird, S. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia, 17–21 July 2006; pp. 69–72. [Google Scholar]
Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Wani, M.A.; ElAffendi, M.; Shakil, K.A. AI-Generated Spam Review Detection Framework with Deep Learning Algorithms and Natural Language Processing. Computers 2024, 13, 264. [Google Scholar] [CrossRef]
Wani, M.A.; Jabin, S. A sneak into the Devil’s Colony-Fake Profiles in Online Social Networks. arXiv 2017, arXiv:1705.09929. [Google Scholar]

Figure 1. Data collection framework combining COVID-19 and flood datasets with AI-generated samples and preprocessing to form the Final Experimental Database (FED).

Figure 2. Converting raw data into cleaned data using an NLTK toolkit.

Figure 3. Three stages from Initial Raw Data (IRD) to Final Experimental Data (FED).

Figure 4. Architectural overview of the Word2Vec embedding model.

Figure 5. Term Frequency-Inverse Document Frequency (TF-IDF) embedding architectural concept.

Figure 6. Design and structure of SOCIAL framework.

Figure 7. Architecture of the integrated CNN and LSTM for the SOCIAL framework.

Figure 8. Layers of the proposed architecture.

Figure 9. Future research direction in event detection on social media.

Table 1. Final Experimental Dataset (FED) showing the balanced sample.

Sample/Instance Type	Number of Samples	Class Label
COVID-19 instances	109,000	Class 0
Flood Instances	108,900	Class 1
Total Number of Instances	217,900	NA

Table 2. Examples of prompt templates and corresponding synthetic tweets.

Prompt Template	Synthetic Tweet Example	Category
Generate a short, informal tweet about a flood event mentioning heavy rainfall, flooding, or evacuation warnings. Include hashtags.	Streets are completely flooded after last night’s rain, cars stuck everywhere! #Flood #StaySafe	Flood
Create a tweet reporting a flood situation in an urban area, including waterlogging or traffic disruption.	Heavy rain has flooded main roads, traffic is at a standstill. Avoid travel if possible. #Flood	Flood
Generate a tweet describing rising water levels and emergency alerts during a flood event.	Water levels rising fast near the river, authorities asking people to evacuate immediately! #FloodAlert	Flood
Write a realistic tweet about COVID-19 mentioning rising cases or hospital pressure. Keep it informal.	COVID cases are going up again, hospitals are getting crowded. Please stay safe everyone. #COVID19	COVID-19
Generate a short tweet about COVID-19 restrictions such as lockdowns or safety measures.	Lockdown might be coming back, cases increasing day by day. Wear masks and avoid crowds! #StaySafe	COVID-19
Create a tweet expressing concern about increasing COVID infections and healthcare system pressure.	Hospitals are under pressure due to rising COVID infections, follow safety guidelines. #COVID	COVID-19

Table 3. Comparison of real and synthetic tweets for flood and COVID-19 events.

Real Tweet	Synthetic Tweet	Category
Roads are flooded after continuous rain, vehicles unable to move. #Flood	Streets are completely flooded after last night’s rain, cars stuck everywhere! #Flood #StaySafe	Flood
River water has entered nearby houses, people are being evacuated. #FloodAlert	Water levels rising fast near the river, authorities asking people to evacuate immediately! #FloodAlert	Flood
Severe waterlogging reported across the city, traffic disrupted for hours. #Flood	Heavy rain has flooded main roads, traffic is at a standstill. Avoid travel if possible. #Flood	Flood
COVID cases rising rapidly, hospitals are reaching capacity again. #COVID19	COVID cases are going up again, hospitals are getting crowded. Please stay safe everyone. #COVID19	COVID-19
Government may impose restrictions again due to increasing infections.	Lockdown might be coming back, cases increasing day by day. Wear masks and avoid crowds! #StaySafe	COVID-19
Hospitals are under pressure due to rising COVID infections, new guidelines expected soon.	Hospitals are under pressure due to rising COVID infections, follow safety guidelines. #COVID	COVID-19

Table 4. Quantitative comparison of real and synthetic tweets across length, vocabulary size, and semantic similarity.

Metric	Real Tweets	Synthetic Tweets
Average Tweet Length (words)	15.2 ± 4.1	15.5 ± 3.9
Average Tweet Length (characters)	92.3 ± 28.6	94.1 ± 27.8
Vocabulary Size (unique tokens)	8240	8105
Embedding Cosine Similarity (Word2Vec)	-	0.87 ± 0.05

Table 5. Representative examples from the unified dataset.

Event Category	Example Tweet
Flood	Streets are completely flooded after last night’s rain, cars stuck everywhere! #Flood #StaySafe
Flood	Water levels rising fast near the river, authorities asking people to evacuate immediately! #FloodAlert
Flood	Severe waterlogging reported across the city, traffic disrupted for hours. #Flood
COVID-19	COVID cases are going up again, hospitals are getting crowded. Please stay safe everyone. #COVID19
COVID-19	Lockdown might be coming back, cases increasing day by day. Wear masks and avoid crowds! #StaySafe
COVID-19	Hospitals are under pressure due to rising COVID infections, follow safety guidelines. #COVID

Table 6. Training, validation, and test splits of the FED.

Event Category	Training	Validation	Test
COVID-19 Tweets	76,300	16,350	16,350
Flood Tweets	76,230	16,335	16,335
Total Instances	152,530	32,685	32,685

Table 7. Confusion matrix.

		Actual
		Positive	Negative
Predicted	Positive	True Positive (TP)	False Positive (FP)
	Negative	False Negative (FN)	True Negative (TN)

Table 8. Metrics for event classification and their interpretation.

Metric	Formula	Interpretation
Negative Predictive Value (NPR)	$\frac{T N}{T N + F N}$	Measures the accuracy of predictions for negative (flood-related) events.
False Positive Rate (FPR)	$\frac{F P}{F P + T N}$	Indicates the rate of flood-related events misclassified as COVID-19 events.
False Discovery Rate (FDR)	$\frac{F P}{F P + T P}$	Indicates the proportion of incorrectly predicted COVID-19 events.
False Negative Rate (FNR)	$\frac{F N}{F N + T P}$	Indicates the rate of COVID-19 events misclassified as flood-related events.

Table 9. Confusion matrix values for different models.

Model	TP	FP	FN	TN
Word2Vec-CNN	9907	227	183	9410
Word2Vec-LSTM	9673	451	332	9248
Word2Vec-CNN + LSTM	9788	346	483	9133
TFI_DF-CNN	9866	268	283	9306
TFI_DF-LSTM	9773	361	514	9075
TFI_DF-CNN + LSTM	9938	196	240	9349

Table 10. Performance of the proposed models on COVID-19 and flood tweets with 95% confidence intervals.

Model	Accuracy (%)	Precision (%)	Recall (%)	F0.5-Score (%)	F1-Score (%)
Word2Vec-CNN	97.68 ± 0.21	96.31 ± 0.25	99.05 ± 0.19	96.85 ± 0.22	97.66 ± 0.21
Word2Vec-LSTM	96.86 ± 0.24	95.29 ± 0.28	98.41 ± 0.22	95.90 ± 0.25	96.82 ± 0.24
Word2Vec-CNN + LSTM	98.12 ± 0.18	97.22 ± 0.20	99.03 ± 0.17	97.58 ± 0.19	98.12 ± 0.18
TF-IDF-CNN	97.78 ± 0.20	97.08 ± 0.22	98.48 ± 0.18	97.36 ± 0.21	97.78 ± 0.20
TF-IDF-LSTM	97.65 ± 0.21	97.06 ± 0.23	98.24 ± 0.19	97.29 ± 0.21	97.64 ± 0.21
TF-IDF-CNN + LSTM	98.59 ± 0.15	98.13 ± 0.16	99.06 ± 0.14	98.31 ± 0.15	98.59 ± 0.15

Table 11. Performance metrics for individual and hybrid architectures.

Model	MCC	NPR	FPR	FDR	FNR
CNN (Word2Vec)	0.9540	0.9638	0.0094	0.0095	0.0369
LSTM (Word2Vec)	0.9378	0.9541	0.0156	0.0159	0.0469
CNN + LSTM (Word2Vec)	0.9626	0.9724	0.0096	0.0097	0.0278
CNN (TFI-DF)	0.9558	0.9710	0.0151	0.0152	0.0292
LSTM (TFI-DF)	0.9530	0.9706	0.0176	0.0176	0.0294
CNN + LSTM (TFI-DF)	0.9719	0.9813	0.0094	0.0094	0.0187

Table 12. Performance of the proposed CNN–LSTM model across two crisis domains: COVID-19 and flood events.

Training Data	Testing Data	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Flood + COVID	Flood	98.45	98.10	98.90	98.50
Flood + COVID	COVID-19	98.72	98.35	99.05	98.69
Flood Only	Flood	98.30	97.95	98.80	98.37
COVID-19 Only	COVID-19	98.60	98.20	99.00	98.59

Table 13. Comparison of CNN–LSTM SOCIAL with BiLSTM–attention and transformer models for short, noisy social media text.

Model	Strengths	Limitations	Suitability for Short, Noisy Social Media Text
CNN–LSTM (Proposed SOCIAL)	Efficient local feature extraction; captures sequential dependencies; lightweight and fast	Less powerful for very long sequences	High–optimized for short, noisy posts and smaller datasets
BiLSTM–Attention	Models forward and backward dependencies; attention highlights important tokens	Increased complexity; higher computational cost	Moderate–good performance but heavier for real-time use
Transformer-based Models	Strong contextual understanding; state-of-the-art in NLP tasks	Requires large-scale data and GPUs; slower inference	Low–resource-intensive and less practical for short, streaming social media data

Table 14. Ablation study of key components in the SOCIAL framework.

Model Configuration	Accuracy (%)	F0.5-Score (%)
Full Model (CNN + LSTM)	98.59	98.31
CNN Only	97.78	97.36
LSTM Only	97.65	97.29
Without Lemmatization	97.21	96.90
Without Spell Correction	97.05	96.70

Table 15. Comparative performance analysis of existing models with the proposed approach.

Study	F-Score	Dataset/Configuration	Remarks
FireExpert [11]	61.01%	FFYOLO	A two-stage framework for fire event detection and assessment. The first stage employs multi-band remote sensing and environmental images to classify fire types and map affected areas
SatCoBiLSTM [12]	94.00%	CrisisLexT6 (Dataset 1)
	85.02%	CrisisLexT26 (Dataset 2)
	89.01%	2CTweets (Dataset 3)	A multi-scale convolution, BiLSTM, and self-attention-based framework to capture local, contextual, and critical crisis features
CrisisSpot [16]	95.50%	CrissMDD
CrisisSpot [16]	50.10%	TSEqD	A Graph Neural Network (GNN) integrating textual, visual, and social context features (SCF) for disaster analysis, enhanced by Inverted Dual Embedded Attention (IDEA) for capturing complex multimodal interactions.
Dual-CNN [26]	79.80%	Dual-CNN
	79.70%	CNN
	78.50%	SVM
	73.30%	Naive Bayes
	72.30%	CART	A Dual-CNN-based semantically enhanced deep learning model for detecting crisis events from social media.
COfEE [21]	76.60%	SMEE + LSTM + CNN
	70.30%	LSTM + CNN
	81.40%	SMEE + LSTM	An event ontology combining expert knowledge and data-driven insights to enhance event identification with a two-level hierarchy
Proposed SOCIAL Framework	96.85%	Word2Vec-CNN
	95.90%	Word2Vec-LSTM
	97.58%	Word2Vec-CNN + LSTM
	97.36%	TF-IDF-CNN
	97.29%	TF-IDF-LSTM
	98.31%	TF-IDF-CNN + LSTM	An integrated configuration and optimized deep learning-based framework for real-time event classification from different data streams.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wani, M.A. Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams. Mathematics 2026, 14, 1369. https://doi.org/10.3390/math14081369

AMA Style

Wani MA. Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams. Mathematics. 2026; 14(8):1369. https://doi.org/10.3390/math14081369

Chicago/Turabian Style

Wani, Mudasir Ahmad. 2026. "Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams" Mathematics 14, no. 8: 1369. https://doi.org/10.3390/math14081369

APA Style

Wani, M. A. (2026). Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams. Mathematics, 14(8), 1369. https://doi.org/10.3390/math14081369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized CNN–LSTM Modeling for Crisis Event Detection in Noisy Social Media Streams

Abstract

1. Introduction

Contributions

2. Literature Survey

3. Event Data Collection and Preprocessing

3.1. Data Collection

3.2. Synthetic Data Generation and Validation

3.2.1. Analysis of Synthetic and Real Data Distributions

3.2.2. Data Leakage Prevention

3.3. Data Preprocessing and Preparation

3.4. Dataset Aggregation and Validation

4. Feature Engineering

4.1. Word2Vec-Driven Vectorized Text Features

4.2. TF-IDF-Informed Sparse Text Representations

5. Experimental Configuration and Proposed Framework

6. Results and Discussion

6.1. Cross-Domain Analysis

6.2. Optimized CNN + LSTM Layered Architecture for Event Detection

7. Comparative Analysis

7.1. Key Observations from Comparative Analysis

7.2. Justification and Limitations of Comparative Analysis

8. Conclusions and Future Work

Future Research Directions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI