1. Introduction
Event detection is a process that automatically identifies events from OSN data, providing details about what occurred, when, where, to whom, and why [
1]. Event detection is crucial in areas like disaster management [
2], public safety, marketing, and trend analysis. The rise of social media provides real-time, user-generated content such as tweets, posts, and videos that offer immediate insights into unfolding events. Leveraging this, deep learning, particularly in natural Language Processing (NLP) and computer vision, enables automatic, near-real-time event detection and classification. The field has rapidly advanced, fueled by large-scale data, modern neural architectures, and increased computational power [
3].
Early research in social media event detection primarily relied on keyword-based and statistical approaches to identify bursts in term frequency and co-occurrence patterns across platforms such as Twitter [
4]. These traditional methods, although effective in detecting sudden events, often struggle with noise, ambiguity, and a lack of contextual understanding. Subsequently, machine learning-based approaches improved detection accuracy by incorporating feature engineering techniques such as TF-IDF and topic modeling [
5]. With the evolution of deep learning, more sophisticated models have been developed that leverage contextual embeddings and sequential modeling to better capture semantic and temporal dependencies in social media streams. Recent studies have further explored hybrid and ensemble approaches, integrating multiple models and data sources to enhance robustness and scalability in real-time event detection systems [
2].
Unlike traditional event detection that relies on delayed, structured data from news or official sources, social media offers real-time, unfiltered insights directly from the public. Deep learning models such as Convolutional Neural Networks (CNNs) [
6], Recurrent Neural Networks (RNNs), and transformers excel in processing this noisy, informal text by capturing context, timelines, and latent meaning. For instance, during the COVID-19 pandemic [
7], social media posts revealed early signs of outbreaks and shifting public sentiment on vaccines and lockdowns long before formal announcements. Similarly, in natural disasters, real-time posts using keywords like “evacuation” or “rescue” provided critical information on affected areas. Such timely insights enhance situational awareness and accelerate emergency response efforts.
Given that social media platforms continuously generate vast continuous streams of posts, images, videos, and metadata, the resulting information can be formalized mathematically as a high-dimensional input space
. Each data instance in this space corresponds to a multi-modal, high-dimensional representation of social media activity. Formally, each social media instance is denoted as
, while the corresponding crisis state is represented by a label
, where
indicates a crisis-related signal and
denotes a non-crisis content. The central objective of crisis detection is to learn a mapping,
such that
is a function that accurately predicts the crisis state of previously unseen signals.
To endow the mapping
f with strong representational power, recent approaches integrate deep neural networks with generative modeling paradigms. In a probabilistic formulation, the learning objective can be written as the posterior distribution:
where
quantifies the likelihood that an observed signal
x corresponds to a crisis event. Since direct computation of the posterior is often intractable due to the high dimensionality and heterogeneity of social media data, generative models approximate the data distribution using a latent representation
. This forms the basis for the generative process:
where
is commonly chosen as a multivariate Gaussian prior and
is a parameterized decoder. The latent representation itself is inferred via the encoder distribution:
with
denoting the encoder parameters. This latent space provides a compressed, noise-reduced abstraction of the social media input. It allows the model to capture underlying crisis-related semantics more effectively.
To train the generative framework, the reconstruction-based objective function
is employed:
where
represents the decoder. This loss encourages the model to preserve important structural information in the latent space. In addition, a classification loss
guides the latent representation toward discriminative crisis features:
which ensures accurate crisis prediction over a dataset of
N labeled samples.
The final optimization problem thus combines the reconstruction and classification objectives using a weighted formulation:
where
and
control the contribution of generative learning and discriminative learning, respectively. This unified formulation leverages the strengths of both deep learning and probabilistic modeling, enabling robust crisis event detection even in noisy, imbalanced, and rapidly evolving social media environments.
Overall, this mathematical formulation establishes a principled framework for crisis detection, connecting high-dimensional input representations, latent generative modeling, and supervised classification within a single end-to-end trainable architecture.
Apart from formal representation of crisis detection, NLP is pivotal in extracting meaningful insights from the vast and unstructured social media text, enabling efficient event detection [
8]. By applying techniques such as text classification, sentiment analysis, topic modeling, and entity recognition, NLP can identify keywords, emotions, and contextual patterns indicative of specific events. For instance, during the COVID-19 pandemic, NLP models could detect spikes in discussions about quarantine, hospital shortages, or vaccination drives, signaling emerging health crises or public concerns [
9]. Similarly, during floods in Pakistan and Sri Lanka, NLP algorithms could analyze posts mentioning terms like “floodwater”, “rescue”, or “evacuation”, helping to pinpoint affected areas and prioritize relief efforts. We are mentioning the floods in Pakistan and Sri Lanka because the datasets used in this study hold the tweet instances related to flood incidents in these two countries. By leveraging NLP’s ability to process multilingual and noisy data, social media platforms become a rich source of real-time information, empowering researchers and organizations to monitor, detect, and respond to events with unprecedented speed and precision.
We propose an AI-based system named SOCIAL (Social Media Event Classification using Integrated Artificial Learning and NLP) for real-time event detection from social media streams, focusing on COVID-19 and flood-related incidents. Leveraging the richness of social media data and the power of deep learning, our approach supports applications in disaster management, public safety, marketing, and trend analysis.
Contributions
The proposed study’s major contributions are as follows:
Customized Deep Learning Architecture for Social Media Event Classification: This study presents SOCIAL, a custom deep learning framework for real-time event classification from social media. Unlike standard CNN–LSTM models, SOCIAL features a restructured architecture tailored toward short, noisy, and unstructured text. It combines CNN for local feature extraction and LSTM for sequence modeling to boost accuracy. The study also evaluates the effects of embedding choices (TF-IDF, Word2Vec) and generative AI-based data augmentation to ensure balanced training and enhanced performance.
Integration of Generative AI for Data Augmentation: This study employs generative AI, such as ChatGPT (GPT-5.3), to generate synthetic event-related data, addressing issues of data scarcity and imbalance in social media event detection. Generated samples are validated through a human-in-the-loop process involving three domain experts, ensuring semantic accuracy and contextual relevance. Disagreements in annotations are resolved via majority voting, producing high-quality, diverse datasets.
Feature Engineering Optimization for Short-Text Event Classification: This study presents a comparative analysis of TF-IDF and Word2Vec embeddings for event classification in social media streams. Given the short and noisy nature of social media text, we evaluate their performance across CNN, LSTM, and CNN + LSTM models. Results show that TF-IDF, by highlighting event-specific term importance, outperforms Word2Vec in accuracy, recall, and robustness, addressing context sparsity more effectively. These findings offer practical guidance for selecting feature representations in real-time NLP-based event detection systems.
Comprehensive Evaluation Metrics: This study employs an extensive evaluation framework beyond standard accuracy, precision, recall, and F1-score to ensure a rigorous assessment of social media event classification models. Matthews Correlation Coefficient (MCC), Negative Predictive Value (NPV), False Positive Rate (FPR), False Discovery Rate (FDR), and False Negative Rate (FNR) provide deeper performance insights, addressing imbalanced and high-stakes classification challenges. Additionally, F-score variants (, , and ) refine the balance between precision and recall for different event detection priorities. This approach enhances model generalizability, ensuring reliable decision-making in real-world applications such as disaster response and misinformation analysis.
2. Literature Survey
Crisis-event detection from social media has attracted significant research attention in recent years; this section examines key contributions, model architectures, and remaining challenges in the field. A system called “EARWORM” was introduced in [
10]. This system utilizes Twitter data to detect earthquakes in Japan. The authors performed keyword-based filtering and applied a machine-learning algorithm to these data. Another study proposed a system called FireExpert [
11]. In this work, the authors presented a two-stage framework for fire event detection and their assessment. The first stage uses a multi-band remote sensing method and environmental images to classify fire types and map affected areas. The second stage integrates detection results of stage one with social media data using a large language model for real-time impact assessment. Evaluations carried out by the authors on a real-world dataset show an F1-score of 61.0% and a mean average precision (mAP) of 57.7%, surpassing state-of-the-art methods. Similarly, SatCoBiLSTM, a self-attention-based hybrid deep learning framework for crisis event detection in noisy social media streams, was proposed in [
12]. It combines multi-scale convolution, Bidirectional Long Short-Term Memory (BiLSTM), and self-attention to capture local, contextual, and critical crisis features. Experiments were conducted on three benchmark datasets and achieved an F1-score of 96%, 94%, and 95%, with 1%, 1%, and 6% improvements, respectively, over state-of-the-art methods. A probabilistic model that identified bursts of tweets related to specific events was proposed in [
13]. The study utilized bursty topic modeling and demonstrated its effectiveness in capturing event-related discussions on Twitter. Similarly, in [
14], TwitterMonitor, a system, was proposed for trend detection over the Twitter stream. The authors in this study introduced the concept of burstiness. They also utilized content similarity to detect and track emerging events in real time. Ref. [
15] introduces DisTGranD, a 47,600-tweet crisis dataset with dual-layer event classification using the Automatic Content Extraction (ACE) standard. In this study, the authors achieved high inter-annotator agreement (Kappa: 0.90, 0.93). They also utilized XLNet, which attains 96–97% intra-label similarity. The authors proposed the RoBiCCus model, which outperforms existing models on DisTGranD and public disaster datasets. Similarly, the study in [
16] proposes CrisisSpot, a Graph-based Neural Network (GNN) that integrates textual, visual, and Social Context Features (SCF) for disaster analysis. It introduces the concept of Inverted Dual Embedded Attention (IDEA) to capture complex multimodal interactions. The proposed method, CrisisSpot, outperforms other state-of-the-art methods, achieving F1-score gains of 9.45% on the Crisis Multimodal Dataset (CrisisMMD) and 5.01% on TSEqD.
Clustering methods to detect events in news streams have also been utilizied in some studies [
17]. These studies are based on leveraging temporal and content features for improved event clustering. Another study [
18] extended this work to event tracking using supervised learning to label incoming news with minimal training data efficiently. Similarly, ref. [
19] conducted a user study exploring the integration of news articles and user-generated content for event-related information retrieval. They assessed how combining these two sources meets user needs. A comparison of news streams and tweets in event reporting was performed in [
20], using discrete dynamic topic modeling and a Hidden Markov Model for event detection, followed by clustering news documents and analyzing the timeliness of tweets and articles in reporting events. The study in [
21] addresses the challenges of Event Extraction (EE) from unstructured web data by proposing a novel event ontology called Cofee. It integrates expert knowledge, prior ontologies, and a data-driven approach to enhance event identification.
The authors of [
22] propose an early warning system for early alerts against diseases such as meningitis or COVID-19 epidemics using Twitter data. They employ NLP techniques and domain ontologies to classify tweets, followed by geolocation and timestamp extraction from metadata. Real-time tweet identification is performed using SVM and CNN models, with the CNN model achieving an accuracy of 0.99. Similarly, ref. [
23] proposes a CNN–BiLSTM hybrid model with an attention mechanism to identify resource-related tweets for disaster recovery. Trained on earthquake tweets from Nepal (2015) and Italy (2016), it outperforms state-of-the-art methods in in-domain and cross-domain experiments. Results show that leveraging local key information and global term dependencies improves crisis text classification, aiding efficient resource allocation. One more study [
24] used Space–Time Scan Statistics (STSS) to detect events on Twitter without predefined keywords. Applied to the 2013 London helicopter crash, STSS identified a short-lived but informative cluster of tweets. This method was also effectively able to detect other events like football matches and transportation delays, showcasing its effectiveness for spatio-temporal event analysis. In another similar study [
25], the authors proposed event detection algorithms for short- and long-term Twitter trends. They were based on using dynamic metrics like time-sensitive IDF and fuzzy representations to capture temporal shifts. Events were detected via multi-assignment spectral graph partitioning, enabling words to belong to multiple clusters. Similarly, [
1] proposes a novel framework to detect composite social events by integrating content, time, and location dimensions using a Location–Time-Constrained Topic (LTT) model. This model represents social messages as topic distributions and measures similarity via distribution distances. Events are identified through efficient similarity joins, accelerated by a variable-dimensional extendible hash.
Some of the studies in the literature are based on deep machine learning. For example, a dual-CNN, semantically enhanced deep learning model for crisis event detection from social media was proposed in [
26]. It achieved over 79% F-measure for event types but 61% for fine-grained details, competing with traditional models like SVM. Similarly, ref. [
27] uses word embeddings to map tweets into numerical vectors and classify them into non-traffic, traffic incidents, or traffic information using CNN and RNN models. In this study, a dataset of 51,100 tweets was collected, labeled, and publicly released.
It is now clear from this section that the existing literature on crisis-event detection from social media has explored a wide range of approaches, including traditional machine learning models, deep learning architectures, such as CNN and LSTM, and, more recently, transformer-based models (e.g., BERT, RoBERTa, and DeBERTa). While transformer-based approaches have demonstrated strong performance in many NLP tasks, they are primarily designed for large-scale corpora and long-context modeling, which may not directly align with the characteristics of social media data.
In particular, social media text is typically short, noisy, and context-sparse, often containing informal language, abbreviations, and irregular structures. In such scenarios, the advantages of deep self-attention mechanisms may be limited, while the computational complexity and resource requirements of Transformer models remain high. Moreover, real-time crisis-event detection systems demand efficient models capable of fast inference and deployment under resource-constrained environments.
To address these challenges, this study proposes the SOCIAL framework, which adopts a hybrid CNN–LSTM architecture specifically optimized for short-text event detection. The CNN component captures local n-gram patterns and event-specific lexical cues from noisy inputs, while the LSTM component models sequential dependencies and contextual flow within the text. This combination enables effective feature learning without relying on large-scale pretraining or extensive computational resources.
Furthermore, the proposed framework integrates feature engineering techniques (TF-IDF and Word2Vec) with generative AI-based data augmentation to enhance data representation and address class imbalance. This holistic design differentiates SOCIAL from existing approaches by combining architectural efficiency, feature optimization, and data augmentation within a unified pipeline.
Overall, this work addresses a key gap in the literature by providing a computationally efficient and practically deployable solution for real-time crisis-event detection in noisy social media streams, while maintaining competitive performance relative to more complex models.
3. Event Data Collection and Preprocessing
3.1. Data Collection
Publicly available datasets related to the COVID-19 pandemic and flood events have been used in this research. These datasets have also been utilized in various previous studies such as [
28,
29]. The COVID-related datasets, including [
30,
31,
32], comprise tweets discussing the COVID-19 pandemic from different countries. In contrast, the flood-related datasets focus on tweets from Pakistan [
33,
34], etc., and Sri Lanka regarding flood events. These diverse sources were used to aggregate similar datasets from different studies to create a more comprehensive and enriched Main Dataset (MD). The first set of datasets, which is primarily concerned with the COVID-19 pandemic, contains approximately 300,000 samples. The second set, focusing on flood-related events in Pakistan and Sri Lanka, includes around 400,000 samples. Before incorporating any sample into the MD, we conducted a series of preliminary checks. First, we ensured that the sample was unique and had not already been included in the MD database. Next, we verified that each sample was relevant, i.e., it either pertained to the COVID-19 or flood-related events. Any sample failing to meet these criteria was discarded. After this initial round of filtering, we retained approximately 109,000 COVID-19-related samples and 58,000 flood-related samples.
To address the imbalance in the MD, where COVID-related samples outnumbered flood-related ones, we used ChatGPT [
35] to generate additional flood-related samples. These were integrated to create a balanced dataset, with 109,000 COVID-19 and 108,900 flood samples, as shown in
Table 1. The data were then cleaned, normalized, and formatted to produce the final Experimental Dataset (ED) for analysis. To ensure ethical and responsible use of social media data, several measures were implemented to mitigate bias, misinformation, and privacy risks. The dataset was balanced across event types to reduce potential demographic or geographic skew, and synthetic samples were carefully generated and validated to avoid reinforcing misleading or biased narratives. A comprehensive data sanitization process was applied, including the removal of Personally Identifiable Information (PII), such as usernames, mentions, and profile links. Tweets containing embedded user profile references were excluded where necessary to further protect user privacy. Only publicly available and anonymized data were utilized in this study. These steps ensure that the proposed SOCIAL framework adheres to ethical data usage practices while maintaining data quality and model reliability.
The complete workflow of this data collection, augmentation, and preprocessing process is illustrated in
Figure 1, providing a visual overview of the steps involved in transforming raw data into an experimental-ready dataset.
It should be observed in the figure that the combined dataset is obtained by aggregating multiple COVID and flood datasets through sample-level integration. Let
denote COVID datasets and
denote flood datasets. The unified dataset
is defined as follows:
This approach merges all available samples into a single comprehensive dataset, , enabling joint analysis and cross-domain modeling when the datasets share a common feature space.
Furthermore, a detailed process of data collection is presented in Algorithm 1. The pseudocode outlines the Data Collection and Curation Process (DCCP), which collects, preprocesses, and balances COVID-19 and flood datasets to produce a unified, curated Final Experimental Dataset (FED) for analysis.
The algorithm initializes empty structures for the IMD (Initial Main Dataset), PD (Preprocessed Dataset), and FED (Final Experimental Dataset). It then aggregates instances from input COVID-19 and flood datasets (CD1... CDn, FD1... FDn) into IMD, forming a unified raw dataset. Next, NLTK-based preprocessing (e.g., tokenization, stopword removal, stemming, and lemmatization) is applied to IMD to produce a clean, structured pd. Class distribution is then analyzed to detect an imbalance between COVID-19 and flood samples. If an imbalance is found, synthetic samples from the Generated Flood Data Samples (GFDS) are appended to the smaller class (e.g., flood samples in our case). Finally, the algorithm compiles the PD into the FED by removing duplicates, resulting in a balanced, preprocessed dataset (FED.csv) optimized for experimental analysis and model training. For binary classification, COVID-19 tweets are designated as Class 0 and flood tweets as Class 1. This assignment is purely technical, required by the CNN and LSTM architectures, and all performance metrics are computed consistently with this convention. It does not imply any contextual or semantic interpretation of COVID-19 as “positive” or Flood as “negative”. All performance metrics, including precision, recall, F1-score, and related evaluations, are computed consistently with this labeling convention, ensuring clear and unambiguous interpretation of results across all experiments and tables.
3.2. Synthetic Data Generation and Validation
To address class imbalance and the limited availability of labeled data, synthetic tweets related to flood and COVID-19 events were generated using a generative AI model through a carefully designed prompt engineering strategy. The primary objective of this process was to create realistic, diverse, and contextually relevant samples that closely resemble real-world social media content. Given the informal and noisy nature of social media text, particular attention was paid to ensuring that the generated tweets reflect typical user behavior, including brevity, conversational tone, and the use of event-specific keywords and hashtags.
Multiple prompt templates were designed to capture a wide range of event scenarios and linguistic variations. These prompts explicitly specified the event type (flood or COVID-19), contextual situations (e.g., heavy rainfall, rising water levels, evacuation alerts, increasing infection rates, hospital overcrowding, and lockdown announcements), and stylistic characteristics, such as informal phrasing and concise structure. Additionally, variations in prompt wording were introduced to encourage diversity in sentence construction, vocabulary usage, and tone, thereby reducing redundancy in the generated data.
| Algorithm 1: Data Collection and Curation Process (DCCP) |
![Mathematics 14 01369 i001 Mathematics 14 01369 i001]() |
The prompt design process also considered different perspectives commonly observed in social media posts, such as eyewitness reports, public warnings, personal experiences, and general observations. This ensured that the synthetic dataset captures a broad spectrum of real-world expressions. In total, multiple prompt variations were employed to generate a diverse set of synthetic tweets, which were subsequently filtered through a validation process. The examples of prompt templates and generated synthetic tweets used in this study are presented in
Table 2.
To ensure the quality and reliability of the generated data, a human-in-the-loop validation process was employed. Three domain experts independently reviewed the synthetic tweets to assess semantic correctness, contextual relevance, and linguistic naturalness. Tweets that were ambiguous, repetitive, or inconsistent with real-world scenarios were removed. In cases of disagreement, majority voting was used to finalize the selection, resulting in a high-quality and diverse synthetic dataset.
3.2.1. Analysis of Synthetic and Real Data Distributions
To assess the realism and suitability of the generated synthetic tweets, a comparative analysis was conducted between synthetic and real data based on key distributional characteristics. Given the short and informal nature of social media text, features such as tweet length, vocabulary usage, and frequency of event-specific keywords were considered essential indicators of similarity. The analysis revealed that the synthetic tweets closely follow the structural patterns of real tweets, maintaining comparable sentence lengths, informal tone, and use of hashtags and event-related terminology.
In particular, the average length of synthetic tweets was found to be consistent with that of real tweets, typically ranging between 15 and 25 words. This alignment is important, as excessively long or overly simplified synthetic text could introduce bias in model training. Furthermore, the frequency distribution of key terms such as flood, rain, water, evacuation, COVID, cases, lockdown, and hospital showed similar trends across both datasets. This indicates that the synthetic data preserves the semantic focus and topical relevance observed in real-world social media content.
A qualitative comparison further supports these findings, as illustrated in
Table 3, where synthetic tweets closely resemble real tweets in terms of linguistic style, phrasing, and contextual meaning. Both datasets exhibit common characteristics such as abbreviated expressions, conversational tone, and the inclusion of situational updates or warnings. This similarity suggests that the synthetic data effectively captures the variability and noise inherent in social media streams.
The quality and reliability of these synthetic samples were evaluated using both quantitative and qualitative criteria to ensure that they accurately represent real-world social media content without introducing spurious patterns.
Several key metrics between the real and synthetic tweets were compared. These metrics included the average tweet length (in words and characters), vocabulary size, embedding similarity using pre-trained Word2Vec vectors, and perplexity estimated via a GPT-style language model.
Table 4 summarizes these comparisons. The results demonstrate that the synthetic tweets closely resemble real tweets in terms of lexical diversity, semantic alignment, and fluency, providing evidence that the generative model produced realistic content suitable for training.
As reported in the table, the average tweet length in words is closely matched between real (15.2 ± 4.1) and synthetic tweets (15.5 ± 3.9), indicating similar structural characteristics. A comparable trend is observed in character length, with real tweets averaging 92.3 ± 28.6 characters and synthetic tweets 94.1 ± 27.8, suggesting consistent text compactness. The vocabulary size is also similar, with 8240 unique tokens in real tweets and 8105 in synthetic tweets, reflecting comparable lexical diversity. Furthermore, the embedding cosine similarity of 0.87 ± 0.05 demonstrates strong semantic alignment between synthetic and real data. Collectively, these results confirm that the synthetic data closely approximates the distributional properties of real-world social media text.
A human-in-the-loop review was performed by three domain experts to assess the semantic accuracy, contextual coherence, and plausibility of the generated tweets. Disagreements in assessment were resolved using majority voting, resulting in a high-quality set of synthetic samples that preserve the characteristics of real flood-related tweets.
Overall, this rigorous validation demonstrates that ChatGPT-generated synthetic data are both representative and reliable, supporting model generalization while addressing class imbalance.
Overall, the combined quantitative and qualitative analysis confirms that the generated synthetic tweets reasonably approximate the underlying distribution of real data. This alignment ensures that the augmented dataset enhances model learning without introducing significant distributional bias, thereby improving generalization while maintaining the integrity of the original data characteristics.
3.2.2. Data Leakage Prevention
To ensure the validity and fairness of the experimental evaluation, strict measures were implemented to prevent data leakage. All synthetic tweets generated through the prompt engineering process were exclusively incorporated into the training dataset to address class imbalance and improve model learning. Under no circumstances were synthetic samples included in the validation or test datasets.
The validation and test sets were composed entirely of real tweets, ensuring that model performance was evaluated on authentic, unseen data. This separation guarantees that the reported results accurately reflect the generalization capability of the proposed model in real-world scenarios. By preventing any overlap between synthetic and evaluation data, the study ensures that performance metrics are unbiased and not artificially inflated due to exposure to generated samples during testing.
3.3. Data Preprocessing and Preparation
This preprocessing step removes noise, including URLs, hashtags, mentions, and emojis. The Natural Language Toolkit (NLTK) [
36] provides several useful functionalities for this purpose. Here are the basic steps we applied in preprocessing text using NLTK:
Tokenization:
Tokenization involves splitting text into individual tokens. NLTK provides various tokenizers, such as word-tokenize for word-level tokenization and sent_tokenize for sentence-level tokenization. Formally, we can say that tokenization in NLTK splits text S (a string) into a sequence of tokens as follows: , where are the extracted tokens.
Example Sentence S: “Covid is an unseen enemy.”
For sentence tokenization:
Case Conversion: NLTK methods, like lower() or upper(), help to normalize it by converting all characters to a consistent case. This prevents the duplication of words that differ only in letter casing.
For lowercase/uppercase conversion:
Stop Word Removal: Stop words are commonly occurring words that usually do not carry much meaningful information, such as ”a”, ”the”, and “is”. NLTK provides a predefined list of stop words to reduce noise and focus on important content words. The mathematical representation of stopword removal is as follows:
where
are the tokenized input text, and
is a predefined set of stopwords.
Punctuation Removal: NLTK’s string punctuation provides a list of symbols that can be removed using string operations or regular expressions. Their removal is mathematically similar to eliminating stop words from text.
Lemmatization/Stemming: Lemmatization and stemming reduce words to their root or base form, helping to normalize vocabulary and minimize variations. NLTK supports this through stemmers like
PorterStemmer,
SnowballStemmer, and
WordNetLemmatizer for lemmatization. Mathematically, lemmatization can be represented as follows:
where
are the tokenized input text, and
L is a lemmatization function that maps a word to its lemma.
Example: Input tokens: .
.
.
Starting with the Initial Raw Data (IRD), as illustrated in
Figure 2, standard preprocessing techniques were employed to clean and structure the dataset, resulting in a reduced sample size and introducing class imbalance. To address this, we leveraged keyword-based sampling and prompt engineering using ChatGPT to generate additional representative samples. The resulting Final Experimental Dataset (FED) comprises approximately 109,000 samples for Class 0 and 108,900 for Class 1, totaling around 217,900 samples, as shown in
Table 1, thus ensuring class balance and robustness for subsequent experimentation.
The pseudocode in Algorithm 2 outlines the process applied to event-related tweets using NLTK. The input dataset (IMD.csv) is cleaned and saved as processed_tweets.csv. The pipeline includes tokenization, lowercasing, stop word and punctuation removal, lemmatization, spell correction, and reassembly of tokens. This ensures normalized, clean data suitable for training deep learning models. The transformation of the Initial Raw Dataset (IRD) into the Final Experiment-Ready Dataset (FED) was carried out through three sequential stages, including data preprocessing, generative sampling, and sample balancing, as shown in
Figure 3.
| Algorithm 2: Data Preprocessing with NLTK Toolkit |
![Mathematics 14 01369 i002 Mathematics 14 01369 i002]() |
3.4. Dataset Aggregation and Validation
To finalize the data collection process, multiple publicly available datasets related to flood and COVID-19 events were aggregated and merged into a unified corpus. This approach ensures comprehensive coverage of diverse crisis scenarios and linguistic variations commonly observed in social media posts. For example, flood-related tweets capture reports of urban waterlogging, road closures, and evacuation alerts, whereas COVID-19-related tweets reflect rising infection rates, hospital capacity updates, and public health advisories.
Given the heterogeneous nature of these datasets, a systematic validation process was implemented to guarantee consistency and reliability. The merged dataset was manually inspected to ensure proper label alignment across sources and to remove inconsistent or irrelevant entries. For instance, tweets incorrectly labeled as flood-related or off-topic were excluded, while variations in phrasing and terminology were preserved to maintain the natural diversity of social media language.
This careful curation process ensures that the final unified dataset is both comprehensive and high-quality, providing a reliable foundation for training the proposed CNN–LSTM model. The representative examples from the unified dataset are presented in
Table 5. The combination of dataset aggregation, manual validation, and quality checks enables robust learning and accurate crisis event detection across multiple domains.
5. Experimental Configuration and Proposed Framework
After preprocessing the raw and unstructured data with the Natural Language Toolkit (NLTK) Python (3.4) package [
36], we obtain a clean, structured dataset ready for analysis. As shown in
Figure 2, preprocessing includes tokenization, case conversion, removal of stop words and punctuations, lemmatization, and spell checking, transforming the text into a form more suitable for machine learning applications.
Following this, each data instance undergoes a rigorous manual filtering process, where three trained annotators independently label samples to ensure high-quality, accurate labeling, and to remove ambiguous cases that could impact classification results. This carefully curated and annotated dataset serves as a solid foundation for training robust classification models, minimizing noise that can degrade performance. The annotated samples are then represented using several text embedding techniques, such as Word2Vec and TF-IDF, that convert textual data into vectorized formats for effective input to deep learning models. These embedding methods enable the models to capture syntactic and semantic relationships within the data. By applying these techniques, we construct two distinct feature representations, each capturing unique aspects of the data’s linguistic and contextual information. This variety of embeddings enriches the learning process, allowing models to explore both simple binary relationships and complex semantic connections.
Before proceeding to training, to evaluate the proposed SOCIAL framework rigorously, the Final Experimental Dataset (FED) presented in
Table 1 was partitioned into train, validation, and test sets while ensuring data integrity and preventing leakage. After preprocessing to remove exact and near-duplicate tweets, splits were performed in a source-aware and user-aware manner: tweets originating from the same user or dataset source were restricted to a single partition, thereby mitigating the risk of highly similar instances appearing in multiple splits.
The resulting distribution allocated approximately 70% of instances for training, 15% for validation, and 15% for testing, with both COVID-19 and flood categories proportionally represented.
Table 6 summarizes the final split counts, showing balanced representation across both classes. This careful split strategy ensures that the CNN–LSTM model is evaluated on genuinely unseen content, providing a reliable assessment of generalization performance and robustness in real-world social media scenarios.
For model training, CNNs and LSTM networks were employed. Customized CNN and LSTM layers were also integrated into a hybrid model. Combining two embedding techniques with three model configurations resulted in six (2 × 3) distinct experimental setups. An overview is shown in
Figure 6. The dataset is partitioned into an 80:20 training-to-validation split, allowing the models to learn from a majority of the data while reserving a portion for validation, thus aiding in generalization. Performance is then evaluated using an independent test set derived from the original dataset.
To comprehensively assess model performance, we utilize several evaluation metrics, including accuracy, precision, recall, F1-score, and the Matthews Correlation Coefficient (MCC) derived from the confusion matrix shown in
Table 7.
We have used the F-score to calculate the balanced performance. It is the harmonic mean of the model’s precision and recall. This harmonic mean is calculated as presented in Equation (
18).
The
-score is a generalization of the F-score, which adds a configuration parameter called beta (
). A lower
(e.g., 0.5) favors precision, while a higher
(e.g., 2.0) favors recall. In this study, the
-score is calculated as shown in Equation (
19).
The MCC, or phi coefficient, measures the association between two binary variables in statistics and evaluates binary classification quality in machine learning. It is preferred over the F1-score for providing a more balanced assessment regardless of the positive class. Formally, the MCC is calculated as shown in Equation (
20).
Furthermore, we evaluated our models to thoroughly validate our results using key metrics, including the Negative Predictive Rate (NPR), False Positive Rate (FPR), False Discovery Rate (FDR), and False Negative Rate (FNR). A summary of these metrics associated with mathematical representation is provided in
Table 8.
6. Results and Discussion
An ablation study, presented in
Table 9,
Table 10 and
Table 11, was conducted to evaluate the contribution of individual components of the proposed framework. Specifically, the author compared the performance of CNN (word2Vec)-only, CNN (TFIDF)-only, LSTM (word2Vec)-only, LSTM (TFIDF)-only, and the combined CNN–LSTM architecture under the same experimental settings. To provide a statistically robust assessment of model performance, all reported metrics, including accuracy, precision, recall, F0.5-score, and F1-score, are accompanied by 95% confidence intervals. These intervals were computed using bootstrap resampling with 500 iterations over the test set, capturing variability across different sample subsets. Reporting confidence intervals quantifies the uncertainty in predictions, allowing a more reliable interpretation of the proposed CNN–LSTM model’s performance on COVID-19 and flood event tweets. The results in
Table 10 demonstrate both high predictive accuracy and low variability, indicating strong stability and robustness of the framework.
Table 9 presents the classification performance of various models for COVID-19 (Class 0) and flood (Class 1) events.
Word2Vec-CNN achieves high true positives (9907) and true negatives (9410) with low misclassification. Word2Vec-LSTM underperforms, showing higher false positives (451) and false negatives (332), reflecting its limitations on short, sparse texts. The Word2Vec-CNN + LSTM hybrid improves over LSTM alone but remains less effective than CNN, with 9788 true positives and 346 false positives. Models based on TF-IDF embeddings outperform Word2Vec. TFIDF-CNN demonstrates strong results (9866 true positives, 268 false positives), while TFIDF-LSTM shows increased false negatives (514). The TFIDF-CNN + LSTM model achieves the best performance, recording the highest true positives (9938) and the lowest false positives (196), indicating superior precision and balance. Overall, TFIDF-CNN + LSTM emerges as the most effective model, leveraging both keyword-focused feature extraction and sequential context learning for accurate event classification.
All models were evaluated based on accuracy, precision, recall, and F-scores (F0.5, F1, and F2) for event classification, as shown in
Table 10. Starting with Word2Vec-CNN, the model achieved 97.68% accuracy, 96.31% precision, 99.05% recall, 96.85% F0.5, 97.66% F1, and 98.49% F2. The high recall highlights its effectiveness in capturing nearly all relevant events, though with a slight drop in precision compared to others. The Word2Vec-LSTM model followed closely, achieving 96.86% accuracy, 95.29% precision, 98.41% recall, 95.90% F0.5, 96.82% F1, and 97.77% F2, indicating strong sequential modeling but slightly lower performance than CNN on short texts. The Word2Vec-CNN + LSTM hybrid improved on both, attaining 98.12% accuracy, 97.22% precision, 99.03% recall, 97.58% F0.5, 98.12% F1, and 98.66% F2, effectively capturing both local and sequential patterns by combining CNN’s and LSTM’s strengths.
The TFI_DF-CNN model achieved an accuracy of 97.78%, precision of 97.08%, recall of 98.48%, F0.5 score of 97.36%, F1 score of 97.78%, and F2 score of 98.20%. Compared to Word2Vec-based models, this model’s precision is higher, reflecting the TF-IDF feature representation’s advantage in focusing on specific, event-related terms within short texts like tweets. Next, the TFI_DF-LSTM model scores 97.65% in accuracy, 97.06% in precision, 98.24% in recall, 97.29% in F0.5 score, 97.64% in F1 score, and 97.99% in F2 score.
Finally, the TFI_DF-CNN + LSTM hybrid model stands out with the highest scores across all metrics: an accuracy of 98.59%, precision of 98.13%, recall of 99.06%, F0.5 score of 98.31%, F1 score of 98.59%, and F2 score of 98.87%. This model merges CNN and LSTM strengths with TF-IDF, capturing both local and sequential patterns while highlighting key event terms. Its high precision and recall demonstrate accurate event detection with minimal false positives and negatives, making it the top-performing model. We also recorded values for NPR, FPR, FDR, and FNR against each trained model, as shown in
Table 11. As mentioned, the results provide insights into the performance of various models in classifying COVID-19 pandemic-related events (positive class) and flood-related events (negative class). The TF-IDF-CNN + LSTM model again achieves the best performance across most metrics, with the highest MCC of 0.9719, indicating strong overall classification reliability. It also exhibits the highest NPR of 0.9813, confirming its accuracy in identifying flood-related events, and the lowest FNR of 0.0187, ensuring minimal misclassification of COVID-19 events. Overall, the results highlight the superior performance of the TF-IDF-CNN+LSTM model due to its balance across all metrics, making it the most reliable choice for event classification in this context.
Furthermore, it should also be noted that the current evaluation is restricted to two event domains, namely pandemic (COVID-19) and natural disaster (flood), within a unified and consistently preprocessed dataset. While this setup allows for controlled experimentation and reliable assessment of the proposed SOCIAL framework, it does not directly evaluate cross-domain transferability or robustness across different languages and regions. Extending the model to out-of-domain crises, such as earthquakes, wildfires, or conflict-related events, as well as datasets from diverse geographical regions and languages, would require additional adaptation. This may include domain-specific fine-tuning, incorporation of multilingual embeddings, or other transfer learning strategies to ensure robust performance across heterogeneous contexts. These considerations define a clear direction for future research aimed at improving the model’s generalization in real-world, multi-domain crisis-event detection scenarios.
6.1. Cross-Domain Analysis
To assess the robustness and generalization capability of the proposed CNN–LSTM model, a cross-domain analysis was conducted across the different event categories in the unified dataset. Specifically, the model was trained on the combined dataset containing both flood and COVID-19 tweets and then evaluated separately on each domain. This evaluation allows us to examine how well the model captures domain-invariant features and handles differences in linguistic patterns, vocabulary, and context between distinct event types.
The results, summarized in
Table 12, show that the model maintains consistently high performance across domains. When trained on the combined dataset, the model achieved 98.45% accuracy on flood-related tweets and 98.72% accuracy on COVID-19-related tweets. Precision, recall, and F1-score metrics also remain high and comparable across both domains, indicating effective generalization without favoring one event type over another.
Qualitative observations further support these findings. Flood-related tweets often contain terms like “waterlogging”, “evacuation”, and “road closures”, while COVID-19-related tweets emphasize “cases”, “hospital”, and “lockdown”. Despite these differences, the CNN–LSTM model successfully captures the underlying semantic context in both domains, demonstrating strong domain-invariant feature learning. This capability is essential for real-world deployment, where social media streams often contain posts from multiple event types simultaneously.
Overall, the cross-domain analysis confirms that the proposed model is robust and generalizable. By performing consistently across heterogeneous datasets, the framework proves suitable for detecting multiple types of crisis events in real-time social media streams, ensuring reliable performance in complex, multi-domain environments.
6.2. Optimized CNN + LSTM Layered Architecture for Event Detection
The best-performing integrated and customized CNN-LSTM architecture in
Figure 7 is tailored for event detection in social media text, combining CNN for local feature extraction and LSTM for modeling sequential dependencies. As shown in the figure, the tokenized input is first transformed into embedding vectors via an embedding layer. These vectors are processed through a CNN layer to capture spatial structures and local patterns. The extracted features are then passed to an LSTM layer to learn temporal and contextual relationships. Following this, a max-pooling layer reduces dimensionality by retaining the most salient features, mitigating overfitting. Finally, the LSTM output is fed into a dense layer with a sigmoid activation function for event prediction.
Figure 8 presents a detailed breakdown of the layers of the proposed architecture.
The architecture begins with an Input Layer of shape (100, 300), representing a sequence of 100 tokens, each embedded in a 300-dimensional space. This ensures a consistent structure for all textual inputs. Next, an embedding layer incorporating TF-IDF representations is employed to assign higher weights to rare yet informative terms, thereby enhancing feature discrimination. The embedded sequences are then passed through three parallel 1D convolutional layers with kernel sizes of 3, 4, and 5. These layers act as n-gram detectors, capturing trigrams, four-grams, and five-grams to learn diverse local linguistic patterns. The outputs are concatenated to form a unified feature map, allowing multi-scale feature extraction. A combination of batch normalization, max pooling, and dropout follows batch normalization, accelerates and stabilizes training, max pooling reduces spatial dimensions, and dropout mitigates overfitting. The pooled features are fed into a Bidirectional LSTM Layer with 256 units in each direction, allowing the model to capture both past and future context in the sequence. Another Dropout Layer is applied for further regularization. The resulting feature map is flattened via a flatten layer and passed through a sequence of dense layers, first with 128 units (followed by batch normalization and dropout) and then with 64 units to enable deep feature abstraction.
Finally, the output layer applies Softmax for multi-class or Sigmoid for binary classification, producing final prediction probabilities. This hybrid CNN–BiLSTM model effectively captures both local and sequential patterns, making it well-suited for robust event detection in noisy and dynamic social media environments.
Unlike transformer-based models, which typically require large-scale training data and substantial computational resources, the proposed CNN–LSTM architecture is optimized for efficiency and performs effectively on relatively smaller datasets. The convolutional layers are designed to capture salient local patterns and event-specific keywords that are characteristic of short, noisy social media posts, while the LSTM layer models sequential dependencies to retain contextual information within the text. This combination enables the model to achieve high classification accuracy without the extensive computational overhead associated with transformer-based architectures.
Moreover, compared with BiLSTM–attention networks, which introduce additional complexity through attention mechanisms, the SOCIAL framework emphasizes a streamlined and lightweight architecture tailored for real-time crisis event detection. By focusing on efficient feature extraction and robust sequence modeling, the proposed design balances representational power and computational efficiency, making it particularly suitable for short-text, noisy, and rapidly evolving social media streams. These architectural choices distinguish SOCIAL from existing approaches and justify its effectiveness for real-time event detection applications.
Table 13 summarizes the key differences between the proposed CNN–LSTM SOCIAL framework and transformer-based models. As shown, the SOCIAL architecture achieves a balance between representational capability and computational efficiency, making it by social media posts. In contrast, BiLSTM–attention networks introduce additional complexity through attention mechanisms, and transformer-based models require large-scale data and substantial computational resources, which can limit their applicability for real-time event detection in dynamic social media streams.
An ablation study was conducted to evaluate the contribution of each component within the SOCIAL framework, with results summarized in
Table 14. As mentioned before, the hybrid CNN + LSTM model achieves the highest performance, with an accuracy of 98.59% and an F0.5-score of 98.31%, outperforming both standalone CNN (accuracy: 97.78%, F0.5-score: 97.36%) and LSTM (accuracy: 97.65%, F0.5-score: 97.29%) models. This demonstrates that the combination of convolutional layers for local feature extraction and LSTM layers for sequential dependency modeling captures complementary information, enhancing the classification of short, noisy social media text.
The importance of preprocessing steps is also evident. Removing lemmatization reduces the F0.5-score to 96.90%, while skipping spell correction further decreases it to 96.70%, indicating that these steps are critical for mitigating lexical variability and noise. Collectively, these results confirm that both architectural design and preprocessing strategies are essential for achieving optimal model performance and that the integrated CNN + LSTM configuration provides a robust and efficient framework for crisis-event detection.
This study focuses on evaluating the CNN-LSTM architecture and preprocessing strategies under a consistent experimental setup for short, noisy social media text. Classical models such as SVM and logistic regression showed comparatively lower performance in preliminary experiments due to limited contextual modeling. Furthermore, the SOCIAL framework demonstrates strong performance under controlled experiments; real-world deployment presents additional challenges. Key considerations include robustness to domain shifts arising from different event types, languages, or regions, the ability to generalize to previously unseen crises, reliance on curated and balanced datasets, real-time constraints such as inference latency and throughput, computational and memory efficiency, and minimizing bias or misinformation amplification. Addressing these factors through domain adaptation, transfer learning, and evaluation on diverse real-world datasets is essential for ensuring reliable, scalable, and robust operational performance.
7. Comparative Analysis
Table 15 shows a comparative analysis between existing methods and the proposed SOCIAL framework. The F1-score is selected as the primary performance metric due to its balanced consideration of both precision and recall.
Accuracy, while commonly used, can be misleading in imbalanced datasets where one class dominates, potentially overstating model performance. Precision and recall individually focus on minimizing false positives and false negatives, respectively. The F1-score, being the harmonic mean of precision and recall, offers a more comprehensive evaluation by accounting for both types of errors simultaneously. This makes it particularly suitable for event classification tasks involving noisy and imbalanced social media data, where accurately identifying relevant events and minimizing misclassifications are crucial.
As shown in the table, the proposed SOCIAL framework, combining CNN, LSTM, and CNN + LSTM models, achieves up to 98.31% F-score across diverse textual events, outperforming prior approaches. SatCoBiLSTM [
12] scored 94–89% on CrisisLexT6, CrisisLexT26, and 2CTweets, CrisisSpot [
16] ranged from 50.1% to 95.5% across datasets, and dual-CNN/COfEE models achieved 70.3–81.4%. SOCIAL’s consistent high performance demonstrates the advantage of integrating convolutional and sequential modeling for robust event detection.
7.1. Key Observations from Comparative Analysis
The following are some of the observations from comparative analysis:
Overall Performance Superiority: The proposed SOCIAL framework achieves the highest classification performance across all compared models, with CNN + LSTM delivering an F-score of 98.31%, significantly exceeding previous methods.
Hybrid Model Effectiveness: While some studies leverage hybrid approaches (e.g., SatCoBiLSTM, CrisisSpot, and Cofee), none match the performance of the optimized CNN + LSTM architecture in SOCIAL, which effectively captures both spatial dependencies and temporal patterns in event data.
Real-time Adaptability: Many existing models rely on pre-trained datasets or multimodal inputs, which can limit real-time event classification. SOCIAL’s deep learning-based design is optimized for real-time social media streams, enhancing its practical application in crisis management.
This comparative analysis highlights that while previous approaches have contributed to event classification, they often suffer from dataset dependency, inconsistent performance, or limited adaptability.
Furthermore, while transformer-based models, such as BERT and its variants, have demonstrated state-of-the-art performance in many NLP tasks, they typically require large-scale pretraining and fine-tuning on massive corpora to achieve optimal results. In contrast, the Final Experimental Dataset (FED) used in this study comprises approximately 217,900 short-text instances, balanced between 109,000 COVID-19 tweets and 108,900 flood tweets (
Table 1). Given the relatively moderate dataset size and the inherently noisy and unstructured nature of social media text, training large Transformer models could lead to overfitting, high computational costs, and suboptimal generalization on short, context-sparse sequences.
The CNN-LSTM architecture was selected to address these specific challenges. Convolutional layers efficiently extract local n-gram features and salient patterns from short text sequences, capturing event-specific lexical cues even in noisy contexts. Sequential dependencies across tokens are then modeled by LSTM layers, enabling the network to learn temporal or contextual relationships critical for accurate event classification. This hybrid approach combines the strengths of both local feature extraction and sequential modeling, providing a computationally efficient alternative to transformers while achieving robust predictive performance. By leveraging this architecture, the proposed framework maintains scalability, reduces training time, and remains effective for real-time crisis event detection, making it well-suited for datasets like FED, where both data volume and sequence length are constrained.
Finally, while the current study focuses on CNN–LSTM, evaluating transformer-based baselines remains an important direction for future research, which would provide additional comparative insights and further validate the effectiveness of the proposed architecture.
7.2. Justification and Limitations of Comparative Analysis
As we can observe,
Table 15 presents a comparative performance analysis of the proposed SOCIAL framework with several state-of-the-art approaches reported in the literature. While such comparisons are useful for positioning the proposed model within the broader research landscape, it is important to acknowledge the inherent challenges associated with direct numerical comparison across different studies.
One major limitation arises from the diversity of datasets and experimental configurations used in prior work. Many existing studies rely on domain-specific datasets such as CrisisLex, CrissMDD, or multimodal datasets incorporating textual, visual, and contextual features. In contrast, the proposed study utilizes a unified dataset combining two distinct event categories, namely floods (natural disaster) and COVID-19 (pandemic-related events). This cross-domain setup introduces variability in linguistic patterns, context, and data distribution, making it difficult to identify studies with an exactly matching experimental setting. Therefore, the selected studies represent the closest possible comparisons in terms of objectives and application domains.
Furthermore, several compared approaches employ multimodal inputs or additional contextual features (e.g., images, graph structures, or social context), whereas the proposed framework focuses primarily on textual data. This difference in input modalities and feature representations further limits the possibility of direct equivalence in performance comparison. As such, the results presented in
Table 15 should be interpreted as indicative rather than strictly comparable.
To provide a fair basis for comparison, the F1-score is used as the primary evaluation metric. The F1-score offers a balanced measure of precision and recall, making it particularly suitable for social media event detection tasks where class imbalance and noisy data are common. By emphasizing F1-score, the comparison avoids bias toward any single performance aspect and enables a more meaningful evaluation across studies with varying class distributions and experimental conditions.
Overall, while acknowledging these limitations, the comparative analysis demonstrates that the proposed SOCIAL framework achieves competitive and superior performance relative to closely related approaches. This highlights its effectiveness in handling noisy, short-text social media data and supports its applicability for real-time crisis event detection across diverse event categories.
8. Conclusions and Future Work
This study introduces a deep learning-based robust framework called SOCIAL (Social Media Event Classification using Integrated Artificial Learning and NLP), aimed at classifying COVID-19 pandemic-related and flood-related events from social media streams. This proposed architecture, along with two prominent text representation techniques, Word2Vec and TF-IDF, effectively combines CNN’s local feature extraction and LSTM’s ability to model long-term dependencies, making it robust for event detection in noisy social media streams. Comprehensive evaluations were conducted across a range of performance metrics, including traditional measures like accuracy, precision, recall, and F1-score, as well as more refined metrics such as the Matthews Correlation Coefficient (MCC), False Positive Rate (FPR), False Discovery Rate (FDR), and Negative Predictive Value (NPR).
The results demonstrate the effectiveness of the TF-IDF-CNN + LSTM model, which achieved the highest Accuracy of 98.59%, MCC (0.9719), precision (0.98), recall (0.99), and the lowest FPR (0.0094) and FNR (0.0187). These findings indicate that the hybrid approach effectively captures both local patterns and long-term dependencies, particularly when paired with TF-IDF embeddings that emphasize distinctive event-related terms. The comparison between models also highlights that hybrid architectures such as Word2Vec-CNN + LSTM with an accuracy of 98.59% and TF-IDF-CNN + LSTM outperform standalone Word2Vec-based CNNs with accuracy (97.68%), precision (0.96), recall (0.99), F0.5-score (0.97), F1-score (0.97), and F2-score (0.98), and Word2Vec-based LSTMs with accuracy (96.86%), precision (0.95), recall (0.98), F0.5-score (0.96), F1-score (0.96), and F2-score (0.97). Furthermore, comparatively, Word2Vec-based models score less than TFIDF models, as presented in the results. Therefore, TF-IDF embeddings are considered better suited for classifying context-light texts like tweets compared to Word2Vec.
This study underscores the importance of utilizing integrated deep-learning techniques for event classification in social media streams, where rapid and accurate event identification can significantly aid in disaster response and crisis management. The comprehensive evaluation framework and insights into misclassification trends further enhance the reliability and applicability of the proposed solution. Overall, the SOCIAL framework sets a solid foundation for real-time, scalable event classification systems capable of addressing diverse challenges in social media analytics.
Future Research Directions
Event detection on social media streams has made notable progress, but challenges like noisy data, fine-grained event detection, and multimodal integration persist. This section highlights key areas for advancing research in this field, also visually presented in
Figure 9.
Incorporation of Multimodal Data and Techniques: Future work could expand the framework by incorporating images, videos, or user interaction data from social media platforms to enhance the accuracy of event classification, especially for events with strong visual or multimodal characteristics. Furthermore, incorporating techniques such as stance detection [
32] and fake profile detection [
41] enhances event detection by improving reliability and filtering misinformation. Stance detection analyzes user opinions to identify bias and misleading narratives, while fake profile detection eliminates inauthentic accounts that amplify false information. Integrating these techniques strengthens event classification, ensuring more accurate and trustworthy detection in social media streams.
Cross-Linguistic Event Detection: To address the global nature of crises, the framework could be extended to handle multilingual social media streams, leveraging advanced multilingual embeddings like BERT, mBERT, or LASER. Several studies have highlighted the effectiveness of word embeddings and deep learning, offering valuable insights for enhancing event detection on social media. The MPAN model further improves precision and recall through multilevel attention mechanisms.
Adaptation to Emerging Events: Developing mechanisms for transfer learning or zero-shot learning could enable the framework to adapt to emerging, previously unseen event categories with minimal retraining.
Cross-Domain and Multilingual Generalization: Future work will evaluate the generalization capability of the SOCIAL framework across out-of-domain crises (e.g., earthquakes, wildfires, and conflicts) and datasets from diverse geographical regions and languages. Cross-dataset transfer learning can be explored to assess the robustness of the framework in multi-domain, real-world crisis detection applications.
Scalability in High-Velocity Streams: Investigating lightweight, distributed models could improve the system’s ability to process high-velocity social media streams in real time, ensuring timely event detection even during data surges.
Incorporation of External Knowledge: Integrating domain-specific knowledge graphs or ontologies could further refine the classification process by providing additional contextual information about events.
Model Interpretability: Enhancing the interpretability of the models used within the framework would allow researchers and practitioners to understand and trust classification decisions, especially in high-stakes scenarios like disaster management.
Bias Mitigation: Addressing potential biases in the training data or model predictions is crucial to ensure fair and equitable event classification across different regions, demographics, or event types.
Application to Non-Crisis Events: Expanding the framework’s scope to include non-crisis events, such as marketing trends or political developments, could broaden its applicability to other domains.
Integration with Real-Time Monitoring Systems: Future studies could focus on deploying the SOCIAL framework as a component of real-time monitoring and alerting systems, seamlessly integrating it with public safety and disaster response infrastructure.
Exploration of Advanced Architectures: Investigating advanced neural network architectures, such as transformers or graph neural networks, could further enhance the framework’s ability to model complex relationships in text data for event classification.
By addressing these directions, the SOCIAL framework can evolve into a more versatile and comprehensive tool for real-time social media analytics, benefiting a wide range of applications from disaster response to trend analysis.