Next Article in Journal
Enabling Deep Recursion in C++
Next Article in Special Issue
Cross-Lingual Adaptation for Multilingual Table Question Answering and Comparative Evaluation with Large Language Models
Previous Article in Journal
Exploring Greek Upper Primary School Students’ Perceptions of Artificial Intelligence: A Qualitative Study Across Cognitive, Emotional, Behavioral, and Ethical Dimensions
Previous Article in Special Issue
Multi-Modal Emotion Detection and Tracking System Using AI Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions

by
Mohammad Robel Miah
1,
Lija Akter
1,
Ahmed Abdelmoamen Ahmed
1,*,
Louis Ngamassi
2 and
Thiagarajan Ramakrishnan
2
1
Department of Computer Science, Prairie View A&M University, Prairie View, TX 77446, USA
2
Department of Accounting Finance and MIS, Prairie View A&M University, Prairie View, TX 77446, USA
*
Author to whom correspondence should be addressed.
Computers 2026, 15(1), 16; https://doi.org/10.3390/computers15010016
Submission received: 30 November 2025 / Revised: 22 December 2025 / Accepted: 28 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)

Abstract

Over the past decade, the frequency and impact of both natural and human-induced disasters have increased significantly, highlighting the urgent need for effective and timely relief operations. Disaster response requires efficient allocation of resources to the right locations and disaster types in a cost- and time-effective manner. However, during such events, large volumes of unverified and rapidly spreading information—especially on social media—often complicate situational awareness and decision-making. Consequently, extracting actionable insights and accurately classifying disaster-related information from social media platforms has become a critical research challenge. Machine Learning (ML) approaches have shown strong potential for categorizing disaster-related tweets, yet substantial variations in model accuracy persist across disaster types and regional contexts, suggesting that universal models may overlook linguistic and cultural nuances. This paper investigates the categorization and sub-categorization of natural disaster tweets using a labeled dataset of over 32,000 samples. Logistic Regression and Random Forest classifiers were trained and evaluated after comprehensive preprocessing to predict disaster categories and sub-categories. Furthermore, a country-specific prediction framework was implemented to assess how regional and cultural variations influence model performance. The results demonstrate strong overall classification accuracy, while revealing marked differences across countries, emphasizing the importance of context-aware, culturally adaptive ML approaches for reliable disaster information management.

1. Introduction

Disaster management is a comprehensive approach that encompasses preparedness, mitigation, response, and recovery phases to minimize the adverse effects of natural or human-made disasters [1]. It requires careful planning, organization, and coordination to save human lives, property, and the environment [2]. Social media posts feature early alert messages, real-time ground reports of impacts, requests for assistance, and updates on relief efforts. Consequently, they become valuable data streams for situational awareness, public messaging, and decision-making during an emergency.
Effective disaster management mitigates potential catastrophes by transforming them into manageable disruptions. Early warnings, evacuation plans, and resilient infrastructure—such as adaptive building codes, flood defenses, and vegetation management—reduce loss of life and property. Coordinated response efforts ensure the timely delivery of rescue operations and essential resources, while sustained recovery strategies strengthen social and economic stability. In the United States, climate change has intensified extreme weather events, including wildfires, droughts, and hurricanes, by increasing temperatures and humidity, raising precipitation levels, and exacerbating flooding. Consequently, disasters are occurring more frequently and with greater severity, disrupting agriculture, public health, and the livelihoods of many.
The use of social media in disaster management has become increasingly relevant, as platforms like X (previously known as Twitter) provide immediacy in disseminating critical information and reaching large audiences in real-time. Rescue authorities use it for alerts and evacuation warnings, especially during hurricanes, wildfires, and floods, to quickly disseminate time-critical information to the public [3,4]. While locals use it for neighborhood updates and service requests, geotags and hashtags make it easier to follow events and mobilize organized responses [5,6]. Geotags and hashtags make it easier to follow events and mobilize organized responses. Social media enhances situational awareness of the data-to-actions continuum of preparedness, responses, and lifesaving interventions.
Disaster management is a multifaceted classification and needs to be addressed in a manner that considers the socio-cultural and linguistic diversity involved. Tweets related to disasters serve as the basis for this work, and a framework based on logistic regression and random forest is developed to implement our approach for predicting category and subcategory labels. The timely classification of disaster-related tweets is critical for emergency response, early warning systems, and situational awareness. While past research has categorized tweets into high-level classes (e.g., natural vs. human-induced disasters), few have explored sub-categorization or examined model performance across regional and cultural boundaries. This paper addresses three research questions. RQ1: How can we categorize and sub-categorize natural disaster tweets? RQ2: How do different ML models perform across different disaster tweet categories? RQ3: How does model performance vary across various cultural and regional contexts?
This work develops and evaluates a robust disaster tweet classification framework that can automatically identify, categorize, and sub-categorize social media posts related to disasters, while also assessing the effectiveness of machine learning models across different contexts. RQ1 aims to design a systematic approach for categorizing and sub-categorizing tweets (e.g., Natural vs. Human-Induced disasters, and further into specific disaster types such as floods, earthquakes, or accidents). This ensures a structured and meaningful representation of social media data. RQ2 aims to compare and analyze ML models (e.g., Logistic Regression and Random Forest) to determine which algorithms provide the most accurate and reliable classification results across disaster-related categories. RQ3 extends the analysis by exploring regional and cultural variations in disaster-related tweets. Differences in how languages are used across cultures may present a significant challenge for machine learning models. This ensures that the models are not only technically adequate but also socially and contextually relevant, highlighting how cultural context may influence tweet language, interpretation, and classification accuracy.
The remainder of this paper is organized as follows. Section 2 reviews existing research in disaster management using social media. Section 3 describes the design and implementation of the proposed system, detailing dataset preparation, preprocessing, and the development of ML models. Section 4 presents the evaluation and result analysis, including performance assessment and cross-country comparison. Finally, Section 5 concludes the paper by summarizing the findings and outlining potential directions for future research.

2. Related Work

Disaster management tweets provide time-stamped and geolocated insights into impacts, needs, and public emotions during crises. Existing research [7,8,9,10] leverages these data for rapid event detection, damage and needs assessment, situational awareness, and misinformation control. ML pipelines for classification and sentiment analysis enhance resource prioritization, regional assessment, and cross-border comparison, thereby improving the effectiveness of these processes. The literature further emphasizes the significance of data quality, bias mitigation, ethical considerations, and rigorous training and validation to ensure reliable model performance.
Khatoon et al. [11] present a cloud-based, multimodal social media analytics framework for disaster response. In their image-classification module, they compare transfer-learning CNNs (i.e., Social Media, AlexNet, and ResNet). The best CNN is a fine-tuned CNN-16 with 95% accuracy, while ResNet is used for situational awareness. In a similar direction, Sinha et al. [12] propose an AI-based decision support system for Twitter analytics in the disaster space, where the sentiment of tweets is labeled using TextBlob, and a Long Short Term Memory (LSTM) classifier is trained on about 1200 tweets that were collected. The mainstay model is the LSTM that the authors state achieves 97.5% accuracy for tweet-level sentiment classification, outperforming several traditional ML baselines detailed in their comparison (e.g., Random Forest 91–95%, Gradient Boosting 92–96%, SVM 80–90%, XGBoost 93–95%, AdaBoost 90–94%, and ensembles 93–97%).
Building on multilingual analysis, Sufi et al. [13] developed a completely automated AI pipeline for disaster monitoring that combines multilingual NLP with geospatial analytics. This pipeline comprises components for named entity recognition, which extracts locations, sentiment, and category classification. For comparison, Lamsal et al. [14] introduce CReMa, a crisis-response matcher that integrates textual, temporal, and spatial signals on social media. The matcher is built on CrisisTransformers and a cross-lingual embedding space, achieving a top-3 matching accuracy of 0.90 with exhaustive search. The concurrent classification modules, powered by CrisisTransformers, obtain strong F1 scores on three identification tasks, highlighting the model’s efficacy before matching.
Focusing on wildfire forecasting, Laksito et al. [15] construct an Indonesian-language wildfire tweet dataset and evaluate sequence models—GRU, LSTM, bidirectional GRU, and bidirectional LSTM–under class imbalance. The BiLSTM is the most effective model, achieving 89.95% test accuracy (and 98.09% training accuracy) with Random Oversampling (ROS). Additionally, ROS is implemented for GRU (89.73%), LSTM (89.64%), and BiGRU (89.70%). SMOTE significantly reduces performance, emphasizing that ROS is the preferred augmentation for this task.
Turning to rescue-tweet detection, Khallouli et al. [16] suggest an integrated rescue-tweet detector that classifies via an MLP and fuses a fine-tuned BERT (low-level text features) with rule-based regex indicators (high-level, task-specific features). They evaluate the datasets with AUC-PR/F1 instead of accuracy due to their high imbalance. The accuracy of the proposed model is not reported by design, as it achieves an AUC-PR of 0.9621 and an F1 score of 0.9093 on Hurricane Harvey, and an AUC-PR of 0.9614 and an F1 score of 0.9047 on the Ida/Ian set.
Franceschini et al. [7] employ a transformer-based text classifier to identify Italian-language landslide reports on Twitter. This classifier stacks a CNN head on top of BERT embeddings. The best model among four settings is BERT-multicase + CNN without preprocessing, which achieves 96.10% accuracy (AUC = 0.96) on the held-out test set. This model outperforms the lighter DistilBERT + CNN variant (95.50%) and demonstrates that aggressive text cleaning can negatively impact performance.
Schmidt et al. [17] propose a geo–social media–driven early warning pipeline for wildfires that integrates semantic filtering of georeferenced tweets using a fine-tuned multilingual XLM-RoBERTa classifier (“Disaster-RoBERTa”), further refined through active learning, before spatio-temporal hotspot detection and satellite tasking. The RoBERTa model–pretrained on 198 million tweets and fine-tuned on CrisisLex data–achieves 94% accuracy on a 44,848-tweet CrisisLex test set (compared to 70% for keyword filtering) and 80% accuracy on a 364-tweet Chile wildfires dataset (vs. 71% keyword). These results demonstrate notable improvements over keyword-based baselines, reinforcing the study’s conclusion that social sensing can effectively anticipate satellite-detected wildfire footprints in populated areas.
He et al. [18] introduce SiamTCN, a Siamese text-classification network designed for the multi-class, multi-label extraction of typhoon-related information from Sina Weibo posts. SiamTCN outperforms robust baselines such as BERT-BiLSTM (0.8907), BERT-LSTM (0.8864), and Word2Vec-BiLSTM (0.8561) by achieving a top precision of 0.9454 (averaged across typhoon categories) on their benchmark. Ablation studies reveal that SingleTCN-1 and SingleTCN-2 achieve precision scores of 0.9292 and 0.9124, respectively, while SiamTCN-1 (without weighted KNN) achieves a score of 0.9271.
Wang et al. [19] used a three-step social sensing pipeline to study the Dingri earthquake on 7 January 2025. Method: This pipeline consists of discussion themes identified by BERTopic, a topic-emotion coupling with SKEP, and a damage-level classifier based on BERT (4 classes; n = 900 labeled posts; 60/20/20 split). We’re interested in precision/recall/F1 over a single figure of accuracy for the classifier. At the extremes, the results illustrate the high precision of our model, including total well-behaved performance for the severe-damage class (F1 = 1.00) and relatively high performance for the no-damage class. Confusion matrices indicate semantic overlap, and thus, precision is lower for adjacent moderate levels of accuracy. This paper does not report a single macro precision; it instead highlights the per-class precision and the effectiveness of the model one can use for extreme damage detection.
Ngamassi et al. (2022) [20] analyzed 28,041 Hurricane Harvey tweets using text mining and Latent Dirichlet Allocation (LDA) for topic modeling. The method achieved an overall classification accuracy of 92%, identifying major themes such as emergency response, community concerns, and climate change. The findings showed that social media mining can effectively capture real-time public sentiment and enhance disaster communication. The study further proposed policy recommendations to improve clarity, coordination, and standardized information sharing during crises.
In conclusion, prior research has demonstrated the effectiveness of ML in disaster tweet classification; however, it remains limited by a lack of sub-categorization and cross-cultural analysis. Most studies employ binary or coarse-grained classifications—typically distinguishing only between relevant and non-relevant tweets—thereby overlooking specific needs, damages, or services. Moreover, the predominant focus on single-country or single-language contexts neglects cultural and linguistic variation in disaster communication. Addressing these gaps, the present study introduces a fine-grained, context-aware framework that advances methodological transparency, enhances reproducibility, and broadens the applicability of findings across diverse disaster management settings. In contrast, this paper presents a general and scalable framework for analyzing disaster tweets. The methodology of this study differs from existing literature because it not only employs established machine learning approaches but also introduces two critical extensions: the sub-categorization of disaster tweets and the exploration of cross-cultural variations.

3. Design and Methodology

Figure 1 presents the system architecture of the proposed ML-based framework for disaster-related tweet classification and regional analysis. The architecture is structured as a four-layer framework, purposefully designed to enable both automated disaster type identification and cross-regional analytical capabilities. Layer 1 (Data Collection) integrates multiple publicly available sources, including the Twitter API, the CrisisLex T26 corpus, and the Hurricane Florence dataset. These heterogeneous data streams are aggregated into a unified corpus comprising approximately 32,228 tweets, encompassing a diverse range of disaster categories and geographic regions. This comprehensive data integration ensures broad coverage of linguistic, contextual, and regional variations, which are critical for achieving accurate disaster classification and robust comparative analysis across world regions.
Layer 2 (Preprocessing and Feature Extraction) encompasses a sequence of linguistic normalization and feature representation procedures, including text cleaning, tokenization, lemmatization, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The lemmatization process standardizes words to their base forms (e.g., “floods” to “flood”), thereby enhancing linguistic consistency and reducing lexical redundancy. During the text cleaning stage, extraneous elements such as URLs, punctuation, and non-informative symbols are removed to ensure the semantic purity of the corpus. The TF-IDF transformation subsequently converts the preprocessed text into a high-dimensional numerical feature matrix, facilitating the effective capture of salient linguistic patterns for downstream model learning.
Layer 3 (ML Models) implements Logistic Regression and Random Forest classifiers to categorize tweets into predefined disaster types. Model performance is rigorously evaluated through learning-wise accuracy trends and confusion matrix analyses, achieving an overall classification accuracy of 97% across all categories. Layer 4 (Regional and Cultural Analysis) focuses on the interpretive assessment of classification outcomes across geographic and cultural dimensions. By mapping predicted disaster types to their corresponding countries and regions, this layer generates visual heatmaps and bar charts to illustrate spatial and cultural variations in disaster-related discourse. This integrative, multi-layered architecture thus enables both accurate disaster classification and nuanced regional analysis, supporting insights into the cultural and geographic dynamics underlying social media communications during crisis events.

3.1. Dataset Collection

Table 1 presents the distribution of tweets across the constituent datasets and disaster categories. The Hurricane dataset comprises a total of 4771 tweets, of which 473 are categorized as human-induced, 3913 as natural, and 385 as a mix of both. The Tweet dataset is comparatively larger, containing 27,933 tweets, including 10,382 human-induced, 16,548 natural, and 1000 mixed instances. When combined, these sources yield a final corpus of 32,228 tweets, encompassing 10,382 human-induced, 20,461 natural, and 1385 mixed tweets. This integrated dataset provides a comprehensive foundation for model training and evaluation, capturing a diverse spectrum of disaster-related discourse across multiple event types and geographic regions.
The dataset employed in this study was derived from the CrisisLexT26 tweet corpus and subsequently augmented with tweets related to Hurricane Florence to construct a comprehensive dataset for analysis. The CrisisLexT26 corpus, compiled initially by Olteanu et al. [21], serves as a diverse and extensive resource for disaster-related tweet classification, containing labeled social media communications collected during multiple crisis events. Complementarily, the Hurricane Florence tweets were obtained from a publicly accessible Kaggle dataset [8], comprising posts gathered via the Twitter API during the 2018 hurricane event. The integration of these two datasets enabled a robust cross-event and cross-regional evaluation of the proposed machine learning framework, enhancing the diversity, representativeness, and generalizability of the corpus for disaster classification and regional analysis tasks.
Table 2 presents representative examples of tweets collected from a range of disaster events. These samples illustrate the diversity of the dataset in terms of geographic regions, event types, and linguistic expressions, highlighting the heterogeneous nature of crisis-related communications analyzed in this study.

3.2. Dataset Preprocessing

The transformation from the raw dataset to the cleaned dataset was performed through a structured preprocessing pipeline. Initially, tweet content was automatically detected and extracted from the relevant raw data fields. Standardized data-cleaning procedures were then applied to remove extraneous formatting, inconsistent spacing, and placeholder entries. Subsequently, a keyword-based mapping approach was employed to categorize each tweet into one of three high-level disaster types (i.e., Human-Induced, Natural, or Mixed) and to assign it to a corresponding subcategory based on predefined disaster-related keywords. This process converted unstructured textual data into a structured word-frequency matrix, enabling the application of topic modeling techniques, such as LDA, to identify dominant themes and improve classification performance. The final output retained only the features relevant to disaster classification while discarding redundant or unstructured attributes.
To enhance coverage and improve representativeness, the hurricane dataset was integrated with a broader tweet dataset to construct the combined dataset. This comprehensive corpus comprises 32,228 disaster-related tweets, distributed as 10,382 Human-Induced, 20,461 Natural, and 1385 Mixed tweets. The hurricane dataset provides focused, event-specific data, containing 4771 tweets, whereas the tweet dataset offers a larger, multi-event representation comprising 27,933 tweets. By merging these two sources, the combined dataset achieves both depth, through detailed hurricane-specific discourse, and breadth, through cross-event coverage encompassing diverse disaster types. This integration enhances the overall balance of data distribution, facilitates cross-category generalization, and establishes a robust foundation for training and evaluating disaster classification models.
Figure 2 shows an overview of the proposed methodology for classifying disaster-related tweets by category and subcategory, and analyzing their distribution across different levels. The process begins with three key research questions: (RQ1) How can natural disaster tweets be categorized and subcategorized? (RQ2) How do different machine learning models perform across disaster tweet categories? (RQ3) How does model performance vary across various cultural and regional contexts? The overall framework enables a comprehensive assessment of disaster communication patterns on Twitter by systematically integrating data acquisition, preprocessing, feature extraction, model training, evaluation, and regional analysis.
In this study, a systematic procedure was employed to prepare, extract, analyze, and evaluate Twitter data related to disaster events. Each tweet record contains key attributes, including Tweet ID, text content, country, category, subcategory, disaster type, and date of occurrence. The textual data were preprocessed by converting all text to lowercase, removing URLs, eliminating stopwords, and applying lemmatization to normalize word forms. Numerical feature representations were generated using the TF-IDF technique. The dataset was partitioned into training and testing subsets using a 70:30 ratio. Two machine learning classifiers (i.e., Random Forest and Logistic Regression) were trained on the training subset and subsequently evaluated on the test subset. This approach enables a systematic comparison of model performance for tweet-level disaster classification.

3.3. Feature Extraction: TF-IDF Vectorization

To enable the classification of textual data, tweets were transformed into numerical feature representations using the TF-IDF technique. TF-IDF assigns higher weights to terms that appear frequently within an individual tweet but are relatively rare across the overall corpus. This weighting scheme effectively captures the discriminative importance of words in short, context-specific texts, making it particularly suitable for representing disaster-related tweets in machine learning models. Let t be a term, d a tweet, and N the number of tweets in the corpus. We define TF-IDF as follow:
TF - IDF ( t , d ) = TF ( t , d ) × log N df ( t )
where TF ( t , d ) is the term frequency of t in d.
Disaster tweet classification performance is enhanced by using TF-IDF over the traditional Bag-of-Words representation. TF-IDF down-weights persistent, non-informative tokens while emphasizing rare and contextually significant terms, thereby improving feature discriminability. This approach achieves a balance between computational efficiency and information richness, enhancing class separability and supporting reliable tweet-level classification. The top 10,000 features with the highest TF-IDF scores were selected to represent the dataset, ensuring that only the most informative terms contribute to disaster category and subcategory prediction.

3.4. ML Models

Random Forest (RF) and Logistic Regression (LR) models were employed to classify disaster-related tweets within the TF-IDF feature space. The RF model leverages TF-IDF features to capture non-linear interactions among tokens, while the LR model utilizes cross-entropy loss to estimate class posterior probabilities. Both models are computationally efficient, interpretable, and well-suited for high-dimensional, sparse text data. The RF model achieved competitive accuracy and additionally provided feature-importance measures, facilitating the identification of terms most influential in distinguishing between disaster categories and subcategories.
The LR model is employed to estimate the probability that a given tweet pertains to a particular disaster category or subcategory. Binary logistic regression is applied at the category level (e.g., Natural vs. Human-Induced), while multinomial logistic regression is utilized at the subcategory level (e.g., Geophysical, Hydrological, Meteorological, etc.), which is defined as:
P ( y = 1 x ) = 1 1 + exp w x + b
where x denotes the feature vector derived from the tweet text, w represents the weight vector, and b is the bias. In this context, y = 1 signifies natural disasters, whereas y = 0 indicates human-induced disasters.
For the multinomial case at the subcategory level, the conditional probability of assigning tweet x to a particular class k { 1 , 2 , , K } is determined by:
P ( y = k x ) = exp w k x + b k j = 1 K exp w j x + b j
where K represents the number of subcategories (e.g., geophysical, hydrological, meteorological, climatological, and others), with each class k characterized by its weight vector w k and bias b k .
The predicted class label is then chosen as the one that maximizes this probability:
y ^ = arg max k P ( y = k x )
We utilize the RF model, an ensemble learning technique, to enhance the classification stability of both disaster-related tweet categories and subcategories. This model aggregates predictions from multiple decision trees that are trained on bootstrapped samples of the training data, employing feature subsampling at each split. For the tweet feature vector (x), the a posteriori probability of class membership (k) in the tth tree is represented as follows:
p t ( k x ) = n t , k ( L t ) j = 1 K n t , j ( L t )
where L t ( x ) denotes the terminal (leaf) node reached by x, and n t , k ( L t ) indicates the count of training samples of class k within that leaf. Additionally, K represents the total number of categories or subcategories.
The class probability is calculated as the average of all T trees, as follows:
P ^ ( y = k x ) = 1 T t = 1 T p t ( k x )
The approximate label of the sample is obtained by majority voting with a classifier, which is calculated as follows:
y ^ = arg max k { 1 , , K } P ^ ( y = k x )
Each tree T is constructed by recursively splitting nodes to reduce impurity. The Gini impurity for a node with class proportions is defined as follows:
Gini = 1 k = 1 K p k 2
The entropy-based measure is expressed as follows:
H = k = 1 K p k log p k
For a candidate split (f, τ ), which divides the node S into left and right child nodes S L and S R , the impurity decrease can be calculated using the following formula:
Δ I ( f , τ ) = I ( S ) w L I ( S L ) + w R I ( S R )
where I is either the Gini index or the entropy, and w L and w R are the weights to the sample size of the children.
In summary, for categories (binary, K = 2 ), the first sets separate natural vs. human-induced disasters as expressed in Equations (4)–(6). In the case of subcategories (multiclass, K > 2 ), the above equations will be followed, where K represents the number of classes, such as geophysical, hydrological, meteorological, and climatological.

3.5. Implementation Details

The ML models were implemented using Python 3.10 on the Google Colab cloud platform, using scikit-learn, NumPy, Pandas, and Matplotlib/Seaborn version 0.13.2 for model development and visualization. The TF-IDF vectorization was configured with max_features = 10,000, English stopword removal, and unigram–bigram tokenization, producing a high-dimensional sparse matrix that served as input to the classifiers. The dataset was partitioned using a stratified 70/15/15 split to preserve category and subcategory distributions across the training, validation, and test sets. All experiments were performed with a fixed random_state = 42 to ensure reproducibility.
The LR classifier operates as a single-layer linear model, where the TF-IDF features constitute the input layer, and the output corresponds to the predicted disaster subcategories. For binary classification, LR applies the sigmoid activation function, defined as:
σ ( z ) = 1 1 + e z
where z = w x + b represents the weighted linear combination of the TF–IDF feature vector x , model weights w , and bias b. The sigmoid maps this linear output to a probability in the range (0, 1), enabling classification based on a chosen decision threshold.
For multi-class prediction, the ML model employs the softmax function, which normalizes linear logits into probability distributions over all classes:
softmax ( z k ) = e z k j = 1 K e z j ,
The ML model was trained using multinomial cross-entropy loss with the scikit-learn solver, setting max_iter = 1000 to ensure convergence in high-dimensional feature spaces. L2 regularization with the default strength (C = 1.0) was applied to mitigate overfitting. As observed during experimentation, Logistic Regression converged in approximately 5.59 s and achieved a test accuracy of 0.97, demonstrating a strong generalization ability.
The RF classifier was implemented as a non-linear ensemble learning method comprising 100 decision trees trained using bootstrap sampling. Each decision tree was grown without depth limitation (max_depth = None), producing a hierarchical multi-layer structure in which each internal node evaluates a threshold-based splitting criterion. Unlike neural networks, decision trees do not employ differentiable activation functions; instead, they rely on step-function-like decisions driven by the minimization of the Gini impurity. Although random forests do not possess “layers” in the neural network sense, each tree contains multiple levels that collectively approximate complex non-linear decision boundaries. The values for n_estimators = 100 and random_state = 42 were used as configured. Training required approximately 16.95 s, and the model achieved a test accuracy of 0.969, confirming strong performance across all disaster categories.
The learning curves obtained through 5-fold cross-validation indicate that both models generalize effectively as training data increases. Logistic Regression shows a consistent margin between training and validation accuracy, whereas Random Forest exhibits slight overfitting due to its fully expanded trees. These observations are reinforced by the confusion matrices, where the majority of predictions align along the diagonal, with notable misclassifications occurring between the Mixed and Others categories due to feature overlap. Collectively, these architectural and technical details demonstrate a robust and reproducible machine learning framework for disaster tweet classification.

4. Evaluation and Result Analysis

This section evaluates the proposed framework in terms of classification accuracy and overall performance, followed by a detailed discussion of the results. The analysis begins with a comparative assessment of LR and RF classifiers in categorizing disaster-related tweets across both categories and subcategories. Comprehensive performance metrics are reported for each ML model, highlighting their respective effectiveness in accurately detecting and classifying disaster-related content. Confusion matrices are subsequently presented to illustrate and compare classification outcomes, providing insights into relative accuracy and patterns of misclassification. Finally, disaster tweets are analyzed by type and subtype across countries, capturing regional and linguistic variations and demonstrating the global applicability and robustness of the proposed framework.
We evaluated two interpretable ML models (i.e., LR and RF) using the combined dataset of 32,228 disaster-related tweets. The primary evaluation metrics were overall accuracy and the macro-averaged F1 score. Secondary analyses included per-class precision, recall, and F1 scores, supplemented by confusion matrix visualizations. Stratified cross-validation with optimized hyperparameters was used for model selection, and all reported results were derived from a held-out test set to ensure an unbiased assessment of performance. The evaluation protocol emphasized the macro-F1 metric to mitigate the effects of class imbalance, applied probability calibration for more reliable confidence estimation, and performed comprehensive error diagnostics through confusion matrices and country-wise performance breakdowns.
The harmonic mean of precision and recall that balances missed positives and false alarms is defined as: harmonic mean of precision and recall, which provides a balanced measure that accounts for both missed positive instances and false alarms, is defined as:   
F 1 = 2 × ( Precision × Recall ) ( Precision + Recall )
where
Precision = T P T P + F P , and Recall = T P T P + F N .
where T P , F P , and F N denote the number of true positives, false positives, and false negatives, respectively. The F 1 -score thus serves as a comprehensive indicator of a model’s performance, particularly in imbalanced classification tasks such as disaster tweet categorization.
For aggregating across multiple classes, we consider both macro-F1 (a simple mean of class-wise F1, which treats all classes equally) and weighted-F1 (a support-weighted mean, which takes class prevalence into account), which is defined as:
Macro - F 1 = 1 C c = 1 C F 1 c , Weighted - F 1 = 1 N c = 1 C n c F 1 c
where C is the number of classes, n c is the support for class c, and N is the total sample size. (Note: in single-label multiclass, the micro-averaged F 1 equals accuracy; hence, Macro−F1 is preferred to reflect minority-class performance.)

4.1. Performance Comparison with State-of-the-Art Approaches

Table 3 presents a comparison between the proposed Logistic Regression and Random Forest models with representative state-of-the-art approaches for disaster-related tweet classification. The findings indicate that the proposed TF–IDF-based classifiers achieve accuracy and F1-score values comparable to those reported in earlier research. While numerous current methodologies rely on deep learning or transformer-based architectures that are often customized for specific datasets or disaster types, the proposed framework achieves competitive performance using a lightweight TF-IDF representation combined with classical classifiers. This highlights that well-designed traditional machine learning methods can remain effective for large-scale, cross-regional disaster analysis, while offering practical advantages in terms of computational efficiency, interpretability, and deployment simplicity.

4.2. Computational Complexity and Runtime Analysis

To evaluate the computational efficiency and practical deployability of the proposed framework, we conducted a runtime analysis that covered the feature representation, model training, and inference stages. All experiments were performed using identical dataset splits and the same hardware environment to ensure a fair comparison among classical machine learning models (e.g., LR and RF models) and a deep learning baseline (BiLSTM).
For LR and RF, TF–IDF vectorization with 1000 features was implemented, followed by model training and prediction. The deep learning baseline utilizes a BiLSTM architecture trained on tokenized and padded tweet sequences of 100 tokens. To optimize performance and computational cost, the BiLSTM model was trained for 10 epochs, which was sufficient to achieve convergence while reflecting a realistic deployment-oriented configuration.
As demonstrated in Table 4, Logistic Regression exhibits the lowest training and inference times, making it highly suitable for real-time disaster monitoring applications. Random Forest achieves competitive predictive performance at a moderate computational cost due to its ensemble-based structure. In contrast, the BiLSTM model incurs substantially higher training time (1252.65 s) and inference latency (8.97 s), even with a limited number of training epochs, highlighting the trade-off between representational power and computational efficiency.
Overall, this analysis suggests that classical machine learning models provide significant advantages in terms of scalability, transparency, and low-latency inference, which are essential for real-time disaster management systems. In contrast, deep learning approaches may be more suitable for offline analysis or resource-rich environments.

4.3. Inference Accuracy

Table 5 presents the inference accuracy results for both the RF and LR models. Each model achieved a testing accuracy of 97%, indicating strong generalization performance on the tweets testing sub-dataset. While the RF model demonstrated a slightly better fit to the training data, the LR and RF models exhibited comparable performance on the test set, suggesting that both approaches effectively capture the key linguistic and contextual features necessary for disaster tweet classification.
The disaster tweet category and subcategory analysis demonstrated consistently high performance, as illustrated by the learning curves in Figure 3 (Logistic Regression) and Figure 4 (Random Forest). The Logistic Regression (LR) model achieved a training accuracy of 0.98 and a validation accuracy of 0.96, indicating strong generalization across both datasets. In contrast, the Random Forest (RF) model attained a perfect training accuracy (1.0) alongside a validation accuracy of 0.96, suggesting slight overfitting despite its excellent overall performance. Both models exhibited stable learning behavior, though the LR model maintained a more balanced relationship between training and validation performance, reflecting its robustness and interpretability.
The confusion matrix for the LR model (Table 6) demonstrates excellent predictive performance, with the majority of predictions concentrated along the diagonal. The human-induced categories, particularly Accidental (1247) and Intentional (295), were classified with near-perfect accuracy, reflecting the model’s strong ability to distinguish these classes. The natural disaster subtypes were also predicted with high accuracy, although performance showed a slight decline for the Natural–Other class (205), likely due to broader linguistic variability. The highest rate of misclassification occurred within the Mixed–Others category, where semantic overlap with both natural and human-induced events contributes to increased ambiguity and reduced model certainty.
The confusion matrix for the Random Forest model (Table 7) reveals similarly high classification accuracy, with improved performance in several categories compared to Logistic Regression. The model achieved near-perfect predictions for the human-induced classes, accurately identifying Accidental (1246) and Intentional (297) tweets. Among the natural disaster subtypes, Random Forest outperformed Logistic Regression in the Natural–Other category, correctly classifying 213 instances compared to 205 for LR. As with the Logistic Regression model, some misclassification persisted within the Mixed–Others class due to semantic overlap across event types; however, overall accuracy remained consistently high across categories.
Table 8 summarizes the performance of the LR model in classifying disaster-related tweets across categories and subcategories. The human-induced classes exhibited near-perfect performance, with exceptionally high precision, recall, and F1-scores (Accidental: precision = 0.98, recall = 1.00, F1 = 0.99; Intentional: precision = 0.99, recall = 0.97, F1 = 0.98). Similarly, the model performed strongly on climatological and geophysical subtypes, achieving an F1-score of 0.98. Performance was slightly lower for the hydrological and meteorological categories (F1 = 0.95), and the Mixed–Others class showed the weakest recall (0.78), reflecting the inherent semantic ambiguity and overlap within mixed disaster contexts. Overall, the Logistic Regression model demonstrated robust classification capability, achieving macro- and weighted-average precision, recall, and F1-scores exceeding 0.95, alongside a macro F1-score that ranked highest among all categories and an overall accuracy of 0.97, confirming its effectiveness for disaster tweet classification.
The performance of the RF model is summarized in Table 9, yielding an overall accuracy of 0.97, which is comparable to that of Logistic Regression. The human-induced categories were classified with near-perfect precision and recall (Accidental: F1 = 0.99; Intentional: F1 = 0.98). The model also performed strongly across the natural disaster subtypes, achieving high F1-scores for Geophysical (0.98), Hydrological (0.97), and Climatological (0.97) events, with Meteorological (0.94) performing slightly lower but still within a high-accuracy range. The Natural–Other class exhibited very high recall (0.99) but more moderate precision (0.89), suggesting occasional misclassification from semantically related categories. As with Logistic Regression, the Mixed–Others class remained the most challenging, with the lowest recall (0.79). Despite this, the macro- and weighted-average precision, recall, and F1-scores.

Loss Curve Analysis

In addition to accuracy-based learning curves, we examine the trends in training and validation loss to assess convergence behavior and potential overfitting. Although accuracy is reported as the primary evaluation metric, the Logistic Regression model is optimized by minimizing a cross-entropy (logarithmic) loss function. Consequently, loss curves provide direct insight into its optimization dynamics and generalization behavior.
The loss curves were generated by calculating cross-validated log-loss values across increasing training set sizes. As illustrated in Figure 5, both training and validation loss decrease consistently as the training size increases, indicating progressive improvement in model fit with additional data. The training loss remains slightly lower than the validation loss across all training sizes, as expected, reflecting stable learning behavior. Notably, the close alignment between the training and validation loss curves, along with the absence of divergence, suggests that the LR model generalizes effectively and does not exhibit significant overfitting.
Figure 5 illustrates the progression of training and validation loss for the Logistic Regression model as the size of the training set increases. A steady reduction in both loss curves is observed, indicating that the model benefits from additional data and achieves improved optimization of the cross-entropy objective. Across all training sizes, the validation loss closely follows the training loss with only a small and consistent gap, indicating stable learning dynamics. The absence of divergence between the curves suggests that the model exhibits robust generalization capabilities, with no indications of overfitting.
Figure 6 depicts the training and validation log-loss trends for the Random Forest model evaluated under varying training set sizes. Unlike iterative optimization-based models, Random Forest performance is assessed through evaluation-based log-loss, which decreases consistently as more data become available. The persistently lower training loss compared to validation loss reflects the ensemble’s capacity to capture patterns in the training data while preserving generalization. The absence of an expanding gap between the two curves indicates stable predictive behavior and suggests that the model does not experience significant overfitting across various data scales.
The bar chart in Figure 7 compares the training and testing accuracies of the Logistic Regression and Random Forest models. The Logistic Regression model achieved substantial and consistent performance, with a training accuracy of 0.98 and a testing accuracy of 0.97, indicating good generalization. The Random Forest model demonstrated slightly higher training performance, reaching a perfect training accuracy of 1.00, while maintaining a testing accuracy of 0.97, suggesting minor overfitting but overall high reliability. Both models therefore performed at a comparable level, each attaining an estimated 97% test accuracy, and demonstrated robust predictive capability across the various categories and subcategories of disaster-related tweets.

4.4. Country- and Region-Based Categorization

Table 10 presents the geographic distribution of disaster-related tweets across multiple countries. The United States generated the highest volume of tweets (4032 human-induced; 2200 natural), followed by Australia (2406 natural) and Italy (2023 natural). Several other countries, including Bangladesh, Brazil, Canada, Spain, Venezuela, and the United Kingdom, also exhibited substantial human-induced tweet activity, with each exceeding 1000 tweets. Countries such as Russia, Costa Rica, and Singapore showed notably high contributions within specific categories (either natural or human-induced). Mixed-category tweets were comparatively rare across most regions, except Nepal and Bangladesh, which accounted for 40 and 43 such tweets, respectively.
Overall, the distribution demonstrates that disaster-related Twitter activity is shaped not only by population size but also by regional hazard exposure and patterns of digital engagement. Highly developed nations (e.g., the United States, Australia, the United Kingdom) produced consistently large tweet volumes across categories. In contrast, certain developing countries (e.g., Bangladesh, Nepal) displayed elevated activity in specific disaster types, reflecting differences in vulnerability, event frequency, and social media response behaviors.
Table 11 presents the distribution of disaster tweet subcategories across countries. The United States overwhelmingly dominated the dataset, with high counts across multiple categories, including Human-Induced Accidental (2000), Climatological (1200), Hydrological (1000), and Intentional (2032). Australia and Italy also exhibited substantial activity, each surpassing 1000 tweets in the Climatological and Hydrological subcategories. Notably, Costa Rica and Guatemala showed elevated hydrological tweet volumes (1411 and 1100, respectively). Several countries—including Bangladesh, Brazil, Canada, Spain, Venezuela, and the United Kingdom—produced more than 1000 Human-Induced Accidental tweets. A distinctive pattern emerged in Russia, which demonstrated unusually high activity in the Others subcategory (1442).
The Mixed and Meteorological categories had comparatively lower counts, with moderate representation in Nepal, Iraq, and the Philippines. This distribution not only reflects the geographic prevalence of specific disaster types but also highlights variations in digital engagement and reporting practices across regions.
Table 12 presents the classification results showing how disaster-related tweets are distributed across the top 10 countries. In the confusion matrix, diagonal elements represent tweets correctly classified by country, while off-diagonal elements indicate misclassifications. The USA has the highest number of true positive tweets (1864), followed by Australia (719), Italy (604), Canada (598), and Russia (429). Bangladesh (422), Costa Rica (416), Guatemala (330), the United Kingdom (330), and Brazil (297) also achieve relatively high classification accuracy, despite having fewer samples. Overall, misclassifications are small compared to the correctly classified counts, suggesting that the model is robust in distinguishing country-specific tweets. However, some minor overlaps occur–for example, a few U.S. tweets were misclassified as Australian, Canadian, or U.K. tweets, and some Italian tweets were misclassified as Costa Rican or Guatemalan. These results highlight both the model’s strength in capturing regional differences in disaster information-seeking and its remaining minor limitations.

4.4.1. Regional Statistical Significance Analysis

Although Table 10 and Table 11 illustrate the descriptive distribution of disaster-related tweets across countries and regions, descriptive statistics alone do not determine whether these observed regional differences are statistically meaningful. To address this limitation and respond to the variability observed in country-level performance, we conducted formal statistical significance testing to evaluate whether disaster category and sub-category distributions are independent of geographic region.
Specifically, chi-square tests of independence were applied to country × category and country × sub-category contingency tables. Given the inherent imbalance in country-level tweet volumes, additional robustness analyses were performed by restricting the evaluation to high-support countries and aggregating low-frequency countries where necessary. Effect sizes were quantified using Cramér’s V to assess the practical strength of regional associations, independent of statistical significance.
As demonstrated in Table 13, statistically significant associations were observed between country and category distributions ( p < 0.001 ) as well as between country and sub-category distributions ( p < 0.001 ). The extensive chi-square statistics and strong effect sizes (Cramér’s V ranging from 0.70 to 0.88) indicate that regional differences are both statistically significant and practically meaningful. Due to the full dataset exhibiting sparsity and violations of expected frequencies, additional analyses were performed using the top 20 high-support countries to ensure the validity of chi-square assumptions. The persistence of strong effect sizes and statistical significance in this filtered setting confirms that regional differences are robust and not solely driven by dataset imbalance.

4.4.2. Ablation Study on Feature Importance and Dimensionality

Feature Importance:Figure 8 presents the top 20 most important TF-IDF features identified by the Random Forest model using Gini-based importance scores. The results indicate that disaster-related keywords and location-specific terms (e.g., haze, explosion, landslides, bangladesh, and colorado) contribute most strongly to the model’s decision-making process. Notably, several high-importance features correspond to well-known disaster events and affected regions, highlighting the model’s ability to capture contextually meaningful signals from textual data. Although RF feature importance reflects global contributions rather than class-specific relevance, the identified features are consistent with those obtained from Logistic Regression analysis, reinforcing the robustness and interpretability of the learned representations.
Feature Dimensionality: The dimensionality of the TF–IDF feature representation is a crucial factor in balancing classification performance, computational efficiency, and interpretability. To justify the choice of feature dimensionality and ensure that the reported performance is not an artifact of excessive vocabulary size, an ablation study was conducted by varying the TF–IDF max_features parameter across multiple values.
Figure 9 illustrates the impact of TF–IDF feature dimensionality on macro-averaged F1-score. As the number of features increases from low-dimensional representations, classification performance improves rapidly, reaching a plateau at approximately 1000 features. Beyond this point, further increases in dimensionality result in marginal or slightly negative performance gains, indicating diminishing returns from additional features. Based on this plateau behavior, a TF–IDF dimensionality of 1000 features was selected for all subsequent experiments. This choice provides an effective trade-off between predictive performance and model complexity, while preserving interpretability by focusing on the most informative lexical features.

4.5. Discussion

The results of our study highlight the strong potential of machine learning models, specifically Logistic Regression and Random Forest classifiers, for categorizing and sub-categorizing disaster-related tweets. Both models demonstrated robust performance on a large, labeled dataset of over 32,000 tweets, effectively addressing the challenge of extracting critical information from the high volume and noise typical of social media during crises. These findings align with previous research emphasizing the utility of such approaches in crisis informatics [7,9,10,16,22].
Our investigation went beyond standard model evaluation by examining the influence of regional and cultural variations, which is an underexplored but critical factor in developing practical and equitable disaster response technologies. A key finding is the variation in model accuracy across countries. While universal models demonstrated strong overall performance, country-specific models revealed notable differences in classification quality, suggesting that a single, universal model may not be equally effective in all contexts. These performance variations are likely driven by the unique linguistic and cultural factors that shape how people communicate during and immediately after disasters [2,10,22,23].
Accurate categorization of disaster-related tweets has direct implications for real-time relief operations. By identifying the type and severity of a disaster, responders can allocate resources more efficiently and prioritize areas of greatest need. However, the observed regional disparities in model performance raise concerns about equitable access to timely disaster relief. Machine learning models that underperform in certain regions due to linguistic or cultural differences may lead to affected populations being overlooked or misrepresented in digital assessments. To address this, future research should explore hybrid approaches that combine supervised learning with unsupervised clustering to detect emergent patterns in underrepresented regions. Integrating geolocation data, user metadata, and temporal trends could further enrich the contextual interpretation of tweets.

Framework Limitations and Future Work

Despite the promising results achieved in this study, several limitations should be acknowledged. First, although the dataset includes tweets written in multiple languages, the proposed framework employs a language-agnostic TF-IDF representation, without requiring explicit language detection or language-specific preprocessing. While this design choice supports scalability and computational efficiency, it may limit the model’s ability to capture linguistic nuances, code-switching behavior, and culturally specific expressions across regions. As a result, classification performance may vary in linguistically diverse settings.
Second, the dataset demonstrates class imbalance, particularly for the Mixed disaster category, which is significantly underrepresented relative to the Natural and Human-Induced classes. No explicit data-level resampling techniques (e.g., SMOTE) or class-weighting strategies were applied to preserve realistic data distributions and prevent the introduction of synthetic noise into short social media texts. The lower recall observed for the Mixed class can therefore be attributed to both limited sample size and semantic overlap with other disaster categories. To mitigate this effect during evaluation, the macro-averaged F1-score is emphasized as a primary performance metric, as it provides a balanced assessment across categories regardless of class frequency.
Moreover, model performance is reported as point estimates obtained from a single train–validation–test split. Although statistical significance testing was performed to analyze regional and categorical variations, standard deviations and confidence intervals were not computed. This limits the assessment of performance variability across repeated runs and should be taken into account when interpreting the reported results.
Building upon these limitations, several directions for future work are identified. Future research will explore language-aware and multilingual modeling pipelines, including automatic language detection and multilingual contextual embeddings, to improve robustness in cross-cultural and multilingual disaster communication. Furthermore, advanced imbalance-handling strategies, such as cost-sensitive learning, focal-loss-based objectives, and controlled data augmentation, will be explored to enhance the performance of minority classes. Finally, extending the proposed framework toward real-time deployment scenarios, which include API-based inference, edge-cloud hybrid architectures, and latency-aware evaluation, will support the development of scalable and operationally reliable disaster monitoring systems.

5. Conclusions

The objective of this study is to develop and evaluate a multi-class classification approach for disaster-related tweets, including both category and subcategory labels. Using a combined dataset of over 32,000 tweets, we performed preprocessing, applied TF-IDF for feature extraction, and trained classifiers such as Logistic Regression and Random Forest models. Both models achieved strong performance, with an aggregate accuracy of approximately 97%. Notably, Random Forest handled class imbalance effectively, while Logistic Regression demonstrated consistent generalization. Secondary analyses, including confusion matrices and learning curves, revealed that national cultural and linguistic differences had a significant influence on classification performance. These findings demonstrate that social media data, particularly Twitter, can support emergency management tasks such as response coordination, early warning, and situational awareness. The results also underscore the importance of cultural and regional representation in disaster analytics. To further enhance disaster management applications, future work will focus on incorporating deep learning methods, multilingual embeddings, and real-time deployability.

Author Contributions

Conceptualization, M.R.M. and A.A.A.; methodology, M.R.M., L.N. and M.R.M.; formal analysis, M.R.M., L.A. and A.A.A.; investigation, M.R.M., A.A.A., L.N. and T.R.; resources, A.A.A. and L.A.; data curation, T.R. and A.A.A.; writing—original draft preparation, M.R.M., A.A.A. and T.R.; writing—review and editing, A.A.A. and T.R.; visualization, A.A.A. and M.R.M.; supervision, A.A.A., L.N. and T.R.; project administration, A.A.A. and L.A.; funding acquisition, A.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is supported in part by the National Science Foundation (NSF) under grants # 2011330, 2200377 and 2302469.

Data Availability Statement

The data and source code that support the findings of this study are available online: https://github.com/mmiah-pvmu/Disaster-Management.git (accessed on 27 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. NOAA National Centers for Environmental Information (NCEI). U.S. Billion-Dollar Weather and Climate Disasters: Overview. 2021. Available online: https://www.ncdc.noaa.gov/billions/ (accessed on 22 September 2025).
  2. Easterling, D.R.; Kunkel, K.E.; Arnold, J.R.; Knutson, T.; LeGrande, A.N.; Leung, L.Y.R.; Wehner, M.F. Precipitation change in the United States. Nat. Clim. Change 2021, 11, 427–435. [Google Scholar] [CrossRef]
  3. Houston, J.B.; Hawthorne, J.; Perreault, M.F.; Park, E.H.; Goldstein Hode, M.; Halliwell, M.R.; Griffith, S.A. Social media and disasters: A functional framework for social media use in disaster planning, response, and research. Disasters 2015, 39, 1–22. [Google Scholar] [CrossRef] [PubMed]
  4. Reuter, C.; Kaufhold, M.A. Fifteen years of social media in emergencies: A retrospective review and future directions for crisis informatics. J. Conting. Crisis Manag. 2018, 26, 41–57. [Google Scholar] [CrossRef]
  5. Vieweg, S.; Hughes, A.L.; Starbird, K.; Palen, L. Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Atlanta, GA, USA, 10–15 April 2010; pp. 1079–1088. [Google Scholar] [CrossRef]
  6. Starbird, K.; Palen, L. “Voluntweeters”: Self-organizing by digital volunteers in times of crisis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada, 7–12 May 2011; pp. 1071–1080. [Google Scholar] [CrossRef]
  7. Franceschini, R.; Rosi, A.; Catani, F.; Casagli, N. Detecting information from Twitter on landslide hazards in Italy using deep learning models. Geoenviron. Disasters 2024, 11, 22. [Google Scholar] [CrossRef]
  8. Adhikari, R. Hurricane Florence Tweets. 2018. Available online: https://www.kaggle.com/datasets/rabinad/hurricane-florence-tweets (accessed on 18 October 2025).
  9. Basit, M.; Alam, B.; Fatima, Z.; Shaikh, S. Natural Disaster Tweets Classification Using Multimodal Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7584–7594. [Google Scholar] [CrossRef]
  10. Bhattarai, Y.; Duwal, S.; Sharma, S.; Talchabhadel, R. Leveraging machine learning and open-source spatial datasets to enhance flood susceptibility mapping in transboundary river basin. Int. J. Digit. Earth 2024, 17, 2313857. [Google Scholar] [CrossRef]
  11. Khatoon, S.; Asif, A.; Hasan, M.M.; Alshamari, M. Social media-based intelligence for disaster response and management in smart cities. In Artificial Intelligence, Machine Learning, and Optimization Tools for Smart Cities; Springer: Cham, Switzerland, 2022; pp. 211–235. [Google Scholar]
  12. Sinha, A.; Pokhriyal, K.; Kumar, S. A Decision Support System for Extracting Artificial Intelligence Driven Insights on Natural Disasters. In Proceedings of the 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), Greater Noida, India, 29–31 August 2024; Volume 1, pp. 1–6. [Google Scholar]
  13. Sufi, F.K.; Khalil, I. Automated disaster monitoring from social media posts using AI-based location intelligence and sentiment analysis. IEEE Trans. Comput. Soc. Syst. 2022, 11, 4614–4624. [Google Scholar] [CrossRef]
  14. Lamsal, R.; Read, M.; Karunasekera, S.; Imran, M. Crema: Crisis response through computational identification and matching of cross-lingual requests and offers shared on social media. IEEE Trans. Comput. Soc. Syst. 2024, 12, 306–319. [Google Scholar] [CrossRef]
  15. Laksito, A.D.; Kusrini, K.; Setyanto, A.; Johari, M.Z.F.; Maruf, Z.R.; Yuana, K.A.; Adninda, G.B.; Kartikakirana, R.A.; Nucifera, F.; Widayani, W.; et al. Machine learning and social media harvesting for wildfire prevention. In Proceedings of the 2023 IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), Guayaquil, Ecuador, 4–7 July 2023; pp. 1–6. [Google Scholar]
  16. Khallouli, W.; Li, J.; Huang, J.; Rabadi, G.; Kovacic, S. An integrated machine learning approach for identifying emergency rescue messages on social media during natural disasters. Soc. Netw. Anal. Min. 2025, 15, 50. [Google Scholar] [CrossRef]
  17. Schmidt, S.; Friedemann, M.; Hanny, D.; Resch, B.; Riedlinger, T.; Mühlbauer, M. Enhancing satellite-based emergency mapping: Identifying wildfires through geo-social media analysis. Big Earth Data 2025, 9, 389–411. [Google Scholar] [CrossRef]
  18. He, Z.; Zhou, C.; Zou, L.; Zhou, S.; Zhao, X. Siamese text classification network (SiamTCN) for multi-class multi-label information extraction of typhoon disasters from social media data. Int. J. Digit. Earth 2025, 18, 2517790. [Google Scholar] [CrossRef]
  19. Wang, C.; Zhang, X.; Liu, L.; Wu, J. Public perception of earthquake events: Evidence from social media–a case study of the 2025 Dingri earthquake. Geomat. Nat. Hazards Risk 2025, 16, 2542196. [Google Scholar] [CrossRef]
  20. Ngamassi, L.; Shahriari, H.; Ramakrishnan, T.; Rahman, S. Text mining hurricane Harvey tweet data: Lessons learned and policy recommendations. Int. J. Disaster Risk Reduct. 2022, 70, 102753. [Google Scholar] [CrossRef]
  21. Olteanu, A.; Vieweg, S.; Castillo, C. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the 2015 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW’15), Vancouver, BC, Canada, 14–18 March 2015; pp. 994–1009. [Google Scholar] [CrossRef]
  22. Guha-Sapir, D.; Mizutori, M. The Human Cost of Disasters: An Overview of the Last 20 Years, 2000–2019; United Nations Office for Disaster Risk Reduction (UNDRR) and Centre for Research on the Epidemiology of Disasters (CRED): Geneva, Switzerland, 2020. [Google Scholar]
  23. Kumar, V.V.; Sahoo, A.; Balasubramanian, S.K.; Gholston, S. Mitigating healthcare supply chain challenges under disaster conditions: A holistic AI-based analysis of social media data. Int. J. Prod. Res. 2025, 63, 779–797. [Google Scholar] [CrossRef]
Figure 1. System architecture of the proposed four-layer ML framework for disaster-related tweet classification and regional analysis. The framework integrates data collection, preprocessing, and feature extraction, as well as ML-based classification and regional visualization layers, to enable accurate identification of disaster types and cross-regional interpretation of social media data.
Figure 1. System architecture of the proposed four-layer ML framework for disaster-related tweet classification and regional analysis. The framework integrates data collection, preprocessing, and feature extraction, as well as ML-based classification and regional visualization layers, to enable accurate identification of disaster types and cross-regional interpretation of social media data.
Computers 15 00016 g001
Figure 2. Overview of the Disaster Tweet Classification Methodology and Research Framework. It illustrates the workflow of Data Preprocessing, Integration, and Classification for Disaster Tweets.
Figure 2. Overview of the Disaster Tweet Classification Methodology and Research Framework. It illustrates the workflow of Data Preprocessing, Integration, and Classification for Disaster Tweets.
Computers 15 00016 g002
Figure 3. Learning curve for the Logistic Regression model. The model exhibits strong generalization, with training accuracy stabilizing at approximately 0.98 and validation accuracy approaching 0.96, indicating effective learning and minimal overfitting.
Figure 3. Learning curve for the Logistic Regression model. The model exhibits strong generalization, with training accuracy stabilizing at approximately 0.98 and validation accuracy approaching 0.96, indicating effective learning and minimal overfitting.
Computers 15 00016 g003
Figure 4. Learning curve for the Random Forest model. The model attains near-perfect training accuracy (1.0) while maintaining a validation accuracy around 0.96, suggesting high predictive power with a slight tendency toward overfitting.
Figure 4. Learning curve for the Random Forest model. The model attains near-perfect training accuracy (1.0) while maintaining a validation accuracy around 0.96, suggesting high predictive power with a slight tendency toward overfitting.
Computers 15 00016 g004
Figure 5. Training and validation loss curves for the Logistic Regression model computed using cross-validated log-loss. The consistent decrease in both curves and the absence of divergence indicate stable convergence and good generalization performance.
Figure 5. Training and validation loss curves for the Logistic Regression model computed using cross-validated log-loss. The consistent decrease in both curves and the absence of divergence indicate stable convergence and good generalization performance.
Computers 15 00016 g005
Figure 6. Training and validation log-loss trends for the Random Forest model evaluated across increasing training set sizes. Although Random Forest does not optimize a continuous loss function, the decreasing log-loss values indicate improved predictive stability and generalization as more training data are incorporated.
Figure 6. Training and validation log-loss trends for the Random Forest model evaluated across increasing training set sizes. Although Random Forest does not optimize a continuous loss function, the decreasing log-loss values indicate improved predictive stability and generalization as more training data are incorporated.
Computers 15 00016 g006
Figure 7. Training and testing accuracy comparison for Logistic Regression and Random Forest models, showing similarly strong generalization with both achieving 97% test accuracy.
Figure 7. Training and testing accuracy comparison for Logistic Regression and Random Forest models, showing similarly strong generalization with both achieving 97% test accuracy.
Computers 15 00016 g007
Figure 8. Top 20 most important TF-IDF features identified by the Random Forest model based on Gini importance.
Figure 8. Top 20 most important TF-IDF features identified by the Random Forest model based on Gini importance.
Computers 15 00016 g008
Figure 9. Ablation study on TF–IDF feature dimensionality. Macro-F1 score improves rapidly with increasing feature size up to 1000 features and then saturates, demonstrating that additional features provide negligible performance gains.
Figure 9. Ablation study on TF–IDF feature dimensionality. Macro-F1 score improves rapidly with increasing feature size up to 1000 features and then saturates, demonstrating that additional features provide negligible performance gains.
Computers 15 00016 g009
Table 1. Tweets Datasets.
Table 1. Tweets Datasets.
DatasetTotal TweetsHuman-InducedNaturalMixed
Hurricane Dataset47714733913385
Tweet Dataset27,93310,38216,5481000
Combined Dataset32,22810,85520,4611385
Table 2. Tweet Samples.
Table 2. Tweet Samples.
Tweet ID:Tweet TextCountryCategorySub-CategoryTypeCrisis Date
T1Entre mal estado de Plantas Josefa Camejo y Planta Centro sin Planta… IHBJKF2VenezuelaHuman-
induced
AccidentalExplosion2012
T2RT @VitoCaro7: Good luck to @kyle_piersma today…USHuman-
induced
IntentionalBombings2013
T3Ethiopia hosting #SW4ALL Meeting of Ministers of WASH…EthiopiaNaturalHydrologicalFloods2016
T4We are taking the #WHO Surgical Safety Checklist on tour…CameroonNaturalMeteorologicalTyphoon2017
Table 3. Comparison of the proposed framework with representative state-of-the-art methods for disaster-related tweet classification. Performance metrics are reported as stated in the corresponding references.
Table 3. Comparison of the proposed framework with representative state-of-the-art methods for disaster-related tweet classification. Performance metrics are reported as stated in the corresponding references.
ReferenceMethodModel ArchitectureAccuracyF1-Score
Sinha et al. [12]LSTM-based ClassifierLSTM0.975
Lamsal et al. [14]CrisisTransformerCross-lingual Transformer∼0.90
Laksito et al. [15]BiLSTM (Wildfire Tweets)BiLSTM + Random Oversampling0.8995
Franceschini et al. [7]BERT + CNNTransformer (BERT) + CNN0.961
Schmidt et al. [17]RoBERTa (XLM-R)Transformer (XLM-R)0.94
Khallouli et al. [16]BERT + Regex + MLPHybrid (BERT + Rules + MLP)0.91
Table 4. Runtime and computational cost comparison of different models, including feature representation, training time, and inference time on the test set. All experiments were conducted using identical data splits and hardware configurations.
Table 4. Runtime and computational cost comparison of different models, including feature representation, training time, and inference time on the test set. All experiments were conducted using identical data splits and hardware configurations.
ModelFeature Representation (Input Encoding)Training Time (s)Inference Time (s)Computational CostReal-Time Suitability
LRTF–IDF (10,000 features)5.390.0200Very LowHigh
RFTF–IDF (10,000 features)16.950.2594MediumMedium
BiLSTMTokenized sequences (length = 100)1252.658.97HighLow
Table 5. Inference Accuracy.
Table 5. Inference Accuracy.
ML ModelTraining Confidence (%)Testing Ineference Accuracy (%)
Logistic Regression0.980.97
Random Forest1.000.97
Table 6. Confusion Matrix of the Logistic Regression ML Model.
Table 6. Confusion Matrix of the Logistic Regression ML Model.
Predicted
Human—AccidentalHuman—IntentionalMixed—OthersNatural—ClimatologicalNatural—GeophysicalNatural—HydrologicalNatural—Meteor.Natural—Others
ActualHuman—Accidental12474010100
Human—Intentional5295002210
Mixed—Others201626916120
Natural—Climat.1004290650
Natural—Geophysical20107651050
Natural—Hydrological000211019120
Natural—Meteor.70000215680
Natural—Others2000090205
Table 7. Confusion Matrix of the Random Forest ML Model.
Table 7. Confusion Matrix of the Random Forest ML Model.
Predicted
Human—AccidentalHuman—IntentionalMixed—OthersNatural—ClimatologicalNatural—GeophysicalNatural—HydrologicalNatural—Meteor.Natural—Others
ActualHuman—Accidental12463011200
Human—Intentional5297001101
Mixed—Others001636101891
Natural—Climatological1004281533
Natural—Geophysical5000767254
Natural—Hydrological100311010163
Natural—Meteor.900001356113
Natural—Others1000110213
Table 8. Classification performance of the LR model across disaster categories and subcategories. The table reports precision, recall, and F1-scores for each class, highlighting near-perfect performance for human-induced events, strong results for climatological and geophysical subtypes, and comparatively lower recall for the Mixed–Others category due to semantic overlap. Overall macro- and weighted-average metrics exceed 0.95, reflecting the model’s high accuracy and robust generalization.
Table 8. Classification performance of the LR model across disaster categories and subcategories. The table reports precision, recall, and F1-scores for each class, highlighting near-perfect performance for human-induced events, strong results for climatological and geophysical subtypes, and comparatively lower recall for the Mixed–Others category due to semantic overlap. Overall macro- and weighted-average metrics exceed 0.95, reflecting the model’s high accuracy and robust generalization.
ClassPrecisionRecallF1-ScoreSupport
Human-induced—Accidental0.981.000.991253
Human-induced—Intentional0.990.970.98305
Mixed—Others0.990.780.88207
Natural—Climatological0.980.970.98491
Natural—Geophysical0.980.980.98783
Natural—Hydrological0.940.990.961034
Natural—Meteorological0.940.950.95596
Natural—Others1.000.950.97216
Accuracy 0.974835
Macro Avg0.980.950.964835
Weighted Avg0.970.970.974835
Table 9. Classification performance of the RF model across disaster categories and subcategories. The table reports precision, recall, and F1-scores for each class, showing near-perfect performance for human-induced events, strong results across natural disaster subtypes, and comparatively lower recall for the Mixed–Others category. High macro- and weighted-average metrics (0.95–0.97) indicate robust and well-balanced classification capability.
Table 9. Classification performance of the RF model across disaster categories and subcategories. The table reports precision, recall, and F1-scores for each class, showing near-perfect performance for human-induced events, strong results across natural disaster subtypes, and comparatively lower recall for the Mixed–Others category. High macro- and weighted-average metrics (0.95–0.97) indicate robust and well-balanced classification capability.
ClassPrecisionRecallF1-ScoreSupport
Human-induced—Accidental0.980.990.991253
Human-induced—Intentional0.990.970.98305
Mixed—Others1.000.790.88207
Natural—Climatological0.980.970.97441
Natural—Geophysical0.980.980.98783
Natural—Hydrological0.960.980.971034
Natural—Meteorological0.940.940.94596
Natural—Others0.890.990.94216
Accuracy 0.974835
Macro Avg0.970.950.964835
Weighted Avg0.970.970.974835
Table 10. Geographic distribution of disaster-related tweets across countries. The table summarizes the volumes of human-induced, natural, and mixed-category tweets, highlighting regional differences in disaster exposure and social media engagement. High activity is observed in the United States, Australia, Italy, and several other countries, with notable category-specific contributions from regions such as Bangladesh and Nepal.
Table 10. Geographic distribution of disaster-related tweets across countries. The table summarizes the volumes of human-induced, natural, and mixed-category tweets, highlighting regional differences in disaster exposure and social media engagement. High activity is observed in the United States, Australia, Italy, and several other countries, with notable category-specific contributions from regions such as Bangladesh and Nepal.
CountryHuman-InducedMixedNatural
Afghanistan0448
Australia002406
Bangladesh125043142
Brazil1000010
Canada100011006
China0110
Colombia0223
Costa Rica001413
Ecuador0036
Fiji0126
Greece0017
Guatemala051116
Guinea0110
Haiti022267
Honduras0010
India012149
Indonesia0393
Iraq0482
Italy012023
Japan0029
Jordan0022
Kenya08154
Madagascar0863
Mexico018106
Myanmar0767
Nepal040347
Pakistan0244
Peru0653
Philippines02397
Puerto Rico010140
Russia001442
Rwanda0128
Senegal007
Sierra Leone0222
Singapore010000
Somalia017225
Spain100000
Sudan0565
Tajikistan014
Thailand004
UK110000
US403202200
Uganda0050
Vanuatu0423
Venezuela100000
Yemen01466
Table 11. Country-level distribution of disaster tweet subcategories. The table highlights regional variations in disaster types and digital engagement, with the United States showing dominant activity across multiple subcategories and other countries exhibiting category-specific concentrations.
Table 11. Country-level distribution of disaster tweet subcategories. The table highlights regional variations in disaster types and digital engagement, with the United States showing dominant activity across multiple subcategories and other countries exhibiting category-specific concentrations.
CountryAccidentalClimatologicalGeophysicalHydrologicalIntentionalMeteorologicalOthers
Afghanistan021480015
Australia0120201120301
Bangladesh1250180478643
Brazil1000103060
Canada1000010100501
China0100310
Colombia06107220
Costa Rica0020000
Ecuador02910160
Fiji00000241
Greece03010044
Guatemala0088885
Guinea02247066
Haiti04747210102222
Honduras0312400
India01528924343
Indonesia02835353
Iraq0628242444
Italy0101913311
Japan015110303
Jordan000200020
Kenya0971130438
Madagascar02040578
Mexico0691111262618
Myanmar066440177
Nepal02765219521940
Pakistan025540010
Peru03149116
Philippines027292904023
Puerto Rico011311010
Russia0000000
Rwanda0011120
Senegal0111055
Sierra Leone0333372
Singapore0000000
Somalia100017702502317
Spain0000000
Sudan011000435
Tajikistan0013331
Thailand1100000100
UK2000000000
US01200010002032010
Uganda02000430
Vanuatu10000700164
Venezuela0000000
Yemen03315154514
Table 12. Confusion matrix showing the classification of disaster-related tweets by country. Diagonal values indicate correctly classified tweets (true positives), while off-diagonal values represent misclassifications.
Table 12. Confusion matrix showing the classification of disaster-related tweets by country. Diagonal values indicate correctly classified tweets (true positives), while off-diagonal values represent misclassifications.
True∖PredictedUSAAUSITACANRUSBGDCRIGTMGBRBRA
USA1864310000020
AUS371900000000
ITA016040001100
CAN400598000000
RUS310042900000
BGD601004220000
CRI004000416400
GTM001000533000
GBR000000003300
BRA310000000297
Table 13. Statistical significance analysis of country-level category and sub-category distributions.
Table 13. Statistical significance analysis of country-level category and sub-category distributions.
Comparison χ 2 DoFp-ValueCramér’s V
Country vs. Category (All Countries)46,730.99510<0.0010.8515
Country vs. Sub-Category (All Countries)96,166.581530<0.0010.7052
Country vs. Category (Top 20 Countries)46,039.5638<0.0010.8800
Country vs. Sub-Category (Top 20 Countries)88,072.04114<0.0010.7027
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Miah, M.R.; Akter, L.; Ahmed, A.A.; Ngamassi, L.; Ramakrishnan, T. An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions. Computers 2026, 15, 16. https://doi.org/10.3390/computers15010016

AMA Style

Miah MR, Akter L, Ahmed AA, Ngamassi L, Ramakrishnan T. An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions. Computers. 2026; 15(1):16. https://doi.org/10.3390/computers15010016

Chicago/Turabian Style

Miah, Mohammad Robel, Lija Akter, Ahmed Abdelmoamen Ahmed, Louis Ngamassi, and Thiagarajan Ramakrishnan. 2026. "An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions" Computers 15, no. 1: 16. https://doi.org/10.3390/computers15010016

APA Style

Miah, M. R., Akter, L., Ahmed, A. A., Ngamassi, L., & Ramakrishnan, T. (2026). An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions. Computers, 15(1), 16. https://doi.org/10.3390/computers15010016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop