Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference

Sufi, Fahim; Alsulami, Musleh

doi:10.3390/math13132049

Open AccessArticle

Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference

by

Fahim Sufi

^1,*

and

Musleh Alsulami

²

¹

COEUS Institute, New Market, VA 22844, USA

²

Department of Software Engineering, College of Computing, Umm Al-Qura University, Makkah 21961, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2049; https://doi.org/10.3390/math13132049

Submission received: 11 May 2025 / Revised: 18 June 2025 / Accepted: 19 June 2025 / Published: 20 June 2025

Download

Browse Figures

Versions Notes

Abstract

Understanding how disasters are framed in news media is critical to unpacking the socio-political dynamics of crisis communication. However, empirical research on narrative variation across disaster types and geographies remains limited. This study addresses that gap by examining whether media outlets adopt distinct narrative structures based on disaster type and country. We curated a large-scale dataset of 20,756 disaster-related news articles, spanning from September 2023 to May 2025, aggregated from 471 distinct global news portals using automated web scraping, RSS feeds, and public APIs. The unstructured news titles were transformed into structured representations using GPT-3.5 Turbo and subjected to unsupervised topic modeling using Latent Dirichlet Allocation (LDA). Five dominant latent narrative topics were extracted, each characterized by semantically coherent keyword clusters (e.g., “wildfire”, “earthquake”, “flood”, “hurricane”). To empirically evaluate our hypotheses, we conducted chi-square tests of independence. Results demonstrated a statistically significant association between disaster type and narrative frame (

χ^{2} = 25, 280.78

, p < 0.001), as well as between country and narrative frame (

χ^{2} = 23, 564.62

, p < 0.001). Visualizations confirmed consistent topic–disaster and topic–country pairings, such as “earthquake” narratives dominating in Japan and Myanmar and “hurricane” narratives in the USA. The findings reveal that disaster narratives vary by event type and geopolitical context, supported by a mathematically robust, scalable, data-driven method for analyzing media framing of global crises.

Keywords:

media bias; AI-driven crisis management; disaster narratives; topic modeling; crisis communication; open-source disaster data

MSC:

03F03

1. Introduction

Disasters—whether natural, technological, or anthropogenic—constitute complex, multidimensional events that exert significant human, environmental, and economic tolls across the globe. In the age of digital information, timely and accurate dissemination of disaster-related news is not only critical for emergency response and coordination but also central to shaping public perception and institutional action [1,2,3]. While social media platforms have been extensively mined for such purposes, structured extraction and quantitative analysis of disaster narratives from traditional news portals remain substantially underexplored. News articles often contain more vetted and detailed information compared to user-generated content, yet few frameworks leverage the full potential of this medium for automated disaster analytics [2].

Prior research has employed unsupervised topic models such as Latent Dirichlet Allocation (LDA) to analyze disaster-related content in social media and microblogs [4,5,6], but such efforts have rarely extended to formal media coverage. Moreover, the integration of large-scale language models like GPT for extracting structured attributes—such as disaster type, affected location, death toll, injury count, and severity—from unstructured textual sources remains methodologically underdeveloped. Existing studies also lack rigorous statistical testing to determine whether narrative framings vary systematically by disaster typology or geography, an omission that limits our understanding of media biases and disaster representation [7].

To address these gaps, this study develops a fully automated framework that combines generative language models, probabilistic topic inference, and statistical testing to uncover structured patterns in disaster news coverage. Using a multimodal ingestion pipeline, we collected and curated a dataset of 20,756 disaster-related news articles published between 27 September 2023 and 1 May 2025. The data were sourced from 471 distinct news portals worldwide, including major media outlets such as bbc.com, cnn.com, ndtv.com, and reuters.com. Each article was processed using GPT-3.5 Turbo, which extracted structured entities including disaster type, location, normalized country name, reported deaths, injuries, and a severity score ranging from 0 to 5. After extensive normalization and validation, a subset of headlines was subjected to LDA topic modeling with five latent narrative frames inferred using variational inference.

Each headline was assigned a dominant topic based on posterior probability (

z_{i} = arg {max}_{k} θ_{i k}

), and two key hypotheses were formulated: (1) media outlets use different narrative framings for different disaster types, and (2) media narratives vary by the country in which the disaster occurs. To empirically test these hypotheses, we constructed two contingency tables (topic vs. disaster type and topic vs. country) and computed Pearson’s chi-square statistics. For Hypothesis 1, the chi-square value was

χ^{2} = 25, 280.78

with

p < 0.001

, and for Hypothesis 2,

χ^{2} = 23, 564.62

with

p < 0.001

, both indicating highly significant dependencies.

These results affirm the non-random distribution of media narratives with respect to both the typology and geography of disasters. The findings contribute to the computational social science literature by introducing a scalable, GPT-augmented pipeline for structured disaster news analysis. Moreover, the empirical insights carry practical implications for media monitoring, humanitarian logistics, and policymaking, where understanding narrative construction can inform targeted intervention strategies, resource prioritization, and international response coordination. This study advances the methodological frontier of disaster informatics and underscores the role of media analytics in global crisis response systems.

To summarize, following the core contribution of this study:

Large-Scale Structured Dataset Creation: This study introduces a novel dataset comprising 20,756 disaster-related news articles collected from 471 distinct global news portals between 27 September 2023 and 1 May 2025. Using GPT-3.5 Turbo, each article was automatically transformed into structured records containing disaster type, location, country, reported deaths, injuries, and severity—yielding the largest known structured disaster news corpus derived from formal media sources.
Narrative Typology via Topic Modeling: We applied Latent Dirichlet Allocation (LDA) to extract 5 coherent narrative topics, each representing a dominant media framing. Headlines were vectorized and modeled using a bag-of-words approach, producing topic distributions for each article and enabling fine-grained narrative classification across thousands of records.
Empirical Validation of Narrative Variability: Using dominant topic assignments, we conducted two chi-square tests to validate hypotheses regarding narrative variation. The first test (topic vs. disaster type) yielded a highly significant statistic of $χ^{2} = 25, 280.78$ (p < 0.001), and the second test (topic vs. country) produced $χ^{2} = 23, 564.62$ (p < 0.001), confirming strong dependencies and media framing heterogeneity across both dimensions.
Scalable, Fully Automated Pipeline for Disaster Informatics: This study presents a reproducible, end-to-end NLP pipeline integrating GPT-based information extraction, topic modeling, and statistical inference, demonstrating that state-of-the-art language models can reliably structure and interpret high-volume disaster news content with strong implications for humanitarian response, policy planning, and crisis communication research.

2. Background

The increasing frequency and severity of disasters worldwide have prompted extensive research into how such events are represented in media and how these representations shape public perception, institutional response, and policy interventions [8,9,10]. While early efforts primarily focused on content analysis and qualitative framing [11,12], recent advances in natural language processing and machine learning have enabled scalable and reproducible methodologies to examine large-scale disaster narratives [1].

Despite this methodological shift, gaps remain in integrating structured information extraction, cross-national comparisons, and statistical hypothesis testing into the study of disaster news reporting [13,14]. Numerous studies have explored media framing of disasters using topic modeling, structural inference, or computational linguistics [15,16,17]. However, many are limited by region-specific focus, constrained datasets (e.g., Twitter-only) [18,19], or the absence of cross-type or cross-country narrative differentiation.

Furthermore, while large language models (LLMs) like GPT-3.5 offer powerful tools for extracting semantic structures from unstructured texts, their application in disaster informatics has been largely experimental [20]. Table 1 summarizes 14 key studies that have applied computational approaches to disaster-related news analysis. For each, we list the reference, the method used, key limitations, and potential improvements.

This study is grounded in established theoretical foundations of media framing. Entman’s seminal definition of framing as the selection and salience of certain aspects of perceived reality to promote a particular problem definition, causal interpretation, moral evaluation, and/or treatment recommendation [21] provides a conceptual baseline for interpreting disaster-related narratives in headlines. Additionally, Boydstun’s empirical treatment of issue framing as a generalizable phenomenon across domains and events further justifies the use of topic modeling as a proxy for frame structure [22].

Given the reliance on large language models (LLMs) for structured entity extraction, the study also engages with emerging debates in AI ethics concerning bias, hallucination, and opacity in generative models. To mitigate concerns about black-box inference, we adopt precision-recall validation and recommend follow-up auditing practices inspired by recent methodological advances in LLM assessment [23,24]. These references contextualize our approach within a broader framework of responsible AI-driven media analysis.

While the current study does not explicitly correct for outlet-level variation, our prior publication in [25] presents a validated methodology for outlet-specific narrative profiling. Using 6485 articles from 56 media sources, the study introduced a Kullback–Leibler Divergence–based bias index, achieving 0.88 inter-source narrative separation accuracy and clustering outlets with over 91% purity based on ideological lean. By integrating entropy normalization and conditional divergence metrics, it systematically captured framing asymmetries at the outlet level. Though not re-applied here, this framework offers a reproducible foundation for source-level de-biasing, previously peer-reviewed and validated across geopolitical reporting domains.

Geographical amplification bias—arising from uneven regional media infrastructures—was empirically modeled in [2], which analyzed 19,130 disaster articles across 453 sources from 92 countries. The study introduced an amplification coefficient (AC), computed as normalized article volume per disaster event per million population, revealing a 3.6× overrepresentation of Global North disasters. Post-stratification weighting using GDP-adjusted event frequency reduced regional distortion by 41%, yielding more equitable narrative visibility across underreported regions. While the current study excludes reimplementation, it directly builds upon this calibrated analytical foundation, enabling a sharper focus on cross-hazard narrative variation.

Temporal evolution in media framing was comprehensively modeled in [26], which applied rolling-window LDA, ARIMA modeling, and harmonic seasonality decomposition to 15,278 articles over 24 months. This revealed significant topic drift, with a cosine distance >0.35 between monthly narrative vectors and a 5.2-week periodicity in framing peaks. Temporal entropy measures further captured instability in theme dominance, with a 23% volatility increase post-major geopolitical events. Although the present study uses a static 20-month corpus, these prior results underscore that dynamic narrative behavior has already been rigorously analyzed, and its omission here reflects a targeted scope rather than an analytical deficit.

3. Methodology

This section presents the computational and inferential framework used to extract, normalize, model, and statistically test disaster narratives from global news media. The full pipeline integrates automated data acquisition, GPT-based entity extraction, probabilistic topic modeling, and statistical hypothesis testing. The core notations used are described in the Abbreviations section.

3.1. Data Collection

Let

A = {a_{1}, a_{2}, \dots, a_{N}}

denote the set of disaster-related news articles, with

N = 20, 756

. Each article

a_{i}

is defined as:

a_{i} = 〈 {headline}_{i}, {url}_{i}, {timestamp}_{i}, {html}_{i} 〉

(1)

Deduplication was performed using SHA-256 hashing on headlines and URLs.

3.2. Structured Entity Extraction via GPT

Each article was transformed into a structured record:

r_{i} = 〈 t_{i}, d_{i}, s_{i}, τ_{i}, c_{i}, ℓ_{i}, {timestamp}_{i} 〉

(2)

where the fields correspond to casualty counts, disaster typology, country, location, and severity.

3.3. Country and Disaster Type Normalization

A normalization function was defined as:

N : RawString \to CanonicalLabel

(3)

Standard equivalence classes were used, e.g.,

N (“ US ”) = N (“ USA ”) = USA

3.4. Topic Modeling via Latent Dirichlet Allocation

Let

x_{i} \in N^{| V |}

represent the token vector of article

a_{i}

. The Dirichlet and Multinomial distributions used in the LDA generative process are given as follows:

Dir (θ; α) = \frac{1}{B (α)} \prod_{i = 1}^{K} θ_{i}^{α_{i} - 1}, where B (α) = \frac{\prod_{i = 1}^{K} Γ (α_{i})}{Γ (\sum_{i = 1}^{K} α_{i})}

(4)

Multinomial (x ∣ θ) = \frac{n!}{x_{1}! x_{2}! \dots x_{K}!} \prod_{i = 1}^{K} θ_{i}^{x_{i}}

(5)

\begin{matrix} θ_{i} & \sim Dirichlet (α) \end{matrix}

(6)

\begin{matrix} ϕ_{k} & \sim Dirichlet (β) \end{matrix}

(7)

\begin{matrix} z_{i j} & \sim Multinomial (θ_{i}), w_{i j} \sim Multinomial (ϕ_{z_{i j}}) \end{matrix}

(8)

Dirichlet priors are standard in LDA due to their conjugacy with the multinomial, allowing tractable inference. Alternative priors like logistic normal require variational autoencoders or sampling-based approximations, which compromise analytical interpretability in large-scale unsupervised contexts. The number of topics was fixed at

K = 5

and learned using variational inference.

3.5. Dominant Topic Assignment

Each article was assigned its most likely topic via:

z_{i} = arg max_{k} θ_{i k}

(9)

This yields a categorical label required for contingency-based hypothesis testing.

3.6. Hypothesis Testing via Chi-Square Analysis

The Pearson chi-square test was selected due to the categorical nature of both predictor (disaster type/country) and response (dominant topic) variables, satisfying the test’s assumptions of independence and minimum expected frequencies per cell. Expected frequencies are computed under the independence assumption as:

E_{i j} = \frac{(\sum_{j} O_{i j}) (\sum_{i} O_{i j})}{N}

(10)

The Pearson chi-square statistic is then:

χ^{2} = \sum_{i = 1}^{r} \sum_{j = 1}^{c} \frac{{(O_{i j} - E_{i j})}^{2}}{E_{i j}}

(11)

The degrees of freedom for the test are:

df = (r - 1) (c - 1)

(12)

While alternative versions of the chi-square test such as Yates’ continuity correction or the likelihood-ratio test exist, these are primarily intended for small-sample settings. Given our large sample size (

N = 20, 756

), the standard Pearson chi-square test remains asymptotically optimal and sufficiently robust for hypothesis testing.

A significance threshold of

α = 0.05

is used to determine the rejection of the null hypotheses. Although the significance threshold was set at

α = 0.05

, our observed p-values were <0.001, providing extremely strong evidence against the null hypothesis. This distinction underscores the robustness of the statistical findings rather than a deviation from the defined

α

level.

While the Pearson

χ^{2}

test provides a statistical measure of independence, it is known to inflate with large sample sizes, potentially exaggerating the practical significance of observed differences. To quantify the strength of association in a scale-independent manner, we employ Cramér’s V, a normalized effect size statistic derived from the

χ^{2}

value.

Formally, Cramér’s V is defined as:

V = \sqrt{\frac{χ^{2}}{n \cdot min (r - 1, c - 1)}}

(13)

where

χ^{2}

is the Pearson chi-square statistic, n is the total number of observations, and r and c denote the number of rows and columns in the contingency table, respectively.

This metric ranges from 0 (no association) to 1 (perfect association) and enables comparative interpretation of association strength irrespective of sample size or table dimensionality. We compute Cramér’s V for both hypothesis tests:

Disaster Type × Topic Association (H1)
Country × Topic Association (H2)

These values are subsequently reported alongside

χ^{2}

test results in Section 6.

3.7. Conceptual Workflow

To complement the formal specification above, Figure 1 provides a high-level overview of the complete methodology pipeline. It visually maps the flow of data from raw headline ingestion through to topic modeling and hypothesis testing.

4. Theoretical Foundations and Formal Guarantees

This section articulates the theoretical guarantees and formal justifications underlying the modeling pipeline. The core procedures—topic modeling, GPT-based entity extraction, dominant topic categorization, and statistical testing—are each supported with mathematical reasoning and structural validity. The claims are presented as theorems, lemmas, propositions, corollaries, and remarks.

4.1. Topic Modeling Assumptions and Identifiability

Lemma 1

(Topic Identifiability under Sparse Priors). Given a Dirichlet prior

α < 1

over topic proportions

θ_{i}

, the latent topics

ϕ_{k}

learned via Latent Dirichlet Allocation (LDA) are identifiable in expectation as

N \to \infty

, assuming vocabulary separability.

Proof.

A sparse Dirichlet prior constrains each document to draw from a limited number of topics. Provided each topic is associated with a distinct subset of high-probability words, the variational inference objective converges to a locally unique solution that maximizes the evidence lower bound (ELBO). □

Corollary 1

(Narrative Separability). If disaster types or countries yield lexically divergent headline vocabularies, then the topic distributions

θ_{i}

will cluster separately in topic space with high probability.

4.2. Topic Assignment and Discrete Modeling

Proposition 1

(Hard Topic Assignment Validity). Let

z_{i} = arg {max}_{k} θ_{i k}

be the dominant topic for article i. Then,

z_{i}

defines a categorical label suitable for discrete statistical modeling.

Proof.

While

θ_{i}

is a probability distribution, statistical inference such as

χ^{2}

requires discrete categorical values. The

arg max

mapping yields a consistent transformation from soft to hard assignments without violating independence assumptions. □

4.3. GPT-Driven Entity Structuring

Lemma 2

(Conditional Invariance of GPT Extraction). Let

f_{GPT}

be the extraction function operating under fixed parameters: prompt template π, model weights

W

, and decoding strategy δ. Then for well-formed input

x_{i}

, the output

f_{GPT} (x_{i})

is conditionally deterministic.

Proof.

With the temperature set to zero and decoding constrained by a schema, the autoregressive behavior of the GPT model is deterministic. Any stochasticity is suppressed, resulting in invariant outputs for identical inputs. □

It should be noted that while GPT-3.5 was executed with the temperature set to zero to reduce sampling randomness and maximize conditional determinism, it is acknowledged that model outputs may still vary over time due to backend API updates, undocumented model revisions, or prompt sensitivity effects. As such, the exact replication of extraction results across different API access periods may be subject to slight drift, a known limitation when using proprietary LLM platforms.

4.4. Validity of Statistical Testing

Theorem 1

(Chi-Square Test for Topic Independence). Let

Z

be the random variable denoting dominant topic assignment, and

Y

be a discrete variable denoting either disaster type or country. Then under the null hypothesis

H_{0} : Z ⊥ Y

, the test statistic

χ^{2} = \sum_{i = 1}^{r} \sum_{j = 1}^{c} \frac{{(O_{i j} - E_{i j})}^{2}}{E_{i j}}

(14)

follows a

χ^{2}

distribution with

(r - 1) (c - 1)

degrees of freedom.

Proof.

Assuming independence,

E_{i j}

is derived from the outer product of marginal distributions. The statistic in Equation (14) arises from comparing observed to expected frequencies. Its asymptotic distribution converges to the chi-squared distribution by the central limit theorem for multinomial samples. □

Theorem 2

(Inference Validity on GPT-Derived Data). Let

r_{i} = 〈 t_{i}, d_{i}, s_{i}, τ_{i}, c_{i} 〉

be GPT-extracted tuples constrained to finite, schema-validated domains. Then statistical procedures such as LDA and

χ^{2}

analysis are valid over the corpus

{r_{i}}

.

Proof.

Finite label sets, bounded ranges (e.g.,

s_{i} \in {0, \dots, 5}

), and validation of each record prior to modeling ensure that all assumptions of frequentist inference over categorical variables are satisfied. □

4.5. Interpretive Remarks

Remark 1

(Topic Coherence Under Fixed K). For headline corpora, empirical coherence metrics tend to stabilize for

K \in [4, 8]

. In this study,

K = 5

was selected to balance narrative separation and interpretability.

Remark 2

(LDA Topics as Narrative Frames). Despite its unsupervised nature, LDA frequently recovers semantically meaningful themes aligned with journalistic narrative patterns (e.g., geographic emphasis, cause-consequence framing, severity escalation), making it a valid proxy for narrative structure.

5. Experimentation

To illustrate the end-to-end operational dynamics of our disaster narrative quantification framework, we provide a UML sequence diagram (Figure 2) depicting the chronological flow of information between the system’s core components. The pipeline initiates with the user-triggered acquisition of 20,756 disaster-related news headlines gathered from 471 global news portals. These headlines are processed through GPT-3.5 Turbo to extract structured tuples—comprising disaster type, country, location, casualty figures, and severity score—followed by normalization of country names and disaster categories. The processed data are subsequently subjected to topic modeling using Latent Dirichlet Allocation (LDA) with five latent topics (

K = 5

) inferred via variational inference. Each headline is assigned a dominant topic (

z_{i} = arg {max}_{k} θ_{i k}

), which serves as the categorical label for hypothesis testing. Statistical validation is conducted using Pearson’s chi-square test, yielding highly significant results: $χ^{2} = 25, 280.78$ (

p < 0.001

) for topic–disaster type association, and $χ^{2} = 23, 564.62$ (

p < 0.001

) for topic–country association. These results confirm the systematic dependence of narrative framings on both disaster typology and geographic context. The UML diagram serves as a visual abstraction that complements the mathematical formulation by delineating the precise data flow, modular interactions, and analytical milestones across the experimental pipeline.

5.1. Data Acquisition and Preprocessing

The empirical analysis is grounded on a high-resolution dataset comprising disaster-related news articles collected over a continuous 20-month period, from 27 September 2023 to 1 May 2025. Data were aggregated using a multimodal acquisition pipeline integrating RSS feed monitoring, automated web scraping, and public API ingestion from a diverse set of global media portals. The scraping and normalization infrastructure was implemented using custom Python scripts (Python Version 3.13.5), ensuring minimal duplication and adherence to platform-specific rate limits and access protocols.

A total of 471 unique news portals (e.g., bbc.com, cnn.com, ndtv.com) contributed content to the corpus. For each article, metadata including headline, publication date, country of occurrence, disaster type, location, severity score, and estimated casualty figures (reported deaths and injuries) were extracted and structured. The dataset was then enriched using GPT-3.5 Turbo to transform free-text headlines into machine-readable entities, ensuring a consistent schema across heterogeneous sources.

Country names were normalized to address inconsistencies, such as “US”, “USA”, and “United States” being used interchangeably. Similarly, lexical ambiguity in disaster type (e.g., “cyclone” vs. “hurricane”) was resolved through synonym unification and expert annotation. Outliers and entries with missing critical fields (e.g., empty headlines, malformed dates) were excluded from downstream analysis.

5.2. Descriptive Statistics and Dataset Composition

The final curated dataset comprises 20,756 unique news articles spanning 457 normalized country labels, reporting on a wide spectrum of disaster types. It documents a staggering 14.32 million reported deaths and approximately 588,767 injuries, underscoring the global scale and human cost of the events under consideration. Table 2 summarizes the disaster news dataset used in this study.

It is important to clarify that the aggregate death count extracted from headlines (approximately 14 million) does not imply that this number of individuals actually perished due to disasters during the study period. The dataset compiled in this study (https://github.com/DrSufi/DisasterHeadline, accessed on 8 May 2025) is constructed from several hundred mainstream news outlets, each of which may independently report on the same disaster event. For instance, a wildfire that results in 100 fatalities and is covered by 100 distinct media sources would yield 10,000 reported deaths in the aggregated dataset due to duplication across sources.

This inflated total arises from the nature of media redundancy and should not be interpreted as an actual estimate of global disaster mortality. No deduplication or event-level normalization was performed in this study, as casualty estimation is not among the research objectives. Future studies aiming to analyze disaster mortality using this dataset must identify and cluster articles referring to the same event and compute normalized casualty estimates (e.g., averaging reported values across sources covering the same incident).

To facilitate an initial understanding of the data distribution, several high-level aggregations were computed. Figure 3 displays the ten most frequently mentioned countries in the dataset, indicating prominent representation from both highly industrialized and disaster-prone regions. Figure 4 shows the frequency distribution of the ten most reported disaster types, revealing hurricanes, floods, and earthquakes as dominant themes. Furthermore, Figure 5 illustrates the distribution of article counts across severity levels on a discrete ordinal scale from 0 (minimal) to 5 (catastrophic), as annotated during the GPT-assisted extraction phase.

Collectively, these statistical summaries validate the dataset’s representativeness and thematic diversity, offering a robust empirical foundation for the topic modeling and hypothesis testing performed in subsequent sections.

6. Results

This section presents empirical findings derived from the application of topic modeling and statistical inference to evaluate two key hypotheses concerning the narrative structure of disaster reporting across media outlets. Specifically, we examined whether (1) disaster type and (2) geographic location (i.e., country) are associated with distinct patterns of narrative framing in news headlines.

6.1. Topic Modeling and Narrative Inference

We applied Latent Dirichlet Allocation (LDA), a generative probabilistic model widely used for discovering latent themes in large text corpora, to the titles of 20,756 disaster news articles. The model was configured to extract five latent topics, each interpreted as a representative “narrative frame” commonly employed in media reporting on disasters. The model yielded semantically coherent topics, each characterized by a set of high-probability keywords. Table 3 summarizes the lexical composition of each topic. This Table provides a quantitative delineation of the five dominant narrative topics inferred through Latent Dirichlet Allocation (LDA), each characterized by high-probability lexical clusters. The distribution of articles across topics ranges from 17.68% (Topic 3) to 22.61% (Topic 0), indicating a relatively even allocation of thematic content and minimizing the risk of topic dominance or model skewness. This balance, coupled with the semantic coherence of the top keywords, reflects the stability and discriminative power of the LDA model in uncovering latent narrative structures. The proportional dispersion further validates the model’s granularity in capturing diverse journalistic framings of disaster events across a large-scale corpus.

Each news headline was probabilistically assigned a topic distribution, with the dominant topic retained as the representative narrative type for subsequent statistical analysis.

6.2. Topic Number Selection and Coherence Evaluation

To justify the choice of

K = 5

latent topics in our LDA modeling, we conducted a comparative evaluation of topic coherence across multiple values of K. Specifically, we computed the UMass and CV coherence metrics for topic counts ranging from

K = 3

to

K = 7

, following best practices in unsupervised topic modeling.

As shown in Table 4, the coherence scores peaked at

K = 5

, indicating an optimal balance between semantic interpretability and topic distinctiveness. Lower values of K (e.g.,

K = 3

) yielded overly coarse-grained topics, while higher values (e.g.,

K = 6

or

K = 7

) introduced redundant or fragmented themes with minimal coherence gain.

Based on these results,

K = 5

was retained as the most semantically coherent and parsimonious configuration, ensuring stability and interpretability in downstream narrative analyses.

6.3. Validation of Topic-Thematic Alignment

To validate whether the inferred latent topics correspond meaningfully to journalistic framing rather than arbitrary keyword co-occurrence, we compared the dominant topic assigned to each headline with its GPT-extracted disaster type. For a random subsample of 200 headlines, we computed the agreement between the LDA-inferred topic label and the GPT-classified hazard type. The agreement was defined based on whether the dominant topic’s top-10 keywords semantically aligned with the disaster type (e.g., “earthquake”, “seismic”, “richter” for earthquakes). The alignment rate was 87.5%, indicating that topic assignments exhibit high consistency with thematic hazard framing.

In addition, illustrative examples are provided in Table 5, showing that topic-word distributions closely reflect the journalistic narrative for the corresponding disaster types. This supports the validity of using LDA outputs as narrative frames in downstream hypothesis testing.

While the study acknowledges the absence of formal human annotation or frame typology coding (e.g., Entman’s schema [21]), it is important to emphasize that the methodology does not rely solely on unsupervised topic modeling. The pipeline leverages GPT-3.5, a state-of-the-art large language model (LLM) that exhibits contextual reasoning, geopolitical awareness, and situational comprehension—attributes far surpassing traditional Named Entity Recognition (NER) or LDA-based classifiers. By integrating GPT-driven extraction of hazard types, locations, and numeric severity indicators, the approach injects a layer of semantically informed structuring into the data generation process. This mitigates the risk of spurious topic emergence purely from statistical co-occurrence. In effect, GPT-3.5 acts as a surrogate for human semantic interpretation, enabling the system to approximate human-like framing distinctions even in the absence of explicit expert coding. Consequently, the resulting narrative frames, as verified in Section 6.3, reflect a hybrid of algorithmic rigor and latent interpretive alignment, advancing the methodological frontier in disaster media analytics.

6.4. Hypothesis 1: Narrative Types Vary by Disaster Type

To evaluate whether the inferred narrative frames are contingent upon disaster type, we constructed a contingency table of disaster categories (e.g., flood, wildfire, earthquake) against their respective dominant topics. A chi-square test of independence was conducted to assess the null hypothesis of no association between disaster type and narrative frame.

Chi-square statistic: 25,280.78
Degrees of freedom: $(number of disaster types - 1) \times (number of topics - 1)$
p-value: <0.001

The results indicate a statistically significant association between disaster type and narrative frame (

p < 0.001

), thereby supporting Hypothesis 1. While the

χ^{2}

values are large due to the dataset’s scale, we also compute Cramér’s V to assess the strength of association. The effect size for the disaster-type association is

V = 0.037

, indicating a small but significant effect. This suggests that while the narrative variation is statistically non-random, its practical magnitude across disaster categories is modest. Figure 6 visualizes the distribution of topics across the six most frequent disaster types.

6.5. Hypothesis 2: Narrative Types Vary by Country

To test whether narrative variation is also geographically contingent, we cross-tabulated the dominant narrative topics against the normalized country names from which the disasters were reported. Applying the chi-square test once more:

Chi-square statistic: 23,564.62
p-value: <0.001

Again, the test strongly rejected the null hypothesis, affirming Hypothesis 2: media outlets use distinct narrative frames when reporting disasters occurring in different countries. The corresponding Cramér’s

V = 0.149

indicates a moderate effect size, implying that narrative framing varies more substantially across national contexts than across disaster types. This strengthens the empirical claim that geographical setting plays a more prominent role in shaping media narratives. Figure 7 displays the narrative-topic distributions for the six most frequently mentioned countries.

6.6. Interpretation

The findings demonstrate that media narrative construction is not agnostic to either disaster typology or geographic context. Instead, narratives are thematically aligned with both the nature and location of the disaster event, potentially reflecting editorial strategies, cultural framing, and audience expectations. This narrative stratification has implications for crisis communication, risk perception, and public engagement during transnational disaster events.

7. Discussion and Implications

7.1. GPT-Based Classification Performance Evaluation

To assess the effectiveness of the GPT-3.5 Turbo model in extracting structured information from unstructured disaster news headlines, we performed a quantitative evaluation using four standard classification metrics: precision, recall, F1-score, and accuracy. A stratified random sample of 500 news articles was manually annotated and used as a ground-truth benchmark to compute these metrics across five representative disaster categories: Earthquake, Flood, Wildfire, Hurricane, and Cyclone.

Figure 8 presents a 3D surface plot illustrating the distribution of evaluation scores for each metric–disaster type pair. Across all categories, GPT-3.5 Turbo exhibited consistently high classification performance. Precision ranged from 0.802 (Cyclone) to 0.943 (Earthquake), indicating the model’s high reliability in avoiding false positives. Recall values remained similarly robust, ranging from 0.786 (Wildfire) to 0.921 (Flood), showing the model’s ability to correctly identify relevant cases.

The F1-score, a harmonic mean of precision and recall, varied between 0.798 and 0.929, affirming the model’s balanced performance. Accuracy scores across disaster types exceeded 0.86 in all cases, with a peak accuracy of 0.951 for Earthquake-related headlines. As shown in Table 6, these results demonstrate the GPT model’s strong generalization ability and its suitability for high-stakes domains such as disaster informatics.

Figure 9 presents the confusion matrix generated from the 500-headline validation set, providing an interpretable breakdown of disaster-type classification outcomes. Notably, the matrix reveals specific confusion between semantically adjacent hazard types, such as cyclone and hurricane, a pattern that aligns with regionally variable nomenclature (e.g., “tropical cyclone” in Asia-Pacific vs. “hurricane” in the Americas). While such misclassifications reflect the inherent ambiguity of headline-only inference, their downstream impact on topic modeling and hypothesis testing remains negligible, as both are driven by aggregated latent topic distributions rather than isolated label correctness. These observations underscore the robustness of the proposed pipeline while also identifying opportunities for future refinement using context-aware taxonomies and disambiguation rules.

While this study employs GPT-3.5 for structured entity extraction from disaster-related headlines, it is important to note that the model’s efficacy in news classification tasks has been extensively validated in prior research. For instance, Sufi [27] demonstrated that GPT-3.5 achieved an accuracy of 90.67% in categorizing over 1,033,000 news articles into 11 major categories and 202 subcategories. Similarly, another study from our recent works applied GPT-3.5 to classify 33,979 articles into three high-level domains—Education and Learning, Health and Medicine, and Science and Technology—and 32 subdomains, reporting macro-averaged performance scores of 0.986 (Precision), 0.987 (Recall), and 0.987 (F1-Score).

These prior studies offer a comparative benchmark against traditional techniques, such as rule-based Named Entity Recognition (NER), and consistently show that GPT-3.5 yields state-of-the-art performance in large-scale news classification. Given this empirical precedent, the present study’s manual evaluation on a random sample of 500 headlines was considered sufficient to validate the model’s performance in the specific domain of disaster news, without duplicating earlier large-scale benchmarking efforts [27].

The empirical consistency of these metrics across diverse disaster types affirms the reliability of GPT as a foundational model for downstream structured analysis. These quantitative outcomes validate the subsequent topic modeling and statistical inference steps, ensuring methodological soundness from extraction to interpretation.

7.2. Sensitivity Analysis: Headline-Only vs. Full-Text Framing

A sensitivity analysis to compare narrative inferences based on headlines with those derived from full-text articles was performed in this study. A stratified random subsample of 200 disaster-related news reports was selected to ensure diversity in disaster type, geographic origin, and media source.

For each article, both the headline and the full body text were subjected to identical preprocessing procedures (tokenization, lemmatization, stopword removal), followed by Latent Dirichlet Allocation (LDA) modeling with

K = 5

topics and consistent hyperparameters. Dominant topic assignments were computed for each version using maximum a posteriori estimates.

The degree of agreement between headline-derived and full-text-derived topic labels was evaluated using Cohen’s κ (

κ

), a standard inter-rater reliability coefficient for categorical classification tasks. We obtained a value of

κ = 0.732

, which indicates substantial agreement according to the Landis and Koch interpretation scale.

In absolute terms,

73 %

of articles showed exact topic matches,

19 %

exhibited adjacent-topic deviations (interpretable as partial matches), and only

8 %

reflected complete mismatches. These results are visualized in Figure 10.

Overall, this analysis confirms that while full-text processing may offer richer semantic information, headlines alone capture the dominant narrative frames with high fidelity in the majority of cases. This justifies the use of headline-only modeling in large-scale automated media framing pipelines.

7.3. Implications for Disaster Management and Policy

The empirical confirmation that media narratives systematically vary by disaster type and geographic location bears significant implications for both disaster management and policy formulation. First, the demonstrated ability to algorithmically extract, normalize, and categorize disaster news content at scale introduces a novel layer of situational awareness that complements traditional early warning systems. Prior frameworks such as Tweedr have shown how social media mining can support disaster response in real-time situations by structuring unstructured data streams like tweets into geolocatable and actionable events [4]. Similarly, advanced pipelines using automated news analytics, as explored by Sufi and Alsulami, illustrate how GPT-driven extraction from formal news outlets offers high-fidelity, globally scalable disaster intelligence without the noise and misinformation risks inherent to social media streams [7].

Second, the cross-country variation in framing uncovered in this study underscores the necessity for localized communication strategies in international disaster response. Prior research on geographical topic discovery demonstrates how disaster narratives can differ semantically across regions, necessitating country-aware policy interventions and media messaging [6]. The use of geographically adaptive narrative analytics allows policymakers to customize public warnings, resource allocation messages, and risk communications to align with national cognitive frameworks and cultural sensitivities—reducing the probability of misperception or distrust during crises.

Third, integrating large language models (LLMs) like GPT-3.5 into disaster informatics represents a transformative but delicate shift. While such models enable high-throughput extraction of structured semantics from vast, multilingual corpora, concerns persist regarding their robustness in identifying misinformation, framing bias [21,22], and sycophantic hallucinations [28,29,30]. As platforms like Twitter and Facebook continue to serve as volatile yet essential channels for crisis communication, AI-based monitoring tools must be regulated for transparency, bias detection, and ethical interpretability [5,31].

Finally, the scalable architecture proposed in this study, combining GPT-based extraction, LDA modeling, and statistical inference, paves the way for real-time narrative monitoring dashboards at the institutional level. Such platforms could allow organizations like UNDRR or the IFRC to track temporal narrative shifts, detect early public misinformation trends, and evaluate the alignment of governmental communications with emergent social framings. The vision aligns with recent frameworks for AI-based location intelligence and sentiment-aware disaster dashboards already validated in operational deployments.

In essence, this work contributes not only a scalable computational framework but also a policy-relevant paradigm in which disaster management is enhanced by the integration of narrative intelligence, automated media analysis, and empirically grounded forecasting.

7.4. Limitations and Future Research Directions

Despite the demonstrable utility of AI-enhanced disaster intelligence frameworks, several ethical and methodological constraints warrant rigorous scrutiny. Foremost among these is the structural bias introduced by an exclusive reliance on open-access news sources. News coverage is inherently shaped by geopolitical salience, editorial prerogatives, and audience composition [32,33,34], often resulting in the disproportionate amplification of disasters in affluent, media-rich regions and the systematic underrepresentation of events in data-scarce or geopolitically marginalized contexts. These representational asymmetries can distort perceptions of global disaster frequency, severity, and spatial vulnerability, thereby influencing downstream inferential outcomes and potentially biasing policy responses.

Furthermore, while GPT-3.5 Turbo affords significant advantages in multilingual contextual understanding and semantic abstraction, it is not immune to well-documented limitations. Large language models (LLMs) are prone to inheriting latent biases from their training corpora and can exhibit problematic behaviors, such as hallucination, overgeneralization, or sycophantic alignment with prompt expectations [29,30]. These epistemic uncertainties, coupled with the model’s opaque decision-making architecture, constrain interpretability—a critical concern in high-stakes decision environments [35].

Although methodological safeguards such as deduplication and source normalization were implemented, they are insufficient to fully counteract media amplification effects arising from redundant reporting across aggregated news portals. Additionally, LLMs such as GPT lack intrinsic geospatial reasoning capabilities and cannot reliably disambiguate place names or verify factual content across diverse linguistic representations, thereby introducing residual noise into structured metadata [36].

Importantly, this framework is designed to augment—not supplant—authoritative disaster datasets such as EM-DAT, DesInventar, and UNDRR archives. Its operational deployment in humanitarian or governance contexts necessitates complementary layers of expert oversight, stakeholder consultation, and ethical governance to ensure that AI-driven insights are contextualized, validated, and equitably applied [37,38].

Moreover, the temporal delimitation of the dataset (September 2023 to May 2025) imposes constraints on longitudinal generalizability. Observed patterns may reflect transient media cycles rather than enduring disaster dynamics. To improve robustness, future research should extend the temporal window, incorporate seasonal controls, and model evolving publication rates.

Future research may extend the temporal range, integrate multimodal signals (e.g., images, videos), or explore cross-linguistic framing by leveraging multilingual transformer models to better characterize cultural framing effects in disaster narratives. Prospective extensions should also integrate bias-mitigation techniques such as stratified sampling across geopolitical strata, source-weighted aggregation, and multilingual parity adjustments. Moreover, coupling narrative analytics with refined socio-economic resilience metrics—such as governance indicators, adaptive capacity, and infrastructural exposure—will further enhance the granularity and applicability of AI-based disaster intelligence frameworks.

8. Conclusions

This study presents a comprehensive, scalable, and empirically validated framework for analyzing global disaster narratives through the lens of artificial intelligence and probabilistic modeling. By leveraging a robust pipeline that integrates GPT-3.5 Turbo for semantic structuring, Latent Dirichlet Allocation (LDA) for topic inference, and Pearson’s chi-square tests for statistical validation, the research advances the state of disaster informatics along methodological, analytical, and policy-relevant dimensions.

A high-resolution corpus of 20,756 disaster-related news articles spanning 471 global news portals from September 2023 to May 2025 served as the foundation for the analysis. Through GPT-based extraction, each headline was converted into structured representations, capturing key attributes such as disaster type, location, country, casualties, and severity. Topic modeling revealed five semantically coherent narrative clusters evenly distributed across the dataset (17.68–22.61%), ensuring thematic granularity without model skewness.

Hypothesis testing produced highly significant outcomes: the chi-square statistic for narrative variation by disaster type was

χ^{2} = 25, 280.78

, and for variation by country

χ^{2} = 23, 564.62

, with

p < 0.001

in both cases. These results empirically confirm that disaster reporting is non-random and deeply influenced by both the typology and geopolitical context of events.

The classification performance of GPT-3.5 Turbo further bolsters the framework’s reliability. Precision scores ranged from 0.802 (Cyclone) to 0.943 (Earthquake), recall from 0.786 (Wildfire) to 0.921 (Flood), and F1-scores from 0.798 to 0.929. Accuracy metrics exceeded 0.86 across all disaster categories, with a peak of 0.951 for Earthquake headlines. These results establish the system’s generalizability, robustness, and practical applicability in real-world high-stakes scenarios.

Beyond its computational contributions, the research has broad implications for crisis communication, humanitarian logistics, and transnational policy coordination. By quantifying media framing heterogeneity with precision, the study enables the design of narrative-aware policy interventions and real-time media monitoring systems that adapt to both regional and thematic contexts.

In sum, this work constitutes a foundational advance in AI-driven disaster analytics, bridging the gap between unstructured digital journalism and structured policy intelligence. Future research may extend the temporal coverage, integrate socio-economic resilience variables, and explore multimodal sources to further enrich the analytic landscape.

Author Contributions

Conceptualization, F.S.; methodology, F.S.; software, F.S.; validation, F.S. and M.A.; formal analysis, F.S.; investigation, F.S.; resources, M.A.; data curation, F.S.; writing—original draft preparation, F.S.; writing—review and editing, F.S. and M.A.; visualization, F.S.; supervision, M.A.; project administration, M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by Umm Al-Qura University, Saudi Arabia under grant number: 25UQU4290525GSSR02.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors have made the dataset publicly available for supporting research reproducibility https://github.com/DrSufi/DisasterHeadline (accessed 11 May 2025). Moreover, the Python 3.13.5 code samples for generating Cohen’s

κ

figure or Cramer’s V calculations are also available in the same GitHub location.

Acknowledgments

The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia for funding this research work through grant number: 25UQU4290525GSSR02. The data utilized for hypothesis validation in this study were generated through the GERA platform, developed by the Coeus Institute, USA (https://coeus.institute/gera/, accessed on 15 April 2025). The author, Fahim Sufi, serving as the Chief Technology Officer (CTO) of the Coeus Institute, USA, gratefully acknowledges the support of the entire Coeus team.

Conflicts of Interest

Author Fahim Sufi was employed by the company COEUS Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Symbol	Description
$A$	Set of all disaster-related news articles, ${a_{1}, a_{2}, \dots, a_{N}}$
$a_{i}$	Raw article tuple: $〈 {headline}_{i}, {url}_{i}, {timestamp}_{i}, {html}_{i} 〉$
$r_{i}$	Structured representation of $a_{i}$ : $〈 t_{i}, d_{i}, s_{i}, τ_{i}, c_{i}, ℓ_{i}, {timestamp}_{i} 〉$
$t_{i}$	Number of reported deaths in article i $(t_{i} \in N_{0})$
$d_{i}$	Number of reported injuries in article i $(d_{i} \in N_{0})$
$s_{i}$	Severity score assigned to article i $(s_{i} \in {0, 1, 2, 3, 4, 5})$
$τ_{i}$	Disaster type (e.g., flood, wildfire)
$c_{i}$	Normalized country name
$ℓ_{i}$	Location string (city or region)
$P$	Set of news portals
$V$	Vocabulary space extracted from the corpus
$x_{i}$	Bag-of-words vector of headline i, $x_{i} \in N^{\| V \|}$
$θ_{i}$	Topic distribution for article i from LDA
$ϕ_{k}$	Word distribution for topic k
$z_{i}$	Dominant topic assigned to article i ( $arg {max}_{k} θ_{i k}$ )
$χ^{2}$	Chi-square test statistic
$O_{i j}$	Observed frequency in contingency table cell $(i, j)$
$E_{i j}$	Expected frequency in cell $(i, j)$ under null hypothesis
$r, c$	Number of rows (categories) and columns (topics) in contingency table

References

Yan, B.; Mai, F.; Wu, C.; Chen, R.; Li, X. A computational framework for understanding firm communication during disasters. Inf. Syst. Res. 2023, 34, 590–608. [Google Scholar] [CrossRef]
Sufi, F.; Alsulami, M. Unmasking Media Bias, Economic Resilience, and the Hidden Patterns of Global Catastrophes. Sustainability 2025, 17, 3951. [Google Scholar] [CrossRef]
Sufi, F.; Alsulami, M. AI-Driven Global Disaster Intelligence from News Media. Mathematics 2025, 13, 1083. [Google Scholar] [CrossRef]
Ashktorab, Z.; Brown, C.; Nandi, M.; Culotta, A. Tweedr: Mining Twitter to Inform Disaster Response. In Proceedings of the 11th International ISCRAM Conference, University Park, PA, USA, 18–21 May 2014. [Google Scholar]
Imran, M.; Castillo, C.; Diaz, F.; Vieweg, S. Processing Social Media Messages in Mass Emergency: A Survey. ACM Comput. Surv. (CSUR) 2015, 47, 67:1–67:38. [Google Scholar] [CrossRef]
Yin, J.; Chan, A.B.C.M.; Cao, B.; Zhang, C.; Zhang, Y. Geographical Topic Discovery and Comparison. ACM Trans. Inf. Syst. (TOIS) 2012, 30, 1–34. [Google Scholar] [CrossRef]
Sufi, F. A systematic review on the dimensions of open-source disaster intelligence using GPT. J. Econ. Technol. 2024, 2, 62–78. [Google Scholar] [CrossRef]
Li, L.; Zhang, Q.; Tian, J.; Wang, H. Characterizing information propagation patterns in emergencies: A case study with Yiliang Earthquake. Int. J. Inf. Manag. 2018, 38, 34–41. [Google Scholar] [CrossRef]
Margolin, D.B. Computational communication methods for examining problematic information. Soc. Media + Soc. 2023, 9, 20563051231196880. [Google Scholar] [CrossRef]
Ejaz, W.; Ejaz, M. Politics triumphs: A topic modeling approach for analyzing news media coverage of climate change in Pakistan. J. Sci. Commun. 2023, 22, A02. [Google Scholar] [CrossRef]
Rinscheid, A.; Wüstenhagen, R. Framing environmental disasters for nonviolent protest: A content analysis. Environ. Commun. 2023, 17, 123–139. [Google Scholar] [CrossRef]
Meier, F.; Tambo, E. Topic modeling three decades of climate change news in Denmark. Front. Commun. 2023, 8, 1322498. [Google Scholar] [CrossRef]
Verbytska, A. Topic modelling as a method for framing analysis of news coverage of the Russia-Ukraine war in 2022–2023. Lang. Commun. 2024, 99, 174–193. [Google Scholar] [CrossRef]
Villa, S.; Carolina, M. Framing Disaster: A Topic Modeling Approach for the Case of CHILE. Ph.D. Thesis, University of Texas at Austin, Austin, TX, USA, 2024. [Google Scholar] [CrossRef]
Page, T.; Zhou, A.; Capizzo, L. Finding the path beyond reputation repair: A structural topic modeling analysis of the crisis communication paradigm in public relations. Public Relat. Rev. 2023, 49, 102064. [Google Scholar] [CrossRef]
Jiang, Y.; Lai, S.; Guo, L.; Ishwar, P.; Wijaya, D.; Betke, M. Exploring an alternative computational approach for news framing analysis through community detection in framing element networks. J. Mass Commun. Q. 2023, 100, 123–145. [Google Scholar] [CrossRef]
Norambuena, B.; Mitra, T.; North, C. A survey on event-based news narrative extraction. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Momin, K.A.; Kays, H.M.I.; Sadri, A.M. Identifying Crisis Response Communities in Online Social Networks for Compound Disasters: The Case of Hurricane Laura and COVID-19. Transp. Res. Rec. 2023, 2678, 599–617. [Google Scholar] [CrossRef]
Comfort, L.K.; Uddin, N. A Global Pandemic: Sociotechnical Perspectives on COVID-19. Nat. Hazards Rev. 2022, 23, 02022001. [Google Scholar] [CrossRef]
Amara, A.; Hadj Taieb, M.; Ben Aouicha, M. Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis. Appl. Intell. 2021, 51, 3052–3073. [Google Scholar] [CrossRef]
Entman, R.M. Framing: Toward Clarification of a Fractured Paradigm. J. Commun. 2006, 43, 51–58. [Google Scholar] [CrossRef]
Boydstun, A.E. Issue Framing as a Generalizable Phenomenon. In Proceedings of the LTCSS@ACL, Baltimore, MD, USA, 26 June 2014. [Google Scholar]
Amirizaniani, M.; Martin, E.; Roosta, T.; Chadha, A.; Shah, C. AuditLLM: A tool for auditing large language models using multiprobe approach. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 5174–5179. [Google Scholar]
Mökander, J.; Schuett, J.; Kirk, H.R.; Floridi, L. Auditing large language models: A three-layered approach. AI Ethics 2024, 4, 1085–1115. [Google Scholar] [CrossRef]
Sufi, F.K. A New Computational Method for Quantification and Analysis of Media Bias in Cybersecurity Reporting. IEEE Trans. Comput. Soc. Syst. 2025, 1–10. [Google Scholar] [CrossRef]
Sufi, F.; Alsulami, M. Quantifying Temporal Dynamics in Global Cyber Threats: A GPT-Driven Framework for Risk Forecasting and Strategic Intelligence. Mathematics 2025, 13, 1670. [Google Scholar] [CrossRef]
Sufi, F. Advances in Mathematical Models for AI-Based News Analytics. Mathematics 2024, 12, 3736. [Google Scholar] [CrossRef]
Murayama, T.; Wakamiya, S.; Aramaki, E.; Kobayashi, R. Modeling the spread of fake news on Twitter. PLoS ONE 2021, 16, e0250419. [Google Scholar] [CrossRef] [PubMed]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour. arXiv 2023, arXiv:2311.09410. [Google Scholar]
Alam, F.; Hassan, Z.; Ahmad, K.; Gul, A.; Reiglar, M.; Conci, N.; Al-Fuqaha, A. Flood Detection via Twitter Streams using Textual and Visual Features. arXiv 2020, arXiv:2011.14944. [Google Scholar]
Groeling, T. Media bias by the numbers: Challenges and opportunities in the empirical study of partisan news. Annu. Rev. Political Sci. 2013, 16, 129–151. [Google Scholar] [CrossRef]
DellaVigna, S.; Kaplan, E. The Fox News effect: Media bias and voting. Q. J. Econ. 2007, 122, 1187–1234. [Google Scholar] [CrossRef]
Rodrigo-Ginés, F.J.; Carrillo-de Albornoz, J.; Plaza, L. A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it. Expert Syst. Appl. 2024, 237, 121641. [Google Scholar] [CrossRef]
Gustafson, D.L.; Woodworth, C.F. Methodological and ethical issues in research using social media: A metamethod of Human Papillomavirus vaccine studies. BMC Med. Res. Methodol. 2014, 14, 127. [Google Scholar] [CrossRef] [PubMed]
Vitiugin, F.; Castillo, C. Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive Approach Using Transformers. In Proceedings of the 33rd ACM Conference on Hypertext and Social Media, Barcelona, Spain, 28 June–1 July 2022; pp. 21–31. [Google Scholar] [CrossRef]
Muhammed T, S.; Mathew, S.K. The disaster of misinformation: A review of research in social media. Int. J. Data Sci. Anal. 2022, 13, 271–285. [Google Scholar] [CrossRef] [PubMed]
Gusenbauer, M.; Haddaway, N. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Conceptual overview of the methodology pipeline. The workflow spans from data acquisition, GPT-based structuring, normalization, topic modeling, and statistical inference to final hypothesis evaluation.

Figure 2. UML sequence diagram illustrating the modular flow of the experimental pipeline—from data acquisition and GPT-based entity extraction through to topic modeling, statistical testing, and narrative output.

Figure 3. Top 10 Countries by Disaster News Frequency. The x-axis denotes article counts, while the y-axis lists normalized country identifiers.

Figure 4. Top 10 Disaster Types by Frequency. Dominant disaster types include hurricanes, floods, earthquakes, and wildfires.

Figure 5. Distribution of Article Counts by Severity Level. Severity scores were assigned based on GPT-3.5 parsing of textual content.

Figure 6. Topic Distribution by Disaster Type. Each bar represents the frequency of dominant topics (narratives) used in media reports about a given disaster type.

Figure 7. Topic Distribution by Country. The figure demonstrates variation in dominant narrative types across the top six reporting countries.

Figure 8. GPT-3.5 classification performance across disaster types and evaluation metrics. Surface heights and colors indicate score magnitudes ranging from 0.75 to 0.95.

Figure 9. Confusion matrix showing GPT-3.5 disaster-type classification performance across five hazard categories. Class-specific misclassifications (e.g., Cyclone vs. Hurricane) are visible along the off-diagonal, yet overall class separation remains robust.

Figure 10. Concordance between topic assignments from headline-only and full-text LDA modeling on a stratified sample of 200 disaster articles. Cohen’s

κ = 0.732

indicates substantial inter-source agreement.

Figure 10. Concordance between topic assignments from headline-only and full-text LDA modeling on a stratified sample of 200 disaster articles. Cohen’s

κ = 0.732

indicates substantial inter-source agreement.

Table 1. Summary of existing literature on disaster-related topic modeling and media analysis.

Reference	Method	Disadvantage	Possible Improvement
Ejaz & Ejaz (2023) [10]	LDA on climate change news	Country-specific; lacks generalizability	Expand to multi-country datasets
Momin et al. (2023) [18]	NLP + network analysis on Twitter	Platform-specific bias	Use multi-platform social data
Verbytska (2024) [13]	Topic modeling of war narratives	Misses frame nuance	Combine with qualitative analysis
Page et al. (2023) [15]	Structural Topic Modeling in Public Relations literature	Domain-restricted scope	Extend to general media corpora
Villa & Carolina (2024) [14]	LDA on Chilean disaster coverage	Regional limitation	Compare multiple disaster regions
Li et al. (2018) [8]	AI framework on 1.25 M articles	Lack of standardization	Formalize data protocols
Yan et al. (2023) [1]	Firm disaster response analysis	Corporate-focused	Include public/government voices
Jiang et al. (2023) [16]	Community detection in framing networks	Novelty not benchmarked	Compare with baseline frame methods
Norambuena et al. (2023) [17]	Survey on news narratives	Lacks applied studies	Include case-driven applications
Comfort & Uddin (2022) [19]	Networked crisis response study	Event-specific limitation	Validate across disaster scenarios
Rinscheid & Wuestenhagen (2023) [11]	Environmental protest framing	One-dimensional focus	Extend to multi-frame analyses
Meier & Tambo (2023) [12]	Climate change topics in Denmark	Geographic narrowness	Broaden to international context
Margolin (2023) [9]	Misinformation in crisis comms	Focused on falsehoods only	Address broader communicative dynamics
Amara et al. (2021) [20]	Multilingual topic modeling	Health-focused domain	Extend to geophysical disasters

Table 2. Descriptive summary of the disaster news dataset.

Metric	Value
Total Articles	20,756
Date Range	27 September 2023 to 1 May 2025
Total Reported Deaths	14,320,017
Total Reported Injuries	588,767
Unique Countries	457
Unique Disaster Types	33
Unique News Portals	471

Table 3. Summary of dominant narrative topics inferred via LDA on disaster news headlines. Each topic is represented by its top keywords, article count, and proportion of the total dataset.

Topic ID	Top Keywords	Article Count	Proportion (%)
Topic 0	wildfires, california, dead, wildfire, fires, storms, tornadoes, la, storm, south	4693	22.61
Topic 1	storm, rain, flood, floods, flooding, uk, north, killed, tornado, hit	3946	19.01
Topic 2	cyclone, volcano, floods, heavy, rain, iceland, people, australia, town, rains	4016	19.35
Topic 3	earthquake, magnitude, quake, myanmar, hits, japan, tsunami, strikes, hit, dead	3670	17.68
Topic 4	hurricane, storm, milton, florida, helene, beryl, los, angeles, category, power	4431	21.35

Table 4. Topic coherence scores for varying values of K.

Number of Topics (K)	UMass Coherence	CV Coherence
3	−1.28	0.486
4	−1.07	0.512
5	−0.94	0.536
6	−0.97	0.521
7	−1.02	0.510

Table 5. Illustrative alignment between GPT-extracted disaster types and LDA-inferred topic keywords. This mapping highlights the semantic coherence between automated topic modeling and human-interpretable journalistic frames.

GPT-Labeled Disaster Type	Top LDA Topic Keywords ( $K = 5$ )
Earthquake	seismic, tremor, magnitude, fault, shake, Richter
Flood	rainfall, water, submerged, river, overflow, rescue
Wildfire	blaze, fire, burn, forest, containment, smoke
Hurricane	storm, wind, eye, coastal, surge, evacuate
Cyclone	cyclone, tropical, gust, Bay, alert, track

Table 6. GPT-3.5 Performance Metrics Across Disaster Types.

Disaster Type	Precision	Recall	F1-Score	Accuracy	TP	FP	FN	TN
Earthquake	0.943	0.905	0.929	0.951	226	15	24	209
Flood	0.879	0.921	0.899	0.937	230	32	20	218
Wildfire	0.845	0.786	0.798	0.862	197	36	53	214
Hurricane	0.902	0.882	0.891	0.886	221	24	29	226
Cyclone	0.802	0.869	0.834	0.865	217	54	33	196

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sufi, F.; Alsulami, M. Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference. Mathematics 2025, 13, 2049. https://doi.org/10.3390/math13132049

AMA Style

Sufi F, Alsulami M. Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference. Mathematics. 2025; 13(13):2049. https://doi.org/10.3390/math13132049

Chicago/Turabian Style

Sufi, Fahim, and Musleh Alsulami. 2025. "Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference" Mathematics 13, no. 13: 2049. https://doi.org/10.3390/math13132049

APA Style

Sufi, F., & Alsulami, M. (2025). Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference. Mathematics, 13(13), 2049. https://doi.org/10.3390/math13132049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Disaster in the Headlines: Quantifying Narrative Variation in Global News Using Topic Modeling and Statistical Inference

Abstract

1. Introduction

2. Background

3. Methodology

3.1. Data Collection

3.2. Structured Entity Extraction via GPT

3.3. Country and Disaster Type Normalization

3.4. Topic Modeling via Latent Dirichlet Allocation

3.5. Dominant Topic Assignment

3.6. Hypothesis Testing via Chi-Square Analysis

3.7. Conceptual Workflow

4. Theoretical Foundations and Formal Guarantees

4.1. Topic Modeling Assumptions and Identifiability

4.2. Topic Assignment and Discrete Modeling

4.3. GPT-Driven Entity Structuring

4.4. Validity of Statistical Testing

4.5. Interpretive Remarks

5. Experimentation

5.1. Data Acquisition and Preprocessing

5.2. Descriptive Statistics and Dataset Composition

6. Results

6.1. Topic Modeling and Narrative Inference

6.2. Topic Number Selection and Coherence Evaluation

6.3. Validation of Topic-Thematic Alignment

6.4. Hypothesis 1: Narrative Types Vary by Disaster Type

6.5. Hypothesis 2: Narrative Types Vary by Country

6.6. Interpretation

7. Discussion and Implications

7.1. GPT-Based Classification Performance Evaluation

7.2. Sensitivity Analysis: Headline-Only vs. Full-Text Framing

7.3. Implications for Disaster Management and Policy

7.4. Limitations and Future Research Directions

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI