Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification

Bie, Tong; Hu, Yongli; Fu, Yu; Hao, Linjia; Liu, Tengfei; Guo, Kan; Jiang, Huajie; Gao, Junbin; Sun, Yanfeng; Yin, Baocai

doi:10.3390/ijgi15030104

Open AccessArticle

Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification

by

Tong Bie

^1,2

,

Yongli Hu

^1,2,*

,

Yu Fu

^1,2

,

Linjia Hao

^1,2

,

Tengfei Liu

^1,2

,

Kan Guo

^1,2,

Huajie Jiang

^1,2,

Junbin Gao

³

,

Yanfeng Sun

^1,2 and

Baocai Yin

^1,2

¹

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, 100 Pingleyuan, Chaoyang District, Beijing 100124, China

²

School of Information Science and Technology, Beijing University of Technology, 100 Pingleyuan, Chaoyang District, Beijing 100124, China

³

The University of Sydney Business School, The University of Sydney, Camperdown, NSW 2006, Australia

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(3), 104; https://doi.org/10.3390/ijgi15030104

Submission received: 28 November 2025 / Revised: 31 January 2026 / Accepted: 20 February 2026 / Published: 2 March 2026

(This article belongs to the Topic Natural Hazards Monitoring, Risk Assessment, Modelling and Management in the Artificial Intelligence Era)

Download

Browse Figures

Versions Notes

Abstract

Social media has become a vital source for humanitarian organizations to gather information during crises. However, existing multimodal classification methods operate primarily as isolated systems, while neglecting external references crucial for accurate judgment. Furthermore, while user comments can provide valuable context, they are often scarce during the early stages of a crisis. To address these limitations, we propose a framework named Mix-Persona Comment Generation with Geographically Enhanced Context Retrieval for LLM Instruction Fine-tuning (MPCG-GECR). To mitigate comment scarcity, we employ a Synthetic Persona Generator (SPG) that prompts LLMs to adopt diverse mix-personas, generating synthetic comments that simulate multi-perspective public discourse. To incorporate external references, we introduce a Geographically Enhanced Context Retrieval (GECR) module. Unlike standard retrieval approaches, GECR utilizes a hybrid re-ranking strategy to identify samples that are both multimodally similar and geographically consistent, serving as reliable reference anchors for the LLM. By integrating these social perspectives and geographic references into a unified instruction-tuning format, we transform the classification task into a context-aware text generation problem and fine-tune the LLM using Low-Rank Adaptation (LoRA). Extensive experiments on the CrisisMMD and DMD datasets demonstrate that MPCG-GECR effectively overcomes data scarcity and context isolation, significantly outperforming existing methods.

Keywords:

multimodal crisis informatics; geographically retrieval-augmented generation; generative data augmentation; LLM instruction fine-tuning; crisis management

Graphical Abstract

1. Introduction

The widespread adoption of social media platforms has rendered them a crucial channel for humanitarian organizations to obtain information about crisis events [1]. Accurately and rapidly analyzing these highly time-sensitive multimodal posts can effectively assist humanitarian organizations in planning relief operations [2].

Existing methods for multimodal crisis post classification primarily rely on multimodal feature fusion, as illustrated in Figure 1a. These approaches employ various pre-trained models to extract features from different modalities, feeding them into downstream fusion modules for feature integration [3,4,5,6]. Although progress has been made, these methods operate within a closed system, relying solely on the model’s internal parametric knowledge of the models. While Lin et al. [7] introduced Large Language Models (LLMs) to extract event-related information, they restricted the LLM’s role to simple querying, failing to leverage its reasoning capabilities effectively or connect the posts to broader external contexts.

To enhance the semantic depth of social media posts, some existing methods (though not specifically designed for crisis post analysis) incorporate user comments as auxiliary text [8,9,10], as shown in Figure 1b. However, relying on real-world comments presents a significant bottleneck: during the early stages of a crisis, immediate user engagement is often scarce. Many posts contain no comments due to the urgency of the situation or limited visibility. Specifically, not all posts attract user engagement, and the time constraints of crisis data collection further limit the window for users to post reactions. This “comment scarcity” renders methods dependent on crowd-sourced feedback ineffective precisely when they are most needed.

Furthermore, current approaches overlook a fundamental aspect of human cognition in crisis assessment: reference via retrieval. When experts analyze a crisis post, they do not examine it in isolation; they intuitively compare it with similar historical events or environmental contexts to make informed judgments [11,12]. In the broader field of Natural Language Processing (NLP), Retrieval-Augmented Generation (RAG) has achieved remarkable success by grounding LLM responses in external non-parametric knowledge, thereby reducing hallucinations and improving accuracy [13,14,15]. Despite its promise, RAG has not yet been effectively adapted for multimodal crisis post analysis. A naive application of standard retrieval methods is insufficient because crisis posts are characterized by unique geographic semantics. Merely retrieving similar text and images without considering geographic consistency (e.g., retrieving a flood image from a tropical zone to match a flood query in an arid zone) can introduce noise and mislead the model.

To address these limitations—specifically the scarcity of user interaction and the lack of geographically grounded reference information—we propose the Mix-Persona Comment Generation with Geographically Enhanced Context Retrieval for LLM Instruction Fine-tuning (MPCG-GECR) framework. Our pipeline is illustrated in Figure 1c. First, to solve the comment scarcity issue, we employ a Synthetic Persona Generator (SPG) that prompts the LLM to act as hundreds of diverse mix-personas, generating varied user comments that simulate multi-perspective public discourse. Second, to incorporate high-value external context, we introduce a Geographically Enhanced Context Retrieval (GECR) module. Unlike traditional retrieval methods, GECR considers both multimodal feature similarity and geographic semantics (e.g., location names, functional zones). By performing a hybrid re-ranking strategy, GECR retrieves samples that serve as “Geographic Reference Anchors,” providing the LLM with objective evidence to ground its reasoning. Finally, we integrate these synthetic subjective perspectives and retrieved objective references to construct a rich instruction fine-tuning dataset. We employ Low-Rank Adaptation (LoRA) to efficiently fine-tune the LLM, transforming the classification task into a context-aware generation task.

Crucially, the SPG and GECR modules are active not only during training but also during the inference phase. When handling new, unseen tweets, the system dynamically augments them with synthetic viewpoints and retrieved historical context before classification. This ensures that the model operates on a consistent data distribution across both training and testing phases. Furthermore, a critical distinction exists between the two modules: while SPG synthesizes subjective social perspectives to simulate public discourse, GECR strictly retrieves objective real-world historical data to serve as factual grounding.

By organically combining diverse subjective perspectives (via SPG) with objective geographic references (via GECR), our framework enables the LLM to mine underlying crisis information with unprecedented depth. Experiments demonstrate that MPCG-GECR outperforms other methods on the CrisisMMD dataset [16] and the DMD dataset [17].

The main contributions of this paper are as follows:

We propose the MPCG-GECR framework, a novel approach that merges synthetic social media user comments and retrieval-augmented geographic context to address the data scarcity and context isolation problems in multimodal crisis post classification. Extensive experiments show that our method outperforms existing baselines.
We introduce the Geographically Enhanced Context Retrieval (GECR) module, representing the first attempt to integrate geographic semantics into the RAG paradigm for crisis post analysis. This module provides the LLM with interpretable geographic reference anchors, improving classification reliability.
We demonstrate the effectiveness of prompting LLMs to generate diverse personas and comments for data augmentation, offering a new direction for handling low-resource crisis post analysis tasks.

2. Related Work

2.1. Traditional Multimodal Crisis Post Classification

Early research in multimodal crisis post classification predominantly relied on multimodal feature fusion techniques. These methods typically employ separate encoders for text and image modalities, followed by fusion modules to integrate cross-modal representations. For instance, Abavisani et al. [3] aimed to address data inconsistency and overfitting through two key mechanisms: Stochastic Shared Embeddings regularization for robust training and Cross-Attention-based fusion of projected image and text embeddings. Qian et al. [4] utilized a multimodal masked transformer network to capture cross-modal semantic relationships, while filtering out redundant information. More recently, Wang et al. [5] adopted a multi-layer architecture combining autoencoders and graph convolutional networks to extract comprehensive information from multimodal data. Yu et al. [6] incorporated uncertainty estimation into the multimodal fusion process, treating encoder outputs as subjective opinions to learn more reliable representations.

While these approaches have demonstrated utility, they operate within a closed-system paradigm, relying exclusively on the parametric knowledge embedded in the models. This restricts their capacity to incorporate real-time or contextually relevant external knowledge, which is often critical for understanding evolving crisis situations.

2.2. Comment-Augmented Social Media Analysis

Beyond crisis-specific applications, several studies in related domains, such as fake news detection and rumor verification have explored user comments as valuable auxiliary signals. Zheng et al. [8] proposed the Multi-modal Feature-enhanced Attention Networks, which integrate textual, visual, and social graph features into a unified framework. Their approach specifically focuses on aligning and complementing relationships across different modalities to enhance rumor detection. Similarly, Su et al. [10] constructed a dual-layer graph comprising both a news layer and a user layer. By mining multi-relations and user interaction features, their model effectively captures user credibility signals to assist in distinguishing fake news. Nan et al. [9] introduced a teacher–student framework named CAS-FEND. This method transfers knowledge from a comment-aware teacher model trained on historical data to a content-only student model, designed to preserve detection accuracy even when social contexts are absent in newly emerging news.

However, these methods depend on the availability of sufficient user engagement or established historical social contexts. This reliance becomes a significant limitation in crisis contexts where posts often receive few or no comments due to urgency, low visibility, or the nascent stage of the event. Consequently, the problem of comment scarcity makes approaches dependent on crowd-sourced feedback or social graph accumulation ineffective during the critical early phases of crisis response.

2.3. LLM Applications and Retrieval-Augmented Generation

Recent advancements have demonstrated the potential of LLMs and Retrieval-Augmented Generation (RAG) in analyzing complex social media content, particularly in the domain of fake news detection. Zheng et al. [18] proposed the Explainable Adaptive Rationale-Augmented Multimodal framework. They utilized Large Vision-Language Models to generate analyses and employed task-specific small models to extract useful rationales, thereby enhancing both generalization and explainability. To exploit the synergy between different model sizes, Zhou et al. [19] introduced the Multi-Round Collaboration Detection framework. This approach features a two-stage retrieval module to select relevant demonstrations and integrates the generalization abilities of LLMs with the specialized functionalities of small models through iterative learning. Similarly, Irnawan et al. [20] emphasized the importance of factual grounding. They proposed a method that assesses claim veracity primarily based on retrieved factual evidence, utilizing the common-sense reasoning mechanisms of LLMs only when explicit evidence is absent.

Despite their success in general NLP and fake news detection, RAG and LLMs have seen limited application in multimodal crisis settings. Standard retrieval methods often ignore the critical role of geographic consistency. This oversight can lead to misleading matches, for example, retrieving posts from unrelated regions that lack relevant geographic context. Moreover, existing LLM-based crisis post analysis approaches, such as that of Lin et al. [7], typically treat LLMs as passive knowledge bases for simple information extraction. They fail to fully leverage the generative and reasoning capabilities of LLMs in an integrated manner crucial for the accurate analysis of crisis posts.

2.4. Geographic Information in Social Media Analysis

The integration of geographic information has become a pivotal enhancement in social media analysis, particularly for crisis management. Unlike general text classification, crisis-related posts are inherently tied to specific spatio-temporal contexts. Recent studies have demonstrated that incorporating geographic attributes can significantly improve the granularity and accuracy of social media mining.

Several approaches focus on the precise extraction and inference of geographic locations from unstructured data. Wei et al. [21] fused the mention and retweet graphs with a gating mechanism to infer user locations. Similarly, Han et al. [22] employed fine-tuning on open-source LLMs to extract crisis geospatial intelligence and structured spatial entity attributes from Chinese social media texts.

Beyond mere location extraction, geographic information serves as a critical bridge for multi-task learning and multi-source data fusion. Zou et al. [23] introduced a framework that integrates toponym extraction with damage identification. By designing toponym-enhanced weights, their model utilizes geographic references to improve the representation of crisis severity. Zhu et al. [24] combined social media texts with microwave remote sensing images. Their framework matches textual traffic information with basic geographic data to assess road damage levels under extreme flood conditions. Furthermore, Zorenböhmer et al. [25] applied an aspect-based emotion analysis to wildfire tweets, revealing that emotion patterns, such as fear and sadness, exhibit distinct variations correlated with the distance to wildfire perimeters.

These studies collectively underscore a fundamental argument: geographic information is not merely a metadata tag but a semantic anchor that grounds social media content in physical reality. In crisis analysis, the interpretation of a post often depends on its location. However, existing methods primarily treat geographic information as an extraction target or a visualization tool. They rarely utilize geographic consistency as a retrieval signal to guide the reasoning process of the model. This limitation motivates our approach, where we leverage geographic semantics to retrieve contexts, thereby enabling the LLM to make more informed and grounded judgments.

3. Method

This paper proposes a Mix-Persona Comment Generation with Geographically Enhanced Context Retrieval for LLM Instruction Fine-tuning (MPCG-GECR) framework, as illustrated in Figure 2. The framework comprises four primary components: Image Information Extraction, User Comment Generation, Geographically Enhanced Context Retrieval, and Instruction Dataset Construction with LLM Fine-tuning. Crucially, our framework integrates “Subjective Generation” and “Objective Retrieval” within a unified inference pipeline. The User Comment Generation module synthesizes diverse social personas to address the scarcity of interaction. In contrast, the Geographically Enhanced Context Retrieval (GECR) module grounds reasoning in reality by retrieving verifiable real-world historical samples based on geographic and multimodal similarities. Both components are utilized during inference to construct a comprehensive context for the LLM. The following sections provide a detailed description of the proposed MPCG-GECR framework.

3.1. Image Information Extraction

To enable the joint processing of images and text using LLMs, we first employ a Multimodal Large Language Model (MLLM) to extract image information, while simultaneously incorporating common-sense knowledge related to the image content provided by the MLLMs. Specifically, we utilize LLaVA [26,27] as the MLLM, replacing the original visual encoder of LLaVA with FG-CLIP [28] to enhance its capability in fine-grained image understanding. FG-CLIP is a powerful pre-trained multimodal model that achieves fine-grained visual and textual alignment.

Subsequently, we use a fixed prompt template to instruct the MLLM to generate detailed descriptions of the image. The prompt template is shown in Figure 3:

Here, [Tweet text] and [Tweet image] denote the original content of the tweet. We store the response generated by the MLLM as [Detailed image caption].

The primary objective of this process is to bridge the modality gap by translating visual content into textual descriptions. Since LLMs currently exhibit superior reasoning and instruction-following capabilities compared to MLLMs, converting images into text allows us to leverage the robust analytical power of LLMs in the subsequent stages. Furthermore, this process serves to explicitly externalize fine-grained visual details and implicit common-sense knowledge into a unified textual format, providing the model with a richer context for accurate analysis. It is worth noting that this stage focuses solely on visual content extraction. The verification of geographic entities and consistency alignment is rigorously addressed in the Geographically Enhanced Context Retrieval (GECR) module, as detailed in Section 3.3.

3.2. Diverse Comment Generation

We design a Synthetic Persona Generator (SPG) to generate sufficiently diverse personas. Specifically, we select five distinct attributes of social media users: Gender, Age, Education, Occupation, and Location. The Location attribute is particularly relevant to crisis events. Categorizing users by their geographic relationship to the crisis area is designed to activate the LLM’s understanding of the differing perspectives between affected and non-affected populations. This is a key aspect closely aligned with our task. The other four attributes are typical user characteristics. Prior research suggests that differences in these attributes lead people to express divergent views in their comments [29,30]. Gender differences reflect varied interests and perspectives; age reflects differences in cognitive maturity; education level corresponds to disparities in knowledge acquisition; while occupation represents disparities in professional expertise and skills. Prior research in crisis informatics suggests that user demographics significantly influence information-sharing behaviors [31,32,33,34]. For instance, professional backgrounds (Occupation) often correlate with the sharing of actionable situational awareness, while educational levels can impact the linguistic formality and logical structure of the text. By explicitly modeling these attributes, we enable the SPG to simulate the semantic divergence between `emotional noise’ and `actionable intelligence,’ thereby aiding the classifier in learning robust decision boundaries.

The specific attribute categories used are as follows:

Gender: Non-binary, Male, Female;
Age: Minors, Young Adulthood, Middle Age, Old Age;
Education: Primary Education, Secondary Education, Undergraduate Education, Graduate Education;
Occupation: Humanitarian Worker, Domain expert, Engineer, Medical Staff, Journalist, General public;
Location: Resident of the local crisis area, Resident of the surrounding crisis areas, Resident of the non-affected area.

We construct diverse mix-personas by combining different attributes, as illustrated by User 1 in Figure 2. Some combinations are inherently contradictory, such as having both Minors and Graduate Education. To prevent the LLM from generating hallucinations, we filter out such clearly inconsistent combinations during the persona construction process, such as [Male, Minors, Graduate Education, Domain expert, Resident of the non-affected area]. Specifically, the inclusion of the `General Public’ category in the Occupation attribute is designed to model the common scenario where specific professional expertise is absent or unknown, ensuring the model covers the baseline perspective of the average user.

After obtaining the mix-persona, we generate comments using the prompt template shown in Figure 2. Specifically, we decompose the mix-persona into individual attribute descriptions to guide the LLM in accurately simulating the specified character. Meanwhile, we assign the LLM a specific task and instruct it to mimic real Twitter users when generating comments. We provide the LLM with tweet text, image caption, and comment list to generate new comments while taking into account the existing comments.

Notably, we adopt a multi-round iterative strategy to generate the comment list, which aligns closely with the Chain-of-Thought (CoT) paradigm [35,36]. In each round, the LLM is assigned only one new persona that it has not previously simulated and is prompted to produce a corresponding comment, rather than generating a full comment list for multiple personas at once. This CoT-like strategy helps prevent the LLM from becoming confused by multiple conflicting persona attributes, thereby reducing the risk of introducing additional hallucinations during comment generation. Due to the context length limitations of the LLM, each [Comment list] contains only a limited number of comments. To accommodate a large volume of diverse comments, we generate multiple [Comment list] instances for the same tweet. We also ensure that the persona attributes differ across various [Comment list] instances for the same tweet to maintain diversity. Although we do not explicitly model temporal dynamics, the generation is conditioned on the tweet’s content, which serves as an implicit temporal anchor (e.g., descriptions of ‘shaking ground’ trigger immediate reactions, while ‘rescue arrival’ triggers post-event comments). This allows the model to dynamically adapt to the crisis stage depicted in the post.

In summary, the implementation of the Synthetic Persona Generator (SPG) aims to simulate the multifaceted nature of public discourse during a crisis. By varying key demographic attributes—particularly Location—we enable the model to capture the distinct cognitive perspectives of both affected populations and remote observers. The primary benefit of this approach, reinforced by our multi-round iterative generation strategy, is the creation of a high-quality heterogeneous comment corpus that minimizes hallucinations while maximizing the diversity of social viewpoints.

However, while the SPG simulates geographic perspectives via persona attributes, it relies on the internal knowledge of the LLM and lacks connection to specific real-world external data. To complement this simulated social diversity with objective data-driven evidence, we introduce the Geographically Enhanced Context Retrieval (GECR) module in the following section.

3.3. Geographically Enhanced Context Retrieval

To assist the LLM in incorporating broader information for the accurate analysis of crisis-related posts, we adopt a Retrieval-Augmented Generation (RAG) approach to collect high-value reference information. Specifically, we introduce the Geographically Enhanced Context Retrieval (GECR) module, marking its first application in crisis post analysis. Unlike existing methods that perform retrieval based solely on feature information, our GECR module incorporates a retrieval process targeting geographic semantics. This allows us to simultaneously retrieve high-value samples that share both similar multimodal features and geographic semantic information with the target sample, thereby providing the model with more explicit references. Unlike the synthetic comments generated in Section 3.2, the geographic context here is retrieved from real-world historical data (training set) to serve as objective grounding.

The GECR process consists of three steps: Geographic Entity Extraction and Indexing, Multimodal Feature Retrieval, and Hybrid Re-ranking with Geographic Semantics.

3.3.1. Geographic Entity Extraction and Indexing

We first employ lightweight NER tools (BERT-NER [37,38]) and OCR tools (EasyOCR [39]) to process multimodal data. Specifically, for the target sample and all samples in the training set, we use the OCR tool to extract all textual information from images. Subsequently, we utilize the NER tool to extract Geo-Entities (e.g., Countries, States, Cities, Towns) and Location-Entities (e.g., river names, street names, landmarks, functional zones).

This extracted information is inserted into the original training set data and stored under two new distinct tags: Geo and Location. This transformation converts unstructured user-generated content into structured geographic metadata, successfully constructing a retrieval database enriched with geographic semantics. This step serves as the foundation for the subsequent geographic semantic retrieval process.

3.3.2. Multimodal Feature Retrieval

We utilize the FG-CLIP encoder to extract text and image features from the target sample and all training set samples. We then calculate the multimodal similarity between the target sample and other samples, respectively. Specifically, for a multimodal target sample

X_{i} = {T_{i}, V_{i}}

and a training set sample

X_{j} = {T_{j}, V_{j}}

, we denote the unimodal features extracted by FG-CLIP as

t_{i}, v_{i}, t_{j}, v_{j}

. We use

\cos (\cdot)

to represent the normalized cosine similarity between two features. For example, the similarity between

t_{i}

and

t_{j}

is expressed as

{Cos}_{t_{i}, t_{j}} = \frac{t_{i} \cdot t_{j}}{∥ t_{i} ∥ ∥ t_{j} ∥} .

(1)

Here,

∥ \cdot ∥

indicates the L2 norm, used to normalize the feature vectors. Given the strong cross-modal alignment capabilities of FG-CLIP, potential associations between cross-modal contents are already considered during encoding. Therefore, we employ a cross-modal similarity fusion method to comprehensively measure multimodal similarity from four different cross-modal directions. Finally, the average similarity

S_{i, j}

between sample

X_{i}

and

X_{j}

is calculated as

S_{i, j} = \frac{{Cos}_{t_{i}, t_{j}} + {Cos}_{t_{i}, v_{j}} + {Cos}_{v_{i}, t_{j}} + {Cos}_{v_{i}, v_{j}}}{4} .

(2)

Using the similarity calculation method above, we match the features of the target sample

X_{i}

against all samples in the retrieval database (training set). We select the Top-K samples with the highest similarity as candidate samples following this coarse-grained multimodal feature retrieval. These K candidates similar to

X_{i}

form the sample candidate pool

R_{i} = {X_{i_{1}}, X_{i_{2}}, \dots, X_{i_{k}}}

.

In this step, we establish a preliminary candidate pool based on multimodal features; these candidates are already similar to the target sample at the feature level, providing a solid basis for the subsequent fine-grained geographic re-ranking.

3.3.3. Hybrid Re-Ranking with Geographic Semantics

Upon obtaining the candidate pool

R_{i}

, we further employ a Hybrid Re-ranking Strategy based on geographic semantics to select high-value samples that provide additional context for the LLM’s reasoning. This strategy comprises two distinct branches: Explicit Alignment and Implicit Inference. Regardless of whether the target sample possesses explicit geographic semantic tags, the dual guarantee provided by these branches ensures that the hybrid re-ranking strategy effectively retrieves the most discriminative samples from the candidate pool.

Before re-ranking, we first check the geographic semantic information of the target sample. If the target sample contains geographic semantic information (i.e., it has Geo and Location tags), we execute the Explicit Alignment strategy.

Explicit Alignment strategy. We first divide the candidate pool into two groups based on the target sample’s Geo and Location tags: those that share the same tags and those that do not.

Same-Tag Group: We employ the FG-CLIP encoder to extract feature embeddings for the Geo and Location tags of both the target and candidate samples, subsequently computing the geographic semantic similarity between them. Based on this metric, we re-rank all candidate samples in descending order. In cases where the geographic semantic similarity between two candidate samples and the target sample is identical, their relative order is determined by their multimodal similarity. Finally, we retrieve the Top-2 samples from the re-ranked list. Here, ‘Top-2’ refers to the two candidates with the highest similarity scores in the sorted list, serving as the most reliable geographic reference anchors.
Different-Tag Group: To supplement extra diversity, we also retrieve information from the group with different tags. Exposure to diverse data enables models to learn essential patterns, mitigate overfitting, and develop a more holistic reasoning capability [40]. This method of supplementing diverse information has been proven beneficial in fields such as Visual Question Answering (VQA), enabling models to make more comprehensive and accurate inferences [41]. For this group, we rank based on multimodal similarity and retrieve the Top-2 samples.

In cases where the “Same-Tag” group contains fewer than two samples, the deficit is filled by retrieving additional samples from the “Different-Tag” group.

Ultimately, we select four high-value samples from the candidate pool

R_{i}

to constitute the final retrieved set

R_{i}^{'}

. This set includes not only samples exhibiting explicit environmental similarity and phenomenological consistency but also supplementary samples that share similar multimodal features yet possess distinct geographic semantics. The choice of a total retrieval size of

K = 4

and the specific

2 + 2

allocation strategy are grounded in our experimental findings (see Section 4.6) and a design philosophy of balance. Specifically, retrieving two samples from the “Same-Tag” group provides a robust reference average to mitigate the risk of outliers inherent in single-sample retrieval. Simultaneously, allocating the remaining two slots to the “Different-Tag” group prevents the marginalization of diversity benefits that would occur with a skewed (e.g., 3-1) split. This strategy ensures that the retrieved context maintains clear environmental similarity and phenomenological consistency while simultaneously offering valuable supplementary information.

If the target sample lacks geographic semantic information (i.e., no Geo or Location tags), we execute the Implicit Inference strategy. This serves as a data-driven form of latent geographic inference.

Implicit Inference strategy. We implement two distinct processing strategies depending on the availability of event type information in the training set:

If event type information is available in the training set (e.g., the CrisisMMD dataset [16] annotates specific event names), we group the samples in the candidate pool by event type and rank them within each group by multimodal similarity (high to low). We then select the Top-2 samples from the largest group, and the Top-1 sample from the second and third largest groups, respectively. If only two groups exist, we select the Top-2 from each; if only one group exists, we simply select the Top-4.
If event type information is unavailable (e.g., the DMD dataset [17] only annotates sample categories), we implement a latent semantic clustering strategy. We employ the K-Means++ algorithm [42] to cluster samples based on the multimodal features of the target and candidate pool. The construction of a multimodal feature involves three steps:
- Pre-normalization: We first apply L2 normalization to the single-modal text and image features extracted by FG-CLIP. This step is crucial to address the modality gap phenomenon [43]. Without this, the modality with naturally larger feature magnitudes (typically vision) would dominate the distance calculation, causing the model to ignore textual semantics [44].
- Concatenation: We concatenate the normalized multimodal features to form a fused vector, ensuring an equal contribution (1:1) from both modalities.
- Post-normalization: Finally, we perform a second L2 normalization on the fused features. This ensures that the Euclidean distance used in K-Means effectively approximates Cosine Similarity, which is the native metric for CLIP-based embeddings.

The number of clusters K is set to a small constant (e.g., $K = 4$ ) to impose a structured diversity on the retrieved candidates, forcing the selection algorithm to sample from distinct latent semantic sub-groups rather than over-focusing on a single dominant pattern. From the clustering results, we select the Top-2 samples from the cluster containing the target sample and the Top-1 sample from each of the two clusters with the nearest centroids. In the specific case where the target sample forms a cluster of its own (i.e., a singleton cluster), we identify the nearest clusters by calculating the Euclidean distance between the target sample and the centroids of all other clusters. We then select the Top-2 samples from each of the two clusters with the smallest centroid distances. In cases where $K \neq 4$ , we continue to retrieve the Top-2 samples from within the target sample’s cluster, while selecting the two samples with the highest similarity from the nearest neighboring clusters.

Through this implicit inference, we ensure that, even when the target sample lacks specific geographic semantic information, we can implicitly infer it through event categories or feature similarity and form the retrieved set

R_{i}^{'}

. This strategy effectively retrieves context with environmental similarity and phenomenological consistency based on environmental morphology and crisis characteristics, while also supplementing diversity rather than limiting results to samples that are merely highly similar to the target.

In summary, by introducing the GECR module, we overcome the limitations of traditional retrieval methods that ignore geographic semantics. Whether through explicit alignment or implicit inference, this module ensures that retrieved samples remain maximally similar to the target in terms of geospatial and environmental context without losing diversity. These samples not only share similar text and images with the target, but also serve as critical geographic reference anchors, providing bases for the LLM to evaluate the content of crisis posts. The complete workflow of the proposed GECR module is summarized in Algorithm 1.

Algorithm 1 Geographically Enhanced Context Retrieval (GECR)

Require:: Target sample $X_{i} = {T_{i}, V_{i}}$ , Training set $D$ , Retrieval size K, Final size $N = 4$ .
Ensure:: Retrieved context set $R_{i}^{'}$ .
1:: // Step 1: Multimodal Feature Retrieval
2:: Extract features $f_{i} = FG - CLIP (X_{i})$ , $f_{j} = FG - CLIP (X_{j}), \forall X_{j} \in D$ .
3:: Calculate similarity $S_{i, j}$ via Equation (2) for all $X_{j} \in D$ .
4:: $R_{i} \leftarrow TopK (D, S_{i, j}, K)$ . {Initial Candidate Pool}
5:: // Step 2: Hybrid Re-ranking Strategy
6:: if $X_{i}$ has Geo/Location tags then
7:: Branch A: Explicit Alignment
8:: Split $R_{i}$ into $G_{s a m e}$ (share tags) and $G_{d i f f}$ (different tags).
9:: Sort $G_{s a m e}$ by tag similarity (descending).
10:: Sort $G_{d i f f}$ by multimodal similarity $S_{i, j}$ (descending).
11:: Select top samples from $G_{s a m e}$ : $S_{1} \leftarrow G_{s a m e} [0 : \min (2, | G_{s a m e} |)]$ .
12:: Calculate remaining quota: $N_{r e m} = 4 - | S_{1} |$ .
13:: Select top samples from $G_{d i f f}$ : $S_{2} \leftarrow G_{d i f f} [0 : N_{r e m}]$ .
14:: $R_{i}^{'} \leftarrow S_{1} \cup S_{2}$ .
15:: else
16:: Branch B: Implicit Inference
17:: if Event Types are available in $D$ then
18:: Group $R_{i}$ by Event Type.
19:: Select samples based on group size (Top-2 from largest, Top-1 from others).
20:: else
21:: Construct fused features F via Norm-Concat-Norm strategy.
22:: Perform K-Means++ on $R_{i} \cup {X_{i}}$ with F ( $K = 4$ ).
23:: Identify target cluster $C_{t a r g e t}$ and nearest clusters $C_{n e a r}$ .
24:: Select Top-2 from $C_{t a r g e t}$ and Top-1 from each $C_{n e a r}$ .
25:: end if
26:: $R_{i}^{'} \leftarrow$ Selected samples.
27:: end if
28:: return $R_{i}^{'}$

With the successful acquisition of high-value retrieval samples (

R_{i}^{'}

), we have obtained all the necessary information to construct the instruction fine-tuning dataset. These retrieved samples

R_{i}

, containing both their content, ground-truth labels, and Geo/Location tags, serve as few-shot demonstrations to guide the LLM’s reasoning process.

3.4. Instruction Dataset Construction

We reformulate the multimodal crisis classification task as a text generation task, requiring the LLM to predict the category of a post via text generation conditioned on the provided multimodal context. To optimize this mapping and unlock the reasoning potential of the LLM, we employ instruction fine-tuning. To this end, we reconstruct the original dataset by integrating heterogeneous information modules—tweet text, visual descriptions, synthetic comments, and the retrieved geographic context—into a unified instruction format.

Specifically, for the i-th multimodal post

X_{i} = {T_{i}, V_{i}}

with the ground-truth label

Y_{i}

, we aggregate the following augmented components:

The [Detailed image caption] $D_{i}$ , generated by the MLLM to externalize image details;
The [Comment List] $C_{i, j}$ , generated by the LLM. Each list comprises n distinct comments (denoted as $c_{i, j, 1}, \dots, c_{i, j, n}$ ), corresponding to varying social perspectives derived from different simulated personas;
The [Retrieved Samples] $R_{i}^{'}$ , consisting of high-value reference samples obtained through the GECR module, which provides geographic semantics and reasoning anchors.

During the construction of the instruction dataset, we transform the original sample

X_{i}

into a series of augmented instruction instances. Since we generate multiple distinct comment lists for a single post to maximize diversity, a single original sample

X_{i}

yields j instruction-tuning instances, defined as

I_{i, j} = {T_{i}, D_{i}, C_{i, j}, R_{i}^{'}}

.

Finally, these components are populated into a structured prompt template (as shown in Figure 2). This template functions as a structured framework guiding the LLM to synthesize the textual content (

T_{i}

), visual context (

D_{i}

), social sentiment (

C_{i, j}

), and geographic references (

R_{i}^{'}

) to infer the correct crisis category. The [Class] field in the output component of the template is replaced with the category name corresponding to

Y_{i}

.

By constructing the dataset in this manner, we effectively simulate a “Human-in-the-Loop” reasoning process where the model is provided with objective content (tweet text and image), subjective public opinion (comments), and comparative context (retrieved samples). This holistic data construction strategy ensures that the LLM is not merely memorizing labels but is learning to reason across multiple dimensions of the crisis event. In the next section, we will detail how we utilize this enriched dataset to fine-tune the LLM for optimal performance.

3.5. LoRA-Based LLM Instruction Fine-Tuning

To effectively adapt LLMs to the nuances of crisis analysis without the prohibitive computational cost of full fine-tuning, we employ Low-Rank Adaptation (LoRA) [45]. LoRA facilitates efficient fine-tuning by freezing the pre-trained weight matrices of the LLM and injecting a small number of trainable low-rank decomposition matrices into the transformer layers. We specifically integrate LoRA into the query and value projection layers of the attention blocks within the LLM. The forward pass in a LoRA-adapted layer is expressed as

h = W_{0} x + Δ W x = W_{0} x + M N x,

(3)

where x is the input,

W_{0} \in R^{d \times k}

denotes the frozen pre-trained weight matrix, and

Δ W

represents the trainable adapter. The adapter is decomposed into two low-rank matrices

M \in R^{d \times r}

and

N \in R^{r \times k}

, where the rank

r ≪ \min (d, k)

.

During the fine-tuning process, the model is optimized by minimizing the negative log-likelihood loss, which drives it to learn how to autoregressively generate the correct textual response based on the given instruction. Crucially, the input instruction

X_{I}

is constructed by populating the comprehensive template that integrates distinct informational elements: the [Tweet text]

T_{i}

, the MLLM-generated [Detailed image caption]

D_{i}

, the [Comment List]

C_{i, j}

, and the [Retrieved Samples]

R_{i}^{'}

. The loss function is defined as

L = - \sum_{s = 1}^{S} \log p_{θ} (y_{s} ∣ X_{I}, y < s),

(4)

where y

= {y_{s}}_{s = 1}^{S}

denotes the target output sequence (i.e., the textual representation of the ground-truth crisis category).

Through this instruction fine-tuning strategy, we transform the LLM from a general-purpose generator into a specialized reasoning agent for crisis informatics. Consequently, The LLM learns not only to classify textual and visual details but also to perform a holistic assessment: integrating retrieved geographic evidence with social comment to provide accurate and context-aware crisis post classification.

4. Results

4.1. Dataset

We conducted experiments on the publicly available multimodal crisis post classification dataset, CrisisMMD [16]. This dataset comprises seven natural disaster events from 2017, including Hurricane Irma, Hurricane Maria, Hurricane Harvey, Mexico earthquake, Iraq–Iran earthquakes, Sri Lanka floods, and California wildfires. A protocol is applied for multi-image tweets: if a tweet has two or more images, we process each text–image pair as an independent sample while maintaining identical textual content across pairs. In the experiments, we used tweets with consistent multimodal labels and evaluated the performance on both Task 1 and Task 2:

Task 1 aims to assess whether a tweet contains valuable humanitarian aid information, requiring classification into two categories: Informative and Not Informative.
Task 2 further categorizes tweets into five fine-grained classes: “Affected individuals” (including “Injured or dead people” and “Missing or found people”), “Rescue, volunteering, or donation effort”, “Infrastructure and utility damage” (including “Vehicle damage”), “Other relevant information”, and “Not humanitarian”.

The CrisisMMD dataset is split into training, validation, and test sets with a ratio of 75/13/12%.

In addition, we conducted experiments on the DMD dataset [17]. The DMD dataset comprises multimodal posts sourced from the Instagram platform, covering five distinct damage categories in addition to a “No damage” category: Infrastructural damage, Natural landscape (such as landslides, avalanches, and fallen trees), Fires, Floods, and Human damage (human casualties or deaths). In our experiments, only multimodal posts containing both images and text were utilized. The DMD dataset is split into training, validation, and test sets with a ratio of 80/10/10%. Notably, the DMD dataset does not contain specific event labels, providing only the sample content itself and the corresponding ground-truth labels. It is important to note that the classification task on the DMD dataset involves discriminating among these six specific categories. Therefore, it represents a fine-grained multi-class classification task, analogous to Task 2 in the CrisisMMD dataset.

4.2. Implementation Details

We utilized LLaVA-1.5-7B [26,27] as the MLLM, replacing its original visual feature extractor with FG-CLIP [28]. Mix-persona simulation and diverse comment generation are conducted using DeepSeek-3.1 [46]. We utilized Llama 3-8B [47] during the fine-tuning stage and employed AdamW [48] as the optimizer. The batch size was set to 12. To ensure statistical reliability, we conducted experiments with five random seeds and reported the average performance. The model is trained for a total of 20 epochs, incorporating an early stopping strategy with a patience value of 5. During comment generation, we restricted the maximum length of DeepSeek output to 50 tokens, while limiting LLaVA’s output to 120 tokens. The maximum number of comments n in each [Comment list] was set to 30.

4.3. Baselines

We compare the proposed MPCG-GECR with numerous prior studies. We select ResNet [49] and BERTweet [50] as unimodal baselines and choose SCBD [3], AT-CAVE [51], OWSEC [4], MFEK [7], CEFN [6], MMDG [5], OWDII [52], CG-PG [53], and CrisisSpot [54] as multimodal baselines, primarily focused on improving multimodal fusion. Additionally, we include DeepSeek-V3 [46], DeepSeek-R1 [55], LLama 3-8B [56], and LLaVA-1.5-7B [26,27] as representative LLMs and MLLMs to evaluate their performance on downstream zero-shot classification tasks. Since LLMs have not been widely adopted in crisis tweet analysis, we select MFEK as the representative model that utilizes LLMs. Detailed explanations of the selected baselines are provided below.

ResNet [49]: This classical image processing model has shown robust performance in various computer vision tasks.
BERTweet [50]: A large-scale pre-trained language model for English tweets, which shares the same architecture as BERT and uses the RoBERTa pre-training procedure.
SCBD [3]: This crisis tweet classification model addresses data inconsistency and overfitting through two key mechanisms: Stochastic Shared Embeddings (SSE) regularization for robust training and Cross-Attention-based fusion of projected image and text embeddings. These mechanisms enhance model stability while maintaining multimodal integration.
AT-CAVE [51]: An adaptive transformer-based conditioned variational autoencoder network for incomplete multimodal tweet classification. It jointly models the textual information, visual information and label information into a unified deep model, which can generate more discriminative latent features and enhance the performance in missing-modality scenarios.
OWSEC [4]: An open-world multimodal classification model that combines fine-grained semantic interaction. The model utilizes a multimodal mask transformer architecture to establish cross-modal semantic relation, filter redundant information through dynamic masking, and generate virtual mixed samples for training a separate open-world classifier.
MFEK [7]: A knowledge-enhanced multimodal architecture that addresses out-of-distribution challenges via image-guided textual enhancement, multi-source knowledge extraction from Wikipedia and GPT-3.5 Turbo, and a co-attention-based fusion mechanism of external knowledge with text and image features.
CEFN [6]: An evidential fusion framework grounded in subjective logic theory that explicitly models uncertainty during multimodal integration. The network treats encoder outputs as subjective opinions, enabling direct uncertainty quantification and more reliable representation learning.
MMDG [5]: A multilayer deep graph model that constructs text and image graphs for multimodal representation learning. The architecture combines Graph Convolutional Networks (GCN) with autoencoder components to systematically extract and integrate heterogeneous graph-structured information.
OWDII [52]: A multimodal model for categorizing social media posts related to disasters in an open-world environment, utilizing a multitask (closed-world and open-world) classifier and a sample generation strategy that models the distribution of unknown samples using known data.
CG-PG [53]: A complementary graph learning and prompt-based cross-modal generation network for missing-modality cases in the fake news detection field. It explores structural complementary information in image and text graphs and generates representations of the missing modality from available modalities.
CrisisSpot [54]: A graph neural network architecture designed to model complex cross-modal relationships by jointly analyzing textual–visual content correlations and social context features (user-centric and content-centric). Its inverted dual embedded attention mechanism simultaneously captures both complementary and contradictory data patterns.
DeepSeek-V3 [46]: DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model with 671B total parameters, employing innovative Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient training and inference.
DeepSeek-R1 [55]: DeepSeek-R1 is an advanced iteration of DeepSeek-R1-Zero that addresses readability and language consistency issues through multi-stage training with cold-start data prior to reinforcement learning. This enhanced version achieves more robust performance while maintaining powerful logical behaviors.
LLama 3-8B [56]: LLama 3-8B is a open-source LLM with 8 billion parameters. This model employs a decoder-only transformer architecture with Grouped-Query Attention (GQA) and an efficient 128K-token tokenizer. Pre-trained on over 15 trillion tokens and instruction-tuned, it delivers strong performance, particularly in reasoning and coding, establishing itself as a leading model in its class.
LLaVA-1.5-7B [26,27]: LLaVA-1.5 is an MLLM using CLIP-ViT-L-336px with a multilayer perceptron projection enhanced by academic-task-oriented VQA data with response formatting prompts.

4.4. Experimental Results

4.4.1. CrisisMMD Dataset Results

Table 1 presents the experimental results of our proposed MPCG-GECR framework on the CrisisMMD dataset. The results indicate that multimodal methods generally surpass unimodal approaches, demonstrating the importance of jointly processing information from both modalities in multimodal tweet analysis.

MPCG-GECR outperforms all baseline methods across all tasks and metrics. Specifically, on the accuracy metric, it surpasses the state-of-the-art method CrisisSpot by 2.61% in Task 1 and 1.67% in Task 2, while the performance gap further widens to 2.63% and 1.72% in terms of the Macro-F1 score, respectively. By generating a large number of diverse user comments, MPCG-GECR significantly enriches the original dataset with external information that was previously absent. Through instruction fine-tuning, the LLM successfully learns to capture the relationship between tweet content and user comments. A key differentiator of MPCG-GECR compared to other methods lies in its ability to supplement and leverage missing user comments for crisis tweets, which is also the crucial factor driving its performance advantage.

Furthermore, a critical observation from Table 1 is the performance disparity between our fine-tuned framework and general-purpose Large Models (e.g., DeepSeek-V3, DeepSeek-R1, LLaMA 3) or Multimodal LLMs (LLaVA-1.5) in zero-shot settings. Despite their immense parameter scale and generalization capabilities, these models exhibit suboptimal performance, significantly lagging behind supervised baselines. For instance, LLaVA-1.5 achieves only 68.94% accuracy on Task 1, which is far lower than the 95.72% achieved by MPCG-GECR. This underscores that without task-specific adaptation, general LLMs struggle to grasp the nuanced definitions of humanitarian categories and the specific boundaries of crisis informatics. Even DeepSeek-R1, which possesses advanced reasoning capabilities, fails to match the performance of specialized models in this domain. We clarify the motivation for comparing our fine-tuned framework with zero-shot General LLMs. This comparison is not meant to establish a trivial superiority of supervised training, but to investigate whether the massive parametric knowledge of modern LLMs can supersede the need for task-specific adaptation in crisis informatics.

This comparison highlights the paradigm shift represented by MPCG-GECR: moving from feature fusion (as seen in methods like CrisisSpot and MMDG) to generative reasoning. By transforming the classification task into a generative instruction-following task and grounding it with synthesized comments and retrieved geographic contexts, our method effectively aligns the vast knowledge of the LLM with the specific requirements of crisis analysis, achieving a new state of the art.

For an assessment of the incremental value of our proposed modules against a standard Supervised Fine-Tuning (SFT) baseline, we refer readers to the Ablation Study in Section 4.5. The comparison there explicitly isolates the contributions of the SPG and GECR modules over the base Llama 3-8B architecture.

4.4.2. DMD Dataset Results

Table 2 presents the experimental results of our proposed MPCG-GECR framework on the DMD dataset.

On the DMD dataset, our proposed MPCG-GECR framework likewise achieved the best performance, with an accuracy of 94.83% and a Macro-F1 score of 94.68%, outperforming all baseline models. This result is consistent with the observations on the CrisisMMD dataset, robustly validating the effectiveness and robustness of our method across different data distributions and crisis types. Specifically, compared to the current state-of-the-art model, CrisisSpot, MPCG-GECR achieved a gain of 1.04% in accuracy and 1.08% in Macro-F1. Although the margin of improvement is slightly smaller than that on the CrisisMMD dataset, it demonstrates that our method can still achieve stable and statistically significant performance gains on the DMD dataset, a dataset where baseline models already exhibit strong performance, proving its capability to address the bottleneck.

4.5. Ablation Study

To validate the efficacy of various designs in MPCG-GECR, we conducted ablation studies testing the following variants:

w/o GECR and w/o SPG: The standard fine-tuning baseline excluding both modules, utilizing only tweet text and image captions.
w/o GECR: The variant removing the retrieval module but retaining synthetic comments.
Random retrieval: The variant replacing GECR with random retrieval.
w/o [Attribute]: Variants removing specific persona attributes.
Full-list generation at once: The variant generating comments without the iterative strategy.

Table 3 shows the test results of MPCG-GECR and its variants.

First, we establish a baseline using the standard supervised fine-tuning approach (w/o GECR and w/o SPG), which achieves an accuracy of 90.31% (Task 1). This indicates the base capability of Llama-3 when adapted to the crisis domain using only intrinsic parametric knowledge. Upon introducing the Synthetic Persona Generator (w/o GECR), we observe a performance gain of roughly 3.90% (reaching 94.21%), confirming that simulating diverse public discourse effectively bridges the semantic gap in isolated tweets. Finally, the complete MPCG-GECR framework achieves the highest accuracy of 95.72%.

Significant performance degradation is observed when the GECR module is removed, underscoring its pivotal role in our framework. By retrieving high-value samples aligned with geographic semantics, GECR provides the LLM with effective reference anchors, enhancing reasoning beyond purely generative augmentation.

Furthermore, replacing GECR with a random retrieval strategy results in a notable decline in performance (92.59%), which is even lower than the baseline without any retrieval (94.21%). This finding directly answers the question of whether performance gains stem from mere information volume or semantic consistency. It demonstrates that blindly introducing external context acts as noise, impairing the model’s judgment. The improvement observed in MPCG-GECR is strictly attributable to the geographic consistency of the retrieved samples, which validates their role as reliable reference anchors rather than merely from the presence of additional few-shot demonstrations.

It can be observed that removing any persona attribute leads to varying degrees of performance degradation, indicating that each attribute possesses distinctive characteristics that help emphasize differences among users. Notably, we find that the performance decline is most pronounced when the “Occupation” and “Location” attributes are excluded, highlighting the critical importance of these two dimensions. We attribute this to the specific semantic value they carry: the “Occupation” attribute introduces variations in professional expertise and domain knowledge, while the “Location” attribute encapsulates distinct cognitive perspectives derived from spatial proximity to the crisis. Consequently, these attributes exert a substantial influence on the model’s ability to accurately parse crisis information.

A noticeable performance drop occurs when the LLM is required to generate the entire comment list at once. This confirms that our iterative generation strategy effectively improves the quality of comment lists by producing more accurate user comments. We posit that this iterative approach also helps mitigate LLM hallucinations. When the model is prompted to simulate multiple conflicting personas and generate a dialogue simultaneously (i.e., the Full-list generation at once strategy), it is prone to context confusion and role-mixing. However, it is important to acknowledge a trade-off: the strategy significantly reduces generation latency, particularly when a large volume of comments is required. Therefore, in scenarios where timeliness is prioritized over accuracy, the Full-list generation at once strategy remains a viable recommendation.

4.6. Hyperparameter Sensitivity

In the MPCG-GECR framework, the primary hyperparameters include n, which controls the number of comments per list, and K, which determines the number of retrieved samples in the GECR module. We investigated the impact of these hyperparameters on Task 1 of the CrisisMMD dataset.

For hyperparameter n, we tested six different values: 5, 10, 20, 30, 40, and 50. The corresponding results are presented in Figure 4. As depicted, the model performance generally exhibits a trend of initial improvement, followed by a plateau as the number of comments per list increases. Specifically, performance peaks at

n = 30

and begins to decline at

n = 40

. We posit that at

n = 30

, the model is equipped with a sufficiently diverse set of social perspectives; thus, further increasing this number does not yield additional performance gains.

For hyperparameter K, we evaluated five values: 2, 3, 4, 5, and 6. The results, shown in Figure 5, indicate that the model performance first increases and then decreases, achieving its optimum at

K = 4

. This suggests that providing the model with more retrieved context does not invariably aid the decision-making process. Conversely, excessive retrieved items may introduce noise, thereby impairing judgment. Similarly, an insufficiently diverse set of retrievals (i.e., with excessively high similarity to the target sample) is also suboptimal, as it hinders the LLM’s ability to assess the target content from a broad and comprehensive perspective.

4.7. Computational Efficiency

In the previous subsection, we discussed the impact of the hyperparameter n on model performance. We further evaluated the average inference time per sample in the test set under different n values, using an Nvidia 4090 GPU and a batch size of 12.

The inference latency of MPCG-GECR comprises the following components: the image captioning latency from LLaVA-1.5-7B, the comment generation latency, the retrieval latency from GECR, and the inference latency from Llama 3-8B. Among these, the image captioning latency (∼1.3 s) and GECR retrieval latency (∼0.3 s) are independent of n. The inference latency of Llama 3-8B exhibited negligible variation across different n values and was relatively short overall, with a minimum of ∼1.6 s at n = 5, ∼1.9 s at n = 30, and a maximum of ∼2.1 s at n = 50. This marginal increase is attributable to the relative insensitivity of the LLM to input length, as its inference latency is primarily influenced by the length of the generated output.

The comment generation latency for different n values is shown in Figure 6.

As illustrated in Figure 6, the comment generation latency increases rapidly with n, as the LLM requires increased time to generate the larger volume of content. Considering the critical requirement for timeliness in crisis scenarios, we recommend selecting the appropriate comment generation strategy and hyperparameter n based on the specific accuracy requirements.

4.8. Case Study and Discussion

To qualitatively evaluate the effectiveness of different models, we selected several examples from the CrisisMMD dataset, as shown in Figure 7.

In all three cases, the proposed MPCG-GECR succeeded, whereas CrisisSpot and MFEK failed sometimes. The specific analysis is as follows:

Case 1: The primary challenge lies in accurately discerning the meaning of the image and jointly analyzing it with the text. CrisisSpot and MFEK likely failed to correctly interpret the visual content, exhibiting a bias towards classifying the image as “Not informative,” which led to their failure. In contrast, MPCG-GECR leverages the MLLM to effectively understand the image information, ultimately enabling the LLM to make the correct classification.
Case 2: The difficulty here involves accurately inferring the implicit information in the text. The sample presents a headline from a news weekly that contains an informative image. The external knowledge introduced by MFEK backfired, causing it to erroneously over-index on the textual content while overlooking the image, leading to a misclassification. Both CrisisSpot and MPCG-GECR achieved correct multimodal fusion, resulting in successful classification.
Case 3: The textual content in this sample strongly biased both CrisisSpot and MFEK towards interpreting it as containing other informational content. Additionally, the human present in the image introduced further distraction. Only MPCG-GECR correctly comprehended that the image lacked substantive crisis-related content, leading to the accurate classification of “Not humanitarian”.

5. Discussion

5.1. Paradigm Shift: From Feature Fusion to Generative Reasoning

MPCG-GECR outperforms state-of-the-art baselines due to a paradigm shift from traditional feature fusion to generative reasoning. While conventional “closed-system” fusion methods are limited by pre-trained parametric knowledge, our approach aligns the LLM’s vast world knowledge with domain requirements via an “augment, retrieve, and fine-tune” pipeline. The stark performance disparity between our fine-tuned model and general-purpose LLMs (e.g., DeepSeek-V3, LLaVA-1.5) in zero-shot settings offers a critical insight: model parameter scale alone is insufficient for crisis informatics. Instead, instruction fine-tuning with augmented context is essential to capture the nuanced definitions and specific boundaries of crisis informatics.

5.2. Mechanism Analysis: Anchors and Perspectives

Ablation studies validate the distinct roles of our two core modules. First, the significant drop in performance when removing the GECR module (or replacing it with random retrieval) validates the hypothesis that geographic semantics are paramount. Unlike random few-shot examples which may introduce noise, GECR retrieves samples that share spatial and environmental semantics with the target. Second, the SPG module confirms that crisis perception is heavily influenced by cognitive perspectives, specifically “Location” and “Occupation.” By synthesizing these diverse viewpoints, the model overcomes the information scarcity intrinsic to isolated posts.

5.3. Trade-Offs Between Accuracy and Latency in Deployment

A practical implication of our study lies in the trade-off between computational efficiency and classification accuracy, a crucial consideration for real-time crisis response systems. Our efficiency analysis indicates that iterative comment generation acts as the primary bottleneck. While the iterative strategy yields higher accuracy by minimizing context confusion, it incurs a latency penalty that scales linearly with n. However, the “full-list generation at once” strategy offers a viable alternative for time-critical scenarios. Although it sacrifices a degree of accuracy due to potential role-mixing, it dramatically reduces the inference time. Consequently, we propose an adaptive deployment strategy: using the rapid mode for initial screening and the iterative mode for high-priority analysis, with optimal resource points at

n = 30

and

K = 4

.

5.4. Quality of Synthetic Comments

To assess the reliability of the comments generated by the SPG module, we conducted a manual review of 50 randomly selected samples. We focused on checking for Label Bias (whether the generated comments contradict the ground-truth category of the post) and Logical Consistency (whether the comments align with the assigned persona). The review indicated that 96% of the generated comments were logically consistent with the post content and persona attributes. Importantly, we did not observe significant label bias that could mislead the model’s classification. The generated comments primarily reflected diverse emotional reactions and subjective viewpoints typical of social media discourse, rather than introducing factual discrepancies. Furthermore, regarding the risk of ‘echo chambers’ or self-reinforcement loops, our Mix-Persona strategy serves as an active countermeasure. By explicitly enforcing diversity in attributes, we force the model to generate perspectives from conflicting standpoints, such as those of local residents and remote observers. This ensures the classifier is exposed to a broad spectrum of social discourse rather than a single reinforced viewpoint.

5.5. Limitations and Future Directions

A primary limitation is the GECR module’s dependence on historical data, which may underperform in regions with sparse digital footprints. Future research should focus on two directions:

Cross-Regional Transfer Learning: Investigating how geographic knowledge learned from data-rich regions can be adaptively transferred to data-sparse regions, potentially using domain adaptation techniques.
Dynamic Persona Adaptation: Moving beyond static attribute lists to dynamically generate persona profiles based on the real-time demographic and cultural characteristics of the impacted region, thereby generating more culturally context-aware comments.

6. Conclusions

In this paper, we address the dual challenges of user comment scarcity and contextual isolation in multimodal crisis post analysis. To overcome these limitations, we propose the Mix-Persona Comment Generation with Geographically Enhanced Context Retrieval for LLM Instruction Fine-tuning (MPCG-GECR) framework, which combines subjective social simulation with objective geographic retrieval.

Our key contribution lies in establishing a novel methodology that transforms crisis classification from a feature discrimination task into a context-aware generative reasoning task. By leveraging the Synthetic Persona Generator (SPG) to simulate diverse public discourse and the Geographically Enhanced Context Retrieval (GECR) module to provide grounded environmental references, we successfully bridge the gap between data-driven deep learning and knowledge-driven reasoning. Extensive experiments on the CrisisMMD and DMD datasets confirm that our approach not only achieves state-of-the-art performance but also offers robust predictions. We believe this framework sets a new baseline for leveraging Large Language Models in multimodal crisis post classification.

Author Contributions

Conceptualization, Tong Bie and Yongli Hu; methodology, Tong Bie; software, Tong Bie and Yu Fu; validation, Tong Bie and Yu Fu; formal analysis, Yongli Hu, Linjia Hao, Tengfei Liu and Kan Guo; investigation, Tong Bie and Yu Fu; resources, Yongli Hu and Huajie Jiang; data curation, Tong Bie; writing—original draft preparation, Tong Bie; writing—review and editing, Tong Bie and Yongli Hu; visualization, Tong Bie and Yu Fu; supervision, Tong Bie; project administration, Tong Bie; funding acquisition, Yongli Hu, Junbin Gao, Yanfeng Sun and Baocai Yin. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the National Key R&D Program of China (No. 2021ZD0111902), NSFC (No. 62572017, 62441232, 62206007), and R&D Program of Beijing Municipal Education Commission (KZ202210005008).

Data Availability Statement

Data available on request due to restrictions. The original benchmark datasets used in this study, CrisisMMD [16] and DMD [17], are publicly available at https://crisisnlp.qcri.org/crisismmd (accessed on 19 February 2026) and https://archive.ics.uci.edu/dataset/456 (accessed on 19 February 2026), respectively. The instruction tuning dataset generated through the methodology proposed in this study is available on request due to the following restrictions: (1) It has not undergone rigorous validation and review; potential copyright issues and biases have not been fully assessed, and premature public release could lead to intellectual property disputes or misguide subsequent research; (2) It represents a core output of our methodology, and full disclosure prior to the publication of related work may compromise ongoing research. For legitimate academic verification purposes, this dataset can be requested from the corresponding author. Requestors will be required to sign a data use agreement, committing to utilize the data solely for verifying the findings presented in this study and prohibiting its redistribution or use for other unauthorized purposes.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their useful comments on the manuscript. A preliminary version of this work was presented at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing, and its title was Mix-Persona Comment Generation for LLM Fine-Tuning in Multimodal Crisis Post Classification. In this extended version, we further introduce the GECR module and provide comprehensive experimental results, including extensive ablation and hyperparameter sensitivity studies, to thoroughly evaluate the contribution of the GECR module. We expand our evaluation to include an additional DMD dataset and present case studies on the CrisisMMD dataset to comprehensively validate the effectiveness and robustness of our approach.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Apostol, E.S.; Truică, C.O.; Paschke, A. ContCommRTD: A Distributed Content-Based Misinformation-Aware Community Detection System for Real-Time Disaster Reporting. IEEE Trans. Knowl. Data Eng. 2024, 36, 5811–5822. [Google Scholar] [CrossRef]
Ghafarian, S.H.; Yazdi, H.S. Identifying Crisis-related Informative Tweets Using Learning on Distributions. Inf. Process. Manag. 2020, 57, 102145. [Google Scholar] [CrossRef]
Abavisani, M.; Wu, L.; Hu, S.; Tetreault, J.; Jaimes, A. Multimodal Categorization of Crisis Events in Social Media. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14679–14689. [Google Scholar]
Qian, S.; Chen, H.; Xue, D.; Fang, Q.; Xu, C. Open-World Social Event Classification. In Proceedings of the WWW ’23: The ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1562–1571. [Google Scholar]
Wang, J.; Yang, S.; Zhao, H.; Chen, Y. A Crisis Event Classification Method Based on A Multimodal Multilayer Graph Model. Neurocomputing 2025, 621, 129271. [Google Scholar] [CrossRef]
Yu, C.; Wang, Z. Cross-modal Evidential Fusion Network for Social Media Classification. Comput. Speech Lang. 2025, 92, 101784. [Google Scholar] [CrossRef]
Lin, Z.; Xie, J.; Li, Q. Multi-modal News Event Detection with External Knowledge. Inf. Process. Manag. 2024, 61, 103697. [Google Scholar] [CrossRef]
Zheng, J.; Zhang, X.; Guo, S.; Wang, Q.; Zang, W.; Zhang, Y. MFAN: Multi-Modal Feature-Enhanced Attention Networks for Rumor Detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 2413–2419. [Google Scholar]
Nan, Q.; Sheng, Q.; Cao, J.; Zhu, Y.; Wang, D.; Yang, G.; Li, J. Exploiting User Comments for Early Detection of Fake News Prior to Users’ Commenting. Front. Comput. Sci. 2025, 19, 1910354. [Google Scholar] [CrossRef]
Su, X.; Yang, J.; Wu, J.; Zhang, Y. Mining User-aware Multi-relations for Fake News Detection in Large Scale Online Social Networks. In Proceedings of the WSDM ’23: The Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 51–59. [Google Scholar]
Imran, M.; Castillo, C.; Diaz, F.; Vieweg, S. Processing social media messages in mass emergency: A survey. ACM Comput. Surv. (CSUR) 2015, 47, 67. [Google Scholar] [CrossRef]
Olteanu, A.; Castillo, C.; Diaz, F.; Vieweg, S. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In Proceedings of the International AAAI Conference on Web and Social Media, Los Angeles, CA, USA, 27–29 May 2014; pp. 376–385. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.t.; Rocktaschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Curran Associates: Red Hook, NY, USA, 2020; pp. 9459–9474. [Google Scholar]
Wu, S.; Xiong, Y.; Cui, Y.; Wu, H.; Chen, C.; Yuan, Y.; Huang, L.; Liu, X.; Kuo, T.W.; Guan, N.; et al. Retrieval-augmented generation for natural language processing: A survey. arXiv 2024, arXiv:2407.13193. [Google Scholar] [CrossRef]
Hu, Y.; Lu, Y. Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv 2024, arXiv:2404.19543. [Google Scholar] [CrossRef]
Alam, F.; Ofli, F.; Imran, M. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the Twelfth International AAAI Conference on Web and Social Media (ICWSM 2018), Stanford, CA, USA, 25–28 January 2018; pp. 465–473. [Google Scholar]
Mouzannar, H.; Rizk, Y.; Awad, M. Damage identification in social media posts using multimodal deep learning. In Proceedings of the Information Systems for Crisis Response and Management Conference, Rochester, NY, USA, 20–23 May 2018; pp. 1–15. [Google Scholar]
Zheng, X.; Zeng, Z.; Wang, H.; Bai, Y.; Liu, Y.; Luo, M. From predictions to analyses: Rationale-augmented fake news detection with large vision-language models. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 5364–5375. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, X.; Tan, S.; Zhang, L.; Li, C. Collaborative evolution: Multi-round learning between large and small language models for emergent fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 1210–1218. [Google Scholar]
Irnawan, B.R.; Xu, S.; Tomuro, N.; Fukumoto, F.; Suzuki, Y. Claim veracity assessment for explainable fake news detection. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 19–24 January 2025; pp. 4011–4029. [Google Scholar]
Wei, Q.; Qiao, Y.; Zhu, S.; Jiao, A.; Dong, Q. Twitter User Geolocation Based on Multi-Graph Feature Fusion with Gating Mechanism. ISPRS Int. J. Geo-Inf. 2025, 14, 424. [Google Scholar] [CrossRef]
Han, Y.; Liu, J.; Luo, A.; Wang, Y.; Bao, S. Fine-Tuning LLM-Assisted Chinese Disaster Geospatial Intelligence Extraction and Case Studies. ISPRS Int. J. Geo-Inf. 2025, 14, 79. [Google Scholar] [CrossRef]
Zou, L.; He, Z.; Wang, X.; Liang, Y. Spatiotemporal Typhoon Damage Assessment: A Multi-Task Learning Method for Location Extraction and Damage Identification from Social Media Texts. ISPRS Int. J. Geo-Inf. 2025, 14, 189. [Google Scholar] [CrossRef]
Zhu, H.; Meng, J.; Yao, J.; Xu, N. Feasibility of Emergency Flood Traffic Road Damage Assessment by Integrating Remote Sensing Images and Social Media Information. ISPRS Int. J. Geo-Inf. 2024, 13, 369. [Google Scholar] [CrossRef]
Zorenbohmer, C.; Gandhi, S.; Schmidt, S.; Resch, B. An Aspect-Based Emotion Analysis Approach on Wildfire-Related Geo-Social Media Data—A Case Study of the 2020 California Wildfires. ISPRS Int. J. Geo-Inf. 2025, 14, 301. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. arXiv 2023, arXiv:2310.03744. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates: Red Hook, NY, USA, 2023; pp. 34892–34916. [Google Scholar]
Xie, C.; Wang, B.; Kong, F.; Li, J.; Liang, D.; Zhang, G.; Leng, D.; Yin, Y. FG-CLIP: Fine-grained Visual and Textual Alignment. arXiv 2025, arXiv:2505.05071. [Google Scholar]
Sheng, Q.; Cao, J.; Bernard, H.R.; Shu, K.; Li, J.; Liu, H. Characterizing Multi-domain False News and Underlying User Effects on Chinese Weibo. Inf. Process. Manag. 2022, 59, 102959. [Google Scholar] [CrossRef]
Gaillard, S.; Oláh, Z.A.; Venmans, S.; Burke, M. Countering the Cognitive, Linguistic, and Psychological Underpinnings Behind Susceptibility to Fake News: A Review of Current Literature with Special Focus on The Role of Age and Digital Literacy. Front. Commun. 2021, 6, 661801. [Google Scholar] [CrossRef]
Starbird, K.; Maddock, J.; Orand, M.; Achterman, P.; Mason, R.M. Rumors, false flags, and digital vigilantes: Misinformation on twitter after the 2013 boston marathon bombing. In Proceedings of the IConference 2014 Proceedings; iSchools: Westford, MA, USA, 2014; pp. 654–662. [Google Scholar]
Mendoza, M.; Poblete, B.; Castillo, C. Twitter under crisis: Can we trust what we RT? In Proceedings of the First Workshop on Social Media Analytics, Washington, DC, USA, 25–28 July 2010; pp. 71–79. [Google Scholar]
Starbird, K.; Palen, L. “Voluntweeters” self-organizing by digital volunteers in times of crisis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada, 7–12 May 2011; pp. 1071–1080. [Google Scholar]
Reuter, C.; Heger, O.; Pipek, V. Combining real and virtual volunteers through social media. In Proceedings of the Information Systems for Crisis Response and Management Conference, Baden-Baden, Germany, 12–15 May 2013; pp. 780–790. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates: Red Hook, NY, USA, 2022; pp. 24824–24837. [Google Scholar]
Chen, Q.; Qin, L.; Liu, J.; Peng, D.; Guan, J.; Wang, P.; Hu, M.; Zhou, Y.; Gao, T.; Che, W. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models. arXiv 2025, arXiv:2503.09567. [Google Scholar]
Kaur, N.; Saha, A.; Swami, M.; Singh, M.; Dalal, R. Bert-ner: A transformer-based approach for named entity recognition. In Proceedings of the 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
JaidedAI. EasyOCR: Ready-to-Use OCR with 80+ Supported Languages and All Popular Writing Scripts. GitHub Repository. 2024. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 19 February 2026).
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Curran Associates: Red Hook, NY, USA, 2015; pp. 1–9. [Google Scholar]
Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; Hengel, A.v.d. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1960–1968. [Google Scholar]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
Liang, V.W.; Zhang, Y.; Kwon, Y.; Yeung, S.; Zou, J.Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates: Red Hook, NY, USA, 2022; pp. 17612–17625. [Google Scholar]
Wu, N.; Jastrzebski, S.; Cho, K.; Geras, K.J. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 24043–24055. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. Lora: Low-rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022; pp. 1–20. [Google Scholar]
DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101v2. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A Pre-trained Language Model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar]
Li, Z.; Qian, S.; Cao, J.; Fang, Q.; Xu, C. Adaptive transformer-based conditioned variational autoencoder for incomplete social event classification. In Proceedings of the 30th ACM International conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1698–1707. [Google Scholar]
Yu, C.; Hu, B.; Wang, Z. Open-world disaster information identification from multimodal social media. Complex Intell. Syst. 2025, 11, 7–20. [Google Scholar] [CrossRef]
Wu, F.; Zhou, R.; Hu, C.; Huang, Q.; Jing, X.Y. Complementary graph learning and prompt-based cross-modal generation for missing-modality fake news detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Shahid, S.D.; Mohammad, Z.U.R.; Karan, B.; Mohammed, A.H.; Nagendra, K. A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies. Expert Syst. Appl. 2025, 259, 125337. [Google Scholar] [CrossRef]
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. DeepSeek-AI Blog. 2025. Available online: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf (accessed on 19 February 2026).
AI, M. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Meta AI Blog. 2024. Available online: https://ai.meta.com/blog/meta-llama-3 (accessed on 19 February 2026).

Figure 1. Comparison of multimodal post (tweet) processing pipelines: (a) Common multimodal feature extraction and fusion schemes in crisis post classification. These methods primarily employ encoders to obtain text and image features, utilizing complex networks designed to fuse multimodal features. (b) Comment fusion strategies utilized in fake news and rumor detection. These approaches assist multimodal models in determining post categories by learning underlying patterns from user comments. (c) Our proposed MPCG-GECR framework. We address the scarcity of user comments in crisis posts by generating diverse synthetic comments. Furthermore, we introduce the GECR module to retrieve high-value samples based on geographic semantics and multimodal features. This supplementary information enables the LLM to comprehend post content from a holistic perspective, ultimately leading to accurate judgments.

Figure 2. Illustration of the MPCG-GECR framework, comprising four key components: Image Information Extraction, Diverse Comment Generation, Geographically Enhanced Context Retrieval, and LoRA-based LLM Instruction Fine-tuning.

Figure 3. Prompt template for the MLLM to generate detailed descriptions of the image.

Figure 4. The sensitivity of hyperparameter n on Task 1 of the CrisisMMD dataset.

Figure 5. The sensitivity of hyperparameter K on Task 1 of the CrisisMMD dataset.

Figure 6. The inference latency of MPCG-GECR with different n values on Task 1 of the CrisisMMD dataset.

Figure 7. Three samples from the CrisisMMD dataset with their text, image, and ground-truth label. The correctly classified results are marked in green, while the incorrect ones are marked in red.

Table 1. Performance comparison (in percentage, %) on the CrisisMMD dataset. All values are percentages, and the best performance is highlighted in bold.

	Task 1		Task 2
	Accuracy	Macro-F1	Accuracy	Macro-F1
ResNet (2016, [49])	81.85	79.14	83.58	60.61
BERTweet (2020, [50])	85.48	81.40	86.58	66.96
SCBD (2020, [3])	89.75	88.45	91.44	68.85
AT-CAVE (2022, [51])	91.69	89.42	91.89	70.54
OWSEC (2023, [4])	92.09	90.43	92.75	73.79
MFEK (2024, [7])	92.69	90.83	92.95	73.83
CEFN (2025, [6])	91.32	89.73	92.47	72.54
MMDG (2025, [5])	92.64	90.38	92.98	73.44
OWDII (2025, [52])	92.35	90.62	92.79	73.90
CG-PG (2025, [53])	92.16	90.82	92.62	73.67
CrisisSpot (2025, [54])	93.11	91.41	93.54	74.45
DeepSeek-V3 (2024, [46])	81.25	78.05	61.20	44.45
DeepSeek-R1 (2025, [55])	83.17	79.25	63.21	46.14
LLama 3-8B (2024, [56])	65.68	54.48	54.95	39.97
LLaVA-1.5-7B (2024, [26,27])	68.94	61.08	64.36	63.10
MPCG-GECR (Ours)	95.72	94.04	95.21	76.17

Table 2. Performance comparison (in percentage, %) on the DMD dataset (Multi-class Classification, analogous to Task 2 of the CrisisMMD dataset). All values are percentages, and the best performance is highlighted in bold.

	Accuracy	Macro-F1
ResNet (2016, [49])	86.53	86.37
BERTweet (2020, [50])	85.89	85.64
SCBD (2020, [3])	92.36	92.27
AT-CAVE (2022, [51])	92.67	92.58
OWSEC (2023, [4])	93.22	93.14
MFEK (2024, [7])	93.51	93.40
CEFN (2025, [6])	93.03	92.91
MMDG (2025, [5])	93.33	93.26
OWDII (2025, [52])	92.90	92.59
CG-PG (2025, [53])	93.14	92.98
CrisisSpot (2025, [54])	93.79	93.60
DeepSeek-V3 (2024, [46])	80.19	79.95
DeepSeek-R1 (2025, [55])	82.12	81.88
LLama 3-8B (2024, [56])	75.60	72.51
LLaVA-1.5-7B (2024, [26,27])	82.62	82.14
MPCG-GECR (Ours)	94.83	94.68

Table 3. Results of the ablation study.

	Task 1		Task 2
	Accuracy	Macro-F1	Accuracy	Macro-F1
MPCG-GECR	95.72	94.04	95.21	76.17
w/o GECR and w/o SPG	90.31	88.62	90.25	69.45
w/o GECR	94.21	92.66	94.03	74.78
Random retrieval	92.59	90.20	92.14	72.05
w/o Gender	94.92	93.11	94.42	75.81
w/o Age	94.60	92.72	94.35	75.23
w/o Education	94.79	92.85	94.36	75.39
w/o Occupation	94.26	92.69	94.16	74.95
w/o Location	94.25	92.72	94.04	74.75
Full-list generation at once	94.88	93.08	94.71	75.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Bie, T.; Hu, Y.; Fu, Y.; Hao, L.; Liu, T.; Guo, K.; Jiang, H.; Gao, J.; Sun, Y.; Yin, B. Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS Int. J. Geo-Inf. 2026, 15, 104. https://doi.org/10.3390/ijgi15030104

AMA Style

Bie T, Hu Y, Fu Y, Hao L, Liu T, Guo K, Jiang H, Gao J, Sun Y, Yin B. Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS International Journal of Geo-Information. 2026; 15(3):104. https://doi.org/10.3390/ijgi15030104

Chicago/Turabian Style

Bie, Tong, Yongli Hu, Yu Fu, Linjia Hao, Tengfei Liu, Kan Guo, Huajie Jiang, Junbin Gao, Yanfeng Sun, and Baocai Yin. 2026. "Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification" ISPRS International Journal of Geo-Information 15, no. 3: 104. https://doi.org/10.3390/ijgi15030104

APA Style

Bie, T., Hu, Y., Fu, Y., Hao, L., Liu, T., Guo, K., Jiang, H., Gao, J., Sun, Y., & Yin, B. (2026). Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS International Journal of Geo-Information, 15(3), 104. https://doi.org/10.3390/ijgi15030104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification

Abstract

1. Introduction

2. Related Work

2.1. Traditional Multimodal Crisis Post Classification

2.2. Comment-Augmented Social Media Analysis

2.3. LLM Applications and Retrieval-Augmented Generation

2.4. Geographic Information in Social Media Analysis

3. Method

3.1. Image Information Extraction

3.2. Diverse Comment Generation

3.3. Geographically Enhanced Context Retrieval

3.3.1. Geographic Entity Extraction and Indexing

3.3.2. Multimodal Feature Retrieval

3.3.3. Hybrid Re-Ranking with Geographic Semantics

3.4. Instruction Dataset Construction

3.5. LoRA-Based LLM Instruction Fine-Tuning

4. Results

4.1. Dataset

4.2. Implementation Details

4.3. Baselines

4.4. Experimental Results

4.4.1. CrisisMMD Dataset Results

4.4.2. DMD Dataset Results

4.5. Ablation Study

4.6. Hyperparameter Sensitivity

4.7. Computational Efficiency

4.8. Case Study and Discussion

5. Discussion

5.1. Paradigm Shift: From Feature Fusion to Generative Reasoning

5.2. Mechanism Analysis: Anchors and Perspectives

5.3. Trade-Offs Between Accuracy and Latency in Deployment

5.4. Quality of Synthetic Comments

5.5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI