Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification
Abstract
1. Introduction
- We propose the MPCG-GECR framework, a novel approach that merges synthetic social media user comments and retrieval-augmented geographic context to address the data scarcity and context isolation problems in multimodal crisis post classification. Extensive experiments show that our method outperforms existing baselines.
- We introduce the Geographically Enhanced Context Retrieval (GECR) module, representing the first attempt to integrate geographic semantics into the RAG paradigm for crisis post analysis. This module provides the LLM with interpretable geographic reference anchors, improving classification reliability.
- We demonstrate the effectiveness of prompting LLMs to generate diverse personas and comments for data augmentation, offering a new direction for handling low-resource crisis post analysis tasks.
2. Related Work
2.1. Traditional Multimodal Crisis Post Classification
2.2. Comment-Augmented Social Media Analysis
2.3. LLM Applications and Retrieval-Augmented Generation
2.4. Geographic Information in Social Media Analysis
3. Method
3.1. Image Information Extraction
3.2. Diverse Comment Generation
- Gender: Non-binary, Male, Female;
- Age: Minors, Young Adulthood, Middle Age, Old Age;
- Education: Primary Education, Secondary Education, Undergraduate Education, Graduate Education;
- Occupation: Humanitarian Worker, Domain expert, Engineer, Medical Staff, Journalist, General public;
- Location: Resident of the local crisis area, Resident of the surrounding crisis areas, Resident of the non-affected area.
3.3. Geographically Enhanced Context Retrieval
3.3.1. Geographic Entity Extraction and Indexing
3.3.2. Multimodal Feature Retrieval
3.3.3. Hybrid Re-Ranking with Geographic Semantics
- Same-Tag Group: We employ the FG-CLIP encoder to extract feature embeddings for the Geo and Location tags of both the target and candidate samples, subsequently computing the geographic semantic similarity between them. Based on this metric, we re-rank all candidate samples in descending order. In cases where the geographic semantic similarity between two candidate samples and the target sample is identical, their relative order is determined by their multimodal similarity. Finally, we retrieve the Top-2 samples from the re-ranked list. Here, ‘Top-2’ refers to the two candidates with the highest similarity scores in the sorted list, serving as the most reliable geographic reference anchors.
- Different-Tag Group: To supplement extra diversity, we also retrieve information from the group with different tags. Exposure to diverse data enables models to learn essential patterns, mitigate overfitting, and develop a more holistic reasoning capability [40]. This method of supplementing diverse information has been proven beneficial in fields such as Visual Question Answering (VQA), enabling models to make more comprehensive and accurate inferences [41]. For this group, we rank based on multimodal similarity and retrieve the Top-2 samples.
- If event type information is available in the training set (e.g., the CrisisMMD dataset [16] annotates specific event names), we group the samples in the candidate pool by event type and rank them within each group by multimodal similarity (high to low). We then select the Top-2 samples from the largest group, and the Top-1 sample from the second and third largest groups, respectively. If only two groups exist, we select the Top-2 from each; if only one group exists, we simply select the Top-4.
- If event type information is unavailable (e.g., the DMD dataset [17] only annotates sample categories), we implement a latent semantic clustering strategy. We employ the K-Means++ algorithm [42] to cluster samples based on the multimodal features of the target and candidate pool. The construction of a multimodal feature involves three steps:
- Pre-normalization: We first apply L2 normalization to the single-modal text and image features extracted by FG-CLIP. This step is crucial to address the modality gap phenomenon [43]. Without this, the modality with naturally larger feature magnitudes (typically vision) would dominate the distance calculation, causing the model to ignore textual semantics [44].
- Concatenation: We concatenate the normalized multimodal features to form a fused vector, ensuring an equal contribution (1:1) from both modalities.
- Post-normalization: Finally, we perform a second L2 normalization on the fused features. This ensures that the Euclidean distance used in K-Means effectively approximates Cosine Similarity, which is the native metric for CLIP-based embeddings.
- The number of clusters K is set to a small constant (e.g., ) to impose a structured diversity on the retrieved candidates, forcing the selection algorithm to sample from distinct latent semantic sub-groups rather than over-focusing on a single dominant pattern. From the clustering results, we select the Top-2 samples from the cluster containing the target sample and the Top-1 sample from each of the two clusters with the nearest centroids. In the specific case where the target sample forms a cluster of its own (i.e., a singleton cluster), we identify the nearest clusters by calculating the Euclidean distance between the target sample and the centroids of all other clusters. We then select the Top-2 samples from each of the two clusters with the smallest centroid distances. In cases where , we continue to retrieve the Top-2 samples from within the target sample’s cluster, while selecting the two samples with the highest similarity from the nearest neighboring clusters.
| Algorithm 1 Geographically Enhanced Context Retrieval (GECR) |
|
3.4. Instruction Dataset Construction
- The [Detailed image caption] , generated by the MLLM to externalize image details;
- The [Comment List] , generated by the LLM. Each list comprises n distinct comments (denoted as ), corresponding to varying social perspectives derived from different simulated personas;
- The [Retrieved Samples] , consisting of high-value reference samples obtained through the GECR module, which provides geographic semantics and reasoning anchors.
3.5. LoRA-Based LLM Instruction Fine-Tuning
4. Results
4.1. Dataset
- Task 1 aims to assess whether a tweet contains valuable humanitarian aid information, requiring classification into two categories: Informative and Not Informative.
- Task 2 further categorizes tweets into five fine-grained classes: “Affected individuals” (including “Injured or dead people” and “Missing or found people”), “Rescue, volunteering, or donation effort”, “Infrastructure and utility damage” (including “Vehicle damage”), “Other relevant information”, and “Not humanitarian”.
4.2. Implementation Details
4.3. Baselines
- ResNet [49]: This classical image processing model has shown robust performance in various computer vision tasks.
- BERTweet [50]: A large-scale pre-trained language model for English tweets, which shares the same architecture as BERT and uses the RoBERTa pre-training procedure.
- SCBD [3]: This crisis tweet classification model addresses data inconsistency and overfitting through two key mechanisms: Stochastic Shared Embeddings (SSE) regularization for robust training and Cross-Attention-based fusion of projected image and text embeddings. These mechanisms enhance model stability while maintaining multimodal integration.
- AT-CAVE [51]: An adaptive transformer-based conditioned variational autoencoder network for incomplete multimodal tweet classification. It jointly models the textual information, visual information and label information into a unified deep model, which can generate more discriminative latent features and enhance the performance in missing-modality scenarios.
- OWSEC [4]: An open-world multimodal classification model that combines fine-grained semantic interaction. The model utilizes a multimodal mask transformer architecture to establish cross-modal semantic relation, filter redundant information through dynamic masking, and generate virtual mixed samples for training a separate open-world classifier.
- MFEK [7]: A knowledge-enhanced multimodal architecture that addresses out-of-distribution challenges via image-guided textual enhancement, multi-source knowledge extraction from Wikipedia and GPT-3.5 Turbo, and a co-attention-based fusion mechanism of external knowledge with text and image features.
- CEFN [6]: An evidential fusion framework grounded in subjective logic theory that explicitly models uncertainty during multimodal integration. The network treats encoder outputs as subjective opinions, enabling direct uncertainty quantification and more reliable representation learning.
- MMDG [5]: A multilayer deep graph model that constructs text and image graphs for multimodal representation learning. The architecture combines Graph Convolutional Networks (GCN) with autoencoder components to systematically extract and integrate heterogeneous graph-structured information.
- OWDII [52]: A multimodal model for categorizing social media posts related to disasters in an open-world environment, utilizing a multitask (closed-world and open-world) classifier and a sample generation strategy that models the distribution of unknown samples using known data.
- CG-PG [53]: A complementary graph learning and prompt-based cross-modal generation network for missing-modality cases in the fake news detection field. It explores structural complementary information in image and text graphs and generates representations of the missing modality from available modalities.
- CrisisSpot [54]: A graph neural network architecture designed to model complex cross-modal relationships by jointly analyzing textual–visual content correlations and social context features (user-centric and content-centric). Its inverted dual embedded attention mechanism simultaneously captures both complementary and contradictory data patterns.
- DeepSeek-V3 [46]: DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model with 671B total parameters, employing innovative Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient training and inference.
- DeepSeek-R1 [55]: DeepSeek-R1 is an advanced iteration of DeepSeek-R1-Zero that addresses readability and language consistency issues through multi-stage training with cold-start data prior to reinforcement learning. This enhanced version achieves more robust performance while maintaining powerful logical behaviors.
- LLama 3-8B [56]: LLama 3-8B is a open-source LLM with 8 billion parameters. This model employs a decoder-only transformer architecture with Grouped-Query Attention (GQA) and an efficient 128K-token tokenizer. Pre-trained on over 15 trillion tokens and instruction-tuned, it delivers strong performance, particularly in reasoning and coding, establishing itself as a leading model in its class.
4.4. Experimental Results
4.4.1. CrisisMMD Dataset Results
4.4.2. DMD Dataset Results
4.5. Ablation Study
- w/o GECR and w/o SPG: The standard fine-tuning baseline excluding both modules, utilizing only tweet text and image captions.
- w/o GECR: The variant removing the retrieval module but retaining synthetic comments.
- Random retrieval: The variant replacing GECR with random retrieval.
- w/o [Attribute]: Variants removing specific persona attributes.
- Full-list generation at once: The variant generating comments without the iterative strategy.
4.6. Hyperparameter Sensitivity
4.7. Computational Efficiency
4.8. Case Study and Discussion
- Case 1: The primary challenge lies in accurately discerning the meaning of the image and jointly analyzing it with the text. CrisisSpot and MFEK likely failed to correctly interpret the visual content, exhibiting a bias towards classifying the image as “Not informative,” which led to their failure. In contrast, MPCG-GECR leverages the MLLM to effectively understand the image information, ultimately enabling the LLM to make the correct classification.
- Case 2: The difficulty here involves accurately inferring the implicit information in the text. The sample presents a headline from a news weekly that contains an informative image. The external knowledge introduced by MFEK backfired, causing it to erroneously over-index on the textual content while overlooking the image, leading to a misclassification. Both CrisisSpot and MPCG-GECR achieved correct multimodal fusion, resulting in successful classification.
- Case 3: The textual content in this sample strongly biased both CrisisSpot and MFEK towards interpreting it as containing other informational content. Additionally, the human present in the image introduced further distraction. Only MPCG-GECR correctly comprehended that the image lacked substantive crisis-related content, leading to the accurate classification of “Not humanitarian”.
5. Discussion
5.1. Paradigm Shift: From Feature Fusion to Generative Reasoning
5.2. Mechanism Analysis: Anchors and Perspectives
5.3. Trade-Offs Between Accuracy and Latency in Deployment
5.4. Quality of Synthetic Comments
5.5. Limitations and Future Directions
- Cross-Regional Transfer Learning: Investigating how geographic knowledge learned from data-rich regions can be adaptively transferred to data-sparse regions, potentially using domain adaptation techniques.
- Dynamic Persona Adaptation: Moving beyond static attribute lists to dynamically generate persona profiles based on the real-time demographic and cultural characteristics of the impacted region, thereby generating more culturally context-aware comments.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Apostol, E.S.; Truică, C.O.; Paschke, A. ContCommRTD: A Distributed Content-Based Misinformation-Aware Community Detection System for Real-Time Disaster Reporting. IEEE Trans. Knowl. Data Eng. 2024, 36, 5811–5822. [Google Scholar] [CrossRef]
- Ghafarian, S.H.; Yazdi, H.S. Identifying Crisis-related Informative Tweets Using Learning on Distributions. Inf. Process. Manag. 2020, 57, 102145. [Google Scholar] [CrossRef]
- Abavisani, M.; Wu, L.; Hu, S.; Tetreault, J.; Jaimes, A. Multimodal Categorization of Crisis Events in Social Media. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14679–14689. [Google Scholar]
- Qian, S.; Chen, H.; Xue, D.; Fang, Q.; Xu, C. Open-World Social Event Classification. In Proceedings of the WWW ’23: The ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1562–1571. [Google Scholar]
- Wang, J.; Yang, S.; Zhao, H.; Chen, Y. A Crisis Event Classification Method Based on A Multimodal Multilayer Graph Model. Neurocomputing 2025, 621, 129271. [Google Scholar] [CrossRef]
- Yu, C.; Wang, Z. Cross-modal Evidential Fusion Network for Social Media Classification. Comput. Speech Lang. 2025, 92, 101784. [Google Scholar] [CrossRef]
- Lin, Z.; Xie, J.; Li, Q. Multi-modal News Event Detection with External Knowledge. Inf. Process. Manag. 2024, 61, 103697. [Google Scholar] [CrossRef]
- Zheng, J.; Zhang, X.; Guo, S.; Wang, Q.; Zang, W.; Zhang, Y. MFAN: Multi-Modal Feature-Enhanced Attention Networks for Rumor Detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 2413–2419. [Google Scholar]
- Nan, Q.; Sheng, Q.; Cao, J.; Zhu, Y.; Wang, D.; Yang, G.; Li, J. Exploiting User Comments for Early Detection of Fake News Prior to Users’ Commenting. Front. Comput. Sci. 2025, 19, 1910354. [Google Scholar] [CrossRef]
- Su, X.; Yang, J.; Wu, J.; Zhang, Y. Mining User-aware Multi-relations for Fake News Detection in Large Scale Online Social Networks. In Proceedings of the WSDM ’23: The Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 51–59. [Google Scholar]
- Imran, M.; Castillo, C.; Diaz, F.; Vieweg, S. Processing social media messages in mass emergency: A survey. ACM Comput. Surv. (CSUR) 2015, 47, 67. [Google Scholar] [CrossRef]
- Olteanu, A.; Castillo, C.; Diaz, F.; Vieweg, S. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In Proceedings of the International AAAI Conference on Web and Social Media, Los Angeles, CA, USA, 27–29 May 2014; pp. 376–385. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.t.; Rocktaschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Curran Associates: Red Hook, NY, USA, 2020; pp. 9459–9474. [Google Scholar]
- Wu, S.; Xiong, Y.; Cui, Y.; Wu, H.; Chen, C.; Yuan, Y.; Huang, L.; Liu, X.; Kuo, T.W.; Guan, N.; et al. Retrieval-augmented generation for natural language processing: A survey. arXiv 2024, arXiv:2407.13193. [Google Scholar] [CrossRef]
- Hu, Y.; Lu, Y. Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv 2024, arXiv:2404.19543. [Google Scholar] [CrossRef]
- Alam, F.; Ofli, F.; Imran, M. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. In Proceedings of the Twelfth International AAAI Conference on Web and Social Media (ICWSM 2018), Stanford, CA, USA, 25–28 January 2018; pp. 465–473. [Google Scholar]
- Mouzannar, H.; Rizk, Y.; Awad, M. Damage identification in social media posts using multimodal deep learning. In Proceedings of the Information Systems for Crisis Response and Management Conference, Rochester, NY, USA, 20–23 May 2018; pp. 1–15. [Google Scholar]
- Zheng, X.; Zeng, Z.; Wang, H.; Bai, Y.; Liu, Y.; Luo, M. From predictions to analyses: Rationale-augmented fake news detection with large vision-language models. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 5364–5375. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhang, X.; Tan, S.; Zhang, L.; Li, C. Collaborative evolution: Multi-round learning between large and small language models for emergent fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 1210–1218. [Google Scholar]
- Irnawan, B.R.; Xu, S.; Tomuro, N.; Fukumoto, F.; Suzuki, Y. Claim veracity assessment for explainable fake news detection. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 19–24 January 2025; pp. 4011–4029. [Google Scholar]
- Wei, Q.; Qiao, Y.; Zhu, S.; Jiao, A.; Dong, Q. Twitter User Geolocation Based on Multi-Graph Feature Fusion with Gating Mechanism. ISPRS Int. J. Geo-Inf. 2025, 14, 424. [Google Scholar] [CrossRef]
- Han, Y.; Liu, J.; Luo, A.; Wang, Y.; Bao, S. Fine-Tuning LLM-Assisted Chinese Disaster Geospatial Intelligence Extraction and Case Studies. ISPRS Int. J. Geo-Inf. 2025, 14, 79. [Google Scholar] [CrossRef]
- Zou, L.; He, Z.; Wang, X.; Liang, Y. Spatiotemporal Typhoon Damage Assessment: A Multi-Task Learning Method for Location Extraction and Damage Identification from Social Media Texts. ISPRS Int. J. Geo-Inf. 2025, 14, 189. [Google Scholar] [CrossRef]
- Zhu, H.; Meng, J.; Yao, J.; Xu, N. Feasibility of Emergency Flood Traffic Road Damage Assessment by Integrating Remote Sensing Images and Social Media Information. ISPRS Int. J. Geo-Inf. 2024, 13, 369. [Google Scholar] [CrossRef]
- Zorenbohmer, C.; Gandhi, S.; Schmidt, S.; Resch, B. An Aspect-Based Emotion Analysis Approach on Wildfire-Related Geo-Social Media Data—A Case Study of the 2020 California Wildfires. ISPRS Int. J. Geo-Inf. 2025, 14, 301. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. arXiv 2023, arXiv:2310.03744. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates: Red Hook, NY, USA, 2023; pp. 34892–34916. [Google Scholar]
- Xie, C.; Wang, B.; Kong, F.; Li, J.; Liang, D.; Zhang, G.; Leng, D.; Yin, Y. FG-CLIP: Fine-grained Visual and Textual Alignment. arXiv 2025, arXiv:2505.05071. [Google Scholar]
- Sheng, Q.; Cao, J.; Bernard, H.R.; Shu, K.; Li, J.; Liu, H. Characterizing Multi-domain False News and Underlying User Effects on Chinese Weibo. Inf. Process. Manag. 2022, 59, 102959. [Google Scholar] [CrossRef]
- Gaillard, S.; Oláh, Z.A.; Venmans, S.; Burke, M. Countering the Cognitive, Linguistic, and Psychological Underpinnings Behind Susceptibility to Fake News: A Review of Current Literature with Special Focus on The Role of Age and Digital Literacy. Front. Commun. 2021, 6, 661801. [Google Scholar] [CrossRef]
- Starbird, K.; Maddock, J.; Orand, M.; Achterman, P.; Mason, R.M. Rumors, false flags, and digital vigilantes: Misinformation on twitter after the 2013 boston marathon bombing. In Proceedings of the IConference 2014 Proceedings; iSchools: Westford, MA, USA, 2014; pp. 654–662. [Google Scholar]
- Mendoza, M.; Poblete, B.; Castillo, C. Twitter under crisis: Can we trust what we RT? In Proceedings of the First Workshop on Social Media Analytics, Washington, DC, USA, 25–28 July 2010; pp. 71–79. [Google Scholar]
- Starbird, K.; Palen, L. “Voluntweeters” self-organizing by digital volunteers in times of crisis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada, 7–12 May 2011; pp. 1071–1080. [Google Scholar]
- Reuter, C.; Heger, O.; Pipek, V. Combining real and virtual volunteers through social media. In Proceedings of the Information Systems for Crisis Response and Management Conference, Baden-Baden, Germany, 12–15 May 2013; pp. 780–790. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates: Red Hook, NY, USA, 2022; pp. 24824–24837. [Google Scholar]
- Chen, Q.; Qin, L.; Liu, J.; Peng, D.; Guan, J.; Wang, P.; Hu, M.; Zhou, Y.; Gao, T.; Che, W. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models. arXiv 2025, arXiv:2503.09567. [Google Scholar]
- Kaur, N.; Saha, A.; Swami, M.; Singh, M.; Dalal, R. Bert-ner: A transformer-based approach for named entity recognition. In Proceedings of the 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- JaidedAI. EasyOCR: Ready-to-Use OCR with 80+ Supported Languages and All Popular Writing Scripts. GitHub Repository. 2024. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 19 February 2026).
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Curran Associates: Red Hook, NY, USA, 2015; pp. 1–9. [Google Scholar]
- Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; Hengel, A.v.d. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1960–1968. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Liang, V.W.; Zhang, Y.; Kwon, Y.; Yeung, S.; Zou, J.Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates: Red Hook, NY, USA, 2022; pp. 17612–17625. [Google Scholar]
- Wu, N.; Jastrzebski, S.; Cho, K.; Geras, K.J. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 24043–24055. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. Lora: Low-rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022; pp. 1–20. [Google Scholar]
- DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101v2. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A Pre-trained Language Model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar]
- Li, Z.; Qian, S.; Cao, J.; Fang, Q.; Xu, C. Adaptive transformer-based conditioned variational autoencoder for incomplete social event classification. In Proceedings of the 30th ACM International conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1698–1707. [Google Scholar]
- Yu, C.; Hu, B.; Wang, Z. Open-world disaster information identification from multimodal social media. Complex Intell. Syst. 2025, 11, 7–20. [Google Scholar] [CrossRef]
- Wu, F.; Zhou, R.; Hu, C.; Huang, Q.; Jing, X.Y. Complementary graph learning and prompt-based cross-modal generation for missing-modality fake news detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Shahid, S.D.; Mohammad, Z.U.R.; Karan, B.; Mohammed, A.H.; Nagendra, K. A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies. Expert Syst. Appl. 2025, 259, 125337. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. DeepSeek-AI Blog. 2025. Available online: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf (accessed on 19 February 2026).
- AI, M. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Meta AI Blog. 2024. Available online: https://ai.meta.com/blog/meta-llama-3 (accessed on 19 February 2026).







| Task 1 | Task 2 | |||
|---|---|---|---|---|
| Accuracy | Macro-F1 | Accuracy | Macro-F1 | |
| ResNet (2016, [49]) | 81.85 | 79.14 | 83.58 | 60.61 |
| BERTweet (2020, [50]) | 85.48 | 81.40 | 86.58 | 66.96 |
| SCBD (2020, [3]) | 89.75 | 88.45 | 91.44 | 68.85 |
| AT-CAVE (2022, [51]) | 91.69 | 89.42 | 91.89 | 70.54 |
| OWSEC (2023, [4]) | 92.09 | 90.43 | 92.75 | 73.79 |
| MFEK (2024, [7]) | 92.69 | 90.83 | 92.95 | 73.83 |
| CEFN (2025, [6]) | 91.32 | 89.73 | 92.47 | 72.54 |
| MMDG (2025, [5]) | 92.64 | 90.38 | 92.98 | 73.44 |
| OWDII (2025, [52]) | 92.35 | 90.62 | 92.79 | 73.90 |
| CG-PG (2025, [53]) | 92.16 | 90.82 | 92.62 | 73.67 |
| CrisisSpot (2025, [54]) | 93.11 | 91.41 | 93.54 | 74.45 |
| DeepSeek-V3 (2024, [46]) | 81.25 | 78.05 | 61.20 | 44.45 |
| DeepSeek-R1 (2025, [55]) | 83.17 | 79.25 | 63.21 | 46.14 |
| LLama 3-8B (2024, [56]) | 65.68 | 54.48 | 54.95 | 39.97 |
| LLaVA-1.5-7B (2024, [26,27]) | 68.94 | 61.08 | 64.36 | 63.10 |
| MPCG-GECR (Ours) | 95.72 | 94.04 | 95.21 | 76.17 |
| Accuracy | Macro-F1 | |
|---|---|---|
| ResNet (2016, [49]) | 86.53 | 86.37 |
| BERTweet (2020, [50]) | 85.89 | 85.64 |
| SCBD (2020, [3]) | 92.36 | 92.27 |
| AT-CAVE (2022, [51]) | 92.67 | 92.58 |
| OWSEC (2023, [4]) | 93.22 | 93.14 |
| MFEK (2024, [7]) | 93.51 | 93.40 |
| CEFN (2025, [6]) | 93.03 | 92.91 |
| MMDG (2025, [5]) | 93.33 | 93.26 |
| OWDII (2025, [52]) | 92.90 | 92.59 |
| CG-PG (2025, [53]) | 93.14 | 92.98 |
| CrisisSpot (2025, [54]) | 93.79 | 93.60 |
| DeepSeek-V3 (2024, [46]) | 80.19 | 79.95 |
| DeepSeek-R1 (2025, [55]) | 82.12 | 81.88 |
| LLama 3-8B (2024, [56]) | 75.60 | 72.51 |
| LLaVA-1.5-7B (2024, [26,27]) | 82.62 | 82.14 |
| MPCG-GECR (Ours) | 94.83 | 94.68 |
| Task 1 | Task 2 | |||
|---|---|---|---|---|
| Accuracy | Macro-F1 | Accuracy | Macro-F1 | |
| MPCG-GECR | 95.72 | 94.04 | 95.21 | 76.17 |
| w/o GECR and w/o SPG | 90.31 | 88.62 | 90.25 | 69.45 |
| w/o GECR | 94.21 | 92.66 | 94.03 | 74.78 |
| Random retrieval | 92.59 | 90.20 | 92.14 | 72.05 |
| w/o Gender | 94.92 | 93.11 | 94.42 | 75.81 |
| w/o Age | 94.60 | 92.72 | 94.35 | 75.23 |
| w/o Education | 94.79 | 92.85 | 94.36 | 75.39 |
| w/o Occupation | 94.26 | 92.69 | 94.16 | 74.95 |
| w/o Location | 94.25 | 92.72 | 94.04 | 74.75 |
| Full-list generation at once | 94.88 | 93.08 | 94.71 | 75.53 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Bie, T.; Hu, Y.; Fu, Y.; Hao, L.; Liu, T.; Guo, K.; Jiang, H.; Gao, J.; Sun, Y.; Yin, B. Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS Int. J. Geo-Inf. 2026, 15, 104. https://doi.org/10.3390/ijgi15030104
Bie T, Hu Y, Fu Y, Hao L, Liu T, Guo K, Jiang H, Gao J, Sun Y, Yin B. Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS International Journal of Geo-Information. 2026; 15(3):104. https://doi.org/10.3390/ijgi15030104
Chicago/Turabian StyleBie, Tong, Yongli Hu, Yu Fu, Linjia Hao, Tengfei Liu, Kan Guo, Huajie Jiang, Junbin Gao, Yanfeng Sun, and Baocai Yin. 2026. "Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification" ISPRS International Journal of Geo-Information 15, no. 3: 104. https://doi.org/10.3390/ijgi15030104
APA StyleBie, T., Hu, Y., Fu, Y., Hao, L., Liu, T., Guo, K., Jiang, H., Gao, J., Sun, Y., & Yin, B. (2026). Mix-Persona Comment Generation and Geographically Enhanced Context Retrieval for LLM Fine-Tuning in Multimodal Crisis Post Classification. ISPRS International Journal of Geo-Information, 15(3), 104. https://doi.org/10.3390/ijgi15030104

