CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating
Abstract
1. Introduction
- (1)
- Bidirectional cross-attention (Bi-Cross-Attention) mechanism: Explicitly models mutual feature refinement between text and images, directly addressing the complex cross-modal relationships identified in [10].
- (2)
- Adaptive gating mechanism: Adaptive gating mechanism to dynamically suppress noisy inputs and balance modality contributions, improving robustness to social media noise.
- (3)
- Hierarchical learning strategy: Optimizes pre-trained CLIP representations for crisis-specific tasks via hierarchical learning rate scheduling (1 × 10−5 for base layers, 1 × 10−4 for fusion layers), improving upon baseline CLIP by 1.55% in accuracy [7].
2. Related Works
2.1. Unimodal Paradigms: Text-Only and Image-Only Models
2.2. Multimodal Fusion Strategies: Early, Late, and Intermediate Fusion
2.3. Emerging Techniques: Contrastive Learning, Weak Supervision, and Adaptive Fusion
2.4. Remaining Challenges in Crisis Tweet Classification
3. Materials and Methods
3.1. Dataset Construction and Preprocessing
3.1.1. Data Source and Label Consolidation
- (1)
- Merge low-frequency human impact labels: Injured_or_dead_people and missing_or_found_people—semantically related and collectively <15% of data—were merged into affected_individuals.
- (2)
- Consolidate infrastructure labels: Vehicle_damage was integrated into infrastructure_and_utility_damage per FEMA guidelines.
- (3)
- Retain core semantic categories: Rescue_volunteering_or_donation_effort and other_relevant_information were preserved to maintain granularity.
- (1)
- Affected individuals: Merged from “injured_or_dead_people” and “missing_or_found_people”;
- (2)
- Rescue_volunteering_or_donation_effort: Retained as original;
- (3)
- Infrastructure_and_utility_damage: Merged from “vehicle_damage” and the original infrastructure category;
- (4)
- Other_relevant_information: Retained as original;
- (5)
- Not_humanitarian: Retained as original.
3.1.2. Data Preprocessing
- (1)
- Text Preprocessing
- (2)
- Image Preprocessing
3.1.3. Data Augmentation
- (1)
- Text Augmentation
- (2)
- Image Augmentation
3.1.4. Dataset Splitting
3.2. Model Architecture
3.2.1. Dual-Tower Encoding with CLIP
3.2.2. Bidirectional Cross-Attention for Local Alignment
- (1)
- Fundamental Bidirectional Attention Paradigm
- (2)
- Multi-Head Attention Architecture for Fine-Grained Interaction
- (3)
- Layer Normalization
3.2.3. Reliability-Aware Modality Gating
3.2.4. Hierarchical Fusion and Classification
- (1)
- Bidirectional Cross-Attention: Local Interaction Refinement
- (2)
- Adaptive Gating Fusion Layer (Global Integration)
- (3)
- Hierarchical Fusion: From Local to Global
- (4)
- Classification Layer: Crisis Category Prediction
3.3. Training Strategy
3.3.1. Hierarchical Learning Rate Scheduling
- (1)
- Base CLIP Encoders: Fine-tuned with LR = . This low rate protects generalizable cross-modal representations, avoiding catastrophic forgetting of foundational visual–language mappings (e.g., “fire” → flame semantics).
- (2)
- BCA and Gating Modules: Trained with LR = . Higher rates accelerate learning of crisis-specific interaction patterns (e.g., aligning “collapsed bridge” text to structural damage in images), as these modules require rapid adaptation to domain-unique noise (typos, blurry visuals).
- (3)
- Classification Head: Optimized with LR to rapidly adapt to disaster category boundaries. The ReduceLROnPlateau scheduler decayed LR by after 5 consecutive epochs without validation accuracy improvement, with a minimum LR of . This strategy outperformed uniform LR settings, improving validation accuracy by 0.89% and reducing overfitting by 17% compared to a fixed LR.
3.3.2. Optimization and Regularization
4. Results
4.1. Overall Classification Performance
- Infrastructure_and_utility_damage: 93.42%;
- Rescue_volunteering_or_donation_effort: 92.15%;
- Affected_individuals: 89.73%.
4.2. Comparative Analysis of Multimodal Fusion Models
- Dynamic Fusion Outperforms Static Strategies: CLIP-BCA-Gated (91.77% accuracy) surpasses static fusion models (e.g., ALIGN-Concat-Aug: 89.91%) by 1.86 percentage points, confirming that bidirectional cross-attention enables finer-grained text–image alignment. For example, in tweets combining “collapsed bridge” text with oblique-angle damage images, the BCA module aligns “collapsed” to structural deformation regions, whereas static fusion relies on global feature similarity.
- Adaptive Gating Enhances Noise Resilience: The model outperforms text-augmented baselines (e.g., CLIP-Txt-Aug: 83.03%) by 8.74 percentage points, demonstrating that adaptive gating effectively suppresses noisy modalities. When text contains typos (e.g., “fload” for “flood”), the gating mechanism reduces text weight (α = 0.31), prioritizing visual cues (waterlogged areas).
- Superiority Over State-of-the-Art Models: Compared to advanced attention models (e.g., CBAN-Dot: 88.40% F1), CLIP-BCA-Gated improves F1 by 3.34 percentage points. This advantage is attributed to the synergy of bidirectional attention and dynamic modality weighting, which captures complex cross-modal relationships.
4.3. Class-Specific Performance Analysis
4.3.1. Confusion Matrix Analysis
- (1)
- High-Accuracy Categories: Infrastructure_and_utility_damage achieves 93.42% correct classification (diagonal value), with minimal misclassifications to Natural_hazard (5.83%) due to shared visual features (e.g., storm-damaged buildings vs. hurricane imagery). Rescue_volunteering_or_donation_effort shows 92.15% accuracy, with rare misclassifications to Affected_individuals (3.27%) when visual cues (e.g., rescue teams) are ambiguous.
- (2)
- Challenging Categories: Hazardous_materials_release (84.90% accuracy) exhibits 12.3% misclassifications to Infrastructure_damage, primarily due to limited training samples (327 instances) and overlapping semantics (e.g., “chemical leak” vs. “structural damage” tweets). Affected_individuals (89.73% accuracy) has 7.21% errors to Other_relevant_information, driven by vague text descriptions (e.g., “people affected” without clear context).
4.3.2. Misclassification Patterns
- (1)
- Semantic Overlap: Natural_hazard and Infrastructure_damage share 8.7% cross-class errors, as both involve disaster-related visuals (e.g., flood images vs. flooded road images).
- (2)
- Modality Noise: Health_related tweets with blurry images show 8.7% misclassifications, where the gating mechanism underweights visual features but struggles with ambiguous text (e.g., “illness” vs. “injury”).
- (3)
- Class Imbalance: Hazardous_materials_release (minority class, 4.1% of dataset) has 15% more errors than the majority classes, highlighting the need for category-specific augmentation.
4.4. Ablation Study Results
- Bidirectional Cross-Attention (BCA): Removal caused a 2.54% drop, highlighting its core role in disaster-specific semantic binding (e.g., “landslide” text-image alignment).
- Adaptive Gating: Disabling led to a 1.12% decline, confirming its necessity for balancing modality reliability.
- Data Augmentation: Elimination resulted in a 0.83% drop, underscoring its supplementary role in scene diversity.
4.5. Training Dynamics and Convergence
4.6. Real-Time Inference Efficiency
4.7. Statistical Significance
5. Discussion
5.1. Theoretical Mechanisms of Model Superiority
5.2. State-of-the-Art Comparisons and Multimodal Synergy
5.3. Practical Implications for Crisis Response
5.3.1. Real-Time Responsiveness and Robustness in Crisis Scenarios
5.3.2. Cost Optimization for Essential Components
- (1)
- Hardware cost flexibility: Optimized for mid-tier GPUs (e.g., NVIDIA RTX 3060, Section 4.6) and compatible with cloud pay-as-you-go services, reducing upfront investments by scaling costs to actual usage.
- (2)
- Open-source toolchain: Critical components (PyTorch, OpenCV) use open-source software, eliminating licensing fees while maintaining efficiency gains (e.g., 50% reduced computational load, Section 4.6).
- (3)
- Modular deployment: Incremental adoption starts with core text-based functionalities on standard CPUs, scaling to multimodal capabilities as resources allow, lowering initial investment barriers.
5.3.3. Computational Efficiency for Low-End Equipment
- (1)
- Optimized inference: Mixed-precision training (FP16) and dynamic graph optimization (Section 4.6) reduce computational load by 50% and eliminate 31% of redundant operations, enabling reliable performance on entry-level GPUs (e.g., NVIDIA MX550) and standard CPUs.
- (2)
- Distributed training: Via PyTorch’s Distributed Data Parallel, multiple low-end devices (e.g., 4× entry-level GPUs like NVIDIA MX550) aggregate to match mid-tier GPU efficiency—reducing reliance on expensive hardware for both inference and training. For resource-constrained organizations, aggregating 4–6 such GPUs achieves training efficiency comparable to a single RTX 3060 when fine-tuning on crisis-specific datasets, avoiding dependence on high-end clusters (rarely accessible in remote disaster zones). Complemented by transfer learning with pre-trained backbones (e.g., CLIP), fine-tuning requires 60% fewer steps than training from scratch, further lowering hardware demands for model updates.
- (3)
- Lightweight modules: Building on modular deployment (Section 5.3.2), CPU-friendly functionalities (e.g., text-only classification) minimize computational demands, ensuring usability for resource-constrained users.
5.3.4. Potential Deployment Scenarios in Real-World Crisis Response
- (1)
- Live Crisis Monitoring: Streamlining Alert Prioritization
- (2)
- Emergency Response Platform Integration
- (3)
- Edge Deployment for On-Site Verification
5.4. Limitations and Future Trajectories
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mandal, B.; Khanal, S.; Caragea, D. Contrastive learning for multimodal classification of crisis related tweets. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 4555–4564. [Google Scholar]
- Alam, F.; Ofli, F.; Imran, M. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 25–28 June 2018; Volume 12. [Google Scholar]
- Shetty, N.P.; Bijalwan, Y.; Chaudhari, P.; Shetty, J.; Muniyal, B. Disaster assessment from social media using multimodal deep learning. Multimed. Tools Appl. 2024, 84, 18829–18854. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ofli, F.; Alam, F.; Imran, M. Analysis of social media data using multimodal deep learning for disaster response. arXiv 2020, arXiv:2004.11838. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PmLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
- Bielawski, R.; Devillers, B.; Van De Cruys, T.; VanRullen, R. When does CLIP generalize better than unimodal models? When judging human-centric concepts. In Proceedings of the 7th Workshop on Representation Learning (Repl4NLP 2022), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics (ACL): Kerrville, TX, USA, 2022; pp. 29–38. [Google Scholar]
- Biamby, G.; Luo, G.; Darrell, T.; Rohrbach, A. Twitter-COMMs: Detecting climate, COVID, and military multimodal misinformation. arXiv 2021, arXiv:2112.08594. [Google Scholar]
- Sirbu, I.; Sosea, T.; Caragea, C.; Caragea, D.; Rebedea, T. Multimodal semi-supervised learning for disaster tweet classification. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2711–2723. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 4904–4916. [Google Scholar]
- Ponce-López, V.; Spataru, C. Social media data analysis framework for disaster response. Discov. Artif. Intell. 2022, 2, 10. [Google Scholar] [CrossRef]
- Chaudhary, V.; Goel, A.; Yusuf, M.Z.; Tiwari, S. Disaster Tweets Classification Using Natural Language Processing. In Proceedings of the International Conference on Smart Computing and Informatics, Kochi, Kerala, India, 3–5 July 2025; Springer: Singapore, 2025; pp. 91–101. [Google Scholar]
- Alcántara, T.; García-Vázquez, O.; Calvo, H.; Torres-León, J.A. Disaster Tweets: Analysis from the Metaphor Perspective and Classification Using LLM’s. In Proceedings of the Mexican International Conference on Artificial Intelligence, Mérida, Mexico, 6–10 November 2023; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 106–117. [Google Scholar]
- Aamir, M.; Ali, T.; Irfan, M.; Shaf, A.; Azam, M.Z.; Glowacz, A.; Brumercik, F.; Glowacz, W.; Alqhtani, S.; Rahman, S. Natural disasters intensity analysis and classification based on multispectral images using multi-layered deep convolutional neural network. Sensors 2021, 21, 2648. [Google Scholar] [CrossRef] [PubMed]
- Yang, L.; Cervone, G. Analysis of remote sensing imagery for disaster assessment using deep learning: A case study of flooding event. Soft Comput. 2019, 23, 13393–13408. [Google Scholar] [CrossRef]
- Jena, R.; Pradhan, B.; Beydoun, G.; Alamri, A.M.; Ardiansyah; Nizamuddin; Sofyan, H. Earthquake hazard and risk assessment using machine learning approaches at Palu, Indonesia. Sci. Total Environ. 2020, 749, 141582. [Google Scholar] [CrossRef]
- Asif, A.; Khatoon, S.; Hasan, M.M.; Alshamari, M.A.; Abdou, S.; Elsayed, K.M.; Rashwan, M. Automatic analysis of social media images to identify disaster type and infer appropriate emergency response. J. Big Data 2021, 8, 83. [Google Scholar] [CrossRef]
- Zou, Z.; Gan, H.; Huang, Q.; Cai, T.; Cao, K. Disaster image classification by fusing multimodal social media data. IEEE Geosci Remote Sens. Lett. 2021, 18, 636–640. [Google Scholar] [CrossRef]
- Parasher, S.; Mittal, P.V.; Karki, S.; Narang, S.; Mittal, A. Natural Disaster Twitter Data Classification Using CNN and Logistic Regression. In International Conference on Soft Computing for Problem-Solving; Springer Nature Singapore: Singapore, 2023; pp. 681–692. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 6 July 2025).
- Zhang, M.; Huang, Q.; Liu, H. A multimodal data analysis approach to social media during natural disasters. Sustainability 2022, 14, 5536. [Google Scholar] [CrossRef]
- Belcastro, L.; Marozzo, F.; Talia, D.; Trunfio, P.; Branda, F.; Palpanas, T.; Imran, M. Using social media for sub-event detection during disasters. J. Big Data 2021, 8, 79. [Google Scholar] [CrossRef]
- Koshy, R.; Elango, S. Multimodal tweet classification in disaster response systems using transformer-based bidirectional attention model. Neural Comput. Appl. 2023, 35, 1607–1627. [Google Scholar] [CrossRef]
- Zou, H.P.; Caragea, C.; Zhou, Y.; Caragea, D. Crisismatch: Semi-supervised few-shot learning for fine-grained disaster tweet classification. arXiv 2023, arXiv:2310.14627. [Google Scholar]
- Teng, S.; Öhman, E. Using Multimodal Models for Informative Classification of Ambiguous Tweets in Crisis Response. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, Albuquerque, NM, USA, 3–4 May 2025; pp. 265–271. [Google Scholar]
- Ochoa, K.S.; Comes, T. A Machine learning approach for rapid disaster response based on multi-modal data. The case of housing & shelter needs. arXiv 2021, arXiv:2108.00887. [Google Scholar] [CrossRef]
- Gite, S.; Patil, S.; Pradhan, B.; Yadav, M.; Basak, S.; Rajendra, A.; Alamri, A.; Raykar, K.; Kotecha, K. Analysis of Multimodal Social Media Data Utilizing VIT Base 16 and GPT-2 for Disaster Response. Arab. J. Sci. Eng. 2025, 1–19. [Google Scholar] [CrossRef]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23729. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 18988–19000. [Google Scholar]
- Peng, Z.; Wang, W.; Dong, L.; Hao, Y.; Huang, S.; Ma, S.; Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv 2023, arXiv:2306.14824. [Google Scholar] [CrossRef]
- Tekumalla, R.; Banda, J.M. TweetDIS: A Large Twitter Dataset for Natural Disasters Built using Weak Supervision. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 4816–4823. [Google Scholar] [CrossRef]
- Gupta, K.; Gautam, N.; Sosea, T.; Caragea, D.; Caragea, C. Calibrated Semi-Supervised Models for Disaster Response based on Training Dynamics. In Proceedings of the International ISCRAM Conference, Halifax, NS, Canada, 18–21 May 2025. [Google Scholar]
- Yin, K.; Liu, C.; Mostafavi, A.; Hu, X. Crisissense-llm: Instruction fine-tuned large language model for multi-label social media text classification in disaster informatics. arXiv 2024, arXiv:2406.15477. [Google Scholar]
- Zahera, H.M.; Jalota, R.; Sherif, M.A.; Ngomo, A.-C.N. I-AID: Identifying actionable information from disaster-related tweets. IEEE Access 2021, 9, 118861–118870. [Google Scholar] [CrossRef]
- Hughes, A.L.; Clark, H. Seeing the Storm: Leveraging Multimodal LLMs for Disaster Social Media Video Filtering. In Proceedings of the ISCRAM 2025. Available online: https://ojs.iscram.org/index.php/Proceedings/article/view/159 (accessed on 6 July 2025).
- Abavisani, M.; Wu, L.; Hu, S.; Tetreault, J.; Jaimes, A. Multimodal categorization of crisis events in social media. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14679–14689. [Google Scholar]
- Pranesh, R. Exploring multimodal features and fusion strategies for analyzing disaster tweets. In Proceedings of the Eighth Workshop on Noisy User-Generated Text (W-NUT 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 62–68. [Google Scholar]
- Cheung, T.; Lam, K. Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing 2022, 514, 1–12. [Google Scholar] [CrossRef]
- Rezk, M.; Elmadany, N.; Hamad, R.K.; Badran, E.F. Categorizing crises from social media feeds via multimodal channel attention. IEEE Access 2023, 11, 72037–72049. [Google Scholar] [CrossRef]
Train (70%) | Dev (15%) | Test (15%) | Total | |||||
---|---|---|---|---|---|---|---|---|
Text | Image | Text | Image | Text | Image | Text | Image | |
other_relevant_information | 1222 | 1269 | 235 | 239 | 244 | 245 | 1222 | 1269 |
rescue_volunteering_or_donation_effort | 749 | 827 | 183 | 188 | 168 | 172 | 749 | 827 |
affected_individuals | 310 | 329 | 67 | 70 | 67 | 71 | 444 | 470 |
infrastructure_and_utility_damage | 478 | 539 | 99 | 108 | 122 | 126 | 478 | 539 |
not_humanitarian | 2666 | 2957 | 644 | 660 | 642 | 660 | 2666 | 2957 |
Total | 5425 | 5921 | 1228 | 1265 | 1243 | 1274 | 7896 | 8460 |
Symbol | Dimension | Description |
---|---|---|
49,408 × 768 | Text token embedding matrix | |
77 × 768 | Position encoding matrix | |
77 × 768 | Initial text embedding (token + position encoding) | |
768 × 768 | Query/Key/Value matrices for Transformer multi-head attention | |
Attention weight matrix | ||
Output of multi-head attention | ||
49 × 768 (7 × 7 patches) | Image patch embedding vector | |
512D | Projected text feature vector | |
512D | Projected image feature vector | |
1024D | Concatenated text–image feature vector |
Model | Accuracy (%) | Precision (%) | Recall (%) | F1 (%) |
---|---|---|---|---|
VGG-16 + CNN [6] | 78.4 ± 0.5 | 78.5 ± 0.4 | 78.0 ± 0.6 | 78.3 ± 0.5 |
VGG-16 + CNN (Image only) | 76.8 ± 0.3 | 76.4 ± 0.3 | 76.8 ± 0.4 | 76.3 ± 0.3 |
VGG-16 + CNN (Text only) | 70.4 ± 0.6 | 70.0 ± 0.5 | 70.0 ± 0.5 | 67.7 ± 0.7 |
DenseNet + BERT [37] | 82.72 ± 0.3 | 82.50 ± 0.3 | 82.72 ± 0.2 | 82.46 ± 0.3 |
FBP with fusion [38] | – | 88.5 ± 0.2 | 88.1 ± 0.2 | 88.1 ± 0.2 |
CBAN-Dot [39] | 88.38 ± 0.2 | 87.95 ± 0.2 | 87.80 ± 0.2 | 88.40 ± 0.2 |
DMCC [40] | 88.00 ± 0.2 | 87.95 ± 0.2 | 87.80 ± 0.2 | 87.72 ± 0.2 |
CLIP [1] | 90.22 ± 0.12 | 90.23 ± 0.11 | 90.22 ± 0.12 | 90.04 ± 0.12 |
CLIP (Image only) [1] | 87.43 ± 0.21 | 87.48 ± 0.22 | 87.43 ± 0.21 | 87.14 ± 0.26 |
CLIP (Text only) [1] | 81.26 ± 0.32 | 81.47 ± 0.31 | 81.26 ± 0.32 | 80.70 ± 0.41 |
CLIP Surgery [1] | 90.21 ± 0.11 | 90.23 ± 0.12 | 90.21 ± 0.11 | 90.02 ± 0.14 |
CLIP Surgery (Image only) [1] | 87.49 ± 0.27 | 87.51 ± 0.22 | 87.49 ± 0.27 | 87.26 ± 0.19 |
CLIP Surgery (Text only) [1] | 81.14 ± 0.33 | 81.18 ± 0.31 | 81.14 ± 0.33 | 80.65 ± 0.36 |
ALIGN [1] | 89.44 ± 0.18 | 89.40 ± 0.18 | 89.44 ± 0.18 | 89.31 ± 0.19 |
ALIGN (Image only) [1] | 86.49 ± 0.18 | 86.58 ± 0.17 | 86.49 ± 0.18 | 86.20 ± 0.18 |
ALIGN (Text only) [1] | 80.63 ± 0.21 | 80.63 ± 0.22 | 80.63 ± 0.21 | 80.40 ± 0.25 |
ALIGN-Concat-Aug | 89.91 ± 0.1 | 89.92 ± 0.1 | 89.91 ± 0.1 | 89.85 ± 0.1 |
ALIGN-Img-Aug | 86.08 ± 0.2 | 86.62 ± 0.2 | 86.08 ± 0.2 | 86.14 ± 0.2 |
ALIGN-Txt-Aug | 82.72 ± 0.2 | 82.68 ± 0.2 | 82.72 ± 0.2 | 82.70 ± 0.2 |
FLAVA-Concat-Aug | 89.60 ± 0.1 | 89.59 ± 0.1 | 89.60 ± 0.1 | 89.56 ± 0.1 |
FLAVA-Img-Aug | 82.80 ± 0.3 | 83.14 ± 0.3 | 82.80 ± 0.3 | 82.81 ± 0.3 |
FLAVA-Txt-Aug | 79.20 ± 0.4 | 79.04 ± 0.4 | 79.20 ± 0.4 | 79.02 ± 0.4 |
CLIP-Img-Aug | 87.18 ± 0.2 | 87.16 ± 0.2 | 87.18 ± 0.2 | 87.14 ± 0.2 |
CLIP-Txt-Aug | 83.03 ± 0.2 | 82.87 ± 0.2 | 83.03 ± 0.2 | 82.85 ± 0.2 |
CLIP-BCA-Gated | 91.77 ± 0.11 | 91.81 ± 0.10 | 91.77 ± 0.11 | 91.74 ± 0.11 |
Ablation Variant | Accuracy (%) | Drop from Full Model (%) |
---|---|---|
CLIP-BCA-Gated | 91.77 ± 0.11 | 0.00 |
No Bidirectional Cross-Attention | 89.23 ± 0.18 | 2.54 |
No Adaptive Gating | 90.65 ± 0.16 | 1.12 |
No Data Augmentation | 90.94 ± 0.20 | 0.83 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.; Liu, Q.; Pan, Z.; Wu, X. CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating. Appl. Sci. 2025, 15, 8758. https://doi.org/10.3390/app15158758
Li S, Liu Q, Pan Z, Wu X. CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating. Applied Sciences. 2025; 15(15):8758. https://doi.org/10.3390/app15158758
Chicago/Turabian StyleLi, Shanshan, Qingjie Liu, Zhian Pan, and Xucheng Wu. 2025. "CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating" Applied Sciences 15, no. 15: 8758. https://doi.org/10.3390/app15158758
APA StyleLi, S., Liu, Q., Pan, Z., & Wu, X. (2025). CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating. Applied Sciences, 15(15), 8758. https://doi.org/10.3390/app15158758