CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe proposal seems to be solid, but this can also be a dilemma because of the high cost of the necessary components, which is why the authors are asked to analyze their proposal to visualize schemes that are more profitable
The same as the previous question, but now in the high computational cost, and it is not a problem that is seen as a huge disadvantage, but most of them do not have or do not have high-end equipment, that is why the authors are asked to analyze their proposal to visualize schemes that are more profitable
In the proposal presented by the authors it is observed that there may be problems with those classes that contain little data, how do the authors solve this point?
The authors in their development, a problem that can occur is the modal alignment, it is not complete and if it can present problems and be limited, due to the configuration of the model, how do the authors face this issue and how do they solve it?
What happens when the data set to be used is not clean enough, the authors are asked to develop this issue, since in the proposal there is a dependence on the data always being correct.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper addresses a very important and timely topic: real-time humanitarian crisis classification using multimodal data analysis. This is a socially significant area, and the proposed method is interesting and well-structured. The combination of Bi-Cross-Attention and Adaptive Gating in a dynamic framework provides a practical solution for understanding complex crisis-related data.
One of the strongest aspects of this work is its experimental evaluation. The paper includes comparisons with state-of-the-art methods such as CLIP and ALIGN, presents ablation studies, and applies statistical significance tests (t-tests), which adds credibility to the findings. These details show that a serious effort has been made to validate the proposed approach. The description of the framework and the performance analysis are clear and detailed, helping readers to follow both the methodology and the results.
However, a few points can be improved to increase the overall impact of the paper. Although the reference list is good, it would be helpful to include more recent works (from 2024–2025), especially studies on multimodal fusion and LLM-based approaches such as Flamingo, BLIP-2, and Kosmos-2. Adding these references will better position the proposed method within the current research landscape and highlight its contribution.
In addition, it would be valuable to include a discussion about possible real-world applications of the proposed model beyond the CrisisMMD dataset. Even if additional experiments are not possible, a short section suggesting how the model could work in practical scenarios (for example, live crisis monitoring or integration with disaster response platforms) would make the paper stronger and more relevant to applied research.
Overall, the paper is well written, with strong experimental results and a clearly defined research problem. The suggestions above aim to improve the scientific value without requiring major structural changes. For these reasons, the recommendation is Minor Revision.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors explain and elaborate on each of the observations