Abstract
Multimodal question answering (QA) involves integrating information from both visual and textual inputs and requires models that can reason compositionally and accurately across modalities. Existing approaches, including fine-tuned vision–language and prompting, often struggle with generalization, interpretability, and reliance on task-specific data. In this work, we propose a Mixture-of-Reasoning Agents (MiRA) framework for zero-shot multimodal reasoning. MiRA decomposes the reasoning process across three specialized agents—Visual Analyzing, Text Comprehending, and Judge—which consolidate multimodal evidence. Each agent operates independently using pretrained language models, enabling structured, interpretable reasoning without supervised training or task-specific adaptation. Evaluated on the ScienceQA benchmark, MiRA achieves 96.0% accuracy, surpassing all zero-shot methods, outperforming few-shot GPT-4o models by more than 18% on image-based questions, and achieving similar performance to the best fine-tuned systems. The analysis further shows that the Judge agent consistently improves the reliability of individual agent outputs, and that strong linear correlations (r > 0.95) exist between image-specific accuracy and overall performance across models. We identify a previously unreported and robust pattern in which performance on image-specific tasks strongly predicts overall task success. We also conduct detailed error analyses for each agent, highlighting complementary strengths and failure modes. These results demonstrate that modular agent collaboration with zero-shot reasoning provides highly accurate multimodal QA, establishing a new paradigm for zero-shot multimodal AI and offering a principled framework for future research in generalizable AI.