You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

29 December 2025

MiRA: A Zero-Shot Mixture-of-Reasoning Agents Framework for Multimodal Answering of Science Questions

,
and
1
Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
2
Department of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 1988, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci.2026, 16(1), 372;https://doi.org/10.3390/app16010372 
(registering DOI)
This article belongs to the Special Issue Deep Learning and Its Applications in Natural Language Processing

Abstract

Multimodal question answering (QA) involves integrating information from both visual and textual inputs and requires models that can reason compositionally and accurately across modalities. Existing approaches, including fine-tuned vision–language and prompting, often struggle with generalization, interpretability, and reliance on task-specific data. In this work, we propose a Mixture-of-Reasoning Agents (MiRA) framework for zero-shot multimodal reasoning. MiRA decomposes the reasoning process across three specialized agents—Visual Analyzing, Text Comprehending, and Judge—which consolidate multimodal evidence. Each agent operates independently using pretrained language models, enabling structured, interpretable reasoning without supervised training or task-specific adaptation. Evaluated on the ScienceQA benchmark, MiRA achieves 96.0% accuracy, surpassing all zero-shot methods, outperforming few-shot GPT-4o models by more than 18% on image-based questions, and achieving similar performance to the best fine-tuned systems. The analysis further shows that the Judge agent consistently improves the reliability of individual agent outputs, and that strong linear correlations (r > 0.95) exist between image-specific accuracy and overall performance across models. We identify a previously unreported and robust pattern in which performance on image-specific tasks strongly predicts overall task success. We also conduct detailed error analyses for each agent, highlighting complementary strengths and failure modes. These results demonstrate that modular agent collaboration with zero-shot reasoning provides highly accurate multimodal QA, establishing a new paradigm for zero-shot multimodal AI and offering a principled framework for future research in generalizable AI.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.