The evaluation results reveal meaningful differences between human participants and MLLMs, alongside notable variability in model performance across the range of tasks.
4.2. Performance of Multimodal Models
Building on the experimental framework, this subsection presents the results of MLLMs evaluated on the three tasks. Namely, we analyze model accuracy, error patterns, and the influence of architecture and parameter scale across them. These results reveal the strengths and limitations of the models’ reasoning capabilities and set the stage for a comparison with human performance.
The results in
Table 1 reveal several notable patterns in the performance of MLLMs across tasks and modalities. Consistent with our previous study [
7], most models show a gradual decline in accuracy when shifting from Task 1 to Task 3, indicating that longer contexts and more complex reasoning demands continue to challenge current architectures. At the same time, the overall mean performance of most models has increased compared to the previous study, suggesting that scaling contributes to performance gains. An exception is LLaVA-34B, which did not follow this trend. Qwen2.5-VL-72B achieved the highest overall performance among the models evaluated in this study. In particular, it surpassed both GPT-4o and GPT-4.1 mini from the previous study in the XML modality of Task 3, reaching an accuracy of 0.91. This advantage aligns with the claims in the Qwen2.5-VL technical report [
2], which highlights the model’s strength in handling text-rich scenarios.
Furthermore, a clear parameter-scaling effect is observed within the Qwen2.5-VL family. Both the 32B and 72B variants demonstrate measurable improvements, highlighting the importance of scaling in supporting more robust multimodal reasoning. In contrast, LLaVA-34B exhibited the lowest overall accuracy, with a particularly severe drop in Task 3 JSON format, suggesting potential deficiencies in structured data exposure during training. These results underscore that while larger parameter counts enhance general multimodal reasoning, training strategies and modality-specific data coverage remain equally critical. Overall, the findings highlight the need for further research into the physical reasoning abilities of MLLMs, showing that performance depends not only on model size but also on the interaction between modality, task complexity, and training objectives.
To further assess model robustness, we examined the number of invalid responses across tasks, modalities, and models, as shown in
Table 2. An invalid response occurs when the model fails to generate a parsable or meaningful output during the answer extraction process. Overall, invalid responses were most frequent in the JSON modality, far exceeding those observed for image and XML. This pattern was most evident in LLaVA-34B, which recorded the highest number of invalid responses in Task 2 and Task 3. These findings indicate that structured formats such as JSON pose challenges for some large-parameter models, likely due to uneven training data exposure.
A secondary observation is that invalid response rates show diverse patterns across models and modalities, rather than a consistent decline with increasing parameter size. For instance, Gemma3 demonstrates a clear decrease in invalid responses across modalities and tasks, suggesting improved reliability within this family. In contrast, LLaVA-34B performs substantially worse under the JSON modality despite its larger size in Task 2 and Task 3, highlighting that scale alone does not guarantee robustness. Across tasks, invalid responses were least frequent in Task 1 and most frequent in Task 3, indicating that more complex tasks exacerbate validity issues. These findings underscore the importance of evaluating not only accuracy but also response validity, since invalid outputs can distort performance comparisons across tasks and modalities.
4.3. Comparison with Human Judgments
To contextualize model performance, we compared MLLM outputs with judgments from 14 human participants collected through the user study. Participants completed the same tasks under the constraints outlined in
Section 3.4 and
Section 4.1: image-only modality, balanced stable and unstable structures, and random selection of candidate questions. Their responses provide a clear empirical reference for evaluating MLLM accuracy under identical task conditions.
As described in
Section 3.4, a duplicate question was included in Task 1 to assess response reliability. This mechanism ensured that participants were attentive and engaged with the survey. All 14 participants provided consistent answers for the duplicate question, confirming the reliability of the human responses.
Table 3 presents the performance of humans and selected MLLMs on the stability reasoning tasks. For model comparison, we included the highest-scoring model from this study (Qwen2.5-VL-72B) and the highest-scoring model from the previous study (GPT-4o) to establish representative reference points. Although MLLMs were originally evaluated on all 300 levels across modalities, we also report their performance on the same 35 image-based items to enable fair comparison with human participants. Human participants demonstrated high accuracy across tasks, with particularly strong results in Task 1 (0.91) and Task 2 (0.83), though performance was lower in Task 3.
Human participants achieved consistently high accuracy across tasks, especially in the binary and comparison categories, whereas models exhibited larger variability across modalities and subtasks. These results illustrate the present empirical performance gap between human participants and current MLLMs in physical reasoning tasks. However, as clarified in
Section 3.4, the human data are used strictly as a descriptive reference to contextualize model capabilities. We do not interpret these differences as evidence of distinct reasoning mechanisms or perceptual strategies, but rather as measurable disparities in task accuracy. In this way, the human benchmark serves to indicate what level of performance can be achieved under identical conditions, providing a practical frame for assessing progress in multimodal physical reasoning.
In contrast, the models showed more variability. GPT-4o performed perfectly on stable structures in Task 1 but failed on unstable structures, resulting in an overall accuracy of 0.50. Qwen2.5-VL-72B showed a more balanced performance between stable and unstable conditions in Task 1 (0.40 and 0.60, respectively), but its overall accuracy remained lower than that of human participants across most subtasks. For Task 2, which involved multiple-choice options, human accuracy remained relatively high (0.83), whereas both models struggled to consistently identify stable and unstable structures, highlighting the difficulty of integrating visual cues under more complex formats.
In Task 3, humans maintained strong performance on “diff_block” and “diff_level” subtasks, but lower accuracy on “diff_id” subtasks (0.60), reflecting subtle challenges in recognizing structural identity differences. Interestingly, GPT-4o achieved higher accuracy than humans on the “diff_block” and “diff_id” subtask, with a larger difference observed in “diff_id”, suggesting that the model can sometimes detect small visual details more consistently than human participants. Nevertheless, the models exhibited inconsistent performance across other subtasks, indicating that current MLLMs still face significant limitations in capturing nuanced human-like reasoning about physical stability.
To assess the statistical significance of differences between human participants and MLLMs at the subtask level, we employed a two-stage bootstrap approach suitable for two independent groups. This method considers participant variability and the limited number of questions (items) in each task, ensuring reliable estimates despite the small sample size. The procedure was applied consistently across all three tasks. Task 1 and Task 2 each consisted of 10 questions, evenly divided between stable and unstable structures, while Task 3 comprised 15 questions partitioned into three subtasks (“diff_block”, “diff_id”, and “diff_level”). Each scenario was analyzed separately to ensure that bootstrap estimates accurately capture performance differences at a fine-grained level.
Human participant responses were represented as matrices of size 14 × number of questions per task, whereas model outputs were represented as 12 × number of questions and treated as fixed for bootstrap resampling. Notably, the set of 12 models (Group B) consisted of both the MLLMs evaluated in this study and those reported in prior work, providing a broader comparative perspective. This approach enabled a direct and fair comparison between humans and models for each subtask while capturing the uncertainty arising from both question-level and participant-level variability.
Table 4 summarizes the bootstrap results comparing human participants (Group A) and MLLMs (Group B) across three tasks and their subtasks. Group A and Group B means represent the average accuracy of humans and models, respectively, while the observed difference (A–B) reflects the accuracy gap between the two groups. In Task 1, humans achieved significantly higher accuracy than models for both stable (0.59,
p = 0.001) and unstable (0.22,
p = 0.039) subtasks, demonstrating stronger physical reasoning in identifying stability. In Task 2, humans again outperformed models, with differences of 0.62 (
p < 0.001) for stable and 0.51 (
p < 0.001) for unstable subtasks. This confirms that human participants apply more reliable reasoning when evaluating relative stability across multiple structures.
In Task 3, the results were more varied. Humans showed clear advantages in “diff_block” (0.63, p < 0.001) and “diff_level” (0.55, p = 0.001), indicating stronger reasoning when assessing the consequences of structural changes. In “diff_id”, the difference (0.23) was not statistically significant (p = 0.167), suggesting that MLLMs performed closer to humans when the task relied on recognizing object identity. It is worth noting that several subtasks report p-values of 0.000. These should be interpreted as p < 0.001, reflecting extremely small values below the reporting threshold and providing strong evidence of significant differences. Overall, humans consistently demonstrated superior physical reasoning across most tasks, while MLLMs exhibited comparable performance only in the binary classification task, where all models showed similar reasoning ability when responding to basic and direct stability questions.