A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning

Zhou, Min; Qi, Zhiheng; Zhu, Tianlin; Vijg, Jan; Huang, Xiaoshui

doi:10.3390/math13172739

Open AccessArticle

A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning

by

Min Zhou

¹

,

Zhiheng Qi

²,

Tianlin Zhu

³,

Jan Vijg

^4,*

and

Xiaoshui Huang

^1,*

¹

School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China

²

School of Basic Medical Sciences and Forensic Medicine, Hangzhou Medical College, Hangzhou 310013, China

³

School of Computer Science and Artificial Intelligence, The Jiangxi University of Finance and Economics, Nanchang 330013, China

⁴

Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, NY 10461, USA

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2739; https://doi.org/10.3390/math13172739

Submission received: 28 July 2025 / Revised: 18 August 2025 / Accepted: 23 August 2025 / Published: 26 August 2025

(This article belongs to the Special Issue Multimodal Deep Learning and Its Application in Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Chart reasoning, a critical task for automating data interpretation in domains such as aiding scientific data analysis and medical diagnostics, leverages large-scale vision language models (VLMs) to interpret chart images and answer natural language questions, enabling semantic understanding that enhances knowledge accessibility and supports data-driven decision making across diverse domains. In this work, we formalize chart reasoning as a sequential decision-making problem governed by a Markov Decision Process (MDP), thereby providing a mathematically grounded framework for analyzing visual question answering tasks. While recent advances such as multi-step reasoning with Monte Carlo tree search (MCTS) offer interpretable and stochastic planning capabilities, these methods often suffer from redundant path exploration and inefficient reward propagation. To address these challenges, we propose a novel algorithmic framework that integrates a pheromone-guided search strategy inspired by Ant Colony Optimization (ACO). In our approach, chart reasoning is cast as a combinatorial optimization problem over a dynamically evolving search tree, where path desirability is governed by pheromone concentration functions that capture global phenomena across search episodes and are reinforced through trajectory-level rewards. Transition probabilities are further modulated by local signals, which are evaluations derived from the immediate linguistic feedback of large language models. This enables fine grained decision making at each step while preserving long-term planning efficacy. Extensive experiments across four benchmark datasets, ChartQA, MathVista, GRAB, and ChartX, demonstrate the effectiveness of our approach, with multi-agent reasoning and pheromone guidance yielding success rate improvements of +18.4% and +7.6%, respectively.

Keywords:

Markov decision processes; Monte Carlo methods; ant colony optimization; stochastic search algorithms

MSC:

90C40

1. Introduction

Chart reasoning is a question-answering task that requires computing models to interpret a chart image, such as a bar chart, line graph, or pie chart, leveraging both its visual and textual information to produce accurate responses to natural language questions [1]. This task necessitates strong cross-modal understanding, where models must perceive and integrate both the ability to understand images and texts [2,3,4]. Recent advances [4,5] in large vision language models (LVLMs) have significantly improved the ability to extract and reason over such multimodal information [6,7,8], enabling structured and interpretable responses rather than shallow pattern matching [9]. Chart reasoning has broad applications in domains such as medicine and scientific research, where users must interpret complex chart images in response to specific queries [10]. Given the ubiquity of charts across disciplines, such as mathematics and computer science, chart reasoning has the potential to substantially improve knowledge accessibility, reduce cognitive burden, and accelerate data-driven decision making [11,12].

The current chart reasoning algorithms mainly leverage the stochastic statistic mathematic theories and foundation models, which can be divided into two categories: single-step and multi-step approaches. The main idea of single-step chart reasoning approaches is that one treats charts as images and directly generates answers by using the general multimodal foundation model which is based on the alignment of the image and text [13,14]. However, these methods, leveraging the ability of a general LLM, often struggle to adapt to domain-specific charts and lack transparency in how the answers are derived [15,16]. In contrast, multi-step chart reasoning decomposes the problem into a sequence of reasoning sub-tasks, allowing interpretability via recursive or parallel evaluation of reasoning branches. Explicit reasoning frameworks such as Chain-of-Thought (CoT) [17] prompting encourage models to produce interpretable, step-by-step rationales. Tree-of-Thought (ToT) [18] further extends this paradigm by enabling parallel exploration of multiple reasoning paths, which samples and evaluates reasoning trajectories in a stochastic decision process. Building on these developments, recent work explores search-based reasoning techniques such as Monte Carlo tree search (MCTS) and its collaborative multi-agent extensions [19,20,21], which provide an advanced way to systematically explore structured reasoning paths.

However, CoT methods often demand extensive prompt engineering and exhibit limited generalization when applied to smaller models [22]. ToT methods require a large amount of computational demands [18]. Their dependence on local heuristics such as upper confidence bound (UCB) introduces susceptibility to local optima, and the lack of memory or global feedback restricts scalability. Recent innovations, including learning-to-reason frameworks for MLLMs [21,23], propose collective learning mechanisms that integrate collaborative reasoning and structured feedback into the search process. Our framework draws on these ideas to construct interpretable, step-wise reasoning paths, combining the strengths of explicit multi-step modeling and efficient search to enhance the generalization, transparency, and scalability of chart-based question answering.

However, multi-step approaches still face significant challenges due to the lack of effective mechanisms for sharing feedback and coordinating exploration across multiple reasoning paths [24,25], as well as difficulties in defining fine-grained reward signals for intermediate reasoning steps in chart reasoning [26]. As illustrated in Figure 1, these issues often lead to redundant search trajectories and high computational overhead.

To overcome the above limitations of the lack of effectiveness and efficiency of multi-step reasoning in chart question answering, we propose a novel multi-agent reasoning algorithm that augments MCTS with a pheromone-guided search mechanism, inspired by Ant Colony Optimization (ACO) [27,28,29]. This mechanism simulates collaborative exploration and feedback propagation, allowing multiple reasoning agents to share information and collectively discover high-quality reasoning paths. Unlike traditional UCB [30], which is based solely on local statistical signals within a single search tree, we formulate chart reasoning as a combinatorial optimization problem over a dynamic search graph, where nodes correspond to intermediate reasoning states and edges represent reasoning transitions. Each path is assigned a pheromone score, a global desirability function updated based on historical trajectory performance, while local edge utilities are computed from heuristic language-model feedback. This construction yields a dual-signal architecture: global memory via pheromone reinforcement and local evaluation via rollout-based heuristics.

Specifically, we introduce two novel mechanisms that enhance multi-step chart reasoning by integrating both global and local decision signals within the MCTS framework. First, we propose a pheromone-guided path selection strategy, which incorporates global knowledge accumulated across reasoning trajectories. Drawing inspiration from Ant Colony Optimization (ACO), we assign a pheromone score to each reasoning path that reflects its historical effectiveness across multiple rollouts. Paths that have led to correct or high-quality answers receive higher pheromone reinforcement, increasing their future selection probability. This global signal complements the local feedback (e.g., immediate validation or step-wise reasoning quality as traditional UCB value) obtained during individual path traversal. To balance exploration and exploitation, we design a hybrid weighting function that adaptively combines global pheromone strength and local evaluation signals, ensuring that both historically strong paths and promising new candidates are considered during a search. This strategy mitigates the risk of being trapped in local optima and enables the model to focus on informative, high-potential reasoning trajectories.

Second, we propose a Dynamic Pheromone Update Strategy that adaptively refines the search space over time. After generating intermediate reasoning steps, an LLM-based verifier evaluates their quality. If a reasoning path surpasses a predefined quality threshold, its pheromone level is increased, reinforcing its global desirability. Conversely, paths that repeatedly yield low-quality steps undergo pheromone decay, reducing their influence. Unlike static or heuristic weighting, our adaptive reinforcement-decay mechanism dynamically reshapes the search landscape based on both recent local reasoning quality and long-term performance trends. This allows the system to continuously shift focus toward effective reasoning strategies, while suppressing unproductive ones, enhancing both efficiency and robustness in complex chart reasoning tasks.

We conduct comprehensive experiments across four benchmark datasets: ChartQA, MathVista, GRAB, and ChartX (Table 1 and Table 2). Further ablation studies validate the effectiveness of our design. Using different iteration budgets (Table 3), we identify that increasing the number of iterations strikes an optimal balance between performance and efficiency, reaching a 71.2% success rate. Model collaboration experiments (Table 4) demonstrate that multi-agent reasoning boosts the success rate by +18.4% compared to single-model variants. In particular, incorporating the pheromone mechanism (Table 5) produces a +7.6% improvement in the success rate, highlighting its crucial role in guiding effective reasoning, as illustrated in Figure 1.

The main contributions of this work are as follows.

First, we propose a novel multi-agent pheromone-guided Monte Carlo tree search algorithm that utilizes global historical information to improve exploration in reasoning spaces.
Second, we integrate pheromone weighting into the UCB-based selection process, enabling a balanced trade-off between exploiting successful reasoning paths and exploring new ones based on historical performance.
Third, we introduce a dynamic pheromone update mechanism guided by LLM feedback into backpropagation progress, yielding an adaptive optimization scheme that improves both robustness and search convergence, particularly on tasks requiring deep and complex reasoning chains.

2. Related Work

Chart reasoning methods can be broadly divided into single-step and multi-step approaches, depending on whether the model generates answers in a single forward pass or through iterative refinement. Single-step chart reasoning emphasizes efficiency by producing direct outputs without intermediate steps [13,14], while multi-step chart reasoning involves exploring and refining multiple reasoning paths to improve accuracy on complex tasks [17,20].

2.1. Single-Step Reasoning Paradigms

The main idea of single-step chart reasoning refers to the process by which a model generates an output through a single, forward reasoning pass upon receiving input, without relying on multi-turn interactions or intermediate planning [15,16]. This approach is widely used in image reasoning tasks to enhance response efficiency and streamline the inference pipeline [34]. Broadly, single-step reasoning methods can be categorized into three types: those based on unimodal large language models, those based on multimodal large models, and those that incorporate reinforcement learning techniques.

Unimodal approaches focus on converting visual information into structured or semi-structured textual representations—such as image captions, OCR outputs, or chart data—which are then input into large language models (e.g., the GPT series) [35,36]. These models utilize their strong language understanding and reasoning capabilities to perform tasks like question answering and inference entirely within the text domain. This strategy benefits from reduced model complexity and compatibility with powerful pretrained LLMs, making it a lightweight and interpretable solution. However, unimodal approaches face notable limitations in the context of chart reasoning [37]. The conversion process often introduces noise or loss of critical visual–spatial relationships—such as relative positions, colors, and scale annotations—that are essential for accurate interpretation. Additionally, they depend heavily on the quality and completeness of upstream extraction tools (e.g., OCR), which may fail on complex or stylized charts [38]. As a result, unimodal methods can struggle with generalization across diverse chart formats and may be inadequate for handling semantically rich or visually intricate reasoning tasks.

Multimodal approaches leverage models capable of simultaneously processing both images and text. The input typically includes an image paired with a natural language query. Representative models such as the Qwen-VL series [4], the LLaVA series [39], and Gemini [2] employ cross-modal alignment mechanisms to enable joint understanding and reasoning across visual and textual modalities. This allows them to effectively handle more complex visual scenes and semantically rich tasks, making them well-suited for interpreting diverse chart types [40]. However, despite their flexibility, multimodal models often function as black boxes, lacking transparency in how answers are derived—posing challenges for interpretability and trustworthiness in high-stakes applications [41,42]. Moreover, their reliance on large-scale pretraining and high computational cost can limit scalability and accessibility in practical deployments.

Reinforcement learning-based approaches build upon the strengths of multimodal models by incorporating reward signals to guide the optimization of modality fusion and decision-making processes [43]. This approach enhances the model’s capacity to perceive and reason over complex visual structures and fine-grained semantic details, leading to improved accuracy and robustness in cross-modal reasoning tasks. Moreover, reinforcement learning provides a natural framework for optimizing intermediate reasoning steps and integrating feedback into the model’s decision process [26]. However, these methods face several challenges. Reinforcement learning pipelines are typically data- and compute-intensive, requiring extensive sampling and tuning to converge [44]. This limits their scalability and practical applicability, particularly in settings where real-time performance or interpretability is critical. Furthermore, without careful architecture or training design, the resulting policies may still suffer from limited generalization across diverse chart types and reasoning formats.

2.2. Multi-Step Reasoning Paradigms

The main idea of multi-step chart reasoning is to iteratively explore and refine potential solutions through multiple reasoning steps, enabling a more accurate and thorough interpretation of complex chart data. These methods can be broadly categorized into three types: Best of N, Chain-of-Thought, and Tree search.

Best of N (BoN), or repeated sampling, involves generating multiple independent solutions by prompting the language model several times with the same input [45]. A verifier then selects the best candidate based on criteria such as accuracy or the number of tests passed. Although this approach increases the likelihood of finding a correct solution, it lacks iterative refinement and interaction between samples, making it less effective for complex reasoning tasks [46].

Chain-of-Thought (CoT) prompting guides a model to generate step-by-step rationales before the final answer, which can improve performance and interpretability on complex reasoning tasks [17]. However, CoT’s effectiveness is highly dependent on careful prompt engineering, limiting its flexibility and scalability. The reasoning quality relies on both the prompt design and the model’s inherent capabilities [47], and the approach often fails with smaller models that struggle to maintain a coherent thought process [17].

Tree of Thought (ToT) extends CoT by exploring multiple reasoning paths in parallel, enhancing solution space coverage [18]. While effective, this method introduces significant computational overhead due to the exponential expansion of its search tree. Unlike guided search techniques such as Monte Carlo tree search (MCTS) [20,21,26], ToT can suffer from inefficient exploration without strong heuristics. This often leads to redundant branches or it becoming trapped in local optima [48]. Moreover, challenges in integrating global feedback across branches limit its scalability for complex, data-intensive tasks like chart reasoning.

Biologically Inspired Algorithms, such as Ant Colony Optimization (ACO), excel at solving problems on well-defined static graphs [27] but are difficult to apply to the dynamic and semantic reasoning of LLMs. The core challenge is defining a semantic heuristic to guide the search, a task traditional metrics like distance cannot perform. A recent approach [49] addresses this by integrating ACO within a Tree-of-Thoughts framework, using fine-tuned LLM ants to provide the necessary semantic guidance. However, this technique has yet to be explored in the Multimodal Large Vision Model domain.

However, these methods face two limitations in handling complex chart reasoning. Despite improvements in LVLMs, existing tree search methods such as ToT and MCTS still struggle with two critical issues in complex reasoning. First, they fail to effectively propagate success signals across different reasoning paths. This leads to redundant exploration and inefficient search strategies. Second, these methods frequently suffer from high computational overhead due to the exponential growth of the search space, particularly because they rely on local heuristics and struggle without well-defined global guidance. To address this specific gap, our method proposes a dual-signal framework that integrates a pheromone-guided search algorithm. This approach combines local evaluation signals with global, trajectory-level feedback to overcome the aforementioned limitations.

3. Methodology

3.1. Problem Formulation as a Markov Decision Process (MDP)

To establish a rigorous mathematical foundation for chart reasoning, we formulate the task as a sequential decision-making problem modeled by a Markov Decision Process (MDP). This problem exhibits a sparse-reward structure, where intermediate reasoning steps receive no supervision and rewards are only provided upon final answer generation as shown in Figure 2. Such settings necessitate advanced planning strategies, notably Monte Carlo tree search (MCTS), and benefit substantially from global credit assignment mechanisms enabled by pheromone-based reinforcement. The MDP framework captures the inherent stochasticity of language model outputs and the agent’s control over reasoning actions, with the objective of identifying an optimal policy that maximizes expected cumulative reward.

Our MDP is formally defined by the tuple

(S, A, P, R, γ)

:

State Space ( $S$ ): A state $s_{t} \in S$ at any step t represents the complete history of the reasoning process. It is defined by the initial chart–question pair ${p, Q}$ and the sequence of reasoning steps (actions) generated thus far, i.e., $s_{t} = {p, Q, a_{0}, . . ., a_{t - 1}}$ .
Action Space ( $A$ ): An action $a_{t} \in A (s_{t})$ is the generation of the next intermediate reasoning step. This corresponds to a language model from a cooperative set ${π_{1}, \dots, π_{K}}$ producing a subsequent sentence that extends the current reasoning trajectory from state $s_{t}$ .
Transition Probability (P): The transition function $P (s_{t + 1} | s_{t}, a_{t})$ defines the probability of moving from state $s_{t}$ to state $s_{t + 1}$ after taking action $a_{t}$ . This function is implicitly defined by the stochastic nature of the language models, and our agent samples from it by invoking a model $π_{k}$ .
Reward Function ( $R$ ): Rewards are sparse and assigned only at the terminal state of a completed trajectory $π$ . Let $\hat{A} (π)$ be the final answer generated by the trajectory. The terminal reward $r = R (π)$ is a binary score based on the semantic similarity, sim, between the generated answer $\hat{A} (π)$ and the ground-truth answer A. The reward function is formally defined using the indicator function as:

$R (π) = I (sim (\hat{A} (π), A) \geq ϵ)$

(1)

where $I (\cdot)$ is the indicator function, which returns 1 if the condition is true and 0 otherwise, and $ϵ$ is a predefined similarity threshold for correctness.
Discount Factor ( $γ$ ): We set the discount factor $γ = 1$ , as the primary reward is terminal.

The goal is to find an optimal policy

π^{*} : S \to A

that maximizes the expected cumulative reward. Formally, the optimal policy

π^{*}

is the one that satisfies:

π^{*} = \underset{π \in Π}{arg max} E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}) ∣ π, s_{0}]

(2)

where the expectation is taken over trajectories generated by following policy

π

from an initial state

s_{0}

.

\sum_{t = 0}^{\infty}

denotes the summation from step

t = 0

to infinity.

3.2. A Dual-Signal MCTS Algorithm for Solving the MDP

The Markov Decision Process (MDP) defined for this task is characterized by an enormous state space and black-box transition and reward functions, rendering it intractable for classical dynamic programming methods. To solve this challenging search problem, we propose a pheromone-guided multi-agent Monte Carlo tree search algorithm designed around a central insight: an optimal search policy should be informed by two distinct signals. These are a local signal, reflecting the immediate, in-episode utility of an action, and a global signal, representing the accumulated historical success of an action across all past episodes.

Let the total number of search iterations be N. We denote the trajectory generated during the j-th iteration (

j = 1, \dots, N

) as

π_{j}

. This trajectory consists of a sequence of state–action pairs,

π_{j} = {(s_{i, j}, a_{i, j})}_{i = 1}^{T_{j}}

, where

T_{j}

is the length of the trajectory.

As illustrated in Figure 3, we integrate these signals by adapting the Selection and Backpropagation phases of the pheromone-guided MCTS algorithm.

3.2.1. Selection: Hybrid Policy Guided by Local and Global Signals

During the tree traversal of the j-th iteration, the selection of an action

a_{i, j}

from a state

s_{i - 1, j}

is governed by a hybrid scoring function. This function serves as our selection criterion for the “best reasoning action” by intelligently balancing two distinct sources of information: the immediate, local utility of an action and its accumulated, global desirability based on historical performance. The action selected is the one that achieves the highest combined score according to the following maximization:

a_{i, j}^{*} = arg max_{a \in A (s_{i - 1, j})} (\underset{Local Signal (UCB)}{\underset{︸}{Q (s_{i - 1, j}, a) + c \cdot \sqrt{\frac{log N (s_{i - 1, j})}{N (s_{i - 1, j}, a)}}}} + α \cdot \underset{Global Signal (Pheromone)}{\underset{︸}{τ (s_{i - 1, j}, a)}})

(3)

where:

The Local Signal (UCB Score) is the part of the formula that balances two competing desires:
−
Exploitation ( $Q (s, a)$ ): This term represents the average reward of taking action a.
−
Exploration ( $c \cdot \sqrt{\dots}$ ): This term gives a bonus to actions that have been tried less often. $N (s)$ is the visit count of the parent state, $N (s, a)$ is the visit count of the action, and c is the exploration constant.
The Global Signal, $τ (s, a)$ , is our ACO-inspired pheromone value, representing the historical desirability of taking action a in state s.
The Balancing Factor ( $α$ ) is a hyperparameter that controls the influence of the global pheromone signal.

3.2.2. Expansion and Simulation with Multi-Agent Collaboration

Our collaborative framework is designed to leverage the “diverse reasoning capabilities” of multiple agents during the Expansion phase through a Parallel Expansion strategy. The Expansion phase begins when the selection process reaches a leaf node

s_{L, j}

. Instead of selecting a single model, we query all cooperative agents in our framework in parallel. Each agent

π_{k}

generates a candidate for the next reasoning step,

a_{n e w, j}^{(k)}

, based on its unique capabilities. This results in a diverse set of potential actions

{a_{n e w, j}^{(1)}, a_{n e w, j}^{(2)}, \dots, a_{n e w, j}^{(K)}}

. The system then evaluates these candidates using the hybrid scoring function (Equation (3)), and the action with the highest score is selected as the definitive next step,

a_{n e w, j}

. This approach ensures that the most promising reasoning path is chosen at each step, benefiting from the collective intelligence of the entire agent team.

This new edge

(s_{L, j}, a_{n e w, j})

is added to the tree, and its pheromone value is initialized to a base level:

τ (s_{L, j}, a_{n e w, j}) = τ_{0}

(4)

From the resulting new state,

s_{n e w, j}

, the Simulation phase completes a full reasoning trajectory

π_{j}

. The trajectory is then evaluated by the reward model

R

to obtain a scalar score

r_{j} \in [0, 1]

.

3.2.3. Backpropagation: Unified Update of Local and Global Values

The backpropagation phase uses the terminal reward

r_{j}

of the j-th trajectory to update the information along its visited path. For each state–action pair

(s_{i, j}, a_{i, j})

on the path (

i = 1, \dots, T_{j}

), we update both the local and global signals:

Local Value Update (MCTS): The standard MCTS update is performed. First, the visit count is incremented:

$N (s_{i, j}, a_{i, j}) = N (s_{i, j}, a_{i, j}) + 1$

(5)

Then, the mean action value is updated with the new reward $r_{j}$ :

$Q (s_{i, j}, a_{i, j}) = Q (s_{i, j}, a_{i, j}) + \frac{r_{j} - Q (s_{i, j}, a_{i, j})}{N (s_{i, j}, a_{i, j})}$

(6)
Global Pheromone Update (ACO): Crucially, all agents share and contribute to the same global pheromone map. Concurrently, the global pheromone trail is updated for all pairs $(s_{i, j}, a_{i, j})$ on the trajectory $π_{j}$ , regardless of which agent generated that specific step. This follows the standard ACO reinforcement-and-evaporation rule:

$τ (s_{i, j}, a_{i, j}) = (1 - ρ) \cdot τ (s_{i, j}, a_{i, j}) + Δ τ (s_{i, j}, a_{i, j})$

(7)

where $ρ \in (0, 1)$ is the pheromone evaporation rate. The added pheromone $Δ τ (s_{i, j}, a_{i, j})$ is defined as:

$Δ τ (s_{i, j}, a_{i, j}) = \{\begin{matrix} β \cdot r_{j}, & if (s_{i, j}, a_{i, j}) is on path π_{j} and reward r_{j} \geq θ \\ 0, & otherwise \end{matrix}$

(8)

where $β$ is the reinforcement strength. This update ensures that successful reasoning paths found by any agent are globally reinforced, benefiting the entire collective over time.

The complete process is summarized in Algorithm 1.

Algorithm 1 Pheromone-guided MCTS for Solving the MDP

1:: Input: Problem instance $x = {p, Q, A}$ , A set of language models ${π_{1}, π_{2}, \dots, π_{K}}$ , Reward model R, Parameters $α, β, ρ, θ$ , Maximum iterations N
2:: Output: The best reasoning action from the root state.
3:: Initialize Tree $T$ with root node $s_{0} = {p, Q}$
4:: Initialize $Q (s, a) = 0$ , $N (s, a) = 0$ , $τ (s, a) = τ_{0}$ for all $(s, a)$
5:: for $i t e r a t i o n = 1$ to N do
6:: Initialize path for backpropagation: $p a t h = []$
7:: $s \leftarrow s_{0}$
8:: Selection
9:: while s is a non-terminal, fully expanded node in $T$ do
10:: Find action $a^{*}$
11:: Append $(s, a^{*})$ to $p a t h$
12:: $s \leftarrow$ the state reached from s after action $a^{*}$
13:: end while
14:: Expansion
15:: if s is not a terminal state then
16:: Choose a model $π_{k}$ from ${π_{1}, \dots, π_{K}}$
17:: Sample a new action $a_{n e w} \sim π_{k} (\cdot | s)$ that is not yet in $T$
18:: Add edge $(s, a_{n e w})$ and its resulting state $s_{n e w}$ to $T$
19:: Initialize $Q (s, a_{n e w}) = 0$ , $N (s, a_{n e w}) = 0$ , $τ (s, a_{n e w}) = τ_{0}$
20:: Append $(s, a_{n e w})$ to $p a t h$
21:: $s_{r o l l o u t} \leftarrow s_{n e w}$
22:: else
23:: $s_{r o l l o u t} \leftarrow s$
24:: end if
25:: Simulation
26:: Complete a trajectory from $s_{r o l l o u t}$ using a default policy $π_{def}$
27:: Obtain terminal reward $r = R (trajectory) \in [0, 1]$
28:: Backpropagation
29:: for each state-action pair $(s, a)$ in reversed $p a t h$ do
30:: Update local value Q
31:: Update global pheromone $τ$
32:: end for
33:: end for
34:: return best action

4. Experiments

4.1. Implementation Details

We evaluated our proposed method on the four representative benchmark datasets detailed in Table 1. Each dataset targets different facets of chart-based and mathematical visual reasoning, providing a comprehensive test of our algorithm’s capabilities. We utilize Qwen2-VL-7B, LLaMA-3.2-11B-Vision-Instruct, and Qwen2-VL-72B as cooperative agents within the search framework. The maximum number of search iterations is set to 15. All experiments are conducted using eight NVIDIA RTX 4090 GPUs. The pheromone mechanism is initialized with a base level

τ_{0} = 1.0

, with hyperparameters set as

α = 1

,

β = 2

, and the pheromone evaporation rate

ρ = 0.1

. For all experiments, we used the official test splits provided by each benchmark. No additional preprocessing was applied to the images. Textual inputs and outputs were tokenized using the standard tokenizer associated with each language model.

4.2. Main Results

Accuracy comparison across datasets. To comprehensively evaluate the performance of our algorithm, we conducted an accuracy comparison experiment across multiple datasets. The results are summarized in Table 2. We selected three distinct models for comparison: the pheromone-guided Monte Carlo tree search (Pheromone_MCTS), the MulBerry model, and the Qwen2VL_7B model. These models were tested on four datasets: ChartQA_test, MathVista, GRAB, and ChartX. From the experimental results, it is evident that the Pheromone_MCTS model achieved the highest accuracy of 84.80% on the ChartQA_test dataset, significantly outperforming the other two models. On the MathVista dataset, the MulBerry model demonstrated the best performance with an accuracy of 63.10%, whereas the Pheromone_MCTS and Qwen2VL_7B models achieved accuracies of 60.20% and 58.20%, respectively. On the GRAB and ChartX datasets, the Pheromone_MCTS model also exhibited advantages, achieving accuracies of 12.40% and 37.41%, respectively. In contrast, the Qwen2VL_7B model performed relatively weaker on these two datasets.

Accuracy comparison across different maximum iterations. We analyze the impact of different iteration strategies on our algorithm’s performance, as shown in Table 3. The Pheromone_MCTS, using fifteen iterations, achieved a 71.2% success rate and required 118.2 s on average, establishing it as the baseline for both accuracy and efficiency. Reducing the number of iterations led to a significant drop in success rate to 45.2%. Although this variant completed in 108.4 s, saving 9.8 s compared to the Pheromone_MCTS algorithm, it yielded a lower performance score of 0.634, indicating a poor balance between speed and accuracy. Conversely, increasing the number of iterations improved the success rate to 76.8%, a +5.6% gain over the full configuration. However, this came at the cost of increased computational time, averaging 129.7 s, 11.5 s more than the Pheromone_MCTS algorithm. Overall, these findings validate the use of fifteen iterations in the Pheromone_MCTS algorithm as the efficiency–performance sweet spot, striking a practical balance between success rate and computational overhead.

4.3. Ablation Study

To thoroughly validate the impact of each critical component on the model’s overall performance, we conducted multiple ablation experiments, assessing from three dimensions: (1) the model collaboration mechanism, (2) the pheromone-guided mechanism, and (3) Hyperparameter Analysis, utilizing success rate and average decision-making time as the core evaluation metrics.

Ablation Study on Model Collaboration. We conduct an ablation study to evaluate the effectiveness of dual-model collaborative reasoning in our enhanced algorithm, as summarized in Table 4. The dual-agent collaboration achieves a success rate of 71.2%, significantly outperforming both single-model settings: 52.8% for a single Qwen2VL_7B model and 58.4% for a single LLaMA-11B Vision model. This highlights a substantial absolute improvement of +18.4% and +12.8% when switching from single to collaborative setups. In terms of efficiency, the dual-model configuration completes execution in 117.6 s, which is actually faster than the Qwen 2VL_7B model (125.1 s) and the LLaMA-11B Vision variant (132.1 s), contradicting the common assumption that collaboration necessarily slows down computation.

Ablation Study on Pheromone-Guided Search Mechanism. We conducted an ablation study to evaluate the effectiveness of pheromone-based search guidance in our algorithm, as shown in Table 5. The Pheromone_MCTS, which incorporates the pheromone mechanism, achieved a 71.2% success rate. Removing this mechanism reduced the success rate to 63.6%, marking a −7.6% drop and underscoring its importance. The no-pheromone variant also required 5.5 s more to execute (123.1 s vs. 117.6 s), indicating that the pheromone mechanism contributes to both accuracy and efficiency. These findings confirm that the pheromone mechanism plays a crucial role in enhancing both reasoning quality and computational efficiency.

Ablation Study on Hyperparameter Analysis. We conducted extensive ablation studies to systematically evaluate the impact of hyperparameter choices. These experiments assessed the influence of the global pheromone weight (

α

), the local exploration weight (

β

), and the pheromone evaporation rate (

ρ

) on both success rate and average inference time, with results summarized in Table 4. The most critical finding is the validation of the pheromone guidance mechanism; a direct ablation setting

α = 0

resulted in a significant 7.1 percentage point drop in success rate, from 71.2% to 64.1%. The model’s performance peaked at our default of

α = 1.0

, with higher values causing a slight degradation. The exploration weight

β

also exhibited a clear optimal range. A low value (

β = 0.5

) yielded poor accuracy, whereas a high value (

β = 10.0

) not only decreased the success rate but also substantially increased inference time. Similarly, the pheromone evaporation rate

ρ

demonstrated that both very low (

ρ = 0.01

) and high (

ρ = 0.5

) rates were suboptimal. The data confirms that our chosen configuration (

α = 1.0, β = 2.0, ρ = 0.1

) achieves the highest success rate with reasonable computational cost and resides within a stable, high-performing region, indicating the robustness of our approach (Figure 4).

5. Discussion

Accuracy Comparison across Models and Datasets. To comprehensively evaluate the performance of our algorithm, we conducted accuracy comparison experiments on multiple datasets, as summarized in Table 2. On the ChartQA_test dataset, Pheromone_MCTS achieved the highest accuracy of 84.80%, significantly outperforming the baseline models. This demonstrates the algorithm’s ability to effectively explore multi-step reasoning trajectories using global guidance from pheromone signals in conjunction with local decision heuristics. On the MathVista benchmark, MulBerry obtained the best result (63.10%), while Pheromone_MCTS and Qwen2VL_7B achieved 60.20% and 58.20%, respectively. Although MulBerry slightly outperforms our method on this dataset, the gap remains narrow. Notably, on the more challenging GRAB and ChartX datasets, Pheromone_MCTS demonstrated superior generalization capabilities, achieving 12.40% and 37.41%, respectively. This performance gap can be attributed to the distinct skill demands of the two datasets. MathVista primarily evaluates question answering. Its question types include Geometry Problem Solving (GPS) and Math Word Problems (MWPs) [31]. For these tasks, the main challenge is accurate calculation. Therefore, our search mechanism offers a smaller advantage, as success depends more on the model’s computational ability rather than its search efficiency. In contrast, ChartX focuses on visual analysis skills. It is composed of perception and cognition tasks [33]. These tasks require the model to accurately parse chart structures and identify trends, which involves longer reasoning steps. Our pheromone-based search method is highly effective here. It uses a history of previous searches to better guide the exploration process. This leads to more precise data extraction from complex visual charts. These results suggest the integration of our pheromone-guided search with Monte Carlo planning enables more robust performance in complex and sparse-reward environments, where traditional decoding strategies often fail (Figure 5).

Impact of Iteration Strategy on Algorithm Performance. Table 3 provides a comparative analysis of different iteration strategies and their influence on algorithm performance. The results reveal a clear trade-off between computational cost and reasoning robustness. The Pheromone_MCTS configuration, employing fifteen iterations, establishes a balanced baseline with a success rate of 71.2% and an execution time of 118.2 s. Reducing the number of iterations significantly degrades the performance, with the success rate falling to 45.2%, despite a modest time gain of 9.8 s. This suggests that insufficient exploration limits the agent’s ability to discover reliable reasoning paths, leading to premature convergence on suboptimal solutions. In contrast, increasing the number of iterations yields a 5.6% improvement in success rate (to 76.8%), reflecting enhanced solution quality through a more extensive search. However, the increased time cost (129.7 s) highlights diminishing returns in efficiency. These findings underscore the effectiveness of Pheromone_MCTS in balancing exploration and efficiency. The use of fifteen iterations achieves a near-optimal trade-off, allowing the pheromone-guided search to accumulate sufficient global information while maintaining acceptable runtime. This validates the algorithm’s capacity to leverage limited iteration budgets for robust and efficient chart reasoning.

Ablation Study on Model Collaboration. We evaluate the effectiveness of dual-model collaborative reasoning within our proposed Pheromone_MCTS algorithm, with the results summarized in Table 4. The complete framework achieves a success rate of 71.2%, significantly outperforming the single-model baselines, which attain 52.8% (single_model_7b) and 58.4% (single_model_llama), respectively. This represents an absolute improvement of 18.4% over Qwen2VL-7B and 12.8% over the LLaMA-11B. These results highlight the advantage of heterogeneous collaboration, where diverse reasoning capabilities from different models contribute to constructing more robust and complementary reasoning trajectories. The pheromone-guided algorithm further amplifies this synergy by adaptively reinforcing effective model choices and guiding future selections toward historically successful strategies. In terms of runtime, the collaborative framework maintains competitive efficiency, averaging 117.6 s per query. This is marginally faster than the single-model configurations (125.1 s for Qwen2VL-7B and 132.1 s for LLaMA-11B), due to better planning efficiency enabled by dual-agent coordination.

Ablation Study on the Pheromone-Guided Search Mechanism. We present a quantitative analysis of the contribution of the pheromone-guided search mechanism, directly addressing its impact on computational overhead, as summarized in Table 5. The results demonstrate that incorporating pheromone signals significantly enhances the effectiveness of the algorithm. Specifically, removing the pheromone mechanism (no_pheromone) leads to a 7.6% drop in success rate (from 71.2% to 63.6%), indicating that global reinforcement in search episodes is crucial for guiding the agent toward high-reward reasoning paths. To isolate the additional cost of maintaining global pheromone trails, we compared the execution times. Our results show that the absence of pheromone trails increases the average execution time from 117.6s to 123.1s. This suggests that without global heuristics, the agent spends more time exploring suboptimal branches, reducing planning efficiency. Therefore, while the pheromone update step has a minor computational cost, it is more than offset by the significant time saved from avoiding unproductive searches. This finding demonstrates that our mechanism improves not only effectiveness but also overall efficiency, addressing concerns about its practicality for large-scale applications. Overall, these findings confirm the importance of the pheromone-augmented component in accelerating convergence and improving robustness. By persistently reinforcing effective state–action choices, the pheromone-guided search enhances the exploration–exploitation balance beyond what local MCTS statistics alone can achieve.

Ablation Study on Hyperparameter Analysis. The ablation studies validate the effectiveness of the pheromone-guided search mechanism and clarify its role in balancing exploration and exploitation. Setting the global pheromone weight

α = 0

confirms that the performance gains arise from the global, history-aware signal rather than the MCTS structure. Peak performance at

α = 1.0

indicates an optimal balance, while higher values risk over-exploitation. For the exploration weight

β

, low values cause premature convergence, whereas high values increase inference time and reduce the success rate due to overly random exploration;

β = 2.0

achieves a desirable trade-off. The evaporation rate

ρ

controls memory adaptability: low

ρ

leads to rigid memory biased by early suboptimal paths, while high

ρ

prevents knowledge accumulation. Optimal performance at

ρ = 0.1

balances learning and forgetting, enabling robust yet adaptable reasoning. Overall, these findings demonstrate a well-tuned algorithm capable of efficiently navigating complex reasoning spaces.

Limitations. While our pheromone-guided search algorithm performs well, we recognize several areas for future work. A primary constraint is scalability to complex documents. Extending our framework to multi-chart documents is conceptually straightforward, as our algorithm’s logic is independent of input size. However, it presents a computational challenge, requiring greater GPU memory and longer inference times due to the increased number of input tokens. Although our tests were lightweight, the computational cost for large-scale scenarios has not been fully profiled. Additionally, our reward function is binary, and future work could explore continuous rewards based on metrics like embedding similarity to offer a more granular signal. Our analysis lacks a direct measure of exploration quality, such as trajectory entropy, and current search budget experiments only provide an indirect perspective. A more quantitative metric would better validate how our mechanism enhances search diversity. In addition, exploring dynamic budget allocation based on early convergence signals (e.g., entropy reduction or reward stabilization) is a promising direction to improve efficiency without sacrificing accuracy. Addressing these limitations will be key to developing more robust and scalable reasoning systems.

6. Conclusions

In this work, we propose Pheromone_MCTS, a dual-signal search algorithm that integrates local exploration statistics with global pheromone-based memory for effective chart reasoning. Empirical results across multiple benchmarks consistently demonstrate its superior performance over strong baseline models, particularly in complex and sparse-reward scenarios. Our analysis reveals that the algorithm achieves a favorable balance between accuracy and efficiency with fifteen iterations, while additional iterations yield marginal gains at the cost of increased computational time. Ablation studies further confirm that multi-agent collaboration significantly enhances the diversity and quality of reasoning trajectories, enabling more robust exploration of the solution space. Moreover, the pheromone mechanism serves as an effective guide for long-term credit assignment, improving both convergence speed and final accuracy. Overall, Pheromone_MCTS offers a principled and efficient solution for multi-step reasoning tasks, combining insights from Monte Carlo tree search and Ant Colony Optimization in a unified algorithm.

Author Contributions

Conceptualization, M.Z. and X.H.; Methodology, T.Z. and X.H.; Validation, M.Z.; Formal analysis, M.Z.; Writing—original draft, M.Z. and X.H.; Writing—review & editing, Z.Q., J.V. and X.H.; Supervision, J.V. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by Hunan Natural Science Foundation Project (No. 2025JJ50338) and Shanghai Education Committee AI Project (No. JWAIYB-2).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, K.H.; Chan, H.P.; Fung, Y.R.; Qiu, H.; Zhou, M.; Joty, S.; Chang, S.F.; Ji, H. From pixels to insights: A survey on automatic chart understanding in the era of Large Foundation Models. arXiv 2024, arXiv:2403.12027. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
Zhang, D.; Wu, J.; Lei, J.; Che, T.; Li, J.; Xie, T.; Huang, X.; Zhang, S.; Pavone, M.; Li, Y. LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 7315–7337. [Google Scholar]
Huang, X.; Li, S.; Qu, W.; He, T.; Zuo, Y.; Ouyang, W. Frozen CLIP model is efficient point cloud backbone. arXiv 2022, arXiv:2212.04098. [Google Scholar]
Huang, X.; Huang, Z.; Li, S.; Qu, W.; He, T.; Hou, Y.; Zuo, Y.; Ouyang, W. Frozen CLIP transformer is an efficient point cloud encoder. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–28 February 2024; Volume 38, pp. 2382–2390. [Google Scholar]
Huang, X.; Huang, Z.; Zuo, Y.; Gong, Y.; Zhang, C.; Liu, D.; Fang, Y. PSReg: Prior-guided Sparse Mixture of Experts for Point Cloud Registration. arXiv 2025, arXiv:2501.07762. [Google Scholar] [CrossRef]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar]
Zhang, H.; Chen, J.; Jiang, F.; Yu, F.; Chen, Z.; Li, J.; Chen, G.; Wu, X.; Zhang, Z.; Xiao, Q. HuatuoGPT, towards taming language model to be a doctor. arXiv 2023, arXiv:2305.15075. [Google Scholar] [CrossRef]
Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-O1, towards medical complex reasoning with LLMs. arXiv 2024, arXiv:2412.18925. [Google Scholar]
Masry, A.; Long, D.X.; Tan, J.Q.; Joty, S.; Hoque, E. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv 2022, arXiv:2203.10244. [Google Scholar] [CrossRef]
Wang, Z.; Xia, M.; He, L.; Chen, H.; Liu, Y.; Zhu, R.; Liang, K.; Wu, X.; Liu, H.; Malladi, S. CharXIV: Charting gaps in realistic chart understanding in multimodal LLMs. Adv. Neural Inf. Process. Syst. 2024, 37, 113569–113697. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtually, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-Language Transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtually, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Sprague, Z.; Yin, F.; Rodriguez, J.D.; Jiang, D.; Wadhwa, M.; Singhal, P.; Zhao, X.; Ye, X.; Mahowald, K.; Durrett, G. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. arXiv 2024, arXiv:2409.12183. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
Zhang, D.; Zhoubian, S.; Hu, Z.; Yue, Y.; Dong, Y.; Tang, J. Rest-MCTS*: LLM self-training via process reward guided tree search. Adv. Neural Inf. Process. Syst. 2024, 37, 64735–64772. [Google Scholar]
Zhao, Y.; Yin, H.; Zeng, B.; Wang, H.; Shi, T.; Lyu, C.; Wang, L.; Luo, W.; Zhang, K. Marco-O1: Towards open reasoning models for open-ended solutions. arXiv 2024, arXiv:2411.14405. [Google Scholar]
Yao, H.; Huang, J.; Wu, W.; Zhang, J.; Wang, Y.; Liu, S.; Wang, Y.; Song, Y.; Feng, H.; Shen, L. Mulberry: Empowering MLLM with O1-like reasoning and reflection via collective Monte Carlo Tree Search. arXiv 2024, arXiv:2412.18319. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Gan, B.; Zhao, Y.; Zhang, T.; Huang, J.; Li, Y.; Teo, S.X.; Zhang, C.; Shi, W. MASTER: A Multi-Agent System with LLM Specialized MCTS. arXiv 2025, arXiv:2501.14304. [Google Scholar]
Zhang, Y.; Mao, S.; Ge, T.; Wang, X.; de Wynter, A.; Xia, Y.; Wu, W.; Song, T.; Lan, M.; Wei, F. LLM as a mastermind: A survey of strategic reasoning with large language models. arXiv 2024, arXiv:2404.01230. [Google Scholar] [CrossRef]
Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv 2023, arXiv:2302.12813. [Google Scholar] [CrossRef]
Zelikman, E.; Wu, Y.; Mu, J.; Goodman, N. STaR: Bootstrapping reasoning with reasoning. Adv. Neural Inf. Process. Syst. 2022, 35, 15476–15488. [Google Scholar]
Blum, C. Ant colony optimization: Introduction and recent trends. Phys. Life Rev. 2005, 2, 353–373. [Google Scholar] [CrossRef]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2007, 1, 28–39. [Google Scholar] [CrossRef]
Dorigo, M.; Stützle, T. Ant colony optimization: Overview and recent advances. In Handbook of Metaheuristics; Springer: Berlin/Heidelberg, Germany, 2018; pp. 311–351. [Google Scholar]
Tolpin, D.; Shimony, S. MCTS based on simple regret. In Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; Volume 26, pp. 570–576. [Google Scholar]
Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.W.; Galley, M.; Gao, J. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv 2023, arXiv:2310.02255. [Google Scholar]
Roberts, J.; Han, K.; Albanie, S. GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models. arXiv 2024, arXiv:2408.11817. [Google Scholar] [CrossRef]
Xia, R.; Zhang, B.; Ye, H.; Yan, X.; Liu, Q.; Zhou, H.; Chen, Z.; Ye, P.; Dou, M.; Shi, B. ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning. arXiv 2024, arXiv:2402.12185. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Chaudhry, R.; Shekhar, S.; Gupta, U.; Maneriker, P.; Bansal, P.; Joshi, A. Leaf-QA: Locate, encode & attend for figure question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Aspen, CO, USA, 1–5 May 2020; pp. 3512–3521. [Google Scholar]
Singh, H.; Shekhar, S. STL-CQA: Structure-based transformers with localization and encoding for chart question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 3275–3284. [Google Scholar]
Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O.K.; Patra, B. Language is not all you need: Aligning perception with language models. Adv. Neural Inf. Process. Syst. 2023, 36, 72096–72109. [Google Scholar]
Kafle, K.; Price, B.; Cohen, S.; Kanan, C. DVQA: Understanding data visualizations via question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5648–5656. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Du, Y.; Liu, Z.; Li, J.; Zhao, W.X. A survey of vision-language pre-trained models. arXiv 2022, arXiv:2202.10936. [Google Scholar]
Hassija, V.; Chamola, V.; Mahapatra, A.; Singal, A.; Goel, D.; Huang, K.; Scardapane, S.; Spinelli, I.; Mahmud, M.; Hussain, A. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 2024, 16, 45–74. [Google Scholar] [CrossRef]
Roumeliotis, K.I.; Tselikas, N.D. ChatGPT and open-ai models: A preliminary review. Future Internet 2023, 15, 192. [Google Scholar] [CrossRef]
Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv 2025, arXiv:2501.09686. [Google Scholar]
Jiao, F.; Qin, C.; Liu, Z.; Chen, N.F.; Joty, S. Learning planning-based reasoning by trajectories collection and process reward synthesizing. arXiv 2024, arXiv:2402.00658. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Hsieh, C.Y.; Chen, S.A.; Li, C.L.; Fujii, Y.; Ratner, A.; Lee, C.Y.; Krishna, R.; Pfister, T. Tool documentation enables zero-shot tool-usage with large language models. arXiv 2023, arXiv:2308.00675. [Google Scholar]
Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q. Least-to-most prompting enables complex reasoning in large language models. arXiv 2022, arXiv:2205.10625. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
Chari, A.; Tiwari, A.; Lian, R.; Reddy, S.; Zhou, B. Pheromone-based learning of optimal reasoning paths. arXiv 2025, arXiv:2501.19278. [Google Scholar]

Figure 1. Comparison of pheromone-enhanced MCTS and traditional MCTS frameworks, with accuracy benchmark results. The top panel illustrates the architectural difference between traditional MCTS and our pheromone-enhanced variant. The bottom panel presents a benchmark comparison across multiple datasets.

Figure 2. Pheromone-guided MCTS: An overview of the pheromone-enhanced path search framework, which iteratively explores reasoning paths by jointly (a) expanding high-potential candidate nodes using UCB and global pheromone scores, (b) simulating downstream reasoning steps and adjusting pheromone levels based on reasoning quality, (c) backpropagating feedback to update UCB and pheromone values, and (d) selecting the next start node based on a selection score that balances exploration and exploitation.

Figure 3. Overview of the pheromone-enhanced path search framework with global and local signal integration. The framework integrates both global (pheromone) and local (UCB-based) decision signals within an MCTS process.

Figure 4. Sensitivity analysis of hyperparameters

α

,

β

, and

ρ

on the accuracy and average time consumed. The optimal values used in our experiments are marked with a red diamond.

Figure 4. Sensitivity analysis of hyperparameters

α

,

β

, and

ρ

on the accuracy and average time consumed. The optimal values used in our experiments are marked with a red diamond.

Figure 5. An example illustrating several reasoning paths for a single chart query.

Table 1. Statistics and key characteristics.

Dataset	Questions	Chart Types	Question Types
ChartQA [13]	2500	Bar Charts, Line Charts, Pie Charts	Data Retrieval, Visual, Compositional, Both Visual and Compositional
MathVista [31]	1000	Bar Charts, Line Plots, Pie Charts, Scatter Plots, Function Plots, Scientific Figures, Tables	FQA, GPS, MWP, TQA, VQA ^a
GRAB [32]	2170	Mathematical Graphs	Properties, Functions, Series, Transforms
ChartX [33]	1150	Domain-specific Chart Types, General Chart Types, Fine-grained Chart Types	Perception Tasks, Cognition Tasks

^a FQA: Figure Question Answering, GPS: Geometry Problem Solving, MWP: Math Word Problem, TQA: Textbook Question Answering, VQA: Visual Question Answering.

Table 2. Accuracy comparison across models and datasets.

Dataset	Pheromone_MCTS	MulBerry	Qwen2VL_7B
ChartQA_test_human	79.68%	78.16%	–
ChartQA_test_augment	89.92%	90.88%	–
ChartQA_test	84.80%	84.52%	83.00%
MathVista	60.20%	63.10%	58.20%
GRAB	12.40%	–	10.18%
ChartX	37.41%	–	28.39%

Table 3. Accuracy comparison across different iteration budgets.

Experiment	Succ. (%)	Diff (%)	Time (s)	Time Diff (s)
Pheromone_MCTS	71.2	0.0	118.2	0.0
10_iterations	45.2	−26.0	108.4	−9.8
20_iterations	76.8	+5.6	129.7	+11.5

Table 4. Ablation study on model collaboration.

Experiment	Succ. (%)	Diff (%)	Time (s)	Time Diff (s)
Pheromone_MCTS	71.2	0.0	117.6	0.0
Single_model_7b	52.8	−18.4	125.1	+7.5
Single_model_llama	58.4	−12.8	132.1	+14.5

Table 5. Ablation study on pheromone-guided search mechanism.

Experiment	Succ. (%)	Diff (%)	Time (s)	Time Diff (s)
Pheromone_MCTS	71.2	0.0	117.6	0.0
no_pheromone	63.6	−7.6	123.1	+5.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, M.; Qi, Z.; Zhu, T.; Vijg, J.; Huang, X. A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning. Mathematics 2025, 13, 2739. https://doi.org/10.3390/math13172739

AMA Style

Zhou M, Qi Z, Zhu T, Vijg J, Huang X. A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning. Mathematics. 2025; 13(17):2739. https://doi.org/10.3390/math13172739

Chicago/Turabian Style

Zhou, Min, Zhiheng Qi, Tianlin Zhu, Jan Vijg, and Xiaoshui Huang. 2025. "A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning" Mathematics 13, no. 17: 2739. https://doi.org/10.3390/math13172739

APA Style

Zhou, M., Qi, Z., Zhu, T., Vijg, J., & Huang, X. (2025). A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning. Mathematics, 13(17), 2739. https://doi.org/10.3390/math13172739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning

Abstract

1. Introduction

2. Related Work

2.1. Single-Step Reasoning Paradigms

2.2. Multi-Step Reasoning Paradigms

3. Methodology

3.1. Problem Formulation as a Markov Decision Process (MDP)

3.2. A Dual-Signal MCTS Algorithm for Solving the MDP

3.2.1. Selection: Hybrid Policy Guided by Local and Global Signals

3.2.2. Expansion and Simulation with Multi-Agent Collaboration

3.2.3. Backpropagation: Unified Update of Local and Global Values

4. Experiments

4.1. Implementation Details

4.2. Main Results

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI