ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates

Jeong, Eunhye; Choi, Yong Suk

doi:10.3390/app15179587

Open AccessArticle

ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates

by

Eunhye Jeong

¹

and

Yong Suk Choi

^2,*

¹

Department of Artificial Intelligence, Hanyang University, Seoul 04763, Republic of Korea

²

Department of Computer Science, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9587; https://doi.org/10.3390/app15179587

Submission received: 5 August 2025 / Revised: 22 August 2025 / Accepted: 27 August 2025 / Published: 30 August 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the improved reasoning capabilities of large language models (LLMs), their applications have rapidly expanded across a wide range of tasks. In recent question answering tasks, performance gains have been achieved through Self-Consistency, where LLMs generate multiple reasoning paths and determine the final answer via majority voting. However, this approach can fail when the correct answer is generated but does not appear frequently enough to be selected, highlighting its vulnerability to inconsistent generations. To address this, we propose Adaptive Confidence Re-scoring (ACR)—a method that adaptively evaluates and re-scores candidate answers to select the most trustworthy one when LLMs fail to generate consistent reasoning. Experiments on arithmetic and logical reasoning benchmarks show that ACR maintains or improves answer accuracy while significantly reducing inference cost. Compared to existing verification methods such as FOBAR, ACR reduces the number of inference calls by up to 95%, while improving inference efficiency—measured as accuracy gain per inference call—by a factor of 2× to 17×, depending on the dataset and model.

Keywords:

natural language processing; question answering; large language models; prompt engineering; verification

1. Introduction

As the scale of language models (LMs) increases, there has been increasing interest in leveraging their capabilities to solve complex problems across diverse domains [1,2,3]. In particular, large language models (LLMs) have demonstrated strong reasoning abilities in question answering (QA) tasks through in-context learning (ICL), even without fine-tuning [4,5,6]. One prominent strategy for enhancing LLM reasoning is Chain-of-Thought (CoT) prompting [7], which encourages step-by-step reasoning before generating a final answer. This technique has been shown to substantially improve QA performance across a variety of tasks [8,9,10].

Self-Consistency (SC) [11] is typically used in combination with CoT prompting. It improves the consistency and reliability of final answers by selecting the most frequent answer among multiple reasoning paths generated by the LLMs. Since CoT encourages LLMs to perform multi-step reasoning, SC works more effectively when combined with CoT, enhancing both the accuracy and consistency of the final answers [11,12]. However, SC relies solely on frequency-based aggregation, which may fail to select the correct answer when it appears among the minority responses. For example, as shown in Figure 1, the LLM generates three unique candidate answers for a given question. Although the correct answer (189) is included among them, SC selects the most frequently generated response—an incorrect one (47.25)—as the final answer due to its majority voting strategy. This limitation is especially problematic when multiple answers occur with similar frequencies, leading to ambiguity and reduced confidence in the final prediction. To address this issue, recent methods such as Self-Verification (SV) [13] and FOBAR [14] have been proposed, which verify candidate answers before selecting the final answer. However, these approaches require performing separate verification for each candidate answer, resulting in substantial computational overhead.

We introduce Adaptive Confidence Re-scoring (ACR), a method that improves the reliability of final answers by collectively evaluating candidates generated by SC, rather than verifying each one individually. Unlike prior verification-based approaches, ACR formulates this process as a lightweight classification task over the candidate set, thereby minimizing computational overhead. ACR first computes an initial confidence score distribution based on the SC results and reformulates the evaluation process as a classification task. The LLM is then prompted to select the most plausible answer from the set of candidate answers, and the confidence scores are adjusted accordingly to compute the final score. The candidate with the highest final score is selected as the final answer. We evaluated the proposed method on five reasoning datasets—GSM8K [15], TabMWP [16], MATH500 [17], Word Sorting [18], and Reasoning about Colored Objects [18]—using five different language models: Qwen2.5-7B [19], Qwen3-14B [20], Mistral-Small-24B [21], Gemini-1.5-Flash-8B [22], and Gemini-2.5-Flash [3]. In all cases, ACR consistently outperformed SC across both arithmetic and logical reasoning tasks.

Our key contributions are as follows:

We reduce computational overhead by selectively evaluating candidate answers based on the confidence distribution derived from SC.
We reformulate the selection process as a classification task, enabling the model to recover correct answers even when they appear in the minority.
We propose a collective evaluation strategy that enhances answer reliability without requiring separate verification for each candidate.

2. Related Work

2.1. Self-Consistency

Recent QA research has explored various strategies to improve answer accuracy by generating and aggregating multiple responses from LLMs. A notable early approach is Best-of-N (BoN) [23,24], which employs a reward model to score and select the most preferred output. More recently, Self-Consistency (SC) [11] has become a widely adopted alternative, selecting final answers based on consensus among reasoning paths rather than reward signals. SC generates multiple candidate answers by prompting the LLM to follow diverse reasoning paths and selects the final answer through majority voting. This method naturally filters out incorrect responses and tends to yield more reliable and accurate outputs [11,25,26].

To further enhance its performance, researchers have proposed several extensions that improve the robustness and flexibility of SC. Adaptive Consistency [27] dynamically adjusts the number of generated samples based on the problem instance, rather than using a fixed sample size, thereby maintaining efficiency without compromising performance. Soft Self-Consistency [28] replaces majority voting with continuous scoring based on token probabilities, improving the stability of answer selection. Additionally, paraphrasing-based methods [29,30] encourage diverse reasoning by constructing different input prompts for the same question.

While these approaches focus on improving SC—via sampling strategies, scoring mechanisms, or prompt diversification—our work instead aims to improve answer reliability after SC. Specifically, we propose a collective evaluation strategy that accounts for the possibility that minority answers may be correct, mitigating the risk of errors caused by frequency-based aggregation.

2.2. Verification

While SC selects answers based on consistency, it does not guarantee that the selected answer is always correct. To address this issue, recent approaches have incorporated an additional verification step to assess the validity of candidate answers. Self-Verification (SV) [13] performs backward reasoning (BR) for each candidate and selects the one with the highest verification score, regardless of the SC outcome. FOBAR (FOrward and BAckward Reasoning for verification) [14] combines SC and SV by computing both scores for each candidate and then selects the answer with the highest combined score. However, these methods require conducting BR individually for every candidate, resulting in substantial computational overhead. Moreover, verification is uniformly applied to all examples, which may lead to inefficiencies—especially when many cases are already confidently resolved by SC.

In contrast, our approach performs a collective evaluation over SC-generated candidates and applies it selectively, only to examples that require further scrutiny. This strategy improves both accuracy and efficiency by avoiding unnecessary evaluation.

3. Method

Our proposed method, Adaptive Confidence Re-scoring (ACR), consists of three main stages. Figure 1 illustrates the overall structure of ACR.

3.1. SC: Generating Initial Answers and Scores

In the first stage, we apply SC using few-shot CoT prompting (for GSM8K, TabMWP, Word, and Color) and zero-shot prompting (for MATH500 and AIME 2024) to generate M reasoning paths, yielding a set of M candidate answers

A = {a_{i}}_{i = 1}^{M}

. From these, we extract the set of unique answers

A = {{\hat{a}}_{c}}_{c = 1}^{| A |}

and compute the initial confidence score

C o n f_{1} ({\hat{a}}_{c})

for each

{\hat{a}}_{c}

as follows:

C o n f_{1} ({\hat{a}}_{c}) = \frac{F r e q ({\hat{a}}_{c})}{M}

(1)

where

F r e q ({\hat{a}}_{c})

denotes the number of times

{\hat{a}}_{c}

was generated. This score reflects the relative frequency of each candidate answer and serves as a measure of the model’s initial confidence.

3.2. SO: Selective and Collective Evaluation of Candidate Answers

ACR is designed to re-score confidence scores when an LLM generates inconsistent answers across multiple reasoning paths. However, performing verification on every data example—including those where the model consistently generates the same answer—can be inefficient. To address this, we introduce a selective evaluation step. Unlike prior verification-based methods discussed in Section 2.2, our approach employs a distinct strategy—termed Select Option (SO)—within the evaluation process. To enable selective evaluation, we utilize the initial confidence scores

C o n f_{1}

computed in Step 1 (SC). When the model’s answers are inconsistent, the resulting score distribution

C o n f_{1} (A)

tends to be more uniform, exhibiting lower standard deviation.

As illustrated in Figure 2a, examples B and C show such uniform-like distributions, where no single candidate dominates—indicating that the model has generated inconsistent answers. In contrast, Figure 2b shows that example A has a more concentrated distribution with a higher standard deviation, reflecting a strong preference for a single answer and thus greater consistency. We use this standard deviation as a proxy for model uncertainty and trigger SO only when it falls below a predefined threshold

δ

. This approach avoids unnecessary computation on examples where the model’s output is already consistent, thereby improving both efficiency and effectiveness.

For selected examples, we extract the top-K candidates from the initial set

A

, denoted as

A_{K} = {{\hat{a}}_{k}}_{k = 1}^{K}

, and reformat them into a multiple-choice prompt, which is then fed back into the LLM. This reformulation transforms the open-ended QA task into a classification setting, where the model is asked to choose one answer from the given options. In SO, we sample N responses from the model:

O = {o_{j}}_{j = 1}^{N}

, where

o_{j} \in A_{K}

. From the set of unique selections

O = {{\hat{o}}_{d}}_{d = 1}^{| O |}

, we compute the second-stage confidence score

C o n f_{2}

as follows:

C o n f_{2} ({\hat{o}}_{d}) = \frac{F r e q ({\hat{o}}_{d})}{N}, {\hat{o}}_{d} \in A_{K}

(2)

where

F r e q ({\hat{o}}_{d})

denotes the number of times

{\hat{o}}_{d}

was selected.

3.3. ACR: Re-Scoring

Finally, we compute the final confidence score

C o n f_{f i n a l}

by applying a scoring function

G (C o n f_{1}, C o n f_{2})

to the confidence scores obtained from Step 1 and Step 2. The candidate answer

{\hat{a}}_{k}

with the highest

C o n f_{f i n a l}

is selected as the final prediction. In our implementation, the scoring function G is defined as follows:

G = C o n f_{1} + C o n f_{2}

(3)

This formulation simply aggregates the confidence scores derived from the two steps to produce the final score. We empirically find this scoring strategy to be the most effective, and compare it with alternative formulations in Section 6.1. ACR successfully flips the prediction by leveraging the evaluation stage—where the correct answer is selected by the majority—and re-scoring the candidates accordingly. This demonstrates ACR’s ability to recover correct answers in challenging scenarios where the LLM fails to generate consistent outputs.

4. Experimental Setup

4.1. Datasets

We evaluate our method on five datasets covering two types of reasoning tasks: arithmetic reasoning and logical reasoning.

4.1.1. Arithmetic Reasoning

GSM8K [15]: A dataset of grade-school-level math word problems. We use all 1319 examples from the test set.
TabMWP [16]: From the TabMWP 1K test set, we exclude multiple-choice questions and evaluate on 751 examples whose answer types are either integer_number or decimal_number.
MATH500 [17]: The MATH500 dataset comprises 500 challenging competition-level mathematics problems, uniformly sampled from the MATH dataset [31].

4.1.2. Logical Reasoning

Reasoning about Colored Objects (Color) [18]: A subtask from BIG-Bench Hard benchmark, which requires reasoning over object–color relationships. Although the original format is multiple-choice, we remove the options and prompt the model in an open-ended format. We use all 250 examples.
Word Sorting (Word) [18]: Another subtask from the BIG-Bench Hard that involves sorting a list of words in lexicographic order. The dataset contains 250 examples. Due to time constraints, we evaluate this task only with Gemini-1.5-Flash-8B.

4.2. Implementation Details

We evaluate ACR using five language models: Qwen2.5-7B (Qwen/Qwen2.5-7B-Instruct) [19], Qwen3-14B (Qwen/Qwen3-14B) [20], Mistral-Small-24B (mistralai/Mistral- Small-24B-Instruct-2501) [21], Gemini-1.5-Flash-8B (models/gemini-1.5-flash-8b) [22], and Gemini-2.5-Flash (models/gemini-2.5-flash) [3]. For MATH500, we conduct experiments using only Gemini-2.5-Flash, as the dataset consist of particularly challenging competition-level problems. Preliminary tests with smaller models yielded poor performance and unstable outputs, making them unsuitable for reliable evaluation on this benchmark. For all models except Qwen3-14B, we use the same generation configuration for both SC and SO: temperature T = 0.7, top-p = 1.0, and top-k = 50. For Qwen3-14B, we set temperature T = 0.7, top-p = 0.95, and top-k = 20, following its recommended decoding configuration, and set enable_thinking = False. Prompting strategies vary across datasets: we use few-shot chain-of-thought (CoT) prompting with 8-shot for GSM8K, 4-shot for TabMWP, and 3-shot for both Color and Word; for MATH500, we adopt zero-shot prompting. The maximum output length is set to 350 tokens for GSM8K and TabMWP, 1024 tokens for Word and Color, 1500 tokens for MATH500.

In Step 1 (SC), we generate

M = 20

responses per example. In Step 2 (SO), we set

N = 5

and consider the top-

K = 4

candidate answers based on their SC scores. The options presented in the SO input prompt are ordered according to their initial SC confidence scores, from highest to lowest. The choice of K is motivated by the typical number of distinct answers observed when the LLM exhibits inconsistency. Empirically, we found that even with

M = 20

samples, this number rarely exceeds four. Thus, setting

K = 4

provides sufficient coverage of plausible candidates while maintaining computational efficiency.

4.3. Prompt Design

We include representative prompt formats used in both the SC (Step 1) and SO (Step 2) stages to clarify how LLMs are guided during each stage of our framework. Figure 3 illustrates examples of the prompts used for GSM8K. While we adopt consistent prompt templates across datasets, the number of few-shot exemplars varies by task as described in the previous section.

4.4. Baselines

We compare our proposed method against the following baselines:

SC [11]: Generates $M = 20$ responses per example and selects the final answer via majority voting.
SV [13]: Based on SC results, performs backward reasoning (BR) $N = 5$ times for each candidate and selects the one with the highest verification score. We adopt the True–False Item Verification variant for consistency across datasets. For a fair comparison with ACR, BR is applied only to the top-K unique candidates from SC, using the same K value as in our method.
FOBAR [14]: Extends SV by aggregating SC and BR scores using the weighting strategy proposed in the original FOBAR paper. We set the aggregation weight to $α = 0.5$ .

5. Results

Table 1 presents a comparison between baseline methods and our proposed approach, ACR, across five datasets and five different LLMs. Among the four scoring strategies evaluated within ACR, we adopt Sum as the default. Details of all scoring variants are provided in Section 6.1.

Experimental results show that ACR consistently outperforms SC across all settings. While it does not always achieve the highest absolute accuracy, it performs comparably to FOBAR, which builds on SV-based scoring. Notably, there are cases where both SV and FOBAR underperform relative to SC, whereas ACR consistently maintains or improves performance. This trend holds not only for relatively easier datasets such as GSM8K and TabMWP, but also for a more challenging dataset like MATH500, where ACR achieves the best performance among all compared methods.

As shown in Figure 4, SV (and by extension FOBAR) performs separate verification for each candidate generated by SC, resulting in computational overhead proportional to the number of unique answers. In contrast, ACR employs a collective evaluation strategy that requires a fixed number of inference calls per example, independent of the number of candidates. Furthermore, ACR selectively triggers evaluation based on the standard deviation of the SC score distribution, often skipping additional inference altogether. This structural efficiency, coupled with robust performance, positions ACR as a practical and scalable alternative. A detailed analysis of computational cost is provided in Section 6.4.

6. Analysis

In this section, we analyze the key components of ACR and evaluate their impact on overall performance. We begin by comparing alternative scoring strategies used to combine the two confidence scores,

C o n f_{1}

and

C o n f_{2}

, then examine the effectiveness of selective re-scoring, and finally assess the trade-off between accuracy and efficiency in comparison to FOBAR.

6.1. Final Confidence Scoring Strategies

To examine the robustness of ACR to different methods of integrating confidence scores, we evaluate four scoring functions for computing the final score

G (C o n f_{1}, C o n f_{2})

by combining

C o n f_{1}

from Step 1 and

C o n f_{2}

from Step 2:

(1): Sum: A simple additive strategy that assigns equal weight to both scores. While intuitive and easy to implement, it may overemphasize noisy SO scores when $C o n f_{1}$ is already high:

$G_{S u m} = C o n f_{1} + C o n f_{2}$

(4)
(2): Product: A strictly multiplicative strategy that enforces agreement between SC and SO. However, it may excessively penalize candidates when either score is low:

$G_{P r o d u c t} = C o n f_{1} \times C o n f_{2}$

(5)
(3): Fobar: A softly weighted geometric mean, proposed by Jiang et al. [14] and controlled by a tunable parameter $α$ :

$G_{F o b a r} = {C o n f_{1}}^{α} \times {C o n f_{2}}^{1 - α}$

(6)

Following the original implementation, we set $α = 0.5$ in our experiments.
(4): SC-Boost: A hybrid strategy that augments $C o n f_{1}$ with a multiplicative boost from $C o n f_{2}$ , thereby reflecting both the generation and evaluation stages:

$G_{S C_B o o s t} = C o n f_{1} + C o n f_{1} \times C o n f_{2}$

(7)

As shown in Table 1, all four strategies yield comparable performance across datasets, suggesting that ACR is robust to the choice of scoring function. Specifically, the average accuracy for each strategy is: Fobar (91.36%), Sum (91.35%), SC-Boost (91.30%), and Product (91.24%). Although Fobar achieves the highest average accuracy, we adopt Sum as the default scoring function in this work due to its simplicity, consistently strong performance, and lack of tunable hyperparameters.

6.2. Robustness to Option Order

In the SO stage, candidate answers obtained from SC are converted into a multiple-choice format, and the LLM is prompted to select one of the provided options. We adopt an ordering strategy in which candidates are sorted by their initial confidence scores (

C o n f_{1}

) from Step 1, placing higher-confidence candidates earlier in the list. However, prior work [32] has shown that LLMs can exhibit inherent selection biases in multiple-choice tasks, making them sensitive to the ordering of options.

To assess the potential impact of this bias, we conducted a comparison on the GSM8K dataset using Mistral-Small-24B and Qwen2.5-7B, evaluating our

C o n f_{1}

-based ordering against a random ordering. As shown in Table 2, the accuracy of Mistral dropped slightly from 95.29% (

C o n f_{1}

) to 95.06% with random ordering, though the difference is minimal. For Qwen, the respective accuracies were 92.79% and 92.87%, indicating virtually no difference. These results suggest that our SO method is relatively robust to variations in option ordering.

Table 3 compares the accuracy of SV and the SO stage in ACR. While ACR selectively applies SO based on the SC score distribution, for a fair comparison, we evaluate SO by uniformly applying it to all examples. As shown in Table 3, this often results in lower accuracy than SV. For instance, the accuracy gap reaches 4.40% for Gemini-1.5-Flash-8B on the Color dataset and 4.13% for Qwen2.5-7B on TabMWP. However, even in these cases, Table 1 shows that ACR still matches or even outperforms SV overall. This indicates that the performance of SO alone does not critically undermine ACR’s effectiveness. In particular, ACR does not rely solely on SO but incorporates a selective evaluation strategy that avoids unnecessary second-stage inference when the SC confidence is already high. Thus, the lower standalone performance of SO does not directly translate into a degradation in ACR’s overall performance. Nevertheless, as the performance of SO can still affect ACR in certain borderline cases, improving SO remains an important direction for future work. Enhancing SO’s reliability may lead to further gains in ACR’s accuracy without sacrificing its computational efficiency.

6.3. Effectiveness of Selective Re-Scoring

Figure 5 illustrates how the performance of ACR varies with different values of the threshold

δ

. Across all three datasets, ACR maintains stable accuracy over a broad range of

δ

values, showing minimal sensitivity to the specific choice of threshold. This robustness indicates that the effectiveness of ACR does not critically depend on fine-tuning

δ

, making it suitable for real-world applications where labeled data for validation may not be available. In particular, even on datasets such as TabMWP—where SO alone performs relatively poorly—ACR still exhibits consistent performance provided that

δ

is not set excessively high. These results highlight the practical benefit of ACR’s selective evaluation mechanism, which avoids unnecessary second-pass inference while preserving accuracy. Additionally, our analysis of performance across different

δ

values reveals that while the optimal threshold may vary depending on the model and dataset, thresholds in the range of 0.3 to 0.4 generally yield strong performance, as reflected in the

δ

values used for reporting ACR results in Table 1.

6.4. Efficiency and Effectiveness: ACR vs. FOBAR

To further evaluate the efficiency of ACR, we compare its best performance against SV, with both methods evaluated using

n = 5

generations. However, since FOBAR improves upon SV by applying its own BR-based scoring strategy, we compare ACR with FOBAR instead of SV to reflect the most competitive version. For ACR, we report the results using the

δ

value that yields the highest accuracy, as shown in Table 1.

Efficiency = \frac{Accuracy {Gain}_{ACR} / Inference {Calls}_{ACR}}{Accuracy {Gain}_{FOBAR} / Inference {Calls}_{FOBAR}}

(8)

As shown in Table 4, both FOBAR and ACR aim to enhance the SC baseline, but their computational costs differ significantly. FOBAR performs verification for each unique answer generated by SC, resulting in an inference cost proportional to the number of candidates. In contrast, ACR employs a selective, collective evaluation strategy based on the standard deviation of the SC score distribution. This allows ACR to either skip additional inference altogether or use a fixed number of calls, regardless of the number of candidates.

To quantify the trade-off between accuracy and cost, we compute both the accuracy gain per inference call and the overall efficiency ratio between ACR and FOBAR. As reported in Table 4, ACR consistently achieves comparable or better accuracy while requiring significantly fewer inference calls. For example, on GSM8K with Qwen2.5-7B, ACR reduces inference calls by nearly 87%, resulting in over 2× improvement in efficiency compared to FOBAR. On the Color dataset with Gemini-1.5-Flash-8B, ACR achieves an even greater efficiency gain of more than 17×, attributed to its selective evaluation and re-scoring mechanism.

In addition, Figure 6 illustrates how the number of inference calls varies with the threshold

δ

. As

δ

increases, more data instances proceed to Step 2 and Step 3, leading to a gradual increase in computation. Nevertheless, ACR continues to require substantially fewer inference calls compared to SV or FOBAR across all threshold settings. These results underscore the advantage of ACR’s selective and collective evaluation strategy, which not only minimizes computational overhead but also preserves robust and stable performance.

7. Error Cases Analysis

In this section, we analyze instances where ACR fails to improve accuracy and categorize the main error types to better understand its limitations. We identify five recurring patterns:

Failure to generate the correct answer: If the model fails to generate the correct answer in any of the SC reasoning paths, ACR cannot recover it during re-scoring. This represents a fundamental limitation of SC: if the correct answer is absent from the candidate set, no selection mechanism—ACR or otherwise—can succeed. Table 5 shows how often such unrecoverable cases occur.
Consistently ambiguous cases: In some examples, both SC and SO assign similarly low confidence score gaps across multiple candidates, indicating persistent uncertainty throughout the pipeline. In such cases, ACR struggles to distinguish the correct answer from other plausible but incorrect ones.
Correct answer favored in SO, but final prediction incorrect: Even when SO correctly assigns high confidence to the correct answer, the final prediction can still be wrong due to residual influence from SC.
Overconfidence in an incorrect answer: SO occasionally assigns a disproportionately high confidence score to an incorrect candidate, even when SC’s distribution is relatively balanced. As discussed in Section 6.3, this illustrates how unreliable SO outputs can undermine ACR’s overall performance. These cases show that re-scoring may inject spurious certainty, leading ACR to reinforce an initially weak—but ultimately incorrect—prediction.
Correct answer in SC, incorrect shift in SO: In these cases, SC initially assigns the highest confidence score to the correct answer, but SO shifts the score distribution toward an incorrect one, resulting in a wrong prediction. This highlights the risk of re-scoring introducing harmful bias—even when the SC result was already correct. As in the previous case, this underscores the importance of SO reliability in the overall performance of ACR. Table 6 shows how often such Case 3–5 errors lead ACR to make incorrect predictions.

8. Conclusions

We introduced Adaptive Confidence Re-scoring (ACR), a lightweight and effective approach for improving answer selection in SC. While SC often generates the correct answer among its candidates, its reliance on majority voting can lead to the exclusion of accurate but minority predictions. ACR addresses this limitation by selectively evaluating candidate answers based on the confidence score distribution—without requiring exhaustive evaluation or incurring substantial additional cost. Experimental results across multiple datasets and models show that ACR consistently outperforms SC. While verification-based methods such as SV and FOBAR also improve upon SC, they require significantly more inference calls. In contrast, ACR achieves performance that is comparable to or better than SV/FOBAR, while requiring far fewer calls, demonstrating strong cost-effectiveness. Moreover, ACR remains robust across a range of threshold values, indicating that it does not require extensive hyperparameter tuning for practical deployment. These findings position ACR as a scalable and efficient solution for answer selection, providing a favorable trade-off between accuracy and computational cost.

9. Limitations

While ACR demonstrates strong performance and efficiency across multiple reasoning tasks, several limitations remain. First, ACR relies on the assumption that the correct answer is present among the candidate answers generated by SC. As discussed in Section 7, when the model fails to generate the correct answer in any reasoning path, ACR cannot recover it, since its re-scoring mechanism operates only over the existing candidates. Second, the effectiveness of ACR depends on the quality of the SO stage, which reformulates answer selection as a classification task. As shown in Section 6.3 and Section 7, SO can occasionally assign high confidence to incorrect answers or shift the final decision away from the correct one. These cases highlight the need for more effective prompting strategies within SO. In future work, we plan to explore more sophisticated prompt designs to improve the reliability of the evaluation stage. Third, although ACR reduces inference cost relative to SV and FOBAR, it still requires additional sampling beyond SC, particularly in the SO stage. In resource-constrained environments, even this reduced overhead may be non-trivial—especially in large-scale deployments. Finally, ACR currently uses a fixed scoring function (e.g., Sum), which may not be optimal across all datasets or model behaviors. Developing more adaptive or learnable scoring functions remains a promising direction for future research.

Author Contributions

Conceptualization, E.J. and Y.S.C.; methodology, E.J.; software, E.J.; validation, E.J. and Y.S.C.; formal analysis, E.J.; investigation, E.J.; resources, E.J.; data curation, E.J.; writing—original draft preparation, E.J.; writing—review and editing, E.J. and Y.S.C.; visualization, E.J.; supervision, Y.S.C.; project administration, E.J.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning and evaluation (IITP) grant (No. RS-2025-25422680, No. RS-2020-II201373), and the National Research Foundation of Korea (NRF) grant (No. RS-2025-00520618) funded by the Korean Government (MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

“GSM8K“: https://github.com/openai/grade-school-math/tree/master/grade_school_math/data (accessed on 1 February 2025), “TabMWP“: https://github.com/lupantech/PromptPG/tree/main/data/tabmwp (accessed on 1 February 2025), “MATH500“: https://huggingface.co/datasets/HuggingFaceH4/MATH-500 (accessed on 1 July 2025), “Reasoning about Colored Objects and Word Sorting“: https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/bbh (accessed on 18 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. 2024. Available online: http://arxiv.org/abs/2303.08774 (accessed on 1 August 2025).
Anthropic. claude-3-7-sonnet. 2025. Available online: https://www.anthropic.com/news/claude-3-7-sonnet (accessed on 26 July 2025).
Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. 2025. Available online: http://arxiv.org/abs/2507.06261 (accessed on 1 August 2025).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Li, T.; Ma, X.; Zhuang, A.; Gu, Y.; Su, Y.; Chen, W. Few-shot In-context Learning on Knowledge Base Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023 (Volume 1: Long Papers); Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 6966–6980. [Google Scholar]
Coda-Forno, J.; Binz, M.; Akata, Z.; Botvinick, M.; Wang, J.X.; Schulz, E. Meta-in-context learning in large language models. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; ichter, b.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 22199–22213. [Google Scholar]
Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; Khot, T. Complexity-Based Prompting for Multi-step Reasoning. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, Y.; Peng, X.; Du, T.; Yin, J.; Liu, W.; Zhang, X. ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024 (Volume 1: Long Papers); Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 8780–8794. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Li, Y.; Yuan, P.; Feng, S.; Pan, B.; Wang, X.; Sun, B.; Wang, H.; Li, K. Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large Language Models are Better Reasoners with Self-Verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2550–2575. [Google Scholar]
Jiang, W.; Shi, H.; Yu, L.; Liu, Z.; Zhang, Y.; Li, Z.; Kwok, J. Forward-Backward Reasoning in Large Language Models for Mathematical Verification. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 6647–6661. [Google Scholar]
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training Verifiers to Solve Math Word Problems. 2021. Available online: http://arxiv.org/abs/2110.14168 (accessed on 1 August 2025).
Lu, P.; Qiu, L.; Chang, K.W.; Wu, Y.N.; Zhu, S.C.; Rajpurohit, T.; Clark, P.; Kalyan, A. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let’s Verify Step by Step. 2023. Available online: http://arxiv.org/abs/2305.20050 (accessed on 1 August 2025).
Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.; Chi, E.; Zhou, D.; et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13003–13051. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. 2025. Available online: http://arxiv.org/abs/2412.15115 (accessed on 1 August 2025).
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. 2025. Available online: http://arxiv.org/abs/2505.09388 (accessed on 1 August 2025).
Mistral-AI. Mistral-Small-3.1. 2025. Available online: https://mistral.ai/news/mistral-small-3 (accessed on 26 July 2025).
Reid, M.; Savinov, N.; Teplyashin, D.; Lepikhin, D.; Lillicrap, T.; Alayrac, J.b.; Soricut, R.; Lazaridou, A.; Firat, O.; Schrittwieser, J.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P.F. Learning to summarize with human feedback. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 3008–3021. [Google Scholar]
Sessa, P.G.; Dadashi, R.; Hussenot, L.; Ferret, J.; Vieillard, N.; Ramé, A.; Shariari, B.; Perrin, S.; Friesen, A.; Cideron, G.; et al. BOND: Aligning LLMs with Best-of-N Distillation. 2024. Available online: http://arxiv.org/abs/2407.14622 (accessed on 1 August 2025).
Huang, B.; Lu, S.; Wan, X.; Duan, N. Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024 (Volume 1: Long Papers); Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1429–1450. [Google Scholar]
Wang, A.; Song, L.; Tian, Y.; Peng, B.; Jin, L.; Mi, H.; Su, J.; Yu, D. Self-Consistency Boosts Calibration for Math Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 6023–6029. [Google Scholar]
Aggarwal, P.; Madaan, A.; Yang, Y.; Mausam. Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 12375–12396. [Google Scholar]
Wang, H.; Prasad, A.; Stengel-Eskin, E.; Bansal, M. Soft Self-Consistency Improves Language Models Agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024 (Volume 2: Short Papers); Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 287–301. [Google Scholar]
Zhou, Y.; Zhu, Y.; Antognini, D.; Kim, Y.; Zhang, Y. Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024 (Volume 1: Long Papers); Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2793–2804. [Google Scholar]
Chen, W.; Wang, W.; Chu, Z.; Ren, K.; Zheng, Z.; Lu, Z. Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 14162–14167. [Google Scholar]
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Virtual, 6–14 December 2021; Vanschoren, J., Yeung, S., Eds.; Volume 1. [Google Scholar]
Zheng, C.; Zhou, H.; Meng, F.; Zhou, J.; Huang, M. Large Language Models Are Not Robust Multiple Choice Selectors. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]

Figure 1. Overview of the ACR Framework. Step 1 generates candidate answers via Self-Consistency, computing initial confidence scores

C o n f_{1}

. Step 2 evaluates these answers to compute updated scores

C o n f_{2}

. Step 3 applies re-scoring to obtain the final confidence score

C o n f_{f i n a l}

and determine the final answer.

Figure 1. Overview of the ACR Framework. Step 1 generates candidate answers via Self-Consistency, computing initial confidence scores

C o n f_{1}

. Step 2 evaluates these answers to compute updated scores

C o n f_{2}

. Step 3 applies re-scoring to obtain the final confidence score

C o n f_{f i n a l}

and determine the final answer.

Figure 2. Comparison of SC score distributions from Step 1. (a) shows the confidence score distributions for examples A, B, and C, where

A = [0.85, 0.15]

,

B = [0.15, 0.20, 0.40, 0.25]

, and

C = [0.45, 0.55]

. (b) presents the corresponding standard deviations for each example, with the hyperparameter

δ = 0.2

serving as a threshold. If the standard deviation is below

δ

, the model proceeds with Step 2 (Evaluation) and Step 3 (Re-scoring).

Figure 2. Comparison of SC score distributions from Step 1. (a) shows the confidence score distributions for examples A, B, and C, where

A = [0.85, 0.15]

,

B = [0.15, 0.20, 0.40, 0.25]

, and

C = [0.45, 0.55]

. (b) presents the corresponding standard deviations for each example, with the hyperparameter

δ = 0.2

serving as a threshold. If the standard deviation is below

δ

, the model proceeds with Step 2 (Evaluation) and Step 3 (Re-scoring).

Figure 3. Prompt examples used for GSM8K in our two-stage framework. (a) shows the SC prompt used in Step 1 to generate multiple reasoning paths. (b) shows the SO prompt used in Step 2, where candidate answers from Step 1 are converted into multiple-choice options for re-scoring.

Figure 4. Comparison between SV and our proposed method, ACR. SV performs verification for each unique candidate answer generated by SC—5, 2, 4, and 9 in this example—conducting

n = 5

inference calls per candidate. This results in a total of

len (unique candidates) \times n = 4 \times 5 = 20

additional calls. In contrast, ACR adopts a collective evaluation strategy, requiring only

n = 5

inference calls in total, regardless of the number of candidate answers. This design significantly reduces computational cost while preserving answer quality.

Figure 4. Comparison between SV and our proposed method, ACR. SV performs verification for each unique candidate answer generated by SC—5, 2, 4, and 9 in this example—conducting

n = 5

inference calls per candidate. This results in a total of

len (unique candidates) \times n = 4 \times 5 = 20

additional calls. In contrast, ACR adopts a collective evaluation strategy, requiring only

n = 5

inference calls in total, regardless of the number of candidate answers. This design significantly reduces computational cost while preserving answer quality.

Figure 5. Performance variation of ACR across three datasets as the threshold

δ

changes. (a) Results on GSM8K dataset, (b) results on TabMWP dataset, and (c) results on Color dataset. Solid lines represent ACR accuracy at different

δ

values. On all datasets, ACR demonstrates stable performance across a wide range of thresholds, indicating its robustness to

δ

.

Figure 5. Performance variation of ACR across three datasets as the threshold

δ

changes. (a) Results on GSM8K dataset, (b) results on TabMWP dataset, and (c) results on Color dataset. Solid lines represent ACR accuracy at different

δ

values. On all datasets, ACR demonstrates stable performance across a wide range of thresholds, indicating its robustness to

δ

.

Figure 6. Inference call variation of ACR across three datasets as the threshold

δ

changes. (a) Results on GSM8K dataset, (b) results on TabMWP dataset, and (c) results on Color dataset. Solid lines represent the number of inference calls made by ACR at different

δ

values, while dashed lines indicate the fixed inference cost of SV. Across all datasets, ACR consistently requires significantly fewer inference calls than SV, demonstrating its efficiency under varying threshold settings.

Figure 6. Inference call variation of ACR across three datasets as the threshold

δ

changes. (a) Results on GSM8K dataset, (b) results on TabMWP dataset, and (c) results on Color dataset. Solid lines represent the number of inference calls made by ACR at different

δ

values, while dashed lines indicate the fixed inference cost of SV. Across all datasets, ACR consistently requires significantly fewer inference calls than SV, demonstrating its efficiency under varying threshold settings.

Table 1. Performance comparison across models and datasets. All values represent accuracy (%), with the best results shown in bold and second-best underlined. We compare baseline methods (SC, SV, and FOBAR) with our proposed method ACR, evaluated using four different scoring strategies. The

δ

value indicates the threshold used to decide whether ACR proceeds to Step 2 (Evaluation) and Step 3 (Re-scoring). Among the scoring variants, Sum is adopted as the default for comparison, due to its simplicity and performance. ACR consistently outperforms or closely matches the best-performing baseline across all settings. ACR variants are shaded.

Table 1. Performance comparison across models and datasets. All values represent accuracy (%), with the best results shown in bold and second-best underlined. We compare baseline methods (SC, SV, and FOBAR) with our proposed method ACR, evaluated using four different scoring strategies. The

δ

value indicates the threshold used to decide whether ACR proceeds to Step 2 (Evaluation) and Step 3 (Re-scoring). Among the scoring variants, Sum is adopted as the default for comparison, due to its simplicity and performance. ACR consistently outperforms or closely matches the best-performing baseline across all settings. ACR variants are shaded.

Setting	Model	Dataset	Baseline			ACR
Setting	Model	Dataset	SC	SV	FOBAR	Sum	Product	Fobar	SC-Boost	Setting
Few-shot	Qwen2.5-7B	GSM8K	91.96	91.81	93.25	92.34	91.96	92.49	92.72	$δ = 0.30$
		TabMWP	90.40	91.73	92.53	91.73	92.00	92.00	91.73	$δ = 0.30$
		Color	91.60	91.20	92.40	92.00	92.00	92.00	91.60	$δ = 0.30$
	Gemini-1.5-Flash-8B	GSM8K	86.88	87.26	87.64	88.40	88.17	88.48	88.25	$δ = 0.40$
		TabMWP	83.36	83.49	84.02	84.69	84.29	84.55	84.42	$δ = 0.40$
		Color	94.00	94.40	94.80	95.20	95.60	95.60	95.60	$δ = 0.20$
		Word	66.80	62.80	68.40	67.60	67.20	67.60	67.20	$δ = 0.05$
	Qwen3-14B	GSM8K	93.71	93.93	93.86	94.24	93.93	94.09	94.01	$δ = 0.40$
		TabMWP	95.87	97.33	97.07	97.07	97.20	97.20	97.07	$δ = 0.40$
		Color	98.80	98.00	98.80	98.80	98.80	98.80	98.80	$δ = 0.40$
	Mistral-Small-24B	GSM8K	94.84	94.77	94.92	94.91	94.91	94.91	95.29	$δ = 0.30$
		TabMWP	92.80	89.33	90.93	91.87	91.87	91.87	91.87	$δ = 0.30$
		Color	98.40	97.60	97.60	98.00	98.00	98.00	98.00	$δ = 0.30$
Zero-shot	Gemini-2.5-Flash	MATH500	91.60	91.00	91.20	92.00	91.40	91.40	91.60	$δ = 0.30$

Table 2. Accuracy (%) comparison across different answer selection methods. ACR (

C o n f_{1}

) ranks candidate answers based on SC confidence, while ACR (Random) uses random order. All methods are evaluated with m = 20, n = 5, and K = 4.

Table 2. Accuracy (%) comparison across different answer selection methods. ACR (

C o n f_{1}

) ranks candidate answers based on SC confidence, while ACR (Random) uses random order. All methods are evaluated with m = 20, n = 5, and K = 4.

Model	SC	SV	FOBAR	ACR ( ${Conf}_{1}$ )	ACR (Random)
Qwen2.5-7B	91.96	91.81	93.25	92.79 ( $δ$ = 0.25)	92.87 ( $δ$ = 0.25)
Mistral-Small-24B	94.84	94.77	94.92	95.29 ( $δ$ = 0.25)	95.06 ( $δ$ = 0.25)

Table 3. Comparison of accuracy (%) between SV and SO across datasets and models. While SO is applied selectively in ACR, here it is applied uniformly to all examples for a fair comparison with SV.

Dataset	Model	SV	SO
GSM8K	Qwen2.5-7B	91.81	89.83
	Gemini-1.5-Flash-8B	87.26	87.79
	Qwen3-14B	93.93	93.86
	Mistral-Small-24B	94.77	94.00
TabMWP	Qwen2.5-7B	91.73	87.60
	Gemini-1.5-Flash-8B	83.49	84.02
	Qwen3-14B	97.33	96.53
	Mistral-Small-24B	89.33	89.20
Color	Qwen2.5-7B	91.20	91.20
	Gemini-1.5-Flash-8B	94.40	90.00
	Qwen3-14B	98.00	98.00
	Mistral-Small-24B	97.60	97.20
MATH500	Gemini-2.5-Flash	91.00	91.80

Table 4. Comparison of accuracy gain and inference cost relative to SC. Both FOBAR and ACR results are based on

n = 5

generations. For ACR, inference calls are computed using the

δ

value that yields the highest accuracy (see Table 1). “# of samples” refers to the number of data instances on which BR was performed in FOBAR, and the number of instances that proceeded to Step 2 and Step 3 in ACR, respectively. “Reduction” under “Inference Calls” indicates the percentage decrease in inference calls by ACR compared to FOBAR. “Accuracy Gain” denotes the increase in accuracy over SC. “Efficiency” is calculated using Equation (8), showing how much more efficient ACR is compared to FOBAR. We omit efficiency values when either FOBAR or ACR fails to outperform SC, as neither method is considered efficient in such cases.

Table 4. Comparison of accuracy gain and inference cost relative to SC. Both FOBAR and ACR results are based on

n = 5

generations. For ACR, inference calls are computed using the

δ

value that yields the highest accuracy (see Table 1). “# of samples” refers to the number of data instances on which BR was performed in FOBAR, and the number of instances that proceeded to Step 2 and Step 3 in ACR, respectively. “Reduction” under “Inference Calls” indicates the percentage decrease in inference calls by ACR compared to FOBAR. “Accuracy Gain” denotes the increase in accuracy over SC. “Efficiency” is calculated using Equation (8), showing how much more efficient ACR is compared to FOBAR. We omit efficiency values when either FOBAR or ACR fails to outperform SC, as neither method is considered efficient in such cases.

Dataset	Model	# of Samples		Inference Calls			Accuracy Gain (%)		Efficiency
Dataset	Model	FOBAR	ACR	FOBAR	ACR	Reduction (%)	FOBAR	ACR	Efficiency
GSM8K	Qwen2.5-7B	665	254	9615	1270	86.79	1.29	0.38	2.21
	Gemini-1.5-Falsh-8B	403	318	6685	1590	76.22	0.76	1.52	8.41
	Qwen3-14B	104	90	1345	450	66.54	0.15	0.53	10.52
	Mistral-Small	277	93	3605	465	87.10	0.08	0.07	7.04
TabMWP	Qwen2.5-7B	288	143	4020	715	82.21	2.13	1.33	3.52
	Gemini-1.5-Falsh-8B	214	178	3500	890	74.57	0.66	1.33	7.91
	Qwen3-14B	94	74	1310	370	71.76	1.20	1.20	3.53
	Mistral-Small	219	100	3155	500	84.15	−1.87	−0.93	-
Color	Qwen2.5-7B	87	31	1020	155	84.80	0.80	0.40	3.29
	Gemini-1.5-Falsh-8B	40	9	535	45	91.59	0.80	1.20	17.83
	Qwen3-14B	11	8	120	40	66.67	0.00	0.00	-
	Mistral-Small	44	4	460	20	95.65	−0.80	−0.40	-

Table 5. Proportion of incorrect predictions made by SC (Incorrect (%)), and within those, the percentage of cases where the correct answer was entirely missing from the generated candidate set (Gold Missing (%)). The latter represents cases fundamentally unrecoverable by rescoring methods like ACR.

Dataset	Model	Incorrect (%)	Gold Missing (%)
GSM8K	Qwen2.5-7B	8.04	31.13
	Gemini-1.5-Flash-8B	13.12	41.04
	Qwen3-14B	6.29	63.86
	Mistral-Small	5.16	42.65
TabMWP	Qwen2.5-7B	9.60	9.72
	Gemini-1.5-Flash-8B	16.64	51.20
	Qwen3-14B	4.13	29.03
	Mistral-Small	7.20	29.63
Color	Qwen2.5-7B	8.40	4.76
	Gemini-1.5-Flash-8B	6.00	0.00
	Qwen3-14B	1.20	33.33
	Mistral-Small	1.60	0.00

Table 6. Breakdown of predictions by SC and ACR across all dataset–model pairs. Proportions are reported as percentages of the total evaluation samples. “SC O & ACR O” indicates both methods predicted correctly; “SC O & ACR X” denotes degradation by ACR; “SC X & ACR O” indicates improvement by ACR; and “SC X & ACR X” refers to cases where both methods failed.

Dataset	Model	SC O & ACR O	SC O & ACR X	SC X & ACR O	SC X & ACR X
GSM8K	Qwen2.5-7B	90.14 %	1.82 %	2.20 %	5.84 %
	Gemini-1.5-Flash-8B	85.60 %	1.28 %	2.81 %	10.31 %
	Qwen3-14B	93.18 %	0.53 %	1.06 %	5.23 %
	Mistral-Small	94.23 %	0.61 %	0.68 %	4.48 %
TabMWP	Qwen2.5-7B	88.27 %	2.13 %	3.47 %	6.13 %
	Gemini-1.5-Flash-8B	81.63 %	1.73 %	3.06 %	13.58 %
	Qwen3-14B	95.20 %	0.67 %	1.87 %	2.26 %
	Mistral-Small	91.07 %	1.73 %	0.80 %	6.40 %
Color	Qwen2.5-7B	90.40 %	1.20 %	1.60 %	6.80 %
	Gemini-1.5-Flash-8B	94.00 %	0.00 %	1.20 %	4.80 %
	Qwen3-14B	98.80 %	0.00 %	0.00 %	1.20 %
	Mistral-Small	98.00 %	0.40 %	0.00 %	1.60 %

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeong, E.; Choi, Y.S. ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates. Appl. Sci. 2025, 15, 9587. https://doi.org/10.3390/app15179587

AMA Style

Jeong E, Choi YS. ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates. Applied Sciences. 2025; 15(17):9587. https://doi.org/10.3390/app15179587

Chicago/Turabian Style

Jeong, Eunhye, and Yong Suk Choi. 2025. "ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates" Applied Sciences 15, no. 17: 9587. https://doi.org/10.3390/app15179587

APA Style

Jeong, E., & Choi, Y. S. (2025). ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates. Applied Sciences, 15(17), 9587. https://doi.org/10.3390/app15179587

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ACR: Adaptive Confidence Re-Scoring for Reliable Answer Selection Among Multiple Candidates

Abstract

1. Introduction

2. Related Work

2.1. Self-Consistency

2.2. Verification

3. Method

3.1. SC: Generating Initial Answers and Scores

3.2. SO: Selective and Collective Evaluation of Candidate Answers

3.3. ACR: Re-Scoring

4. Experimental Setup

4.1. Datasets

4.1.1. Arithmetic Reasoning

4.1.2. Logical Reasoning

4.2. Implementation Details

4.3. Prompt Design

4.4. Baselines

5. Results

6. Analysis

6.1. Final Confidence Scoring Strategies

6.2. Robustness to Option Order

6.3. Effectiveness of Selective Re-Scoring

6.4. Efficiency and Effectiveness: ACR vs. FOBAR

7. Error Cases Analysis

8. Conclusions

9. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI