Recall Mechanism and Multi-Head Attention for Numerical Reasoning

Lai, Linjia; Tan, Tien-Ping; Zeng, Bocan

doi:10.3390/app15073528

Open AccessArticle

Recall Mechanism and Multi-Head Attention for Numerical Reasoning

by

Linjia Lai

^1,2,3,

Tien-Ping Tan

^2,*

and

Bocan Zeng

²

¹

New Engineering Industry College, Putian University, Putian 351100, China

²

School of Computer Sciences, Universiti Sains Malaysia, Penang 11800, Malaysia

³

Putian Electronic Information Industry Technology Research Institute, Putian University, Putian 351100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3528; https://doi.org/10.3390/app15073528

Submission received: 17 February 2025 / Revised: 14 March 2025 / Accepted: 16 March 2025 / Published: 24 March 2025

Download

Browse Figures

Versions Notes

Abstract

Numerical reasoning is a challenging question-answering task in artificial intelligence (AI) that requires both reading comprehension and numerical computation capabilities. Although recent approaches have made significant progress in reasoning, two critical issues remain: (1) information tends to be gradually lost as the network deepens due to the complexity of deep learning models, and (2) the performance of multi-step reasoning is suboptimal, leaving room for improvement in accuracy. To address these issues, we propose a model with a recall mechanism and multi-head attention for numerical reasoning (RMMANR). The recall mechanism prevents the embedding information of questions and passages from being forgotten as the model deepens, while the multi-head attention mechanism analyzes possible solutions for numerical reasoning. RMMANR consists of two main components: an encoder and a decoder. The encoder leverages RoBERTa to encode the question and passage into contextual embeddings. The decoder, which consists of four modules (RM Module, Selector, MA Module, and Program Solver), generates inference steps and answers based on contextual information from the encoder. We implement our model using PyTorch and evaluate it on the FINQA dataset, a benchmark for numerical reasoning in the financial domain. Experimental results demonstrate that RMMANR outperforms several baseline models, achieving superior accuracy.

Keywords:

artificial intelligence; financial report analysis; natural language processing; numerical reasoning; multi-head attention; recall mechanism

1. Introduction

Numerical reasoning refers to the ability to understand and work with numbers, make sense of quantitative information, and perform mathematical operations. In numerical reasoning, a person or machine solves a question that typically involves the use of numerical data in a given document. Numerical reasoning encompasses a range of mathematical operations such as arithmetic, algebra, ratios, etc. While simple numerical reasoning tasks, such as math word problems (MWPs) [1], typically involve a few arithmetic operations, more complex tasks—such as those in the financial domain—require analyzing heterogeneous data sources like financial reports to answer intricate questions and the solution [2,3]. In healthcare, numerical reasoning aids in interpreting patient data (such as medical images and clinical readings), diagnosing conditions, and making treatment decisions. Similarly, information technology may involve extracting patterns and insights from large datasets [4].

An example of numerical reasoning in the financial domain is illustrated in Figure 1. Here, the task is to determine the percentage change in cash flow hedges from 2010 to 2011 based on a given passage. Solving such problems requires identifying relevant operands and deducing the correct operators from the passage, based on the question. Even competitive models like ChatGPT-4 struggle with numerical reasoning exams [5]. FINQA is the first dataset of its kind that handles complex QA tasks based on real financial documents. It consists of 8281 question–answer pairs, collaboratively constructed by 11 finance professionals using earnings reports from

S & P

500 companies. Answering questions in the FINQA dataset involves many common financial analysis calculations, such as addition, subtraction, multiplication, division, averaging, summation, comparison and so on [2].

The encoder–decoder architecture has emerged as a widely adopted deep neural network architecture for solving numerical reasoning tasks due to its ability to derive mathematical equations from textual inputs. As depicted in Figure 2, this architecture typically employs an encoder to transform a question and a document into vector representations, which are then processed by a decoder to produce the corresponding equation. More recent models, such as FINQA [2], ELASTIC [6], and SoarGraph [7], leverage pre-trained language models like RoBERTa to encode problems and documents, significantly improving performance. In terms of the decoder, early approaches, such as the deep neural solver (DNS) [8], utilized LSTMs, while later advancements introduced tree decoders to generate prefix traversal sequences of expression trees [9]. In the financial domain, FINQA proposes a retrieval-based generator QA framework that first retrieves supporting facts from financial reports and then generates executable reasoning programs to answer questions [2].

Despite these advancements, two significant challenges persist in numerical reasoning: (1) information loss as network depth increases, leading to the gradual dissipation of embedding vectors for questions and passages, and (2) suboptimal performance in multi-step reasoning tasks, leaving substantial room for accuracy improvement. To address these issues, we propose a model with a recall mechanism and multi-head attention for numerical reasoning (RMMANR). RMMANR consists of two main components: an encoder and a decoder. The encoder leverages RoBERTa to encode the question and passage into contextual embeddings. The decoder, which consists of four modules (RM Module, Selector, MA Module, and Program Solver), generates reasoning steps and answers based on contextual information from the encoder. The main contribution of our work is three-fold:

We propose a recall mechanism for numerical reasoning, which refocuses on the embedding information of questions and passages before generating operators and operands. This mechanism prevents the loss of critical information as the model deepens, ensuring that the initial input embeddings are retained throughout the network.
We propose multi-head attention for numerical reasoning, which enhances the model’s ability to handle complex, multi-step reasoning tasks. By efficiently extracting multiple relevant information fragments, this mechanism optimizes the solution process for multi-step problems.
We evaluate our proposed model, RMMANR (recall mechanism and multi-head attention for numerical reasoning), on the FINQA dataset [2]. Experiment results show that RMMANR outperforms several baseline models, achieving superior accuracy and robustness.

2. Related Works

In this section, we review existing research on numerical reasoning, focusing on three key aspects: datasets, question and document representation, and reasoning modules.

2.1. Datasets

Numerous datasets have been developed to evaluate numerical reasoning capabilities. Early Math23K consists of 23,161 Chinese math word problems curated from online educational platforms and refined using rule-based extraction methods [10]. DROP, proposed by Dua et al., emphasizes discrete reasoning over paragraphs and requires more complex arithmetic operations [11]. Recently, domain-specific datasets like FINQA [2] and TAT-QA [12,13] have been introduced. FINQA contains 8281 financial question–answer pairs derived from S&P 500 company reports, while TAT-QA features 16,552 questions based on 2757 hybrid contexts of tables and textual descriptions. These datasets focus on numerical reasoning in financial domains and support the development of explainable models.

2.2. Question and Document Representation

Early approaches to numerical parsing relied on language parsers, such as the Stanford CoreNLP suite, to identify variables, operands, and their relationships in questions and documents [14,15]. With the rise of deep learning, word embeddings became the dominant method for representing questions and documents. Early embeddings, such as skip-gram, CBOW, and GloVe, used static word vectors, where each token was represented by a single vector [16,17,18]. However, these approaches struggled with polysemy, as tokens like words can have multiple meanings.

To address this limitation, contextualized word embeddings were introduced. Early efforts, such as ELMo, used bidirectional LSTMs to generate context-aware embeddings [19]. The advent of transformer-based models, such as BERT, RoBERTa, and LLaMA [20], revolutionized the field by enabling dynamic, context-sensitive representations. These models have been widely adopted in numerical reasoning tasks. For example, neural-symbolic reader (NeRd) uses BERT as its encoder [21], while models like QDGAT [22], MHST [23], and ELASTIC [6] leverage RoBERTa for encoding questions and documents. Additionally, models like LayoutLM and LayoutLMv2 [24] extend these approaches to visually rich documents by incorporating text, image, and layout information. When the reasoning process involves images within a document, certain image feature techniques can also be applied. For example, Li et al. use the combined channel and spatial attention (CCSA) mechanism, which emphasizes capturing both spatial and channel-wise features while effectively preserving essential high-frequency details [25].

2.3. Reasoning Module

The reasoning module in deep neural networks typically incorporates a decoder, which can be based on feed-forward neural networks (FFNNs), recurrent neural networks (RNNs), multi-head attention, transformers, or graph neural networks (GNNs). Table 1 summarizes the performance of various approaches.

Feed-Forward Neural Networks: Models like MHST use multi-head classifiers to identify arithmetic operations and relevant numbers from questions and documents [23]. GDS introduces a goal-driven approach to generate expression trees by recursively decomposing target outcomes into subgoals [9]. FinMath adopts a similar strategy for financial numerical reasoning [26].

Recurrent Neural Networks (RNNs): The deep neural solver (DNS) combines an encoder–decoder architecture with a similarity-based retrieval model, achieving accuracies of 64.7% and 70.1% on Math23K and Alg514, respectively [10]. NeRd integrates BERT for encoding and an LSTM-based programmer for generating multi-step reasoning programs [21]. ELASTIC employs a GRU-based decoder with an attention layer and a memory register, achieving 68.96% execution accuracy on the FINQA dataset [6].

Transformers: Transformer-based decoders, such as those used by Wang et al., generate mathematical expression trees but often underperform compared to LSTM-based decoders [27]. Ensemble models combining multiple encoder–decoder architectures have shown improved results on datasets like Math23K.

Graph Neural Networks (GNNs): Doc2SoarGraph converts embeddings into hierarchical graphs (e.g., quantity comparison, text relation, and date comparison graphs) and processes them using a graph convolutional neural network (GCNN) [7]. QDGAT constructs graphs from questions and documents, using a question-directed layer to simulate reasoning processes [22].

Prompting: Prompting is a technique used to guide large language models (LLMs) to generate more accurate and structured responses without modifying the model’s parameters. By carefully designing the input query, users can significantly enhance the model’s reasoning ability in numerical tasks. There are several strong prompting techniques that improve LLM performance in numerical reasoning. Chain-of-thought (CoT) [28] demonstrates how step-by-step reasoning enhances model performance. Xuezhi Wang et al. proposed self-consistency in “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, where multiple reasoning paths are sampled to select the most consistent answer [29]. Program-aided language (PAL) models explore integrating code execution with language models, enabling precise calculations by generating and running code [30].

Despite these advancements, challenges such as information loss in deep networks and suboptimal multi-step reasoning in numerical reasoning remain significant. Inspired by ELASTIC [6], we propose a recall mechanism and multi-head attention for numerical reasoning (RMMANR) to address these limitations.

3. Methods

Deep neural networks often suffer from the gradual loss of embedding vectors for questions and passages as the network depth increases. To address this issue, we propose a recall mechanism for numerical reasoning, designed to mitigate the loss of embedding vectors for questions and passages as network depth increases. In memory and cognitive psychology, a recall mechanism refers to the process by which the brain retrieves stored information from long-term memory. This mechanism allows individuals to access past experiences, facts, or learned skills. Inspired by the human tendency to review relevant information before making a decision, RM incorporates a recall module that revisits question and passage embeddings before generating operators and operands. Unlike most models, where the Reasoning Module directly passes the guidance vectors—obtained from embedding vectors of questions and passages—to the Generator Module for operator and operand generation, RM introduces an additional recall step. Specifically, as shown in Figure 3, the Recall Module is placed between the Reasoning Module and the Generator Module, enabling second access to question and passage embeddings. This additional retrieval step ensures that the guidance vectors contain richer contextual information, leading to more accurate operator and operand generation. By integrating this mechanism, RM prevents the degradation of initial input embeddings, enhancing numerical reasoning accuracy.

On the other hand, many existing models struggle with multi-step reasoning tasks [2,6]. To address this challenge, we propose multi-head attention for numerical reasoning. As shown in Figure 4, the MA Module consists of a multi-head attention reasoning module and a Generator Module. Multi-head attention enables the model to focus on different parts of the input sequence simultaneously, facilitating parallel processing of multiple reasoning steps. This capability is particularly beneficial for complex numerical reasoning tasks, as it efficiently extracts relevant information fragments and optimizes the solution process for multi-step problems.

RMMANR combines the strengths of the RM and MA Modules. As illustrated in Figure 5, the question and passage embeddings are first processed by the RM Module, which generates initial operators

O P_{R M}

and operands

O E_{R M}

. The Selector then calculates

s t e p N u m

based on

O P_{R M}

. If

s t e p N u m = 1

, the Selector passes

O P_{R M}

and

O E_{R M}

to the Program Solver. Otherwise, the MA Module is activated to generate

O P_{M A}

and

O E_{M A}

, which are then passed to the Program Solver for execution.

3.1. Task Definition

All the notations are defined in Table 2, and these definitions align with those used in ELASTIC [6]. Given a question Q and a passage P, the task is to construct a numerical reasoning program R, which consists of a sequence of symbols s (operators

O P

and operands

O E

). Operands are derived from constants

C O N S

or numbers

N U M

in the input. The program R is structured as:

R = {o p_{i} {[o e_{j}^{i}]}_{j = 0}^{m - 1}}_{i = 0}^{n - 1}

(1)

where

o p_{i}

is the i-th operator and

o e_{j}^{i}

are its associated operands. The program R represents the reasoning process and can be executed to derive the final answer. As shown in Appendix A, it is an example of financial question–answering problems, using nations defined in Table 2.

3.2. Proposed RMMANR

As illustrated in Figure 5, our proposed model, RMMANR, consists of two main components: an encoder and a decoder. The encoder transforms the question and passage into contextual embeddings. The decoder comprises four modules: RM Module, Selector, MA Module, and Program Solver. The process begins with the question and passage embeddings being fed into the RM Module, which generates initial operators and operands. The Selector then determines whether to activate the MA Module based on the complexity of the task (i.e., the number of reasoning steps). Finally, the Program Solver executes the generated program to produce the final result.

3.3. Encoder

As shown in Figure 5, the encoder in RMMANR is implemented using RoBERTa, which transforms the combined sequence of the question Q and passage P into contextual embeddings h. Specifically,

h^{q}

represents the embedding of the question. These embeddings capture the semantic and contextual information necessary for numerical reasoning. The decoder then uses h to generate the numerical reasoning program R and derive the final result.

RMMANR is encoder-agnostic, meaning that any model capable of providing sequence-context vectors, such as BERT, RoBERTa, or LLaMa, can be used as the encoder. We chose RoBERTa primarily due to its efficiency in memory usage and its competitive performance in capturing contextual relationships, although other models with similar functionality could also be applied.

3.4. Decoder

As shown in Figure 5, the decoder in RMMANR is responsible for generating the numerical reasoning program R based on the contextual embeddings h provided by the encoder. It consists of four modules: RM Module, Selector, MA Module, and Program Solver. The decoding process begins with the RM Module, which generates initial operators and operands, followed by the Selector determining whether to activate the MA Module for multi-step reasoning. Finally, the Program Solver executes the generated program to produce the final result.

3.4.1. Decoding Vocabulary and Token Embedding

The decoding vocabulary consists of two primary elements: operators

O P

and operands

O E

. Operands are further divided into constants

C O N S

(e.g., mathematical constants like

π

) and numbers

N U M

extracted from the input. The embedding retrieval function

E_{o p, c o n s, n u m} (s)

maps a symbol s in the decoding vocabulary to its corresponding embedding

e_{s}

. Specifically, the embedding for a symbol s is defined as:

e_{s} = \{\begin{matrix} E_{o p} (s) & if s \in O P, \\ E_{c o n s} (s) & if s \in C O N S, \\ E_{n u m} (s) = h_{i} & if s \in N U M, \end{matrix}

(2)

where

E_{o p} (s)

,

E_{c o n s} (s)

, and

E_{n u m} (s)

are the embedding retrieval functions for operators, constants, and numbers, respectively.

3.4.2. RM Module

The RM Module applies a recall mechanism to refocus on the embedding information from the question and passage before generating operators and operands, ensuring the initially inputted embeddings are not lost as the model deepens. As illustrated in Figure 3, the RM Module is composed of three modules: Reasoning Module, Recall Module, and Generator Module. The Recall Module receives initial guidance vectors

g_{i n i t i a l_o p}

from the Reasoning Module, recalls relevant information

h^{q}

, and outputs

g_{f i n a l_o p}

, which directs the operator generator to produce the operator

o p_{t}

. Concurrently, the Recall Module receives initial guidance vectors

g_{i n i t i a l_o e}

from the Reasoning Module, recalls relevant information h, and outputs

g_{f i n a l_o e}

. Next, the operator generator pauses and passes

g_{f i n a l_o p}

and

o p_{t}

to the operand generator, which generates the first operand

o e_{t}^{1}

. Once the operator and operands of the subprogram

r_{t}

are completed, the RM Module then moves on to generate the the operator and operands of the next subprogram

r_{t + 1}

. After the array of mathematical operators

O P_{R M}

and the array of mathematical operators generate

O E_{R M}

, they are passed to the Selector.

Reasoning Module: As shown in Figure 5, the inputs for the Reasoning Module are the embedding of the previously generated symbol

s_{t - 1}

and contextual vector

h^{q}

from the encoder. The Reasoning Module first calculates the context vector c by the attention weights

a_{i}

and the normalized vectors of

h_{i}^{q}

:

c = \sum_{i} a_{i} h_{i}^{q}

(3)

a_{i} = \frac{e x p (s c o r e (e_{s_{t - 1}}, h_{i}^{q}))}{\sum_{j} e x p (s c o r e (e_{s_{t - 1}}, h_{j}^{q}))}

(4)

s c o r e (e_{s_{t - 1}}, h_{i}^{q}) = e_{s_{t - 1}}^{T} W_{1} W_{2} h_{i}^{q}

(5)

where

W_{1} \in R^{h * h}

,

W_{2} \in R^{h * h}

,

h = 1024

, and both weight parameters are trainable. The encoding information from the encoder is summarized into c based on the previously generated symbol

s_{t - 1}

. Next, the Reasoning Module uses the GRU [31] to obtain the initial guidance output

g_{i n i t i a l_o p}

:

g_{i n i t i a l_o p}, H_{t} = G R U (R e l u (W_{3} [c : E_{o p, c o n s, n u m} (s_{t - 1})]), H_{t - 1})

(6)

where “:” denotes concatenation.

R e l u

is the activation function, and

W_{3} \in R^{h * 2 h}

is a trainable parameter.

H_{t - 1} \in R^{h}

represents the GRU’s hidden state from the previous step, and

H_{0}

is initialized to zero. The Reasoning Module outputs the vector

g_{i n i t i a l_o p}

. The process of obtaining

g_{i n i t i a l_o e}

is the same as that of obtaining

g_{i n i t i a l_o p}

, with the main difference being that

g_{i n i t i a l_o e}

is obtained based on h. Though

g_{i n i t i a l_o p}

can guide the Generator Module to produce

o p

and

o e

. In the RM Module,

g_{i n i t i a l_o p}

is not used directly to generate operators or operands.

Recall Module: As the network deepens, the information from input questions and passages tends to dissipate in many models. Similar to how humans solve math problems by revisiting the question and passage for verification, the RM Module applies a recall mechanism to refocus on the embedding information before generating operators and operands. This ensures that important question and passage information is retained throughout the model’s depth. When predicting operators, the Recall Module recalls the contextual vector

h^{q}

.

h^{q}

is the key/K of the concern layer, and

g_{i n i t i a l_o p}

is the query/Q of the concern layer. The Recall Module calculates the final guidance vector

g_{f i n a l_o p}

by the attention weights

b_{i}

and the normalized vectors of

h_{i}^{q}

:

g_{f i n a l_o p} = \sum_{i} b_{i} h_{i}^{q}

(7)

b_{i} = \frac{e x p (s c o r e (g_{i n i t i a l_o p}, h_{i}^{q}))}{\sum_{j} e x p (s c o r e (g_{i n i t i a l_o p}, h_{j}^{q}))}

(8)

s c o r e (g_{i n i t i a l_o p}, h_{i}^{q}) = g_{i n i t i a l_o p}^{T} W_{4} W_{5} h_{i}^{q}

(9)

where

W_{4} \in R^{h * h}

,

W_{5} \in R^{h * h}

,

h = 1024

, and both weight parameters are trainable. The

g_{f i n a l_o p}

recalls the embedding vectors from the encoder based on the

g_{i n i t i a l_o p}

. The process of obtaining

g_{f i n a l_o e}

is the same as that of obtaining

g_{f i n a l_o p}

, with the main difference being that

g_{f i n a l_o e}

is obtained based on h.

Generator Module: The Generator Model consists of an operator generator and an operand generator. The guidance vector

g_{f i n a l_o p}

enters the operator generator. The operator generator uses softmax to calculate the most likely i-th operator:

P (o p_{i} | g_{f i n a l_o p}, E_{o p_{t - 1}}) = s o f t m a x (E_{o p_{i}}^{T} R_{f i n a l_o p})

(10)

R_{f i n a l_o p} = R e l u (W_{o p} g_{f i n a l_o p})

(11)

The guidance vector

g_{f i n a l_o e}

enters the operand generator. The operand generator uses softmax to calculate the most likely i-th operand:

P (o e_{i} | g_{f i n a l_o e}, E_{o e_{i - 1}}) = s o f t m a x (E_{o e_{i}}^{T} R_{f i n a l_o e})

(12)

R_{f i n a l_o e} = R e l u (W_{a r g u} g_{f i n a l_o e})

(13)

3.4.3. MA Module

Many current models struggle with multi-step reasoning tasks [2,6]. Multi-head attention allows the model to focus on different parts of the input sequence simultaneously. This enables parallel processing of multiple reasoning steps, which is especially beneficial for complex numerical reasoning tasks. By analyzing different possible solutions, multi-head attention optimizes the solution process for multi-step problems. As shown in Figure 4, the MA Module consists of two modules: a Multi-head Attention Reasoning Module and a Generator Module. The Multi-head Attention Reasoning Module processes the contextual embeddings, and the Generator Module generates the array of mathematical operators

O P_{M A}

, and the array of mathematical operators generates

O E_{M A}

.

Multi-head Attention Reasoning Module: The input to the Multi-head Attention Reasoning Module consists of the previously generated symbol

s_{t - 1}

and the embedded context vector h from the encoder. The reasoning module first calculates the context vector c using the multi-head attention mechanism, which allows it to effectively focus on different segments of the input. In the multi-head attention mechanism, attention weights and context vectors are calculated separately for each attention head. The specific steps are as follows:

Calculate the attention weights for each attention head: For the kth attention head, calculate the attention weight $a_{i}^{(k)}$ from the previously generated symbol $s_{t - 1}$ and the context vector h:

$a_{i}^{(k)} = \frac{e x p (s c o r e (e_{s_{t - 1}}, h_{i}^{(k)}))}{\sum_{j} e x p (s c o r e (e_{s_{t - 1}}, h_{j}^{(k)}))}$

(14)

where the scoring function for the kth attention head is:

$s c o r e (e_{s_{t - 1}}, h_{i}^{(k)}) = e_{s_{t - 1}}^{T} W_{6}^{(k)} W_{7}^{(k)} h_{i}^{(k)}$

(15)

Here, $W_{6}^{(k)} \in R^{h * h}$ and $W_{7}^{(k)} \in R^{h * h}$ are trainable parameters of the kth attention head, where $h = 1024$ .
Calculate the context vector for each attention head: Different heads provide diverse attention perspectives, enabling the model to focus on multiple levels of information simultaneously, thereby enhancing its modeling capability. Sum the weighted vectors of attention heads to obtain the context vector $c^{(k)}$ for the kth attention head:

$c^{(k)} = \sum_{i} a_{i}^{(k)} h_{i}^{(k)}$

(16)
Merge the context vectors of multiple attention heads: Combine the context vectors $c^{(k)}$ from all attention heads and obtain the final context vector c through a linear transformation:

$c = W_{c} [c^{(1)}, c^{(2)}, \dots, c^{h e a d N u m}]$

(17)

where $W_{c}$ is the linear transformation matrix after merging, and
$h e a d N u m = 4$ is the number of attention heads.

Next, the reasoning module uses a GRU to compute the initial guided output

g_{i n i t i a l_o p}

:

g_{i n i t i a l_o p}, H_{t} = G R U (Relu (W_{8} [c : E_{o p, c o n s, n u m} (s_{t - 1})]), H_{t - 1})

(18)

where “:” denotes concatenation, Relu is the activation function, and

W_{8} \in R^{h * h}

is a trainable parameter.

H_{t - 1} \in R^{h * h}

represents the hidden state of the GRU in the previous step, and

H_{0}

is initialized to zero.

Generator Module: In the MA Module,

g_{i n i t i a l_o p}

directly guides the Generator Module to generate operators and operands. The Generator Model consists of an operator generator and an operand generator. The guidance vector

g_{i n i t i a l_o p}

enters the operator generator. The operator generator uses softmax to calculate the most likely i-th operator:

P (o p_{i} | g_{i n i t i a l_o p}, E_{o p_{t - 1}}) = s o f t m a x (E_{o p_{i}}^{T} R_{i n i t i a l_o p})

(19)

R_{i n i t i a l_o p} = R e l u (W_{o p} g_{i n i t i a l_o p})

(20)

The guidance vector

g_{i n i t i a l_o e}

enters the operand generator. The operand generator uses softmax to calculate the most likely i-th operand:

P (o e_{i} | g_{i n i t i a l_o e}, E_{o e_{i - 1}}) = s o f t m a x (E_{o e_{i}}^{T} R_{i n i t i a l_o e})

(21)

R_{f i n a l_o e} = R e l u (W_{a r g u} g_{i n i t i a l_o e})

(22)

3.4.4. Selector

Selector uses the hyperparameter

s t e p N u m

as a selection threshold and calculates

s t e p N u m

by the size of

O P_{R M}

generated by the RM Module. If

s t e p N u m

= 1

, the Selector chooses to pass the

O P_{R M}

and

O E_{R M}

generated by the RM Module to the Program Solver. If

s t e p N u m > 1

, the Selector starts the MA Module, which then generates

O P_{M A}

and

O E_{M A}

and passes them to the Program Solver.

3.4.5. Program Solve

Program Solve is used to calculate programs. Since subsequent subprograms may use the result of the previous subprogram as an operand, we store the result of the previous subprogram in #n of a memory registry like ELASTIC [6].

3.5. RMNR and MANR

For the framework of the model RMNR, see Appendix B. The main difference between RMNR and RMMANR is that RMNR does not have an MA Module and Selector. For the framework of the model MANR, see Appendix C. The main difference between MANR and RMMANR is that MANR does not have an RM Module and Selector.

4. Experiments

4.1. Dataset

The FINQA dataset [2] consists of 8281 question–answer pairs derived from annual financial reports. The dataset is divided into training (6251 samples), evaluation (883 samples), and test (1147 samples) sets. Each sample includes a financial question, a supporting passage, and a gold-standard numerical reasoning program. The FINQA dataset is designed to test models’ ability to perform complex numerical reasoning in the financial domain. We evaluate our model on the processed FINQA dataset from ELASTIC. ELASTIC has already converted all table data related to the reasoning process into text and placed it in the passage during the data preprocessing stage (The preprocessed FINQA dataset is available at https://github.com/NeuraSearch/NeurIPS-2022-Submission-3358 (accessed on 15 July 2022).)

4.2. Evaluation Metrics

Following the original FINQA paper, we use two evaluation metrics: execution accuracy (Exe Acc) and program accuracy (Prog Acc).

Execution Accuracy measures the correctness of the final answer produced by executing the model-predicted program against the gold-standard executable result.
Program Accuracy evaluates the precision of the predicted program by comparing the operands and operators in the model-generated program with those in the gold-standard program.

These metrics provide a comprehensive assessment of both the reasoning process (program accuracy) and the final result (execution accuracy).

4.3. Baselines

We use several prompting models and fine-tuning models to compare with RMNR, MANR, and RMMANR.

Prompting models include the following:

GPT-3.5-turbo: GPT-3.5-turbo is a large language model with 175 billion parameters [32].
GPT-4: GPT-4 is a large multimodal model that is able to process both text and image inputs and generate text outputs [33].
Program-of-Thought: Program-of-Thought first generates programming and text statements, and then produces an answer [28].

Fine-tuning models include:

FINQANet: FINQANet employs a typical encoder–decoder architecture, where pre-trained LMs are used as the encoder, and LSTM serves as the decoder [2].
NeRd: NeRd generates symbolic nested programs using BERT and a model based on pointer generators [21].
NumNet: NumNet models arithmetic information through GNN networks [34].
ELASTIC: ELASTIC separately generates mathematical operators and operands, reducing the occurrence of cascading errors [6]. (The preprocessed FINQA dataset and code are available at https://github.com/NeuraSearch/NeurIPS-2022-Submission-3358 (accessed on 15 July 2022).)
DyRRen: DyRRen is an extended retriever-reranker-generator framework [35].
We also compared the reasoning results of non-experts and experts from the original FINQA paper [2].

4.4. Implementation Details

RMMANR is trained on a server with an RTX4090 GPU with 24GB of memory, implemented by Transformer 4.44.0 [36] and Pytorch 2.3.0 [37]. The training batch size was set to 3 and the epoch to 100. We chose Adam as the optimizer [38] to perform the update of the model parameters. The initial learning rate was set to

1 \times 10^{- 5}

, and the learning rate was halved every 25 epochs. In order to prevent overfitting, during training, the dropout rate was set to 0.1, and the weight decay was set to

1 \times 10^{- 5}

.

5. Experiment Results and Analysis

5.1. Overall Results

Table 3 compares the performance of our models (RMNR, MANR, and RMMANR) with several baselines on the FINQA dataset. Notably, all three of our models outperform the existing baselines when using RoBERTa as the encoder. Specifically, RMMANR (RoBERTa-large) achieves the highest performance, surpassing all baselines by at least 2.67% in execution accuracy and 2.1% in program accuracy. Compared to FINQANet (RoBERTa-large) from the original FINQA paper, RMMANR improves execution accuracy by 6.58 points and program accuracy by 3.79 points.

Both RMNR (RoBERTa-large) and MANR (RoBERTa-large) also demonstrate significant improvements over ELASTIC (RoBERTa-large). RMNR increases execution accuracy by 1.62 points and program accuracy by 1.4 points, while MANR improves execution accuracy by 1.25 points and program accuracy by 0.53 points. Given that RMNR, MANR, and ELASTIC use RoBERTa encoders of the same size, these results highlight the effectiveness of the recall mechanism and multi-head attention in enhancing numerical reasoning capabilities.

Our models also outperform other baselines, such as NeRd, NumNet, and DyRRen. Unlike NeRd, which relies on external rules for operators [34], our models achieve superior results without such dependencies. Additionally, NumNet and NumNet+ exhibit poor scalability due to limitations in their internal architectures [40], while DyRRen’s retriever–reranker–generator framework is prone to cascading errors, resulting in lower performance.

Furthermore, our models outperform all prompting models listed in Table 3, demonstrating stronger reasoning abilities compared to models like GPT-3.5-turbo and GPT-4. While RMNR, MANR, and RMMANR surpass non-expert human performance, there remains a significant gap between our models and expert human performance. This suggests that there is still considerable room for improvement on the FINQA dataset.

5.2. The Convergence Process

The convergence process of RMNR (RoBERTa-large) is illustrated in Figure 6. RMNR demonstrates rapid convergence during the initial five training rounds, capturing the majority of core data features and patterns. In subsequent training iterations, the model’s accuracy continues to improve steadily, indicating that it is fine-tuning its parameters to better fit the data by learning more intricate features. However, occasional fluctuations in accuracy are observed in later stages, which may be attributed to the presence of noisy data. The convergence rate slows significantly towards the end of training, with the highest accuracy achieved near the final iterations. This suggests the potential for further optimization of hyperparameters to enhance performance.

The convergence behavior of MANR (RoBERTa-large), as depicted in Figure 7, is similar to that of RMNR (RoBERTa-large) and is therefore not discussed in detail here.

It is worth noting that execution accuracy may overestimate the model’s performance, as it can occasionally produce correct answers by chance. On the other hand, program accuracy may lead to false negatives in cases where a question admits multiple valid reasoning programs. Since correct reasoning programs invariably yield correct results, execution accuracy is generally higher than program accuracy.

5.3. Performance on Different Program Steps

The models’ performance across different program steps is presented in Table 4. When the program step is 1, RMNR outperforms ELASTIC, FINQANet, and MANR, demonstrating its superior capability in handling single-step program scenarios. MANR achieves the best performance on two-step reasoning tasks, suggesting that its multi-head attention mechanism is particularly effective for reasoning tasks of moderate complexity. RMMANR exhibits the highest performance on both the entire dataset and single-step programs, while also surpassing RMNR on two-step programs. This indicates that RMMANR successfully integrates the strengths of RMNR and MANR, leveraging their complementary features.

Interestingly, all models experience a significant drop in accuracy—approximately halved—when transitioning from two-step to more than two-step reasoning tasks. As shown in Table 4, the number of training examples with more than two program steps (521) is substantially smaller compared to those with one (3717) or two (2013) steps. We hypothesize that the models’ poor performance on tasks with more than two steps is primarily attributable to the limited number of training examples available for such scenarios. Consequently, the lack of sufficient training data for longer program steps is likely the main factor contributing to the observed performance degradation. This issue highlights an important direction for future research, particularly in addressing data scarcity for complex multi-step reasoning tasks.

5.4. Error Analysis

To better understand the limitations of our model in predicting mathematical operations, we analyze common error patterns. The errors can be categorized into the following three types:

Operator prediction error: One common source of error is the incorrect selection of mathematical operations. For instance, as shown in Example 1 in Figure 8, in a case where the task required computing the percentage change in deferred tax assets, the correct sequence involved a subtraction operation, but the model mistakenly predicted an addition operation. Such errors often arise when the model fails to grasp the underlying mathematical relationship, particularly in scenarios involving changes or differences where subtraction is required instead of addition.

Operand prediction error: Another prevalent issue is the misidentification of numerical values within the input text. As shown in Figure 8, in Example 2 involving the ratio of collateral pledged to short-term borrowings, the model incorrectly selected 19.3 instead of the correct value 2.3. This type of error may stem from: (1) Numerical mismatch: The model might prioritize selecting numbers that appear more prominently in the text, overlooking the correct values required by the problem. (2) Contextual misunderstanding: The model may fail to correctly associate numerical values with temporal constraints, leading to the selection of numbers from incorrect time periods or categories.

Operand and operator prediction error: As shown in Example 3 in Figure 8, in a case requiring the calculation of the percentage change in the fair value of non-vested shares, the correct sequence should involve a multiplication step followed by subtraction and division. However, the model only predicted multiplication and division, neglecting the intermediate subtraction operation. This suggests that the model struggles with multi-step reasoning, particularly in percentage-based calculations, where computing the absolute difference before division is crucial.

These errors indicate three major challenges: (1) inaccurate selection of mathematical operations, (2) incorrect numerical value extraction, and (3) incorrect operands and operators. To mitigate these issues, future work can focus on the following:

Enhancing the model’s ability to understand mathematical relationships by refining operation selection mechanisms.
Improving number extraction accuracy through better numerical alignment techniques.
Strengthening multi-step reasoning to ensure all steps are correctly executed.

5.5. Ablation Studies

As illustrated in Figure 9 and Figure 10, both execution accuracy and program accuracy of RMRN and MARN are significantly higher than those of RMMARN without the recall mechanism and multi-head attention. This indicates that either the recall mechanism or multi-head attention can effectively enhance the model’s accuracy. Furthermore, RMMARN achieves the highest accuracy, demonstrating that the combination of these two components enables the model to attain optimal performance. For the framework of RMMARN without the recall mechanism and multi-head attention, refer to Appendix D.

Figure 11 and Figure 12 reveal that RMNR with both operator and operand recall achieves the highest scores. Notably, RMNR with only operand recall performs worse than RMNR without any recall modules. This degradation in performance may be attributed to the recalled h containing excessive numerical noise. Since the model must select the correct operand from a large set of numbers, revisiting h before performing operand prediction could distract the model and increase the difficulty of identifying the correct operand. This observation highlights a potential area for future research.

Figure 13 and Figure 14 demonstrate that MANR with four attention heads achieves the best performance, both on the entire FINQA dataset and on multi-step reasoning tasks within FINQA. This suggests that, for MANR, increasing the number of attention heads does not necessarily improve performance. Instead, the configuration of four heads is optimal for this model.

6. Conclusions and Future Work

This paper introduces the recall mechanism and multi-head attention for numerical reasoning (RMMANR). RMMANR consists of two main components: an encoder and a decoder. The encoder leverages RoBERTa to encode the question and passage into contextual embeddings. The decoder, which consists of four modules (RM Module, Selector, MA Module, and Program Solver), generates inference steps and answers based on contextual information from the encoder. The recall mechanism addresses the issue of information loss in question and passage embeddings as the model depth increases, ensuring that critical details are retained throughout the reasoning process. On the other hand, multi-head attention proves particularly effective for complex numerical reasoning tasks, as it enables the extraction of multiple relevant information fragments and optimizes the solution process for multi-step problems.

RMNR, MANR, and RMMANR were evaluated on the FINQA dataset. Experimental results demonstrate that all three models outperform several baselines, with RMMANR achieving the highest performance by effectively combining the strengths of RMNR and MANR.

For future work, we plan to focus on enhancing the model’s performance on more complex multi-step reasoning tasks. This includes addressing challenges such as data scarcity for longer program steps and further optimizing the recall mechanism to reduce noise interference during operand prediction. Additionally, exploring alternative architectures or training strategies to improve generalization on complex reasoning tasks will be a key direction.

Author Contributions

Conceptualization, L.L.; methodology, L.L.; software, L.L. and B.Z.; validation, L.L. and B.Z.; formal analysis, L.L.; investigation, L.L.; resources, L.L.; data curation, L.L.; writing—original draft preparation, L.L.; writing—review and editing, T.-P.T.; visualization, L.L.; supervision, T.-P.T.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Putian University (grant No. 2022048), Putian Science and Technology Bureau (Putian Science and Technology Plan Project 2023GJGZ003), and Putian University (grant No. JG202388).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://github.com/NeuraSearch/NeurIPS-2022-Submission-3358 (accessed on 15 July 2022) [6].

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

RMMANR	Recall Mechanism and Multi-head Attention for Numerical Reasoning
RMNR	Recall Mechanism for Numerical Reasoning
MANR	Multi-head Attention for Numerical Reasoning

Appendix A. An Example of Financial Question–Answering Problems Using Notations Defined in Table 2

Appendix B. Recall Mechanism for Numerical Reasoning (RMNR)

Appendix C. Multi-Head Attention for Numerical Reasoning (MANR)

Appendix D. RMMANR Without Recall Mechanism and Multi-Head Attention

References

Hosseini, M.J.; Hajishirzi, H.; Etzioni, O.; Kushman, N. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 523–533. [Google Scholar]
Chen, Z.; Chen, W.; Smiley, C.; Shah, S.; Borova, I.; Langdon, D.; Moussa, R.; Beane, M.; Huang, T.H.; Routledge, B. Finqa: A dataset of numerical reasoning over financial data. arXiv 2021, arXiv:2109.00122. [Google Scholar]
Cheng, W.K.; Bea, K.T.; Leow, S.M.H.; Chan, J.Y.L.; Hong, Z.W.; Chen, Y.L. A review of sentiment, semantic and event-extraction-based approaches in stock forecasting. Mathematics 2022, 10, 2437. [Google Scholar] [CrossRef]
Ooi, B.Y.; Lee, W.K.; Shubert, M.J.; Ooi, Y.W.; Chin, C.Y.; Woo, W.H. A flexible and reliable internet-of-things solution for real-time production tracking with high performance and secure communication. IEEE Trans. Ind. Appl. 2023, 59, 3121–3132. [Google Scholar]
Callanan, E.; Mbakwe, A.; Papadimitriou, A.; Pei, Y.; Sibue, M.; Zhu, X.; Ma, Z.; Liu, X.; Shah, S. Can GPT models be Financial Analysts? In An Evaluation of ChatGPT and GPT-4 on mock CFA Exams. In Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning, Jeju, Republic of Korea, 3 August 2024; pp. 23–32. [Google Scholar]
Zhang, J.; Moshfeghi, Y. ELASTIC: Numerical reasoning with adaptive symbolic compiler. Adv. Neural Inf. Process. Syst. 2022, 35, 12647–12661. [Google Scholar]
Zhu, F.; Li, M.; Xiao, J.; Feng, F.; Wang, C.; Chua, T.S. SoarGraph: Numerical Reasoning over Financial Table-Text Data via Semantic-Oriented Hierarchical Graphs. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1236–1244. [Google Scholar]
Wang, L.; Wang, Y.; Cai, D.; Zhang, D.; Liu, X. Translating a math word problem to an expression tree. arXiv 2018, arXiv:1811.05632. [Google Scholar]
Xie, Z.; Sun, S. A Goal-Driven Tree-Structured Neural Model for Math Word Problems. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 5299–5305. [Google Scholar]
Wang, Y.; Liu, X.; Shi, S. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 845–854. [Google Scholar]
Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv 2019, arXiv:1903.00161. [Google Scholar]
Zhu, F.; Lei, W.; Huang, Y.; Wang, C.; Zhang, S.; Lv, J.; Feng, F.; Chua, T.S. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv 2021, arXiv:2105.07624. [Google Scholar]
Chen, Q.; Gao, X.; Guo, X.; Wang, S. Multi-head attention based candidate segment selection in QA over hybrid data. Intell. Data Anal. 2023, 27, 1839–1852. [Google Scholar]
Liang, C.C.; Hsu, K.Y.; Huang, C.T.; Li, C.M.; Miao, S.Y.; Su, K.Y. A tag-based English math word problem solver with understanding, reasoning and explanation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 67–71. [Google Scholar]
Chen, D.; Manning, C.D. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 740–750. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Lazaridou, A.; Pham, N.T.; Baroni, M. Combining language and vision with a multimodal skip-gram model. arXiv 2015, arXiv:1501.02598. [Google Scholar]
Alammary, A.S.J.A.S. BERT models for Arabic text classification: A systematic review. Appl. Sci. 2022, 12, 5720. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Chen, X.; Liang, C.; Yu, A.W.; Zhou, D.; Song, D.; Le, Q.V. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chen, K.; Xu, W.; Cheng, X.; Xiaochuan, Z.; Zhang, Y.; Song, L.; Wang, T.; Qi, Y.; Chu, W. Question directed graph attention network for numerical reasoning over text. arXiv 2020, arXiv:2009.07448. [Google Scholar]
Zhu, F.; Lei, W.; Feng, F.; Wang, C.; Zhang, H.; Chua, T.S. Towards complex document understanding by discrete reasoning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4857–4866. [Google Scholar]
Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Che, W.; et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv 2020, arXiv:2012.14740. [Google Scholar]
Li, B.; Chen, W.; Tang, X.; Bian, S.; Liu, Y.; Guo, J.; Zhang, D.; Huang, F. Squeeze and Excitation Convolution with Shortcut for Complex Plasma Image Recognition. Comput. Mater. Contin. 2024, 80, 2221–2236. [Google Scholar] [CrossRef]
Li, C.; Ye, W.; Zhao, Y. Finmath: Injecting a tree-structured solver for question answering over financial reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 6147–6152. [Google Scholar]
Wang, L.; Zhang, D.; Zhang, J.; Xu, X.; Gao, L.; Dai, B.T.; Shen, H.T. Template-based math word problem solvers with recursive neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7144–7151. [Google Scholar]
Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv 2022, arXiv:2211.12588. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Li, X.; Chan, S.; Zhu, X.; Pei, Y.; Ma, Z.; Liu, X.; Shah, S. Are ChatGPT and GPT-4 general-purpose solvers for financial text analytics? A study on several typical tasks. arXiv 2023, arXiv:2305.05862. [Google Scholar]
Ran, Q.; Lin, Y.; Li, P.; Zhou, J.; Liu, Z. NumNet: Machine reading comprehension with numerical reasoning. arXiv 2019, arXiv:1910.06701. [Google Scholar]
Li, X.; Zhu, Y.; Liu, S.; Ju, J.; Qu, Y.; Cheng, G. Dyrren: A dynamic retriever-reranker-generator model for numerical reasoning over tabular and textual data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2023; Volume 37, pp. 13139–13147. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sun, J.; Zhang, H.; Lin, C.; Su, X.; Gong, Y.; Guo, J. Apollo: An optimized training approach for long-form numerical reasoning. arXiv 2022, arXiv:2212.07249. [Google Scholar]
Lake, B.; Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2873–2882. [Google Scholar]

Figure 1. An example of numerical reasoning.

Figure 2. A general encoder–decoder architecture for numerical reasoning.

Figure 3. The RM module.

Figure 4. The the MA module.

Figure 5. The Recall Mechanism and Multi-Head Attention for Numerical Reasoning (RMMANR).

Figure 6. The convergence process of RMNR (RoBERTa-large).

Figure 7. The convergence process of MANR (RoBERTa-large).

Figure 8. Examples of prediction error.

Figure 9. The ExeAcc of RMMANR with or without recall mechanism and multi-head attention. Each model employs RoBERTa-large for encoding.

Figure 10. The ProgAcc of RMMANR with or without recall mechanism and multi-head attention. Each model employs RoBERTa-large for encoding.

Figure 11. The ExeAcc of RMNR with different recall modules. Each model employs RoBERTa-large for encoding.

Figure 12. The ProgAcc of RMNR with different recall modules. Each model employs RoBERTa-large for encoding.

Figure 13. The accuracy of MANR with a different number of heads on multi-step reasoning problems within FINQA. Each model employs RoBERTa-large for encoding.

Figure 14. The accuracy of MANR with different numbers of heads on the entire FINQA. Each model employs RoBERTa-large for encoding.

Table 1. Performance of deep neural network approaches in different numerical reasoning datasets. Exe Acc (execution accuracy) measures whether the executed result of the generated program matches the ground truth answer. Prog Acc (program accuracy) evaluates whether the generated program is mathematically equivalent to the ground truth program. The F1 Score assesses the overlap between the predicted and gold answers using a bag-of-words representation. Bold values indicate the best performance for each metric within a given dataset.

Category	Model	Math23K	DROP		FINQA		TAT QA
		Prog Acc	Exu Acc	F1	Exu Acc	Prog Acc	Exu Acc	F1
	Human Expert [2]	-	-	-	91.16	87.49	-	-
FF	MHST [23]	-	-	-	-	-	63.6	72.7
	FinMath [26]	-	-	-	-	-	58.3	68.2
RNN	DNS [10]	64.7	-	-	-	-	-	-
	NeRd [21]	78.55	81.85	52.48	49.9	-	-	-
	ELASTIC [6]	-	-	-	68.96	65.21	-	-
Transformer	Ensemble [27]	66.7	-	-	-	-	-	-
Graph	ODGAT [22]	-	64.56	67.97	13.1	-	39.1	49.7
	SoarGraph [7]	-	-	-	67.2	-	65.4	75.3

Table 2. Task definition notation.

Notation	Description
P, Q, R	The text of the passage, the text of the question, the program of numerical reasoning.
$N U M$	The numbers that appear in Q and P.
$C O N S$	Constants described in domain-specific language (DSL).
$O P$	All mathematical operators.
$o p_{i}$	The i-th mathematical operator in R.
$O E$	All operands.
$o e^{i}$	All operands associated with $o p_{i}$ .
$o e_{j}^{i}$	The j-th operand of $o p_{i}$ .
$O P_{R M}$	An array of mathematical operators generated by RM Module.
$O P_{M A}$	An array of mathematical operators generated by MA Module.
$O E_{R M}$	An array of mathematical operands generated by RM Module.
$O E_{M A}$	An array of mathematical operands generated by MA Module.
s	Selected from either $O P$ or $O E$ , R consists of s.
$r_{i}$	$r_{i} = o p_{i} [o e^{i}]$ , the i-th subprogram of R.
$e_{s}$	Symbol s’s embedding from the decoding vocabulary.
$E_{o p} (s)$	The embedding retrieval function of $O P$ .
$E_{c o n s} (s)$	The embedding retrieval function of $C O N S$ .
$E_{n u m} (s)$	The embedding retrieval function of $N U M$ .

Table 3. Overall results of our models and the baselines on the test data. † denotes that the results are sourced from the original paper. ‡ denotes the results are sourced from [2]. * denotes that the results are sourced from [6]. ! denotes that the results are sourced from [39]. Bold values indicate the best performance for each metric.

Category	Models	FINQA Results
		Exe Acc (%)	Prog Acc (%)
Prompting-Model	GPT-3.5-turbo [32] !	48.56	-
	GPT-4 [33] !	68.79	-
	Program-of-Thought [28] !	68.10	-
Fine-Tuning-Model	NumNet [34] *	2.32	-
	NumNet+ [34] *	10.29	-
	NeRd [21] ‡	52.48	49.90
	FINQANet (RoBERTa-base) [2] †	60.10	58.38
	FINQANet (RoBERTa-large) [2] †	65.05	63.52
	ELASTIC (RoBERTa-base) [6] †	62.66	59.28
	ELASTIC (RoBERTa-large) [6] †	68.96	65.21
	DyRRen [35] !	63.30	61.29
	RMNR (RoBERTa-base)	65.33	61.20
	RMNR (RoBERTa-large)	70.58	66.61
	MANR (RoBERTa-base)	64.98	61.12
	MANR (RoBERTa-large)	70.21	65.74
	RMMANR (RoBERTa-large)	71.63	67.31
Human Performance	Human Expert †	91.16	87.49
Human Performance	Human Non-Expert †	50.68	48.17

Table 4. Results on different program steps. † denotes that the results are sourced from the original ELASTIC [6] paper. Each model employs RoBERTa-large for encoding. Bold values indicate the best performance for each metric.

Model	The Entire Dataset		=1(3717)		=2(3013)		≥3(512)
	Eex Acc	Prog Acc	Eex Acc	Prog Acc	Eex Acc	Prog Acc	Eex Acc	Prog Acc
FINQANet †	65.05	63.52	73.70	71.25	62.34	59.65	28.57	23.80
ELASTIC †	68.96	65.21	76.30	75.66	66.01	66.01	31.78	31.10
RMNR	70.58	66.61	78.96	76.61	65.04	58.68	32.14	27.38
MANR	70.21	65.74	76.49	74.01	68.22	60.88	30.95	25.00
RMMANR	71.63	67.31	80.34	77.68	66.01	59.41	30.95	25.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lai, L.; Tan, T.-P.; Zeng, B. Recall Mechanism and Multi-Head Attention for Numerical Reasoning. Appl. Sci. 2025, 15, 3528. https://doi.org/10.3390/app15073528

AMA Style

Lai L, Tan T-P, Zeng B. Recall Mechanism and Multi-Head Attention for Numerical Reasoning. Applied Sciences. 2025; 15(7):3528. https://doi.org/10.3390/app15073528

Chicago/Turabian Style

Lai, Linjia, Tien-Ping Tan, and Bocan Zeng. 2025. "Recall Mechanism and Multi-Head Attention for Numerical Reasoning" Applied Sciences 15, no. 7: 3528. https://doi.org/10.3390/app15073528

APA Style

Lai, L., Tan, T.-P., & Zeng, B. (2025). Recall Mechanism and Multi-Head Attention for Numerical Reasoning. Applied Sciences, 15(7), 3528. https://doi.org/10.3390/app15073528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recall Mechanism and Multi-Head Attention for Numerical Reasoning

Abstract

1. Introduction

2. Related Works

2.1. Datasets

2.2. Question and Document Representation

2.3. Reasoning Module

3. Methods

3.1. Task Definition

3.2. Proposed RMMANR

3.3. Encoder

3.4. Decoder

3.4.1. Decoding Vocabulary and Token Embedding

3.4.2. RM Module

3.4.3. MA Module

3.4.4. Selector

3.4.5. Program Solve

3.5. RMNR and MANR

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Baselines

4.4. Implementation Details

5. Experiment Results and Analysis

5.1. Overall Results

5.2. The Convergence Process

5.3. Performance on Different Program Steps

5.4. Error Analysis

5.5. Ablation Studies

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. An Example of Financial Question–Answering Problems Using Notations Defined in Table 2

Appendix B. Recall Mechanism for Numerical Reasoning (RMNR)

Appendix C. Multi-Head Attention for Numerical Reasoning (MANR)

Appendix D. RMMANR Without Recall Mechanism and Multi-Head Attention

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI