Modeling Extractive Question Answering Using Encoder-Decoder Models with Constrained Decoding and Evaluation-Based Reinforcement Learning

: Extractive Question Answering, also known as machine reading comprehension, can be used to evaluate how well a computer comprehends human language. It is a valuable topic with many applications, such as in chatbots and personal assistants. End-to-end neural-network-based models have achieved remarkable performance on these tasks. The most frequently used approach to extract answers with neural networks is to predict the answer’s start and end positions in the document, independently or jointly. In this paper, we propose another approach that considers all words in an answer jointly. We introduce an encoder-decoder model to learn from all words in the answer. This differs from previous works. which usually focused on the start and end and ignored the words in the middle. To help the encoder-decoder model to perform this task better, we employ evaluation-based reinforcement learning with different reward functions. The results of an experiment on the SQuAD dataset show that the proposed method can outperform the baseline in terms of F1 scores, offering another potential approach to solve the extractive QA task.


Introduction
Text question-answering [1] systems can answer natural language questions automatically, providing a convenient way for people to obtain required knowledge. A successful system must be able to carry out several Natural Language Processing (NLP) tasks, such as natural language understanding [2], information retrieval [3], or natural language inference [4], making Question Answering (QA) one of the most challenging tasks that has attracted the interest of many NLP researchers.
Researchers have proposed and defined different types of text QA tasks, such as multichoice QA [5], generative QA [6], and extractive QA [7]. This paper focuses on extractive QA tasks, which take a question and a document as the input. The document contains a span that can serve as the correct answer for the question. Figure 1 shows an example. An extractive QA system needs to locate the start and end positions of the answer span in the input document and "extract" the span as the answer.
Extractive QA specifies the scope of the input question. The input question should be about a document, and the answer exists in the document as a substring. This setting enables us to confidently evaluate the predicted answers by comparing them to ground truth using overlap-based metrics, thus providing a relatively ideal and convenient test bed for QA models. Many works have used the extractive QA task to test models' ability to answer questions [2,[8][9][10] or focus on achieving better results [11].
The neural network models in [8,[11][12][13][14] have achieved remarkable performances. They differ in their approaches for adapting end-to-end neural networks to model extractive QA tasks, e.g., the model proposed in [14] predicts the start and end positions of the span jointly, while [12] predicts them independently. This paper discusses several approaches to the modeling of extractive QA with neural networks. We propose the use of encoder-decoder models to solve the extractive QA task with evaluation-based reinforcement learning. The main ideas of this paper are as follows:

1.
We solve the extractive QA task with an encoder-decoder model that generates all answer words jointly, enabling the model to use more information from the answers for training and to naturally output entire answers in the inference.

2.
The proposed encoder-decoder extractive QA model uses evaluation-based reinforcement learning to enhance the model's performance. The experiment results show that the proposed model can achieve better results than the baseline.
The structure of this paper is as follows: Section 2 gives the background and related work, which includes a discussion about existing approaches to the training of neuralnetwork-based extractive QA models, the introduction of the encoder-decoder model, and some general ideas behind reinforcement learning. Section 3 proposes our encoder-decoder model, the constrained decoding method, and evaluation-based reinforcement learning. Next, we present the experiment settings, results, and discussion in Section 4 and the conclusion in Section 5.

Extractive Question Answering
Extractive Question Answering is also known as span prediction or machine reading comprehension [15,16]. An extractive QA sample contains a question Q, a document D, and an answer A to question Q. Q = {q 1 , q 2 , . . . , q |Q| }, D = {d 1 , d 2 , . . . , d |D| }, and A = {a 1 , a 2 , . . . , a |A| }. These are represented as word sequences where q i , d i , and a i denote the words, and |Q|, |D|, and |A| are the number of words in each sequence. The word sequence A = {a 1 , a 2 , . . . , q |A| } is a substring (subsequence occupies consecutive positions) in D, i.e., s and e exist and satisfy We can solve this task by using an end-to-end model that takes Q and D as the input and A as the output. The model predicts the probability of a span S being the correct answer P(S|D, A) (We refer to P(S|D, A) as P(S) for brevity). The span S = {d s , d s+1 , . . . , d e } denotes a substring in D.

Independent Assumption for the Start and End Positions
Since S could be an arbitrary span that holds s ≤ e, the number of valid spans is quadratically related to |D| (e.g., D has (|D| 2 + |D|)/2 spans). Inferring the probabilities for all these spans is time-consuming and ineffective. Additionally, only one or a few spans are the correct answers, so training models to learn from such imbalanced data is challenging. Therefore, precisely calculating the P(S) usually hinders such models from obtaining a state-of-the-art prediction accuracy [17,18]. Rajpurkar et al. [19] and Seo et al. [12] try to approximate P(S) by assuming that the start position s and the end position e are independent, which can also accelerate the computation: (2) d s and d e denote the start and end token of the span S. S can be uniquely identified by its boundary (s and e), so P(d s , d e ) equals P(S), and words between d s and d e can be omitted from this formulation. P start (d s ) denotes the probability of the s-th word in D being the start of the answer, and P end (d e ) denotes that probability that the e-th word is the end.
With this approximation, a model π θ can be used to estimate the P start (d i ) and P end (d e ) for each word in D as π θ,start (d i ) and π θ,end (d e ) and then assemble them to get the predicted answerÃ:s The number of calculations is reduced to be linearly related to the document length |D|.

Greedy Search in the Multistep Decomposition
Although the independent assumption can reduce the computation load satisfactorily [8,14], it is somewhat counterintuitive, since these two positions cannot be independent. The start and end positions form the answer together, and if one varies, the other should also move to improvise a rational answer. Thus, Yang et al. [20] and Clark et al. [21] insisted on using a precise formula for determining the span's probability, but they decomposed it into multiple steps and used a greedy search to reduce the computational load. Usually, the end position is a conditional probability in the decomposition: Compared with Equation (2), P end (d e ) becomes P end (d e | d s ) in Equation (4), indicating that the end position should depend on the start position. A model π θ can estimate the preceding probability P start (d i ) over each word d i ∈ D by picking the highest value ass and then finding the end positionẽ based ons. The two steps are performed in a greedy manner: The independent (Equation (2)) and decomposition (Equation (4)) methods have similar levels of computational complexity and can achieve competitive performances. The decomposition method is more flexible, since its performance can be improved through the use of more complex search methods, such as the beam search [22]. It provides a more convenient way of balancing performance and efficiency. Moreover, Fajcik et al. [14] and Wang and Jiang [23] pointed out that independent methods may reduce the prediction accuracy, and therefore, the start and end positions should be considered jointly. Regarding approaches to predict the answer span, this paper explores another method to decompose the P(S), which takes care of all tokens in the answer using an encoder-decoder model. Section 3 details the proposed approach.

Neural Network Models Used for Prediction
As introduced above, we need to predict P(d s ), P(d e ), or P(d e |d s ) to extract the answer span from D. Neural networks could be suitable choices with remarkable performance on this task [24]. Generally, neural-network-based extractive QA models consist of an encoder to transform each word into a feature vector and a classifier to compute the probability of each word (as start or end) based on its feature vector [8,12,13]: d i and d j are the feature vectors for d i and d j , respectively. Commonly, the classifier is a multilayer perception with a softmax function that gives the final probability [25]. When calculating the end position with Equation (4), the classifier can take the features for both the predicted start word ds and the candidate end word d j . The encoder has more elaborate architectures like the Recurrent Neural Network (RNN) [26,27] or hte Convolutional Neural Network (CNN) [28,29] to transform words into feature vectors effectively. θ denotes the parameters of the model π θ . It can be optimized by minimizing a loss function on labeled data using the gradient descent.

Encoder-Decoder Models
This paper employs an encoder-decoder model to predict the answer span S. Encoderdecoder models are also known as sequence-to-sequence models, which can generate word sequences according to the input word sequences [30,31]. The input and output of the models, i.e., the sequence of words, are adaptable and can be used in many natural language generation tasks, such as dialogue systems [32,33], machine translation [34,35], text summarization [36,37], and knowledge graph completion [38,39]. Figure 2 illustrates an encoder-decoder model. The encoder transforms an input word sequence X = {x 1 , x 2 , . . . , x m } into numerical features, similar to the encoder in Equation (6). These features are considered to capture the semantic information in X. The decoder generates the output sequence Y = {y 1 , y 2 , . . . , y n } based on the features of X constructed by the encoder.
Encoder Decoder Figure 2. An encoder-decoder model autoregressively generates an output word sequence based on the input. <start> and <end> are the special tokens representing the generation's start and end.
Generally, an encoder-decoder model π θ estimates the conditional probability P(Y | X) with π θ (Y | X). The number of output candidates is exponential, and it is impractical to enumerate them to find the one with the highest probability. Thus, the P(Y | X) is conditionally factorized for the decoder to handle it in an autoregressive manner: P(Y | X) = P(y 1 | X)P(y 2 | X, y 1 )P(y 3 | X, y 1 , y 2 ) · · · P(y n | X, y 1 , y 2 , . . . , y n−1 ). (7) The decoder can generate the output sequence in a word-by-word greedy manner. The generation of the i-th word depends upon the already generated words: (decoder(w |ỹ 1 ,ỹ 2 , . . . ,ỹ i−1 )). (8) V represents the vocabulary containing all of the candidate words. The process starts with a special "start of the sequence" token representing an empty generated sequence and stops if gets a special "end of the sequence" token. Additionally, advanced search algorithms in [22,40,41] can be used to improve the generation results.

Reinforcement Learning for Encoder-Decoder Models
Reinforcement Learning (RL) is learning to control an agent to accomplish an objective in the environment [42]. Usually, the agent does not learn, or barely learns, from examples that show which action is the best for the objective but requires the discovery of the superior actions by trial and error. RL uses reward functions r to quantify how well the agent accomplishes the objective. The agent's goal is to maximize the rewards: where A denotes actions, and π θ (A) determines the probability of choosing A. r(A) is the reward obtained from the environment after taking A.
A simple way to integrate RL with the encoder-decoder models is to consider the output sequence Y as actions A and the input sequence X as the state [43]. Thus, Equation (9) becomes π θ denotes the encoder-decoder model here. We continue to use π θ to represent the model for convenience. π θ generates Y word-by-word and earns a reward when the generation is finished. π θ (Y | X) is the probability of generating the whole sequence Y. The selection of the reward function can be flexible. Task-specified evaluation methods, such as BLEU [44] for machine translation, ROUGE [45] for text summarization, or Diversity [46] for response generation, are suitable for the corresponding tasks. The RL has been used in many NLP models. Ranzato et al. [47] proposed sequencelevel training, which uses BLEU and ROUGE to calculate the reward for a generated sequence. Chen and Bansal [48] used RL to train a sentence selector in the proposed two-stage model, which selects salient sentences from the document and then summarizes the salient sentences to get the final summary. Li et al. [49] proposed the use of RL to help with the question generation task. Xiong et al. [50] defined an additional RL-based objective for the extractive QA model that independently predicts the answer span's start and end positions. Hu et al. [51] improved the reward function used in [50], which involves overlap-based metrics and could neglect the acceptable answers. These works show the significant potential of using RL to help with the NLP. To our knowledge, the RL has not been evaluated in extractive QA models using the encoder-decoder framework. This paper fills this gap by providing empirical results. Other than extracting answers from textual evidence, locating answers over a knowledge graph [52] has been well-studied and applied in many real-world applications [53].

Modeling the Whole Answer Span Using the Encoder-Decoder Model
As introduced in Section 2, two kinds of approximation, independent assumption (Equation (2), described in Section 2.1.1), and greedy search (Equation (4) described in Section 2.1.2), have been developed for use in extractive QA. Each has advantages, e.g., independent assumption simplifies the training and inference process, and decompo-sition requires no strong assumptions and can deal with the multistop problem with search algorithms.
However, both approaches can neglect some words in answers, i.e., the words between a 1 and a |A| in the answer span A = {a 1 , a 2 , . . . , a |A| }. This paper proposes that the words between boundaries, {a 2 , . . . , a |A|−1 } (called middle words for convenience) can also provide useful semantic information for QA models to learn the semantics of the answer, which can aid in the understanding of the question and document. By considering the middle words, the probability of a span S becomes which is a word sequence that an encoder-decoder model can generate. s and e are the start and end positions. Thus, we introduce an encoder-decoder model for extractive QA, as shown in Figure 3a. (b) Figure 3. The comparison between the proposed model and a baseline model. The input question is "Which NFL team represented the NFC at Super Bowl 50?". The answer is "Santa Clara California". (a) The encoder-decoder model is used to solve the extractive QA task, taking advantage of all words in the answer. (b) The baseline extractive QA models use the start and end words only.
The encoder-decoder model is trained to predict all the words in the answers, estimating the P(S) following Equation (11). As a comparison, Figure 3b shows a traditional extractive QA model which ignores the middle words.
We use BART-base as the backbone of the proposed encoder-decoder model [54]. The encoder in the BART-base consists of six transformer encoder layers, and the decoder has the same number of transformer decoder layers. Each transformer layer contains a selfattention layer and a feed-forward network [55]. The input is the concatenation of the question and the document sequences, and the output is the word sequence in the answer.

Constrained Decoding
The decoder usually considers all words in the vocabulary as candidates when generating each word, as shown in Equation (8). However, the extractive QA ensures that the output answer is a substring in the document, so we need to limit the output space of the decoder, which is usually referred to as constrained decoding [56,57]. We implement constrained decoding based on a trie tree, as shown in Algorithm 1. V c ← V c + P [1] Add the first token P [1] in P into V c 11: end for 12:ỹ i = argmax w∈V c (decoder(w |ỹ 1 ,ỹ 2 , . . . ,ỹ i−1 )).

13:Ã ←Ã +ỹ i
Save the predicted words 14: end while 15: returnÃ T is a trie tree, also known as a prefix tree. It is a tree data structure that can store and locate a set of sequences [58,59]. It groups the sequences based on their prefixes, and we can obtain the sequences that have a specified prefix from all the stored sequences. The function search(T, . . .) is shown in Algorithm 1. The add(T, . . .) function denotes the addition of a new sequence to T. We add all possible starts, as shown in Lines 2-3. The search(T, . . .) obtains the sequences that start with a prefix (We use the trie tree implemented in https://github.com/pytries/marisa-trie, accessed on 7 August 2021). V c is the constrained vocabulary that stores all of the candidate words and can ensure that the generation becomes invalid. It is updated at each generation step.
The constrained decoding required here is not trivial, since we need to take care of the tokenization [60], which can split a natural word into pieces with different indices to the origin word. The constrained decoder should not reject a generated word (piece) if its index differs from the expected index, but it should be checked carefully. Thus, in practice, we append every word w in the vocabulary V after the current sequence, check whether the updated sequence satisfies the constraints, and save the valid word into the current constrained vocabulary V c .

Evaluation-Based Reinforcement Learning
Predicting a whole answer span is more challenging than predicting only its start and end words, since there are more labels to be predicted. Therefore, we introduce evaluation-based reinforcement learning to help the encoder-decoder model. Wvaluationbased reinforcement learning is implemented as an auxiliary loss: Both the L text and L RL are optimized in training. The final loss function of the proposed model is L text + L RL . Minimizing this loss function encourages the model to maximize the reward. To make the optimization process more stable, we generate k candidate answers for each sample (Q, D, and A) and standardize the rewards within the same sample: Ã 1 , A), . . . , r(Ã k , A)) std(r (Ã 1 , A), . . . , r(Ã k , A)) .
A i denotes the i-th answer sampled from π θ . We standardize the k raw rewards by subtracting their mean values from them and then dividing them by their standard variance. This standardization can help to ensure that the final rewards have both positive and negative values. r(Ã, A) is the reward obtained by comparing the predicted answerÃ with the ground truth answer A. We compute the reward using conventional metrics for the extractive QA, F1 and Exact Match (EM) [19]. We also test out the use of the ROUGE-L [45] for reward calculation in our experiment.
Specifically, EM is used to check whether the prediction is the same as the ground truth: EM(Ã, A) = 1 if the predicted sequenceÃ equals the ground truth A, Additionally, F1 is used to measure the word-level overlaps between the prediction and the ground truth: In the above equation, w ∈ {A} ∩ {Ã} denotes enumerating the words that exist in both A andÃ; each word only counts once. count(w, A) is the number of times that word w appears in the sequence. |A| and |Ã| denote the lengths of sequences A andÃ, respectively. The official evaluation script (https://worksheets.codalab.org/rest/bundles/0x6b567e1 cf2e041ec80d7098f031c5c9e/contents/blob/, accessed on 21 February 2022) normalizes the answers before computing EM and F1, for example, by converting the answers into lower case and removing some stop words, punctuation, and extra spaces. ROUGE-L indicates the similarity between the predictionÃ and ground truth A based on their Longest Common Subsequence (LCS [61,62]): where LCS is the function that gives the LCS between the two input sequences, A andÃ.

Experiment Settings
We evaluated the proposed model on the SQuAD dataset [19], where each sample consists of a question, a document, and an annotated answer. The documents were obtained from Wikipedia (https://www.wikipedia.org/, accessed on 16 June 2016) paragraphs. Table 1 shows two examples from the SQuAD dataset. The ground-truth answers are marked in bold and the corresponding clues are presented in blue. There were 87,599 samples in the training set and 10,570 samples in the validation set. We used the validation set for both validation and testing. We trained each model for 100,000 steps in total, saved checkpoints every 10,000 steps, and chose the best checkpoint using the results of the validation set. Table 2 shows the other hyperparameters used. : a classical extractive QA model that uses bidirectional attention flow (question-to-document and document-to-question attention) to enrich the representation of words. BiDAF predicts the answers' start and end positions independently according to the representations.

2.
BiDAF\w compound (best) [14]: uses the approaches proposed in [14] that jointly predict the start and end positions to enhance the BiDAF model. The best result is reported.

3.
DCN [63]: locates the answer spans by iteratively predicting the start and end positions to overcome the initial local maxima, which may lead to the wrong answers. 4. DCN+ [50]: introduces reinforcement learning techniques to optimize the F1 metric for extractive QA directly.

5.
R.M-Reader [51]: a memory-based model that uses reinforcement learning with a reward function refined for better coverage. 6.
BERT-base\w compound (best) [14]: jointly predicts the start and end positions. It is similar to Model 2: BiDAF\w compound (best).

8.
BART-base: directly trains a BART-base model to generate the whole answer based on the question and answer.
The following models are trained with the proposed approaches. "RL\w F1", "RL\w ROUGE-L ", and "RL\w EM&F1" represent the BART-base models that are trained using the reinforcement learning loss with the F1 rewards (Equation (16)), ROUGE-L (Equation (17)), and the sum of the F1 and EM values (Equations (16) and (15)), respectively. "Constrained" denotes the use of constrained decoding for answer generation. As shown in Table 3, we can see that the proposed method, Model 13, can achieve better F1 results than the baseline models. Model 13 uses the F1 score as the reward function for reinforcement learning and constrained decoding. Model 13 outperforms the models that also use reinforcement learning for extractive QA and works slightly better than the model that also jointly models the start and end positions, showing the potential for considering the whole answer span in extractive QA.
The use of metrics as rewards enables the models to be directly optimized based on the metrics. However, Model 10 uses all of the metrics, EM and F1, as rewards but does not achieve better results, indicating that the discrete EM value (0 or 1) may not be suitable for a reward function. Again, we emphasize that the reward functions should be chosen carefully.
The "#Out of Document" shows the number of the predicted answer spans that are not a substring of the input document, violating the settings of extractive QA. We can see that there are always some invalid predictions without constrained decoding. This means that the encoder-decoder model used, BART-base, still cannot thoroughly learn the input-output paradigm of extractive QA. Generally, we find that the constrained decoding strategy always brings a notable improvement to the EM score but does not improve F1 as much. Meanwhile, the proposed evaluation-based reinforcement learning method is always helpful for improving the performance, even when the ROUGE-L is used as the reward. ROUGE-L does not directly correspond with the evaluation metrics and is slightly different from EM and F1 in terms of the calculation formula. The results again demonstrate the effectiveness of the combination of the RL and text generation.
We set different beam sizes for the proposed encoder-decoder model and obtained similar results, as shown in Table 4. One possible reason for this is that the generated sequence is short, and the search space for generation is small, so the search algorithm is limited. We set the beam size to 4 to obtain the results presented in Table 3.  Table 5 presents a case study to show the improvement qualitatively. The proposed model is compared with a strong baseline model BERT on the SQuAD datasets. Question 1 asks about the allies of Normans in the war. The baseline model gives their opponent, whereas the proposed model yields the correct answer, demonstrating that it understands the document correctly. Question 2 explicitly asks for the complete date, including the year, month, and day, but the baseline provides the month only and omits the others.

Case Study and Discussion
Based on these cases, the proposed model seems to be better at producing longer answers. To verify this assumption quantitatively, we investigated the lengths of the answers predicted by different models. Figure 4 shows the average number of words/characters in answers, respectively. The horizontal axis shows the answer's length, and the vertical axis displays the models.  We sorted the models by the answers' lengths in ascending order. We can see that the ground-truth answers were the longest. The plain encoder-decoder model BART-base and the baseline model BERT-base had the shortest average lengths, demonstrating that they cannot perform ideally with long answers. The models equipped with the proposed methods, constrained decoding and evaluation-based RL (denoted by Constrained and RL\w, respectively), prefer to give longer answers that are closer to the ground-truth's length.
However, generating longer answers does not mean generating better answers, and we need to evaluate their quality with the corresponding metrics. To analyze how the answer length affects the models' prediction results, we correlated the length of the ground-truth answer with the F1 score of the answer predictions, as shown in Figure 5. We grouped questions with similar answer lengths in the same range together and averaged the F1 scores of the predictions for the questions in the same group. The horizontal axis shows the length ranges, and the vertical axis denotes the F1 scores. Overall, the performance degraded as the target answer became longer. Notably, the proposed model (BART-base RL\w F1 Constrained) and the baseline model (BERT-base) performed similarly when the answers were shorter (e.g., less than ten words). In comparison, the performance of the proposed model was much better than the baseline model for longer answers. This demonstrates that the proposed model generates longer answers and performs better when the ground truth is longer, which also reveals that the improvement brought by the proposed model is mainly for longer answers.

Conclusions
This paper proposes an encoder-decoder model for extractive QA. The proposed encoder-decoder model can predict all the words in the answer span, which differs from the previous extractive QA models, which only predict the start and end positions. Thus, the extractive QA task runs in a more natural way, and the model should give a precise answer, rather than answer pieces. We additionally introduced reinforcement learning to create an auxiliary objective based on evaluation metrics to assist the models in training. We evaluated the proposed method on the SQuAD dataset. The experiment results show that the proposed encoder-decoder model can achieve competitive results to state-ofthe-art baseline models, showing the potential of the encoder-decoder models that use reinforcement learning and unifying the NLP tasks with the encoder-decoder model.  Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.