You Don’t Need Labeled Data for Open-Book Question Answering

: Open-book question answering is a subset of question answering (QA) tasks where the system aims to ﬁnd answers in a given set of documents (open-book) and common knowledge about a topic. This article proposes a solution for answering natural language questions from a corpus of Amazon Web Services (AWS) technical documents with no domain-speciﬁc labeled data (zero-shot). These questions have a yes–no–none answer and a text answer which can be short (a few words) or long (a few sentences). We present a two-step, retriever–extractor architecture in which a retriever ﬁnds the right documents and an extractor ﬁnds the answers in the retrieved documents. To test our solution, we are introducing a new dataset for open-book QA based on real customer questions on AWS technical documentation. In this paper, we conducted experiments on several information retrieval systems and extractive language models, attempting to ﬁnd the yes–no–none answers and text answers in the same pass. Our custom-built extractor model is created from a pretrained language model and ﬁne-tuned on the the Stanford Question Answering Dataset—SQuAD and Natural Questions datasets. We were able to achieve 42% F1 and 39% exact match score (EM) end-to-end with no domain-speciﬁc training.


Introduction
Question answering (QA) has been a major area of research in artificial intelligence and machine learning since the early days of computer science [1][2][3][4]. The need for a performant open-book QA solution was exacerbated by rapid growth in available information in niche domains, the growing number of users accessing this information, and the expanding need for more efficient operations. QA systems are especially useful when a user searches for specific information and does not have the time-or simply does not want-to peruse all available documentation related to their search to solve the problem at hand.
In this article, open-book QA is defined as the task whereby a system (such as a computer software) answers natural language questions from a set of available documents (open-book). These questions can have yes-no-none answers, short answers, long answers, or any combination of the above. In this work, we did not train the system on our domainspecific documents or questions and answers, a technique called zero-shot learning [5]. The system should be able to perform with a variety of document types and questions and answers without training. We defined this approach as "zero-shot open-book QA". The proposed solution is tested on the AWS Documentation dataset. As the models within this solution are not trained on the dataset, the solution can be used in other similar domains, such as finance and law.
Software technical documentation is a critical piece of the software development life-cycle process. Finding the correct answers for one's questions can be a tedious and time-consuming process. Currently, software developers, technical writers, and marketers are required to spend substantial time writing documents such as technology briefs, web content, white papers, blogs, and reference guides. Meanwhile, software developers and solution architects have to spend a lot of time searching for information they need to solve their problems. Our approach to QA aims to expedite this process.
Our work's key contributions are:

1.
Introducing a new dataset in open-book QA.

2.
Proposing a two-module architecture to find answers without context. 3.
Experimenting on ready-to-use information-retrieval systems. 4.
Inferring text and yes-no-none answers in a single forward pass once we find the right document.
The rest of the paper is structured as follows: First, related previous work is summarized. Then, the dataset is described. Next, details on implementing the zero-shot open-book QA pipeline are provided. In addition, the experiments are explained, and finally, the results, along with limitations and next steps, are presented.

Related Work
There are a number of datasets in the literature for natural language QA [6][7][8][9][10][11][12][13][14][15], along with several solutions to answer these questions [16][17][18][19][20][21][22][23][24][25][26]. In this paper, we propose an approach that differs from the previous body of work as we do not receive the context but assume that the answer lies in a set of readily available documents (open-book), and we were not allowed to train our models on the given questions or set of documents (zero-shot). Our proposed solution attempts to answer questions from a set of documents with no prior training or fine-tuning (zero-shot open-book question answering).

QA Approaches
The natural language QA solutions take a question along with a block of text as context and attempt to find the correct answer to the question within the context. Open-book QA solutions take a question along with a set of documents that may contain the answer, then the solution attempts to find the answer to the original question within the available set of documents. Open-book QA solutions have been explored by several research teams including Banerjee et al. [27], who performs QA using fine-tuned extractive language models, and the work of Yasunaga et al. [28], who performs QA using GNNs.

Our Inspirations
The inspiration for our two-step extractor-retriever approach came from the work of Banerjee et al. [27] which consisted of six main modules: Hypothesis Generation, Open-Book Knowledge Extraction, Abductive Information Retrieval, Information Gain-based Reranking, Passage Selection, and Question Answering. This solution tries to answer a question from a set of available documents along with a database of common knowledge. Alternatively, earlier works [29] created a logical form from the natural language text and then used formal reasoning to answer questions. In our solution, we were able to combine and condense these steps into two modules without the need of a common knowledge database and a logical form to answer open-book questions.
For our extractor, our approach is inspired by extractive question-answering solutions presented by Devlin et al. [18]. Devlin et al. showed that a bidirectional transformerbased [17] model pretrained on unlabeled text can be fine-tuned with only one extra output layer to achieve good performance on variety of NLP tasks, including language inference, text classification, sentiment analysis, and question answering.

Data
Real-world open-book QA use cases require significant amounts of time, human effort, and cost to access or generate domain-specific labeled data. For our solution, we intentionally did not use any domain-specific labeled data and conducted experiments on popular QA datasets and pretrained models. We used feedback from customers to generate a set of 100 questions as the test dataset and used QA datasets, explained in Sections 3.2 and 3.3, for training.

AWS Documentation Dataset
Herein, we present the AWS documentation corpus (https://github.com/siagholami/ aws-documentation, accessed on 10 November 2021), an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual, and unambiguous answer within the accompanying documents; we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information.

Questions and Answers
There are two types of answers: text and yes-no-none answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none (YNN) answers can be yes, no, or none, depending on the type of question. For example, the question: "Can I stop a DB instance that has a read replica?" has a clear yes or no answer, but the question "What is the maximum number of rows in a dataset in Amazon Forecast?" is not a yes or no question, and therefore obtains "None" as the YNN answer. Here, 23 questions have "Yes" YNN answers, 10 questions have "No" YNN answers, and 67 questions have "None" YNN answers. Table 1 shows a few examples from the dataset. Third-party auditors assess the security and compliance of AWS IoT Greengrass as part of multiple AWS compliance programs. These include SOC PCI FedRAMP HIPAA and others.

Annotation Process
Answers are provided by authors and reviewed by a pool of five experts. We divided the annotation tasks into four steps, where all four steps are completed by one annotator and verified by one expert. The guidelines given to annotators are stated below: 1.
Question validation: Annotators determine whether a question is valid or invalid. A valid question is a fact-seeking question that has a clear and unambiguous answer. An invalid question is incomprehensible, vague, opinion-seeking, or not plainly a query for factual information. Annotators must determine this by the question's content only.

2.
Document selection: Annotators select the right source document from the AWS Documentation corpus that contains the answer. Every valid question in the dataset must have an accompanying source document. 3.
Text answer: If a question is valid and has a valid source document, annotators select the correct answer from the source document. Every question must have a clear and unambiguous text answer. For example, for the question "What is the maximum number of rows in a dataset in Amazon Forecast?", there is one and only one answer which is "1 billion". 4.
YNN answer: In the final step, annotators flag the answer's YNN answer as "Yes", "No", or "None", depending on the question and the source document. Every question must have a clear and unambiguous YNN answer. For example, for the question "Can I run codePipeline in a VPC?", there is one and only one answer, which is "Yes".

SQuAD Datasets
The Stanford Question Answering Dataset (SQuAD) (https://rajpurkar.github.io/ SQuAD-explorer/, accessed on 10 November 2021) is a reading comprehension dataset [6], including questions created by crowdworkers on Wikipedia articles. The answers to these questions is a segment of text from reading passages, or the question might be unanswerable. SQuAD1.1 comprises 100,000 question-answer pairs on more than 500 articles. SQuAD2.0 adds 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

Natural Questions Dataset
The Natural Questions (NQ) dataset (https://ai.google.com/research/NaturalQuestions, accessed on 10 November 2021) includes 400,000 questions and answers created on Wikipedia articles [7]. The corpus contains questions from real users with answers that can be long (a few sentences), short (a few words) if present on the page, or null if no long or short answer is present.

Approach
Our approach consists of two high-level modules: a retriever and an extractor. Given a question, the retriever tries to find a set of documents that may contain the answer; then, from these documents, the extractor tries to find the answer. Figure 1 illustrates a highlevel workflow of the solution, and Table 2 shows an example of the question, retrieved documents, and extracted answers using the solution.

Retrievers
Given a question with no context, our approach relies on the retriever to find the right documents that contains the answer. The need for a retriever stems from the fact that our extractors are fairly large models and it is time-and cost-prohibitive for the extractor to go through all available documents. For example, in our AWS Documentation dataset from Section 3.1, it will take hours for a single compute instance to run an extractor through all available documents. We conducted experiments with simple information-retrieval systems with a keyword search along with deep semantic search models to list relevant documents for a question. We used precision at K (P@K) metric to evaluate our retrievers. Precision at K is the proportion of retrieved items in the top-k set that are relevant: P@K = number of retrieved documents that are relevant total number of retrieved documents 4.1.1. Whoosh Whoosh (https://whoosh.readthedocs.io/, accessed on 10 November 2021) is a fast, pure Python search engine library. The primary design impetus of Whoosh is that it is pure Python and can be used anywhere Python is running, as no compiler or Java is required. It lets you index free-form or structured text and find matching documents based on simple or complex search criteria. For information retrieval, Whoosh uses the Okapi BM25F [30] ranking function, which is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document. Since Whoosh is easy to implement, easy to maintain and easy to customize, a lot of teams are using it for their information-retrieval use cases.

Amazon Kendra
Amazon Kendra (https://aws.amazon.com/kendra/, accessed on 10 November 2021) is a semantic search and question answering service provided by AWS for enterprise customers. Kendra allows customers to power natural language-based searches on their own AWS data by using a deep learning-based semantic search model to return a ranked list of relevant documents. At the core of its search engine, Amazon Kendra understands natural language questions and returns the most relevant documents. Under the hood, it has a deep learning semantic search model to return a ranked list of relevant documents.

Extractors
Given a question with no context, the retriever finds a set of documents. Then, the output of the retriever will pass on to the extractor to find the correct answer for the original question. We built our extractor from a base model, created from different variations of BERT [18] language models, and added three fine-tuning layers to extract yes-no-none answers and text answers. Our extractor attempts to find YNN answers and text answers in the same pass.
We used F1 and exact match (EM) metrics to evaluate our extractors. EM is a binary metric determining whether the model's prediction was exactly correct (i.e., the same words and in the same order). F1 metric is more relaxed and treats answers as a bag-of-words and calculates the harmonic mean of precision and recall.

Extractor Model Data Processing
For preprocessing, we followed the work of Devlin et al. [18]. We first tokenized every document using a 30,522 wordpiece vocabulary, then constructed the examples by combining a "[CLS]" token, tokenized question, a "[SEP]" token, tokens from the document, and a final "[SEP]" token. The examples had a maximum sequence length of 512 tokens. For the documents longer than 512 tokens, we used a sliding window over the document tokens with a stride of 128 tokens. The YNN answers are simply transformed to three classes (i.e., yes, no, and none). For the text answers, we computed start and end token indices to represent the target answer span. In the AWS Documentation dataset (Section 3.1), some of the text answers are not from a continuous block of text and therefore there will be more than one text answer spans (i.e., several start and end indices) for a single answer.

Extractor Model
Our extractor model is created with a base model and three fine-tune layers. We used a YNN layer, start indices layer, and end indices layer. For our base model, we compared BERT (tiny, base, large) [18] along with RoBERTa [31], AlBERT [32], and distillBERT [33]. We implemented the same strategy as the original papers to fine-tune these models. We also used the same hyperparameters as the original papers: L is the number of transformer blocks (layers), H is the hidden size, and A is the number of self-attention heads.
The YNN fine-tune layer takes the pooled output from the base BERT model and feeds it into a three-node dense layer with a Softmax activation to obtain the YNN answer. Furthermore, our model takes the sequence output from the base BERT model and feeds it into two sets of dense layers with sigmoid as activation. The first layer aims to find the start indices of the answer sequences, and the second layer aims to detect the end indices of the answer sequences. Figure 2 illustrates the extractor model architecture. We define a training set data point as a four-tuple (d, s, e, yn), where d is a document containing the answers, s, e are indices to the start and end of the text answer, and yn is the yes-no-none answer.
The loss of our model is L = log p(s, e, yn|d) where each probability p is defined as where θ is the base model parameters and f start , f end , f yn represent three outputs from the last layer of the model.
At inference, we pass through all text from each document and return all start and end indices with scores higher than a threshold (implementation details are presented in Section 5.2). We used F1 and exact match (EM) metrics to evaluate our extractor models.

Experiments
We experimented with two information-retrieval systems for retrievers and seven models for extractors.

Retriever Experiments
In our experiments, we used pretrained and ready-to-use information-retrieval systems. Table 3 demonstrates the results of these experiments. We found that Amazon Kendra's semantic search is far superior to a simple keyword search; therefore, we chose Amazon Kendra as our retriever. Given a question, we used Amazon Kendra to find the documents that may contain the answer, then we passed the top nine documents to our extractor. Given that these information-retrieval systems are easy to use and easy to maintain and they are adequately performant, we chose to use these in our solution and focus our efforts on building a custom-built extractor. Section 6 states our plans for the next iteration of this solution which includes creating a custom-built retriever.

Extractor Experiments
Regarding the extractors, we found that the bigger the base model, the better the performance. We used popular pretrained BERT-based language models as described in Section 4.2 and fine-tuned these models on SQuAD1.1 and SQuAD2.0 [6] along with Natural Questions datasets [7]. We trained the models by minimizing loss L from Section 4.2.2 with the AdamW optimizer [18] with a batch size of eight. We used 0.5 as the threshold for start and end indices. Then, we tested our models against the AWS documentation dataset (Section 3.1) while using Amazon Kendra as the retriever. Our final results are shown in Table 4.

Error Analysis
To better understand our solution and perform an error analysis, we manually analyzed our solution's predicted answers to the AWS Documentation dataset. We identified four categories of error in the results: retriever error, partial answer, table extraction error, and wrong prediction (see Table 5).

Exact Matches
The model provides an exact match for 39% of questions. Generally, these questions have a clear answer in the source document coming from a continuous block of text. Our solution works best in this category primarily because our training dataset mostly consists of factual questions and answers that can be extracted from a continuous block of text. For example, for the question "Can I run my AWS Lambda in a VPC?", our solution correctly predicted the answer as "You can configure a Lambda function to connect to private subnets in a virtual private cloud (VPC) in your AWS account". This answer is clearly stated in the document and the predicted answer had one start and one end token index. Another example in this category of questions is "In Amazon SageMaker, what is the internetwork communication protocol?", for which the solution correctly predicted "TLS 1.2".

Retriever Errors
As discussed in Section 5.1, we used Amazon Kendra as our retriever to obtain the top nine documents to pass to our extractor. Our retriever has an error rate of 10% which affects our end-to-end performance. Undeniably, if we cannot find the right document that contains the answer, we cannot find the correct answer. Improving our retriever's performance will enhance the overall performance of our solution. We are planning to perform experiments on this subject in our future works.

Partial Answers
For 7% of questions, our solution was not able to find an exact match; however, it was able to find parts of the answer. These questions have answers that should be extracted from different locations of a document, and our solution was not able to find all correct answer spans. For example, for the question "What are the Amazon RDS storage types?", the correct answer is "General Purpose SSD, Provisioned IOPS, and Magnetic", but our solution predicted the answer as "Provisioned IOPS", which is partly correct, but not entirely. These types of questions and answers were underrepresented in our training dataset. We are planning to train our models with more data in this category in our future work.

Table Extraction Errors
For 24% of questions, our solution was not able to extract the right information from a table in the source document. For example, for the question "What is the instance store type for d2.xlarge?", our solution was not able to extract the correct answer, which is "HDD", from a table within the document. In our training dataset, all of the questions and answers were presented in plain text and there were no examples with tables. In our future work, we will explore more datasets along with different approaches to tackle this challenge.

Wrong Predictions
For 20% of questions, our solution was not able to find any part of the answer. These questions and answers were challenging in nature, even for a human. For example, for the question "Is Administrator an Amazon Forecast reserved field name?", the correct answer is "No, Administrator is not an Amazon Forecast reserved field name", but our solution was not able to predict the correct answer. To answer this question correctly, a solution must read the document entirely and check for the "Administrator" word in the reserved field name. Because the word does not exist in the list, the answer should be "no". We hypothesize that if we use a generative extractor, we might be able to better answer these types of questions. We are planning to investigate this type of questions and answers in our future works. Table 5. Manual error analysis results on AWS Documentation dataset.

Limitations and Future Work
Our solution has a number of limitations. Below, we describe some of these and suggest directions for future work. We were able to achieve 42% F1 and 39% EM for our test dataset due to the challenging nature of zero-shot open-book problems. The performance of the solution proposed in this article is fair if tested against technical software documentation. However, it needs to be improved before finding use in real-world software products. Additionally, more testing and evaluation is needed if we want to thoroughly assess the applicability of this solution in other domains (e.g., medical, financial, or laws and regulations). We are planning to experiment and evaluate this solution in multiple domains and compare it to fine-tuned (not zero-shot) models in our future work.
Furthermore, the solution performs better if the answer can be extracted from a continuous block of text from the document. The performance drops if the answer is extracted from several different locations in a document. Moreover, all questions had a clear answer in the AWS documentation dataset, which is not always the case in the real world. As our proposed solution always returns an answer to any question, it fails to recognize if a question cannot be answered.
For future work, we plan to experiment with custom-built retrievers similar to DPR [34], as well as generative models, such as GPT-2 [35] and GPT-3 [5], with a wider variety of text in pretraining to improve the F1 and EM scores presented in this article.

Conclusions
In this paper, we presented a new solution for zero-shot open-book QA with a retrieverextractor architecture to answer natural language questions from an available set of documents. We showed that ready-to-use information-retrieval systems can be used to find the right document and pass it to extractors. We demonstrated a custom-built transformer model to extract the answer within a set of documents. We experimented with several BERT-based models to identify the best-performing model for the task. With this solution, we were able to achieve 42% F1 and 39% EM with no domain-specific labeled data. We hope this new dataset and solution helps researchers create better solutions for zero-shot open-book use cases in similar real-world environments.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.