A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets

Machine Reading Comprehension (MRC) is a challenging NLP research field with wide real world applications. The great progress of this field in recent years is mainly due to the emergence of large-scale datasets and deep learning. At present, a lot of MRC models have already surpassed the human performance on many datasets despite the obvious giant gap between existing MRC models and genuine human-level reading comprehension. This shows the need of improving existing datasets, evaluation metrics and models to move the MRC models toward 'real' understanding. To address this lack of comprehensive survey of existing MRC tasks, evaluation metrics and datasets, herein, (1) we analyzed 57 MRC tasks and datasets; proposed a more precise classification method of MRC tasks with 4 different attributes (2) we summarized 9 evaluation metrics of MRC tasks and (3) 7 attributes and 10 characteristics of MRC datasets; (4) We also discussed some open issues in MRC research and highlight some future research directions. In addition, to help the community, we have collected, organized, and published our data on a companion website(https://mrc-datasets.github.io/) where MRC researchers could directly access each MRC dataset, papers, baseline projects and browse the leaderboard.


Introduction
In the long history of Natural Language Processing (NLP), teaching computers to read text and understand the meaning of text is a major research goal that has not been fully realized. In order to accomplish this task, researchers have conducted machine reading comprehension (MRC) research in many aspects recently with the emergence of big dataset, higher computing power, and the deep learning techniques, which has boosted the whole NLP research [111,54,53]. The concept of MRC comes from human understanding of text. The most common way to test whether a person can fully understand a piece of text is to require she/he answer questions about the text. Just like human language test, reading comprehension is a natural way to evaluate computer's language understanding ability. In the NLP community, machine reading comprehension has received extensive attention in recent years [9,91,114,4,30]. The goal of a typical MRC task is to require a machine to read a (set of) text passage(s) and then answers questions about the passage(s), which is very challenging [26].
Machine reading comprehension could be widely applied in many NLP systems such as search engines and dialogue systems. For example, as shown in Figure 1, nowadays, when we enter a question into the search engine Bing, sometimes the Bing can directly return the correct answer by highlight it in the context (if the question is simple enough). Moreover, if we open the "Chat with Bing" in the website of Bing, as shown in the right part of the browser in Figure 1, we can also ask it questions such as "How large is the pacific?", the Bing chat bot will directly give the answer "63.78 million square miles". And on Bing's App, we can also open this "Chat with Bing", as shown in the right part of Figure 1. It is clear that MRC can help improve performances of search engines and dialogue systems, which can allow users to quickly get the right answer to their questions, or to reduce the workload of customer service staff.
Machine reading comprehension is not newly proposed. As early as 1977, Lehnert et al. [52] had already built a question answering program called the QUALM which was used by two story understanding systems. In 1999, Hirschman et al. [36] constructed a reading comprehension system with a corpus of 60 development and 60 test stories of 3rd to 6th grade material. The accuracy of baseline system is between 30% and 40% on 11 sub-tasks. Most of MRC systems in the same period were rule-based or statistical models [77,13]. However, due to the lack of high quality MRC datasets, this research field has been neglected for a long time [14]. In 2013, Richardson et al. [76] created the MCTest [76] dataset which contained 500 stories and 2000 questions. Later, many researchers began to apply machine learning models on MCTest [76,99,78,60] despite that the original baseline of MCTest [76] is a rule-based model and the number of training samples in the MCTest [76] dataset is not large. A turning point for this field came in 2015 [14]. In order to resolve these bottlenecks, Hermann et al. [33] defined a new dataset generation method that provides large-scale supervised reading comprehension dataset in 2015. They also developed a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure. Since 2015, with the emergence of various large-scale supervised datasets and neural network models, the field of machine reading comprehension has entered a period of rapid development. Figure 2 shows the numbers of research papers on MRC since 2013. As is seen, the number of papers on MRC has been growing at an impressive rate.
In both computer vision and also MRC research, the benchmark datasets play a crucial role in speeding up the development of better neural models. In the past few years, we have witnessed an explosion of work that brings various MRC benchmark datasets [9,91,114,4,30]. Figure 3 (a) shows the cumulative number of MRC datasets from the beginning of 2014 to the beginning of 2020. It shows that the number of MRC datasets has increased exponentially in recent years. And these novel datasets inspired a large number of new neural MRC models, such as those shown in Figure  3 (b), just take SQuAD 1.1 [73] for example, we can see that many neural network models were created in recent years, such as BiDAF [83], ELMo [69], BERT [21], RoBERTa [57] and XLNet [110]. The performance of the state-of-the-art neural network models had already exceeded the human performance over the related MRC benchmark datasets.
Despite the critical importance of MRC datasets, most of existing MRC reviews have focused on MRC algorithms for improving system performance [29,71], performance comparisons [4], or general review that has limited coverage of datasets [114]. In addition, there is also a need for systematic categorization/classification of task types. For example, MRC tasks are usually divided into four categories: cloze style, multiple choice, span prediction and free form [14,56,71]. But this classification method is not precise because the same MRC task could belong to both cloze style and multiple choice style at the same time, such as the CBT [35] task in the Facebook bAbi project [104]. Moreover, most researchers focus on several popular MRC datasets while many of the rest are not widely known and studied by the community. To address these gaps, a comprehensive survey of existing MRC benchmark datasets, evaluation metrics and tasks is strongly needed.
At present, a lot of neural MRC models have already surpassed the human performance on many MRC datasets, but there is still a giant gap between existing MRC and real human comprehension [42]. This shows the need of improving existing MRC datasets in terms of both question and answer challenges and related evaluation criteria. In order to build more challenging MRC datsets, we need to understand existing MRC tasks, evaluation metrics and datasets better.
Our contributions of this review include the following as shown in Figure 4: (1) we analyzed 57 English MRC tasks and datasets and proposed: a more precise classification standard of MRC tasks which has 4 different attributes of MRC tasks and each of them could be divided into several types; (2) 9 evaluation metrics of MRC tasks have been analyzed; (3) 7 attributes and 10 characteristics of MRC datasets have been summarized; (4) We also discussed some open issues for future research. In addition, we have prepare and released all resources of datasets, evaluation metrics as a companion website at website for easy access by the communitiy.

Typical machine reading comprehension task
Typical machine reading comprehension task could be formulated as a supervised learning problem. Given the a collection of textual training examples {(p i , q i , a i )} n i=1 , where p is a passage of text, and q is a question regarding the text p. The goal of typical machine reading comprehension task is to learn a predictor f which takes a passage of text p and a corresponding question q as inputs and gives the answer a as output, which could be formulated as the following formula [14]: and it is necessary that a majority of native speakers would agree that the question q does regarding that text p, and the answer a is a correct one which does not contain information irrelevant to that question.

Definition of Typical MRC Tasks
In this survey, machine reading comprehension is considered as a research field, which includes many specific tasks, such as multi-modal machine reading comprehension, textual machine reading comprehension, etc. Since most of the existing machine reading comprehension tasks are in the form of question answering, the textual QA-based machine reading comprehension task is considered to be the typical machine reading comprehension task. According to previous review papers on MRC [14,56], the definition of a typical MRC task is given in Table 1:

Discussion on MRC Tasks
In this section, we first compare multi-modal MRCs with textual MRCs, and then discuss the relationship between question answering tasks and machine reading comprehension tasks.

Multi-modal MRC vs. Textual MRC
Multi-modal MRC is a new challenging task that has received increasing attention from both the NLP and the CV communities. Compared with existing MRC tasks which are mostly textual, multi-modal MRC requires a deeper understanding of text and visual information such as images and videos. When human is reading, illustrations can help to understand the text. Experiments showed that children with higher mental imagery skills outperformed children with lower mental imagery skills on story comprehension after reading the experimental narrative [9]. This results emphasize the importance of mental imagery skills for explaining individual variability in reading development [9]. Therefore, if we want the machine to acquire human-level reading comprehension ability, multi-modal machine reading comprehension is a promising research direction.
In fact, there are already many tasks and datasets in this field, such as the TQA [44], MovieQA [95], COMICS [40] and RecipeQA [106]. As seen in Table 2, TQA is a multi-modal MRC dataset that aims at answering multi-modal questions given a context of text, diagrams and images.

Machine Reading Comprehension vs. Question Answering
The relationship between question answering and machine reading comprehension is very close. Some researchers consider MRC as a kind of specific QA task [14,56], and compared with other QA tasks such as open-domain QA, it is characterized by that the computer is required to answer questions according to the specified text. However, other researchers regard the machine reading comprehension as a kind of method to solve QA tasks. For example, in order to answer open-domain questions, Chen et al. [15] first adopted document retrieval to find the relevant articles from Wikipedia, then used MRC to identify the answer spans from those articles. Similarly, Hu [39] regarded machine reading as one of the four methods to solve QA tasks. The other three methods are rule-based method, information retrieval method and knowledge-based method.

Question with illustration:
What is the outer surrounding part of the Nucleus? Choices: However, although the typical machine reading comprehension task is usually in the form of textual question answering, the forms of MRC tasks are usually diverse. Lucy Vanderwende [98] argued that machine reading could be defined as the automatic understanding of text. "One way in which human understanding of text has been gauged is to measure the ability to answer questions pertaining to the text. An alternative way of testing human understanding is to assess one's ability to ask sensible questions for a given text". In fact, there are already many such benchmark datasets for evaluating such techniques. For example, ShARC [79] is a conversational MRC dataset. Unlike other conversational MRC datasets, when answering questions in the ShARC, the machine needs to use background knowledge that is not in the context to get the correct answer. The first question in a ShARC conversation is usually not fully explained and does not provide enough information to answer directly. Therefore, the machine needs to take the initiative to ask the second question, and after the machine has obtained enough information, it then answers the first question. Another example is RecipeQA [106] which is a dataset for multi-modal comprehension of illustrated recipes. There are four sub-tasks in RecipeQA, one of them is the ordering task, ordering task test the ability of a model in finding a correctly ordered sequence given a jumbled set of representative images of a recipe [106]. As in previous visual tasks, the context of this task consists of the titles and descriptions of a recipe. To successfully complete this task, the model needs to understand the temporal occurrence of a sequence of recipe steps and infer temporal relations between candidates, i.e. boiling the water first, putting the spaghetti next, so that the ordered sequence of images aligns with the given recipe. In addition, in the MS MARCO [61], ordering tasks are also included.
In summary, although most machine reading comprehension tasks are in the form of question answering, this does not mean that machine reading comprehension tasks belong to the question answering. In fact, as mentioned above, the forms of MRC tasks are diverse. And question answering also includes a lot of tasks that do not emphasize that the system must read a specific context to get an answer, such as rule-based question answering systems and knowledge-based question answering systems (KBQA). Figure 5 illustrates the relations between machine reading comprehension (MRC) tasks and question answering (QA) tasks. As shown in Figure 5, we regard the general machine reading comprehension and the question answering as two sub fields in the research field of natural  language processing, both of which contain various specific tasks, such as visual question answering (VQA) tasks, multi-modal machine reading comprehension tasks, etc. Among them, some of these tasks belong to both natural language processing and computer vision research fields, such as the VQA task and the multi-mode reading comprehension task. Lastly, most of the existing MRC tasks are textual question answering tasks, so we regard this kind of machine reading comprehension task as a typical machine reading comprehension task, and its definition is shown in Table 1 above.

Classification of MRC Tasks
In order to have a better understanding of MRC tasks, in this section, we analyze existing classification methods of tasks and identify potential limitations of these method.s After analyzing 57 MRC tasks and datasets, we propose a more precise classification method of MRC tasks which has 4 different attributes and each of them could be divided into several types. The statistics of the 57 MRC tasks are shown in the table in this section.

Existing Classification Methods of MRC tasks
In many research papers [14,56,71], MRC tasks are divided into four categories: cloze style, multiple choice, span prediction, and free-form answer. Their relationship is shown in Figure 6,

Choices:
(A) Robbie Keane (B) Dimitar Berbatov √ In a cloze style task, there are some placeholders in the question, the MRC systems need to find the most suitable words or phrases that are filled in these placeholders according to the context content.

Multiple choice
In a multiple choice task, the MRC system needs to select a correct answer from a set of candidate answers according to the provided context.

Span prediction
In a span prediction task, the answer is a span of text in the context. That is, the MRC system needs to select the correct beginning and end of the answer text from the context.

Free-form answer
This kind of tasks allow the answer to be any free-text forms and the answer is not restricted to a single word or a span in the passage [14].
It should be pointed out that the classification method above has its limitations. Firstly, it is not very precise. With this classification method, a MRC task may belong to more than one of the above categories at the same time. For instance, as seen in Table 3, a sample in the "Who did What" task [63] are both in the form of "Cloze style" and "Multiple choice", and we can see the answer is a span of text in the context so that it could be also classified to "Span prediction". Secondly, with the rapid development of MRC, a large number of novel MRC tasks have emerged in recent years. One example is multi-modal MRC, such as MovieQA [95], COMICS [40], TQA [44] and RecipeQA [106]. But these multi-modal tasks are ignored by the existing classification method.
MRC tasks have also been classified according to the answer type [114,30], which includes:

Tasks with extractive answers
The tasks with extractive answers are similar to the Cloze and Span MRC tasks which use text spans in the passages as the answers to the proposed questions, for example, the SQuAD 1.1, CBT, TriviaQA and WikiHop. However, the authors did not give a clear definition of the tasks with extractive answers, and did not make distinction between the Cloze and Span MRC tasks [114].

Tasks with descriptive answers
Instead of using text span answers extracted from the context, the MRC tasks with descriptive answers are whole, stand-alone sentences, which exhibit more fluency and integrity [114].

Tasks with multiple-choice answers
To accomplish tasks with multiple-choice answers, the computer needs to choose the right answer from a candidate option set.
However, this classification method is still not very precise. We can see that the MRC task in Table  3 belongs to the Tasks with descriptive answers and the Tasks with multiple-choice answers at the same time. Razieh Baradaran et al. [4] also provided a statistics table for each category over 28 MRC datasets according to their volume, data collection method, context type and language, etc. This is similar to some sub-sections in Section 4, but the statistics and description in Section 4 is more comprehensive which has 7 attributes and 10 characteristics of 47 MRC dataset.

A More Precise Classification Method
In this section, we propose a more precise classification method of MRC tasks. As shown in Figure  7, we summarize four different attributes of MRC tasks, including: the type of corpus, the type of questions, the type of answers, and the source of answers. Each of these attributes can be divided into several different categories. We give a detailed description of each category with examples in the following sections.

Type of Corpus
According to whether the corpus contains pictures or other non-text information, the MRC task can be divided into two categories: multi-modal (combination of graphics and text) and textual.

Multi-modal
In multi-modal MRC tasks, besides textual information, multi-modal information such as pictures are also included as context, questions or answers, as seen in Table 2 above.

Textual
Most MRC tasks belong to this category. Their context, questions and answers are all plain texta, as seen in Table 4 below.
There is a certain similarity between multi-modal MRC tasks and VQA (Visual question answering) tasks. But multi-modal MRC tasks focus more on natural language understanding, and their context includes more text that needs to be read, and the VQA task usually does not have much context, and gives the image directly.

Type of Questions
According to the type of question, a MRC task can be classified into one of three categories: natural style, cloze style, and synthetic style: Natural style Table 4: An illustrative textual MRC task. The question-answer pair and passage are taken from the SQuAD 1.1 [73].

An illustrative textual MRC task
Passage: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, in-tense periods of rain in scattered locations are called "showers".
Question: What causes precipitation to fall?

Answer: gravity
According to the type of corpus, natural form of questions can be divided into textual and multi-modal. Textual natural question is usually a natural question or imperative sentence. It could be formulated as: In the equation above, q denotes the question, where q k is a word, and q k ∈ V, V denotes the vocabulary. Example of textual natural question has been shown in Table 4, and example of multimodal natural question is in Table 2 above.

Cloze style
According to the type of corpus, cloze questions also can be divided into textual and multi-modal. A textual cloze question is usually a sentence with a placeholder. The MRC system is required to find a correct word or phrase that is suitable to be filled in the placeholder so that the sentence is complete. The textual cloze question can be formulated as: In the above equation, q denotes the question, where q k is a placeholder, q j is a word, and q j ∈ V, V denotes the vocabulary. Example of textual cloze question has been shown in Table 3.
A multi-modal cloze question is a natural sentence with visual information such as images, but some parts of these images are missing, and the MRC system is required to fill in the missing images. For example, a sample of visual cloze question in the RecipeQA [106] dataset is shown seen in Table 5:

Synthetic style
The "synthetic question" is just a list of words and do not necessarily conform to normal grammatical rules, such as questions in Qangaroo [103], WikiReading [34], etc. Take the Qangaroo as an example, in the Qangaroo dataset, the question is replaced by a set of words. The "question" here is not a complete sentence that fully conforms to the natural language grammar, but a combination of words, as shown in Table 6.

Type of Answers
According to the type of answers, MRC tasks can be divided into two categories: multiple choice forms, natural forms.

Multiple-choice answer
In this category, the dataset provides a set of options as the candidate answers. It could be formulated as follows: A = {a 0 , a 1 , a 2 , . . . . . . , a n−1 , a n } Where a n could be a word, phrase, sentence, or image. The MRC system is required to find the correct answer a k from A. Examples of textual multiple choices form of answers have been shown in Table 3 and Table 6, and multi-modal example has been shown in Table 5 above. Question: Choose the best image for the missing blank to correctly complete the recipe.
Choices:  Natural answer The answer is also a natural word, phrase, sentence or image but it doesn't have to be in the form the multiple options. Examples of natural textual answers has been shown in Table 4 above, and example of natural multi-modal answers has not been found by us, i.e., all the multi-modal MRC datasets we collected in this survey contain only multiple choice answers.

Source of Answers
According to different sources of answers, we divide the MRC tasks into two categories: span and free-form.

Span answer
In this type of MRC tasks, an answer to a question is a span or a word which is from the passage, and it can be formulated as: Where p denotes the passage, and p i is a word, p i ∈ V, V denotes the vocabulary. The answer a to the question must be a span of text or a word from the corresponding passage p, and the answer a can be formulated as: Where 0 ≤ i ≤ j ≤ n. Example of textual span answer is shown in Table 3 above. It should be noted that, in this paper, we do not provide example for multi-modal span answer, because such tasks already exist in the field of computer vision, such as semantic segmentation, object detection or instance segmentation. Question: How many more dollars was the Untitled (1981) painting sold for than the 12 million dollar estimation?

Answer: 4300000
Free-form answer A free-form answer may be any phrase, word or even image, and it doesn't have to come from the context. Example of multi-modal free-form answer are shown in Table 5 and example of textual free-form answer are shown in Table 7 above.

Statistics of MRC Tasks
In this section, we collected 57 different MRC tasks and categorize them according to four attributes as shown in   [23] Textual Natural Free-Form Natural According to Table 8, we have made a statistical chart of MRC task classification, as shown in Figure  8. We can see that for the type of corpus, the textual task still accounts for a large proportion which is 89.47%. At present, the proportion of multi-modal reading comprehension tasks is still small, about 10.53%, which shows that the field of multi-modal reading comprehension still have many challenge problems for future research. In terms of question types, the most common type is natural form of questions, followed by cloze type and synthetic type. In terms of answer types, the proportion of natural type and multiple-choice type are 52.63% and 47.37% respectively. In terms of answer source, 29.82% of the answers are of spans type, and 70.18% of the answers are of free-form.

Mixed Tasks
It should be pointed out that many MRC tasks are mixtures of the above types, such as PaperQA [66], the question of it is in the form of cloze, the answer is a multi-choice form, and its context corpus contains images. Take another example, in the RecipeQA-Cloze task [106], the question type is multi-choice, and the question type is cloze style. While Figure 8 has shown the proportions of different types of tasks, we can not get the overall distribution of tasks from Figure 8. Therefore, we have made a sunrise statistical chart for machine reading comprehension tasks as shown in Figure 9. The figure is divided into four layers, the central layer represents the type of corpus, the second layer is the type of questions, the third layer represents the type of answers, and the outermost layer is the source of answers. From Figure 9, we can see that the most common MRC tasks are textual free-form tasks, with natural answer type and natural questions. As seen in Figure 9, the proportion of multi-modal cloze tasks is the smallest. What's more, we can see that at present there is no dataset for natural answer multi-modal tasks, which shows that the field of multi-modal reading comprehension is still with many challenges waiting for future research.

Form of Task vs. Content of Task
The discussion above is mainly about the form of MRC tasks. However,it should be noted that, besides the form of the MRC task, the content of the context/passage and the question also determine the types of the task. As shown in Table 9, in the FaceBook BAbi dataset [104], there are many different types of MRC tasks depending on the content of the passages and questions. But because classifying tasks based on the content is a very subjective matter with established standards, herein, we mainly analyze the form of tasks rather than the content.

Overview of Evaluation Metrics
The most commonly used evaluation metric for MRC models is accuracy. However, in order to more comprehensively compare the performances of MRC models, the models should be evaluated by various evaluation metrics. In this section, we introduce the calculation methods of commonly used evaluation metrics in machine reading comprehension, which include: Accuracy, Exact Match, Precision, Recall, F1, ROUGE, BLEU, HEQ and Meteor. For multiple choice or cloze style tasks, Accuracy is usually used to evaluate MRC models. For span prediction tasks, Exact Match, Precision, Recall, and F1 are usually used as evaluation metrics. Currently, many of the evaluation metrics for MRC tasks are derived from other research areas in NLP (natural language processing) such as machine translation and text summaries. Similar to machine translation tasks, the goal of a MRC task is also to generate some text and compare it with the gold answer. So the evaluation metrics of machine translation tasks can also be used for MRC tasks. In the following sections, we will give detailed calculation methods of these evaluation metrics.

Accuracy
Accuracy represents the percentage of the questions that a MRC system accurately answers. For example, suppose a MRC task contains N questions, each question corresponds to one gold answer, the answers can be a word, a phrases or a sentence, and the number of questions that the system answers correctly is M . The equation for the accuracy is as follows:

Exact Match
If the gold answer to the question is a sentence or a phrase, it is possible that some of the words in the system-generated answer are gold answers, and the other words are not gold answers. In this case, Exact Match represents the percentage of questions that the system-generated answer exactly matches the gold answer, which means every word is the same. Exact Match is often abbreviated as EM.
For example, if a MRC task contains N questions, each question corresponds to one right answer, the answers can be a word, a phrases or a sentence, and the number of questions that the system answers correctly is M . Among the remaining N − M answers, some of the answers may contain some ground truth answer words, but not exactly match the ground truth answer. The Exact Match can then be calculated as follows: Therefore, for the span prediction task, Exact Match and Accuracy are exactly the same. But for multi-choice task, Exact Match is usually not used because there is no situation where the answer includes a portion of the gold answer. In addition, to make the evaluation more reliable, it is also common to collect multiple gold answers for each question. Therefore, the exact match score is only required to match any of the gold answers [14].

Token-level Precision
The token-level precision represents the percentage of token overlap between the tokens in the gold answer and the tokens in the predicted answer. Following the evaluation method in SQuAD [73,74], we treat the predicted answer and gold answer as bags of tokens, while ignoring all punctuation marks and the article words such as "a" and "an" or "the". In order to get the token-level Precision, we first need to understand the token-level true positive (TP), false positive (FP), true negative (TN), and false negative (FN), as shown in Figure 10: As seen in Figure 10, for a single question, the token-level true positive (TP) denotes the same tokens between the predicted answer and the gold answer. The token-level false positive (FP) denotes the tokens which are not in the gold answer but in the predicted answer, while the false negative (FN) denotes the tokens which are not in the predicted answer but in the gold answer. A token-level Precision for a single question is computed as follows: Where P recision T S denotes the token-level Precision for a single question, and N um (T P T ) denotes the number of token-level true positive (TP) tokens and N um (F P T ) denotes the number of token-level false positive (FP) tokens. For example, if a gold answer is "a cat in the garden" and the predicted answer is "a dog in the garden". We can see, after ignoring the article word "a" and "the", the number of the shared tokens between the predicted answer and the gold answer is 2, which is also the N um (T P T ) , and N um (F P T ) is 1, so the token-level Precision for this answer is 2/3.

Question-level Precision
The question-level precision represents the average percentage of answer overlaps (not token overlap) between all the gold answers and all the predicted answers in a task [108]. The question-level true positive (TP) , false positive (FP), true negative (TN), and false negative (FN) are shown in Figure 11: As seen in Figure 11, the question-level true positive (TP) denotes the shared answers between all predicted answers and all gold answers, in which one answer is treated as one entity, no matter how many words it consists of. And the question-level false positive (FP) denotes these predicted answers which do not belong to the set of gold answers, while the question-level false negative (FN) denotes those gold answers which do not belongs to the set of predicted answers. A question-level Precision for a task is computed as follows: Where P recision Q denotes the question-level Precision for a task, N um (T P Q ) denotes the number of question-level true positive (TP) answers and N um (F P Q ) denotes the number of question-level false positive (FP) answers.

Token-level Recall
The Recall represents the percentage of tokens in a gold answer that have been correctly predicted in a question. Following the definitions of the token-level true positive (TP), false positive (FP), and false negative (FN) above, A token-level Recall for a single answer is computed as follows: Where Recall T S denotes the token-level Recall for a single question, N um (T P T ) denotes the number of token-level true positive (TP) tokens and N um (F N T ) denotes the number of token-level false negative (FN) tokens.

Question-level Recall
The question-level Recall represents the percentage of the gold answers that have been correctly predicted in a task [108]. Following the definitions of the token-level true positive (TP), false positive (FP), and false negative (FN), A token-level Recall for a single answer is computed as follows: Where Recall Q denotes the question-level Recall for a task, N um (T P Q ) denotes the number of question-level true positive (TP) answers and N um (F N Q ) denotes the number of question-level false negative (FN) answers.

F1
3.6.1 Token-level F1 Token-level F1 is a commonly used MRC task evaluation metrics. The equation of token-level F1 for a single question is: Where F 1 T S denotes the token-level F1 for a single question, P recision T S denotes the token-level Precision for a single question and Recall T S denotes the token-level Recall for a single question.
To make the evaluation more reliable, it is also common to collect multiple gold answers to each question [14]. Therefore, to get the average token-level F1, we first have to compute the maximum token-level F1 of all the gold answers of a question, and then average these maximum token-level F1 over all of the questions [14]. The equation of average token-level F1 for a task is: Where F 1 T denotes the average token-level F1 for a task, and M ax (P recision T S ) denotes the maximum token-level F1 of all the gold answers for a single question, M ax (P recision T S ) denotes the sum of for every question in the task. N um(Questions) denotes the number of questions in the task.

Question-level F1
The equation of question-level F1 for a task is: Where F 1 Q denotes the question-level F1, P recision Q denotes the question-level Precision for a task and Recall Q denotes the question-level Recall for a task.

ROUGE
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation, which was first proposed by Chin-Yew Lin [55]. In the original paper, ROUGE was used to evaluate the performance of text summary system. Currently, ROUGE is also used in the evaluation of MRC system. ROUGE-N is a n-gram Recall between a candidate summary and a set of reference summaries [55].
According to the value of n, ROUGE are specifically divided into: ROUGE-1, ROUGE-2, ROUGE-3 and so on. The ROUGE-N is computed as follows: Where n is the length of the n-gram, Count (gram n ) is the maximum number of times the n-gram appears in the candidate text and predicted text generated by the algorithm, and RS is an abbreviation of Ref erenceSummaries.

BLEU
BLEU (Bilingual Evaluation Understudy) was proposed by Papineni et al. [65]. In the original paper, BLEU was used to evaluate the performance of machine translation systems. Currently, BLEU is also used in the performance evaluation of MRC. The computation method of BLEU is to take the geometric mean of the modified Precision and then multiply the result by an exponential brevity penalty factor. Currently, case folding is the only text normalization performed before computing the precision. First, we compute the geometric average of the modified n-gram precision, P n , using n-grams up to length N and positive weights w n summing to one [65].
Next, let C be the length of the candidate sentence and r be the length of the effective reference corpus. The brevity penalty BP is computed as follows[bib BLEU]: Then:

Meteor
Meteor was first proposed by Banerjee and Lavie [3] in order to evaluate the machine translation system . Unlike the BLEU using only Precision, the Meteor indicator uses a combination of Recall and Accuracy to evaluate the system. In addition, Meteors also include features such as synonym matching. Besides Meteor, Denkowski and Lavie also proposed Meteor-next [19] and Meteor 1.3 [20], the new metric features include improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words. Currently, some MRC datasets use Meteor as one of their evaluation metrics, such as the NarrativeQA [49] dataset. The Meteor score for the given alignment is computed as follows: Where F mean is combined by the P recision and Recall via a harmonic-mean [97] that places most of the weight on Recall, and the formula of F mean is: And P enalty is a fragmentation penalty to account for differences and gaps in word order, which is calculated using the total number of matched words (m, average over hypothesis and reference) and number of chunks (ch): Where the parameters α, β, and γ are tuned to maximize correlation with human judgments [20]. It should be noted that the P recision and Recall in Meteor 1.3 is improved by text normalization, we can see the original paper of Denkowski and Lavie for the detailed calculation method of P recision and Recall in Meteor 1.3 [20].

HEQ
The HEQ stands for Human Equivalence Score, which is a new MRC evaluation metric that can be used in conversational reading comprehension datasets, such as QuAC [16]. For these dataset in which questions with multiple valid answers, the F1 may be misleading. Therefore, HEQ was introduced. The HEQ is an evaluation metric for judging whether the output of the system is as good as the output of an ordinary person. For example, suppose a MRC task contains N questions, and the number of questions for which the token-level F 1 performance of algorithm exceeds or reaches the token-level F 1 of humans is M . The HEQ score is computed as follows [16]:

Statistics of Evaluation Metrics
In this section, we collated the evaluation metrics in evaluation of 57 MRC tasks. As seen in Table 10 Figure 12 shows the statistics on the usages of different evaluation metrics in the 57 MRC tasks collected in this paper. Among them, Accuracy is the most widely used evaluation metric, and 61.40% of MRC tasks collected in this paper used it. It is followed by F1 (36.84%) and Exact Match (22.81%). The rest of these evaluation metrics are less used, as shown in Figure 12: We also analyzed the relationship between the evaluation metrics and the task types. Figure 13 shows the usage of evaluation metrics with different types of tasks. Taking the "Accuracy" in the Figure 13 (b) as an example, a total of 35 MRC tasks use the "Accuracy" as evaluation metric. Among them, 25 tasks have the "Multi-choice" type of answers, and the remaining 10 tasks have the "Natural" type of answers. It can be seen from Figure 13 (b) that tasks with "Multi-choice" type of answers prefer to use "Accuracy" evaluation metric rather than other evaluation metrics. This is because it is impossible to calculate the EM, Precision, BLEU or F1 score of a typical "Multi-choice" question which has only one correct answer in the candidates. Among the "Multi-choice" tasks we collected, only the MultiRC [46] task does not use Accuracy, but F1 and Exact Match as evaluation metric. That is because there are multiple correct answers in the candidates of MultiRC task. As can be seen from Figure 13 (a), tasks with "Cloze" questions prefer to use the "Accuracy" as evaluation metrics rather than other evaluation metrics, which is because "Cloze" tasks tend to have "Multi-choice" answers. From Figure 13 (c), we can see that tasks with "Spans" answers and tasks with "Free-form" answers have no special preference in selecting evaluation metrics.

Benchmark Dataset
In this section, we analyze various attributes of 57 MRC benchmark datasets, including: dataset size, generation method, source of corpus, context type, availability of leaderboards and baselines, prerequisite skills, and citations of related papers.

The Size of Datasets
The recent success of machine reading comprehension is driven largely by both large-scale datasets and neural models [14].The size of a dataset affects the generalization ability of the MRC model and determines whether the model is useful in real world. Early MRC datasets tend to of small sizes. With the continuous development of MRC datasets in recent years, the question set sizes of new created MRC datasets are generally more than 10K. Here, we have counted the total number of questions in each MRC dataset along with the sizes of its training set, development set and testing set, as well as the proportion of training set to the total number of questions. The data is shown in Table 11 which is sorted by the question set size of the datasets. We also use the data in Table 11 to make a statistical chart where the Y coordinate is logarithmic, as shown in Figure 14, we can see that the WikiReading is the dataset with the largest question size [34] of a total of 18.87M questions; BookTest [2] is ranked second, and ProPara [18] is the smallest which has only 488 questions. When it comes to the proportion of training sets, BookTest has the highest proportion 99.86%, while the ARC (challenge set) has the lowest proportion 43.20%. The development set is generally slightly smaller than the testing set. Because different MRC datasets contain different corpora, we also give details of the corpus used in each MRC dataset, including the size of corpus and the unit of corpus, as well as the size of training set, development set and testing set. As seen in Table 12, The units of corpus in MRC datasets are various, such as paragraphs, documents, etc.

The Generation Method of Datasets
Generation method of datasets can be roughly described as several categories: Crowdsourcing, Expert, and Automated. "Crowd sourcing" is evolving as a distributed problem-solving and business production model in recent years [112]. A example of crowdsourcing website is Amazon Mechanical Turk. Today, many MRC datasets are posed by distributed workforce on such crowdsourcing websites. The "Expert" generation method means that question and answer pairs in the dataset are generated by people with professional knowledge in some fields. For example, in the ARC dataset [17], there are 7,787 science questions covered by US elementary and middle schools. The "Automated" generation method means that question and answer pairs are automatically generated on the basis of corpus, such as many cloze datasets.

The Source of Corpus
The source of corpus affects the readability and complexity of machine reading comprehension datasets. According to the source of corpus, the MRC datasets can be described as the following types: Exam Text, Wikipedia, News articles, Abstract of Scientific Paper, Crafted story, Technical documents, Text Book, Movie plots, Recipe, Government Websites, Search engine query logs, Hotel Comments, Narrative text, etc.

The Type of Context
The type of context can affect the training method of machine reading comprehension model, which produces many special models, such as the multi-hop reading comprehension, and multi-document reading comprehension. There are many types of context in MRC datasets, including: Paragraph, Multi-paragraph, Document, Multi-document, URL, Paragraphs with diagrams or images. As shown in Table 13, we give details of the generation method, corpus source and context type of each machine's reading comprehension dataset.  [18] Crowd-sourcing Process Paragraph Paragraph 2018 QuAC [16] Crowd-sourcing Wikipedia Document 2018 RecipeQA [106] Automated Recipes Paragraph with Images 2018 ReCoRD [113] Crowd-sourcing News Paragraph 2018 ReviewQA [29] Crowd-sourcing Hotel Comments Paragraph 2018 SciTail [47] Crowd-sourcing School science curricula

The Availability of Datasets, Leaderboards and Baselines
The release of MRC baseline projects and leaderboards can help the researchers evaluate the performance of their models. In this section, we try to find all the MRC dataset download links, leaderboards and baseline projects. As shown in Table 14, all the download links of MRC datasets are available except PaperQA [67]. Most of datasets provide leaderboards and baseline projects except only 19.3% of the datasets. We have published all the download links, leaderboards and the baseline projects on our website.    Figure 15 demonstrates the statistical analysis of the attributes of dataset as seen in Table 13. As seen in Figure 15 (a), the most common way to generate datasets is "Crowdsourcing", by which we can generate question and answer pairs that need complex reasoning abilities. The second is "Automated" method which can help us quickly create large-scale MRC datasets. The "Expert" generate method is the least used because it is usually expensive. When it comes to context type, as seen in Figure 15 (b), the main context type is the "Paragraph" type, followed by "Document" type, "Paragraph with images", "Multi-Paragraph" and so on. Figure 15 (c) shows the source of corpus which is very diverse. Among them, "Wikipedia" is the most common context source, but only accounts for 19.30%. Figure  15 (d) illustrated the availability of leaderboard and baseline. As can be seen in Figure 15

Prerequisite Skills
When humans read passages and answer questions, we need to master various prerequisite skills in order to answer them correctly. The analysis of these prerequisite skills may help us understand the intrinsic properties of the MRC datasets. In Table 15, we quote the descriptions and examples of prerequisite skills proposed by Sugawara et al. [89]. They defined 10 kinds of prerequisite skills, including: List/Enumeration, Mathematical operations, Coreference resolution, Logical reasoning, etc. By manually annotate questions in the MCTest [76] and SQuAD 1.1 [73], they got the frequencies of each prerequisite skill in the two MRC datasets. As seen in Table 15. However, the definition and classification of these prerequisite skills are often subjective and changeable. Many definitions have been drawn [89][90][91] , but they are still hard to give standard mathematical definition of them, which is the same as natural language understanding.

Citation Analysis
The citation times of the paper in which a dataset was proposed reveals the dataset's impact to some extent. As shown in Table 16, we analyze how many times each paper was cited and make a statistical table. We count both the total times of citations and the monthly average citation times since they were published. Except the two PaperQA datasets [37,67], the times of citations of all other papers have been found in the Google Scholar. In addition, we make a Table 16 in which the datasets are sorted by the monthly average citations. As expected, the dataset with the highest monthly average citations is SQuAD 1.1 [73], followed by CNN/Daily Mail [32] and SQuAD 2.0 [74]. It shows that these datasets are widely used as benchmark.   We also analyze the relationship between total citations and monthly average citations. As seen in Figure 16, on the whole, there is a correlation between the monthly average citations and the total citations of MRC dataset. For example, the top two citations of the total citations and the monthly average citations are the same which are SQuAD 1.1 [73] and CNN/Daily Mail [32]. However, some papers with lower total citations have higher monthly citations. This shows that these papers have been published for a short time, but they have received a lot of attention from the community, such as SQuAD 2.0 [74]. In addition, some papers with higher total citations have relatively low monthly average citations. Because these datasets have been published for a long time, but are rarely used in recent years.

Overview
In recent years, various large-scale MRC datasets have been created. The growth of large-scale datasets greatly promoted the research process of the machine reading comprehension.  [80]. Finally, we summarize the characteristics of each dataset in Table 17. In the following sections, we will describe each of them separately.

MRC with Unanswerable Questions
The existing MRC datasets often lack of training set for unanswerable questions which weaken the robustness of the MRC system. As a result, when the MRC models answer unanswerable questions, the models always try to give a most likely answer, rather than refuse to answer these unanswered questions. In this way, no matter how the model answers, the answers must be wrong. In order to solve this problem, the researchers proposed many MRC datasets with unanswerable questions which were more challenging. Among the datasets collected by us, the datasets that contain unanswerable questions includes: SQuAD 2.0, MS MARCO [61], Natural Questions [50] and NewsQA [96]. We will give a detailed description of these datasets in section in section 4.10.

Multi-hop Reading Comprehension
In most MRC dataset, the answer of question usually can be found in a single paragraph, or a document. However, in real human reading comprehension, when reading a novel, we are very likely to extract answers from multiple paragraphs. Compared with single passage MRC, the multi-hop machine reading comprehension is more challenging and requires multi-hop searching and reasoning over confusing passages or documents. In different papers, multi-hop MRC is named in different ways such as multi-document machine reading comprehension [107], multi-paragraph machine reading comprehension [101], multi-sentence machine reading comprehension [46]. Compared with single paragraph MRC, multi-hop MRC is more challenging and is naturally suitable for unstructured information processing. Among the datasets collected by us, the datasets that contain unanswerable questions include: SQuAD 2.0 [74], MS MARCO [61], Natural Questions [50] and NewsQA [96].

Multi-modal Reading Comprehension
When humans read, they often do it in a multi-modal way. For example, in order to understand the information and answer the questions, sometimes, we need read both the texts and illustrations, and we need use our brain to imagine, reconstruct, reason, calculate, analyze or compare. Currently, most of the existing machine reading comprehension datasets belong to plain textual machine reading comprehension, which has some limitations. some complex or precise concepts can not be described or communicated only via text. For example, if we need the computer answer some precise questions related to aircraft engine maintenance, we may have to input the image of aircraft engine. Multi-modal machine reading comprehension is a dynamic interdisciplinary field which has great application potential. Considering the heterogeneity of data, multi-modal machine reading comprehension brings unique challenges to NLP researchers, because the model has to understand both texts and images. In recent years, due to the availability of large-scale internet data, many multi-modal MRC datasets have been created, such as TQA [44], RecipeQA [106], COMICS [40], and MovieQA [95].

Reading Comprehension Require Commonsense or World knowledge
Human language is complex. When answering questions, we often need draw upon our commonsense or world knowledge. Moreover, in the process of human language, many conventional puns and polysemous words have been formed. The use of the same words in different scenes also requires the computer to have a good command of the relevant commonsense or world knowledge. Conventional MRC tasks usually focus on answering questions about given passages. In the existing machine reading comprehension datasets, only a small proportion of questions need to be answered with commonsense knowledge. In order to build MRC models with commonsense or world knowledge, many Commonsense Reading Comprehension (CRC) datasets have been created, such as CommonSenseQA [94], ReCoRD [113] and OpenBookQA [58].

Complex Reasoning MRC
Reasoning is an innate ability of human beings, which can be embodied in logical thinking, reading comprehension and other activities. Reasoning is also a key component in artificial intelligence and a fundamental goal of MRC. In recent years, reasoning has been an essential topic among the MRC community. We hope that the MRC system can not only read and learn the representation of the language, but also can really understand the context and answer complex question. In order to push towards complex reasoning MRC system, many datasets have been generated, such as Facebook bAbI [104], DROP [23],RACE [51], and CLOTH [105].

Conversational Reading Comprehension
It is a natural way for human beings to exchange information through a series of conversations. In the typical MRC tasks, different question and answer pairs are usually independent of each other. However, in real human language communication, we often achieve efficient understanding of complex information through a series of interrelated conversations. Similarly, in human communication scenarios, we often ask questions on our own initiative, to obtain key information that helps us understand the situation. In the process of conversation, we need to have a deep understanding of the previous conversations in order to answer each other's questions correctly or ask meaningful new questions. Therefore, in this process, historical conversation information also becomes a part of the context. In recent years, conversational machine reading comprehension (CMRC) has become a new research hotspot in NLP community, and there emerged many related datasets, such as CoQA [75], QuAC [16], DREAM [92] and ShARC [79].

Domain-specific Datasets
In this paper, domain-specific dataset refers to the MRC dataset whose context comes from a particular domain, such as science examinations, movies, clinical reports. Therefore, the neural network models trained by those datasets usually can be directly applied to the certain field. For example, CliCR [93] is a cloze MRC dataset in the medical domain. There are approximately 100,000 cloze questions about the clinical case reports. SciQ [102] is a multiple choice MRC dataset containing 13.7K crowdsourced science exam questions about physics, chemistry and biology, and others. The context and questions of SciQ are derived from scientific exam questions. In addition, domain-specific datasets aslo include ReviewQA [29], SciTail [47], WikiMovies [59], PaperQA [66].

MRC with Paraphrased Paragraph
Paragraph paraphrasing refers to rewriting or rephrasing a paragraph using different words, while still conveying the same messages as before. The MRC dataset with paraphrased paragraph has at least two versions of context which expresses the same meanings while there is little word overlap between the different versions of context. The task of paraphrased MRC requires the computer to answer questions about contexts. In order to answer these questions correctly, the computer needs to really understand the true meaning of different versions of context. So far, we only find that the DuoRC [80] and Who-did-What [63] are datasets of this type.

Large-scale MRC Dataset
The size of the early MRC dataset is usually not very large, such as QA4MRE, CuratedTREC [7], MCTest [76]. With the emergence of large-scale datasets, MRC is greatly promoted due to the possibility of neural network models training.

MRC dataset for Open-Domain QA
Open-domain question answering was originally defined as finding answers in collections of unstructured documents [15]. With the progress of MRC, many machine reading comprehension datasets tend to be used to solve open domain QA. The release of new training and evaluation of MRC datasets such as MCTest [76], CuratedTREC [7], Quasar [22], SearchQA [24] greatly promotes open-domain QA recently.

Descriptions of each MRC dataset
In section 4.9, we introduced the characteristics of various machine reading comprehension datasets.
In this section, we will give a detailed description of the 47 MRC datasets collected in our survey with their download links available. Then we will describe them according to the order of datasets in Table 17.

WikiQA
The WikiQA [108] dataset contains a large number of real Bing query logs as the question-answer pair and provided links to Wikipedia passages which might have answers in the dataset. The WikiQA dataset also contains questions that can not actually be answered from the given passages, so we should detect these unanswerable questions by the machine. WikiQA was completed by crowd-workers and contains 3,047 questions and 29,258 sentences, in which 1,473 sentences were marked as answer sentences for the question [108]. The WikiQA dataset is available on https://www.microsoft.com/enus/download/details.aspx?id=52419.

SQuAD 2.0
SQuAD 2.0 [74] is the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines the data from the existing version of SQuAD 1.1 [73] with more than 50,000 unanswerable questions written by crowd workers. In order to acquire a good performance on SQuAD 2.0, the MRC model not only need answer questions when possible, but also need identify issues without correct answers in the context and not to answer them [74]. For existing models, SQuAD 2.0 is a challenging natural language understanding task. As mentioned in the authors' paper, the powerful nervous model that achieved 86% F1 on SQuAD 1.1 received only 66% of F1 on SQuAD 2.0. Data for both SQuAD 1.1 and SQuAD 2.0 are available on https://rajpurkar.github.io/SQuAD-explorer/.

Natural Questions
Natural Questions [50] is a MRC dataset with unanswerable questions. The samples in this dataset comes from real anonymous questions and answers in the Google search engine. The dataset is manually generated by the crowd workers. The MRC model presents the crowd worker with a question and related Wikipedia pages and require the crowd worker to mark a long answer (usually a paragraph) and a short answer (usually one or more entities) on the page, or mark null if there is no correct answer. The Natural Questions dataset consists of 307,373 training samples with single  annotations, 7,830 samples with 5-way annotations for development data, and 7,842 test examples with 5-way annotations [50]. The dataset can be downloaded at https://github.com/google-researchdatasets/natural-questions, which also has a link to the leaderboard.

DuoRC
DuoRC [80] is a MRC dataset which contains 186,089 question-answer pairs generated from 7,680 pairs of movie plots. Each pair of movie plots reflects two versions of the same movie: one from Wikipedia and the other from IMDb. The texts of these two versions are written by two different authors. In the process of building question-answer pairs, the authors require crowd workers to create questions from one version of the story and a different set of crowd workers to extract or synthesize answers from another version. This is the unique feature of DuoRC in which there is almost no vocabulary overlap between the two versions. Additionally, the narrative style of the paragraphs generated from the movie plots (compare to the typical descriptive paragraphs in the existing dataset) indicates the need for complex reasoning of events in multiple sentences [61]. DuoRC is a challenging dataset, and the authors observed that the state-of-the-art model on the SQuAD 1.1 [73] also performed poorly on DuoRC, with F1 score of 37.42% while 86% on SQuAD 1.1.The dataset, paper and leaderboard of DuoRC can be obtained at https://duorc.github.io/.

Who-did-What
The Who-did-What [63] dataset contains more than 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension questions constructed from the LDC English Gigaword newswire corpus. Compared to other existing machine reading comprehension datasets, such as CNN/Daily Mail [32], the Who-did-What dataset avoided using the same article summaries to create sample in the dataset. Instead, each sample is formed by two separate articles. One article is given as the passage to be read and the other article on the same events is used to form the question. Second, the authors avoided anonymization -each choice is a person named entity. Third, the questions have been filtered to remove a fraction that are easily solved by simple baselines, while humans can still solve 84% of the questions [63].The dataset and leaderboard of Who-did-What are available on https://tticnlp.github.io/who_did_what/index.html.

ARC
AI2 Reasoning Challenge (ARC) [17] is a MRC dataset and task to encourage AI research in question answering that requires deep reasoning. To finish the ARC task, the MRC model requires far more powerful knowledge and reasoning than previous challenges such as SQuAD [73,74] or SNLI [11]. The ARC dataset contains 7,787 elementary level scientific questions which are in the form of multiple choices. The dataset is divided into a Challenge Set and an Easy Set, where the Challenge Set only contains questions that are not correctly answered by both a retrieval-based algorithm and a word co-occurrence algorithm. The ARC dataset contains only natural, primary-level science questions (written for human exam) and is the largest collection of such datasets. The authors tested several baselines on the Challenge Set, including state-of-the-art models from the SQuAD and SNLI, and found that none of them were significantly better than the random baseline, reflecting the difficulty of the task. The author also publishes the ARC corpus, which is a corpus of 14M scientific sentences related to this task, and the implementation of three neural baseline models tested [17]. Information about the ARC dataset and leaderboards is available on http://data.allenai.org/arc/.

MCScript
MCScript [64] is a large-scale MRC dataset with narrative texts and questions that require reasoning using commonsense knowledge. The dataset focuses on narrative texts about everyday activities, and the commonsense knowledge are required to answer multiple-choice questions based on this text. The MCScript dataset also forms the basis of a shared task on commonsense and script knowledge organized at SemEval 2018 [64]. The official web page and CodaLab competition page of the SemEval 2018 Shared Task 11 are available on https://competitions.codalab.org/competitions/17184.

OpenBookQA
OpenBookQA [58] consists of about 6,000 elementary level science questions in the form of multichoice (4,957 training sets, 500 validation sets, and 500 test sets). Answering the questions in OpenBookQA requires broad common knowledge. OpenBookQA also requires a deeper understanding of both the topic (in the context of common knowledge) and the language it is expressed in [58]. The baseline model provided by the author has reached about 50% in this dataset, but many state-of-the-art pre-trained QA methods perform surprisingly even worse [58]. Dataset and leaderboard of OpenBookQA are available on https://leaderboard.allenai.org/open_book_qa/.

ReCoRD
ReCoRD [113] is a large-scale MRC dataset that requires deep commonsense reasoning. Experiments on the ReCoRD show that the performance of the state-of-the-art MRC model lags far behind human performance. ReCoRD represents the challenge of future research to bridge the gap between human and machine commonsense reading comprehension. The ReCoRD dataset contains more than 120,000 queries from over 70,000 news articles. Each query has been verified by crowd workers [113]. Since July 2019, SuperGLUE added ReCoRD in its evaluation suite. ReCoRD is available on https://sheng-z.github.io/ReCoRD-explorer/.

CommonSenseQA
CommonSenseQA [94] is a MRC dataset that requires different types of commonsense knowledge to predict the correct answer. It contains 12,247 questions. The CommonSenseQA dataset is splited into training set, validation set and test set. The authors performed two types of splits: "Random split" which is the main evaluation split, and "Question token split" where each of the three sets have disjoint question concepts [94]. In order to capture common sense beyond association, the authors of CommonSenseQA extracted multiple target concepts from Conceptnet 5.5 [87] that have the same semantic relationship to a single source concept. Crowd workers were asked to propose multiple-choice questions, mention source concepts, and then distinguished each goal concept. This encouraged crowd workers to ask questions with complex semantics that often require prior knowledge [94]. The dataset and leaderboard of CommonSenseQA are available on https://www.taunlp.org/commonsenseqa.

WikiReading
WikiReading [34] is a large-scale machine reading comprehension dataset that contains 18 million instances. The dataset consists of 4.7 million unique Wikipedia articles, which means that about 80% of the English language Wikipedia is represented. In the WikiReading dataset, multiple instances can share the same document, with an average of 5.31 instances per article (median: 4, maximum: 879). The most common document categories are humans, categories, movies, albums, and human settlements, accounting for 48.8% of documents and 9.1% of instances respectively. The average and median document lengths are 489.2 and 203 words [34]. The WikiReading dataset is available on https://github.com/google-research-datasets/wiki-reading.

WikiMovies
WikiMovies [59] is a MRC dataset with Wikipedia documents. To compare using Knowledge Bases (KBs), information extraction or Wikipedia documents directly in a single framework, the author built the WikiMovies dataset which contains raw texts and preprocessed KBs. Wiki-Movies is part of FaceBook's bAbI project, and information about the BABi project is available on https://research.fb.com/downloads/babi/, and the WikiMovies dataset is available on http://www.thespermwhale.com/jaseweston/babi/movieqa.tar.gz.

MovieQA
The MovieQA [95] dataset is a multi-modal machine reading comprehension dataset designed to evaluate the automatic understanding of both pictures and texts. The dataset contains 14,944 questions from 408 movies. The types of questions in the MovieQA dataset are multiple-choice, and the questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. The MovieQA dataset is unique because it contains multiple sources of information-video clips, episodes, scripts, subtitles, and DVS [95]. Download links and evaluation benchmarks of the MovieQA dataset can be obtained for free from http://movieqa.cs.toronto.edu/home/.

COMICS
COMICS [40] is a multi-modal machine reading comprehension dataset, which is composed of more than 1.2 million comic panels (120 GB) and automatic text box transcriptions. In the COMICS task, the machine is required to read and understand the text and images in the comic panels at the same time. Besides the traditional textual cloze tasks, the authors also designed two novel MRC tasks (visual cloze, and character coherence) to test the model's ability to understand narratives and characters in a given context [40]. The dataset and baseline of COMICS are available on https://obj.umiacs.umd.edu/comics/index.html.

TQA
The TQA [44] (Textbook Question Answering) challenge encourages multi-modal machine reading (M3C) tasks. Compared with visual question answering (VQA) [1], the TQA task provides the multi-modal context and question-answer pair which consists of text and images. TQA dataset is constructed from the science curricula of middle school. The textual and diagrammatic content in middle school science reference fairly complex phenomena that occur in the world. Many questions need not only simple search, but also complex analysis and reasoning of multi-mode context. The TQA dataset consists of 1,076 courses and 26,260 multi-modal questions [44]. Analysis shows that a high proportion of questions in the TQA dataset require complex text analysis, graphing and reasoning, which indicates that the TQA dataset is related to previous machine understanding and VQA dataset [1] The TQA dataset and leaderboards are available on http://vuchallenge.org/tqa.html.

RecipeQA
RecipeQA [106] is a MRC dataset for multi-modal comprehension of recipes. It consists of about 20K instructional recipes with both texts and images and more than 36K automatically generated question-answer pairs. A sample in RecipeQA contains multi-modal context , such as headings, descriptions, or images. To find an answer, the model need (i) joint understanding of the pictures and texts; (ii) capturing the temporal flow of events; and (iii) understanding procedural knowledge [106]. The dataset and leaderboard of RecipeQA are available on http://hucvl.github.io/recipeqa.

HotpotQA
HotpotQA [109] is a multi-hop MRC dataset with multi-paragraphs. There are 113k Wikipediabased QA pairs in HotpotQA. Different from other MRC datasets, In the HotpotQA, the model is required to perform complex reasoning and provide explanations for answers from multi-paragraphs. HotpotQA has four key features: (1) the questions require the machine to read and reason over multiple supporting documents in order to find answer; (2) The questions are diverse and not subject to any pre-existing knowledge base; (3) The authors provided sentence-level supporting facts required for reasoning; (4) The authors offered a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison [109]. Dataset and leaderboard of HotpotQA are publicly available on https://hotpotqa.github.io/.

NarrativeQA
NarrativeQA [49] is a multi-paragraph machine reading comprehension dataset and a set of tasks.
In order to encourage progress on deeper comprehension of language, the authors designed the NarrativeQA dataset. Unlike other datasets in which the questions can be solved by selecting answers using superficial information, in the NarrativeQA, the machine is required to answer questions about the story by reading the entire book or movie script. In order to successfully answer questions, the model needs to understand the underlying narrative rather than relying on shallow pattern matching or salience [49]. NarrativeQA is available on https://github.com/deepmind/narrativeqa.

Qangaroo
Qangaroo [103] is a multi-hop machine reading comprehension dataset. Most reading comprehension methods limit themself to questions that can be answered using a single sentence, paragraph, or document [103]. Therefore, the authors of Qangaro proposed a new task and dataset to encourage the development of text understanding models across multiple documents and to study the limitations of existing methods. In the Qangaroo task, the model is required to seek and combine evidenceeffectively performing multihop, alias multi-step, inference [103]. The dataset, papers and leaderboard of Qangaroo are publicly available on http://qangaroo.cs.ucl.ac.uk/index.html.

MultiRC
MultiRC (Multi-Sentence Reading Comprehension) [46] is a MRC dataset in which questions can only be answered by considering information from multiple sentences. The purpose of creating this dataset is to encourage the research community to explore more useful methods than complex lexical matching. MultiRC consists of about 6,000 questions from more than 800 paragraphs across 7 different areas (primary science, news, travel guides, event stories, etc.) [46]. MultiRC is available on http://cogcomp.org/multirc/. Since May 2019, MultiRC is part of SuperGLUE, so the authors will no longer provide the leaderboard on the above website.

CNN/Daily Mail
In order to solve the problem of lack of large-scale datasets, Hermann et al. [32]

MCTest
In MCTest [76] dataset, the model is required to answer multiple choice questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. The stories and questions of MCTest are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task [76]. The data in MCTest was gathered using Amazon Mechanical Turk. Since the answer is a fictional story, the content of the answer is very broad and not limited to a certain field. Therefore, the MRC model trained by MCTest is helpful for the open-domain question answering research [76]. The MCTest dataset and leaderboards are available on https://mattr1.github.io/MCTest/.

CuratedTREC
The CuratedTREC [7] dataset is a curated version of the TREC corpus [62]. The Text REtrieval Conference (TREC) [62] was started in 1992 by U.S. Department of Defense and the National Institute of Standards and Technology (NIST). Its purpose was to support research of information retrieval system.  [12] serves as a background corpus for extracting these answers. The Quasar dataset is a challenge to two related sub-tasks of the factoid questions: (1) searching for relevant text segments containing the correct answers to the query, and (2) reading the retrieved passages to answer the questions [22]. The dataset and paper of Quasar are available on https://github.com/bdhingra/quasar.

SearchQA
SearchQA [24] is a MRC dataset with retrieval systems. In order to answer open-domain questions in SearchQA, the model need to read the text retrieved by the search engine, so it can also be regarded as a machine reading comprehension dataset. The question-answer pairs in the SearchQA dataset are all collected from the J!Archive, and the context is retrieved from Google. SearchQA consists of more than 140k QA pairs, with an average of 49.6 clips per pair. Each QA environment tuple in SearchQA comes with additional metadata, such as the URL of the fragment, which the authors believe will be a valuable resource for future research. The authors perform a manual evaluation on SearchQA and tests two baseline methods, one simple word selection and another deep learning [24]. The paper suggests that the SearchQA can be obtained at https://github.com/nyu-dl/SearchQA.

SciQ
SciQ [102] is a domain-specific multiple choice MRC dataset containing 13.7K crowdsourced science questions about Physics, Chemistry and Biology, etc. The context and questions are derived from real 4th and 8th grade exam questions. The questions are in the form of multiple choices, with an average of four choices for each question. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided. In addition, the authors proposed a new method for generating domain-specific multiple choice MRC dataset from crowd workers [102]. The SciQ dataset can be downloaded at http://data.allenai.org/sciq/.

CliCR
CliCR [93] is a cloze MRC dataset in the medical domain. There are approximately 100,000 cloze questions about the clinical case reports. The authors applied several baseline and state-of-the-art neural model to the CliCR dataset and observed the performance gap (20% F1) between the human and the best neural models [93]. They also analyzed the skills required to correctly answer the question and explained how the model's performance changes based on the applicable skills, and they found that reasoning using domain knowledge and object tracking is the most frequently needed skill, and identifying missing information and spatiotemporal reasoning is the most difficult for machines [93]. The code of baseline project can be publicly available on https://github.com/clips/clicr, where the author claims that the CliCR dataset can be obtained by contacting the author via email.

PaperQA (Hong et al.,2018)
PaperQA [37] created by Hong et al. is a MRC dataset containing more than 6,000 human-generated question-answer pairs about academic knowledge. To build the PaperQA, crowd workers have provided questions based on more than 1,000 abstracts of the research paper on deep learning, and their answers that consist of text spans of the related abstracts. The authors collected the PaperQA through a four stage process to acquire QA pairs that require reasoning. And they have proposed a semantic segmentation model to solve this task [37]. PaperQA is publicly available on http://bit.ly/PaperQA. In order to measure the machine's ability of understanding professional-level scientific papers, a domain-specific MRC dataset called PaperQA [66] was created. PaperQA consists of over 80,000 cloze questions from research papers. The authors of PaperQA performed fine-grained linguistic analysis and evaluation to compare PaperQA and other conventional question and answering (QA) tasks on general literature e.g., books, news, and Wikipedia), and the results indicated that the PaperQA task is difficult, showing there is ample room for future research [66]. According to the authors' paper, PaperQA had been published on http://dmis.korea.ac.kr/downloads?id=PaperQA, but when we visited this website, it was not available at that moment.

ReviewQA
ReviewQA [29] is a domain-specific MRC dataset about hotel reviews. ReviewQA contains over 500,000 natural questions and 100,000 hotel reviews. The authors hope to improve the relationship understanding ability of the machine reading comprehension model by constructing the ReviewQA dataset. Each question in ReviewQA is related to a set of relationship understanding capabilities that the model is expected to master [29]. The ReviewQA dataset, summary of the tasks and results of models are available on https://github.com/qgrail/ReviewQA/.

SciTail
The SciTail [47] is Different from existing datasets, SciTail was created solely from natural sentences that already exist independently "in the wild" rather than sentences authored specifically for the entailment task [47]. The authors also generated hypotheses from questions and the relevant answer options, and premises from related web sentences from a large corpus [47]. Baseline and leaderboard of SciTail are available at https://leaderboard.allenai.org/scitail/submissions/public. The SciTail dataset is available at http://data.allenai.org/scitail/.

DROP
DROP [23] is an English MRC dataset that requires the Discrete Reasoning Over the content of Paragraphs. The DROP dataset contains 96k questions created by crowd workers. Unlike existing MRC task, in the DROP, MRC model is required to resolve references in a question, and perform discrete operations on them (such as adding, counting or sorting) [23]. These operations require a deeper understanding of the content of paragraphs than what was necessary for prior datasets [23]. The dataset of DROP can be downloaded at https://s3-uswest-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip. The Leaderboard is available on https://leaderboard.allenai.org/drop.

Facebook CBT
Children's Book Test (CBT) [35] is a MRC dataset that uses children's books as context. Each sample in the CBT dataset contains 21 consecutive sentences, the first 20 sentences form the context, and a word is deleted from the 21st sentence, so it becomes a cloze question. MRC model is required to identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the question. Different from standard language-modeling tasks, CBT distinguishes the task of predicting syntactic function words from that of predicting lower-frequency words, which carry greater semantic content [35]. The CBT dataset is part of FaceBook's bAbI project which is available on https://research.fb.com/downloads/babi/.The Children's Book Test (CBT) dataset can be downloaded at http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz.

Google MC-AFP
Google MC-AFP [86] is a MRC dataset which has about 2 million examples. It is generated from the AFP portion of LDC's English Gigaword corpus [28]. The authors of MC-AFP also provided a new method for creating large-scale MRC datasets using paragraph vector models. In the MC-AFP, the upper limit of accuracy achieved by human testers is approximately 91%. Among all models tested by the authors, the authors' hybrid neural network architecture achieves a highest accuracy of 83.2%. The remaining gap to the human-performance ceiling provides enough room for future model improvements [86]. Google MC-AFP is available on https://github.com/google/mcafp. The Stanford Question Answering Dataset (SQuAD) [73] is a well-known machine reading comprehension dataset which contains more than 100,000 questions generated by crowd-workers, in which the answer of each question is a segment of text from the related paragraph [73]. Since it was released in 2016, SQuAD 1.1 quickly became the most widely used MRC dataset. Now it has been updated to SQuAD 2.0 [74]. In the leaderboards of SQuAD 1.1 and SQuAD 2.0, we have witnessed the birth of a series of state-of-the-art neural models, such as BiDAF [83], BERT [21], RoBERTa [57] and XLNet [110], etc. The data and leaderboard of SQuAD 1.1 and SQuAD 2.0 are available on https://rajpurkar. github.io/SQuAD-explorer/.

RACE
RACE [51] is a MRC dataset collected from the English exams for Chinese students. There are approximately 28,000 articles and 100,000 questions provided by human (English teachers), covering a variety of carefully designed topics to test students' understanding and reasoning ability. Different from existing MRC dataset, the proportion of questions that need reasoning ability in RACE is much large than other MRC datasets, and there is a great gap between performance of the state-of-the-art models (43%) and the best human performance (95%) [51]. The authors hope that this new dataset can be used as a valuable resource for machine understanding research and evaluation [51]. The dataset of RACE is available on http://www.cs.cmu.edu/ glai1/data/race/.The baseline project is available on https://github.com/qizhex/RACE_AR_baselines.

TriviaQA
TriviaQA [43] is a challenging MRC dataset, which contains more than 650k question-answer pairs and their evidences. TriviaQA has many advantages over other existing MRC datasets: (1) relatively complex combinatorial questions; (2) considerable syntactic and lexical variability between the questions and the related passages; (3) more cross sentence reasoning is required to answer the question [43].  [92]. In the DREAM dataset, 84% of answers are non-extractive, 85% require more than one sentence of reasoning, and 34% of questions involve common sense knowledge. DREAM's authors applied several neural models on DREAM that used surface information in the text and found that they could barely surpass rule-based methods. In addition, the authors also studied the effects of incorporating dialogue structures and different types of general world knowledge into several models on the DREAM dataset. The experimental results demonstrated the effectiveness of the dialogue structure and general world knowledge [92]. DREAM is is available on: https://dataset.org/dream/.

CoQA
CoQA [75] is a conversational MRC dataset that contains 127K questions and answers from 8k dialogues in 7 different fields. Through in-depth analysis of CoQA, the authors showed that conversational questions in CoQA have challenging phenomena that not presented in existing MRC datasets, such as coreference and pragmatic reasoning. The authors also evaluated a set of state-of-the-art conversational MRC models on CoQA. The best F1 score achieved by those models is 65.1%, and human performance is 88.8%, indicating that there was plenty of room for future advance [75]. Dataset and leaderboard of CoQA can be found at https://stanfordnlp.github.io/coqa/.

QuAC
QuAC [16] is a conversational MRC dataset containing about 100K questions from 14K informationseeking QA dialogs. Each dialogue in QuAC involves two crowd workers: (1) One act like a student who ask a few question to learn a hidden passage from Wikipedia, and (2) the other one act like a teacher to answer questions by providing a brief excerpt from the Wikipedia passage. The QuAC dataset introduced the challenges that not present in existing MRC datasets: its questions are often more open-ended, unanswerable, or meaningful only in a dialog environment [16]. The authors also reported the performance of many state-of-the-art models on QuAC ,and the best result was 20% lower than human F1, suggesting there was ample room for future research [16]. Dataset, baseline and leaderboard of QuAC can be found at http://quac.ai.

ShARC
ShARC [79] is a conversational MRC dataset. Unlike existing conversational MRC datasets, when answering questions in the ShARC, the model needs to use background knowledge that is not in the context to get the correct answer. The first question in a ShARC conversation is usually not fully explained and does not provide enough information to answer directly. Therefore, the model needs to take the initiative to ask the second question, and after the model has got enough information, it then answers the first question [79]. The dataset, paper and leaderboard of ShARC are available on https://sharc-data.github.io.

Open Issues
In recent years, great progress has been made in the field of MRC due to large-scale datasets and effective deep neural network approaches. However, there are still many issues remaining in this field. In this section, we describe these issues in the following aspects:

What needs to be improved?
Nowadays, the neural machine reading models have exceeded the human performance scores on many MRC datasets. However, the state-of-the-art models are still far from human-level language understanding. What needs to be improved on existing tasks and datasets? We believe that there are many important aspects that have been overlooked which merit additional research. Here we list several areas as below:

Multi-modal MRC
A fundamental characteristic of human language understanding is multimodality. Psychologists examined the role of mental imagery skills on story comprehension in fifth graders (10-to 12-yearolds). Experiments showed that children with higher mental imagery skills outperformed children with lower mental imagery skills on story comprehension after reading the experimental narrative [9]. Our observation and experience of the world bring us a lot of common sense and world knowledge, and these multi-modal information are extremely important for us to acquire such common sense and world knowledge. Howover, it is currently not clear how our brains store, encode, represent, and process knowledge, which is an important scientific problem in cognitive neuroscience, philosophy, psychology, artificial intelligence and other fields. At present, the research in the field of natural language processing mainly focuses on pure textual corpus, but in neuroscience, the research methods are very different. Since the 1990s, cognitive neuroscientists have found that knowledge extraction could activate the widely distributed cerebral cortex, including the sensory cortex and the motor cortex [45]. More and more cognitive neuroscientists believe that concepts are rooted in modalityspecific representations [45]. This is usually called Grounded Cognition Model [6,68], or Embodied Cognition Model [45,82,27]. The key idea is that semantic knowledge does not reside in an abstract realm that is totally segregated from perception and action, but instead overlaps with those capacities to some degree. [45,5,84]. In that case, can we still make computer really understand human languages only by the neural network training of pure textual corpus? Nowadays, although there are already a few of multi-modal MRC datasets, but the related research is still insufficient. The number of current multi-modal MRC datasets are still small, and these datasets simply put pictures and texts together, lacking detailed annotations and internal connections. How to make better use of multi-modal information is an important research area in the future.

Commonsense and World Knowledge
Commonsense and world knowledge are main bottlenecks in machine reading comprehension. Among different kinds of commonsense and world knowledge, two types of commonsense knowledge are considered fundamental for human reasoning and decision making: intuitive psychology and intuitive physics [88]. Although there are some MRC datasets about commonsense, such as CommonSenseQA [94], ReCoRD [113], DREAM [92], OpenBookQA [58], this field is still in a very early stage. In these datasets, there is no strict division of commonsense types, nor research on commonsense acquisition methods combined with psychology. Understanding how these commonsense knowledge is acquired in the process of human growth may help to reveal the computing model of commonsense.
Observing the world is the first step for us to acquire commonsense and world knowledge. For example, "this book can't be put into a schoolbag, it's too small" and "this book can't be put into a schoolbag, it's too big". In these two sentences, human beings can know from commonsense that the former "it" refers to a schoolbag, and the latter "it" refers to a book. But this is not intuitive for computers. Human beings receive a great deal of multi-modal information in our daily life, which forms commonsense. When the given information is insufficient, we can make up the gap by predicting. Correct prediction is the core function of our commonsense. In order to gain real understanding ability comparable to human beings, machine reading comprehension models must need massive data to provide commonsense and world knowledge. Algorithms are needed to get better commonsense corpus and we need to create multi-modal MRC datasets to help machines acquire commonsense and world knowledge.

Complex Reasoning
Many of the existing MRC datasets are relatively simple. In these datasets, the answers are short, usually a word or a phrase. Many of the questions can be answered by understanding a single sentence in the context, and there are very few datasets that need multi-sentences reasoning [14]. This shows that most of the samples in existing MRC datasets are lack of complex reasoning. In addition, researchers found that after input-ablation, many of the answers in existing MRC datasets are still correct [91]. This shows that many existing benchmark datasets do not really require the machine reading comprehension model to have reasoning skills. From this perspective, high-quality MRC datasets that need complex reasoning is needed to test the reasoning skill of MRC modals.

Robustness
Robustness is one of the key desired property of a MRC model. Jia and Liang [42] found that existing benchmark datasets are overly lenient on models that rely on superficial cues [14,56]. They tested whether MRC systems can answer questions that contain distracting sentences. In their experiment, a distracting sentence which contains words that overlap with the question was added at the end of the context. These distracting sentences will not mislead human understanding, but the average scores of the sixteen models on SQuAD will be significantly reduced. This shows that these state-of-the-art MRC models still rely too much on superficial cues, and there is still a huge gap between MRC anbd human level reading comprehension [42]. How to improve develop MRC datasets to avoid the above situation and force training of MRC systems with true language understanding and test the robustness of MRC models is currently as big challenge.

Interpretability
In the existing MRC tasks, the model is only required to give the answer to the question directly, without explaining why it get the answer. So it is very difficult to really understand how the model makes decisions [14,56]. Regardless of whether the complete interpretability of these models is absolutely necessary, it is fair to say that a certain degree of understanding of the internal model can greatly guide the design of neural network structure in the future. In future MRC datasets, sub-tasks can be set up to let the model give the reasoning process, or the evidence used in reasoning.

Evaluation of the quality of MRC datasets?
There are many evaluation metrics for machine reading comprehension models, such as F1, EM, accuracy, etc. However, different MRC datasets also need to be evaluated. How to evaluate the quality of MRC datasets? One metric of MRC dataset is the readability. The classical measures of readability are based on crude approximations of the syntactic complexity (using the average sentence length as a proxy) and lexical complexity (average length in characters or syllables of words in a sentence). One of the most well-known measures along these lines is the Flesch-Kincaid readability index [48] which combines these two measures into a global score [8]. However, recent studies have shown that readability of MRC dataset is not directly related to the question difficulty [90]. The experiment results suggest that while complexity of datasets is decreasing, the performance of MRC model will not be improved to the same extent and the correlation is quite small [8]. Another possible metric is the frequencies of different prerequisite skills needed in MRC datasets. Sugawara et al. defined 10 prerequisite skills [90], including: Object tracking, Mathematical reasoning, Coreference resolution, Analog, Causal relation, etc. However, the definition of prerequisite skills is often arbitrary and changeable. Different definitions can be drawn from different perspectives [89][90][91]. Moreover, at present, the frequency of prerequisite skills is still manually counted, and there is no automated statistical method. In summary, how to evaluate the quality of MRC datasets is still an unsolved problem.

5.2
Have we understood understanding?

What is understanding?
The word "understanding" has been used by human beings for thousands of years [38,41]. But, what is the exact meaning of understanding? What are the specific neural processes of understanding? Many researchers attempted to give definitions of understanding. For example, Hough and Gluck [38] conducted an extensive survey of literature about understanding. They summarized: "In an attempt to summarize the preceding review, we propose the following general definition for the process and outcome of understanding: The acquisition, organization, and appropriate use of knowledge to produce a response directed towards a goal, when that action is taken with awareness of its perceived purpose." But understanding is too natural and complex for us as it is difficult to define, especially from different perspectives such as philosophy, psychology, pedagogy, neuroscience, computer science, etc. In the field of NLP, we still lack a comprehensive definition of understanding of language and also lack of specific metrics to evaluate the real understanding capabilities of MRC models.

Understanding from the perspective of cognitive neuroscience
In recent years, great progress has been made in the field of cognitive neuroscience of language. Thanks to the advanced neuroimaging technologies such as PET and fMRI, contemporary cognitive neuroscientists have been able to study and describe large-scale cortical networks related to language in various ways, and they have found many interesting findings. Just taking understanding object nouns as an example. How are these object nouns represented in the brain? As David Kemmerer summarized in his book [45]: "From roughly the 1970s through the 1990s, the dominant theory of conceptual knowledge was the Amodal Symbolic Model. It emerged from earlier developments in logic, formal linguistics, and computer science, and its central claim was that concepts, including word meanings, consist entirely of abstract symbols that are represented and processed in an autonomous semantic system that is completely separate from the modality specific systems for perception and action [25,85,70]. Since 1990s,the Grounded Cognition Model has been attracting increasing interest.The key idea is that semantic knowledge does not reside in an abstract realm that is totally segregated from perception and action, but instead overlaps with those capacities to some degree. To return to the banana example mentioned above, understanding this object noun is assumed to involve activating modality-specific records in long-term memory that capture generalizations about how bananas look, how they taste, how they feel in one's hands, how they are manipulated, etc. This theory maintains that conceptual processing amounts to recapitulating modalityspecific states, albeit in a manner that draws mainly on high-level rather than low-level components of the perceptual and motor systems [45]." In addition, a recent study [100] published in the Cell reveals that the two hypothesis theories mentioned above are both right. The authors studied the brain basis of color knowledge in sighted individuals and congenitally blind individuals whose color knowledge can only be obtained through language descriptions. Their experiments show that congenitally blind individuals can obtain knowledge representation similar to healthy people through language without any sensory experience. And more importantly, they also found that there are two different coding systems in the brain of sighted individuals: one is directly related to the sense, in the visual color processing brain area; the other is in the left anterior temporal lobe dorsal side, the same as the memory brain area of knowledge obtained only through language in congenitally blind individuals [100]. According to their study, there are (at least) two forms of object knowledge representations in the human brain: sensory-derived and cognitively-derived knowledge, supported by different brain systems [100]. It also shows that human language is not only used to express symbols for communication, but also to encode conceptual knowledge.
So, can we get more effective MRC model through training multi-modal corpus? Probably. But, due to the complexity of human brain, cognitive neuroscientists are still unable to fully understand the details of natural language understanding. But these cognitive neuroscience studies have brought a lot of inspiration to the NLP community. We could make full use of the existing research results of cognitive neuroscience to design novel MRC systems.

Conclusions
We conducted a comprehensive survey of recent efforts on the tasks, evaluation metrics and benchmark datasets of machine reading comprehension (MRC). We discussed the definition and taxonomy of MRC tasks, and proposed a new classification method for MRC tasks. The computing methods of different MRC evaluation metrics have been introduced with their usage in each type of MRC tasks also analyzed. We also introduced attributes and characteristics of MRC datasets, with 47 MRC datasets described in detail. Finally, we discussed the open issues for future research of MRC and we argued that high-quality multi-modal MRC datasets and the research findings of cognitive neuroscience may help us find better ways to construct more challenging datasets and develop related MRC algorithms to achieve the ultimate goal of human-level machine reading comprehension.
To facilitate the MRC community, we have published the above data on the companion website (https://mrc-datasets.github.io/), from where MRC researchers could directly access the MRC datasets, papers, baseline projects and browse the leaderboards.