1. Introduction
Machine reading comprehension (MRC) [
1,
2,
3,
4] has attracted increasing interest in recent years. It is presented to evaluate how well a machine understands human language by asking the machine to answer questions according to a given passage. In particular, multi-choice machine reading comprehension [
2] is one of the most difficult tasks in MRC since it usually requires different natural language processing skills such as word matching, syntactic structure matching, logical reasoning and summarization, compared to span-selection tasks [
1]. In other words, addressing multi-choice reading comprehension tasks requires various information items due to the abundant diversity of the questions, options and passages. Therefore, researchers propose a variety of methods to provide richer information for the different reading comprehension samples. Among them, pre-trained language models such as BERT [
5] and RoBERTa [
6] have become an important trend.
A pre-trained language model usually uses several self-supervised pre-training tasks to train a deep transformer-based network [
7] and learn rich information from large-scale corpora. It brings great success to MRC and achieves SOTA performance in multiple tasks simply by fine-tuning. Jawahar et al. [
8] conducted a series of experiments on BERT and found that its intermediate layers learn a rich hierarchy of linguistic information. Specifically, BERT learns phrase-level information and surface information in the low layer, syntactic features in the middle layer and semantic features at the top layer. In other words, BERT naturally contains different information at different layers of the network.
However, existing methods generally use the top-layer representation of pre-trained models to address the MRC tasks [
3,
9,
10]. Therefore, while the pre-trained models encode different information at different layers, the current methods cannot directly exploit the diverse information for the various questions and passages. How to adaptively select the required information in pre-trained model for MRC task is still a problem to be solved.
An intuitive idea to address the above problem is building multiple decision modules by making use of the outputs at different layers of pre-trained models respectively, and then synthesizing the multiple decisions to reach the final solution. In this way, the representations at intermediate layers could be exploited directly, which helps the different samples to utilize the required information from different layers. However, another problem arises. Since each decision module has the similar supervisory signal, the output representations at those layers tend to be similar at fine-tuning. This damages the information diversity in different layers of the original pre-trained models. An effective way should make use of the multi-layer representations without damaging the information diversity.
This paper therefore proposes a simple but effective multi-decision based transformer model with learning rate decaying to address the above problem. For a pre-trained model with stacked
L transformer layers, we divide them evenly into
N blocks, each block consists of
Transformer layers. We then use the output of each block for decisions. For example, if
and
, the representation of the 4th layer (the output of 1st block), the 8th layer (the output of 2nd block) and the 12th layer (the output of 3rd block) are used. Each block answers the input question separately. The final answer is then chosen by two manners: “performance first” and “speed first”. When updating the parameters, in order to avoid the information diversity being damaged, we weaken the influences of supervisory signals on lower layer parameters. This is achieved by a learning rate decaying method which gradually reduces the learning rate from top to bottom of the transformer stack. Experimental results demonstrate that our model is able to answer the questions by extracting the required information from intermediate layers and speed up the inference procedure while retaining considerable accuracy. The source code is available at
https://github.com/bestbzw/Multi-Decision-Transformer.
To sum up, the main contributions of this paper are listed as follows.
We propose a simple but effective multi-decision based transformer model that adaptively selects information at different layers to answer different reading comprehension questions. To the best of our knowledge, this is the first occasion explicitly making use of the information at different layers in pre-trained language models for MRC.
We propose a learning rate decaying method to maintain the information diversity in different layers of pre-trained models, avoiding it being damaged by the multiple similar supervisory signals during fine-tuning.
We conduct a detailed analysis to show which types of reading comprehension task can be addressed by each block. Moreover, the experimental results on five public datasets demonstrate that our model increases the inference speed without sacrificing the accuracy.
2. Related Work
Pre-trained Language Model. Early research on pre-training neural representations of language mainly attends to pre-trained context-free word embeddings, such as Word2Vec [
11], GloVe [
12]. Since most Natural Language Processing (NLP) tasks are beyond word level, researchers propose to pre-train sequential encoders, such as LSTM, to obtain contextual word embeddings [
13].
Recently, the very deep pre-trained language models (PLMs) [
5,
6] have improved the performance on multiple natural language processing benchmarks [
14], which is credited to the strong prior knowledge and the deep network structure. OpenAI GPT [
15] proposes the first transformer-based PLM with a left-to-right language modeling and auto-encoder objective. BERT [
5] improves GPT by using a Masked Language Model (MLM) pre-training objective to learn bidirectional encoder representations, and a Next Sentence Prediction (NSP) objective to learn the relation between sentence pairs. BERT encodes diverse information in different layers. Jawahar et al. [
8] conduct a series of experiments to unpack the elements of English language structure learned by BERT. This work finds that BERT captures structural information about language. It learns phrase-level information and surface information in the low layer, syntactic features in the middle layer and semantic features at the top layer. Moreover, Guarasci et al. [
16] also find that the middle layer of BERT, rather than the top layer, performs best on syntax task.
Following GPT and BERT, a lot of transformer-based PLMs are proposed. These models use more data [
6] and parameters [
17], modified word masking methods [
6,
18] and transformer structures [
19,
20] to significantly improve the SOTA performance on multiple NLP datasets.
Machine Reading Comprehension. Early Machine Reading Comprehension (MRC) systems use rules-based methods or feature engineering methods to extract answers from passages. Deep Read [
21] proposes the first reading comprehension system. Given a story and a question, it uses sentence matching (bag-of-words) techniques and some linguistic processing methods, such as stemming and pronoun resolution, to select a sentence from the story as the answer to the question. Quarc [
22] and Cqarc [
23] are two rules-based reading comprehension systems which design some hand-crafted heuristic rules to look for the semantic clues and answer the given questions. These methods depend on human-defined feathers and are difficult to generalize to new datasets [
24].
The rapid growth of MRC is largely thanks to the development of deep learning [
25,
26] and the availability of large-scale datasets [
1,
2]. The attentive reader [
26] proposes the first neural MRC models based on LSTM and attention mechanism, in which two cloze-style corpuses are also presented.
Then the attentive reader is extended and enhanced in several ways. For the embedding module, BIDAF [
27] introduces char-level embedding into the model that alleviates the out-of-vocabulary (OOV) problem. DrQA [
28] improves the embedding representations of both question and passage with POS embedding, NER embedding and binary feature of exact match. For the encoder, QANet [
29] replaces the LSTM encoder with convolution network to accelerate the training and inference processes. To better fuse the question and passage, BIDAF [
27] proposes a bi-directional attention flow mechanism and GA [
30] presents a static multi-hop attention architecture, both of which obtain a fine-grained question-aware passage representation. DFN [
31] proposes a sample-specific network architecture with dynamic multi-strategy attention process, lending flexibility to adapt to various question types that require different comprehension skills. Hierarchical Attention Flow [
32] models the interactions among passages, questions and candidate options to adequately leverage candidate options for multi-choice reading comprehension task. Besides these, there are also many works [
27,
29,
33] that use an additional encoder to capture the query-aware contextual representation of passage after attention.
More recently, BERT-like models [
5,
6,
19] have been applied in the MRC task and some even surpass human performance on multiple datasets [
1,
2]. Most models take BERT as a strong encoder and propose some task-specific modules upon it. DCMN [
9] and DUMA [
34] add an interaction attention layer to model the relationship among passage, question and options. NumNet [
35] and RAIN [
36] introduce number relation into the BERT-based MRC network for addressing the arithmetic question. However, most of these studies take PLM as a black box and only use the output hidden states at the top layer [
3,
9,
10,
34], wasting the rich information at other layers. As far as we know, there is still no work to leverage the different level representations of PLM to address MRC tasks.
Multi-Decision Network. Most of the previous work utilizes diverse information by building multiple decision modules. These modules are either located in the same layer of the neural network (parallel) [
37], or located in different layers (serial) [
31,
38,
39,
40].
All these models are trained from a random initialization state, so that they take much pains to differentiate the input representation of each decision module. This goal is usually achieved by assigning different training samples to each decision module. Some works [
31,
38,
39] employ a sample-specific network architecture with reinforcement learning to spontaneously determine which decision module to train. This method is sometimes unstable and each sample is only assigned to one decision module which reduces the size of the training corpus for each module. Zhou et al. [
40] proposes a soft way with a weighted loss function to adjust the distribution of training samples on different decision modules.
BERT-based multi-decision models are the most similar work to ours. They add classifiers to the output of each transformer layer and aim to reduce the inference time of BERT in sentence classification task. FastBERT [
41] adopts a self-distillation mechanism that distills the predicted probability distribution of the top-layer classifier (as a teacher) to other classifiers at the lower layers (as students). The students simply imitate the teacher, but neglect learning the cases that only the lower layers could solve. This method damages the information diversity of the original BERT model. In contrast, DeeBERT [
42] completely stops the gradient propagation from the classifiers at the lower layers to the transformer network. This keeps the information diversity but loses the performance at the lower layers.
3. Models
The overall structure of our model is illustrated in
Figure 1. It consists of two parts: A bidirectional transformer-based encoder and multiple decision modules. The encoder is composed of a stack of
L identical transformer layers. We divide the encoder equally into
N blocks from bottom to top, so that each block contains
stacked transformer layers (the output of
nth block is the output of the
th layer). A decision module is built for each block by making use of the output of the block.
Our model can also be easily rebuilt upon the existing pre-trained model as its backbone. For example, a BERT_base model with 12 transformer layers can be used as a backbone, where outputs of three layers (4th, 8th and final layer) are chosen for building separate decision modules. Details of the model architecture and the training methods are given in rest of the section.
3.1. Model Architecture
The input
S of our model is a sequence with length
. For different tasks, the input sequence and the decision modules are slightly different. We first introduce the basic multi-decision-based transformer architecture in this section and then expound it for specific tasks in
Section 3.3.
3.1.1. Embedding
Firstly, our model transforms the symbolic input sequence
S into distributed representation by the embedding function.
The representation E is the summation of the token embeddings, the segmentation embeddings and the position embeddings.
3.1.2. Backbone
Next the embedding sequence is encoded by the
L transformer layers sequentially. Each layer is composed of a multi-head self-attention sublayer and a position-wise, two-layer fully connected feed-forward network sublayer. The input of a layer is the output of the previous layer. Let
be the output of the
i-th layer; apparently, its input is the output of the
-th layer. The
i-th layer is therefore defined as,
where
is the output of the
ith layer,
is the length of the input sequence,
d is the hidden size of the transformer. For the first layer (bottom layer), the input
is the embedding representation
E. Specific implementation of
is described in [
7].
We denote
as the output representation for the
nth block. To conveniently describe the model, the encoder for each block is defined as,
where the function
is a stack of
transformer layers,
denotes the parameters of the
th transformer layers.
Following the encoding output of each block is the decision module. The score of the candidate answers in
nth block is calculated by the decision module,
where
is the probability distribution over the candidate answer set
and
is the parameters of the decision function
. The specific implementation of the decision module will be expounded in the
Section 3.3.
Then the predicted answer of this block is determined by the probability,
where
indicates the predicted probability of the candidate
c.
3.1.3. Decision Module
With the above steps, each decision module outputs a probability distribution and an answer. When selecting the final answer from these intermediate prediction results, we apply two manners for performance improvement and inference acceleration, respectively.
Performance first. For each layer, the probability of the predicted answer where
is denoted as where
. The answer with the highest probability is selected as the final answer.
Speed first. If a decision module is confident enough about its predicted answer, it is not necessary to execute the following inference steps. This is also beneficial in terms of reducing the inference time. Inspired by [
41], we use a normalized entropy as the uncertainty to measure whether to terminate the current inference. The uncertainty of the
nth decision module is defined as,
where
is the number of candidate answers. During inference, if
, then the inference is stopped and
is selected as the final answer; otherwise the model will move on to the next layer.
is a threshold to control the inference speed. If all uncertainties of the decision modules are greater than the threshold, we adopt the first method to determine the final answer.
3.2. Training with Learning Rate Decaying
3.2.1. Loss Function
There is no explicit annotation or rule to indicate which layer should be selected for a sample; therefore, we directly train each decision module on the whole dataset to avoid designing annotation rules.
For each decision module, the loss
is the negative likelihood of the ground-truth answer
.
where
is the training set. The final loss is the summation of the losses in all blocks.
3.2.2. Learning Rate Decaying
While the above method exploits the representations at difference layers, the decision modules in lower blocks have the same supervisory signal as the top block, leading to the output representation of each block tending to be similar. This damages the information diversity in the original pre-trained language models. We have showed the phenomenon by comparing the information similarity between the general BERT model and the fine-tuned BERT model with multiple supervisory signals in
Section 4.7.
Interrupting the gradient propagation from intermediate-block decision modules to the transformer network is an effective way to maintain the information diversity (equivalent to setting the learning rate to 0); however, it abandons fine-tuning the transformer to fit the lower-block decision modules, causing their performance to drop. We hope to find a balance between adjusting the lower-block parameters and maintaining the original rich information. An intuitive method is to let the learning rate of the lower blocks be smaller than top block and greater than zero.
We propose a learning rate decaying method to update the parameters in our model. Let
be the initial learning rate for the parameters in the
Nth block (top block). As shown in
Figure 2, for
nth block of the transformer, the learning rate is set to
, where
is a positive number greater than 1, called decay factor. Moreover, the learning rates for all decision modules are
without any decay. In the above way, we build a multi-decision-based transformer which utilizes rich and diverse information in pre-trained models without damaging the information diversity.
3.3. For Specific Task
In this section, we introduce the specific model implementation for multi-choice MRC tasks such as RACE [
2], Dream [
43] and ReCO [
3]. For each task, we simply plug the task-specific input and decision modules into our multi-decision-based transformer architecture.
RACE & Dream. In this task, each sample contains a passage P of text, a question sentence Q and k candidate answer sequences . We concatenate each candidate answer with the corresponding question and passage as the inputs: “”.
At each block, we extract the encoded “
” representations of all sequences and concatenate them as a matrix
to represent the
k candidates. Then
is fed into the decision module, which is a Feed-Forward Network (FFN), to predict the probability distribution,
where
is a tanh function,
,
,
are trainable parameters and
.
ReCO. ReCO is also a multi-choice task with a passage
P, a question
Q and three candidate answers
. Compared with RACE, the candidate answer in ReCO is shorter. Therefore, we directly concatenate all candidates with question and passage as the input sequence:
At each block, the three
representations
are passed through an FFN to obtain the predicted probability.
4. Experiments
4.1. Implement Details
We use Adam [
44] algorithm with a batch size of 48 for optimization and the initial learning rate is set to 2 × 10
. We use a linear warmup for the first 10% of steps followed by a linear decay to 0 for all parameters. The number of training epoch is set to 20 with early stopping. The model is implemented with PyTorch, and pre-trained models are from Huggingface’s transformer library [
45]. To alleviate the problem of gradient explosion, the derivatives are clipped in the range of [−2.0, 2.0]. The decay factor
is selected from {2, 3, 5} according to the performance on the development dataset. Finally, we conduct comprehensive experiments on BERT_base [
5], RoBERTa_base, RoBERTa_large [
6] and DUMA [
34]. We run each experiment three times with different random seeds. For the “performance first” manner, we record its average accuracy and standard error, and for the “speed first” manner, we record the best results.
4.2. Datasets
We first evaluate our model on three public multi-choice MRC datasets.
RACE [
2] is an MRC dataset collected from English exams for middle and high school. The questions and candidates are generated by experts to evaluate the human agent’s ability in reading comprehension. It contains two subsets with 98,000 questions in total, and includes four types of answers.
ReCO [
3] is a recently released large scale Chinese MRC dataset on opinion. The passage and questions are collected from multiple resources. It contains 300,000 samples and includes three types of answers (yes/no/uncertain).
Dream [
43] is an English dialogue-based multiple-choice MRC dataset. It is collected from examinations designed by human experts to evaluate the comprehension level of Chinese learners of English, where the passage is an in-depth multi-turn multi-party dialogue. It contains 10,000 questions and 6000 dialogues.
Moreover, we find that our multi-layer structure and learning rate decaying method are not only effective for MRC tasks; therefore, we also evaluate our method on two sentence classification datasets for detailed analysis:
Ag.news is an English sentence classification datasets from [
46]. Ag.news contains around 120,000 training samples.
For these classification tasks, we follow the same settings as described in [
5] to construct the input sequence and each decision module is a two-layer FFN. These models are trained by the training method described in
Section 3.2.
4.3. Evaluate Metrics
For both the MRC task and the classification task, we use accuracy (acc) as the evaluation criteria.
where
indicates the number of examples correctly answered by the model, and
N denotes the total number of examples in the whole evaluation set.
4.4. Main Results
We first compare our “performance first” model with common BERT-like models on five test datasets. For easy comparison, three existing pre-trained models, BERT_base, RoBERTa_base and RoBERTa_large, are used as baselines and backbones of our models.
Table 1 shows the experimental results of our model and the baselines on three MRC datasets, and
Table 2 shows the results on two classification datasets. As we can see, in almost all datasets, our model outperforms the baselines by 0.1–2.7% accuracy. These results show that different questions can be handled by the representations in different blocks instead of the top block only. The representations in lower blocks can bring competitive effects as the top layer. We also compare our method with DUMA [
34], a representative MRC model with a complex decision module (for a fair comparison, both our model and DUMA are implemented by the BERT_base model). It can be shown that our model also works when incorporating with DUMA, indicating that the intermediate-layer information is useful for different decision modules. These results again show the effectiveness of our method.
Then we evaluate the accuracy and speed of our “speed first” model. Floating-point operations (FLOPs) (
https://github.com/Lyken17/pytorch-OpCounter (accessed on 13 January 2021)) is used to measure the computational complexity.
Table 3 and
Table 4 show the accuracy and FLOPs. Increasing the threshold will no doubt speed up the inference process since the inference is more likely to be terminated at lower layers. As the table shows, compared with the BERT_base model, our model not only accelerates the inference process but also achieves better accuracy on all datasets in most cases. For the MRC task, our model is faster than the BERT_base model by 1.02–1.12 times on RACE, 1.11–1.39 times on ReCO and 1.00–1.02 times on Dream. For the classification task, our model accelerate the inference by 1.29–1.81 times on the Book Review dataset and 2.06–2.82 times on the Ag.news dataset. Moreover, our speed first model with
also achieves comparable accuracy to our performance first model (
) with a higher inference speed.
4.5. Information Type Analysis
In this section, we explicitly analyze which types of reading comprehension task that can be addressed by each layer. Generally speaking, answering different questions requires different information. We collect the results predicted by a 3-block model on the ReCO development dataset and split them into three groups according to the source of the final answer (selected from which layer). Then we randomly select 100 samples from each group, respectively, and ask volunteers to label the information type required by each sample. The information types are mainly taken from ReCO [
3] and the findings of Jawahar et al. [
8]. ReCO presents seven information types, such as “Lexical Knowledge”, “Syntactic Knowledge”, “Specific Knowledge”. Jawahar et al present four information types, such as “Syntactic Features” and “Semantic Features”. Considering that too fine-grained categories are difficult for volunteers to distinguish, we defined the four information types (“Lexical Knowledge”, “Syntactic Knowledge”, “Semantic Knowledge” and “Specific Knowledge”) in our paper. In order to ensure that the labeled categories are accurate and not affected by other information, we randomly shuffle all samples and only provide the volunteers with passages, questions and options.
Table 5 shows the frequencies of the required information type in each layer. The information types are mainly taken from the popular MRC work [
3]. As we can see, more than half of the samples in the first block require lexical knowledge such as synonymy matching. In the second block, the model prefers answering the questions that require syntactic knowledge such as sentence structure information. The third block is devoted to addressing the reasoning tasks such as logical reasoning, casual inference, and so on, which generally need deep semantics knowledge. These results show that the low layer and middle layer are good at solving MRC problems, needing shallow information and the top layer is skilled in addressing deep reasoning tasks. Moreover, there are also some questions that require specific external knowledge, and the third block performs best on them. One possible reason is that the third block contains more background knowledge due to the deeper network.
Some examples requiring different information are shown in
Figure 3. The first item shows two passage-question cases requiring lexical knowledge. These two questions are easily answered by matching the similar words/phrases in the question and passage (“before meals” vs. “empty stomach” and “can purchase” vs. “purchase”). As we can see, our model tends to select the first block to answer this kind of question.
The questions requiring syntactic knowledge are shown in the second item. In this kind of problem, the syntactic structures of the question and the evidence are usually different. “Evidence” is the most needed sentences in the passage to answer the question. For instance, in the first example, the question uses the active voice while the evidence “Fever… be caused not by hernia” uses the passive voice. To answer this question, the model needs to find the correct subject–predicate–object tuples in the question and evidence.
The third item shows two examples that requires semantic knowledge. This kind of example usually requires understanding the semantic information of the passages at first, then answers the question with logical reasoning. Take the first question being the third item as an example, the model firstly understands “detention” is a worse outcome, then reasons that “the defendant must participate in the cross-examination after receiving court summons”, and finally gives the answer “no”.
The last item is some cases requiring external knowledge, such as “fish is a kind of seafood”. There are no specific rules for this kind of problem. Generally speaking, the more knowledge the model learns from the training corpus, the better the performance will be.
Moreover, we also show the predicted results of a common BERT baseline in
Figure 3. It can be seen that our model outperforms the baseline on questions requiring semantic knowledge and external knowledge. It seems that this result contradicts the above analysis, but it is reasonable. The baseline learns to answer all questions requiring different knowledge at the same layer, leading the model to tend to learn shortcuts (shortcuts are the tricks that use partial evidence to produce answers to the expected comprehension challenges, e.g., co-reference resolution) [
47]. Our method exploits a multi-decision structure and a learning rate decaying method that makes the top block attend to learning semantic knowledge and external knowledge, which shows that our approach is effective on these questions.
4.6. Ablation Study
To analyze the influence of the number of blocks and decay factors in a detailed manner, we conduct several comprehensive ablation experiments.
Table 6 shows the accuracy of our models (w/BERT_base) with different blocks on three datasets. The 1-block model indicates that we only use the top-layer representation of the BERT_base model without learning rate decaying, which is same as the general BERT_base baseline. As we can see, the performance does not increase monotonously with the increase of the blocks. The model performs best when the number of blocks is 2 or 3, and as the number continues to increase, the accuracy falls. Especially when the number of blocks increases to 12 (use outputs of all layers), the performance reaches the worst. It is even worse than the 1-block model on the RACE dataset.
This is because the representations of adjacent layers are similar. As shown in
Figure 4, for any two layers, with the decrease of their distance, their representations tend to be more similar. In other words, for two layers with close distance, the samples they can correctly solve highly overlap. Simply increasing the number of blocks not only fails to address more samples, but also introduces additional noises. In addition, using too many blocks may cause the model to be overfitting. Therefore, it is very important to properly select the number of blocks. According to
Table 6 and
Figure 4, we think that setting the number of blocks to three is universally effective.
Next, we study the effectiveness of different decay factors in
Figure 5. We can see that the model without learning rate decaying (
) performs worst, where its third-block (top-layer) accuracy and final accuracy are both lower than the baseline which only uses the top-layer representation. This shows that only adding decision modules at different blocks not only fails in improving the final performance, but also loses the performance of the top layer. In contrast, all our models (
) perform well and the top-layer performance even grows slightly. This illustrates that our learning rate decaying method is very effective for maintaining the top-layer performance and improving final accuracy.
Moreover, the performance of the first block (bottom block) falls sharply with the increase of the decay factor ; the most likely reason is that the learning rate of the first block is smallest, when the final performance in the development dataset reaches the peak, the first block is still in the state of under fitting. As increases, the learning rate of the first block falls gradually and the degree of under fitting rises, so that the corresponding accuracy goes down. The performance drop does affect the final accuracy, but this does not mean that the first block is completely useless. We try to remove the first-block decision module during testing and find that the accuracy decreases by 0.2%.
Convergence analysis of each layer. Due to the decaying learning rate in our model, the convergence speed of each decision module is different.
Figure 6 shows the convergence in accuracy of each block. As the figure shows, the accuracy of the third block first reaches the peak, then is the second block, and the first block is converged at lastly. In addition, we can see that the accuracy of the third block and second block continue at a fixed height after the seventh epoch even though the bottom-block parameters are still updating to fit the bottom-block decision module. This is because the lower-block parameters update slowly, the upper-block parameters can be adjusted in time to maintain the performance. The performance maintenance ability is important for achieving a good final accuracy.
4.7. More Analysis
Information diversity analysis. We calculate the similarity between different blocks to show that our model successfully maintains the information diversity of the pre-trained models. The overall similarity is the average cosine similarity of the vector representations of each two different blocks on development datasets. Generally speaking, lower similarity indicates that the vectors contain more different information—that is, the model has higher information diversity.As shown in
Figure 7, there is no doubt that the model without learning rate decaying has the highest similarity on all datasets. The similarity of our model is between the other two models on the ReCO dataset and lower than the baseline model on the RACE dataset, which indicates that our learning rate decaying method is actually effective for maintaining the information diversity. The above experiment has shown that the information of each block is different, which indicates that our model is able to address different reading comprehension samples in different blocks.