A Multiple-Choice Machine Reading Comprehension Model with Multi-Granularity Semantic Reasoning

: To address the problem of poor semantic reasoning of models in multiple-choice Chinese machine reading comprehension (MRC), this paper proposes an MRC model incorporating multi-granularity semantic reasoning. In this work, we ﬁrstly encode articles, questions and candidates to extract global reasoning information; secondly, we use multiple convolution kernels of different sizes to convolve and maximize pooling of the BERT-encoded articles, questions and candidates to extract local semantic reasoning information of different granularities; we then fuse the global information with the local multi-granularity information and use it to make an answer selection. The proposed model can combine the learned multi-granularity semantic information for reasoning, solving the problem of poor semantic reasoning ability of the model, and thus can improve the reasoning ability of machine reading comprehension. The experiments show that the proposed model achieves better performance on the C 3 dataset than the benchmark model in semantic reasoning, which veriﬁes the effectiveness of the proposed model in semantic reasoning.


Introduction
How to make computers understand human language is the main goal of the field of Natural Language Processing (NLP) and has been a long-standing challenge for artificial intelligence research. Machine Reading Comprehension (MRC) tasks are similar to human reading comprehension tests in which the computer needs to answer questions based on the content of a given text [1]. In contrast to traditional NLP, MRC requires techniques that involve multiple aspects of lexical, grammatical and syntactic meanings and also requires a combination of feature representations analysis of the text context and semantic reasoning techniques, making it a very challenging NLP task.
In MRC tasks, deep learning models are often used to help machines learn and understand contextual content so that they can answer the corresponding questions correctly. If machines can perform reading comprehension tasks similarly to humans, and have reading comprehension capabilities similar to or better than those of the human brain, then they can be of great value in replacing traditional human reading comprehension tasks.
Among the several types of tasks in MRC (cloze test, span extraction and multiple choice, etc.), this paper focuses on multiple-choice style tasks. A multiple-choice MRC task differs from a span extraction task in that it requires not only the text and questions but also a set of candidate answers from which the machine needs to find the correct answer, taking into account the semantic information of the text [2]. In contrast to the cloze test MRC task where the answers are fixed words and phrases, the answers to the multiple-choice MRC task are artificially generated sentences that are manually rewritten with complete logic based on the content of the text. Typical English datasets of this type include MCTest [3], RACE [4] and MCScript [5], and a representative Chinese dataset is C 3 [6]. A sample of data selected from the RACE dataset for a multiple-choice MRC task is shown in Figure 1. The RACE dataset is a representative benchmark dataset for multiplechoice reading comprehension, which is constructed using an English test bank for junior and senior high schools. The candidates in candidate answer set A in Figure 1 never appear in the passage, and the machine needs to fully understand the semantic information of the text context. The machine needs to fully understand the semantic information of the context and select semantically similar candidates from the candidate set as the answer. While the answers to the cloze test and span extraction tasks must come from the context of the given passage, and the answers to the multiple-choice tasks are not necessarily sequences in the text. Those answers are manually rewritten and summarized based on the content of the text, and some answers even need to be inferred together with external knowledge.Questions and answer candidates for multiple-choice Chinese MRC are written by humans, which means the content is more flexible and difficult to find out the correct answer by simple matching.
Through our analysis of the multiple-choice Chinese MRC task, we found these three factors to make the task challenging: (1) few training data and lack of external knowledge severely limit the accuracy of the model; and (2) the answer selection for many questions requires deep semantic interaction to find out the corresponding answer. Repeated semantic interactions between articles, questions and candidates are crucial, but the learning of them is inadequate. (3) MRC requires a high level of semantic reasoning, and answer selection must not only take into account the local information of the related passages but also consider the global information of the article.
Based on the characteristics and challenges mentioned above, we propose a deep neural network (DNN) based model. The main contributions of this paper can be summarized as follows: 1. We design a multi-granularity semantic information extractor and apply it to our proposed MRC model to enhance the comprehension of local sematic meanings, which have been proved beneficial to the model performance in our experiments. 2. We investigate the semantic interactional reasoning aspect and leverage attention mechanism to extract semantic perceptual information between articles, questions and candidates. By fusing multiple semantic interaction information, we have further improved the performance of our multiple-choice MRC model. 3. We model the learning process of global semantic information and local semantic information, respectively, and jointly construct a deep global and local semantic multiple-choice MRC model to achieve better deep semantic learning and reasoning for articles, questions and answers in multiple-choice MRC tasks.
In the first section of this paper, we introduce the background of our research and the up to date researching progress in related fields. By analyzing existing works, we illustrate the significance of our proposed idea in Section 2. Then we define the task in formulaic language and describe the proposed model structure with multi-granularity semantic reasoning in Section 3. Section 4 describes the dataset, evaluation metrics and settings used in our experiments. Our experimental results are presented and comprehensively analyzed in detail in Section 5, and we summarize our work in this paper in the last section.

Related Work
For nearly half a century, research on MRC has gone through three stages of development: the early era of rule-based MRC, the era of machine-learning-based MRC and the era of neural networks that use deep learning to build MRC models.

Rule-Based MRC
When the MRC task was firstly proposed in the 1970s, most of the early approaches were limited by hand-coded scripts and rules, making them difficult to apply widely in realworld scenarios. In the late 20th century, Hirschman et al. [7] proposed an MRC dataset for development and testing that contained 120 reading materials for primary school students and a number of short question-answer pairs, such as who, where, when, why and what, consisting of questions and answer pairs. They did not require the model to give an exact answer, but only needed it to find the sentence where the answer is located in the article. They also proposed the DEEP READ model for this dataset (which primarily uses a rulebased bag-of-words model). Charniak [8] et al. fused a rule-based bag-of-words model with a lexical and semantic similarity-based approach, ultimately achieving an accuracy rate of 30% to 40% in the reading comprehension task of searching answer location.

Machine-Learning-Based MRC
In 2013, Richardson et al. [9] proposed the MCTest dataset on which the weighted distances between questions and answers were calculated to predict the correct answer. The presentation of this dataset has rapidly advanced the development of machine learning models [10][11][12]. In 2015, Wang et al. proposed a max-margin learning framework based on a heuristic sliding window approach, which improved the model accuracy from 63% to around 70% on the MCTest dataset by converting each question-answer pair into a textual implication system for the corresponding utterance. Similar to Wang's model, most of the models at that time were based on a simple max-margin learning framework with some rich linguistic features (such as syntactic dependencies, denotational disambiguation, semantics, word embeddings, etc.) to fit into passages, questions and answers. Compared to earlier rule-based MRC approaches, machine learning-based MRC models have shown good performance. However, we can find that the existing machine learning models still have significant limitations in terms of performance improvement, and there are two main reasons affecting the performance improvement: (1) the machine learning models mainly rely on existing language tools for feature extraction, such as dependency parsers and semantic role annotator, but these language tools are trained from data in a single domain, and their generalization capability is relatively weak; therefore, for MCTest data, there is a lot of noise in the obtained features; (2) the size of the dataset is too small, thus it cannot support the adequate training of machine learning models.

Deep-Learning-Based MRC
In 2015, Hermann et al. [13] presented a large-scale fill-in-the-blank MRC dataset CNN/Daily Mail for the first time (about 1.26 million training data), and also proposed the ATTENTIVE READER neural network model for this dataset, which is based on the attention mechanism model and compared to the traditional ATTENTIVE READER neural network model, which is based on an attention mechanism and achieves a 12.9% improvement in accuracy compared to traditional NLP models. This marks the beginning of the DNN era for MRC. In 2016, Rajpurkar et al. [14] proposed SQuAD, an English dataset for extractive answer-based MRC tasks, which is the first canonical dataset containing largescale natural language question-answer pairs in the MRC research community. Relying on the SQuAD dataset, Wang et al. [15] proposed the Match-LSTM and Answer Pointer models, which use a bidirectional LSTM model to encode questions and articles, and a one-way attention mechanism to perform semantic matching between article and question. Yu et al. [16] proposed QAnet, which uses multi-layer convolution and a self-attentiveness in the encoding module mechanism to integrate local and global interactions of articles and questions to improve the performance of the model. Basafa el al. [17] use Longformer [18], a long document transformer, to learn the abstract meaning of the context. It has been proved that deep learning-based MRC models have stronger text semantic representation ability and answer reasoning ability in English compared with traditional machine learning models.
A set of large-scale datasets for different Chinese MRC tasks and datasets have also been proposed, such as ReCO [19] for Chinese reading comprehension, ChID [20] for the cloze-style task on Chinese idioms, CMRC2018 [21] for the extractive task, and C 3 [6] for the multiple-choice task. The proposal of a large number of high-quality MRC datasets has driven the development of deep neural MRC models. Knowledge-enhanced pretrained models, such as ERNIE 3.0 [22] and Kepler [23], are able to integrate factual knowledge into PLMs to achieve better performance. Instead of striving for better objective evaluation, Cui et al. [24] try to improve the explainability for MRC tasks on multiple-choice datasets. To the best of our knowledge, few studies have focused on semantic reasoning on different levels of granularity to address the Chinese MRC challenge. In this paper, we build on previous work to construct models that have greater comprehension and generalization capabilities for natural language in the field of MRC.

Task Definition
The multiple-choice MRC task requires the model to select the appropriate answer from the candidate answers based on the given context, and the answers are not only limited to words or entities present in the context, which makes the answer format more flexible. By giving the machine a Document (denoted as D) and a Question (denoted as Q), which corresponds to a set of options (denoted as O), the goal of the model is to be able to infer the correct answer from the set of candidate answers.
Then we can define the task as follows: the relationship between document D = d 1 , d 2 , . . . , d m (m denotes the number of words in the article), questions Q = q 1 , q 2 , . . . , q n (n denotes the number of words in the question) and answers O i = o 1 , o 2 , . . . , o k (k denotes the number of words in the i-th candidate answer) is shown in Equation (1).
In this task, our model should find the best answer from all k candidates by learning from the aforementioned relationship.

Model
In this section, we construct a multiple-choice MRC model incorporating multigranularity semantic reasoning. Firstly, the articles, questions and candidates are input to the BERT model for learning, and the semantic information in the articles, questions and candidates is learned through the multi-layer transformer structure in the BERT model; secondly, the feature vectors output from the final hidden layer of BERT become the input to a convolutional neural network (CNN) with multiple windows of different sizes, and the convolutional kernels with different sizes of windows are used to learn different lengths of semantic paths. Then, the output of the multi-granularity semantic reasoning information is spliced with the global feature information output from the CLS position in BERT to obtain the information of global reasoning and local multi-granularity semantic reasoning; finally, the spliced information is used to complete the answer selection by using the fully connected layer and softmax function, and the model structure is shown in Figure 2. The model structure is divided into six main layers from bottom to top: input layer, embedding layer, encoding layer, multi-granularity semantic reasoning layer, information fusion layer and answer prediction layer.

Input Layer
This layer mainly represents the inputs of the documents, questions and candidates. According to the input characteristics of the BERT model, the sequence of inputs of documents, questions and candidates is represented as shown in Equation (2).
where D denotes the set of token sequences of documents, Q denotes the set of token sequences of questions, and O i denotes the set of token sequences of the i-th candidate.

Embedding Layer
The embeddings we use in our model are divided into three layers: token Embeddings, Segment Embeddings and position embeddings, Token Embeddings is the conversion of S in the input layer into a vector of fixed dimensions; Segment Embeddings is used to distinguish the front and back parts of the sentence pairs; Position Embeddings is to encode the position information in the input layer S. The formulae are shown in Equations (3)- (6).
where Input S represents the overall output after the BERT embedding layer.

Encoding Layer
This layer is used to encode the input embedded layer sequence through the multilayer transformer in BERT. There is a dependency relationship between the multi-layer transformer and the output of the previous layer transformer is the input of the current layer transformer, which is calculated as shown in Equations (7) and (8).
where h i denotes the output of the transformer's i-th layer and N is the number of layers of the transformer in BERT.

Multi Granularity Semantic Reasoning Layer
This layer is mainly used to perform multi-granularity semantic reasoning on the vectors encoded by BERT using CNN. This process simulates the process of human reading comprehension by repeatedly focusing on the important semantic information before and after reasoning, and finally completing the answer selection. The CNN mainly contains a convolutional layer and a pooling layer, and the convolutional kernel windows used in our mode are 2, 3 and 4, and the pooling method uses the maximum pooling method. The calculation method is shown in Equations (9)-(11).
where T 2 , T 3 and T 4 represent the results of convolution kernels for 2, 3 and 4 convolutions with maximum pooling, respectively.

Information Fusion Layer
This layer fuses the output of multi-granularity layer with the feature vector acquired by the CLS embeddings from BERT. The CLS embedding vectors represent the global feature information obtained through the BERT model, while the output results of the multi-granularity layer represent the feature information obtained by reasoning at multiple local granularities. By fusing the two pieces of information, the model is able to learn more comprehensive information, which is more conducive to the subsequent answer prediction. The calculation method is shown in Equation (12).
where C denotes the feature vector output from the CLS position, and x is the result of information fusion operation.

Answer Prediction Layer
This layer focuses on the prediction of answers for multiple-choice MRC, and after the fully connected layer, the answer prediction is performed by the softmax function. The final output is calculated as shown in Equation (13).
where W x denotes the weight and b denotes the bias.

Optimization
Our model uses the cross-entropy loss function as the loss function, which is calculated by Equation (14) as below: y ic log (ŷ ic ) (14) where N is the total number of inputs, M is the number of categories. y i c is the expected output, which is 1 when the categories are the same with the actual output and 0 when they are different, andŷ i c is the probability of predicting sample i to category c.

Dataset
C 3 : We use the multiple-choice Chinese MRC dataset C 3 and perform a statistical analysis about it. In 2019, researchers at Tencent AI Lab presented the first free-form multiplechoice Chinese MRC dataset, which contains 13,369 documents (containing both formal and informal forms) collected from questions in the general domain of the Chinese Proficiency Test and 19,577 multiple-choice MRC questions associated with these documents.
In order to evaluate the generalization ability of different domain models, the dataset contains two document types, conversational form documents and non-dialogical documents with mixed topics (e.g., stories, news reports, monologues, or advertisements). MRC tasks can be classified into two categories based on the different document types: C 3 -Dialogue(C 3 D ) and C 3 -Mixed(C 3 M ), and within these two task types, each document corresponds to a diversity of question types, such as complete fill-in-the-blank questions formed by removing spans or sentences from the text, closed-form questions that can be answered with minimal answers (e.g., yes or no), or free-form questions that reason from multiple sentences of the text. With 86.8% of the questions in this dataset requiring a combination of internal and external knowledge of the document (general world knowledge) to better understand the given text, we can say that most questions in this dataset require rich external knowledge to assist the machine in answering the question.
There is a significant difference between C 3 D and C 3 M in that most of the documents in C 3 M are formal written texts, while there is a lot of spoken language in the dialogue documents in C 3 D , so there is a larger vocabulary in C 3 M compared to the dialogue documents. The average document length in C 3 M is 180.2, and the vocabulary size is 4120, while the average document length in C 3 D is 76.3, and the vocabulary size is 2922. Due to the longer document length in C 3 M , it may be better for assessing MRC for verbose texts.

Metrics
Multiple-choice MRC tasks generally use accuracy to measure the performance of the model. The accuracy rate indicates the number of samples that made the correct choice as a percentage of the total number of samples. A higher value of accuracy means that the model answered more questions correctly. Accuracy is calculated by Equation (15).
where I is a function to determine whether the predicted value y i and the actual value y i are equal. The output is 1 or 0 for equal or not equal, respectively.

Experimental Settings
In this paper, our model is built using the pytorch deep learning framework. After many experiments, we adjust the parameters in our model to what is shown in Table 1. In order to prevent model overfitting and excessive training time, the validation set is tested every round during the training phase of the model. If there is no further improvement in accuracy in two consecutive rounds on the validation set, the training process of the model is stopped, and the model with the highest accuracy round is used as the final model.

Experimental Results
Most of the answers to the questions in the C 3 dataset used in this paper require a combination of semantic reasoning, some of which need reasoning on a single sentence, and others require multiple sentences to be considered together to find the appropriate answer. Therefore, this dataset can better verify the effectiveness of multi-granularity reasoning.
To illustrate the effectiveness of the proposed model, the test results are compared with the test results of several models, and the detailed experimental comparison results are shown in Table 2. From the table above, we can see that our proposed model achieve improvements of 0.634%, 0.838%, and 0.736% over the benchmark BERT model on C 3 M -test, C 3 D -test, and C 3 -test, respectively, which indicate that the introduction of the multi-granularity module has a significant improvement over the benchmark BERT model. The results of our experiment also suggest that the fused multi-granularity semantic reasoning method we propose can improve the reasoning ability of the model. By convolution and maximum pooling on the convolution window size of 2, 3 and 4, our model can extract the local multi-granularity feature information and then combine it with the global granularity feature information, which can achieve the effect of reasoning from multi-granularity.
We have also counted the testing results of each round on the validation set to verify the relationship between the number of training rounds and model performance. The results are shown in Figures 3 and 4.  From Figures 3 and 4, we can see that the model converges faster in the first three rounds of training, and the model convergence becomes steady when the training process reaches the third round. The model achieves the best performance at the fifth round, which indicates that the proposed model can achieve a better convergence effect and can complete the training of the model with fewer training rounds.

Ablation Studies
In order to verify the effectiveness of global granularity reasoning and local multigranularity reasoning in our model, we design a local multi-granularity reasoning model without fusing the global granularity feature information of the CLS position and using only the BERT model with convolutional pooling. The experimental results are shown in Table 3. From Table 3, it is clear that the model considering only local multi-granularity reasoning degrades in performance of by 1.52%, 0.961%, and 1.25% over the baseline BERT model on C 3 M -test, C 3 D -test, and C 3 -test, respectively. This result indicates that only considering local granularity reasoning without the global scope information will cause a certain decrease in the performance of the model. This is consistent with the fact that if humans reason only in terms of the partial information when answering reading comprehension questions, it will lead to inaccurate outcomes; thus, further demonstrating the correctness of our multiple-choice MRC model that incorporates multi-granularity semantic reasoning not only considering local multi-granularity reasoning but also global granularity reasoning.

Analysis
To further demonstrate the performance improvement of the proposed model, we randomly selected some questions from the Chinese multiple-choice MRC dataset that require different types of reasoning to give answers. Then we conducted experiments on the benchmark BERT model and our proposed model, and the results are shown below in Table 4. From Table 4, we can see that our fused multi-granularity MRC model has considerable improvement in semantic reasoning, implicative reasoning, and causal reasoning, respectively, compared with the BERT model. All the results indicate that our proposed method has higher rationality and feasibility.

Conclusions and Future Work
This paper focuses on a multiple-choice Chinese MRC model that incorporates multigranularity semantic reasoning. By designing the fusion method of global features and local semantic reasoning outputs, we effectively improve the performance of the model, which proves the effectiveness of the proposed method. This research has proved that studying the patterns of human reading, thinking and learning is an essential way to conduct research in the field of deep learning. A well-designed local-global semantic information interaction scheme can provide remarkable enhancement in model perception capabilities. Our study calls for the research community to go deeper into the utility of semantic meanings and explores further how to find out a better way to build up a stronger MRC model.
In the future, we plan to focus on two main aspects. Due to the deficiency discovered in other types of reasoning experiments, the first aspect is to improve the model's ability to handle the referential problems by adapting solutions used in Coreference Resolution tasks. Secondly, as the proposed model is less capable of processing excessively long context with various external knowledge, we will further leverage the latest promising knowledge enhanced approaches to overcome this shortcoming and extend the proposed model to deal with more challenging settings.

Data Availability Statement:
The model is trained on the multiple-choice Chinese MRC dataset C 3 [25], which contains 13,369 documents (containing both formal and informal forms) collected from questions in the general domain of the Chinese Proficiency Test and 19,577 multiple-choice MRC questions associated with these documents. This dataset gives paragraphs of varying length and a number of questions, along with corresponding English translations. The average document length in C 3 M is 180.2, and the vocabulary size is 4120, while the average document length in C 3 D is 76.3, and the vocabulary size is 2922. Both types are used to assess the ability of MRC for verbose texts.