Document Re-Ranking Model for Machine-Reading and Comprehension

: Recently, the performance of machine-reading and comprehension (MRC) systems has been signiﬁcantly enhanced. However, MRC systems require high-performance text retrieval models because text passages containing answer phrases should be prepared in advance. To improve the performance of text retrieval models underlying MRC systems, we propose a re-ranking model, based on artiﬁcial neural networks, that is composed of a query encoder, a passage encoder, a phrase modeling layer, an attention layer, and a similarity network. The proposed model learns degrees of associations between queries and text passages through dot products between phrases that constitute questions and passages. In experiments with the MS-MARCO dataset, the proposed model demonstrated higher mean reciprocal ranks (MRRs), 0.8%p–13.2%p, than most of the previous models, except for the models based on BERT (a pre-trained language model). Although the proposed model demonstrated lower MRRs than the BERT-based models, it was approximately 8 times lighter and 3.7 times faster than the BERT-based models.


Introduction
Machine-reading and comprehension (MRC) is a question answering task in which computers are required to understand contexts based on passages and answer related questions. With the rapid evolution of deep neural network techniques, the performance of MRC models has been substantially enhanced [1][2][3]. However, conventional MRC models have deficiencies in that text passages relevant to user queries (i.e., text passages containing phrases answering user queries) should be prepared in advance. Figure 1 illustrates an example in which an MRC model returns different answers according to given passages. To overcome this problem, open-domain MRC models based on information retrieval (IR) have been proposed [4,5]. These models conventionally follow a two-stage process: passage retrieval based on an IR model and answer extraction based on an MRC model, as illustrated in Figure 2. In several cases, the performance of IR-based MRC models depends on those of underlying IR models that employ term frequency and inversed document frequency (TF-IDF) rankings. As illustrated in Figure 1, when the relevant passage (i.e., the upper document) is given, the MRC model returns the correct answer "April 1975", but when the irrelevant passage (i.e., the lower document) is given, it returns the incorrect answer "2010". Although recent IR models have demonstrated superior performances, the highly ranked documents often do not contain answers to the relevant questions. This leads to a decrease in answer recall in open-domain MRC. Therefore, certain models have been proposed for enhancing IR performance [6]. Similar to Lee et al. [6], we propose an artificial neural network (ANN) model that re-ranks retrieved documents to improve answer recall in MRC (i.e., for ensuring that documents containing answers are ranked high). The proposed model complements the underlying IR model by learning degrees of associations (i.e., possibilities for the documents to contain answer phrases) between queries and documents through a deep neural network.
The remainder of this paper is organized as follows. In Section 2, we briefly review earlier reranking models. In Section 3, we describe our model. In Section 4, we explain our experimental setup and report some of our experimental results. In Section 5, we provide the conclusions of our study. To overcome this problem, open-domain MRC models based on information retrieval (IR) have been proposed [4,5]. These models conventionally follow a two-stage process: passage retrieval based on an IR model and answer extraction based on an MRC model, as illustrated in Figure 2. To overcome this problem, open-domain MRC models based on information retrieval (IR) have been proposed [4,5]. These models conventionally follow a two-stage process: passage retrieval based on an IR model and answer extraction based on an MRC model, as illustrated in Figure 2. In several cases, the performance of IR-based MRC models depends on those of underlying IR models that employ term frequency and inversed document frequency (TF-IDF) rankings. As illustrated in Figure 1, when the relevant passage (i.e., the upper document) is given, the MRC model returns the correct answer "April 1975", but when the irrelevant passage (i.e., the lower document) is given, it returns the incorrect answer "2010". Although recent IR models have demonstrated superior performances, the highly ranked documents often do not contain answers to the relevant questions. This leads to a decrease in answer recall in open-domain MRC. Therefore, certain models have been proposed for enhancing IR performance [6]. Similar to Lee et al. [6], we propose an artificial neural network (ANN) model that re-ranks retrieved documents to improve answer recall in MRC (i.e., for ensuring that documents containing answers are ranked high). The proposed model complements the underlying IR model by learning degrees of associations (i.e., possibilities for the documents to contain answer phrases) between queries and documents through a deep neural network.
The remainder of this paper is organized as follows. In Section 2, we briefly review earlier reranking models. In Section 3, we describe our model. In Section 4, we explain our experimental setup and report some of our experimental results. In Section 5, we provide the conclusions of our study. In several cases, the performance of IR-based MRC models depends on those of underlying IR models that employ term frequency and inversed document frequency (TF-IDF) rankings. As illustrated in Figure 1, when the relevant passage (i.e., the upper document) is given, the MRC model returns the correct answer "April 1975", but when the irrelevant passage (i.e., the lower document) is given, it returns the incorrect answer "2010". Although recent IR models have demonstrated superior performances, the highly ranked documents often do not contain answers to the relevant questions. This leads to a decrease in answer recall in open-domain MRC. Therefore, certain models have been proposed for enhancing IR performance [6]. Similar to Lee et al. [6], we propose an artificial neural network (ANN) model that re-ranks retrieved documents to improve answer recall in MRC (i.e., for ensuring that documents containing answers are ranked high). The proposed model complements the underlying IR model by learning degrees of associations (i.e., possibilities for the documents to contain answer phrases) between queries and documents through a deep neural network.
The remainder of this paper is organized as follows. In Section 2, we briefly review earlier re-ranking models. In Section 3, we describe our model. In Section 4, we explain our experimental setup and report some of our experimental results. In Section 5, we provide the conclusions of our study.

Previous Studies
The earlier IR models that employ TF-IDF rankings do not consider semantic information such as homonyms properly because it depends on token-matching methods. To resolve this problem, some IR models based on ANNs have been proposed. Xiong et al. [7] proposed a ranking model called KNRM (Kernel based Neural Ranking Model). It generates a translation matrix that uses similarities between queries and documents. In addition, KNRM used a kernel pooling method to effectively summarize the translation matrix and to generate scores for ranking learning. Guo et al. [8] proposed a ranking model termed DRMM that was based on cosine similarities between query vectors and document vectors in a latent vector space generated by a multilayer perceptron (MLP). DRMM demonstrated superior performance in certain retrieval tasks. However, Dai et al. [9] pointed out that DRMM returns inconsistent similarities based on the lengths of the query and document vectors. To resolve this issue, Dai et al. proposed a cross-mapping function based on a convolutional neural network (CNN) with kernel pooling [7] that always returns fixed lengths of query vectors and document vectors. To overcome the limitation that several ANN models cannot properly reflect term frequency and document frequency, Mitra et al. [10] proposed a joint model in which a local model based on conventional term frequencies and a distributed model based on the distributed representation of words are co-trained. Alaparthi et al. [11] proposed a ranking model that was based on the bi-LSTM with a co-attention mechanism between a query and a document. In addition to co-attention, we also used self-attention mechanism on various word embeddings (e.g., word2vec [12], GloVe [13], fastText [14]). Figure 3 illustrates the overall architecture of the proposed re-ranking model. As depicted, the proposed model consists of five parts: query encoder, passage encoder, phrase modeling layer, attention layer, and similarity network. In this paper, the term "passage" refers to an indexing unit that is typically referred to as a document in IR.

Re-Ranking Model Based on Artificial Neural Network
Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 10

Previous Studies
The earlier IR models that employ TF-IDF rankings do not consider semantic information such as homonyms properly because it depends on token-matching methods. To resolve this problem, some IR models based on ANNs have been proposed. Xiong et al. [7] proposed a ranking model called KNRM (Kernel based Neural Ranking Model). It generates a translation matrix that uses similarities between queries and documents. In addition, KNRM used a kernel pooling method to effectively summarize the translation matrix and to generate scores for ranking learning. Guo et al. [8] proposed a ranking model termed DRMM that was based on cosine similarities between query vectors and document vectors in a latent vector space generated by a multilayer perceptron (MLP). DRMM demonstrated superior performance in certain retrieval tasks. However, Dai et al. [9] pointed out that DRMM returns inconsistent similarities based on the lengths of the query and document vectors. To resolve this issue, Dai et al. proposed a cross-mapping function based on a convolutional neural network (CNN) with kernel pooling [7] that always returns fixed lengths of query vectors and document vectors. To overcome the limitation that several ANN models cannot properly reflect term frequency and document frequency, Mitra et al. [10] proposed a joint model in which a local model based on conventional term frequencies and a distributed model based on the distributed representation of words are co-trained. Alaparthi et al. [11] proposed a ranking model that was based on the bi-LSTM with a co-attention mechanism between a query and a document. In addition to coattention, we also used self-attention mechanism on various word embeddings (e.g., word2vec [12], GloVe [13], fastText [14]). Figure 3 illustrates the overall architecture of the proposed re-ranking model. As depicted, the proposed model consists of five parts: query encoder, passage encoder, phrase modeling layer, attention layer, and similarity network. In this paper, the term "passage" refers to an indexing unit that is typically referred to as a document in IR.   In Figure 4, is the ith word in a query or a passage and ( ) is a pre-trained k-dimensional GloVe embedding [13] of . Thus, ( ) is an l-dimensional position embedding of that represents a word position in the query or passage, and ( ) is an l-dimensional inversed document frequency (IDF) embedding [15] that is set to a discrete value according to the score intervals of 0.05. Further, ( ) is an l-dimensional position embedding of an overlapped word in the other text between the query and the passage. For example, when the query "I go to school" and the passage "We should come back to school" are presented, ( ) of the overlapping word "school" in the query is set to 6, meaning that the overlapping word occurs in the 6th position in the passage (opponent text). We empirically set the embedding sizes of ( ), ( ), and ( ) to all the same dimensions because we cannot distinguish them in terms of the quantity of information. All embeddings except the GloVe embedding are randomly initialized and fine-tuned during training. To simplify the equations, we rewrite ( ) for the ith input unit in a query and a passage as and , respectively. The query and passage encoders convert the query vector = ( , , … , ) with n word vectors and the passage vector = ( , , … , ) with m word vectors into the encoded vectors, and , respectively, which embed contextual information using bidirectional gated recurrent units (biGRUs) [16], as represented by Equation (1)

Re-Ranking Model Based on Artificial Neural Network
In Equation (1), [ , ] is the concatenation of a forward hidden state and a backward hidden state . The weights in the query encoder and passage encoder are not shared. Thus, the encoded-word vectors (outputs of the query encoder and the passage encoder) are input to the phrase-modeling layer.
The phrase-modeling layer generates phrase-level features based on word n-grams (from word unigram to word trigram) using CNNs [17]. The CNNs used in the phrase modeling layer do not have any pooling layers, unlike conventional CNNs, as depicted in Figure 5. In Figure 4, w i is the ith word in a query or a passage and E g (w i ) is a pre-trained k-dimensional GloVe embedding [13] of w i . Thus, E p (w i ) is an l-dimensional position embedding of w i that represents a word position in the query or passage, and E IDF (w i ) is an l-dimensional inversed document frequency (IDF) embedding [15] that is set to a discrete value according to the score intervals of 0.05. Further, E op (w i ) is an l-dimensional position embedding of an overlapped word in the other text between the query and the passage. For example, when the query "I go to school" and the passage "We should come back to school" are presented, E op (w 4 ) of the overlapping word "school" in the query is set to 6, meaning that the overlapping word occurs in the 6th position in the passage (opponent text). We empirically set the embedding sizes of E p (w i ), E op (w i ), and E IDF (w i ) to all the same dimensions because we cannot distinguish them in terms of the quantity of information. All embeddings except the GloVe embedding are randomly initialized and fine-tuned during training. To simplify the equations, we rewrite E(w i ) for the ith input unit in a query and a passage as q i and p i , respectively.
The query and passage encoders convert the query vector Q = (q 1 , q 2 , . . . , q n ) with n word vectors and the passage vector P = (p 1 , p 2 , . . . , p m ) with m word vectors into the encoded vectors, ↔ Q and ↔ P, respectively, which embed contextual information using bidirectional gated recurrent units (biGRUs) [16], as represented by Equation (1): The phrase-modeling layer generates phrase-level features based on word n-grams (from word unigram to word trigram) using CNNs [17]. The CNNs used in the phrase modeling layer do not have any pooling layers, unlike conventional CNNs, as depicted in Figure 5. Using the CNNs depicted in Figure 5, the encoded vectors and are represented as three types of phrase vectors: unigram phrase vectors and , bigram phrase vectors and , and trigram phrase vectors and . To simplify the equations, we rewrite the n-gram phrase vectors as and . Thereafter, the n-gram phrase vectors are input to the attention layer. The attention layer consists of two sublayers: a passage-query (P-Q) attention layer and an attention pooling layer. In the P-Q attention layer, the proposed model calculates the degrees of associations between the n-gram phrases in a query and those in a passage. The P-Q attention vector of an n-gram phrase vector, , is calculated using the scaled dot product [18] expressed in Equation (2): In Equation (2), denotes an n-gram phrase vector of a passage, and denotes an n-gram phrase vector of a query mapped onto the dimension using a conventional pooling mechanism [19]. That is, the proposed model converts and into fixed-length vectors in the attentive pooling layer, as expressed in Equation (3) In Equation (3), and denote a weight matrix and bias vector, respectively. Therefore, and are generated by feed-forward neural networks (FNNs). Then, denotes a cross product between two vectors. The normalized attention vectors and for the n-gram phrase vectors are input to the similarity network.
To calculate the similarity between the normalized attention vector and the normalized query vector , we adopt the similarity vector representation proposed in a report on sentence embedding by Conneau et al. [20], as expressed in Equation (4): In Equation (4), [ , , … ] and denote the concatenation of vectors and a cross product, respectively. The final logit function for the similarity calculation between a query and a passage is expressed in Equation (5): The attention layer consists of two sublayers: a passage-query (P-Q) attention layer and an attention pooling layer. In the P-Q attention layer, the proposed model calculates the degrees of associations between the n-gram phrases in a query and those in a passage. The P-Q attention vector of an n-gram phrase vector, P n att , is calculated using the scaled dot product [18] expressed in Equation (2): In Equation (2), ↔ P n denotes an n-gram phrase vector of a passage, and Q n denotes an n-gram phrase vector of a query mapped onto the d Q n dimension using a conventional pooling mechanism [19].
That is, the proposed model converts ↔ Q n and P n att into fixed-length vectors in the attentive pooling layer, as expressed in Equation (3): In Equation (3), w n and b n denote a weight matrix and bias vector, respectively. Therefore, g q n and g p n are generated by feed-forward neural networks (FNNs). Then, × denotes a cross product between two vectors. The normalized attention vectors Q n and P n for the n-gram phrase vectors are input to the similarity network.
To calculate the similarity between the normalized attention vector P n and the normalized query vector Q n , we adopt the similarity vector representation proposed in a report on sentence embedding by Conneau et al. [20], as expressed in Equation (4): sim Q n , P n = Q n , P n , Q n − P n , Q n × P n In Equation (4), Q, P, . . . and × denote the concatenation of vectors and a cross product, respectively. The final logit function for the similarity calculation between a query and a passage is expressed in Equation (5): In Equation (5), sim Q 1 , P 1 , . . . denotes the concatenation of n-gram similarity vectors. Thus, w and b denote a weight matrix and bias vector, respectively, to represent similarity distributions through an FNN. Finally, the model is trained using cross-entropy loss, as expressed in Equation (6):

Datasets and Experimental Settings
We trained and evaluated our model on the Microsoft Machine-Reading Comprehension (MS-MARCO) dataset [21]. The training set contains approximately 400 M tuples of queries and relevant and non-relevant passages. The development set contains approximately 6900 queries, each paired with the top 1000 passages retrieved with BM25 [22] from the MS-MARCO dataset. On average, each query has one relevant passage. We trained the model via negative sampling using the ratio of 1:5 for the 1000 passages corresponding to each query. To implement the proposed model, we adopted the pre-trained GloVe algorithm. The vocabulary size of GloVe was 300. We set the hidden size of the GRU neural network to 200. The model optimization was performed with Adam [23] at a learning rate of 0.0001, and the learning rate was halved if the validation performance did not improve. The dropout rate was set to 0.2, and the mini-batch size was set to 256 sequences.
We used mean reciprocal rank at 10 (MRR@10) [24], which represents the MRR score of documents ranked in the top 10 because it is essential for relevant documents to be highly ranked for MRC models, as expressed in Equation (7): In Equation (7), r i is the rank of the first passage containing a correct answer produced by the ith query and n is the number of queries.

Experimental Results
The first experiment was conducted to evaluate the effectiveness of the additional input embeddings (i.e., position embedding, overlapped position embedding, and IDF embedding) and the phrase modeling layer by comparing the changes in performance, as presented in Table 1. In Table 1, "w/o additional input embeddings" refers to a modified model in which only word embeddings are used as input units. Therefore, "w/o phrase modeling layer" means a modification of our model in which the phrase modeling layer is excluded. As presented in Table 1, the additional input embeddings and phrase modeling layer contribute to the improvement of MRR@10 by 8%p and 0.8%p, respectively. In addition, to check whether the word n-gram features can effectively contain phrase information or not, we visualized the degrees of associations between the n-gram phrases in a query and those in a passage (i.e., P-Q attention scores in Equation (2)) through 2-dimensional heat maps, as shown in Figure 6.  In Figure 6, the n-gram phrases with higher attention scores were colored in bluer. As illustrated in Figure 6, the uni-gram features were colored in bluer between single words or short phrases, and the tri-gram features were colored in bluer between long phrases. It reveals that each n-gram feature differently contributes to capturing associations between a query and a passage.
The second experiment was conducted to compare the proposed model with the earlier models, as presented in Table 2.  [25] 0.347 0.347 BERT-Large [25] 0.365 0.365 Referring to Table 2, BM25 is a traditional retrieval model termed Okapi BM25, and Duet v2 is a joint ANN model comprising a local model based on term frequencies and a distributed model based on word vectors. Conv-KNRM is an ANN model in which queries and documents are encoded using CNNs. Alaparthi et al. [11] refer to a bidirectional long short-term memory network model with a coattention mechanism between query and passage representations. BERT-Base and BERT-Large are fine-tuned classification models based on the base and large models of BERT [25], which is a pretrained language model with state-of-the-art performance in several downstream natural language processing tasks such as span prediction, sequence labeling, and text classification [26]. As presented In Figure 6, the n-gram phrases with higher attention scores were colored in bluer. As illustrated in Figure 6, the uni-gram features were colored in bluer between single words or short phrases, and the tri-gram features were colored in bluer between long phrases. It reveals that each n-gram feature differently contributes to capturing associations between a query and a passage.
The second experiment was conducted to compare the proposed model with the earlier models, as presented in Table 2. Referring to Table 2, BM25 is a traditional retrieval model termed Okapi BM25, and Duet v2 is a joint ANN model comprising a local model based on term frequencies and a distributed model based on word vectors. Conv-KNRM is an ANN model in which queries and documents are encoded using CNNs. Alaparthi et al. [11] refer to a bidirectional long short-term memory network model with a co-attention mechanism between query and passage representations. BERT-Base and BERT-Large are fine-tuned classification models based on the base and large models of BERT [25], which is a pre-trained language model with state-of-the-art performance in several downstream natural language processing tasks such as span prediction, sequence labeling, and text classification [26]. As presented in Table 2, the proposed model outperformed all the previous models except the BERT-based models (The proposed model named "n-gram co-attention" can be found in the official rankings: https://microsoft.github.io/msmarco/). Table 3 presents the memory usages and the response times of the proposed and BERT-based models. In Table 3, the response time is the average time per query that is spent to rank the top 1000 passages retrieved with BM25 [22]. In order to re-rank the passages retrieved against 6900 queries in the development set, BERT-Base spent approximately 2.108 h, but the proposed model spent approximately 0.575 h. As presented in Table 3, although the proposed model demonstrated lower performance than the BERT models, it demonstrated significantly less memory usage (about 8.0 times less) and faster response time (about 3.7 times faster) than the latter. The ratio between the response times, 300/1100 ≈ 0.27, is bigger than the ratio between the sizes of parameters, 3.5/110 ≈ 0.03. It is caused by the difference of neural network frameworks: the recurrent neural network framework used for the proposed model should sequentially process input words, but the transformer framework used for BERT-Base can process all input words in parallel. Based on these experimental results, we conclude that the proposed model may be more suitable for practical open-domain MRC systems that should respond to multiple substantial user queries simultaneously.

Conclusions
We proposed an ANN-based model to re-rank documents retrieved using a conventional IR model, BM25, to improve the performance of MRC models. The proposed model was composed of five subnetworks: query encoder, passage encoder, phrase modeling layer, attention layer, and similarity network. By calculating the mutual information of the phrase unit for queries and passages, the passage scores for queries were effectively reflected. In the experiments with the MS-MARCO dataset, the proposed model demonstrated better MRRs, 0.8%p-13.2%p, than the previous models, except for the BERT-based models. Although the proposed model demonstrated lower MRRs than the BERT-based models, it demonstrated significantly more efficient (approximately 8 times less memory usage and approximately 3.7 times faster response time) than the latter in terms of memory usage and response time. We conclude that these efficiencies are very important engineering factors in the development of a practical MRC system for massive concurrent users.
Author Contributions: Conceptualization and methodology, H.K.; software, validation, formal analysis, investigation, resources, data curation, and writing-original draft preparation, Y.J.; writing-review and editing, visualization, supervision, project administration, and funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript.
Funding: This paper was supported by Konkuk University in 2020.