SS-BERT: A Semantic Information Selecting Approach for Open-Domain Question Answering

: Open-Domain Question Answering (Open-Domain QA) aims to answer any factoid questions from users. Recent progress in Open-Domain QA adopts the “retriever-reader” structure, which has proven effective. Retriever methods are mainly categorized as sparse retrievers and dense retrievers. In recent work, the dense retriever showed a stronger semantic interpretation than the sparse retriever. When training a dual-encoder dense retriever for document retrieval and reranking, there are two challenges: negative selection and a lack of training data. In this study, we make three major contributions to this topic: negative selection by query generation, data augmentation from negatives, and a passage evaluation method. We prove that the model performs better by focusing on false negatives and data augmentation in the Open-Domain QA passage rerank task. Our model outperforms other single dual-encoder rerankers over BERT-base and BM25 by 0.7 in MRR@10, achieving the highest Recall@50 and the max Recall@1000, which is restricted by the BM25 retrieval results.


Introduction
With the rapid development of the internet, people are facing the problem of obtaining the required response from a large amount of information.Search engines alleviate this issue; however, they can only provide a list of web page results rather than a direct answer.Compared with search engines, Question Answering (QA) systems are able to provide direct answers to satisfy users.The early question answering systems (e.g., [1,2]) are more like expert systems.As QA systems gradually develop, the knowledge source changes from an artificially encoded knowledge base to text collections.Restricted Domains Question Answering (RDQA [3]) aims to answer questions in restricted domains.Compared with RDQA, Open-Domain QA [4] aims to find the answers to any factoid questions using an unlimited knowledge base in any domain.
Due to the gradual maturity of associated techniques in natural language processing, deep learning has been used in almost every stage of Open-Domain QA.DrQA [5] is the first Open-Domain QA model to use Neural Machine Reading Comprehension (Neural MRC) in Open-Domain QA, resulting in a two-stage structure QA system which consists of a retriever and a reader.In a two-stage QA system, the retriever retrieves a list of passages from a large database, then the reader provides the final answer, the accuracy of which is not only decided by the reader itself but also by the performance of the retriever.Traditional retrievers are efficient, with an inverted index, but face difficulties (e.g., term mismatch [6]) in matching queries and passages, e.g., Term Frequency-Inverse Document Frequency (TF-IDF) and Best Match 25 (BM25).Recently, based on Pre-trained Language Models (PLMs), the dual-encoder has been widely used to learn the relations between queries and passages.It outperforms sparse retrievers.
There are two major challenges when it comes to training a dual-encoder for document retrieval: The first is the hard negative selection problem.A hard negative is a negative that is more likely to be regarded as a positive by the model and is more valuable than simple negatives (there may be lots of overlapping tokens between the hard negative and the positive).Since most negatives in Open-Domain QA are not labeled, selecting hard negatives for the model is problematic.There are two main training approaches used to select hard negatives for dual-encoders: list-wise and pair-wise.In the list-wise approach, the dual-encoder chooses in-batch negatives, facing the limit of the memory of the GPU (or other devices).When using the pair-wise approach, the dual-encoder uses triples of query, positive and negative as the training data, leading to issues related to the low quality of training data.Both methods face the hard negative selection problem.
The second is the training data deficiency problem.More training data usually improves the performance of the Open-Domain QA models, but it is expensive to acquire labeled data.In recent work, cross-encoders have been used to build labels for unlabeled data, especially in large-scale datasets (e.g., MS-MARCO [7]; Natural Questions [8]), but the method faces two problems: the cross-encoder is too expensive and the labeled data are limited since they only come from the original unlabeled passages (the total amount of training data is constant).
In this paper, we present several methods with which to alleviate these problems relating to dual-encoder training in Open-domain QA systems.First, we change the dataset distribution to enable the dual-encoder to learn more about the difference between false negatives and negatives using a fine-tuned BART [9] model.Second, to acquire more training data, we use the BART model to generate queries.We choose the generated queries which have low similarity scores with the positive queries as the new queries, the negatives as new positives, and the positives as new negatives.Third, we use the probability of generating the query from the negative as an evaluation of the negative.
Our contributions are as follows: • We present a BERT-based semantic information selection method, named SS-BERT, to alleviate the hard negative selection problem and the training data deficiency problem.

•
We prove that the dual-encoder performs better by focusing on hard negatives in the Open-Domain QA passage rerank task;

•
The proposed methods outperform other passage rerankers on MRR@10 and Recall@50 with a single dual-encoder based on the BERT-based model and BM25 retriever.

Related Work
The retriever, the reranker, and the reader are the core modules of the two-stage Opendomain QA, in this section, we introduce how retrievers and rerankers have developed recently and give a brief review of the sequence-to-sequence method, which is applied in our proposed model.

Retriever Methods
Retriever methods are mainly divided into sparse retrievers and dense retrievers.In recent work, dense retrievers (e.g., [10][11][12]) outperformed traditional sparse retrievers (e.g., TF-IDF and BM25).They usually represent queries and passages as vectors in lowdimensional vector space (compared with the dictionary size) and calculate their similarity in terms of the dot product.
Retrieval models usually use dual encoders or cross-encoders, which are representation based and interaction based, respectively.Furthermore, the late interaction encoder is representation-interaction based (e.g., [13]).The three methods are shown in Figure 1.
Dual encoders use two encoders to represent queries and passages, respectively (e.g., [10]).Cross-encoders use one encoder to jointly represent queries and passages (e.g., [14][15][16]).The cross-encoder captures the semantic relationships between queries and passages more precisely than the dual encoder, but it requires much more computational resources than the dual encoder, and it is usually used as the passage reranker.Retriever Methods.Among the methods, dual-encoder is more efficient, cross-encoder is more accurate, and late-interaction looks for a balance of efficiency and accuracy.

Passage Reranker
Recently, a two-stage retriever structure has been proved efficient, which further divides the retriever into two categories: retriever and passage reranker.For example, Rocket QA [14] con-cats the query and the passage as the input to train the reranker, then uses the reranker to drop the retrieval results with low confidence scores and generate labeled data with which to train the cross-encoder.PAIR [16] follows Rocket QA and presents a method for measuring the similarity of passages to drop false negatives.Rocket QA v2 [15] presents dynamic list-wise distillation to jointly train the retriever and reranker.ColBERT [13] presents a light late interaction model over BERT [17] as a reranker.
Since the cross-encoder is expensive, in order to balance efficiency and accuracy simultaneously, our method uses a dual-encoder as the reranker.We fine-tune a BART [9] model to improve the dual-encoder through negative selection, data augmentation, and passage evaluation, which is described in the Methodology and the Experimental Section.

Sequence-to-Sequence Method
Sequence-to-Sequence (Seq2Seq) is a method that generates a sequence with a given sequence.It was first presented in 2014 ( [18,19]).GPT [20] uses a diverse corpus of unlabeled text to train a generative pre-train model, then fine-tunes the model on different tasks, e.g., text generation.BART [9] is one of the best seq2seq models for text generation, combining the advantages of BERT [17] and GPT [20], BART uses a cascade structure of a bidirectional encoder and an auto-regressive decoder to allow arbitrary noises to avoid model dependence on sequence information.DocT5query [21] follows DocTquery [22]; it uses T5 [23] to generate queries from given documents.The queries can be answered with the documents, then the generated queries are added to the documents.Recent work [24] used BART-large (374M parameters, 12 layers in the encoder and decoder) to generate queries from English Wikipedia in a zero-shot retrieval task, unsupervised, and found that for some datasets, the model training on the generated dataset outperformed the original one.Our proposed model uses a fine-tuned BART-base model to generate queries from negatives, which is described in the Methodology and the Experimental Section.

Methodology
In this section, we present a light passage reranker that selects semantic information including negatives via the BART [9] model and method over BERT [17] dual encoder (called SS-BERT).SS-BERT has the advantage of being a light model which can be used as a module in another retriever framework.

Task Description
To choose passages to obtain the answer to a natural question from a large corpus, a two-stage retriever begins by using a retriever to choose a list of passages from the corpus.Since we present our model as a flexible module for the retriever, we continue to use BM25 as the retriever and keep the top k results of BM25, then we use a passage reranker to rerank the top k results.

Negative Selection by Query Generation
The QA system faces the problem of insufficient training data since it is expensive to obtain labeled data.We try to improve the quality of training data to alleviate the problem.In the retrieval task, when using the pairwise training method, training data are in the form of triples consisting of query, positive, and negative (denoted as T(q, p, n)).However, there are more negatives than positives, which means the quality of negatives may not be very high because they are actually unlabeled.Furthermore, hard negatives are not labeled; thus, the model would regard them as simple negatives.Table 1 shows a hard negative case in the MS-MARCO dataset; our proposed negative selection method tries to judge the hard negatives.
Table 1.Hard Negative Case.In this case, the keywords 'a master's degree' are mentioned many times both in the positive and the negative.The cosine similarity of the positive and the hard negative is 0.9, which is calculated as shown in Section 4.2.This kind of negative is called hard negative; however, a hard negative is always regarded as a common negative, since it is too expensive to label all negatives.Normally you need to get a bachelor's degree before you get a master's degree.The bachelor's degree is the basic college degree.it can be completed in three years including summers, or in four years if you take summers off.The master's degree follows it.Any individuals may pursue a master's degree in a field unrelated to their bachelor's degree.My bachelor's degree is in psychology.However, my master's degree is in organizational management.

Data Type Text
To alleviate the problem, we fine-tune the BART-base model with the MS-MARCO dataset (using the Huggingface Transformers library [25]) and use the model to generate queries from the negatives (denote the queries as generated queries q G ).Then, we compute the similarity score (denoted as sim(q, q G )) between the queries q) and the generated queries: We set a threshold (denoted as t G ) and drop the negative if sim(q, q G ) is smaller than t G .After all the negatives have been selected, we obtain the negatives set (denoted as n H ), which represents the part of the original data which are more similar to the positives and thus more likely to be hard negatives; the pipeline is shown in Figure 2.Then, we mix the triples T(q, p, n) and T(q, p, n H ) in a 1:1 ratio and obtain new triples T hybrid , which have a data distribution that focuses more on the false negatives.The strategy is shown in Figure 3.
The method enables the dual encoder to learn more about the differences between false negatives and positives.As a result, it performs better, which indicates a novel method for data pre-processing for the task (e.g., training a data reranker to change the data distribution to improve the dataset quality).With the proposed method, the model performs better; the experimental results are shown in Section 4.3.

Data Augmentation From Negatives
We set a threshold (denoted as t A ) using the generated query q G , the negative, and the positive as a new triple T(q G , p new , n new ), of which the s G is lower than t A .This means the positive is not similar to the negative, so it is actually the negative of the generated query.Compared with the original triple T(q, p, n), p is the same as n new , and n is the same as p new .We add the new triples T(q G , p new , n new ) into T hybrid and the model performs better.The experimental results are shown in Section 4.3, and a case of data augmentation is shown in Section 4.4.

Passage Evaluation
In passage rerank, we set the query as the target and the passage as the input, denoting the probability of a BART model generating the target from the input as P t .A high P t means the passage is more likely positive.In Formula (2), n is the length of a query, tok t is the t th token of a query, and s query is the score needed for BART to generate the query from the passage.A ratio (denoted as k s ) is used to limit the influence of the score s query .In passage rerank, we denote the score of a passage as s passage (s passage is calculated as ColBERT [13]).It becomes s passage with our method, as shown below.The experimental results are shown in Section 4.3:

Experiments
In this section, we describe the experimental setting details, including how to make the generated query reusable for saving computing resources, evaluation metrics, experimental results, and case studies.This paper uses MS-MARCO [7] to evaluate the proposed model.MS-MARCO is one of the most popular Open-Domain QA datasets.In MS-MARCO, all queries are sampled from real anonymous user queries through Bing or Cortana.The passages are extracted from real web documents by the Bing search engine.The answers are humangenerated from the context passages and are strongly encouraged to be in the form of a complete sentence.
The dataset has 502,939 queries in the training set; 6980 queries in the dev set; 6837 queries in the test set; 8,841,823 non-redundant passages and 39,769,172 triples (a triple consists of a query, a positive, and a negative; the negatives are from the 8,841,823 passages, which means each passage is used 4.5 times on average).Each query has an average of only 1.07 positives and 5.97 words; each passage has an average of 56.58 words.
In the MS-MARCO dataset, since the queries come from real anonymous users, only part of the queries contain keywords such as "what", "how", "where", etc.As mentioned in Section 3.3, our proposed method generates queries from negatives with a fine-tuned BARTbase [9] model, the distribution of queries' keywords in the generated queries is compared to that in the MS-MARCO dataset in Table 2.The top two words in both datasets are "what" (42.2% and 43.8%) and "how" (15.3% and 19.5%), and the third word in MS-MARCO is "where" (4.4%, 4.3% in the generated dataset) and the third word in the generated dataset is "which" (4.9%, 1.7% in MS-MARCO).The similar keyword distribution between the two datasets indicates the generated queries are reliable.MRR calculates the averaged reciprocal of the rank at which the first positive passage is retrieved.When using MRR we focus more on the EM (exact match) part than R@k.When we use MRR@k to evaluate the model on N queries, it is calculated as below, p i represents the rank of the true positive: R@k calculates the proportion of queries to which the top k retrieved passages contain positives.We use R@50 and R@1000 to evaluate the reranker.When we use R@k to evaluate the model on N queries, it is calculated as below, TP j = 1 when the j th sample is true positive, TP j = 0 when the j th sample is negative (one true positive at most): Rouge Rouge-N [26] calculates the n-gram recall between a candidate text and a reference text, Rouge-L calculates the longest common sub-sequence between the candidate and the reference, we use Rouge-1, Rouge-2, and Rouge-L to evaluate the fine-tuned BARTbase model mentioned in Section 3.2.Rouge-N is calculated as Formula ( 6), M is the number of candidates, Count match (gram n ) is the maximum number of n-grams co-occurring in the candidate and the reference.Rouge-L is calculated as Formula (7), β is a large constant, R lcs is obtained by dividing LCS(R, C) (the longest common sub-sequence of the candidate and the reference) by the length of the reference, C lcs is obtained by dividing LCS(R, C) by the length of the candidate.

Implementation Details
We choose the MS-MARCO triples dataset, which has a format of query, positive passage, and negative passage.The triples are disordered and the negatives are repeated.We build labels between each triple and non-redundant negative (a non-redundant negative means one of the 8,841,823 passages mentioned in Section 4.1.1),then build labels between each negative and its generated query.Once we acquire all generated queries from the non-redundant negatives, we use the labels to distinguish the relationships among queries, positives, negatives, and generated queries when we try new methods or other hyper-parameters.

Passage Rerank
Fine-tune PLM Huggingface Transformers library [25] provides the method to finetune BART [9] model on CNN-Dailymail dataset ( [27,28]) of news articles paired.We follow the framework, preprocess the MS-MARCO query-positive pairs to suit Huggingface Transformers, and fine-tune BART-base on these data.In this paper, for efficiency, we choose BART-base (6 layers in the encoder and decoder) to generate queries from negatives.The fine-tuning process costs 5 h on 4 NVIDIA Tesla V100 GPUs (with 16G RAM).The version of the transformers is 4.17.0.The ratio of train: test: valid is 8:1:1, dropout is 0.1, the number of epochs is 5, the batch size is 64, optimization steps is 26,000, and the gradient accumulation step is 1.Cases of the generated queries are shown in Table 3, and the performance of the fine-tuned BART-base model is shown in Table 4, which indicates the degree of confidence of the generated queries.Table 3. Query generation case, in this case, we choose three query-negative pairs to visually show the quality of the generated queries.

How to apply for Schengen Visa
To apply for a Schengen Visa you should apply to the embassy of the first country you intend to visit.If you plan to visit Italy, France, and Spain (in that order) then you only need to apply to the Italian embassy and your visa is good for France and Spain too.

What diseases do finches carry
However, in terms of diseases that can be transmitted to other birds, finches can carry pretty much any avian disease, including influenza viruses, Newcastle's disease virus, many different bacterial diseases and both internal and external parasites.

What is the function of capillaries
Veins carry blood from the other parts of the body to the heart.They have valves to stop the backward flow of blood.Capillaries are only one cell thick in reality, and they help to diffuse substances from the blood to the cell through the cell membrane (which is selectively permeable).Negative Selection We first process the MS-MARCO triples (query, positive, and negative) train set and change the train set into ordered.Then, we build maps from triples to non-redundant negatives and use the fine-tuned BART-base model to generate queries (denoted as q G ) from the non-redundant negatives.Then, we use sentence-BERT [29] to calculate the similarity score (denoted as s G ) of queries (denoted as q) and q G by dot product (using cosine similarity as the metric).Then, we drop the negatives of which the s G is lower than a threshold (denoted as t G ), the distribution of s G is in Table 5.We set the threshold (denoted as t G ) as 0.3 which means we keep the top 50% of similar negatives for each positive since these parts of negatives are more likely to be hard negatives.Then, we hybrid the selected triples and original triples by 1:1.Finally, we train the dual encoder with the hybrid train set.
Table 5.The distribution of the similarity scores between queries and generated queries.The distribution is important evidence for the threshold selection in the negative selection method and the data augmentation method, which indicate the degree of confidence of the selected data.The similarity scores are calculated through sentence-BERT [29].Data Augmentation We set the threshold t A as 0.1, which means we choose about 8 M generated queries to build new triples.The triples are the top 20% of credible generated data, seen from the distribution of similarity scores, approximately.The rationality of the chosen threshold is discussed in Section 4.3, and a case of the new triples is shown in Section 4. 4.

Similarity
Passage Evaluation The score of the passage is denoted as s query , we add (k s • s query ) on s passage to re-evaluate the passage.The ratio k s should not be too high since s passage and s query are similar in size, a big k s intuitively means we regard a model fed with about 500 k data as reliable as one fed with about 50 M data.When k s is in [0.05, 0.15], the model performs best.The proposed method enhances the robustness of the model, thus improving its performance.

Compared Methods
We report the results of the following baselines: BM25 (official), K-NRM [30], Duet ( [31,32]), fastText+ConvK-NRM ( [33]), ColBERT [13], and COIL [34].Detailed descriptions of the baselines are given below: Duet and fastText+ConvK-NRM are representations of neural matching models that have been tested in MS-MARCO [7] passage rerank task.ColBERT and COIL are both BERT-based and have a single dualencoder architecture.We do not show the result of BERT-large models, since the large model with more parameters and neural network layers surely improves the model performance.We do not compare the result with Multi-stage BERT [35], which has a tandem structure of a list-wise BERT and a pair-wise BERT because our model can be seen as a pair-wise BERT of the architecture.
• BM25 (official) is a traditional bag-of-words information retrieval method (a sparse retriever), the rerank task of the following compared methods is based on the BM25 retriever result, which limits the Recall@1000 to 0.814.• Soft-match of queries to document is a weaker signal compared with an exact match, K-NRM focuses on soft-match features extraction through kernels, uses a kernelpooling technique to build word embeddings, then uses a translation matrix to model word-level similarities.

•
Duet uses two deep neural networks, a local sub-model to match the term space of the queries and documents, and a distributed sub-model to match the learned latent space of the queries and documents.• fastText+ConvK-NRM conduct a set of experiments on K-NRM, ConvK-NRM [36] and MatchPyramid [37], present a method that adopts sub-word token embeddings to alleviate the absence of low-frequency words in the word embeddings list.• COIL gives the matching scores between queries and documents through overlapping query document tokens' contextualized representations.COIL-tok uses the exact match of tokens, and COIL-full uses CLS matching in addition.

•
Col-BERT introduces a late interaction architecture to model the similarity between queries and documents.
The results are in Table 6, our SS-BERT outperforms other neural matching models and dual-encoders based on BERT-base and BM25 retriever, all our proposed methods can improve MRR@10 and achieve the maximum of Recall@1000, SS-BERT(d) has a higher Recall@50 than baselines.The result shows that changing the data distribution improves the model performance since the method enables the dual-encoder to focus more on hard negatives.Data augmentation from negatives improves the model performance since the method provides the dual-encoder with more reliable training data.Passage evaluation improves the model performance since it enhances the robustness of the model.Table 6.The results in MS-MARCO passage rerank task, PLM means the pre-trained language models used in the dual-encoder.SS-BERT(t) means the model uses the negatives select method, SS-BERT(h) means the model uses the negatives select method and the data augmentation method, SS-BERT(d) means the model uses the negatives select method, the data augmentation method, and the passage evaluation method.

Methods
PLM MRR@10 Recall@50 Recall@1000 In order to prove the rationality of the threshold t A (t A = 0.1) in Section 4.2, when t A changes, as shown in Table 5, the amount of new triples consisting of generated queries grows as t A grows; however, a higher t A means the generated queries has a lower degree of confidence.
Figure 4 shows how the model performs with different t A , a higher t A means the new triples have a lower degree of confidence, thus the model performs worse when the threshold is too high, which provides the model with too much "bad data".Figure 5 shows how many training steps the model needs to perform best with different t A , a higher t A means more training data; thus, the model training cost is much higher.In both experiments, the chosen t A are in [0.10, 0.16, 0.22, 0.28, 0.34, 0.40], according to a sequence of equal difference.

Case Study
We choose a case of the data augmentation shown in Table 7, the original training triple consists of a query, a positive, and a negative, and the generated query is generated from the original negative.From the table, we see that the original negative contains the answer to the generated query while the original positive does not.Therefore, we acquire a new triple that regards the generated query as the new query, the original negative as the new positive, and the original positive as the new negative.In the label, the main overlapping tokens are in bold, we see that although the positive and negative have lots of overlapping tokens, our method finds that they are different indeed, our method ensures the quality of the new triple.
of the generated query while the original positive does not.So we acquire a new triple 302 which regards the generated query as the new query, the original negative as the new 303 positive and the original positive as the new negative.In the label the main overlapping 304 tokens are in bold, we see that although the positive and negative have lots of overlapping 305 tokens, our method find that they are different indeed, our method ensures the quality of 306 the new triple.Table 7. Data Augmentation Case.In this case, since there are lots of overlapping tokens ('interest', 'account', which are in bold type) between the original positive and negative, it is hard for the model to judge whether the positive and negative are similar through term frequency.We calculate the similarity between the generated query and the original query, and the similarity score is lower than 0.1 (using cosine similarity), so we regard the original positive as the negative of the generated query and build a new triple.

Conclusions
The recent work focus on the two-stage Open-Domain QA system, improvements in both the retriever and the reader can improve the performance of the system.Our work tries to improve the reranker module in a retriever-reranker structure.
In this paper, to alleviate two problems in Open-domain QA: high-quality negatives selection and insufficient high-quality training data, we propose SS-BERT, a semantic information selecting method for Open-Domain QA passage rerank.We propose three methods to re-evaluate labeled data and create new labeled data: Negative selection by query generation, Data augmentation from negatives, and Passage evaluation.The experiments show that our methods are effective: among all the dual-encoder rerankers based on BERT-base and BM25, our proposed model performs best on MRR@10 (all the methods are effective) and Recall@50 and achieved the highest Recall@1000, which is limited by the BM25 retrieval result.
SS-BERT is a light model used for training a dual encoder, we use light pre-trained models (e.g., BERT-base, BART-base) to finish the dual-encoder training and query generation.We believe the absolute value of the model performance will significantly improve if we replace the base models with other state-of-art pre-trained models, which may cost much more computational resources.We plan to apply our method to train a cross-encoder and apply our method on other datasets in the future, which ask for much more computing resources (we have only two NVIDIA Tesla V100 GPUs with 16G RAM most of the time).

Figure 1 .
Figure1.Retriever Methods.Among the methods, dual-encoder is more efficient, cross-encoder is more accurate, and late-interaction looks for a balance of efficiency and accuracy.

query
How long it takes to get a master's degree positive In most cases, a master's degree program takes two years to complete, although there are exceptions to the rule.If you'd like to know how long it would take to earn a master's degree, you should consider how much time you could devote to school and the specific type of program you will be enrolling in.hard negative

Figure 2 .
Figure 2. The pipeline of negative choices from the original dataset.(1) A BART model is fine-tuned with query-positive pairs; (2) the fine-tuned BART model generates queries from the negatives; (3) the similarity scores of queries and generated queries are calculated after being encoded by a BERT model; (4) choose negatives of which the similarity scores are higher than the threshold.

Figure 3 .
Figure 3. Acquire the hybrid triples.The chosen triples are more likely to be hard negatives.

Figure 4 .
Figure 4.This figure shows the model performance with different threshold t A , which presents different amount of new triples, since the amount of new triples grows, the model performs worse since the data quality is worse.

Figure 4 .Figure 5 .
Figure 4.This figure shows the model performance with different thresholds t A , which present different amounts of new triples.When the amount of new triples grows, the model performs worse since the data quality is worse.March 3, 2023 submitted to Electronics 11 of 14

Figure 5 .
Figure 5.This figure shows the training steps for the model to perform best with different thresholds t A , which present different amounts of new triples, the model costs much more computational resources when t A is too big.Every 100,000 training steps cost about 17 h for two NVIDIA Tesla V100 GPUs with 16G RAM to train.

Table 2 .
The distribution of queries' keywords of MS-MARCO and SS-BERT generated queries.Following previous work, we use Mean Reciprocal Rank (MRR) and Recall at Top k (R@k) to evaluate the performance of the reranker.

Table 4 .
The performance of the fine-tuned BART-base model.

Table 7 .
Data Augmentation Case.In the case, since there are lots of overlapping tokens ('interest', 'account', which are in bold type) between the original positive and negative, it is hard for the model to judge whether the positive and negative are similar through term frequency.We calculate the similarity between the generated query and the original query, the similarity score is lower than 0.1 (using cosine similarity), so we regard the original positive as the negative of the generated query and build a new triple.Data typeText original query $10,000 at 5% interest term deposit how much interest original positive (new negative)With simple interest, interest is only paid at the end of a specified term.A term deposit is an example of an With simple interest, interest is only paid at the end of a specified term.A term deposit is an example of an account that will earn simple interest not compound interest.If you invested $10,000 at 5% per year, you would earn $2500 in simple interest after 5 years, $500 for each year.If you invested $10,000 at 5%, you would earn $2834 in compound interest after 5 years, giving you a total of $12,834.originalnegative(new positive)A full offset means that 100% of the funds in your offset account will be deducted from what you owe on your home loan before interest is calculated.A partial offset gives you a reduced interest rate on the part of your home loan equal to the balance of your offset account and while your money is working hard to reduce the interest you pay, your offset account will also be every bit as an everyday transaction account.This means you pay less interest on your home loan.