Recent Advances in Dialogue Machine Translation

: Recent years have seen a surge of interest in dialogue translation, which is a signiﬁcant application task for machine translation (MT) technology. However, this has so far not been extensively explored due to its inherent characteristics including data limitation, discourse properties and personality traits. In this article, we give the ﬁrst comprehensive review of dialogue MT, including well-deﬁned problems (e.g., 4 perspectives), collected resources (e.g., 5 language pairs and 4 sub-domains), representative approaches (e.g., architecture, discourse phenomena and personality) and useful applications (e.g., hotel-booking chat system). After systematical investigation, we also build a state-of-the-art dialogue NMT system by leveraging a breadth of established approaches such as novel architectures, popular pre-training and advanced techniques. Encouragingly, we push the state-of-the-art performance up to 62.7 BLEU points on a commonly-used benchmark by using mBART pre-training. We hope that this survey paper could signiﬁcantly promote the research in dialogue MT.


Introduction
Dialogue is a written or spoken conversational exchange between two or more people, expressing human emotions, moods, attitudes, and personality [1]. Nowadays there is a huge demand for cross-language dialogue communication between people, and advances in machine translation (MT) for improving communication (or translation) have been seen in recent years. MT is a sequence-to-sequence prediction task, which aims to find for a source language sentence the most probable target language sentence that shares the most similar meaning. Dialogue machine translation comprises a number of significant application domains such as audiovisual subtitles, meeting transcripts, instant messaging, and speech-to-speech interpretation. Although neural machine translation (NMT) [2][3][4] has achieved great progress in recent years, translating dialogues is still a challenging task due to its inherent characteristics such as data limitation, irregular expressions, discourse properties, and personality traits. To address corresponding problems, a number of works are exploited to improve the translation quality of dialogue translation systems.
Although some researchers have explored ways to construct data for modeling dialogues [5][6][7][8][9], parallel data are still scarce to build robust dialogue translation models. As a result, previous work has been hampered by a lack of dialogue-domain datasets [10][11][12]. In contrast to the translation of general domains (e.g., news), in which the text is carefully authored and well formatted, translating dialogue conversations has been less planned, more informal, and often discourse-aware. One research direction investigates incorporating dialogue history into document-level NMT architectures [13,14], which aims to implicitly enhance the ability on modeling coherence and consistency. On the other hand, some research has explicitly modelled various discourse phenomena in dialogues, such as anaphora [15][16][17] and discourse connectives [18,19]. Furthermore, recent studies investigated effects of inherent characteristics on translating dialogues, including speaker information [20], role preference [21] and topics [11].
In recent years, there have been more interest in modeling dialogue machine translation. In this article, we aim to give a comprehensive survey of the recent advances in dialogue MT. First of all, we systematically define four critical problems in dialogue translation by reviewing a large number of related works. Second, we collect nearly all existing corpora for the dialogue translation task, covering 5 language pairs and 4 sub-domains. Third, we also respectively introduce three representative approaches on architecture, discourse phenomenon and personality aspects. Last, we discuss an example of real-life applications, demonstrating the importance and feasibility of a dialogue translation system. Furthermore, we explore the potential of building a state-of-the-art dialogue translation system by leveraging a breadth of established approaches. Experiments are conducted on a task-oriented translation dataset that is widely used in previous studies (i.e., WMT20 English-German). Encouragingly, we push the SOTA performance up to 62.7 BLEU points on the benchmark by using the mBART pre-training method.
This paper describes highlights of recent advances in dialogue machine translation: 1.
Previous works mainly exploited dialogue MT from perspectives of coherence, consistency, and cohesion. Furthermore, recent studies began to pay more attention to the issue of personality such as role preference.

2.
Although there are some related corpora, the scarcity of training data remains one of the crucial issues, which severely hinders the further development of the deep learning methods for real applications of dialogue translation.

3.
Existing approaches can be categorized into three main strands. One research line is to exploit document-level NMT architectures, which can improve the consistency and coherence in translation output. The second one tries to deal with specific discourse phenomena such as anaphora, which can lead to better cohesion in translations. The third line aims to enhance the personality of dialogue MT systems by leveraging additional information labeled by humans. In future work, it is necessary to design an end-to-end model that can capture various characteristics of dialogues.

4.
Through our empirical experiments, we gain some interesting findings: (1) data selection methods can significantly improve the baseline model especially for smallscale data; (2) the large batch learning works well, which makes sentence-level NMT models perform the best among different NMT models; (3) document-level contexts are not always useful on the dialogue translation due to the limitation of data; (4) it is helpful to dialogue MT by transferring general knowledge from pretrained models.
This section is organized as follows: we first introduce the fundamental knowledge on NMT (including the models, frameworks, and evaluation metrics) and basic information on dialogue translation (including theory, definition and characteristics) in Section 2. Section 3 gives a comprehensive review of problems, resources, approaches, and real-life applications for dialogue translation task. We explore building a state-of-the-art dialogue translation system by combining advanced techniques in Section 4. Finally, we summarize the content of this article in Section 5.

Preliminary
Without loss of generality, we provide the fundamental knowledge on machine translation and dialogue translation in this section.

Machine Translation
As an active research field in NLP, the task of MT is to translate texts from one language to another language. It is a challenging task for MT to generate high-quality translation, because computers need to thoroughly understand the text in the source language and have a good knowledge of the target language. In the last several decades, scientific research in the field of MT has experienced three main historical periods including Rule-based Machine Translation (RBMT) [22], Statistical Machine Translation (SMT) [23] and Neural Machine Translation (NMT) [24,25], and each of these models has significantly improved the performance of MT systems.

Statistical Machine Translation
Assume that a sentence pair x = {x 1 , . . . , x i , . . . , x I } and y = {y 1 , . . . , y j , . . . , y J } are in source and target side, respectively. x i is the i-th word of x and y j is the j-th word of y. I and J are lengths of x and y, which can be different. Based on Bayes decision theory, we can formulate SMT [26] as: whereŷ denotes the translation output with the highest translation probability. The translation problem is factored into p(x|y) and p(y), representing the inverse translation probability and language model probability respectively. The denominator p(x) is ignored since it remains constant for a given source sentence x. The advantage of this decomposition is that we can learn separate probabilities in order to computeŷ. Och and Ney [27] proposed a log-linear model, which incorporates different features containing information from the source and target sentences in the model, in addition to the language and translation models of the original noisy channel [28] approach. Figure 1a describes the architectures of phrase-based SMT [29], which consists of several components: (1) words within the parallel corpus are aligned and phrase pairs are then extracted based on word-alignment results [30]; (2) the translation model and the lexicalized reordering model can be learned using aligned phrases; (3) an n-gram language model can be built using a large number of monolingual sentences in the target language [31]; (4) these models are optimized under the log-linear framework in order to maximize the performance using a development set [32]; (5) with the optimized weight parameters of the features in the models, we can finally translate the test set and the evaluation score indicates the performance of the whole system.

Neural Machine Translation
In recent years, NMT [2,24,25] has made significant progress towards constructing and utilizing a single large neural network to handle the entire translation task. A standard NMT model directly optimizes the conditional probability of a target sentence y = y 1 , . . . , y J given its corresponding source sentence x = x 1 , . . . , x I : where θ is a set of model parameters and y <j denotes the partial translation. The probability P(y|x; θ) is defined on the neural network based encoder-decoder framework [25,33], where the encoder summarizes the source sentence into a sequence of representations H = H 1 , . . . , H I with H ∈ R I×d , and the decoder generates target words based on the representations. Typically, this framework can be implemented as a recurrent neural network (RNN) [2], convolutional neural network (CNN) [4], and Transformer [3]. The Transformer has emerged as the dominant NMT paradigm among the different models, as shown in Figure 1.
which is used as a sentence-level baseline in this work. We use automatic evaluation metrics to evaluate the translation quality. BLEU [34] is the most commonly-used one, which is reference-based and computed over the entire test set. The output of BLEU is a score between 0 and 100%, indicating the similarity between the MT outputs and the reference translations. The higher the score is, the better the translation is. It is computed based on a modified n-gram precision: where n represents the order of the n-grams compared between the translations and references. Typically, n is from 1 to 4. m n and m r indicate the n-grams occurring in the MT outputs and the corresponding references respectively. |m n ∩ m r | is the number of n-grams occurring in both translations and references. BP is the brevity penalty to penalize shorter translations than the references.

Dialogue Translation
Dialogue is a written or spoken conversational exchange between two or more people, and a literary and theatrical form that depicts such an exchange. It is an essential component of social behaviour to express human emotions, moods, attitudes, and personality. In the context of dialogue modeling, we divided the dialogue into two types: task-oriented and open-domain. Specifically, the task-oriented dialogue system makes users communicate in a task-based fashion: (1) help users achieve their specific goals; (2) focus on understanding users, tracking states, and generating subsequent actions; (3) minimize the number of turns (i.e., fewer turns the better). On the other hand, an open-domain dialogue system aims to establish long-term connections with users by satisfying the human need for communication, affection, and social belonging.
A typical scenario for such application is translating dialogue texts, in particular the record of group chats or movie subtitles, which helps people of different languages understand cross-language chat and improve their comprehension capabilities. For instance,   Although NMT has achieved great progress in recent years, translating conversational text is still an important and challenging application task. In contrast to the translation of common domains (e.g., newswire and biomedical), in which the text is carefully authored and well-formatted, translating dialogue conversations is less planned, more informal, and often context-aware. More specifically, few researchers have investigated how to improve the MT of conversational material by exploiting their internal structure. This lack of research on the dialogue MT is a surprising fact, since dialogue exhibits more cohesiveness than a single sentence and at least as much as textual discourse. In natural dialogues, speakers may make some kinds of mistakes or so called irregular expressions. One of the most challenging problems which dialogue MT must deal with is translating irregular expressions in the natural conversation, such as ungrammatical, incompleted, or ill-formed sentences. However, most existing machine translation systems reject utterances with irregular expressions. Furthermore, this task has so far not been extensively explored largely due to the lack of publicly available datasets.

Overview of Dialogue Machine Translation
In this section, we make a survey of problems (in Section 3.1), resources (in Section 3.2), approaches (in Section 3.3), and real-life applications (in Section 3.4) for a dialogue machine translation task.

Dialogue Translation Issues
Dialogue machine translation varies from the other translation tasks, e.g., news and biomedical, mainly due to the fact that the conversations are bilingual, less planned, more informal, and often discourse-aware. Furthermore, such conversations are usually characterized by shorter and simpler sentences and contain more implicit information. According to the inherent characteristics of dialogue, we divide the issues of dialogue translation into four perspectives: coherence, consistency, cohesion, and personality. As shown in Figure 4, each perspective contains its sub-fields and related works. Note that most methods are used in general-domain translation, but can also be employed for dialogue translation task. Coherence is created referentially when different parts of a text refer to the same entities, and relationally, by means of coherence relations such as "Cause-Consequence" between different discourse segments [35]. Some researchers attempt to exploit the discourse trees (e.g., Rhetorical Structure Theory [36]) of the input texts to infer more coherent translations [37][38][39][40][41]. Another research line investigates effects of specific phenomena, such as discourse connectives and relations on MT [18,19,42]. Besides, document-level NMT architectures are proposed to implicitly modeling information across sentences [13,14].
Consistency is another critical issue in dialogue MT, where a repeated term should keep the same translation throughout the whole text [43]. The underlying assumption is that the same concepts should be consistently referred to with the same words in a translation. To alleviate the inconsistency problems, some researchers have investigated different approaches for MT and evaluation, which can be divided into different aspects, such as verb tense [44][45][46], entity/terminology [43,47], sentiment [48]. Furthermore, documentlevel NMT can also improve translation consistency, including cache-based [49,50], and document-level decoding [51,52] and document-level architecture [13,53].
Cohesion is a surface property of the text that is realized by explicit clues. It occurs whenever "the interpretation of some element in the discourse is dependent on that of another" [54]. Some researchers have investigated approaches of incorporating anaphora/coreference information to improve the performance of MT [51,55]. Zero pronoun (ZP) is a more complex case of anaphora, where pronouns are often omitted when they can be pragmatically or grammatically inferable from intra-and inter-sentential contexts [56]. This severely harms MT systems since the translation of such missing pronouns cannot be normally reproduced, and several works have addressed this prob-lem [15,16,57,58]. Lexical cohesion refers to the way related words are chosen to link elements of a text. Some studies have tried to model lexical cohesion for both MT and evaluation tasks [59,60].
Personality is the specific set of qualities and interests that make a person unique and unlike others. It is one of the major challenges in conversational systems, which aims to present a consistent personality [61]. Due to the lack of explicitly modeling such inherent characteristics (e.g., role preference), dialogue translation systems cannot obtain satisfactory results [12]. Therefore, recent studies have investigated different inherent characteristics of dialogue translation, including speaker identification [20,62], role preference [21,63], and topic [11,64].

Existing Data
Translating dialogue has so far not been extensively explored in prior MT research, largely due to the lack of publicly available data sets [12]. Prior related work has mostly focused on movie subtitles and European Parliament speeches. To alleviate this problem, the WMT2020 Shared Task (https://www.statmt.org/wmt20, accessed on 20 November 2021) created a corpus on task-oriented dialogue translation, namely BConTrasT [9].
Some work regarding bilingual subtitles as parallel corpora exists, but it lacks rich information between utterances [10,[65][66][67][68][69][70]. Other work focuses on mining the internal structure in dialogue data from movie scripts. However, these are monolingual data, which cannot be used for MT [5][6][7][8]. In general, the fact is that bilingual subtitles are ideal resources to extract parallel sentence-level utterances, and movie scripts contain rich information such as dialogue boundaries and speaker tags. Recently, some works explored constructing parallel dialogue data with rich information [11,15,20].
The detailed corpora for the dialogue translation task are summarized as follows and in Table 1.  [71], which are originally crawled from the movie subtitle website (http://www.opensubtitles.org, accessed on 20 November 2021). Bilingual subtitles are ideal resources to extract parallel utterances because a large amount of data are available. Most of the translations of subtitles are usually simple and short, and they do not preserve the syntactic structures of their original sentences at all. Previous works on dialogue translation usually randomly select some episodes as the validation set, and the others as the test set. In total, it contains 62 language pairs, and researchers mainly exploited commonly-cited French-English, Spanish-English and Russian-English.
TVSub (https://github.com/longyuewangdcu/tvsub, accessed on 20 November 2021) extracted subtitles from TV episodes, instead of movies compared with the OpenSubtitle Corpus [15]. The dataset is the Chinese-English language pair. Its source-side sentences are automatically annotated with zero pronouns by a heuristic algorithm [58] (The annotation indicates recovering dropped pronouns with correct pronoun words). Thus, it can be generally used to study dialogue translation as well as the zero anaphora phenomenon. More than two million sentence pairs were extracted from the subtitles of television episodes. Their multiple references and zero pronoun labels in validation and test sets have been manually designed.
MVSub (http://longyuewang.com/corpora/resource.html, accessed on 20 November 2021) is extracted from a classic American TV series, namely Friends [11]. It contains speaker tags and scene boundaries, which are all manually annotated according to their corresponding screenplay scripts. Thus, it can be generally used to study dialogue translation as well as personality characteristics. The dataset contains 100 thousand Chinese-English sentence pairs, and validation and test sets are well designed.
IWSLT-DIALOG (http://iwslt2010.fbk.eu/node/33, accessed on 20 November 2021) are from the Spoken Language Databases (SLDB) corpus, a collection of human-mediated cross-lingual dialogues in travel situations. In addition, parts of the BTEC corpus are also provided to the participants of the DIALOG Task [72]. The dataset contains very limited Chinese-English sentence pairs. The validation and test sets are not available. Thus researchers usually randomly selected parts of data. Ref. [73] pointed out that NMT systems have a steeper learning curve with respect to the amount of training data, resulting in worse quality in low-resource settings. The DIALOG is difficult to translate given the variety of topics in quite small-scale training data.
BConTrasT (https://github.com/Unbabel/BConTrasT, accessed on 20 November 2021) is first provided by WMT 2020 Chat Translation Task, which is translated from English into German and is based on the monolingual Taskmaster-1 corpus [9]. The conversations (originally in English) were first automatically translated into German and then manually post-edited by human editors, who are native German speakers. Having the conversations in both languages allows us to simulate bilingual conversations in which one speaker, the customer, speaks in German and the other speaker, the agent, answers in English. The training, validation and test sets contain utterances in task-based dialogues with contextual information.
BMELD (https://github.com/XL2248/CPCC, accessed on 20 November 2021) is created based on the dialogue dataset in the MELD (originally in English) [74]. Ref. [20] firstly crawled the corresponding Chinese translations from movie website and then manually post-edited them according to the dialogue history by native Chinese speakers, who are postgraduate students majoring in English. Finally, they assume 50% of speakers as Chinese to keep data balance for Chinese-to-English translations and build the bilingual MELD (BMELD). The MELD is a multi-modal emotionLines dialogue dataset, each utterance of which corresponds to a video, voice, and text, and is annotated with emotion and sentiment.
Europarl (https://www.statmt.org/europarl, accessed on 20 November 2021) is extracted from the proceedings of the European Parliament. Sentences are usually long and formally used in the official conference. It contains 21 European language pairs [75].

Representative Approaches
As discussed in Section 3.1, there are different strands of research in the literature. One attempts to exploit the macroscopic structure of the input texts to infer better translations in terms of discourse properties, including cohesion, coherence, and consistency. Other work deals with specific linguistic phenomena that are governed by discourse-level processes, such as the generation of anaphoric pronouns and translation of discourse connectives. These strands are not isolated, but closely related to each other. For instance, documentlevel information can not only improve the overall performance of MT but also alleviate inconsistency problems at the same time. Furthermore, some researchers investigated effects of characteristics of dialogue on MT [11,20,21]. Instead of reviewing all existing approaches, we mainly introduce three representative ones: document-level architecture, discourse phenomena for dialogue MT, and translation with speaker information.
3.3.1. Architecture: Document-Level NMT It aims to consider both the current sentence and its large context in a unified model to improve translation performances, especially discourse properties. Figure 5 introduces a classic document-level NMT model, namely multi-encoder [13,53,55]. Taking [55] for an example, it employs (N − 1)× layers of context encoder to summarize the larger context from source-side previous sentences, and (N − 1)× layers of a standard encoder to model the current sentence. At the last layer, they integrate the contextual information with the source representations using a gating mechanism. Finally, the combined document-level representations are fed into the NMT decoder to translate the current sentence. Given a source sentence x i to be translated, we can consider its K previous sentences in the same document as source context C = {x i−K , . . . , x i−1 }. The source encoder employs multi-head self-attention ATT(·) to transform an input sentence x i into a sequence of where h is one of H heads. Q, K and V, respectively, represent queries, keys and values, which are calculated as: where {W Q , W K , W V } ∈ R d×d are trainable parameters and d indicates the hidden size. The context encoder employs the same networks as the source encoder to obtain the context outputÔ. Finally, the two encoder outputs O andÔ are combined via a gated sum, as in: in which σ(·) is the logistic sigmoid function and W λ is the parameter. O is the final document-level representation, which is further fed into the NMT decoder. Following [55], people usually share the parameters of context encoders and embedding with those of the standard NMT encoder.

Discourse Phenomenon: Zero Pronoun Translation
Pronouns are frequently omitted in pro-drop languages (e.g., Chinese and Japanese), generally leading to significant challenges with respect to the production of complete translations. This problem is especially severe in informal genres, such as dialogues and conversation, where pronouns are more frequently omitted to make utterances more compact [76]. Ref. [58] proposed an automatic method to annotate ZPs by utilizing the parallel corpus of MT. The homologous data for both ZP prediction and translation leads to significant improvements in translation performances for both statistical [58] and neural MT models [15]. However, such approaches still require external ZP prediction models with a low accuracy of 66%. The numerous errors of ZP prediction errors will be propagated to translation models, which leads to new translation problems. Therefore, some works began to investigate an end-to-end ZP translation model [15,16].
Taking reconstructor-based NMT [15] for example, the reconstructor reads a sequence of hidden states and the annotated source sentence, and outputs a reconstruction score. It employs an attention model to reconstruct the annotated source sentencê x = {x 1 ,x 2 , . . . ,x J } word by word, which is conditioned on the input latent representations v = {v 1 , v 2 , . . . , v T }. The reconstruction score is computed by Equation (9): whereŝ j is the hidden state in the reconstructor, and computed by Equation (10): Here, g r (·) and f r (·) are, respectively, softmax and activation functions for the reconstructor. The context vectorĉ j is computed as a weighted sum of hidden states v, as in Equation (11) where the weightα j,t is calculated by an additional attention model. The parameters related to the attention model, g r (·), and f r (·), are independent of the standard NMT model. The labeled source wordsx share the same word embeddings with the NMT encoder. Finally, they augment the standard encoder-decoder-based NMT model with the introduced reconstructor, as shown in Figure 6. The standard encoder-decoder reads the source sentence x and outputs its translation y along with the likelihood score. We train both the encoder-decoder and the introduced reconstructors together in a single end-to-end process. The training objective can be revised as in Equation (12): log P(y n |x n ; θ) likelihood + λ log R enc (x n |h n ; θ, γ) enc-rec + η log R dec (x n |s n ; θ, ψ) dec-rec (12) where θ is the parameter matrix in the encoder-decoder, and γ and ψ are model parameters related to the encoder-side reconstructor ("enc-dec") and decoder-side reconstructor ("decrec"), respectively. λ and η are hyper-parameters that balance the preference between likelihood and reconstruction scores; h and s are encoder and decoder hidden states. The original training objective P(·) guides the standard NMT counterpart to provide better translations. Furthermore, the auxiliary reconstruction objectives (R enc (·) and R dec (·)) guide the related part of the parameter matrix θ to learn better latent representations, which are used to reconstruct the annotated source sentence. The parameters of the model are trained to maximize the likelihood and reconstruction scores of a set of training examples {[x n , y n ]} N n=1 . In testing, reconstruction can serve as a re-ranking technique to select a better translation from the k-best candidates generated by the decoder. Each translation candidate is assigned a likelihood score from the standard encoder-decoder, as well as reconstruction score(s) from the newly added reconstructor(s). As shown in Figure 7, given an input sentence, a two-phase scheme is used.

Dialogue Personality: Speaker Information
As shown in Figure 8, Ref. [11] conduct a personalized MT experiment to explore the effects of speaker tags on dialogue MT. They first build a baseline MT engine using Moses [29] on a dataset extracted from the bilingual movie subtitle of Friends. They train a 5-gram language model using the SRI Language Toolkit [31] on the target side of the parallel corpus. Besides, they use GIZA++ [30] for word alignment and minimum error rate training [32] to optimize feature weights. Based on the hypothesis that different types of speakers may have specific speaking styles, they employ a language model adaptation method to boost the MT system. Instead of building a LM on the whole data, they split the data into two separate parts according to the speakers' sex and then build two separate LMs. As Moses supports multiple LM integration, they directly feed Moses two LMs.

Real-Life Applications
Dialogue translation can help real-life systems such as a hotel-booking conversation online system, which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking. Ref. [77] showcases a semantics-enhanced task-oriented dialogue translation system with novel features: (1) taskoriented named entity (NE) definition and a hybrid strategy for NE recognition and translation; and (2) a novel grounded semantic method for dialogue understanding and task-order management.
In the hotel booking scenario, customers and agents speak different languages. For instance, the rest of the paper will assume that customers speak English and agents speak Chinese. Customers access the hotel website to request a conversation, and the agent accepts the customer's request to start the conversion. Figure 9 shows the detailed workflow of the hotel-booking translation system. They first recognize entities by inferring their specific types based on information such as contexts, speakers, etc. Then, the recognized entities will be represented as logical expressions or semantic templates using the grounded semantics module. Finally, candidate translations of semantically represented entities will be marked up and fed into a unified bi-directional translation process.

Building Advanced Dialogue NMT Systems
Prior related work has investigated different inherent characteristics of dialogue translation [11,[13][14][15][16][17]63]. In the meantime, a number of advanced techniques have been empirically validated for general-domain translation, which may be adopted to taskoriented translation scenarios, including data selection, back-translation, and larger batch learning. Therefore, we explore a breadth of established approaches for building better dialogue translation systems. First, we mainly investigate three kinds of mainstream models: sentence-level NMT, document-level NMT, and non-autoregressive NMT models [3,13,78]. Technically, we adapt the most recent effective strategies to our models, including back translation [79], data selection [80], domain fine-tuning [81], and large batch learning [82]. To alleviate the low-resource problem, we employ large-scale pre-training language models including monolingual BERT [83], bilingual XLM [84] and multilingual mBART [85], of which knowledge are transferred to translation models. Based on systematic comparisons, we combine the effective approaches to build two SOTA dialogue translation systems w/ and w/o pre-training, respectively.

Methodology
Sentence-level NMT Models. We choose the state-of-the-art Transformer network [3] as our model structure, which consists of an encoder with 6 layers and a decoder with 6 layers. For sentence-level NMT (SENT), we use two settings customized from the base and small configurations. We followed the base configurations to train the SENT-B model, where the dimension of word embedding and the inner feed-forward layer is 512 and 2048 respectively. The parameters of source and target word embeddings and the projection layer before softmax are shared. The number of attention heads is 8. Due to data limitations, we also use the small configurations to build SENT-S models. The main differences with the base settings are: the inner feed-forward layer is 1024, with the number of attention heads being 4. For all models, we empirically adopt large batch learning [82] (i.e., 4096 tokens × 8 GPUs vs. 16348 tokens × 4 GPUs) with a larger dropout of 0.3. The models are trained by the Adam optimizer [86] with β 1 = 0.9, β 2 = 0.98. We use the default learning rate schedule used in [3] with the initial learning rate 5 × 10 −4 . Label smoothing [87] is adopted with a value of 0.1. We set the max learning rate to 7 × 10 −4 , warmup steps to 16 K and total training steps to 70 K. All models are trained on NVIDIA V100 GPUs. Document-level NMT Models. For document-level NMT (DOC), we re-implement the cross-sentence model [13] on top of TRANSFORMER-BASE. The addition encoder reads N = 3 previous source sentences as history context, and the representations are integrated into the standard NMT for aiding the current sentence. We follow Zhang et al. [88] to use two-stage training, where the context-agnostic model is trained in the first stage (70 K), and then context-aware parameters are tuned in the second stage (40 K).
Non-autoregressive Models. Different from autoregressive NMT models that generate each target word conditioned on previously generated ones, Non-autoregressive NMT (NAT) models break the autoregressive factorization and produce target words in parallel [89] as p(y|x) = p L (T|x; θ) ∏ T t=1 p(y t |x; θ). Although NAT is proposed to speed up the inference, we expect it can alleviate sequential error accumulation and improve the diversity in conversational translation. We employ the advanced MaskPredict model [78] with a better training method [90]. More specifically, the Mask-Predict uses the conditional mask LM [83] to iteratively generate the target sequence from the masked input. We followed its optimal settings to keep the iteration number as 10 and the length beam as 5. We closely followed previous works to apply sequence-level knowledge distillation to NAT [91]. We train BIG Transformer as the AT teachers and adopt a large batch strategy (i.e., 458 K tokens/batch) to optimize the performance. Traditionally, NAT models are usually trained for 300K steps on regular batch size (i.e., 128 K tokens/batch). In this work, we empirically adopt large batch strategy (i.e., 480 K tokens/batch) to reduce the training steps for NAT (i.e., 70 K). Accordingly, the learning rate warms up to 1 × 10 −7 for 10 K steps and then decays for 60 K steps with the cosine schedule. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction, and apply weight decay with 0.01 and label smoothing with = 0.1. We use Adam optimizer [86] to train our models. We followed the common practices [78,92] to evaluate the performance on an ensemble of top 5 checkpoints to avoid stochasticity.
Pre-Training for NMT. To transfer the general knowledge to downstream tasks, we explore to initialize (part of) parameters of our models with different pre-trained models. In our preliminary experiments, we found that it is difficult for pre-training to improve general-domain NMT models, which usually have a large amount of parallel data. On the contrary, pre-training can help a lot for low-resourced scenarios such as dialogue translation. Furthermore, pre-training on such a large contiguous text corpus enables the model to capture long-range dialogue context information, which motivates us to systematically exploit various kinds of pre-training models in terms of architectures and languages.
Ref. [93] shows that large scale generative pre-training could be used to initialize the document-level NMT by concatenating the current sentence and its context. Accordingly, we follow their work to build the BERT→DOC model. Besides, ref. [84] proposes directly training a novel cross-lingual pre-training language model (XLM) to facilitate translation tasks. Accordingly, we adopt XLM pre-trained model to sentence-level NMT (XLM→SENT). More recently, ref. [85] proposes a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective. We also export mBART for sentence-level NMT (MBART→SENT).

Experiments
Setup. All models are implemented on top of the open-source toolkit Fairseq [94]. Experiments are conducted on two task-oriented translation datasets: WMT20 En-De (http://www.statmt.org/wmt20/chat-task.html, accessed on 20 November 2021), which only consist of 14 K sentence pairs. They contain utterances in task-based dialogues with contextual information, and we use both w/ and w/o context formats for corresponding models. We use the official validation and test datasets for a fair comparison with previous works. Table 2 shows the statistics of WMT20 En-De data. We also use large WMT20 news data (http://www.statmt.org/wmt20/translation-task.html, accessed on 20 November 2021), and select parts of them as pseudo-in-domain data. After preprocessing, we generate subwords via Joint BPE [95] with 32K merge operations. We evaluated the translation quality with BLEU [34]. Comparison of Advanced Models. Table 3 illustrates the translation performances of various NMT models with different fine-tuning strategies. As seen, all models are hungry for larger in-domain data due to the data limitation problem (IN+OUT vs. IN). About sentence-level models, the "base + big batch" setting performs better than the "small" one (SENT-B vs. SENT-S). However, it is difficult for document-level models to outperform sentence-level ones (DOC vs. SENT). The interesting finding is that the document-level model trained on pseudo contexts ("IN+OUT") can improve the baseline that is trained on only real context ("IN") by +5.47 BLEU points. There are two main reasons: (1) it lacks large-scale training data with contextual information; (2) it is still unclear how the context help document translation [96,97]. About NAT models, it can improve the vanilla NAT by +0.6 BLEU point, which is lower than those of autoregressive NMT models. About pre-training, we first investigate SENT→DOC. Unfortunately, it is still lower than pure sentence-level models. The performance of BERT→DOC is much better than pure documentlevel models (56.01 vs. 51.93), which confirms our hypothesis that contextual data is limited in this task. Furthermore, the XLM→SENT can obtain 59.61 BLEU points, which is close to that of SENT-B. Surprisingly, the MBART→SENT with CC25 pre-trained model can achieve the best performance among all models (62.67 BLEU). Except for MBART, all pre-training models cannot beat the best sentence-level model. This demonstrates: (1) it is difficult to transfer general knowledge to downstream tasks; (2) multilingual knowledge may be useful to dialogue scenarios. Encouragingly, we find that the best model with mBART pre-training pushes the state-of-the-art performance on WMT20 English-German dataset up to 62.67 BLEU points. Effects of Domain Fine-tuning. Modeling all the speakers and language directions involved in the conversation can be regarded as a different sub-domain. We conduct domain adaptation for different models to avoid performance corruption caused by domain shifting in Table 4. Specifically, we fine-tune the well-trained models w/ and w/o domain adaptation, denoted as "-Domain" and "+Domain", and evaluated them on domain combined and split valid sets. As seen, domain adaptation helps a lot on valid set ("AVE." 61.48). While evaluating on combined valid sets has a bias towards models without domain adaptation. We attribute this interesting phenomenon to personality and will explore it in the future.

Conclusions
Dialogue MT is a relatively new but very important research topic to promote MT for practical use. This paper gives the first comprehensive review of the problems, resources, techniques mainly being developed in the last several years. First, we systematically define four critical problems in dialogue translation by reviewing a large number of related works. Second, we collect nearly all existing corpora for dialogue translation task, covering 5 language pairs and 4 sub-domains. Third, we also respectively introduce three representative approaches on architecture, discourse phenomenon and personality aspects. Last, we discuss an example of real-life applications, demonstrating the importance and feasibility of dialogue translation system. Furthermore, we explore the potential of building a stateof-the-art dialogue translation system by leveraging a breadth of established approaches. Experiments are conducted on a task-oriented translation dataset that is widely used in previous studies (i.e., WMT20 English-German). Encouragingly, we push the SOTA performance up to 62.7 BLEU points on the benchmark by using mBART pre-training method. We hope that this survey paper could significantly promote the research in dialogue MT.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: