LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.


Introduction 1.Motivation
Subword segmentation has been established as a standard preprocessing method in neural machine translation (NMT) [1,2].In particular, byte-pair-encoding (BPE)/BPE-dropout [3,4] is the most successful compression-based subword segmentation.We propose another compression-based algorithm, denoted by LCP-dropout, that generates multiple subword segmentations for the same input; thus, enabling data augmentation especially for small training data.
In NMT, a set of training data is given to the learning algorithm, where a training data is a pair of sentences from the source and target languages.The learning algorithm first transforms the given each sentence into a sequence of tokens.In many cases, the tokens correspond to words in the unigram language model.The extracted words are projected from a high-dimensional space consisting of all words to a low-dimensional vector space by word embedding [5], which enables us to easily handle distances and relationships between words and phrases.The word embedding have been shown to boost the performance of various tasks [6,7] in natural language processing.The space of word embedding is defined by a dictionary constructed from the training data, where each component of the dictionary is called vocabulary.Embedding a word means representing it by a set of related vocabularies.
Constructing appropriate dictionary we tackle in this study is one of the most important tasks.Here, consider a simplest strategy that uses the words themselves in the training data as the vocabularies.If a word does not exist in the current dictionary, it is called an unknown word, and the algorithm decides whether or not to register it in the dictionary.Using a sufficiently large dictionary can reduce the number of unknown words as much as desired; however, as a trade-off, overtraining is likely to occur, so the number of vocabularies is usually limited to 16k and 32k.Therefore, the subword segmentation has been widely used to construct a small dictionary with high generalization performance [8][9][10][11][12].

Related works
Subword segmentation is a recursive decomposition of a word into substrings.For example, let the word 'study' be registered as a current vocabulary.By embedding other words 'studied' and 'studying', we can learn that these three words are similar.However, each time a new word appears, the number of vocabularies grows monotonically.
On the other hand, when we focus on the common substrings of these words, we can obtain a decomposition, such as 'stud y', 'stud ied', and 'stud ying' with the explicit blank symbol ' '.Therefore, the idea of subword segmentation is not to register the word itself as a vocabulary but to register its subwords.In this case, 'study' and 'studied' are regarded as known words because they can be represented by combining subwords already registered.These subwords can also be reused as parts of other words (e.g., student and studied), which can suppress the growth of vocabulary size.
In the last decade, various approaches have been proposed along this line.SentencePiece [13] is a pioneering study based on likelihood estimation over unigram language model, having high performance.Since maximum likelihood estimation requires a quadratic time in the size of training data and the length of longest subword, a simpler subword segmentation [3] based on BPE [14,15], which is known as one of fastest data compression algorithms, and therefore has many applications, especially in information retrieval [16,17] has been proposed.BPE-based segmentation starts from the state where a sentence is regarded as a sequence of vocabularies where the set of vocabularies is initially identical to the set of alphabet symbols (e.g., ASCII characters).BPE calculates the frequency of any bigram, merges all occurrences of the most frequent bigram, and registers the bigram as a new vocabulary.This process is repeated until the number of vocabularies reaches the limit.Thanks to the simplicity of the frequency-based subword segmentation, BPE runs in linear time in the size of input string.However, frequency-based approach may generates inconsistent subwords for same substring occurrences.For example, 'impossible' and its substring 'possible' are possibly decomposed into undesirable subwords, such as 'po ss ib le' and 'i mp os si bl e', depending on the frequency of bigrams.Such merging disagreement can also be caused by misspellings of words or grammatical errors.BPE-dropout [4] proposed a robust subword segmentation for this problem by ignoring each merging with a certain probability.It has been confirmed that BPE-dropout can be trained with higher accuracy than the original BPE and SentencePiece on various languages.

Our contribution
We propose LCP-dropout: a novel compression-based subword segmentation employing the stochastic compression algorithm, called locally-consistent parsing (LCP) [18,19], for improving the shortcomings of BPE.Here, we describe an outline of original LCP.Suppose we are given an input string and a set of vocabularies, where similarly to BPE, the set of vocabularies is initially identical to the set of symbols appearing in the string.LCP randomly assigns the binary label for each vocabulary.Then, we obtains a binary string corresponding to the input string where the bigram '10' works as a landmark.LCP merges any bigram in the input string corresponding to a landmark in the binary string, and adds the bigram to the set of vocabularies.The above process is repeated until the number of vocabularies reaches the limit.
By this random assignment, it is expected that any sufficiently long substring contains a landmark.Furthermore, we note that two different landmarks never overlap each other.Therefore, LCP can merge bigrams appropriately, avoiding undesirable subword segmentation that occurs in BPE.Using this characteristics, LCP has been theoretically shown to achieve almost optimal compression [19].The mechanism of LCP has also been applied to mainly information retrieval [18,20,21].
A notable feature of the stochastic algorithm is that LCP assigns a new label to each vocabulary for each execution.Owing to this randomness, the LCP-based subword segmentation is expected to generate different subword sequences representing a same input; thus, it is more robust than BPE/BPE-dropout.Moreover, these multiple subword sequences can be considered as data augmentation for small training data in NMT.
LCP-dropout consists of two strategies: landmark by random labeling for all vocabularies and dropout of merging bigrams depending on the rank in the frequency table.Our algorithm requires no segmentation training in addition to counting by BPE and labeling by LCP and uses standard BPE/LCP in test time; therefore is simple.With various language corpora including small datasets, we show that LCP-dropout outperforms the baseline algorithms: BPE/BPEdropout/SentencePiece.

Background
We use the following notations throughout this paper.Let A be the set of alphabet symbols, including the blank symbol.A sequence S formed by symbols is called a string.S[i] and S[i, j] are i-th symbol and substring from S[i] to S[j] of S, respectively.We assume the meta symbol '−' not in A to explicitly represent each subwords in S. For a string S from A ∪ {−}, a maximal substring of S including no − is called a subword.For example, S = a − b − a − a − b/a − b − a − ab contains the subwords in {a, b}/{a, b, ab}, respectively.
In subword segmentation, the algorithm decomposes all the symbols in S by the meta symbol.When a trigram a − b is merged, the meta symbol is erased and the new subword ab is added to the vocabulary, i.e., ab is treated as a single vocabulary.
In the following, we describe previously proposed subword segmentation algorithms, called SentencePiece (Kudo [13]), BPE (Sennrich et al. [3]), and BPE-dropout (Provilkov et al. [4]), respectively.We assume that our task in NMT is to predict a target sentence T given a source sentence S, where these methods including our approach are not task-specific.

SentencePieace
SentencePiece [13] can generate different segmentations for each execution.Here, we outline SentencePiece in the unigram language model.Given a set of vocabularies, V , a sentence T , and the probability p(x) of occurrence of x ∈ V , the probability of the partition The optimum partition x * for T is obtained by searching for the x that maximizes P (x) from all candidate partitions x ∈ S(T ).
Given a set of sentences, D, as training data for a language, the subword segmentation for D can be obtained through the maximum likelihood estimation of the following L with P (x) as a hidden variable by using EM algorithm, where X (s) is the s-th sentence in D.
SentencePieace was shown to achieve significant improvements over the method based on subword sequences.However, this method is rather complicated because it requires a unigram language model to predict the probability of subword occurrence, EM algorithm to optimize the lexicon, and Viterbi algorithm to create segmentation samples.

BPE and BPE-dropout
BPE [14] is one of practical implementations of Re-pair [15], which is known as the algorithm with the highest compression ratio.Re-pair counts the frequency of occurrence of all bigrams xy in the input string T .For the most frequent xy, it replaces all occurrences of xy in T such that T [i, i + 1] = xy, with some unused character z.This process is repeated until there are no more frequent bigrams in T .The compressed T can be recursively decoded by the stored substitution rules z → xy.
Since the naive implementation of Re-pair requires O(|T | 2 ) time, we use a complex data structure to achieve lineartime.However, it is not practical for large-scale data because it consumes Ω(|T |) of space.Therefore, we usually split T = t 1 t 2 • • • t m into substrings of constant length and process each t i by the naive Re-pair without special data structure, called BPE.Naturally, there is a trade-off between the size of the split and the compression ratio.BPE-based subword segmentation [3] (called BPE simply) determines the priority of bigrams according to their frequency and adds the merged bigrams as the vocabularies.
Since BPE is a deterministic algorithm, it splits a given T in one way.Thus, it is not easy to generate multiple partitions like stochastic approach (e.g., [13]).Therefore, BPE-dropout [4], ignoring the merging process with a certain probability, has been proposed.In BPE-dropout, for the current T and the most frequent xy, for each occurrence i satisfying T [i, i + 1] = xy, merging xy is dropped with a certain small probability p (e.g., p = 0.1).This mechanism makes BPE-dropout probabilistic and generates a variety of splits.BPE-dropout has been recorded to outperform SentencePieace in various languages.Additionally, BPE-based methods are faster and easier to implement than likelihood-based approaches.

LCP
Frequency-based compression algorithms (e.g.[14,15]) are known to be not optimum in theoretical point of view.Optimum compression here means a polynomial-time algorithm that satisfies |A(T with the output A(T ) of the algorithm for the input T and an optimum solution A * (T ).Note that computing A * (T ) is NP-hard [22].
For example, consider a string Since such pathological merging cannot be prevented by frequency information alone, frequency-based algorithms cannot obtain asymptotically optimum compression [23].Various linear-time and optimal compressions have been proposed to improve this drawback.LCP is one of the simplest optimum compression algorithms.The original LCP, like Re-pair, is a deterministic algorithm.Recently, the introduction of probability into LCP [19] has been proposed, and in this study, we focus on the probabilistic variant.The following is a brief description of the probabilistic LCP.
We are given an input string T = a 1 a 2 • • • a n of length n and a set of vocabularies, V .Here, V is initialized as the set of all characters appearing in T .

According to
4. Set V = V ∪ {a i a i+1 } and repeat the above process.
The difference between LCP and BPE is that BPE merges bigrams with respect to frequencies, whereas LCP pays no attention to them.Instead, LCP merges based on the binary labels assigned randomly.The most important point is that any two occurrences of '10' never overlap.For example, when T contains a trigram abc, there is no possible assignment allowing (ab)c and a(bc) simultaneously.By this property, LCP can avoid the problem that frequently occurs in BPE.Although LCP theoretically guarantees almost optimum compression, as for as the authors know, this study is the first result of applying LCP to machine translation.
3 Our Approach: LCP-dropout BPE-dropout allows diverse subword segmentation for BPE by ignoring bigram merging with a certain probability.However, since BPE is a deterministic algorithm, it is not trivial to generate various candidates of bigram.In this study, we propose an algorithm that enables multiple subword segmentation for the same input by combining the theory of LCP with the original strategy of BPE.

Algorithm description
We define the notations used in our algorithm.Let A be an alphabet and ' ' be the explicit blank symbol not in A. A string w formed from A is called word, denoted by x ∈ A * , and a string s ∈ (A ∪ { }) * is called a sentence.
We also assume the meta symbol '−' not in A ∪ { }.By this, a sentence x is extended to have all possible merges: Let x be the string of all symbols in x separated by −, e.g., x = a − b − a − b − b for x = ababb.For strings x and y, if y is obtained by removing some occurrences of − in x, then we express the relation y x and y is said to be a subword segmentation of x.
After merging a − b (i.e., a − b is replaced by ab), the substring ab is treated as a single symbol.Thus, we extend the notion of bigram to vocabularies of length more than two.For a string of the form s = α 1 − α 2 − • • • − α n such that each α i contains no −, each α i − α i+1 is defined to be a bigram consisting of the vocabularies α i and α i+1 .

Example run
Table 1 presents an example of subword segmentation using LCP-dropout.Here, the input X consists of a single sentence ababcaacabcb.The hyperparameters are (v, , k) = (6, 5, 0.5).First, the set of vocabularies is initialized to end while %subroutine of LCP-dropout

Framework of neural machine translation
Figure 1 shows the framework of our transformer-based machine translation model with LCP-dropout.Transformer is the most successful NMT model [25].The model mainly consists of Encoder and Decoder.The Encoder converts the input sentence in the source language into a word embedding (Emb i in Figure 1), taking into account the positional information of the characters.Here, the notion of word is extended to that of subword in this study.The subwords are obtained by our proposed method, LCP-dropout.Next, the correspondences in the input sentence are acquired as attention (Multi-Head Attention).Then, the normalization is done through a forward propagation network formed by linear transformation, activation by ReLU function, and linear transformation.These processes are performed in N = 6 layers for the Decoder.
For the Decoder, it receives the candidate sentence generated by the Encoder and the input sentence for the Decoder.Then, it acquires the correspondence between those sentences as attention (Multi-Head Attention).This process is also performed in N = 6 layers.Finally, the predicted probability of each label is calculated by linear transformation and softmax function.
4 Experimental Setup

Baseline algorithms
Baseline algorithms are SentencePiece [13] with the unigram language model and BPE/BPE-dropout [3,4].Sen-tencePiece takes the hyperparameters l and α, where l specifies how many best segmentations for each word are produced before sampling and α controls the smoothness of the sampling distribution.In our experiment, we use (l = 64, α = 0.1) that performed best on different data in the previous studies.
Table 1: Example of multiple subword segmentation using LCP-dropout for the single sentence 'x = ababcaacabcb' with the hyperparameters (v, , k) = (6, 5, 0.5), where the meta symbol − is omitted, and L is the label of each vocabulary assigned by LCP.The resulting subword segmentation is BPE-dropout takes the hyperparameter p, where during segmentation, at each step, some merges are randomly dropped with the probability p.If p = 0, the segmentation is equal to the original BPE and p = 1, the algorithm outputs the input string itself.Then, the value of p can be used to control the granularity of segmentation.In our experiment, we use p = 0 for the original BPE and p = 0.1 for the BPE-dropout with the best performance.

Data sets, preprocessing, and vocabulary size
We verify the performance of the proposed algorithm for a wide range of datasets with different sizes and languages.Table 2 summarizes the details of the datasets and hyperparameters.These data are used to compare the performance of LCP-dropout and baselines (SentencePiece/BPE/BPE-dropout) with appropriate hyperparameters and vocabulary sizes shown in [4].
Before subword segmentation, we preprocess all datasets with the standard Moses toolkit, 1 where for Japanese and Chinese, subword segmentations are trained almost from raw sentences because these languages have no explicit word boundaries; and thus, Moses tokenizer does not work correctly.Based on a recent research on the effect of vocabulary size on translation quality, the vocabulary size is modified according to the dataset size in our experiments (Table 2).
To verify the performance of the proposed algorithm for small training data, we use News Commentary v162 , a subset of WMT143 , as well as KFTT 4 .In addition, we use a large training data in WMT14.The training step is set to 200,000 for all data.In training, pairs of sentences of source and target languages were batched together by approximate length.
As shown in Table 2, the batch size was standardized to approximately 3k for all datasets.

Model, optimizer, and evaluation
NMT was realized by the seq2seq model, which takes a sentence in the source language as input and outputs a corresponding sentence in the target language [24].A transformer is an improvement of seq2seq model, that is the most successful NMT.[25] In our experiments, we used OpenNMT-tf [26], a transformer-based NMT, to compare LCP-dropout and other baselines algorithms.The parameters of OpenNMT-tf were set as in the experiment of BPE-dropout [4].The batch size was set to 3072 for training and 32 for testing.We also use regularization and optimization procedure as described in BPE-dropout [4].
The quality of machine translation is quantitatively evaluated by BLEU score, i.e., the similarity between the result and the reference of translation.It is calculated using the following formula based on the number of matches in their n-grams.Let t i and r i (1 ≤ i ≤ m) be the i-th translation and reference sentences, respectively.
where N is a small constant (e.g., N = 4), and BP BLEU is the brevity penalty when |t i | < |r i |, where BP BLEU = 1 otherwise.In this study, we use SacreBLEU [27].For Chinese, we add option--tok zh to SacreBLEU.Meanwhile, we use character-based BLEU for Japanese.

Estimation of hyperparameters for LCP-dropout
First, we estimate suitable hyperparameters for LCP-dropout.Table 3 summarizes the effect of hyperparameters (v, , k) on the proposed LCP-dropout.This table shows the details of multiple subword segmentation using LCPdropout and BLEU scores for the language pair of English (En) and German (De) from News Commentary v16 The threshold k controls the dropout rate, and contributes to the multiplicity of the subword segmentation.The results show that k and affect the learning accuracy (BLEU).The best result is obtained when (k, ) = (0.01, = v/2).This can be explained by the results in Table 4 which shows the depth of executed inner-loop of LCP-dropout for randomly assigning {0, 1} to vocabularies, where when = v/2, it means the average before the outer-loop terminates.Therefore, the larger this value is, the more likely it is that longer subwords will be generated.However, unlike BPEdropout, the value of k alone is not enough to generate multiple subwords.The proposed LCP-dropout guarantees the diversity by initializing the subword segmentation by ( < v).Using this result, we will fix (k, ) = (0.01, v/2) as the hyperharameter of LCP-dropout.

Comparison with baselines
Table 5 summarizes the main results.We show BLEU scores for News Commentary v16 and KFTT: En and De are the same in Table 3.In addition to these languages, we set French (Fr), Japanese (Ja), and Chinese (Zh).For each language, we show the average of the number of multiple subword sequences generated by LCP-dropout.For almost datasets, LCP-dropout outperforms the baselines algorithms.Meanwhile, we use the best ones reported in the previous study for the hyperparameters of BPE-dropout and SentencePiece.
Table 5 extracts the effect of alphabet size on subword segmentation.In general, Japanese Jn and Chinese Zh alphabets are very large, containing at least 2k alphabet symbols even if we limit them in common use.Therefore, the average length of words is small and subword semantics is difficult.For these case, we confirmed that LCP-dropout has higher BLEU scores than other methods for these languages.
Table 5 also presents the BLEU scores for a large corpus (WMT14) for the translation De → En.This experiment shows that LCP-dropout cannot outperform baselines with the hyperparameter we set.This is because the ratio of the vocabulary size (v, ) to dropout rate k is not appropriate.As data to support this conjecture, it can be confirmed that the multiplicity in the large datasets is much smaller than that of small corpus (Table 5).This is caused by the reduced repetitions of label assignments, as shown in Table 6 compared to Table 4.The results show that the depth of inner-loop is significantly reduced, which is why enough subwords sequences cannot be generated.
Table 7 presents several translation results.The 'Reference' represents the correct translation for each case, and the BLEU score is obtained from the pair of the reference and each translation result.We also show the average length for each reference sentence indicate by 'ave./word'.These results show the characteristics of successful and unsuccessful translations by the two algorithms related to the length of words.
Considering subword segmentation as a parsing tree, LCP produces a balanced parsing tree, whereas the tree produced by BPE tends to be longer for a certain path.For example, for a substring abcd, LCP tends to generate subwords like ((ab)(cd)), while BPE generates them like (((ab)c)d).In this example, the average length of the former is shorter than that of the latter.This trend is supported by the experimental results in Table 7 showing the average length of all Figure 2 shows the distributions of sentence length of English.The sentence length denotes the number of tokens in a sentence.BPE-dropout is a well-known fine-grained segmentation approach.The figure shows that LCP-dropout produces more fine-grained segmentation than the other three segmentation approaches.
Therefore, LCP-dropout is considered to be superior in subword segmentation for languages consisting of short words.Table 5 including the translation results for Japanese and Chinese also supports this characteristics.
6 Conclusions, Limitations, and Future Research

Conclusions and limitations
In this study, proposed LCP-dropout as an extension of BPE-dropout [4] for multiple subword segmentation by applying a near-optimum compression algorithm.The proposed LCP-dropout can properly decompose strings without background knowledge of the source/target language by randomly assigning binary labels to vocabularies.This mechanism allows generating consistent multiple segmentations for the same string.As shown in the experimental results, LCP-dropout enables data augmentation for small datasets, where sufficient training data are unavailable on minor languages or limited fields.
Multiple segmentation can also be achieved by likelihood-based methods.After SentencePiece [13], various extensions have been proposed [28,29].In contrast to these studies, our approach focuses on a simple linear-time compression algorithm.Our algorithm does not require any background knowledge of the language compared to word replacement-based data augmentation, [30,31] where some words in the source/target sentence are swapped with other words preserving grammatical/semantic correctness.

Future research
The effectiveness of LCP-dropout was confirmed for almost small corpora.Unfortunately, the optimal hyperparameter obtained in this study did not work well for a large corpus.Besides, the learning accuracy was found to be affected Table 7: Examples of translated sentences by LCP-dropout (k = 0.01) and BPE-dropout (p = 0.1) with the reference translation for News Commentary v16.We show the average word length (ave./word) for each reference sentence as well as the average subword length (ave./subword)generated by respective algorithms for the entire corpus.We also show the BLEU sores between the references and translated sentences as well as their standard deviations (SD).
Reference: 'Even if his victory remains unlikely, Bayrou must BLEU (ave./word= 5.00) now be taken seriously.'LCP-dropout: 'While his victory remains unlikely, Bayrou must 84.5 now be taken seriously.'BPE-dropout: 'Although his victory remains unlikely, he needs to 30.8 take Bayrou seriously now.' Reference: 'In addition, companies will be forced to restructure BLEU (ave./word= 5.38) in order to cut costs and increase competitiveness.'LCP-dropout: 'In addition, restructuring will force rms to save costs 12.4 and boost competitiveness.'BPE-dropout: 'In addition, businesses will be forced to restructure 66.8 in order to save costs and increase competitiveness.by the alphabet size of the language.Future research directions include an adaptive mechanism for determining the hyperparameters depending on training data and alphabet size.
In the experiments in this paper, we considered word-by-word subword decomposition.On the other hand, multiwords are known to violate the compositeness of language.Therefore, by considering multi-words as longer words and performing subword decomposition, LCP-dropout can be applied to related to language processing related to multi-words.In this study, subword segmentation was applied to machine translation.To improve the BLEU score, there are other approaches such as data augmentation [32].Incorporating the LCP-dropout with them is one interesting approach.In this paper, we handled several benchmark datasets with major languages.Recently, machine translation of low-resource languages is an important task [33].Applying the LCP-dropout to the task is also important future work.
Although proposed LCP-dropout is currently applied only to machine translation, we plan to apply our method to other linguistic tasks including sentiment analysis, parsing, and question answering in future studies.

1 :
assign L : V → {0, 1} randomly 2: F REQ(k) ← the set of top-k frequent bigrams in Y of the form α − β with L(αβ) = '10' 3: merge all occurrences of α − β in Y for each α − β ∈ F REQ(k) 4: add all the vocabularies αβ to V V = {a, b, c}; for each α ∈ V , a label L(w) ∈ {0, 1} is randomly assigned (depth 0).Next, find all occurrences of 10 in L, and the corresponding bigrams are merged depending on their frequencies.Here, L(ab) = L(ac) = 10 but only a − b is top-k bigram assigned 10, and then a − b is merged to ab.The resulting string is shown in the depth 1 over the new vocabularies V 1 = {a, b, c, ab}.This process is repeated while |V m | < for the next m.The condition |V 2 | = 5 terminates the inner-loop of LCP-dropout, and then the subword Y 1 = ab − abc − a − a − c − abc − b is generated.Since |V (Y 1 )| < v, the algorithm generates the next subword segmentations Y 2 for the same input.Finally, we obtain the multiple subword segmentation Y 1 = ab − abc − a − a − c − abc − b and Y 2 = ab − ab − ca − a − ca − b − c − b for the same input string.

Figure 1 :
Figure 1: Framework of our neural machine translation model with LCP-dropout.

Figure 2 :
Figure 2: Distribution of sentence length.The number of tokens in each sentence by LCP-dropout tends to be larger than the others: BPE, BPE-dropout, and SentencePiece.

Table 2 :
Overview of the datasets and hyperparameters.The hyperparameter v (vocabulary size) is common to all algorithms (baselines and ours) and others ( and k) are specific to LCP-dropout only.

Table 3 :
Experimental results of LCP-dropout for News Commentary v16 (Table 2) w.r.t the specified hyperparameters, where the translation task is De→En.Bold indicates the best score.

Table 4 :
Depth of label assignment in LCP-dropput.

Table 2 )
. For each threshold k ∈ {0.01, 0.05, 0.1}, En and De indicate the number of multiple subword sequences generated for the corresponding language, respectively.The last two values are the BLEU scores for De → En with = v and = v/2 for v = 16k, respectively.

Table 5 :
Experimental results of LCP-dropout (denoted by LCP), BPE-dropout (denoted by BPE), and SentencePiece (denoted by SP) on various languages in Table2(small corpus: News Commentary v16 and KFTT, and large corpus: WMT14), wehre 'multiplicity' denotes the average number of sequences generated per input string.Bold indicates the best score.

Table 6 :
Depth of label assignment for large corpus.
subwords generated by LCP/BPE-dropout for real datasets.Due to this property, when the vocabulary size is fixed, LCP tends not to generate subwords of approximate length because it decomposes a longer word into excessively short subwords.