Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model

Qing-dao-er-ji, Ren; Cheng, Kun; Pang, Rui

doi:10.3390/app121910074

Open AccessArticle

Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model

by

Ren Qing-dao-er-ji

¹

,

Kun Cheng

^1,2,*

and

Rui Pang

^1,2

¹

School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China

²

Information and Communications Division, Wuhai Power Supply Company, Inner Mongolia Power (Group) CO., Ltd., Wuhai 016000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 10074; https://doi.org/10.3390/app121910074

Submission received: 5 September 2022 / Revised: 28 September 2022 / Accepted: 1 October 2022 / Published: 7 October 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Neural machine translation (NMT) is a data-driven machine translation approach that has proven its superiority in large corpora, but it still has much room for improvement when the corpus resources are not abundant. This work aims to improve the translation quality of Traditional Mongolian-Chinese (MN-CH). First, the baseline model is constructed based on the Transformer model, and then two different syntax-assisted learning units are added to the encoder and decoder. Finally, the encoder’s ability to learn Traditional Mongolian syntax is implicitly strengthened, and the knowledge of Chinese-dependent syntax is taken as prior knowledge to explicitly guide the decoder to learn Chinese syntax. The average BLEU values measured under two experimental conditions showed that the proposed improved model improved by 6.706 (45.141–38.435) and 5.409 (41.930–36.521) compared with the baseline model. The analysis of the experimental results also revealed that the proposed improved model was still deficient in learning Chinese syntax, and then the Primer-EZ method was introduced to ameliorate this problem, leading to faster convergence and better translation quality. The final improved model had an average BLEU value increase of 9.113 (45.634–36.521) compared with the baseline model at experimental conditions of N = 5 and epochs = 35. The experiments showed that both the proposed model architecture and prior knowledge could effectively lead to an increase in BLEU value, and the addition of syntactic-assisted learning units not only corrected the initial association but also alleviated the long-term dependence between words.

Keywords:

attention mechanism; dependency parsing; convolutional neural networks; Transformer

1. Introduction

NMT has been developed over the course of decades and has become the dominant approach in the field of machine translation. The origins of NMT can be traced back to the 1980s when researchers proposed neural network-based approaches to machine translation [1,2] and encoder-decoder architectures [3,4,5], but they did not attract much attention. With the successful application of deep learning and distributed word vectors in the field of natural language processing (NLP) [6,7,8,9], NMT began to show its potential.

Among the NMT methods, NMT based on an attention mechanism is particularly prominent. Bahdanau et al. [10] combined the attention mechanism with an NMT model based on an encoder-decoder architecture, where the decoder can use the output vectors from the full time step of the encoder and automatically adjust the weights with these vectors. Luong et al. [11] introduced the local attention mechanism based on the study by Bahdanau et al. The local attention mechanism focuses on a subset of the output vector for the full time step of the encoder to reduce the time overhead while ensuring the translation quality. The entry of large research institutions such as Google [12] and Facebook [13] has accelerated the development of NMT.

In 2017, Google Brain proposed the Transformer model [14], which is entirely based on the attention mechanism, and it has quickly become the benchmark model due to its excellent performance. Variants based on the Transformer model have proliferated and become important models for various tasks in the field of NLP [15,16,17]. In the self-attention mechanism,

Q K^{T}

represents the degree of association between any two words in the sequence, and the larger the dot product of two word vectors, the stronger the association between the corresponding two words.

However, the Transformer model is also subject to the same constraints as other NMT models: computer arithmetic power, model structure, and the amount of corpus resources. The level of computer hardware has reached a certain height, and some effective structures and tricks have also been proposed. The establishment of the corpus requires a lot of manual effort and time, which is a major problem that limits the translation quality of NMT. Therefore, NMT translation in low-resource corpus needs more research and attention.

Prior knowledge has been proven to improve the quality of machine translation [18,19,20,21,22], which is indeed a worthy research direction for small language machine translation. NMT assumes that syntactic knowledge can be learned automatically during training [11]. Studies have shown that this level of learning is still not enough to capture deep syntactic details, and additional syntactic information can improve the translation performance [23,24,25,26,27]. Syntactic structure arranges words into a meaningful sentence, and incorporating syntactic knowledge into the NMT model helps to eliminate ambiguity. Syntactic analysis is generally divided into constituent syntactic parsing and dependency syntactic parsing. Constituent syntactic parsing is used to determine the lexical components and phrase structure of a sentence, and dependency syntactic parsing is used to identify the dependency relations between words. Since the self-attention mechanism is concerned with associations between words, the dependency syntactic relations were chosen as the external linguistic knowledge in this paper.

In fact, there have been some attempts to improve the Transformer model’s attention mechanism. Yang et al. [28] proposed adding a learnable Gaussian bias G to the self-attention mechanism. The center location and window size of the Gaussian bias can be determined by the model autonomously, and the Gaussian bias is incorporated into the original attention distribution to form a new modified distribution, with which the attention mechanism can enhance its ability to learn local contextual information. Bugliarello et al. [29] proposed strengthening the self-attention mechanism by multiplying

Q K^{T}

by a matrix of dependent syntactic information, calling the improved attention head PASCAL. Both articles aim to enhance the model’s ability to learn local information, which may weaken the model’s ability to capture long-range dependencies.

This research also aims to improve the quality of translations by improving the attention mechanism. We propose a translation model for learning syntactic knowledge at both the encoder and decoder sides. The syntactic-assisted learning unit M is added to the encoder’s self-attention mechanism to implicitly enhance the encoder’s learning of Traditional Mongolian syntax, and the Masked Multi-Head Dependency Parsing Attention (MMHDPA) layer is added to the front end of the decoder, which explicitly enhances the decoder’s learning of Chinese syntax using the dependency syntax (DP) matrix extracted from the Chinese dependent syntax, A CNN is also integrated into the MMHDPA layer, making it easier for the attention mechanism to determine the weighting relationships between words. The proposed model has the following features. The source code and experimental results are publicly available at: https://github.com/CK-IMUT-501/applsci-12-10074 (accessed on 9 September 2022):

The syntactic-assisted learning unit M in the encoder can take into account the abilities of local and global information learning and is used in each encoder sublayer, expanding the hypothesis space of the model;
The addition of the Chinese dependency syntax matrix (DP) to the decoder provides an intuitive, effective, and interpretable result;
The combination of CNNs and MMHDPA enhances translation performance with only a few additional parameters.

2. Related Works

2.1. Traditional Mongolian-Chinese Machine Translation

Language is part of culture and not only assumes the role of enhancing national unity, strengthening identity, and passing on culture, but it also conveys information about the customs, values, and social attitudes to the outside world. Traditional Mongolian, the official way of writing the Mongolian language in the Inner Mongolia Autonomous Region, is an agglutinative language, and its word formation and morphology are accomplished by linking different affixes to the roots and stems of words, which results in a complex grammatical form and a large vocabulary [30,31]. However, a high-quality MN-CH parallel corpus is still very scarce, which poses difficulties for machine translation.

After reviewing a large amount of research, the development trend of MN-CH machine translation is summarized as follows.

Before 2017, research on MN-CH machine translation focused on example-based machine translation (EBMT), rule-based machine translation (RBMT), and statistical machine translation (SMT), and the research methods during this period mainly included multi-strategy translation systems, SMT systems based on phrases or hierarchical phrases, and MT systems combining the characteristics of Mongolian words. In order to improve the large number of word order errors in Chinese-Mongolian SMT, Wang et al. [32] proposed a Chinese sentence reordering method based on the Mongolian word order. To alleviate the data sparsity and linguistic morphological differences in MN-CH statistical machine translation, Yang et al. [33] proposed a method for constructing a lexeme-based translation model, with Mongolian lexemes as the central language. Wu et al. [34] proposed a multi-method fusion approach to MN-CH machine translation, where they added subword translation monolingual data to attention-based NMT and use SMT to guide the model to produce correct translations. Wu et al. [31] also proposed combining the template-based machine translation (TBMT) system with the SMT system to achieve better translation results.

In 2017, MN-CH machine translation gradually transitioned to NMT, with NMT methods dominating Mongolian-Chinese machine translation from 2018 to the present. During this period, MN-CH machine translation has presented the following characteristics: pure NMT research dominates and follows the international mainstream deep learning methods closely, and sometimes, SMT methods are used as a supplement to NMT methods. Li et al. [35] argued that the Transformer model’s positional encoding cannot correctly express the deep logical relationships between document sentences, so they proposed a method to improve positional encoding. Additionally, they proposed a method to fuse inter-sentence relationship information to improve the translation quality. Ji et al. improved the generative adversarial network (GAN) by adding value constraints and semantic enhancement, and then they used it to improve the unknown (UNK) words problem in NMT [36].

Nowadays, the research on MN-CH NMT has absorbed the international advanced research ideas and methods and has made promising progress, but the research directions are more scattered, and there is a lack of research themes with depth, systems, and strong cascading relationships.

2.2. Transformer

The Transformer model is an encoder-decoder model, where both the encoder and decoder contain N sublayers with a default N value of six. The model structure is shown in Figure 1. Transformer was entirely based on the attention mechanism when it was built, abandoning the recursive neural betwork (RNN) structure and convolutional neural betwork (CNN) structure, realizing training parallelization, and improving training efficiency. However, it also brings potential problems: (1) Transformer’s ability to learn the position information of sequences is not as good as an RNN, but position information is very important in NLP, so positional encoding is introduced to remedy this deficiency. (2) Transformer has an insufficient ability to extract local information. Transformer observes the correlation between words from a global perspective, while RNNs and CNNs are more biased toward looking at local information.

Transformer introduces two multi-head attention mechanisms to learn feature information from multiple spaces. One is the self-attention mechanism (Q, K, and V all come from the same input), which is used in both the encoder and decoder and is intended to allow the model to learn the correlation between different words in the input sentence itself, while the other is the cross-attention mechanism, which is only used in the decoder and is calculated in the same way as the self-attention mechanism, except that K and V come from the output of the encoder and Q comes from the input matrix of the decoder.

The formula for one of the heads in the multi-head self-attention mechanism is as follows:

A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i}

(1)

where

Q_{i}

,

K_{i}

, and

V_{i}

are linear transformations of the input matrix which help to enhance the expression space of the model,

d_{k}

is the feature dimension of

K_{i}

, and

Q_{i} K_{i}^{⊤}

divides

\sqrt{d_{k}}

by an element to help reduce the variance difference in

Q_{i} K_{i}^{⊤}

, making the model more stable during the gradient update.

3. Baseline Model

The following three modifications are made to Transformer, and the modified model is used as the baseline model for this paper.

3.1. Activation Functions

In deep learning, there are many linear mapping operations such as matrix multiplication, and the role of the activation function is to add nonlinearity to the model.

ReLU [37,38] is the default activation function for Transformer, which is popular in deep learning. Compared with Sigmoid and Tanh, it is computationally simple and does not have a gradient saturation zone when the input is positive, but ReLU handles negative values too simply and tends to cause neurons to fail to be updated when the gradient is updated. Leaky ReLU [39,40] has made an improvement to ReLU, which handles negative values more gently and alleviates the problem of silent neurons.

A trade-off is made between the computational cost and the proportion of silent neurons. Leaky ReLU was chosen as the activation function to replace ReLU.

3.2. Optimizer

Adam [41] is the default optimizer of Transformer, but Loshchilov and Hutter [42] found that mainstream machine learning frameworks have some problems in implementing the Adam + L2 regularity, so they improved the implementation of the Adam + L2 regularity and proposed AdamW [42], which not only saves computation but also brings improvement in the training effect. AdamW was chosen as the optimizer in Bert [43]. Therefore, AdamW was also used as the optimizer for the baseline model in this paper.

3.3. Learning Rate Adjustment Strategy

Since the optimizer optimizes the model based on the loss function, the model may be trained with decreasing loss values without a consequent increase in accuracy. To cope with this situation, the learning rate adjustment strategy ReduceLROnPlateau (ReduceLROnPlateau, accessed on 5 August 2022) was added to the baseline model to monitor the accuracy value of the validation set and halve the learning rate if the accuracy value was not increasing.

4. The Proposed Model

The self-attention mechanism defaults the distance between the words in a sequence to one, which means that the self-attention mechanism initially treats the association between two words “equally” and needs to slowly distinguish the “weak association” and “strong association” between words during model training.

Q K^{T}

is a matrix used to measure the degree of association between any two words in a sequence. The magnitude of the elemental values in

Q K^{T}

reflects the strength of the association between words. Therefore, the learning effect of the self-attention mechanism can be enhanced by adjusting the

Q K^{T}

matrix.

Inspired by previous works on the improvement of the attention mechanism, this paper improves the baseline model’s encoder and decoder to better suit the MN-CH translation task. The improvements to the encoder and decoder are very similar in that they both add additional syntax-assisted learning units to enhance the ability of the attention mechanism to learn syntactic knowledge of the source language and target language. The difference is that the self-attention layer of each sublayer in the encoder adds a syntax-assisted learning unit M, which is a randomly initialized matrix that can update the gradient as the model is trained, while the improvement to the decoder is the addition of an MMHDPA layer incorporating syntactic knowledge of the target language after positional encoding. The structure of the improved model is shown in Figure 2.

4.1. Improvement of the Self-Attention Mechanism in the Encoder

The input matrix for each sublayer of the encoder is

X {\in R}^{b \times l \times d_{m o d e l}}

, where b, l, and

d_{m o d e l}

denote the batch size of the model, the maximum length of the sentence, and the dimension of the word embedding, respectively. The shape of X is invariant after adding positional encoding information. Before performing the scaled dot product attention calculation, X is multiplied by the weight matrices of the four fully connected layers

W_{Q} {\in R}^{d_{m o d e l} \times d_{m o d e l}}

,

W_{K} {\in R}^{d_{m o d e l} \times d_{m o d e l}}

,

W_{V} {\in R}^{d_{m o d e l} \times d_{m o d e l}}

, and

W_{M} {\in R}^{d_{m o d e l} \times (h e a d_s i z e \times l)}

, and the results are

Q {\in R}^{b \times l \times d_{m o d e l}}

,

K {\in R}^{b \times l \times d_{m o d e l}}

,

V {\in R}^{b \times l \times d_{m o d e l}}

, and

M {\in R}^{b \times l \times (h e a d_s i z e \times l)}

, respectively. Then, Q, K, V, and M are divided equally according to the last dimension using

h e a d_s i z e

.

Then,

Q_{i} {\in R}^{b \times l \times d_{m o d e l / h e a d_s i z e}}

,

K_{i} {\in R}^{b \times l \times d_{m o d e l / h e a d_s i z e}}

,

V_{i} {\in R}^{b \times l \times d_{m o d e l / h e a d_s i z e}}

, and

M_{i} {\in R}^{b \times l \times l}

are obtained, and the formula for the scaled dot product attention for

h e a d_{i}

is calculated as follows:

A t t e n t i o n (Q_{i}, K_{i}, V_{i}, M_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}} ⊙ M_{i}) V_{i}

(2)

where

Q_{i} K_{i}^{⊤} {\in R}^{b \times l \times l}

, ⊙ is the element-wise multiplication. The

M_{i}

matrix is used to assist the self-attention mechanism of the encoder sublayer to learn more fully the correlation between words.

The structure of the scaled dot product attention after adding the syntax-assisted learning unit M is shown in Figure 3a.

4.2. Masked Multi-Head Dependency Parsing Attention (MMHDPA)

The multi-head attention after adding the DP matrix is called Masked Multi-Head Dependency Parsing Attention (MMHDPA), and MMHDPA is placed before the first sublayer of the decoder. In MMHDPA, the scaled dot product attention is calculated as follows:

A t t e n t i o n (Q_{i}, K_{i}, V_{i}, D P) = s o f t m a x (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}} + D P) V_{i}

(3)

where

D P

denotes the Chinese dependency syntax matrix. The

D P

matrix of each sentence is expanded to

l \times l

, and this part of the work belongs to the preprocessing stage, where the processed

D P

matrix was saved to a disk in advance in batches and then read during model training. The structure of MMHDPA is shown in Figure 3b.

4.3. The Construction of the DP Matrix

The dependency syntactic relation is an asymmetrical relation in which the dominant points to the subordinate. Dependency syntactic analysis aims to transform the input sentence into a dependency analysis tree in which any parent–child nodes are connected by dependency relations. Dependent syntactic relations can be represented in triadic form (relation, head, and dependent). Let the set of all word segmentations in a sentence be V = {w1, w2, …, wn} and the set of all dependency relations be R = {(r, wi, wj) | i, j ∈ [1, n], i ≠ j}, wi

\overset{r}{⟶}

wj.

In a sentence, there is only one word that does not depend on other words, called the central word. The corresponding dependency relation of the central word is ROOT, and all other words are directly or indirectly subordinate to the central word.

The DP matrix is constructed on the basis of the results of the dependency syntactic analysis of Stanford NLP [44]. First, sentences are tokenized using Stanford CoreNLP, and then dependency syntactic parsing is performed. The result is a list of triples, and finally, dependencies are assigned to the single-word level.

The following is an example of how the DP matrix is constructed:

Input sentence: "我喜欢看书。你爱音乐。"

Stanford CoreNLP word segment results: ["我", "喜欢", "看书", "。", "你", "爱", "音乐", "。"]

Stanford CoreNLP dependency syntactic parsing results: [("ROOT", 0, 2), ("nsubj", 2, 1), ("ccomp", 2, 3), ("punct", 2, 4), ("ROOT", 0, 2), ("nsubj", 2, 1), ("dobj", 2, 3), ("punct", 2, 4)]. The dependency tree is shown in Figure 4.

A sentence may contain multiple clauses, and Stanford CoreNLP will handle each clause separately, so the results need to be reordered and combined into a single sentence. The result after reordering were as follows: [("ROOT", 0, 2), ("nsubj", 2, 1), ("ccomp", 2, 3), ("punct", 2, 4), ("ROOT", 0, 6), ("nsubj", 6, 5), ("dobj", 6, 7), ("punct", 6, 8)].

When extracting the DP matrix from dependent relationships, it does not matter what the dependencies specifically are. It is only necessary to set the coordinates corresponding to the two single words for which a dependency exists up to one (except for the ROOT relationship).

The construction process of the matrix DP is shown in Figure 5. The DP matrix extracted from the example sentence is shown in Figure 6.

5. Experiment

5.1. Word Segmentation

Zipf’s Law [45] states that there is an inverse relationship between the frequency of a word and its frequency ranking in a large corpus. In 2020, China’s State Language and Script Work Committee analyzed a news corpus of 888,448,937 Chinese characters containing 10,970 different Chinese words, of which 2247 high-frequency words accounted for 99% of the entire corpus (http://qrcode.cp.cn/qr_code.php?id=qsxb3615t4ycdvrsgi8w3lxcb1x7n0v9, accessed on 5 August 2022).

The NMT model is constrained by objective conditions during training, and its model scale should be controlled. Parameters directly related to the model scale are the word embedding dimension, sentence length, corpus lexicon size, etc. Li et al. [46] discussed whether word segmentation is necessary for deep learning of Chinese representation, finding that models based on char-level tokenization performed better than models based on word-level tokenization. Their conclusion was that word-based models are more susceptible to data sparsity and out-of-vocabulary words, resulting in models prone to overfitting and limiting the generalization ability of the model.

Building upon these considerations, this paper splits the Chinese sentences into single word, while the Traditional Mongolian is naturally separated from individual words by spaces. In addition, there is a large number of numbers and English words in the corpus, which will bring a lot of noise to the model, so the numbers and English words are all split up as well.

5.2. Datasets and Settings

Datasets: The MN-CH parallel corpus used in this paper contained a wide range of proper nouns or phrases, literary works, computer-related terms, news reports, hiatus and slang, company or organization names, daily chats, etc. We randomly shuffled the corpus after word segmentation and saved sentences with a length of 8–100. The total number of sentences in the filtered MN-CH parallel corpus was 498,921, and the training set, validation set, and test set were divided by a ratio of approximately 0.96:0.02:0.02. The number of sentences in each dataset is shown in Table 1.

Settings: The model training was run under the following experimental conditions. The CPU was an Intel(R) Xeon(R) Gold 6130 with two Nvidia Tesla P100 graphics cards and each graphics card with 12,198 MB of memory, and the server operating system was Ubuntu 16.04.6.The main packages in the Python virtual environment for the experiments were Python 3.6.13, Pytorch 1.7 (CUDA 10.1),torchtext 0.6.0, scikit-learn 0.24.2, sacrebleu 2.0.0, pandas 1.1.5, numpy 1.19.2, matplotlib 3.2.2, and stanfordcorenlp 3.9.1.1.

The hyperparameters of the model were set as follows: word embedding dimension

d_m o d e l = 256

, feedforward fully connected layer of 2048,

h e a d_s i z e = 8

,

d r o p o u t_r a t e = 0.2

,

b a t c h_s i z e = 120

;

m a x_l e n g t h = 100

, AdamW was set with an initial learning rate of 0.0001,

β_{1} = 0.9

,

β_{2} = 0.999

,

ϵ

=

1 \times 10^{- 8}

, and

λ = 0.02

.

To exclude randomness of the experimental results, all models were trained on the network parameters corresponding to five random seeds, and the average BLEU value was taken as the evaluation metric.

The Chinese DP matrix was only used during training. The validation set and test set did not provide any syntactic information, and the DP matrix extracted from the training set was about 18.3 GB.

5.3. Experimental Results and Discussion

In this paper, ablation experiments were conducted to compare the impact of each change, involving five models named Transformer_Base, Transformer_Enc_Imp, Transformer_Dec_Imp + Syn, Transformer_Enc_Dec_Imp, and Transformer_Enc_Dec_Imp + Syn, where Enc_Imp, Dec_Imp, and Enc_Dec_Imp indicate encoder improvement, decoder improvement, and both improvements in the baseline model (Transformer_Base), respectively, and Syn indicates the knowledge of Chinese dependent syntax.

5.3.1. Experimental Results of the Six-Layer Models

In this section, the number of sublayer layers N for both the encoder and decoder was set to six, the number of training rounds (epochs) was 40, and the number of parameters for each model is shown in Table 2:

The number of parameters for all 5 models was around 50 million, with Transformer_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn having the largest number of parameters, increasing by 1,521,440 compared with the baseline model (1,258,272 + 263,168 = 1,521,440). The number of parameters for Transformer_Enc_Imp and Transformer_Dec_Imp increased by 1,258,272 and 263,168, respectively, compared with the baseline model, accounting for 82.7% and 17.3% of the total increase in parameters, respectively.

With the cumulative four-gram BLEU [47] score as the primary evaluation metric, adopting the function corpus_bleu() in sacrebleu with smooth_method = “none”, it is worth pointing out that the BLEU value can vary considerably depending on the choice of smoothing function, and multi-bleu-detok.perl is also used in the BLEU evaluation to ensure the accuracy of the BLEU values by verifying the consistency of the results between the two evaluation tools. The BLEU-4 values for each model were evaluated as shown in Table 3.

It can be seen that the average BLEU value of Transformer_Enc_Dec_Imp was 2.706 higher than that of the baseline model, which indicates that the improved structure was effective. The average BLEU of Transformer_Enc_Dec_Imp + Syn was 3.539 higher than that of Transformer_Enc_Dec_Imp, demonstrating that the addition of dependent syntactic information could improve the quality of translation.

However, two unexpected results were found:

Transformer_Enc_Dec_Imp + Syn did not have the highest average BLEU score, being lower than Transformer_Enc_Imp 1.161;
Transformer_Dec_Imp + Syn performed the worst in the experiment and had lower BLEU values than the baseline model for each random seed, obtaining the two lowest scores in all the experimental results.

The analysis yielded the two most likely causes of the above phenomenons:

The large disparity in the proportion of model parameters to the total increase in the parameters brought about by the improved encoder and improved decoder resulted in not having enough neural network units on the decoder side to fit the amount of data brought about by the dependent syntax matrix;
The information extraction of the Chinese dependency syntax by MMHDPA was not sufficient, and the structure of MMHDPA should continue to be improved to better learn the target language’s grammar.

5.3.2. Experimental Results of the Five-Layer Models

Experiments were conducted under the conditions of N = 5 and epochs = 35 to verify whether the above interpretations were correct. The parameter counts for the five-layer models are shown in Table 4, and the experimental results for each model are shown in Table 5.

The contents in parentheses in Table 5 indicate the changes relative to Table 3. The poorer experimental results for the five-layer models were to be expected, as the total number of model parameters and training rounds were reduced. The six-layer Transformer_Dec_Imp + Syn achieved an abnormally low score, with the network parameters corresponding to the random seed 256, pulling down the average BLEU value and leading to a better experiment result for the five-layer Transformer_Dec_Imp + Syn, which should be an anomalous event.

The same conclusion can be drawn from these two experiments, in descending order of the average BLEU value: Transformer_Enc_Imp, Transformer_Enc_Dec_Imp + Syn, Transformer_Enc_Dec_Imp, Transformer_Base, and Transformer_Dec_Imp + Syn.

The loss curves and accuracy curves of the five models also corroborated the above conclusions. As can be seen in Figure 7a–d and Figure 8a,b, the five models had the same order of merit in these four curves, corresponding to the order of the average BLEU scores achieved by the model. However, in Figure 8c,d, the order of the Transformer_Enc_Dec_Imp + Syn and Transformer_Enc_Dec_Imp curves was reversed, but in terms of the average BLEU score, Transformer_Enc_Dec_Imp + Syn won slightly by 0.849, and the more plausible explanation is that Transformer_Enc_Dec_Imp + Syn had better generalization performance.

More importantly, the experiments with the five-layer model confirmed the conjectures made in Section 5.3.1, which pointed the way to the next improvements.

6. Integration of Primer-EZ Methods Based on Improved Models

By analyzing the experiment results in the previous section, it can be concluded that the decoder had the problem of an insufficient learning ability for dependent syntactic information. The self-attention mechanism was inferior to the RNN and CNN in local feature extraction, and the RNN would affect the parallel computing ability of Transformer and greatly prolong the training time. Therefore, it was considered to add a CNN to MMHDPA to enhance the model’s learning ability of dependent syntactic information.

6.1. Primer-EZ

Coincidentally, in September 2021, Google proposed a search framework designed to automatically search for efficient Transformer variants. They named the searched architecture Primer [48], and experiments showed that Primer outperformed Transformer in most tasks when given the same training time. We may build on their work and integrate Primer into our improved model.

There are two improvements in Primer that are generic and easily portable, and the Primer containing these two improvements is called Primer-EZ. The structure of Primer-EZ is shown in Figure 9.

In this paper, Primer-EZ was implemented in Pytorch 1.7. Since the activation function used in all models was Leaky ReLU, the square of Leaky ReLU was taken, and the CNN layer was only used after Q, K, and V in the MMHPDA but limited by the experimental hardware conditions. Q, K, and V in the MMHDPA layer shared one convolutional layer, increasing the number of neurons by 512. In Transformer_Base and Transformer_Enc_Imp, only the activation function needed to be modified to Squared Leaky ReLU, while in the remaining three models, both improvements were applied.

6.2. Experimental Results and Discussion

The experimental conditions in this section are the same as in Section 5.3.2, where N = 5, epochs = 35, and the average BLEU value results are shown in Table 6. Table 7 shows a comparison of the results of all the experiments.

As we can see, the average BLEU values of Transformer_Base and Transformer_Enc_Imp were higher after adding Squared Leaky ReLU, indicating that Squared Leaky ReLU can speed up the convergence of the model and slightly improve the translation quality.

Transformer_Dec_Imp + Syn + EZ, Transformer_Enc_Dec_Imp + EZ, and Transformer_-Enc_Dec_Imp + Syn + EZ achieved the highest average BLEU scores, while the number of model parameters and training rounds were reduced, indicating that the improved model with a fused CNN layer can significantly improve translation quality. It can be seen that the addition of a CNN layer on the one hand made up for the insufficient number of neurons and insufficient fitting ability at the original decoder side, and on the other hand, Q, K, and V, after the CNN layer, enhanced the extraction of local features, making it easier for the MMHPDA layer to judge the strength of association between words.

6.3. Comparison with Other Methods

In the following, we compare other MN-CH machine translation methods in order to give the reader a general idea of the performance level of the methods proposed in this paper. The results of the comparison are shown in Table 8. It is unfortunate that there is not yet an accepted SOTA model in the field of MN-CH machine translation, and the MN-CH dataset used in the paper is not uniform, so the task of small language translation is still a long way off. Hopefully, more people will participate in the field in the future.

7. Conclusions and Future Work

In order to improve the translation quality of MN-CH NMT on a low-resource corpus, we explored a way to enhance the syntactic learning ability of the attention mechanism. We designed a translation model with double-ended learning syntactic knowledge. A syntax-assisted learning unit M was added to the self-attention layer of each sublayer of the encoder for implicitly enhancing the encoder’s ability to capture syntactic information in Traditional Mongolian. MMHDPA was added before the first sublayer of the decoder, and Chinese dependent syntactic knowledge was added to the attention mechanism of MMHDPA for explicitly learning Chinese syntactic knowledge.

The experiments show that Transformer_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn both had higher average BLEU scores than the baseline model under the same training conditions, and Transformer_Enc_Dec_Imp + Syn outperformed Transformer_Enc_Dec_Imp, which suggests the proposed improved model structure can effectively improve the translation quality even in the absence of external knowledge. The improvement was more obvious with the addition of external knowledge. Both the proposed model structure and external syntactic knowledge can improve the quality of translation.

The analysis revealed that the decoder side suffered from an insufficient fitting ability and MMHDPA did not sufficiently learn Chinese dependent syntactic information. Therefore, the Primer-EZ method was incorporated into the existing model. In Transformer_Base and Transformer_Enc_Imp, only their activation functions were replaced with Squared Leaky ReLU. In Transformer_Dec_Imp + Syn, Transformer_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn both used Squared Leaky ReLU as the activation function and also added a shared convolution layer after the Q, K, and V of the MMHDPA layer to enhance local feature capture.

The final improved model, Transformer_Enc_Dec_Imp + Syn + EZ (Both), showed a significant improvement of 9.113 (45.634 − 36.521 = 9.113) in the average BLEU value compared with the baseline model for experimental conditions of N = 5 and epochs = 35.

The conclusion can be drawn that Squared Leaky ReLU can accelerate the convergence speed of the model, and a CNN can significantly enhance the local feature capture of Chinese syntax. Both improvements can improve the quality of Mongolian-Chinese NMT translations.

Overall, our proposed method was validated in the MN-CH parallel corpus and achieved decent results.

For future work, firstly, other ways of constructing dependent syntactic matrices were not explored in this paper due to time constraints, and it is hoped that more constructions will be attempted. Secondly, an improved encoder self-attention mechanism could be used to perform grammar induction tasks for small languages such as Mongolian. Grammar induction refers to the model automatically learning grammatical structures from the corpus without any expert annotated syntactic knowledge. One possible solution is that the self-attention mechanism looks at the relationships between participles from a global perspective by strengthening the relationships between participles and using this to delineate the structure of sentences, and Transformer has several encoder sublayers that can be used to delineate the hierarchical structure of sentences.

Author Contributions

Conceptualization, R.Q.-d.-e.-j.; methodology, R.Q.-d.-e.-j. and K.C.; software, K.C.; validation, K.C.; formal analysis, K.C.; investigation, K.C.; resources, R.Q.-d.-e.-j.; data curation, K.C.; writing—original draft preparation, R.Q.-d.-e.-j., K.C. and R.P.; writing—review and editing, K.C. and R.P.; visualization, K.C.; supervision, R.Q.-d.-e.-j. and R.P.; project administration, R.Q.-d.-e.-j. and K.C.; funding acquisition, R.Q.-d.-e.-j. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant numbers 61966027, 61966028, and 62141603, the Natural Science Foundation of Inner Mongolia Autonomous Region, grant numbers 2022MS06013 and 2020MS07006, and the Fundamental Research Funds for universities directly under the Inner Mongolia Autonomous Region: JY20220122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is private, but we have made the full code and the trained model publicly available. The source code and experimental results are publicly available at: https://github.com/CK-IMUT-501/applsci-12-10074 (accessed on 9 September 2022).

Acknowledgments

We would like to thank the anonymous reviewers for their kind and constructive comments, and Elvis Saravia for providing a drawing template for the Transformer model (https://github.com/dair-ai/ml-visuals) (accessed on 9 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Neco, R.P.; Forcada, M.L. Asynchronous translations with recurrent neural nets. In Proceedings of the International Conference on Neural Networks (ICNN’97), Houston, TX, USA, 12 June 1997; pp. 2535–2540. [Google Scholar]
Castano, A.; Casacuberta, F. A connectionist approach to machine translation. In Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece, 22–25 September 1997; pp. 91–94. [Google Scholar]
Forcada, M.L.; Ñeco, R.P. Recursive hetero-associative memories for translation. In Proceedings of the International Work-Conference on Artificial Neural Networks, Berlin, Germany, 4–6 June 1997; pp. 453–462. [Google Scholar]
Pollack, J.B. Recursive distributed representations. Artif. Intell. 1990, 46, 142–149. [Google Scholar] [CrossRef]
Chrisman, L. Learning recursive distributed representations for holistic computation. Connect. Sci. 1991, 3, 345–366. [Google Scholar] [CrossRef]
Hinton, G.E. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, USA, 15–17 August 1986; pp. 1–12. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013; pp. 1–12. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Stateline, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7 May 2015. [Google Scholar]
Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1243–1252. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 2017 Conference on Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar]
Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. arXiv 2021, arXiv:211106091. [Google Scholar]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. arXiv 2020, arXiv:2009.06732. [Google Scholar] [CrossRef]
Susanto, R.H.; Chollampatt, S.; Tan, L. Lexically Constrained Neural Machine Translation with Levenshtein Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020; pp. 3536–3543. [Google Scholar]
Chen, K.; Wang, R.; Utiyama, M.; Sumita, E. Content word aware neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020; pp. 358–364. [Google Scholar]
Yang, J.; Ma, S.; Zhang, D.; Li, Z.; Zhou, M. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020; pp. 5979–5989. [Google Scholar]
Zheng, Z.; Huang, S.; Tu, Z.; Dai, X.; Chen, J. Dynamic Past and Future for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3 November 2019; pp. 931–941. [Google Scholar]
Li, F.; Zhu, J.; Yan, H.; Zhang, Z. Grammatically Derived Factual Relation Augmented Neural Machine Translation. Appl. Sci. 2022, 12, 6518. [Google Scholar] [CrossRef]
Shi, L.; Niu, C.; Zhou, M.; Gao, J. A DOM tree alignment model for mining parallel data from the web. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–21 July 2006; pp. 489–496. [Google Scholar]
Sennrich, R.; Haddow, B. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference on Machine Translation, Berlin, Germany, 11–12 August 2016; pp. 83–91. [Google Scholar]
Eriguchi, A.; Tsuruoka, Y.; Cho, K. Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 72–78. [Google Scholar]
Chen, H.; Huang, S.; Chiang, D.; Chen, J. Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1936–1945. [Google Scholar]
Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Zhao, T. Syntax-directed attention for neural machine translation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 4792–4799. [Google Scholar]
Yang, B.; Tu, Z.; Wong, D.F.; Meng, F.; Chao, L.S.; Zhang, T. Modeling Localness for Self-Attention Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4449–4458. [Google Scholar]
Bugliarello, E.; Okazaki, N. Enhancing machine translation with dependency-aware self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020; pp. 1618–1627. [Google Scholar]
Wu, J.; Hou, H.X.; Monghjaya, M.; Bao, F.L.; Xie, C.J. Introduction of Traditional Mongolian-Chinese Machine Translation. In Proceedings of the 2015 International Conference on Electrical, Automation and Mechanical Engineering, Phuket, Thailand, 26–27 July 2015; pp. 357–360. [Google Scholar]
Wu, J.; Hou, H.; Bao, F.; Jiang, Y. Template-Based Model for Mongolian-Chinese Machine Translation. J. Adv. Comput. Intell. Intell. Inform. 2016, 20, 893–901. [Google Scholar] [CrossRef]
Wang, S.R.G.L.; Si, Q.T.; Nasun, U. The Research on Reordering Rule of Chinese-Mongolian Statistical Machine Translation. Adv. Mater. Res. 2011, 268, 2185–2190. [Google Scholar] [CrossRef]
Yang, Z.; Li, M.; Chen, L.; Wei, L.; Wu, J.; Chen, S. Constructing morpheme-based translation model for Mongolian-Chinese SMT. In Proceedings of the 2015 International Conference on Asian Language Processing (IALP), Suzhou, China, 24 October 2015; pp. 25–28. [Google Scholar]
Wu, J.; Hou, H.; Shen, Z.; Du, J.; Li, J. Adapting attention-based neural network to low-resource Mongolian-Chinese machine translation. In Proceedings of the Natural Language Understanding and Intelligent Applications, Kunming, China, 2–6 December 2016; pp. 470–480. [Google Scholar]
Li, H.; Hou, H.; Wu, N.; Jia, X.; Chang, X. Semantically Constrained Document-Level Chinese-Mongolian Neural Machine Translation. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Ji, Y.; Hou, H.; Chen, J.; Wu, N. Adversarial training for unknown word problems in neural machine translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2019, 19, 1–12. [Google Scholar] [CrossRef] [Green Version]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30. [Google Scholar]
Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. 2018. Available online: https://arxiv.org/abs/1711.05101v2 (accessed on 30 August 2022).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the 685 Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; (Long and Short Papers). Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The stanford corenlp natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 22 June 2014; pp. 55–60. [Google Scholar]
Zipf, G.K. Human Behaviour and the Principle of Least-Effort: An Introduction to Human Ecology; Martino Fine Books: Eastford, CT, USA, 1949. [Google Scholar]
Li, X.; Meng, Y.; Sun, X.; Han, Q.; Yuan, A.; Li, J. Is Word Segmentation Necessary for Deep Learning of Chinese Representations? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 3242–3252. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
So, D.; Mańke, W.; Liu, H.; Dai, Z.; Shazeer, N.; Le, Q.V. Primer: Searching for Efficient Transformers for Language Modeling. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Online, 6–14 December 2021; pp. 6010–6022. [Google Scholar]

Figure 1. Construction of the Transformer model.

Figure 2. The structure of the proposed model.

Figure 3. Improvements to the attention mechanism. (a) Improvement of self-attention mechanism in each encoder sublayer. (b) The structure of MMHDPA.

Figure 4. The dependency tree for the example sentence of Chinese. At the bottom of the picture are the word segmentations of the example sentence, the abbreviation above the arrow represents the dependent relation between two words.

Figure 5. Dependency syntactic parsing and processing of the example sentence. The figure shows the Chinese example sentence going through the process of word segmentation, dependent syntax analysis, clause reordering and finally assingning dependent relationships to signal word level.

Figure 6. Dependency syntax matrix. The figure shows the rules for the construction of the DP matrix: the coordinates corresponding to the two words for which there is a dependency are set to one.

Figure 7. Loss curves and accuracy curves of six-layer models on training set and validation set. (a) Training set loss curve. (b) Validation set loss curve. (c) Training set accuracy curve. (d) Validation set accuracy curve.

Figure 8. Loss curves and accuracy curves of five-layer models on training set and validation set. (a) Training set loss curve. (b) Validation set loss curve. (c) Training set accuracy curve. (d) Validation set accuracy curve.

Figure 9. The structure of the Primer-EZ. (a) Multi-DConv-Head Attention (MDHA). (b) Squared ReLU.

Table 1. Number of sentences in Traditional Mongolian-Chinese dataset.

Corpus	MN-CH
Training Set	479,163
Validation Set	9979
Test Set	9779

Table 2. The number of parameters for each six-layer model.

Model	Number of Parameters
Transformer_Base	48,787,909
Transformer_Enc_Imp	50,046,181
Transformer_Dec_Imp + Syn	49,051,077
Transformer_Enc_Dec_Imp	50,309,349
Transformer_Enc_Dec_Imp + Syn	50,309,349

Table 3. BLEU evaluation of six-layer models on different random seeds.

Model	100	6421	2169	500	256	Avg. BLEU
Transformer_Base	38.377	38.280	38.566	38.362	38.588	38.435
Transformer_Enc_Imp	46.158	45.661	46.994	46.161	46.535	46.302
Transformer_Dec_Imp + Syn	37.183	28.481	36.939	35.264	16.351	30.844
Transformer_Enc_Dec_Imp	41.337	44.774	35.002	45.556	41.341	41.602
Transformer_Enc_Dec_Imp + Syn	46.195	44.892	45.273	46.170	43.176	45.141

Table 4. The number of parameters for each five-layer model.

Model	Number of Parameters
Transformer_Base	45,894,085
Transformer_Enc_Imp	46,942,645
Transformer_Dec_Imp + Syn	46,157,253
Transformer_Enc_Dec_Imp	47,205,813
Transformer_Enc_Dec_Imp + Syn	47,205,813

Table 5. BLEU evaluation of five-layer models on different random seeds.

Model	100	6421	2169	500	256	Avg. BLEU
Transformer_Base	36.575	36.686	36.533	36.711	36.099	36.521 (↓1.914)
Transformer_Enc_Imp	44.049	43.758	44.765	44.797	44.610	44.396 (↓1.906)
Transformer_Dec_Imp + Syn	36.244	36.374	25.571	35.433	35.382	33.801 (↑2.957)
Transformer_Enc_Dec_Imp	41.068	42.649	36.973	43.258	32.996	39.389 (↓2.213)
Transformer_Enc_Dec_Imp + Syn	45.758	44.197	37.822	37.973	43.901	41.930 (↓3.211)

Table 6. BLEU evaluation of five-layer models with fused Primer-EZ on different random seeds.

Model	100	6421	2169	500	256	Avg. BLEU
Transformer_Base + EZ (Squared Leaky ReLU)	37.275	37.607	37.135	37.409	37.021	37.289
Transformer_Enc_Imp + EZ (Squared Leaky ReLU)	43.866	44.795	45.236	45.251	44.933	44.816
Transformer_Dec_Imp + Syn + EZ (Both)	37.556	37.183	37.622	37.283	36.102	37.149
Transformer_Enc_Dec_Imp + EZ (Both)	45.966	46.005	44.840	44.722	45.698	45.446
Transformer_Enc_Dec_Imp + Syn + EZ (Both)	45.569	46.081	45.597	45.264	45.658	45.634

Table 7. The Traditional Mongolian-Chinese translation results.

Model	Conditions	Avg. BLEU
Transformer_Base	N = 6, EPOCHS = 40	38.435
Transformer_Base	N = 5, EPOCHS = 35	36.521
Transformer_Base + EZ (Squared Leaky ReLU)	N = 5, EPOCHS = 35	37.289
Transformer_Enc_Imp	N = 6, EPOCHS = 40	46.302
Transformer_Enc_Imp	N = 5, EPOCHS = 35	44.396
Transformer_Enc_Imp + EZ (Squared Leaky ReLU)	N = 5, EPOCHS = 35	44.816
Transformer_Dec_Imp + Syn	N = 6, EPOCHS = 40	30.844
Transformer_Dec_Imp + Syn	N = 5, EPOCHS = 35	33.801
Transformer_Dec_Imp + Syn + EZ (Both)	N = 5, EPOCHS = 35	37.149
Transformer_Enc_Dec_Imp	N = 6, EPOCHS = 40	41.602
Transformer_Enc_Dec_Imp	N = 5, EPOCHS = 35	39.389
Transformer_Enc_Dec_Imp + EZ (Both)	N = 5, EPOCHS = 35	45.446
Transformer_Enc_Dec_Imp + Syn	N = 6, EPOCHS = 40	45.141
Transformer_Enc_Dec_Imp + Syn	N = 5, EPOCHS = 35	41.930
Transformer_Enc_Dec_Imp + Syn + EZ (Both)	N = 5, EPOCHS = 35	45.634

Table 8. Performance of different Traditional Mongolian-Chinese machine translation approaches on various datasets.

Model	BLEU	Training Set	Validation Set	Test Set	Max. Sentence Length
Wu et al. [31]	21.09	58,900	1000	1000	-
Wu et al. [34]	31.18	63,000	1000	1000	50
Li et al. [35]	19.4	421	52	52	-
Our Approach	45.634	479,163	9979	9779	100

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qing-dao-er-ji, R.; Cheng, K.; Pang, R. Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model. Appl. Sci. 2022, 12, 10074. https://doi.org/10.3390/app121910074

AMA Style

Qing-dao-er-ji R, Cheng K, Pang R. Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model. Applied Sciences. 2022; 12(19):10074. https://doi.org/10.3390/app121910074

Chicago/Turabian Style

Qing-dao-er-ji, Ren, Kun Cheng, and Rui Pang. 2022. "Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model" Applied Sciences 12, no. 19: 10074. https://doi.org/10.3390/app121910074

APA Style

Qing-dao-er-ji, R., Cheng, K., & Pang, R. (2022). Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model. Applied Sciences, 12(19), 10074. https://doi.org/10.3390/app121910074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model

Abstract

1. Introduction

2. Related Works

2.1. Traditional Mongolian-Chinese Machine Translation

2.2. Transformer

3. Baseline Model

3.1. Activation Functions

3.2. Optimizer

3.3. Learning Rate Adjustment Strategy

4. The Proposed Model

4.1. Improvement of the Self-Attention Mechanism in the Encoder

4.2. Masked Multi-Head Dependency Parsing Attention (MMHDPA)

4.3. The Construction of the DP Matrix

5. Experiment

5.1. Word Segmentation

5.2. Datasets and Settings

5.3. Experimental Results and Discussion

5.3.1. Experimental Results of the Six-Layer Models

5.3.2. Experimental Results of the Five-Layer Models

6. Integration of Primer-EZ Methods Based on Improved Models

6.1. Primer-EZ

6.2. Experimental Results and Discussion

6.3. Comparison with Other Methods

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI