Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model

: Neural machine translation (NMT) is a data-driven machine translation approach that has proven its superiority in large corpora, but it still has much room for improvement when the corpus resources are not abundant. This work aims to improve the translation quality of Traditional Mongolian-Chinese (MN-CH). First, the baseline model is constructed based on the Transformer model, and then two different syntax-assisted learning units are added to the encoder and decoder. Finally, the encoder’s ability to learn Traditional Mongolian syntax is implicitly strengthened, and the knowledge of Chinese-dependent syntax is taken as prior knowledge to explicitly guide the decoder to learn Chinese syntax. The average BLEU values measured under two experimental conditions showed that the proposed improved model improved by 6.706 (45.141–38.435) and 5.409 (41.930–36.521) compared with the baseline model. The analysis of the experimental results also revealed that the proposed improved model was still deﬁcient in learning Chinese syntax, and then the Primer-EZ method was introduced to ameliorate this problem, leading to faster convergence and better translation quality. The ﬁnal improved model had an average BLEU value increase of 9.113 (45.634–36.521) compared with the baseline model at experimental conditions of N = 5 and epochs = 35. The experiments showed that both the proposed model architecture and prior knowledge could effectively lead to an increase in BLEU value, and the addition of syntactic-assisted learning units not only corrected the initial association but also alleviated the long-term dependence between words.


Introduction
NMT has been developed over the course of decades and has become the dominant approach in the field of machine translation.The origins of NMT can be traced back to the 1980s when researchers proposed neural network-based approaches to machine translation [1,2] and encoder-decoder architectures [3][4][5], but they did not attract much attention.With the successful application of deep learning and distributed word vectors in the field of natural language processing (NLP) [6][7][8][9], NMT began to show its potential.
Among the NMT methods, NMT based on an attention mechanism is particularly prominent.Bahdanau et al. [10] combined the attention mechanism with an NMT model based on an encoder-decoder architecture, where the decoder can use the output vectors from the full time step of the encoder and automatically adjust the weights with these vectors.Luong et al. [11] introduced the local attention mechanism based on the study by Bahdanau et al.The local attention mechanism focuses on a subset of the output vector for the full time step of the encoder to reduce the time overhead while ensuring the translation quality.The entry of large research institutions such as Google [12] and Facebook [13] has accelerated the development of NMT.
In 2017, Google Brain proposed the Transformer model [14], which is entirely based on the attention mechanism, and it has quickly become the benchmark model due to its excellent performance.Variants based on the Transformer model have proliferated and become important models for various tasks in the field of NLP [15][16][17].In the self-attention mechanism, QK T represents the degree of association between any two words in the sequence, and the larger the dot product of two word vectors, the stronger the association between the corresponding two words.
However, the Transformer model is also subject to the same constraints as other NMT models: computer arithmetic power, model structure, and the amount of corpus resources.The level of computer hardware has reached a certain height, and some effective structures and tricks have also been proposed.The establishment of the corpus requires a lot of manual effort and time, which is a major problem that limits the translation quality of NMT.Therefore, NMT translation in low-resource corpus needs more research and attention.
Prior knowledge has been proven to improve the quality of machine translation [18][19][20][21][22], which is indeed a worthy research direction for small language machine translation.NMT assumes that syntactic knowledge can be learned automatically during training [11].Studies have shown that this level of learning is still not enough to capture deep syntactic details, and additional syntactic information can improve the translation performance [23][24][25][26][27]. Syntactic structure arranges words into a meaningful sentence, and incorporating syntactic knowledge into the NMT model helps to eliminate ambiguity.Syntactic analysis is generally divided into constituent syntactic parsing and dependency syntactic parsing.Constituent syntactic parsing is used to determine the lexical components and phrase structure of a sentence, and dependency syntactic parsing is used to identify the dependency relations between words.Since the self-attention mechanism is concerned with associations between words, the dependency syntactic relations were chosen as the external linguistic knowledge in this paper.
In fact, there have been some attempts to improve the Transformer model's attention mechanism.Yang et al. [28] proposed adding a learnable Gaussian bias G to the self-attention mechanism.The center location and window size of the Gaussian bias can be determined by the model autonomously, and the Gaussian bias is incorporated into the original attention distribution to form a new modified distribution, with which the attention mechanism can enhance its ability to learn local contextual information.Bugliarello et al. [29] proposed strengthening the self-attention mechanism by multiplying QK T by a matrix of dependent syntactic information, calling the improved attention head PASCAL.Both articles aim to enhance the model's ability to learn local information, which may weaken the model's ability to capture long-range dependencies.
This research also aims to improve the quality of translations by improving the attention mechanism.We propose a translation model for learning syntactic knowledge at both the encoder and decoder sides.The syntactic-assisted learning unit M is added to the encoder's self-attention mechanism to implicitly enhance the encoder's learning of Traditional Mongolian syntax, and the Masked Multi-Head Dependency Parsing Attention (MMHDPA) layer is added to the front end of the decoder, which explicitly enhances the decoder's learning of Chinese syntax using the dependency syntax (DP) matrix extracted from the Chinese dependent syntax, A CNN is also integrated into the MMHDPA layer, making it easier for the attention mechanism to determine the weighting relationships between words.The proposed model has the following features.The source code and experimental results are publicly available at: https://github.com/CK-IMUT-501/applsci-1927936(accessed on 9 September 2022):

•
The syntactic-assisted learning unit M in the encoder can take into account the abilities of local and global information learning and is used in each encoder sublayer, expanding the hypothesis space of the model; • The addition of the Chinese dependency syntax matrix (DP) to the decoder provides an intuitive, effective, and interpretable result; • The combination of CNNs and MMHDPA enhances translation performance with only a few additional parameters.

Traditional Mongolian-Chinese Machine Translation
Language is part of culture and not only assumes the role of enhancing national unity, strengthening identity, and passing on culture, but it also conveys information about the customs, values, and social attitudes to the outside world.Traditional Mongolian, the official way of writing the Mongolian language in the Inner Mongolia Autonomous Region, is an agglutinative language, and its word formation and morphology are accomplished by linking different affixes to the roots and stems of words, which results in a complex grammatical form and a large vocabulary [30,31].However, a high-quality MN-CH parallel corpus is still very scarce, which poses difficulties for machine translation.
After reviewing a large amount of research, the development trend of MN-CH machine translation is summarized as follows.
Before 2017, research on MN-CH machine translation focused on example-based machine translation (EBMT), rule-based machine translation (RBMT), and statistical machine translation (SMT), and the research methods during this period mainly included multi-strategy translation systems, SMT systems based on phrases or hierarchical phrases, and MT systems combining the characteristics of Mongolian words.In order to improve the large number of word order errors in Chinese-Mongolian SMT, Wang et al. [32] proposed a Chinese sentence reordering method based on the Mongolian word order.To alleviate the data sparsity and linguistic morphological differences in MN-CH statistical machine translation, Yang et al. [33] proposed a method for constructing a lexeme-based translation model, with Mongolian lexemes as the central language.Wu et al. [34] proposed a multi-method fusion approach to MN-CH machine translation, where they added subword translation monolingual data to attention-based NMT and use SMT to guide the model to produce correct translations.Wu et al. [31] also proposed combining the template-based machine translation (TBMT) system with the SMT system to achieve better translation results.
In 2017, MN-CH machine translation gradually transitioned to NMT, with NMT methods dominating Mongolian-Chinese machine translation from 2018 to the present.During this period, MN-CH machine translation has presented the following characteristics: pure NMT research dominates and follows the international mainstream deep learning methods closely, and sometimes, SMT methods are used as a supplement to NMT methods.Li et al. [35] argued that the Transformer model's positional encoding cannot correctly express the deep logical relationships between document sentences, so they proposed a method to improve positional encoding.Additionally, they proposed a method to fuse inter-sentence relationship information to improve the translation quality.Ji et al. improved the generative adversarial network (GAN) by adding value constraints and semantic enhancement, and then they used it to improve the unknown (UNK) words problem in NMT [36].
Nowadays, the research on MN-CH NMT has absorbed the international advanced research ideas and methods and has made promising progress, but the research directions are more scattered, and there is a lack of research themes with depth, systems, and strong cascading relationships.

Transformer
The Transformer model is an encoder-decoder model, where both the encoder and decoder contain N sublayers with a default N value of six.The model structure is shown in Figure 1.Transformer was entirely based on the attention mechanism when it was built, abandoning the recursive neural betwork (RNN) structure and convolutional neural betwork (CNN) structure, realizing training parallelization, and improving training efficiency.However, it also brings potential problems: (1) Transformer's ability to learn the position information of sequences is not as good as an RNN, but position information is very important in NLP, so positional encoding is introduced to remedy this deficiency.
(2) Transformer has an insufficient ability to extract local information.Transformer observes the correlation between words from a global perspective, while RNNs and CNNs are more biased toward looking at local information.Transformer introduces two multi-head attention mechanisms to learn feature information from multiple spaces.One is the self-attention mechanism (Q, K, and V all come from the same input), which is used in both the encoder and decoder and is intended to allow the model to learn the correlation between different words in the input sentence itself, while the other is the cross-attention mechanism, which is only used in the decoder and is calculated in the same way as the self-attention mechanism, except that K and V come from the output of the encoder and Q comes from the input matrix of the decoder.

Masked
The formula for one of the heads in the multi-head self-attention mechanism is as follows: where Q i , K i , and V i are linear transformations of the input matrix which help to enhance the expression space of the model, d k is the feature dimension of K i , and Q i K i divides √ d k by an element to help reduce the variance difference in Q i K i , making the model more stable during the gradient update.

Baseline Model
The following three modifications are made to Transformer, and the modified model is used as the baseline model for this paper.

Activation Functions
In deep learning, there are many linear mapping operations such as matrix multiplication, and the role of the activation function is to add nonlinearity to the model.
ReLU [37,38] is the default activation function for Transformer, which is popular in deep learning.Compared with Sigmoid and Tanh, it is computationally simple and does not have a gradient saturation zone when the input is positive, but ReLU handles negative values too simply and tends to cause neurons to fail to be updated when the gradient is updated.Leaky ReLU [39,40] has made an improvement to ReLU, which handles negative values more gently and alleviates the problem of silent neurons.
A trade-off is made between the computational cost and the proportion of silent neurons.Leaky ReLU was chosen as the activation function to replace ReLU.

Optimizer
Adam [41] is the default optimizer of Transformer, but Loshchilov and Hutter [42] found that mainstream machine learning frameworks have some problems in implementing the Adam + L2 regularity, so they improved the implementation of the Adam + L2 regularity and proposed AdamW [42], which not only saves computation but also brings improvement in the training effect.AdamW was chosen as the optimizer in Bert [43].Therefore, AdamW was also used as the optimizer for the baseline model in this paper.

Learning Rate Adjustment Strategy
Since the optimizer optimizes the model based on the loss function, the model may be trained with decreasing loss values without a consequent increase in accuracy.To cope with this situation, the learning rate adjustment strategy ReduceLROnPlateau (ReduceL-ROnPlateau, accessed on 5 August 2022) was added to the baseline model to monitor the accuracy value of the validation set and halve the learning rate if the accuracy value was not increasing.

The Proposed Model
The self-attention mechanism defaults the distance between the words in a sequence to one, which means that the self-attention mechanism initially treats the association between two words "equally" and needs to slowly distinguish the "weak association" and "strong association" between words during model training.QK T is a matrix used to measure the degree of association between any two words in a sequence.The magnitude of the elemental values in QK T reflects the strength of the association between words.Therefore, the learning effect of the self-attention mechanism can be enhanced by adjusting the QK T matrix.
Inspired by previous works on the improvement of the attention mechanism, this paper improves the baseline model's encoder and decoder to better suit the MN-CH translation task.The improvements to the encoder and decoder are very similar in that they both add additional syntax-assisted learning units to enhance the ability of the attention mechanism to learn syntactic knowledge of the source language and target language.The difference is that the self-attention layer of each sublayer in the encoder adds a syntax-assisted learning unit M, which is a randomly initialized matrix that can update the gradient as the model is trained, while the improvement to the decoder is the addition of an MMHDPA layer incorporating syntactic knowledge of the target language after positional encoding.The structure of the improved model is shown in Figure 2.

Improvement of the Self-Attention Mechanism in the Encoder
The input matrix for each sublayer of the encoder is X∈R b×l×d model , where b, l, and d model denote the batch size of the model, the maximum length of the sentence, and the dimension of the word embedding, respectively.The shape of X is invariant after adding positional encoding information.Before performing the scaled dot product attention calculation, X is multiplied by the weight matrices of the four fully connected layers head_size×l) , and the results are Q ∈ R b×l×d model , K ∈ R b×l×d model , V ∈ R b×l×d model , and M ∈ R b×l×(head_size×l) , respectively.Then, Q, K, V, and M are divided equally according to the last dimension using head_size. Then, , and M i ∈ R b×l×l are obtained, and the formula for the scaled dot product attention for head i is calculated as follows: where Q i K i ∈R b×l×l , is the element-wise multiplication.The M i matrix is used to assist the self-attention mechanism of the encoder sublayer to learn more fully the correlation between words.The structure of the scaled dot product attention after adding the syntax-assisted learning unit M is shown in Figure 3a.

Masked Multi-Head Dependency Parsing Attention (MMHDPA)
The multi-head attention after adding the DP matrix is called Masked Multi-Head Dependency Parsing Attention (MMHDPA), and MMHDPA is placed before the first sublayer of the decoder.In MMHDPA, the scaled dot product attention is calculated as follows: where DP denotes the Chinese dependency syntax matrix.The DP matrix of each sentence is expanded to l × l, and this part of the work belongs to the preprocessing stage, where the processed DP matrix was saved to a disk in advance in batches and then read during model training.The structure of MMHDPA is shown in Figure 3b.

The Construction of the DP Matrix
The dependency syntactic relation is an asymmetrical relation in which the dominant points to the subordinate.Dependency syntactic analysis aims to transform the input sentence into a dependency analysis tree in which any parent-child nodes are connected by dependency relations.Dependent syntactic relations can be represented in triadic form (relation, head, and dependent).Let the set of all word segmentations in a sentence be V = {w1, w2, . . ., wn} and the set of all dependency relations be R = {(r, wi, wj In a sentence, there is only one word that does not depend on other words, called the central word.The corresponding dependency relation of the central word is ROOT, and all other words are directly or indirectly subordinate to the central word.
The DP matrix is constructed on the basis of the results of the dependency syntactic analysis of Stanford NLP [44].First, sentences are tokenized using Stanford CoreNLP, and then dependency syntactic parsing is performed.The result is a list of triples, and finally, dependencies are assigned to the single-word level.
When extracting the DP matrix from dependent relationships, it does not matter what the dependencies specifically are.It is only necessary to set the coordinates corresponding to the two single words for which a dependency exists up to one (except for the ROOT relationship).
The construction process of the matrix DP is shown in Figure 5.The DP matrix extracted from the example sentence is shown in Figure 6.

Word Segmentation
Zipf's Law [45] states that there is an inverse relationship between the frequency of a word and its frequency ranking in a large corpus.In 2020, China's State Language and Script Work Committee analyzed a news corpus of 888,448,937 Chinese characters containing 10,970 different Chinese words, of which 2247 high-frequency words accounted for 99% of the entire corpus (http://qrcode.cp.cn/qr_code.php?id=qsxb3615t4ycdvrsgi8 w3lxcb1x7n0v9, accessed on 5 August 2022).
The NMT model is constrained by objective conditions during training, and its model scale should be controlled.Parameters directly related to the model scale are the word embedding dimension, sentence length, corpus lexicon size, etc. Li et al. [46] discussed whether word segmentation is necessary for deep learning of Chinese representation, finding that models based on char-level tokenization performed better than models based on word-level tokenization.Their conclusion was that word-based models are more susceptible to data sparsity and out-of-vocabulary words, resulting in models prone to overfitting and limiting the generalization ability of the model.
Building upon these considerations, this paper splits the Chinese sentences into single word, while the Traditional Mongolian is naturally separated from individual words by spaces.In addition, there is a large number of numbers and English words in the corpus, which will bring a lot of noise to the model, so the numbers and English words are all split up as well.

Datasets:
The MN-CH parallel corpus used in this paper contained a wide range of proper nouns or phrases, literary works, computer-related terms, news reports, hiatus and slang, company or organization names, daily chats, etc.We randomly shuffled the corpus after word segmentation and saved sentences with a length of 8-100.The total number of sentences in the filtered MN-CH parallel corpus was 498,921, and the training set, validation set, and test set were divided by a ratio of approximately 0.96:0.02:0.02.The number of sentences in each dataset is shown in Table 1.Settings: The model training was run under the following experimental conditions.The CPU was an Intel(R) Xeon(R) Gold 6130 with two Nvidia Tesla P100 graphics cards and each graphics card with 12,198 MB of memory, and the server operating system was Ubuntu 16.04.6.The main packages in the Python virtual environment for the experiments were Python 3.6.13,Pytorch 1.7 (CUDA 10.1),torchtext 0.6.0,scikit-learn 0.24.2, sacrebleu 2.0.0, pandas 1.1.5,numpy 1.19.2, matplotlib 3.2.2, and stanfordcorenlp 3.9.1.1.
To exclude randomness of the experimental results, all models were trained on the network parameters corresponding to five random seeds, and the average BLEU value was taken as the evaluation metric.
The Chinese DP matrix was only used during training.The validation set and test set did not provide any syntactic information, and the DP matrix extracted from the training set was about 18.3 GB.

Experimental Results and Discussion
In this paper, ablation experiments were conducted to compare the impact of each change, involving five models named Transformer_Base, Transformer_Enc_Imp, Trans-former_Dec_Imp + Syn, Transformer_Enc_Dec_Imp, and Transformer_Enc_Dec_Imp + Syn, where Enc_Imp, Dec_Imp, and Enc_Dec_Imp indicate encoder improvement, decoder improvement, and both improvements in the baseline model (Transformer_Base), respectively, and Syn indicates the knowledge of Chinese dependent syntax.

Experimental Results of the Six-Layer Models
In this section, the number of sublayer layers N for both the encoder and decoder was set to six, the number of training rounds (epochs) was 40, and the number of parameters for each model is shown in Table 2: The number of parameters for all 5 models was around 50 million, with Trans-former_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn having the largest number of parameters, increasing by 1,521,440 compared with the baseline model (1,258,272 + 263,168 = 1,521,440).The number of parameters for Transformer_Enc_Imp and Transformer_Dec_Imp increased by 1,258,272 and 263,168, respectively, compared with the baseline model, accounting for 82.7% and 17.3% of the total increase in parameters, respectively.
With the cumulative four-gram BLEU [47] score as the primary evaluation metric, adopting the function corpus_bleu() in sacrebleu with smooth_method = "none", it is worth pointing out that the BLEU value can vary considerably depending on the choice of smoothing function, and multi-bleu-detok.perl is also used in the BLEU evaluation to ensure the accuracy of the BLEU values by verifying the consistency of the results between the two evaluation tools.The BLEU-4 values for each model were evaluated as shown in Table 3.It can be seen that the average BLEU value of Transformer_Enc_Dec_Imp was 2.706 higher than that of the baseline model, which indicates that the improved structure was effective.The average BLEU of Transformer_Enc_Dec_Imp + Syn was 3.539 higher than that of Transformer_Enc_Dec_Imp, demonstrating that the addition of dependent syntactic information could improve the quality of translation.
However, two unexpected results were found: 1. Transformer_Enc_Dec_Imp + Syn did not have the highest average BLEU score, being lower than Transformer_Enc_Imp 1.161; 2.
Transformer_Dec_Imp + Syn performed the worst in the experiment and had lower BLEU values than the baseline model for each random seed, obtaining the two lowest scores in all the experimental results.
The analysis yielded the two most likely causes of the above phenomenons: 1.
The large disparity in the proportion of model parameters to the total increase in the parameters brought about by the improved encoder and improved decoder resulted in not having enough neural network units on the decoder side to fit the amount of data brought about by the dependent syntax matrix; 2.
The information extraction of the Chinese dependency syntax by MMHDPA was not sufficient, and the structure of MMHDPA should continue to be improved to better learn the target language's grammar.

Experimental Results of the Five-Layer Models
Experiments were conducted under the conditions of N = 5 and epochs = 35 to verify whether the above interpretations were correct.The parameter counts for the five-layer models are shown in Table 4, and the experimental results for each model are shown in Table 5.The contents in parentheses in Table 5 indicate the changes relative to Table 3.The poorer experimental results for the five-layer models were to be expected, as the total number of model parameters and training rounds were reduced.The six-layer Transformer_Dec_Imp + Syn achieved an abnormally low score, with the network parameters corresponding to the random seed 256, pulling down the average BLEU value and leading to a better experiment result for the five-layer Transformer_Dec_Imp + Syn, which should be an anomalous event.
The same conclusion can be drawn from these two experiments, in descending order of the average BLEU value: Transformer_Enc_Imp, Transformer_Enc_Dec_Imp + Syn, Transformer_Enc_Dec_Imp, Transformer_Base, and Transformer_Dec_Imp + Syn.
The loss curves and accuracy curves of the five models also corroborated the above conclusions.As can be seen in Figure 7a-d and Figure 8a,b, the five models had the same order of merit in these four curves, corresponding to the order of the average BLEU scores achieved by the model.However, in Figure 8c,d, the order of the Trans-former_Enc_Dec_Imp + Syn and Transformer_Enc_Dec_Imp curves was reversed, but in terms of the average BLEU score, Transformer_Enc_Dec_Imp + Syn won slightly by 0.849, and the more plausible explanation is that Transformer_Enc_Dec_Imp + Syn had better generalization performance.
More importantly, the experiments with the five-layer model confirmed the conjectures made in Section 5. 3

Integration of Primer-EZ Methods Based on Improved Models
By analyzing the experiment results in the previous section, it can be concluded that the decoder had the problem of an insufficient learning ability for dependent syntactic information.The self-attention mechanism was inferior to the RNN and CNN in local feature extraction, and the RNN would affect the parallel computing ability of Transformer and greatly prolong the training time.Therefore, it was considered to add a CNN to MMHDPA to enhance the model's learning ability of dependent syntactic information.

Primer-EZ
Coincidentally, in September 2021, Google proposed a search framework designed to automatically search for efficient Transformer variants.They named the searched architecture Primer [48] , and experiments showed that Primer outperformed Transformer in most tasks when given the same training time.We may build on their work and integrate Primer into our improved model.
There are two improvements in Primer that are generic and easily portable, and the Primer containing these two improvements is called Primer-EZ.The structure of Primer-EZ is shown in Figure 9.
In this paper, Primer-EZ was implemented in Pytorch 1.7.Since the activation function used in all models was Leaky ReLU, the square of Leaky ReLU was taken, and the CNN layer was only used after Q, K, and V in the MMHPDA but limited by the experimental hardware conditions.Q, K, and V in the MMHDPA layer shared one convolutional layer, increasing the number of neurons by 512.In Transformer_Base and Transformer_Enc_Imp, only the activation function needed to be modified to Squared Leaky ReLU, while in the remaining three models, both improvements were applied.
As we can see, the average BLEU values of Transformer_Base and Transformer_Enc_Imp were higher after adding Squared Leaky ReLU, indicating that Squared Leaky ReLU can speed up the convergence of the model and slightly improve the translation quality.
Transformer_Dec_Imp + Syn + EZ, Transformer_Enc_Dec_Imp + EZ, and Transformer_-Enc_Dec_Imp + Syn + EZ achieved the highest average BLEU scores, while the number of model parameters and training rounds were reduced, indicating that the improved model with a fused CNN layer can significantly improve translation quality.It can be seen that the addition of a CNN layer on the one hand made up for the insufficient number of neurons and insufficient fitting ability at the original decoder side, and on the other hand, Q, K, and V, after the CNN layer, enhanced the extraction of local features, making it easier for the MMHPDA layer to judge the strength of association between words.

Comparison with Other Methods
In the following, we compare other MN-CH machine translation methods in order to give the reader a general idea of the performance level of the methods proposed in this paper.The results of the comparison are shown in Table 8.It is unfortunate that there is not yet an accepted SOTA model in the field of MN-CH machine translation, and the MN-CH dataset used in the paper is not uniform, so the task of small language translation is still a long way off.Hopefully, more people will participate in the field in the future.

Conclusions and Future Work
In order to improve the translation quality of MN-CH NMT on a low-resource corpus, we explored a way to enhance the syntactic learning ability of the attention mechanism.We designed a translation model with double-ended learning syntactic knowledge.A syntax-assisted learning unit M was added to the self-attention layer of each sublayer of the encoder for implicitly enhancing the encoder's ability to capture syntactic information in Traditional Mongolian.MMHDPA was added before the first sublayer of the decoder, and Chinese dependent syntactic knowledge was added to the attention mechanism of MMHDPA for explicitly learning Chinese syntactic knowledge.
The experiments show that Transformer_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn both had higher average BLEU scores than the baseline model under the same training conditions, and Transformer_Enc_Dec_Imp + Syn outperformed Transformer_Enc_Dec_Imp, which suggests the proposed improved model structure can effectively improve the translation quality even in the absence of external knowledge.The improvement was more obvious with the addition of external knowledge.Both the proposed model structure and external syntactic knowledge can improve the quality of translation.
The analysis revealed that the decoder side suffered from an insufficient fitting ability and MMHDPA did not sufficiently learn Chinese dependent syntactic information.Therefore, the Primer-EZ method was incorporated into the existing model.In Trans-former_Base and Transformer_Enc_Imp, only their activation functions were replaced with Squared Leaky ReLU.In Transformer_Dec_Imp + Syn, Transformer_Enc_Dec_Imp and Transformer_Enc_Dec_Imp + Syn both used Squared Leaky ReLU as the activation function and also added a shared convolution layer after the Q, K, and V of the MMHDPA layer to enhance local feature capture.
The final improved model, Transformer_Enc_Dec_Imp + Syn + EZ (Both), showed a significant improvement of 9.113 (45.634 − 36.521= 9.113) in the average BLEU value compared with the baseline model for experimental conditions of N = 5 and epochs = 35.
The conclusion can be drawn that Squared Leaky ReLU can accelerate the convergence speed of the model, and a CNN can significantly enhance the local feature capture of Chinese syntax.Both improvements can improve the quality of Mongolian-Chinese NMT translations.
Overall, our proposed method was validated in the MN-CH parallel corpus and achieved decent results.
For future work, firstly, other ways of constructing dependent syntactic matrices were not explored in this paper due to time constraints, and it is hoped that more constructions will be attempted.Secondly, an improved encoder self-attention mechanism could be used to perform grammar induction tasks for small languages such as Mongolian.Grammar induction refers to the model automatically learning grammatical structures from the corpus without any expert annotated syntactic knowledge.One possible solution is that the self-attention mechanism looks at the relationships between participles from a global perspective by strengthening the relationships between participles and using this to delineate the structure of sentences, and Transformer has several encoder sublayers that can be used to delineate the hierarchical structure of sentences.

Figure 1 .
Figure 1.Construction of the Transformer model.

Figure 2 .
Figure 2. The structure of the proposed model.

Figure 3 .
Figure 3. Improvements to the attention mechanism.(a) Improvement of self-attention mechanism in each encoder sublayer.(b) The structure of MMHDPA.

Figure 4 .
Figure 4.The dependency tree for the example sentence of Chinese.At the bottom of the picture are the word segmentations of the example sentence, the abbreviation above the arrow represents the dependent relation between two words.

Figure 5 .
Figure 5. Dependency syntactic parsing and processing of the example sentence.The figure shows the Chinese example sentence going through the process of word segmentation, dependent syntax analysis, clause reordering and finally assingning dependent relationships to signal word level.

Figure 6 .
Figure 6.Dependency syntax matrix.The figure shows the rules for the construction of the DP matrix: the coordinates corresponding to the two words for which there is a dependency are set to one.

Figure 7 .
Figure 7. Loss curves and accuracy curves of six-layer models on training set and validation set.(a) Training set loss curve.(b) Validation set loss curve.(c) Training set accuracy curve.(d) Validation set accuracy curve.

Figure 8 .
Figure 8. Loss curves and accuracy curves of five-layer models on training set and validation set.(a) Training set loss curve.(b) Validation set loss curve.(c) Training set accuracy curve.(d) Validation set accuracy curve.

Table 1 .
Number of sentences in Traditional Mongolian-Chinese dataset.

Table 2 .
The number of parameters for each six-layer model.

Table 3 .
BLEU evaluation of six-layer models on different random seeds.

Table 4 .
The number of parameters for each five-layer model.

Table 5 .
BLEU evaluation of five-layer models on different random seeds.
.1, which pointed the way to the next improvements.

Table 8 .
Performance of different Traditional Mongolian-Chinese machine translation approaches on various datasets.