Next Article in Journal
Dynamic Evolution Analysis of Digital Technology Multilayer Convergence Networks
Previous Article in Journal
Quality Improvement Decisions in Service Supply Chains with Collaborative and Free-Riding Behaviors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention

1
School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China
2
School of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA
3
Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, USA
*
Author to whom correspondence should be addressed.
Systems 2024, 12(10), 420; https://doi.org/10.3390/systems12100420
Submission received: 6 August 2024 / Revised: 29 September 2024 / Accepted: 2 October 2024 / Published: 9 October 2024
(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

Abstract

:
In neural machine translation (NMT), the sophistication of word embeddings plays a pivotal role in the model’s ability to render accurate and contextually relevant translations. However, conventional models with single granularity of word segmentation cannot fully embed complex languages like Chinese, where the granularity of segmentation significantly impacts understanding and translation fidelity. Addressing these challenges, our study introduces the Dynamic Multi-Granularity Translation System (DMGTS), an innovative approach that enhances the Transformer model by incorporating multi-granularity position encoding and multi-granularity self-attention mechanisms. Leveraging a Directed Acyclic Graph (DAG), the DMGTS utilizes four levels of word segmentation for multi-granularity position encoding. Dynamic word embeddings are also introduced to enhance the lexical representation by incorporating multi-granularity features. Multi-granularity self-attention mechanisms are applied to replace the conventional self-attention layers. We evaluate the DMGTS on multiple datasets, where our system demonstrates marked improvements. Notably, it achieves significant enhancements in translation quality, evidenced by increases of 1.16 and 1.55 in Bilingual Evaluation Understudy (BLEU) scores over traditional static embedding methods. These results underscore the efficacy of the DMGTS in refining NMT performance.

1. Introduction

In recent years, machine translation has emerged as a prominent research domain within the field of Natural Language Processing (NLP). Thanks to advances in computer science, technology, and hardware resources, machine translation has made great strides. Deep learning based on neural machine translation (NMT) methods has superseded statistical machine translation methods, emerging as the prevailing approach in the field. As NMT evolves, the fundamental encoder–decoder framework has remained unchanged, while the backbone network has progressed from a recurrent neural network (RNN) to a Convolutional Neural Network (CNN), and ultimately to the Transformer model.
In the early stages, RNNs were widely used as the backbone network. Lei et al. demonstrated that using independent recurrent neural networks can greatly enhance the performance of machine translation models [1,2]. Sutskever et al of Google team proposed RNN-RNN model based on the former [3], which became the later common Sequence to Sequence model. Baliyan et al. combined Long Short-Term Memory (LSTM) with RNNs to complement the sentiment analysis capability with language translation [4]. In this model, knowledge-based context vectors were integrated to facilitate the mapping of multilingual vocabulary, while an RNN was employed to ensure good results. Wang et al. designed a bidirectional Gated Recurrent Unit (GRU) model for English translation analysis based on the RNN model, which made full use of word vectors in the construction of language sequences [5]. Later recurrent neural networks were gradually replaced by other backbone networks. Kalchbrenner and Blunsom proposed to apply CNNs in a recurrent continuous translation model, which improves the learning ability of the mapping between continuous representations of implicit links between phrases and sentences during the translation process [6]. Wang et al. proposed a generative adversarial network-based neural machine translation model that utilizes adversarial thinking to consider the order of emotional directions to make the translation results more humanized [7].
The field of machine translation witnessed a groundbreaking shift with the advent of the Transformer model, ushering in a new era by integrating self-attention, thereby enhancing the efficacy and precision of the translation process. Hu et al. [8] introduced a Transformer-based NMT model designed specifically for important information fusion. Notably, this model demonstrated exceptional performance in effectively translating long sentences. To address the fact that traditional attention mechanisms could not fully utilize the hidden information of target words, Li et al. proposed a novel enhanced attention mechanism that incorporates hidden details from target words into both RNN-based and self-attention-based translation models. Theoretical analysis demonstrated the facilitative role of the hidden details in improving translation prediction [9]. To improve word representations and translation performance, Wang et al. incorporated part-of-speech sequence information [10]. While this approach improved translation performance, it remains unclear whether the artificially added information enhances the model’s encoding ability or merely serves as noise to bolster robustness, thus lacking interpretability.
Transformer models are still the most widely used in the selection of backbone networks for NMT systems. We list some key reasons analyzing the prevalence of the Transformer architecture in contemporary NMT frameworks:
  • Efficiency and Scalability: Transformer models are characterized by a streamlined architecture that necessitates fewer parameters, thereby reducing the computational power required.
  • Enhanced Computational Throughput: The Transformer architecture’s ability to process inputs in parallel marks a significant departure from the sequential processing inherent to recurrent neural network (RNN)-based NMT systems. This parallelization facilitates a remarkable improvement in computational efficiency, allowing for the utilization of larger datasets within the same computing constraints.
  • Superior Language Representation: The Transformer model is able to overcome the limitations associated with encoding long-distance dependencies—a notable challenge for RNNs. Through the Self-Attention (SA) mechanism, Transformers can flexibly and efficiently capture relationships between any elements in the input sequence, regardless of their positional distance.
However, the Transformer model has its limitations. Although multi-head semantic analysis can capture the inter-word relationship well, it ignores the inter-word structural information to some extent. Therefore, in recent years, a series of Large Language Models (LLMs) based on the Transformer model architecture, such as OpenAI’s GPT and Google’s BERT, have been applied in the field of machine translation more and more widely [11]. In order to improve the ability of the model to use context learning, Li et al. proposed a demonstration-aware Large Translation Model (LTM) based on mixed presentation types [12]. It determines the presentation type of the training sample by randomly selecting sentence pairs in the training set as sentence-level presentations or continuous context text as document-level presentations. Zhu et al. proposes a robust approach that enables LLMs to achieve robust translation with In-Contextual Learning (ICL) [13]. This method adopts the multi-view method and considers both sentence-level and word-level information to capture the relationship between words and sentences effectively, so as to select the presentation that effectively avoids noise.
Besides various backbone models being used in NMT models, there are different input language granularity and text characterization methods explored in NMT. Table 1 summarizes some word embedding methods of NMT systems.
As shown in Table 1, NMT systems take different combinations of language granularity and text characterization methods. However, all text characterizations use static word vectors. Among them, the Random method used by Vaswani et al. [16] initializes an N d m o d e l word embedding matrix randomly and synchronizes with the model for the training. After training, each row of the word embedding matrix correlates to a fixed word. Vector representations are, therefore, essentially static word vectors. Static word vectors are generally based on corpus pre-training to obtain distributed representations of words. The notable characteristic of this approach is its capability to leverage pre-trained word vectors across diverse downstream NLP tasks, eliminating the need for additional training and thereby enhancing efficiency. Nevertheless, real language environments encompass polysemy, where a single word may encompass multiple semantic nuances and grammatical interpretations. Static word vectors cannot effectively represent such differences, resulting in the deviation of semantic grammar.
Moreover, in terms of language granularity, whether it is character-based, word-based, or subword-based, most of the representative NMT systems have a single-granularity input. This can limit the encoder’s representation capacity and the model’s performance, as the ability to accurately interpret the input sentence directly affects the NMT system’s overall effectiveness. In particular, East Asian languages like Chinese do not have spaces as a natural division between words, which is different from other alphabet-based languages. Therefore, when dealing with NLP tasks based on the Chinese corpus, the Chinese corpus must be segmented first. Using different word segmentation tools will result in different granularity of word segmentation results, resulting in performance differences. Morishita et al. [17] used a hierarchical network to fuse multiple subword granularities as input. Su et al. [18] employed a word lattice to combine different levels of word granularity for improved Chinese–English translation.
To address the challenges of word representations of Transformers in Chinese, we introduce the Dynamic Multi-Granularity Translation System (DMGTS), a novel Transformer model modification that incorporates multi-granularity position encoding and self-attention mechanisms. Specifically, four different word segmentation methods are applied to the inputs to obtain four levels of granularities. A Directed Acyclic Graph (DAG) is utilized to transform multi-granularity inputs into position encodings, and ELMo [19] is employed to generate dynamic word embeddings. These dynamic embeddings are fed into a modified Transformer where multi-granularity self-attention mechanisms replace conventional self-attention layers. Our extensive evaluation of the DMGTS on benchmark datasets reveals substantial improvements over methods with single-granularity input and static word embedding.

2. Dynamic Multi-Granularity Translation System

In this section, we introduce the DMGTS, which modifies the word embedding of the traditional Transformer. Multi-granularity position encoding and multi-granularity self-attention mechanisms will be introduced. The architecture of the DMGTS is shown in Figure 1. The DMGTS comprises four main components: pre-trained multi-granularity word segmentation, dynamic word embedding, multi-granularity relative position encoding, and encoder–decoder where multi-granularity self-attention mechanism is introduced.
Specifically, the DMGTS processes the text inputs as follows:
  • The text input sequence S = { s 1 ,   ,   s n } is segmented into multiple granularities, including character-level granularity and three other granularities. The details of multi-granularity word segmentation are given later. These granularities are then modeled using a Directed Acyclic Graph (DAG) fusion approach.
  • We use the DAG to transform the multi-granularity representation of the input S = { s 1 , , s n } into the granularity sequence X = { x 1 , , x k } and the corresponding position representation P h e a d = { p 1 h e a d , , p k h e a d } and P t a i l = { p 1 t a i l , , p k t a i l } .
  • Input X = { x 1 , , x k } is passed through the ELMo pre-training model to obtain the dynamic vector X E L M o = { x 1 , E L M o , , x k , E L M o } .
  • We convert the position representation sequence P h e a d = { p 1 h e a d , , p k h e a d } and P t a i l = { p 1 t a i l , , p k t a i l } into a multi-granularity relative position representation ( d i j ( h h ) , d i j ( h t ) , d i j ( t h ) , d i j ( t t ) ) , and further convert it into a relative position encoding R i j by the trigonometric function [20,21,22].
  • The dynamic text feature vector X E L M o = { x 1 , E L M o , , x k , E L M o } is fed into the model encoder. We integrate the positional encoding R i j accessed in step 4 into the calculation of the attention weights, by the multi-granularity self-attention layer. Specifically, the model considers the current position and the relative distance from the attention position when computing the attention. We then obtain a new text feature vector and feed it to the position feedforward layer. Subsequently, we pass the results through the subsequent encoder layers from bottom to top.
  • The encoder–decoder attention layer uses the output of the last encoder layer as its input. Following word embedding and positional encoding, the decoder receives the translated outcome and passes it through the self-attention layer and the position feedforward layer across its six layers.
  • The text feature vector generated by the decoder is fed into the output layer to produce the output.

2.1. Pre-Trained Multi-Granularity Word Segmentation

Similar to the work of Su et al. [18], we use the open source toolbox of Stanford University, and train three word segmentation tools on datasets with various segmentation standards of MSR, PKU, and CTB, respectively, and use these three different word segmentation tools to divide the machine translation corpus to obtain different word granularities.
At the beginning of the training process, the data are partitioned into three distinct sets: the training set, the validation set, and the test set, with allocation proportions of 70%, 10%, and 20%, respectively. Cross-validation is employed using a five-fold strategy during training. To assess the word segmentation performance, the F1 value (F1), recall rate (Recall), and precision rate (Precision) are computed using Equation (1) upon completion of the training process.
P r e c i s i o n = n u m b e r   o f   c o r r e c t   w o r d s n u m b e r   o f   p r e d i c t e d   w o r d s R e c a l l = n u m b e r   o f   c o r r e c t   w o r d s n u m b e r   o f   s t a n d a r d   w o r d s F 1 = 2 · P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l
Table 2 and Figure 2 display the performance metrics of the three word segmentation tools obtained through training.

2.2. Dynamic Word Embedding

In the original Transformer model, the word embedding module uses a random initialization method, which randomly initializes a vocabulary of size N d m o d e l , where N represents vocabulary size and d m o d e l is the model dimension, and the trainable parameter is True. After the training is completed, each word in the vocabulary corresponds to a fixed vector representation, which is a static word vector. Instead, we explored building a Transformer model using different word embedding modules to evaluate the influence of dynamic and static word embeddings.
Word2vec and GloVe are chosen as the representative static embedding methods. To facilitate the Transformer model’s input, a fully connected neural network layer is cascaded to transform the vector into a 512-dimension one. The word2vec and GloVe models are pre-trained in advance using the WMT Chinese and English datasets and the NIST Chinese and English datasets. The ELMo model is selected as the dynamic word vector method. Due to the high pre-training cost of the ELMo model, we select the open source ELMo model pre-trained on the large-scale Chinese corpus: Chinese Gigaword.
Table 3 provides a comprehensive summary of the statistical information for the three word embedding modules used in our experiments, highlighting their distinct characteristics and pre-training datasets. By comparing these methods, we aim to gain deeper insights into how static and dynamic embeddings influence the overall performance of the Transformer model in machine translation tasks.
Whether it is a static or dynamic word–vector word embedding module, model finetuning will be performed on our specific datasets, with a small learning rate. In our experiments, the learning rate is best set to 5 × 10−5.

2.3. Multi-Granularity Position Encoding

2.3.1. Relative Position Encoding

In the development of the Transformer model, there are two major methods for the position encoding: absolute position coding and relative position.
  • Absolute Position Encoding
Since the Transformer model processes inputs in parallel, it cannot obtain the position information of the input sequence by using the input sequence like RNNs. Therefore, the original Transformer model [16] used a trigonometric (Sinusoidal) position encoding, which is essentially an absolute position encoding that encodes the position representation as a vector using the sin and cos function. The absolute position encoding obtained by using trigonometric functions can reflect the relative distance between word inputs to a certain extent but cannot represent the direction. Therefore, the existing studies using Transformer rarely use a single absolute position encoding [21,22,23].
2.
Relative Position Encoding
Figure 3a illustrates that the original SA mechanism cannot represent the temporal relationship between words. Therefore, some researchers [24] proposed the Relative Position Representation (RPR). The absolute position representation is reflected in the sum of the position code and the input word embedding, and the relative position representation involves adding a set of trainable embeddings to the SA mechanism, so that the current position is compared with the received word in the calculation [25,26].
After adding the relative position representation as shown by the numbers in the figure, two instances of “I” in different positions will have different output encodings. Figure 3b shows the output encoding process of the first “I” and Figure 3c shows the output encoding process of the second “I”. When the Transformer calculates the attention between “I” and “therefore”, for the first “I”, because “therefore” is the second word to the right of the first “I”, the model will use the information in the first; and for the second “I”, since “therefore” is the first word on the left relative to the second “I”, the model will use the first word.

2.3.2. Multi-Granularity Relative Position

The original Transformer model used a single position representation sequence for absolute position encoding. Similarly, the relative position representation was based on aligning the input sequence with its corresponding positional sequence. However, when processing multi-granularity feature input, two positional representation sequences are required, making it difficult to use the existing relative position representation.
To address this issue, based on relative position encoding, we propose a new Multi-granularity Position Encoding (MGPE) approach, which combines the existing relative position representation with multi-granularity feature input. This method allows for more precise encoding of relative positions and enhances overall performance. Inspired by graph-based relative position encoding [27], we utilize a DAG to handle the multi-granularity features. Figure 4 shows the position representation of the multi-granularity feature using a DAG, where the two positions are the head and the tail node positions.
Specifically, we propose to use the position of the head node (head) and the position of the tail node (tail) to calculate the relative position of words from four different granularities. These four relative distance matrices are then concatenated to form a comprehensive multi-granularity representation. To further refine the representation, we apply a nonlinear transformation to the concatenated matrix, producing the final Multi-Granularity Position Encoding (MGPE). This approach enables the model to effectively capture positional relationships across different granularities, enhancing the precision of position encoding in multi-granularity NMT systems. The detailed formula for this method is provided in the following section.
For any word x k , we have h e a d [ k ] and t a i l [ k ] to present its position. The relative position encoding vector R i j between x i and x j can be calculated under the multi-granularity feature. First, we calculate the relative distance between x i and x j from four perspectives, as shown in Equations (2a)–(2d). Four relative distance matrices D ( h h ) , D ( h t ) , D ( t h ) , D ( t t ) will be obtained, as shown in Figure 5.
D i j h h = h e a d [ i ] h e a d [ j ]
D i j h t = h e a d [ i ] t a i l [ j ]
D i j t h = t a i l [ i ] h e a d [ j ]
D i j t t = t a i l [ i ] t a i l [ j ]
To obtain the position encoding vector R i j between x i and x j , the four relative distance matrices are first concatenated, followed by nonlinear transformation, as described in Equation (3).
R i j = ReLU   ( W t r a i n ( P D i j ( h h ) P D i j ( h t ) P D i j ( t h ) P D i j ( t t ) ) )
where W t r a i n is a trainable parameter and P d can be calculated by the absolute position coding of Equations (4) and (5).
P E ( p o s , 2 i ) = sin ( p o s 1000 0 2 i / d )
P E ( p o s , 2 i + 1 ) = cos ( p o s 1000 0 2 i / d )

2.4. Multi-Granularity Self-Attention

In this section, we introduce the multi-granularity self-attention mechanism which replaces the conventional self-attention mechanism. We first illustrate the encoder–decoder structure of our DMGTS. The DMGTS consists of a six-layer encoder, an output layer, and a six-layer decoder. Multi-granularity self-attention and a position feed-forward layer compose the encoder. Similarly, multi-head attention, position feed-forward, and encoder–decoder attention compose the decoder. Finally, a linear transformation and Softmax layer compose the output layer.
Based on Shaw et al.’s work [28], we integrate the multi-granularity relative position encoding R i j , discussed in the previous section, into the original model. As a result, the SA mechanism is transformed into a multi-granularity self-attention mechanism (MGSA).
Equation (6) shows the general form of the SA mechanism with absolute position encoding, in which p i and p j represent the absolute position encoding, a i , j denotes the self-attention weight, and o i represents the final output of the self-attention layer.
q i = ( x i + p i ) W Q k j = ( x j + p j ) W K v j = ( x j + p j ) W V a i , j = s o f t max ( q i k j T ) o i = j a i , j v j
The expanded result of q i k j Τ is shown in Equation (7).
q i k j T = x i + p i W Q W K T x j + p j T
To introduce relative position information for training synchronously, we remove the expression p i W Q in Equation (6), change the second item p j W K to relative position vector R i , j K , and change the attention output p j W V to the relative position vector R i , j V . By converting to a relative position vector, we can obtain the SA mechanism with relative position encoding using Equations (8) and (9):
a i , j = s o f t m a x ( x i W Q ( x j W K + R i , j K ) Τ )
o i = j a i , j ( x j W V + R i , j V )
R i , j K and R i , j V are two relative position factors that can be trained and learned. These factors are closely related to the multi-granularity relative position representation. Specifically, the connection between the multi-granularity relative position code and representation can be expressed mathematically as shown in Equations (10) and (11):
R i , j K = W K t r a i n P E [ c l i p ( i j , p m i n , p m a x ) ]
R i , j V = W V t r a i n P E [ c l i p ( i j , p m i n , p m a x ) ]
where P E represents the absolute position encoding, generally using the trigonometric function encoding form such as Equations (4) and (5); the function c l i p ( ) indicates that the output is within a certain window range. For example, when the window size is 3, p m i n is 0 and p m a x is 6. Therefore, c l i p ( i j , 0 ,   6 ) means to select the value of i j and limit it to the range of 0 to 6. W K t r a i n and W V t r a i n are trainable, corresponding to the relative position weight between words in the key vector and the value vector, respectively.
The multi-granularity SA mechanism is based on Equations (8) and (9) for calculating R i , j K and R i , j V . To derive the corresponding multi-granularity form, Equation (4) is used. The calculation method is then given in Equations (12) and (13).
R i , j K = R e L U ( W r K ( P E D i j ( h h ) P E D i j ( h t ) P E D i j ( t h ) P E D i j ( t t ) ) )
R i , j V = R e L U ( W r V ( P E D i j ( h h ) P E D i j ( h t ) P E D i j ( t h ) P E D i j ( t t ) ) )

3. Data Processing

The experiments in this study were carried out on the WMT2019Zh-En translation dataset [29] and the NIST Chinese–English dataset [30]. The training dataset adopts the whole parallel corpus officially published by the WMT2019Zh-En translation, including 10M sentence pairs. The NIST Chinese–English dataset utilizes the entire 1.25 million parallel corpus as the training set, the NIST translation task general test data MT05 as the validation set, and MT04 and MT06 as the test sets.
We preprocess the training data as follows:
  • Filter sentence pairs whose sequence length exceeds 50 characters/words.
  • Exclude sentences with a Chinese-to-English sentence length ratio exceeding 1.5.
  • Employ the Moses tool [31] to clean duplicate data, HTML tags, and third-party language words in the original corpus.
  • Use the Chinese word segmentation tool to perform multi-granularity word segmentation on the source language sentence (Chinese) to generate a dictionary and set the upper limit of the dictionary size to be 50 k.
  • Obtain the head position representation and tail position representation of Chinese characters and words after word segmentation and store them in a dictionary data structure.
  • Tokenize the English sentence, use the byte pair encoding BPE to segment, and train the subword dictionary [15] with a dictionary size of 50 k.
  • After completing the word segmentation process, incorporate the <bos> (start of sentence) and <eos> (end of sentence) markers to both the Chinese and English sentence pairs.
After data preprocessing, the WMT2019Zh-En Chinese–English translation dataset contains 8.9 M sentence pairs, and the NIST Chinese–English dataset contains 1.2 M sentence pairs. The specifications of the processed datasets are shown in Table 4 and Table 5.

4. Experiments

Our DMGTS was built using the Fairseq toolbox [32], and the machine learning models were developed with PyTorch 2.0.1 on the Linux server. The hardware resource was a server equipped with 32 GB memory, RTX-2080ti graphics card and Intel i7-9700k CPU. In this study, the control variable method was employed to conduct comparative experiments and verify the impact of each module in the proposed DMGTS.
In our experimental framework, the Bilingual Evaluation Understudy (BLEU) score serves as the universal metric for assessing the quality of the translations produced by our models. BLEU, a widely accepted measure in the field of machine translation, quantifies the correspondence between a machine’s output and that of a human translator, focusing on the precision of n-gram matches across the two texts. This metric enables a quantitative evaluation of the translation’s fidelity and fluency, providing a standardized way to compare the effectiveness of different model configurations and input granularities.

4.1. Multi-Granularity Feature Experiment

The goal of the multi-granularity feature experiment is to evaluate how varying levels of granularity in the input affect the performance of translation tasks, specifically comparing multi-granularity inputs against single-granularity inputs. To ensure the reliability of the results, it is crucial to maintain consistency across all other variables in the experiment as much as possible.
Therefore, we adopted the traditional Transformer model as the backbone and integrated different single-granularity word segmentation methods with it. Considering the source language of the Chinese corpus, single-granularity inputs are categorized into four distinct types: character-level granularity and three separate word-level granularities. Specifically, T r a n s f o r m e r c h a , T r a n s f o r m e r m s r , T r a n s f o r m e r p k u and T r a n s f o r m e r c t b denote the Transformer model with character, MSR, PKU, and CTB segmentations, respectively. In this experiment, we employed random initialization for the word embedding to our DMGTS multi-granularity input and denoted this variant as D M G T S r d m .
Table 6 and Table 7 show the experimental results of multi-granularity input compared to single-granularity input on WMT2019Zh-En dataset and NIST Chinese–English dataset.
In Table 6, we observe that the D M G T S r d m model, which incorporates multi-granularity input, outperforms all other models on both the validation set (newstest2018) and the test set (newstest2019). The BLEU scores show a clear advantage for the DMGTS model with scores of 25.47 on the validation set and 30.37 on the test set. This is a significant improvement over the single-granularity input models. Similarly, Table 7 shows the improvement of D M G T S r d m over single-granularity models on BLEU scores. Notably, on the MT06 set, the DMGTS achieves a BLEU score of 41.27, surpassing the single-granularity models by at least 1.87 points.
Across both datasets, the D M G T S r d m model consistently outperforms the baseline Transformer models that use single-granularity inputs. This indicates that the D M G T S r d m model’s approach to integrating multi-granularity information into the Transformer’s architecture not only addresses the limitations of single granularity but also capitalizes on the rich linguistic information present at multiple levels of language structure.
It is worth noting that among methods with single-granularity inputs, it is evident that character-level input outperforms the word-level alternatives. Additionally, when comparing the translation outputs from models trained on MSR, PKU, and CTB word segmentations, the performance metrics are closely matched.

4.2. Ablation Study on Relative Position Factors

To study the impact of the relative position factors R i , j K and R i , j V on the model performance, we tested different combinations of ( R i , j K ,   R i , j V ) with R i , j K { 0 ,   1 } and R i , j V { 0 ,   1 } . Experiments were conducted on both datasets with our D M G T S r d m model.
Table 8 presents the impact of relative position factors R i , j K and R i , j V on translation quality as measured by BLEU scores. As shown in Table 8, if both R i , j K and R i , j V are eliminated, the BLEU value on the WMT dataset is only 17.32, and the BLEU value on the NIST dataset is only 22.67. When any relative position factor is added, the BLEU value corresponding to the model is greatly improved. The inclusion of both positional expressions yields the highest BLEU scores of 30.37 for WMT and 41.25 for NIST. However, the marginal gains when both factors are included compared to just one are relatively slight. This suggests that each factor independently possesses a strong capability to encapsulate relative positional information effectively.

4.3. Ablation Study on Word Embeddings

To evaluate the impact of different word embedding methods on the performance of machine translation models, we conducted an ablation study with the DMGTS as the framework and with various word embedding modules on both datasets. The compared methods include D M G T S r d m , D M G T S w 2 v , D M G T S g l v , and D M G T S that employ random initialization, word2vec, GloVe and ELMo for their word embedding, respectively. The first three are static word vectors, and the last one uses dynamic word vectors.
Table 9 shows the BLEU scores with different word embedding modules. The implementation of the ELMo dynamic word embedding module yields a notable improvement in translation quality, as evidenced by the increased BLEU scores—31.53 for the WMT dataset and 42.61 for the NIST dataset. These scores are at most 1.16 and 1.55 points higher than those achieved using static word vector methods. This enhancement substantiates the premise that dynamic word embeddings contribute positively to model performance. Among the static embeddings, D M G T S w 2 v marginally outperforms the others, with D M G T S r d m and D M G T S g l v following closely, indicating that while there is a slight variation in their effectiveness, the differences are minimal.

5. Discussion

In addressing the limitations of current NMT systems, our research focuses on overcoming the challenges associated with static word vectors and the constraints of single-granularity inputs. Static word vectors often fail to capture the nuanced differences in semantic grammar, leading to deviations that can affect the accuracy of translations. Furthermore, the prevalent approach in NMT systems of relying on a single level of language granularity—whether character-based, word-based, or subword-based—restricts the encoder’s ability to represent input sentences efficiently and reliably. To bridge these research gaps, we introduced the DMGTS. This innovative adaptation of the Transformer model integrates multi-granularity position encoding and self-attention mechanisms to accommodate multiple levels of language granularity.
Through experiments on two Chinese–English translation datasets, we demonstrate the importance of multi-granularity position encoding and dynamic word embedding on improving the translation quality of the DMGTS. Specifically, with multi-granularity position encoding, D M G T S r d m achieved BLEU scores of 25.47 on the validation set (newstest2018) and 30.37 on the test set (newstest2019), indicating a significant enhancement in translation accuracy and fluency. On the MT06 set, it attained a BLEU score of 41.27, outstripping single-granularity models by a margin of at least 1.87 points. Additionally, the implementation of the ELMo dynamic word embedding module within the DMGTS further amplifies its translation quality. This is evidenced by the improved BLEU scores, which saw an increase of 1.53 for the WMT dataset and 42.61 for the NIST dataset, surpassing the results of static word vector methods by averages of 1.16 and 1.35 points, respectively.
While our experiments validate the effectiveness of the proposed DMGTS in the field of neural machine translation, particularly for Chinese–English translation tasks, there are some areas for future work. One valuable direction is to extend the application of DMGTS to other translation tasks. The effectiveness of multi-granularity position encoding and self-attention mechanisms has been demonstrated within this language pair, but its applicability and performance in other translation tasks remain untested. Language pairs with different syntactic, grammatical, and semantic structures may present unique challenges that the current model configuration might not address effectively. Expanding the testing to include diverse language pairs, such as those involving non-Indo-European languages with varying levels of morphological complexity, would provide a more comprehensive understanding of the model’s versatility and areas for improvement.
Another direction worth exploring is to study multi-granularity position encoding in applications beyond translation. While our DMGTS showcases the potential of integrating multi-level linguistic information in translation tasks, the broader applicability of this concept across other NLP tasks has not been investigated. Tasks such as text summarization, question answering, and natural language inference could potentially benefit from the nuanced understanding and representation of language that multi-granularity approaches offer.
Furthermore, the current DMGTS implementation may encounter limitations related to computational efficiency and resource demands. The integration of multi-granularity inputs and dynamic embeddings, while beneficial for capturing linguistic nuances, increases the model’s complexity and the computational resources required for training and inference, compared to methods with single-granularity input and static word embedding.

6. Conclusions

In this paper, we developed the DMGTS, a novel NMT model that enhances translation accuracy by integrating multi-granularity features with dynamic word vectors. Through comprehensive ablation studies on both WMT and NIST Chinese–English datasets, we demonstrated the significant impact of multi-granularity input on improving translation performance. Additionally, our experiments underscored the critical role of relative position factors in the model’s effectiveness. By comparing the performance of dynamic word vector embedding, as realized through the ELMo model, against traditional static word vector embeddings like word2vec and GloVe, we showcased the superior capability of our model to handle the complexity of language, particularly in capturing the context-sensitive nature of words.
Our findings reveal that the integration of multi-granularity features with dynamic word embeddings substantially outperforms conventional static embedding methods, yielding an average increase of 1.10 and 1.39 BLEU scores on the WMT and NIST Chinese–English translation tasks, respectively. This advancement highlights the limitations of existing NMT models that rely on static embeddings and underlines the benefits of our approach, which leverages ELMo to enhance the encoder network for more context-aware, dynamic representations.

Author Contributions

Conceptualization, W.Z., L.Y., and B.Y.; methodology, S.L. (Shenrong Lv), J.T. and X.C.; software, S.L. (Shenrong Lv), J.T., and X.C.; validation, R.W., S.L. (Shenrong Lv) and J.T.; formal analysis, R.W., S.L. (Siyu Lu), and B.Y.; resources, L.Y. and B.Y.; data curation, R.W. and S.L. (Shenrong Lv); writing—original draft preparation, L.Y., S.L. (Siyu Lu), and S.L. (Shenrong Lv); writing—review and editing, S.L. (Siyu Lu), L.Y., and W.Z.; visualization, S.L. (Shenrong Lv) and X.C.; supervision, B.Y.; project administration, W.Z.; funding acquisition, B.Y. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by Sichuan Science and Technology Program (2023YFSY0026, 2023YFH0004).

Data Availability Statement

The original contributions presented in this study are publicly available. These data can be found here: http://www.statmt.org/wmt19/index.html (accessed on 3 August 2024).

Conflicts of Interest

The authors declare that they have no competing interests.

References

  1. Lei, L.; Wang, H. Design and Analysis of English Intelligent Translation System Based on Internet of Things and Big Data Model. Comput. Intell. Neurosci. 2022, 16, 6788813. [Google Scholar] [CrossRef] [PubMed]
  2. Chen, Y. Intelligent English Language Translation and Grammar Learning Based on Internet of Things Technology. ACM Trans. Asian Low-Resource Lang. Inf. Process. 2023, 9, 3588769. [Google Scholar] [CrossRef]
  3. Sutskever, I. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
  4. Baliyan, A.; Batra, A.; Singh, S.P. Multilingual Sentiment Analysis Using RNN-LSTM and Neural Machine Translation. In Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021. [Google Scholar]
  5. Wang, Q. Semantic Analysis Technology of English Translation Based on Deep Neural Network. Comput. Intell. Neurosci. 2022, 16, 1176943. [Google Scholar] [CrossRef] [PubMed]
  6. Kalchbrenner, N.; Blunsom, P. Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
  7. Wang, H. Short Sequence Chinese-English Machine Translation Based on Generative Adversarial Networks of Emotion. Comput. Intell. Neurosci. 2022, 16, 3385477. [Google Scholar] [CrossRef] [PubMed]
  8. Hu, S.; Li, X.; Bai, J.; Lei, H.; Qian, W.; Hu, S.; Yang, S. Neural Machine Translation by Fusing Key Information of Text. CMC Comput. Mater. Contin. 2023, 74, 2803–2815. [Google Scholar] [CrossRef]
  9. Li, X.; Liu, L.; Tu, Z.; Li, G.; Shi, S.; Meng, M.Q.H. Attending from Foresight: A Novel Attention Mechanism for Neural Machine Translation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2606–2616. [Google Scholar] [CrossRef]
  10. Wang, D.; Liu, B.; Zhou, Y. Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning. Appl. Sci. 2022, 12, 11875. [Google Scholar] [CrossRef]
  11. Zhu, W.; Liu, H.; Dong, Q.; Xu, J.; Huang, S.; Kong, L.; Chen, J.; Li, L. Multilingual machine translation with large language models: Empirical results and analysis. In Proceedings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
  12. Li, C.; Zhang, M.; Liu, X.; Li, Z.; Wong, D.; Zhang, M. Towards Demonstration-Aware Large Language Models for Machine Translation. In Proceedings of the Association for Computational Linguistics ACL 2024, Virtual Meeting, Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
  13. Zhu, S.; Cui, M.; Xiong, D. Towards robust in-context learning for machine translation with large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024. [Google Scholar]
  14. Costa-Jussa, M.R.; Fonollosa, J.A. Character-based Neural Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
  15. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31th Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Morishita, M.; Suzuki, J.; Nagata, M. Improving Neural Machine Translation by Incorporating Hierarchical Subword Features. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018. [Google Scholar]
  18. Su, J.; Tan, Z.; Xiong, D.; Ji, R.; Shi, X.; Liu, Y. Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  19. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Repre-sentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
  20. Dahl, G.E.; Yu, D. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 30–42. [Google Scholar] [CrossRef]
  21. Qi, L.; Zhang, Y. Bidirectional Transformer with Absolute-Position Aware Relative Position Encoding for Encoding Sentences. Front. Comput. Sci. 2023, 17, 171301. [Google Scholar] [CrossRef]
  22. Chen, T.; Zhou, L.; Wang, N.; Chen, X. Joint Entity and Relation Extraction with Position-Aware Attention and Relation Embedding. Appl. Soft Comput. 2022, 119, 108604. [Google Scholar] [CrossRef]
  23. Pathan, A.F.; Prakash, C. Attention-Based Position-Aware Framework for Aspect-Based Opinion Mining Using Bidirectional Long Short-Term Memory. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8716–8726. [Google Scholar] [CrossRef]
  24. Chen, P. PermuteFormer: Efficient Relative Position Encoding for Long Sequences. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021. [Google Scholar]
  25. Li, J. Application of Machine Learning Combined with Wireless Network in Design of Online Translation System. Wireless Commun. Mobile Comput. 2022, 12, 1266397. [Google Scholar] [CrossRef]
  26. Yu, J.; Ma, X. English Translation Model Based on Intelligent Recognition and Deep Learning. Wireless Commun. Mobile Comput. 2022, 22, 3079775. [Google Scholar] [CrossRef]
  27. Park, W.; Chang, W.G.; Lee, D.; Kim, J. GRPE: Relative Positional Encoding for Graph Transformer. In Proceedings of the ICLR 2022 Machine Learning for Drug Discovery, Online, 25–29 April 2022. [Google Scholar]
  28. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
  29. Wikimedia Foundation. ACL 2019 Fourth Conference on Machine Translation (WMT19), Shared Task: Machine Translation of News. Available online: http://www.statmt.org/wmt19/translation-task.html (accessed on 3 August 2024).
  30. NIST Multimodal Information Group. NIST 2008 Open Machine Translation (OpenMT) Evaluation, V1; Abacus Data Network. 2023. Available online: https://hdl.handle.net/11272.1/AB2/YEK10L (accessed on 3 August 2024).
  31. Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Herbst, E. Moses: Open Source Toolkit for Sta-tistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics: Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, 25–27 June 2007. [Google Scholar]
  32. Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Auli, M. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Figure 1. Architecture of the proposed DMGTS.
Figure 1. Architecture of the proposed DMGTS.
Systems 12 00420 g001
Figure 2. Performance comparison of different word segmentation methods.
Figure 2. Performance comparison of different word segmentation methods.
Systems 12 00420 g002
Figure 3. Comparison of output encoding procedures with and without adding relative positions. (a) SA; (b) the output encoding of the first “I”; (c) the output encoding of the second “I”.
Figure 3. Comparison of output encoding procedures with and without adding relative positions. (a) SA; (b) the output encoding of the first “I”; (c) the output encoding of the second “I”.
Systems 12 00420 g003
Figure 4. Multi-granularity position encoding using DAG.
Figure 4. Multi-granularity position encoding using DAG.
Systems 12 00420 g004
Figure 5. Multi-granularity relative distance matrix.
Figure 5. Multi-granularity relative distance matrix.
Systems 12 00420 g005
Table 1. Word embedding comparison of NMT systems.
Table 1. Word embedding comparison of NMT systems.
Model MethodLanguage GranularityTextualizationIs It a Static Vector
Sutskever et al. [3]WordBag-of-words modelYes
Jussa et al. [14]CharacterCBOWYes
Sennrich et al. [15]SubwordSkip-GramYes
Vaswani et al. [16]Character/subwordRandomYes
Morishita et al. [17]SubwordGRUYes
Table 2. Performance comparison of different word segmentation methods.
Table 2. Performance comparison of different word segmentation methods.
Word Segmentation MethodPrecisionRecallF1
MSR95.8894.9495.41
PKU96.5595.0496.48
CTB96.3794.1195.22
Table 3. Comparisons of different word embedding methods.
Table 3. Comparisons of different word embedding methods.
Word Vector TypeNamePre-Training CorpusWord Vector DimensionNumber of Word Vectors
StaticRandom51250k
word2vecWMT+NIST Chinese corpus30050k
GloVeWMT+NIST Chinese corpus30050k
DynamicELMoChinese Gigaword V5512
Table 4. Specifications of WMT2019Zh-En dataset.
Table 4. Specifications of WMT2019Zh-En dataset.
TypeNameSize
Parallel corpusTraining setWMT2019Zh-En8.9 M
Validation setnewstest20182 k
Test setnewstest20192 k
Table 5. Specifications of NIST Chinese and English datasets.
Table 5. Specifications of NIST Chinese and English datasets.
TypeNameSize
Parallel corpusTraining setNIST Zh-En1.2 M
Validation setMT051082
Test setMT041788
MT061664
MT081357
Table 6. BLEU scores of models with various granularities on the WMT2019Zh-En dataset.
Table 6. BLEU scores of models with various granularities on the WMT2019Zh-En dataset.
Model MethodInputValidation SetTest Set
newstest2018newstest2019
T r a n s f o r m e r c h a character24.8029.29
T r a n s f o r m e r m s r MSR24.5329.13
T r a n s f o r m e r p k u PKU24.4129.26
T r a n s f o r m e r c t b CTB24.6729.09
D M G T S r d m Multi-granularity input25.4730.37
Table 7. BLEU scores of models with various granularities on the NIST Chinese–English dataset.
Table 7. BLEU scores of models with various granularities on the NIST Chinese–English dataset.
Model MethodInputValidation SetTest Set
MT05MT04MT06MT08ALL
T r a n s f o r m e r c h a Character41.6943.9740.2131.7340.41
T r a n s f o r m e r m s r MSR41.0743.5039.5931.0139.78
T r a n s f o r m e r p k u PKU41.5743.5840.1531.5340.15
T r a n s f o r m e r c t b CTB41.7743.4940.3031.6740.31
D M G T S r d m Multi-granularity input42.5644.7241.2732.8941.25
Table 8. BLEU scores with various combinations of relative position factors ( R i , j K ,   R i , j V ) .
Table 8. BLEU scores with various combinations of relative position factors ( R i , j K ,   R i , j V ) .
Positional ExpressionWMT2019Zh-EnNIST Chinese–English Dataset
R i , j K R i , j V ALL(MT04+06+08)Size
0017.3222.67
1030.2341.14
0130.1841.21
1130.3741.25
Table 9. BLEU scores with different word embedding modules.
Table 9. BLEU scores with different word embedding modules.
Word Embedding ModuleWMT Chinese and English DatasetNIST Chinese–English Dataset
newstest2019ALL(MT04+06+08)
D M G T S r d m 30.3741.25
D M G T S w 2 v 30.4941.34
D M G T S g l v 30.4241.06
D M G T S 31.5342.61
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, S.; Yang, B.; Wang, R.; Lu, S.; Tian, J.; Zheng, W.; Chen, X.; Yin, L. Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention. Systems 2024, 12, 420. https://doi.org/10.3390/systems12100420

AMA Style

Lv S, Yang B, Wang R, Lu S, Tian J, Zheng W, Chen X, Yin L. Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention. Systems. 2024; 12(10):420. https://doi.org/10.3390/systems12100420

Chicago/Turabian Style

Lv, Shenrong, Bo Yang, Ruiyang Wang, Siyu Lu, Jiawei Tian, Wenfeng Zheng, Xiaobing Chen, and Lirong Yin. 2024. "Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention" Systems 12, no. 10: 420. https://doi.org/10.3390/systems12100420

APA Style

Lv, S., Yang, B., Wang, R., Lu, S., Tian, J., Zheng, W., Chen, X., & Yin, L. (2024). Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention. Systems, 12(10), 420. https://doi.org/10.3390/systems12100420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop