4.3. Experimental Parameter Settings
To ensure the maximum performance of the Seq2Seq+attention framework in this article, the parameters were tested, and the batch size was set to 640. The word vector dimensions of the encoder and decoder were set to 300. The sizes of the encoder’s and decoder’s hidden layers were 128 and 356, respectively. The dropout rate of the encoder was 0.3, and the dropout rate of the decoder was 0. The learning rate was set to 0.02, and dynamic decay was applied. During sentence generation, the width of the beam search was set to 15. The Seq2Seq+attention network model was trained for 100 epochs, and the model with the lowest loss value was saved as the final model.
For the Transformer model in this article, both the encoder and decoder had four layers. The batch size was set to 512. The word vector dimensions of the encoder and decoder were both 512. The hidden layer sizes of the encoder and decoder were both 512. The dropout rates for both the encoder and decoder were 0.1. The learning rate was 0.0005. Both the encoder and decoder used eight heads in their multi-head attention mechanisms. The training lasted 60 epochs, and the loss was calculated in the validation set at each training cycle. The model with the lowest validation loss was selected as the final model.
4.4. Evaluating Indicator
In order to ensure the objectivity of the results, this article chose to evaluate the models using two metrics: BLEU score and human evaluation. The essence of the BLEU score is to measure the similarity between two texts, so calculating the written language produced in the test set and comparing it with the original written language can reflect the performance of the model. The formula for
BLEU is shown in Equation (6):
In Equation (6), Pn represents the probability of shared subsequences, specifically, n-grams appearing in the reference sentence, with n up to 4. Thus, only the precision of up to 4 g is measured. Wn represents the weights of each n-gram. BP represents a penalty factor that biases BLEU towards shorter translations. When the predicted sentence is shorter than the reference sentence, it is set to 1.
Otherwise, it is dependent on the length ratio between the predicted and reference sentence, as shown in Equation (7):
The final BLEU score is calculated as the average BLEU score of all sentences in the test set. This article calculates BLEU scores for both word-grams and character-grams. However, BLEU is only a substitute for human evaluation, and its disadvantages include the lack of consideration of grammatical accuracy and no consideration of synonyms or similar expressions. Therefore, it is necessary to add a human evaluation as an evaluation criterion for the model.
The manual evaluation method used in this article is in the form of a questionnaire survey. Each questionnaire was composed of 500 non-colloquial sentence pairs randomly selected with a replacement from the test set. They were distributed to 20 graduate students at our university to evaluate the effectiveness of de-colloquialization for each sentence on a ten-point scale. The scores were averaged for each questionnaire, and the average scores across all questionnaires determined the final de-colloquialization score for the model.
4.5. Result and Analysis
This article employs two outstanding models in the field of machine translation for training to investigate whether the principles of machine translation can be applied to the task of de-colloquialization. It is observed that the Seq2Seq+attention model failed to fit the dataset for de-colloquialization well. Even after 100 epochs of training, the loss value remained as high as 5.77. To verify whether this phenomenon is attributed to the large size of the dataset, this article resorts to 1 million parallel corpora to further train the model. However, even after 100 epochs, the loss value reached 5.15, as shown in
Figure 4, indicating a limited improvement in the loss value.
In contrast, the Transformer model converges to remarkable effectiveness within only 60 epochs, as shown in
Figure 5.
The comparison of experimental results is presented in
Table 6.
Table 6 displays detailed metrics for the training of both the Seq2Seq+attention and Transformer models. These metrics include the amount of data used for training, the final converged loss value, the character-level and word-level BLEU scores (a character-matching algorithm compared to the reference answers), the human evaluation score, and the training time. The initial line of data aims to showcase the computation of BLEU scores for a test set comprising 49,000 paired spoken and written texts. These scores are considered fundamental indicators capable of reflecting the similarity between spoken and written texts. They are utilized for comparison with the BLEU scores of sentences that have undergone model transformation. It is observed that, for sentences transformed by the Transformer model, whether at the character-level BLEU or word-level BLEU, all exceed the baseline indicator. This indicates that the sentences transformed by the model are more closely aligned with written language. Seq2seq is a pure translation model without an attention mechanism. In the case of the Transformer (base) model, it has four attention heads in the multi-head attention mechanism and a hidden layer dimension of 256. On the other hand, the Transformer model has eight attention heads in the multi-head attention mechanism and a hidden layer dimension of 512.
As the Seq2Seq+attention model did not converge to an ideal state, it was excluded from the model evaluation. It is notable that the Transformer model outperforms the Seq2Seq+attention model in terms of convergence speed and training duration.
Regarding the task of de-colloquialization, since both the target and source languages are Chinese, the difference between the grammar and semantics of the two languages is relatively small. At this point, the multi-head attention mechanism can effectively consolidate features at various dimensions, while Seq2Seq+attention may struggle to learn useful features in a single dimension and could forget some relevant features. Therefore, the Seq2Seq+attention model may not perform well in converging within linguistic datasets with a smaller degree of language distance.
The reason why the calculated BLEU score is very high is that the source language and the target language are already very similar. For delignifying Chinese, although there are a lot of changes in word order and synonym replacements, they did not have a significant impact on the calculation of BLEU.
The Transformer model achieved good ratings in the manual evaluation score indicator. De-colloquialization is a relatively subjective task, and each individual may have different perceptions of whether a sentence is colloquial or not, resulting in varying scores. The de-colloquialization effects of different types of colloquial sentences in the Transformer model are shown in
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11. In order to facilitate readers’ comprehension, an English translation of the Chinese text has been provided. It is important to note that due to the differences between English and Chinese in terms of colloquial expressions, the translation of colloquial expressions may not be entirely accurate in order to convey the original meaning of the Chinese text. The English translation is offered solely for the purpose of assisting readers and should be used as a reference only.
This example demonstrates the model’s capability in de-colloquializing nouns. In the given sentence, “谢谢” and “爸妈” are colloquial terms in Chinese, where “谢谢” is used as a colloquial verb and “爸妈” is a colloquial noun. After applying the de-colloquialization model, “谢谢” is replaced with “感谢”, and “爸妈” is replaced with “父母”.
This example showcases the model’s ability to de-colloquialize verbs. In the given sentence, “干” is a colloquial verb commonly used in Chinese. After applying the de-colloquialization model, “干” is replaced with “做”, which is a more formal equivalent.
This example demonstrates the model’s ability to de-colloquialize conjunctions. In the given sentence, “不光 … 连 …” is a colloquial expression commonly used in Chinese to indicate a progressive relationship. After applying the de-colloquialization model, “不光 … 连 …” is replaced with “不仅 …而且 …”, which is a more formal expression to convey a progressive relationship.
This example showcases the model’s ability to de-colloquialize modal particles, which are only found in colloquialism. In the given sentence, both “呢” and “吧” are modal particles in Chinese used to emphasize the emotional tone of expression. After applying the de-colloquialization model, the model not only removes the modal particles “呢” and “吧”, but also modifies the sentence to retain its original meaning. For example, “可是” is changed to “但” and “好像” is followed by “已经”.
This example highlights the model’s ability to de-colloquialize in terms of grammar. In the original sentence, the conjunction “不仅” lacks a fixed collocation, and the phrase “两个距离的不动的理由” is unclear and lacks coherence. After applying the de-colloquialization model, the model completely restructures and summarizes the content, resulting in “两个距离的不动点的存在性定理”, which means “the existence theorem of two fixed points of distances”. The model removes the conjunction “不仅” and uses “并” to indicate a progressive relationship.