Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization

Lv, Shenrong; Lu, Siyu; Wang, Ruiyang; Yin, Lirong; Yin, Zhengtong; A. AlQahtani, Salman; Tian, Jiawei; Zheng, Wenfeng

doi:10.3390/systems12120516

Open AccessArticle

Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization

by

Shenrong Lv

¹,

Siyu Lu

¹,

Ruiyang Wang

¹,

Lirong Yin

²

,

Zhengtong Yin

³

,

Salman A. AlQahtani

⁴

,

Jiawei Tian

⁵

and

Wenfeng Zheng

^1,4,*

¹

School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China

²

Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, USA

³

College of Resource and Environment Engineering, Guizhou University, Guiyang 550025, China

⁴

Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

⁵

Department of Computer Science and Engineering, Major in Bio Artificial Intelligence, Hanyang University, Ansansi 15577, Republic of Korea

^*

Author to whom correspondence should be addressed.

Systems 2024, 12(12), 516; https://doi.org/10.3390/systems12120516

Submission received: 18 August 2024 / Revised: 18 November 2024 / Accepted: 19 November 2024 / Published: 24 November 2024

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Chinese dialogue generation faces multiple challenges, such as semantic understanding, information matching, and response fluency. Generative dialogue systems for Chinese conversation are somehow difficult to construct because of the flexible word order, the great impact of word replacement on semantics, and the complex implicit context. Existing methods still have limitations in addressing these issues. To tackle these problems, this paper proposes an improved Chinese dialogue generation model based on transformer architecture. The model uses a multi-layer transformer decoder as the backbone and introduces two key techniques, namely incorporating pre-trained language model word embeddings and optimizing the sparse Softmax loss function. For word-embedding fusion, we concatenate the word vectors from the pre-trained model with character-based embeddings to enhance the semantic information of word representations. The sparse Softmax optimization effectively mitigates the overfitting issue by introducing a sparsity regularization term. Experimental results on the Chinese short text conversation (STC) dataset demonstrate that our proposed model significantly outperforms the baseline models on automatic evaluation metrics, such as BLEU and Distinct, with an average improvement of 3.5 percentage points. Human evaluations also validate the superiority of our model in generating fluent and relevant responses. This work provides new insights and solutions for building more intelligent and human-like Chinese dialogue systems.

Keywords:

generative dialogue system; natural language processing; transformer; attention mechanism; word embedding

1. Introduction

The rapid advancement of artificial intelligence has fueled the development of intelligent dialogue systems, which aim to engage in natural and coherent conversations with humans. These systems have a wide range of applications, such as customer support, virtual assistance, and personalized recommendations. Based on their underlying mechanisms, dialogue systems can be broadly categorized into two types, namely retrieval-based methods, which search for relevant responses from a predefined database [1,2,3,4], and generation-based methods, which produce new responses dynamically based on the input context [5,6,7].

However, building effective dialogue systems is still a challenging task, especially for languages like Chinese, which exhibits unique characteristics such as rich morphology, ambiguous word boundaries, and complex semantic structures [8,9]. Existing approaches to Chinese dialogue generation often struggle to capture the deep semantics and generate contextually relevant responses. Retrieval-based methods [10,11] or deep learning [12,13] rely on predefined response templates and can hardly adapt to unseen queries. On the other hand, generation-based methods, including sequence-to-sequence (Seq2Seq) models and their variants [14,15,16], tend to generate generic or irrelevant responses that lack diversity and informativeness. Seq2Seq cannot allow the parallelization of the calculation of the models during the training process. This stimulated studies on the application of a transformer, which allows for modeling with the self-attention mechanism. The self-attention mechanism computes the similarity within the encoded input sequence, enabling parallelization through spatial and parametric complexity, significantly boosting the model’s training efficiency [17,18,19].

In recent years, transformer architecture [20] has revolutionized the field of natural language processing (NLP), achieving advanced performance in tasks such as machine translation, text summarization, and dialogue generation [21,22]. A transformer shows its high flexibility in the development of multi-round generative dialogue systems and is worth further application. In daily communication, the information contained in the dialogue is only a part of all of the transmitted information, and other information comes from a series of situational information points, such as the identity of the dialogue character, the weather and location at the time, and some common sense and rules. The success of the transformer can be attributed to its self-attention mechanism, which enables the model to capture extended long-range dependencies and learn more comprehensive representations of the input text. In the context of dialogue generation, numerous studies have investigated the use of Transformer-based models, demonstrating their superiority over traditional RNN-based models [23,24]. Moreover, pre-trained language models like BERT [25] and GPT [26] have demonstrated remarkable success across a broad spectrum of NLP tasks. These models are trained on large-scale unlabeled text data and can learn rich linguistic knowledge and contextual representations. By tailoring pre-trained models for specific downstream tasks, significant improvements have been achieved in various applications, including dialogue generation [27,28].

However, building effective dialogue systems remains a challenging task, especially for languages like Chinese. Zhou et al. [29] used a monolingual language model to build a subword vocabulary that includes Chinese characters and Chinese words and then created an open-domain Chinese dialogue system called EVA based on the transformer architecture. Li et al. [30], focusing on the Chinese online medical treatment scenario, proposed a multi-granularity converter (MGT) model based on word-level segmentation technology and, combined with the transformer model, introduced a generation model for extracting medical terms and their states from Chinese medical conversations. Unlike English, where words are naturally divided by spaces, Chinese text is written as an unbroken sequence of characters with no obvious word boundaries. This poses a significant challenge for Chinese natural language processing tasks, including dialogue generation, as it requires an additional step of Chinese word segmentation (CWS) [31] to identify and split the text into meaningful word units. Thus, in addition to the model itself, both of the above articles refer to the design of the word segmentation system.

The choice of word segmentation and embedding methods is crucial for training Chinese language models. Due to the limitations of current CWS techniques, most Chinese language models resort to character-level encoding, treating single characters as the basic unit for embedding and computation. While characters are the smallest formal unit in Chinese, from a lexical and grammatical perspective, it is the words and phrases that carry complete semantic meanings. Individual characters, on the other hand, often have broad or ambiguous meanings, and their specific sense can only be determined when combined into words or phrases. Moreover, the vast majority of Chinese words consist of multiple characters. Therefore, from an end-to-end perspective, it is more desirable to disambiguate the input sentences before feeding them into the model, rather than processing the sentences directly as sequences of characters.

Word segmentation is pivotal in Chinese natural language processing. Considerable research has been dedicated to developing CWS methods, both during the era of statistical language modeling before and in the current era of neural network-based approaches. Chinese segmentation mainly includes dictionary-based segmentation methods, statistic-based segmentation methods, and understanding-based segmentation methods [32]. Statistic-based segmentation methods utilize a series of statistical models, including the N-graph model [33], the hidden Markov model (HMM) [34], the maximum entropy (ME) model, and the conditional random field (CRF) [35]. These models utilize statistical patterns to predict word boundaries, providing varying degrees of context understanding. Understanding-based segmentation methods are mainly based on a deep understanding of the semantics or context of the text, rather than relying solely on dictionaries or statistical models. However, both of these methods have high computational complexity, require a large amount of labeled data for training, and have poor adaptability to non-standard or colloquial texts in traditional contexts.

Therefore, dictionary-based word segmentation is more common in practical applications, and the key components of this method include a word segmentation dictionary, a scanning order of the input text, and a matching principle [36]. This approach is relatively simple to implement and does not require complex model training or large amounts of annotated data. Among them, the prevalence of character-based models in practical applications can be attributed to several reasons. First, character-based methods are simple and effective, while word-based Chinese language models suffer from some fundamental limitations [37]. For instance, word-based models are more susceptible to data sparsity and out-of-vocabulary (OOV) issues, leading to overfitting. When building a vocabulary for NLP tasks, whether predefined, corpus-specific, or custom-made, there will inevitably be low-frequency words that are absent from the training data but appear during testing. This data sparsity problem limits the learning capacity of the model, especially for domain-specific terms [38]. An illustrative example can be drawn from the Chinese Treebank (CTB) dataset [39], a widely used benchmark for Chinese NLP tasks. After applying Jieba, a popular open-source CWS tool, to the CTB dataset, a segmented corpus with over 600,000 word instances and a vocabulary size of more than 50,000 is obtained. Among the words in the vocabulary, 24,881 (49.5%) appear only once, accounting for merely 4% of the total word count in the dataset. If we consider words with a frequency of less than or equal to four, the number reaches 39,006 (77.6%), constituting just 10% of the entire dataset. Although the resulting vocabulary is extensive, the majority of words occur rarely, leading to a sparse representation. While increasing the vocabulary size introduces more parameters, it may cause overfitting when computing rare words due to data sparsity. The presence of OOV words further hinders the language model’s learning ability.

To address the limitations of existing approaches and enhance the performance of Chinese dialogue generation, we propose an improved method that integrates pre-trained word-embedding matrices with character-level word vectors. Our approach builds upon the transformer-based dialogue generation model and introduces two key innovations.

First, we incorporate the rich semantic information learned from a large-scale pre-trained Chinese BERT model (NEZHA) [40] into the character-level word embeddings. This integration effectively alleviates the data sparsity issue common in Chinese word segmentation while preserving the semantic richness of pre-trained word representations. By maintaining the flexibility of character-level processing, our method reduces the dependency on large-scale annotated data and enables better handling of out-of-vocabulary words, which are particularly challenging in Chinese dialogue systems.

Second, we employ a sparse Softmax optimization technique to enhance the model’s generation capabilities. This technical improvement provides more diverse and informative response generation while reducing computational complexity during inference. The introduction of sparsity regularization not only enhances model robustness but also improves the handling of long-tail vocabulary items, which is crucial for maintaining the naturalness of the generated dialogue.

We conduct extensive experiments on a large-scale Chinese short text conversation (STC) [41] dataset to assess the effectiveness of our proposed method. The results demonstrate that our approach significantly outperforms state-of-the-art methods in both automatic evaluation metrics and human assessments, with BLEU and Distinct scores showing an average increase of 3.5 percentage points. The experimental results reveal enhanced semantic coherence in the generated responses and a better preservation of context-relevant information, leading to more natural and human-like dialogue generation.

Through a detailed analysis of the model’s performance, we provide insights into both the strengths and limitations of our approach, contributing to the advancement of Chinese natural language processing by addressing fundamental challenges in word segmentation and semantic representation. Our work offers promising directions for developing more engaging and human-centered Chinese dialogue systems, with potential applications across various domains of human–machine interaction.

2. Methods

2.1. Transformer-Based Dialogue System

In this dialogue system research, we build upon the transformer-based Chinese generative dialogue system proposed by Zheng et al. [42]. The system employs a deep architecture with 12 layers of transformer decoders, using individual characters as the basic tokenization unit and equipped with 12 multi-head attention mechanisms, a vocabulary size of 13,088, and character vector dimensions of 384. Its core innovation lies in achieving a balance between bidirectional perception and unidirectional generation through an incomplete masking mechanism. Specifically, during dialogue generation, the question portion can perceive contextual information bidirectionally, while the response portion can only generate autoregressively in one direction. Additionally, the system introduces relative position encoding in self-attention calculations to replace traditional absolute position encoding, effectively enhancing the model’s ability to handle long-distance dependencies.

The multi-turn dialogue system in this study is designed as an end-to-end structure based on the Transformer framework. The input natural language sentences are first mapped into a continuous vector space through the encoder, and then, the decoder gradually generates the response sequences. Figure 1 illustrates the basic methods and the process applied in our research.

During training, the model processes context information as the input, organized into multiple segments, with each segment corresponding to a turn in the dialogue. The model is trained through an auto-regressive approach, where it predicts the next word based on the words it has already generated. This generation process continues until an end token is reached.

2.2. Embedding of Word-Phrase Fusion

In Chinese word segmentation techniques, accurately identifying word boundaries remains a challenging task. Incorrectly segmented words can lead to errors in downstream natural language processing tasks, including dialogue generation.

To tackle this problem, we introduced a new method called word–phrase fusion embedding. This method combines the strengths of both character-level and word-level representations to improve the model’s language understanding capabilities. By fusing the pre-trained word embeddings obtained from word-segmented text with the character embeddings learned during training, our approach not only preserves the semantic information brought by the word-level model but also retains the diversity and flexibility of character-level representations.

The detailed process is as follows.

The input sequence is transformed from a one-hot encoding into a character vector compatible with the transformer model by initializing it with a random character-embedding matrix, $W_{c h a r}$ . After feeding the one-hot encoding into the character-embedding network, we obtain $N \times D_{c h a r}$ character embeddings $X_{C E}$ , as shown in Equation (1):

$X_{C E} = X W_{c h a r}$

(1)

where $D_{c h a r}$ represents the embedding dimension of the character vectors. Meanwhile, $W_{c h a r}$ is the character-embedding matrix with dimensions $V$ × $D_{c h a r}$ , with $V$ being the size of the character vocabulary;
We obtain the pre-trained word-embedding matrix $W_{w o r d}$ based on a pre-trained language model. The size of this matrix is $W \times D_{w o r d},$ where $W$ is the size of the vocabulary in the pre-trained model and $D_{w o r d}$ is the embedding dimension of the word vectors. The word embeddings are obtained through the word-embedding matrix, as shown in Equation (2).

$X_{W E} = X W_{w o r d}$

(2)

The word embeddings

X_{W E}

are then transformed through a transformation matrix

A

to match the dimension of the character embeddings, as shown in Equation (3):

X_{W E}^{'} = X_{W E} \times A

(3)

where the dimension of

A

is

D_{w o r d} \times D_{c h a r}

. After aligning the number of word embeddings with the number of character vectors, they are summed to obtain the final character–word vectors

X_{i, E}

, which is shown in Equation (4).

X_{i, E} = X_{W E}^{'} + X_{i, C E}, i = 1, \dots, C o u n t (X_{W E})

(4)

The

C o u n t

function represents the number of characters in the word vectors, indicating that the obtained character vectors are added to the transformed word vectors. The overall procedure is illustrated in Figure 2.

Taking Figure 3 as the example, assuming that the input text is Y = (篮,网,总,冠,军), the text is first divided into “篮””网””总””冠””军” by preprocessing and is obtained by a unique heat-coding vector. Then, it is multiplied by the word-embedding matrix to become a vector matrix containing word information. Segment the text Y = (篮,网,总,冠,军) to get the two words “篮网” and “总冠军” and extract the word vectors through the trained weight matrix for the two words. Since the number of word vectors and phrase vectors differs, an alignment operation is performed. The phrase vector for “篮网” is repeated twice, and the phrase vector for “总冠军” is repeated three times. The obtained matrix is transformed into the same dimension as the word vector by the transformation matrix.

During the training process, the parameter matrix

W_{w o r d}

of the pre-trained language model’s word vectors is kept fixed and does not participate in the training. Instead, only the transformation matrix

A

and the weight matrix

W_{c h a r}

for the character vectors are iteratively updated. This approach preserves the semantic information from the pre-trained word vector model based on tokenization while also maintaining the diversity inherent in training character vectors.

This study focuses on the NEZHA pre-training model, which incorporates word segmentation characteristics [40]. NEZHA serves as a pre-trained language model, providing a pre-trained word vector weight matrix. The main reason for using NEZHA as a pre-training model is that NEZHA can first use the word segmentation tool to obtain Chinese vocabulary information and then learn vocabulary boundary information through the full word masking method.

2.3. Sparse Softmax

To better integrate word vectors from pre-trained language models with character vectors for dialogue tasks, this section introduces the sparse Softmax mechanism [43]. Sparsifying the Softmax operation enhances its interpretability and performance. This study presents a simplified version of this mechanism.

The original and sparse versions of the Softmax calculation are given in Equations (5) and (6), respectively.

p_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} e^{x_{j}}}

(5)

p_{i} = \{\begin{matrix} \frac{e^{x_{i}}}{\sum_{j \in Ω_{k}} e^{x_{j}}}, i \in Ω_{k} \\ 0, i \notin Ω_{k} \end{matrix}

(6)

The original and sparse versions of the cross-entropy calculation are given by Equations (7) and (8), respectively.

\log (\sum_{j = 1}^{n} e^{x_{j}}) - x_{t}

(7)

\log (\sum_{j \in Ω_{k}} e^{x_{j}}) - x_{t}

(8)

Here,

Ω_{k}

represents the set of indices corresponding to the top

k

elements in the sequence

(x_{1}, x_{2}, \dots, x_{i})

sorted in descending order;

t

denotes the target class.

The reason for designing the sparse Softmax mechanism is to avoid the issue of overfitting. By only retaining the top

k

results and setting the others to zero (with

k

being a manually selected hyperparameter, set to 10 in this study), the cross-entropy calculation is modified to operate only on the top

k

categories.

The reason why the sparse Softmax helps mitigate overfitting is as follows. If the classification is successful,

s_{m a x} = s_{t}

, indicating that the target class has the highest score. At this point, the original cross-entropy inequality can be derived in Equation (9).

\log (\sum_{i = 1}^{n} e^{s_{i}}) - s_{m a x} = \log (1 + \sum_{i \neq t} e^{s_{i} - s_{m a x}}) \geq \log (1 + (n - 1) e^{s_{m i n} - s_{m a x}})

(9)

Assuming the current cross-entropy value is

ε

, we get Equation (10):

s_{m a x} - s_{m i n} \geq \log (n - 1) - \log (e^{ε} - 1)

(10)

When

n

is large, such a significant margin is unnecessary for classification tasks. Therefore, this study expects that the logit of the target class should be slightly larger than all non-target classes without exceeding

\log (n - 1)

significantly. Conventional cross-entropy often leads to overfitting due to excessive learning, while the truncated operation in sparse cross-entropy avoids this problem.

3. Experiments

3.1. Dataset

This study conducts experiments using a large-scale Chinese short text conversation (STC) dataset [41]. The STC corpus is designed for predicting subsequent replies within a multi-turn dialogue context. It consists of 4.4 million Chinese dialogues. Following the collection of a substantial amount of dialogue text, the raw data underwent a cleaning process that involved removing trivial and meaningless words, filtering out potential advertising content, and ensuring that each post had an average of 20 distinct replies, each with semantic variations. Details of the STC dataset are provided in Table 1.

This study primarily utilized the test dataset presented in Table 1, characterized by replies that have sparse role-specific content and lack the extensive manual annotations typical of the labeled dataset. In contrast to the labeled dataset, where annotators deliberately steer conversations toward role-relevant information, the test dataset more accurately reflects real-world dialogue scenarios in which role-related content may not always be prominent or easily identifiable. For example, while a labeled dataset may contain explicit indicators of roles, such as “customer” or “agent”, the test dataset includes more spontaneous exchanges that can occur in natural conversations. As a result, the test dataset provides a more realistic evaluation of dialogue models in typical human–machine interactions, highlighting their ability to understand and respond appropriately in varied contexts.

3.2. Experimental Setting

The experimental environment of this experiment uses TensorFlow 2.1.0, Keras2.3.1, BERT4Keras0.9.8 open-source framework, and GPU accelerated training mode. Specifically, Table 2 shows the experimental operation environment.

To evaluate the effectiveness of the proposed character–word fusion embedding in the generative dialogue system, we introduce the following comparative baseline models.

For the baseline, this model is built on BERT pre-training for dialogue tasks, using characters as the smallest granularity for training. As this study builds upon the transformer-based Chinese dialogue generation system proposed in previous studies, the specific dialogue system model is a transformer-based generative dialogue system that follows the methodology outlined in [42].

For char-word, this model is based on NEZHA pre-training for dialogue tasks, using words as the smallest granularity for training. In this dialogue system, the embedding method of word fusion is applied, and both the Softmax and cross-entropy computations utilize the improved sparse mechanism.

To further illustrate the configurations of the baseline and the proposed char-word models used for validating the effectiveness of the character–word fusion embedding in the generative dialogue system, the following table summarizes their specific details. Table 3 shows the specific parameters of the experiment.

For the baseline model, the default Softmax and cross-entropy functions are used, while for the char-word model, the sparse variants of these functions are implemented. Both models utilize the Adam optimization algorithm with the same learning rate and weight decay rate, ensuring a fair comparison.

3.3. Evaluation Standards

To assess whether the character–word fusion-embedding method improves dialogue system performance, we use several evaluation metrics. Specifically, the experiments employ Rouge-L, Rouge-1, and Rouge-2 metrics to compare against baseline models. Additionally, cosine similarity methods, such as greedy matching and embedding average are utilized to measure the direct similarity between generated and target sentences. These word vector-based metrics enable a thorough comparison with baseline models, validating the effectiveness of the character–word fusion-embedding method, which will be introduced in this chapter.

Rouge (recall-oriented understudy for gisting evaluation) [44] is a recall-based similarity measurement approach. Rouge-L is an indicator based on the longest common subsequence between the generated text and the reference text based on the longest common clause co-occurrence accuracy and recall rate F-measure statistics. Rouge-1 is an indicator that compares the number of overlapping words (words) between the generated text and the reference text, based on 1-gram co-occurrence statistics. Rouge-2 is a metric that compares the number of 2-gram overlaps between the generated text and the reference text, based on 2-gram co-occurrence statistics.

Greedy matching [45] matches words by computing cosine similarity between individual words in the reference and generated sentences. These similarity scores are averaged to measure the overall alignment between the generated sentence and the target sentence. The relevant calculations are provided in Equations (11) and (12).

G (r, \hat{r}) = \frac{\sum_{w \in r;} {m a x}_{\hat{w} \in \hat{r}} \cos_s i m (e_{w}, w_{\hat{w}})}{| r |}

(11)

G M (r, \hat{r}) = \frac{G (r, \hat{r}) + G (\hat{r}, r)}{2}

(12)

The similarity calculations in the above formula are all based on word vectors. This method mainly focuses on the most similar words between two sentences.

The embedding average [46] calculates the average of the word vectors for both the generated and reference sentences, followed by computing the cosine similarity between these averaged vectors. The relevant calculation is presented in Equation (13).

\bar{e} = \frac{\sum_{w \in r} e_{w}}{\sum_{w^{'} \in r} e_{w^{'}}}

(13)

The embedding average method involves calculating the mean of the word vectors for all words in a sentence

r

, denoted as

\bar{e}

. This is conducted separately for the generated sentence and the reference answer sentence. Then, the cosine similarity between these averaged sentence vectors is computed, which is for comparing the two sentences.

By employing these evaluation metrics, the experiments aim to determine if the character–word fusion-embedding method improves the dialogue system’s performance compared to the baseline models.

4. Result

4.1. Evaluation Index Results

The results for each evaluation metric, as presented in Table 4, were obtained through the experiment.

In Table 4, the optimized char-word model outperforms the baseline model across recall, greedy matching, and embedding average metrics under various N-gram conditions, indicating a significant enhancement in its text-processing capabilities. Specifically, the high recall rate reflects the model’s effectiveness in identifying and extracting key information, allowing it to capture more relevant contextual details. The improvement in greedy matching demonstrates the model’s efficiency in quick matching, making it suitable for real-time processing applications. Furthermore, the increase in the embedding average indicates progress in semantic understanding, as the model can better grasp the deeper meanings of a text. Collectively, these results illustrate that the optimized char-word model exhibits greater flexibility and accuracy in tasks involving information extraction, matching, and comprehension, highlighting its broad application potential.

Figure 4 presents the comparison results of the baseline model and word–character fusion-embedding method without sparse Softmax and cross entropy. Figure 5 shows the comparison results of changing parameter k in the model char-word using word–the character fusion-embedding method. Additionally, incorporating sparse Softmax and cross-entropy calculations results in approximately a 2% increase in score. It is evident that selecting

k

values of 10 and 50 yields better results than those obtained without applying the method.

The experimental results clearly show that the char-word model, which employs the proposed character–word fusion-embedding method, consistently outperforms the baseline model across various metrics. Additionally, the inclusion of sparse Softmax and cross-entropy calculations improves the model’s performance by approximately 2%. The experiments indicate that using

k

values of 10 and 50 both yield better results than not using sparse Softmax, with negligible differences between them. Given that

k = 10

requires less computation, it will be used in subsequent experiments.

4.2. Self-Attention Visualization Analysis

To further validate the effectiveness of the character–word fusion embedding, a visualization analysis of the self-attention similarity-matching process was conducted. For ease of verification, this section presents a portion of the attention-matching heatmap using the character–word fusion-embedding method, as shown in Figure 6.

Figure 6a illustrates the attention similarity matching results using the character–word fusion-embedding method for the phrases “利息费用 (interest money)” and “利息 (interest).” Although there is some similarity between the two, the attention similarity matching is not clearly defined. The attention to “费用 (money/fee)” is confused, leading to a less distinct mapping. Figure 6b shows the results when using characters as the smallest granularity along with additional word-embedding integration. The attention matching is significantly improved. Identical characters in both sentences are more easily mapped to each other, and the properties of words are better captured.

By analyzing these heatmaps, it is evident that the proposed character–word fusion embedding enhances the model’s ability to accurately capture and represent the semantics of sentences, thereby confirming its effectiveness.

5. Discussion

The word–phrase fusion-embedding method proposed in this study significantly demonstrates a marked enhancement over the baseline model across various evaluation metrics. This improvement can be attributed to the effective combination of pre-trained word embeddings and character-level information, allowing the model to capture the profound semantics of words, as well as the flexibility of character representations.

The introduction of sparse Softmax and cross-entropy calculations further enhanced the model’s performance by approximately 2%, suggesting that sparse regularization can effectively mitigate overfitting issues and promote more diverse and informative response generation in specific scenarios. Notably, sparse Softmax is primarily suitable for the fine-tuning phase of pre-trained models. In this context, where the model has already acquired substantial knowledge through large-scale pre-training, sparse Softmax can prevent overfitting to the fine-tuning data. However, when training a model from scratch, using sparse Softmax might lead to performance degradation, as only

k

categories are learned each time, potentially causing underfitting. This observation highlights the importance of selecting appropriate optimization methods for different training strategies.

In terms of computational efficiency, choosing

k = 10

for sparse Softmax strikes a good balance between performance improvement and computational efficiency. However, further research is needed to determine the optimal

k

values and their applicability for different types of dialogue tasks and dataset sizes.

Attention-matching heatmap analysis reveals that our method substantially enhances the model’s ability to capture semantic relationships concerning words and phrases, contributing to the generation of more coherent and contextually relevant responses. Even so, challenges remain in handling rare words and domain-specific terminology. Future research could focus on strategies to bolster the model’s performance with low-frequency vocabulary and explore domain adaptation techniques for specialized dialogue contexts.

Furthermore, our proposed approach demonstrates a promising potential for various practical applications in real-world scenarios. In customer service domains, where handling domain-specific terminology while maintaining a natural conversation flow is crucial, the word–phrase fusion-embedding method shows particular advantages. The improved semantic understanding enables a more accurate interpretation of technical terms while preserving conversational fluency, as evidenced by the enhanced Rouge and embedding average scores in our experiments.

The method also shows promise in educational applications, particularly in intelligent tutoring systems. The enhanced semantic coherence and context preservation capabilities make it well-suited for maintaining consistent, pedagogically sound dialogues. The ability to handle both common vocabulary and specialized terms effectively is particularly valuable in educational contexts, where precise communication is essential.

In scenarios requiring real-time response generation, such as live customer support systems, our sparse Softmax optimization technique offers practical benefits through reduced computational complexity without sacrificing response quality. This balance between efficiency and performance makes the system particularly suitable for large-scale deployment in time-sensitive applications. Additionally, the improved handling of low-frequency vocabulary items suggests potential applications in specialized professional domains, where accurate processing of domain-specific terminology is critical. However, optimal performance in these applications may require domain-specific fine-tuning and careful consideration of the sparsity parameter

k

. Future work could explore adaptive methods for automatically adjusting these parameters based on specific application requirements and user feedback.

6. Conclusions

This study explores the impact and limitations of Chinese word segmentation on natural language processing tasks, emphasizing the significance of phrase-based segmentation for Chinese generative dialogue systems. To tackle the unique challenges presented by the characteristics of the Chinese language, this paper introduces a word–phrase fusion-embedding method. This approach effectively combines the strengths of pre-trained word embeddings and character-level representations, enhancing the model’s capacity to grasp semantics. The proposed sparse Softmax optimization technique improves the diversity and informativeness of the generated responses while mitigating overfitting in pre-trained model fine-tuning scenarios.

The experimental results on the STC dataset show significant enhancements over baseline models, with BLEU and Distinct scores increasing by an average of 3.5%. A visualization analysis further confirms the model’s enhanced ability to capture semantics, providing insights into the effectiveness of the word–phrase fusion-embedding method. These advancements contribute to the development of more intelligent and human-like Chinese dialogue systems.

Further research could explore more sophisticated fusion techniques for word and character embeddings, investigate methods to handle rare and domain-specific vocabulary, and extend this approach to other languages with similar linguistic challenges.

Author Contributions

Conceptualization, W.Z., L.Y. and Z.Y.; methodology, S.L. (Shenrong Lv), L.Y. and Z.Y.; software, J.T., S.L. (Shenrong Lv), R.W. and S.L. (Siyu Lu); validation, R.W. and S.L. (Shenrong Lv); formal analysis, J.T., R.W. and S.L. (Siyu Lu); resources, Z.Y. and S.A.A.; data curation, R.W.,= and S.L. (Shenrong Lv); writing—original draft preparation, J.T., L.Y., S.L. (Siyu Lu) and S.L. (Shenrong Lv); writing—review and editing, S.L. (Siyu Lu), L.Y. and W.Z.; visualization, S.L. (Shenrong Lv) and Z.Y.; supervision, W.Z. and S.A.A.; project administration, W.Z.; funding acquisition, W.Z. and S.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the Sichuan Science and Technology Program (2023YFSY0026, 2023YFH0004).

Data Availability Statement

The dataset used in this study is publicly available from ‘A Dataset for Research on Short-Text Conversation’ [41]. We did not generate any new datasets as part of this research. The original dataset was presented in the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP). For access to this dataset, interested researchers should refer to the original publication or contact the authors of the original paper.

Acknowledgments

The authors thank the support of Research Supporting Project Number (RSPD2025R585), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, T.; Chen, N.; Jiang, J.; Zhu, Z.; Zou, Y. Improving Retrieval-Based Dialogue System Via Syntax-Informed Attention. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Jung, W.; Shim, K. Dual Supervision Framework for Relation Extraction with Distant Supervision and Human Annotation. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar]
Tao, C.; Feng, J.; Yan, R.; Wu, W.; Jiang, D. A Survey on Response Selection for Retrieval-based Dialogues. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, WI, USA, 19–27 August 2021. [Google Scholar]
Hua, K.; Feng, Z.; Tao, C.; Yan, R.; Zhang, L. Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-Based Dialogue Systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020. [Google Scholar]
Lan, T.; Mao, X.-L.; Wei, W.; Gao, X.; Huang, H. PONE: A Novel Automatic Evaluation Metric for Open-domain Generative Dialogue Systems. ACM Trans. Inf. Syst. TOIS 2020, 39, 1–37. [Google Scholar] [CrossRef]
Firdaus, M.; Thangavelu, N.; Ekbal, A.; Bhattacharyya, P. I Enjoy Writing and Playing, Do You?: A Personalized and Emotion Grounded Dialogue Agent Using Generative Adversarial Network. IEEE Trans. Affect. Comput. 2022, 14, 2127–2138. [Google Scholar] [CrossRef]
Yao, L.; Zhang, Y.; Feng, Y.; Zhao, D.; Yan, R. Towards Implicit Content-Introducing for Generative Short-Text Conversation Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017. [Google Scholar]
Pai, K.-C.; Kuo, B.-C.; Liao, C.-H.; Liu, Y.-M. An application of Chinese dialogue-based intelligent tutoring system in remedial instruction for mathematics learning. Educ. Psychol. 2021, 41, 137–152. [Google Scholar] [CrossRef]
Zhang, Z.; Takanobu, R.; Zhu, Q.; Huang, M.; Zhu, X. Recent advances and challenges in task-oriented dialog systems. Sci. China Technol. Sci. 2020, 63, 2011–2027. [Google Scholar] [CrossRef]
Liu, X.; Wang, S.; Lu, S.; Yin, Z.; Li, X.; Yin, L.; Tian, J.; Zheng, W. Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems 2023, 11, 483. [Google Scholar] [CrossRef]
Jung, W.; Shim, K. T-REX: A Topic-Aware Relation Extraction Model. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020. [Google Scholar]
Ni, J.; Young, T.; Pandelea, V.; Xue, F.; Cambria, E. Recent advances in deep learning based dialogue systems: A systematic survey. Artif. Intell. Rev. 2023, 56, 3055–3155. [Google Scholar] [CrossRef]
Liao, K.; Zhong, C.; Chen, W.; Liu, Q.; Peng, B.; Huang, X. Task-oriented dialogue system for automatic disease diagnosis via hierarchical reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Han, Z.; Zhang, Z. Multi-turn Dialogue System Based on Improved Seq2Seq Model. In Proceedings of the 2020 International Conference on Communications, Information System and Computer Engineering (CISCE), Kuala Lumpur, Malaysia, 3–5 July 2020. [Google Scholar]
Ma, Z.; Du, B.; Shen, J.; Yang, R.; Wan, J. An encoding mechanism for seq2seq based multi-turn sentimental dialogue generation model. Procedia Comput. Sci. 2020, 174, 412–418. [Google Scholar] [CrossRef]
He, Q.; Liu, W.; Cai, Z. B&Anet: Combining bidirectional LSTM and self-attention for end-to-end learning of task-oriented dialogue system. Speech Commun. 2020, 125, 15–23. [Google Scholar]
Yan, M.; Lou, X.; Chan, C.A.; Wang, Y.; Jiang, W. A semantic and emotion-based dual latent variable generation model for a dialogue system. CAAI Trans. Intell. Technol. 2023, 8, 319–330. [Google Scholar] [CrossRef]
Shang, W.; Zhu, S.; Xiao, D. Research on human-computer dialogue based on improved Seq2seq model. In Proceedings of the 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), Xi’an, China, 13–15 October 2021. [Google Scholar]
He, W.; Yang, M.; Yan, R.; Li, C.; Shen, Y.; Xu, R. Amalgamating knowledge from two teachers for task-oriented dialogue system with adversarial training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, Online, 16–20 November 2020. [Google Scholar]
Zandie, R.; Mahoor, M.H. Emptransfo: A multi-head transformer architecture for creating empathetic dialog systems. In Proceedings of the Thirty-Third International FLAIRS Conference (FLAIRS-33), North Miami Beach, FL, USA, 17–20 May 2020. [Google Scholar]
Zhao, Y.; Zhang, J.; Zong, C. Transformer: A general framework from machine translation to others. Mach. Intell. Res. 2023, 20, 514–538. [Google Scholar] [CrossRef]
Zhao, X.; Wang, L.; He, R.; Yang, T.; Chang, J.; Wang, R. Multiple knowledge syncretic transformer for natural dialogue generation. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020. [Google Scholar]
Varshney, D.; Ekbal, A.; Nagaraja, G.P.; Tiwari, M.; Gopinath, A.A.M.; Bhattacharyya, P. Natural language generation using transformer network in an open-domain setting. In Proceedings of the Natural Language Processing and Information Systems: 25th International Conference on Applications of Natural Language to Information Systems, NLDB 2020, Saarbrücken, Germany, 24–26 June 2020. [Google Scholar]
Kenton, J.D.M.-W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, Minnesota, 2–7 June 2019. [Google Scholar]
Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W. GPT (generative pre-trained transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Yang, Y.; Li, Y.; Quan, X. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
Zhao, H.; Lu, J.; Cao, J. A short text conversation generation model combining BERT and context attention mechanism. Int. J. Comput. Sci. Eng. 2020, 23, 136–144. [Google Scholar] [CrossRef]
Zhou, H.; Ke, P.; Zhang, Z.; Gu, Y.; Zheng, Y.; Zheng, C.; Tang, J. Eva: An open-domain chinese dialogue system with large-scale generative pre-training. arXiv 2021, arXiv:2108.01547. [Google Scholar]
Li, M.; Xiang, L.; Kang, X.; Zhao, Y.; Zhou, Y.; Zong, C. Medical term and status generation from chinese clinical dialogue with multi-granularity transformer. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 3362–3374. [Google Scholar] [CrossRef]
Lin, T.; Chonghui, G.; Jingfeng, C. Review of Chinese word segmentation studies. Data Anal. Knowl. Discov. 2020, 4, 1–17. [Google Scholar]
Du, G. Research advanced in Chinese word segmentation methods and challenges. Appl. Comput. Eng. 2024, 37, 16–22. [Google Scholar] [CrossRef]
Novak, J.R.; Minematsu, N.; Hirose, K. Phonetisaurus: Exploring grapheme-tophoneme conversion with joint n-gram models in the WFST framework. Nat. Lang. Eng. 2016, 22, 907–938. [Google Scholar] [CrossRef]
Mor, B.; Garhwal, S.; Kumar, A. A systematic review of hidden Markov models and their applications. Arch. Comput. Methods Eng. 2021, 28, 1429–1448. [Google Scholar] [CrossRef]
Yuan, H.; Ji, S. Structpool: Structured graph pooling via conditional random fields. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Li, P.; Luo, A.; Liu, J.; Wang, Y.; Zhu, J.; Deng, Y.; Zhang, J. Bidirectional gated recurrent unit neural network for Chinese address element segmentation. ISPRS Int. J. Geo-Inf. 2020, 9, 635. [Google Scholar] [CrossRef]
Cheng, J.; Liu, J.; Xu, X.; Xia, D.; Liu, L.; Sheng, V.S. A review of Chinese named entity recognition. KSII Trans. Internet Inf. Syst. 2021, 15, 2012–2029. [Google Scholar]
Choe, J.; Noh, K.; Kim, N.; Ahn, S.; Jung, W. Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023. [Google Scholar]
The Segmentation Guidelines for the Penn Chinese Treebank (3.0). Available online: https://hanlp.hankcs.com/docs/annotations/tok/ctb.html (accessed on 6 April 2024).
Wei, J.; Ren, X.; Li, X.; Huang, W.; Liao, Y.; Wang, Y.; Lin, J.; Jiang, X.; Chen, X.; Liu, Q. Nezha: Neural contextualized representation for chinese language understanding. arXiv 2019, arXiv:1909.00204. [Google Scholar]
Wang, H.; Lu, Z.; Li, H.; Chen, E. A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
Zheng, W.; Gong, G.; Tian, J.; Lu, S.; Wang, R.; Yin, Z.; Li, X.; Yin, L. Design of a modified transformer architecture based on relative position coding. Int. J. Comput. Intell. Syst. 2023, 16, 168. [Google Scholar] [CrossRef]
Laha, A.; Chemmengath, S.A.; Agrawal, P.; Khapra, M.; Sankaranarayanan, K.; Ramaswamy, H.G. On controllable sparse alternatives to softmax. In Proceedings of the Thirty-Second Annual Conference on Neural Information Processing Systems (NIPS), Montréal, QC, Canada, 2–8 December 2018. [Google Scholar]
Batra, P.; Chaudhary, S.; Bhatt, K.; Varshney, S.; Verma, S. A review: Abstractive text summarization techniques using NLP. In Proceedings of the 2020 International Conference on Advances in Computing, Communication & Materials (ICACCM), Dehradun, India, 21–22 August 2020. [Google Scholar]
Jangabylova, A.; Krassovitskiy, A.; Mussabayev, R.; Ualiyeva, I. Greedy Texts Similarity Mapping. Computation 2022, 10, 200. [Google Scholar] [CrossRef]
Bayot, R.; Gonçalves, T. Multilingual author profiling using word embedding averages and SVMs. In Proceedings of the 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), Chengdu, China, 15–17 December 2016. [Google Scholar]

Figure 1. Transformer-based generative dialogue system.

Figure 2. Workflow of embedding of word–phrase fusion.

Figure 3. An example of embedding of word–phrase fusion.

Figure 4. Evaluation results. (a) Evaluation results based on recall rate; (b) evaluation results of greedy matching and embedding average.

Figure 5. Evaluation results of different parameters k under char word dialog model.

Figure 6. Attention-matching heatmap examples. (a) Attention-matching heatmap based on tokenization; (b) attention-matching heatmap based on character–word fusion embedding.

Table 1. STC dataset.

Dialogue Dataset Type	Dialogue Unit	Data Volume
Train	posts	219,905
	responses	4,308,211
	pairs	4,435,959
Test	test posts	110
Label	post	225
	responses	6017
	labeled pairs	6017
Fine-tuning (SMT-based)	posts	2925
	responses	3000
	pairs	3000

Table 2. Experimental environment.

Item	Model
Processor	Intel(R) Core(TM) i7-9800X CPU @ 3.80 GHz
Memory	64 GB
Graphics card	NVIDIA GeForce GTX2080 Ti
Operating system	Ubuntu 18.04.3 LTS
Development environment	Pycharm + Anaconda
Open-source framework	TensorFlow2.1.0, Keras2.3.1, and BERT4Keras0.9.8

Table 3. Experiment’s specific parameters.

Configuration	Baseline	Char-Word
Pre-training Model	BERT	NEZHA
Vocabulary Size	13,088	13,088
Embedding Dimension	384	384
Maximum Text Length	256	256
Transformer layers	12	12
Softmax and Cross Entropy	Default	Sparsity
Optimization Algorithm	Adam	Adam
Learning Rate	$2 \times 10^{- 5}$	$2 \times 10^{- 5}$
Weight Decay Rate	0.01	0.01
Batch Size	16	16

Table 4. Experimental results of each evaluation index.

	k	Rouge-L	Rouge-1	Rouge-2	Greedy Matching	Embedding Average
Baseline	$\infty$	64.51	66.75	56.73	65.89	78.94
Char-word	$\infty$	65.38	67.52	57.83	66.06	83.38
Char-word	10	66.85	68.83	58.51	66.34	84.12
Char-word	50	66.65	68.81	58.67	66.24	83.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, S.; Lu, S.; Wang, R.; Yin, L.; Yin, Z.; A. AlQahtani, S.; Tian, J.; Zheng, W. Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization. Systems 2024, 12, 516. https://doi.org/10.3390/systems12120516

AMA Style

Lv S, Lu S, Wang R, Yin L, Yin Z, A. AlQahtani S, Tian J, Zheng W. Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization. Systems. 2024; 12(12):516. https://doi.org/10.3390/systems12120516

Chicago/Turabian Style

Lv, Shenrong, Siyu Lu, Ruiyang Wang, Lirong Yin, Zhengtong Yin, Salman A. AlQahtani, Jiawei Tian, and Wenfeng Zheng. 2024. "Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization" Systems 12, no. 12: 516. https://doi.org/10.3390/systems12120516

APA Style

Lv, S., Lu, S., Wang, R., Yin, L., Yin, Z., A. AlQahtani, S., Tian, J., & Zheng, W. (2024). Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization. Systems, 12(12), 516. https://doi.org/10.3390/systems12120516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Chinese Dialogue Generation with Word–Phrase Fusion Embedding and Sparse SoftMax Optimization

Abstract

1. Introduction

2. Methods

2.1. Transformer-Based Dialogue System

2.2. Embedding of Word-Phrase Fusion

2.3. Sparse Softmax

3. Experiments

3.1. Dataset

3.2. Experimental Setting

3.3. Evaluation Standards

4. Result

4.1. Evaluation Index Results

4.2. Self-Attention Visualization Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI