CGM: Copy Mechanism GPT with Mask for Ellipsis and Anaphora Resolution in Dialogue

Ji-Won Cho; Jinyoung Oh; Jeong-Won Cha

doi:10.3390/app15010005

,

and

Department of Computer Engineering, Changwon National University, Changwon 51140, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(1), 5;https://doi.org/10.3390/app15010005

Version Notes

Order Reprints

Abstract

GPT (Generative Pre-trained Transformer) is a generative language model that demonstrates outstanding performance in the field of text generation. Generally, the attention mechanism of the transformer model behaves similarly to a copy distribution. However, due to the absence of a dedicated encoder, it is challenging to ensure that the input is retained for generation. We propose a model that emphasizes the copy mechanism in GPT. We generate masks for the input words to initialize the distribution and explicitly encourage copying through training. To demonstrate the effectiveness of our approach, we conducted experiments to restore ellipsis and anaphora in dialogue. In a single domain, we achieved 0.4319 (BLEU), 0.6408 (Rouge-L), 0.9040 (simCSE), and 0.9070 (BERTScore), while in multi-domain settings we obtained 0.4611 (BLEU), 0.6379 (Rouge-L), 0.8902 (simCSE), and 0.8999 (BERTScore). Additionally, we evaluated the operation of the copy mechanism on out-of-domain data, yielding excellent results. We anticipate that applying the copy mechanism to GPT will be useful for utilizing language models in constrained situations.

Keywords:

copy mechanism; curriculum learning; pre-trained models

1. Introduction

GPT (Generative Pre-trained Transformer) has achieved great success in the field of natural language processing. This model demonstrates exceptional performance in various text generation tasks, producing innovative results in several applications such as machine translation, text summarization, and question answering. The success of GPT has become a significant milestone showcasing the potential of large language models (LLMs).

However, in environments where LLMs cannot be used, sLLMs (small LLMs) in the form of GPT often struggle to achieve desired results with zero-shot learning alone []. We aim to introduce a copying mechanism to GPT, which is a method that preserves keywords found in the input while having limited training data. As far as we know, encoder–decoder transformer models like T5 [] have achieved excellent results in applications requiring specific contexts by using a copying mechanism, but there have been no instances of its application in GPT.

We validated the capabilities of the copying mechanism for GPT through tasks involving ellipsis and anaphora resolution in conversation. Ellipsis and anaphora processing are among the challenging problems in natural language processing. In human conversations, as shown in Table 1, information is often explicitly omitted, or substitutes are used based on already shared context, avoiding the repetition of previous dialogue content. In Table 1, User1 uses anaphora (‘그걸 (it)’) at the end of the conversation. While humans can recognize omitted information as the original information through contextual clues in such phenomena, it is very difficult for machines to interpret this appropriately [].

Table 1. The bold parts indicate the portions that need to be resolved. In the example below, standard GPT omits the word LLM that needs to be resolved.

Previous studies have addressed ellipsis and anaphora resolution in ways similar to coreference resolution [] or machine reading comprehension []. However, in a question-and-answer environment with a mixture of various domains, it is necessary to reference the content from the same domain as the sentences where ellipsis and anaphora occur, making simple coreference resolution insufficient. Particularly in spoken dialogue, two additional factors must be considered. The first is processing time. Among existing methods, the mention pair approach [] generates all possible candidates for classification, which requires a significant amount of time. The second is contextual consideration. It is essential to reference only the appropriate contextual content in dialogue to handle ellipsis and anaphora.

In this paper, we conducted experiments on ellipsis and anaphora resolution through prompt modification based on whether the history and current information share the same context in a conversational environment with context switching, as well as through a copying mechanism for GPT. We demonstrated that the copying mechanism in GPT shows significant performance improvements not only in the training domain but also in domains outside of training.

The contributions of this paper are as follows:

We improved the performance of ellipsis and anaphora resolution by introducing a copying mechanism to the GPT structure.
We proposed a method for prompt modification in conversational environments where context switching occurs.
We demonstrated that the proposed method shows significant performance improvements not only in the training domain but also in domains outside of training (https://github.com/cwnu-airlab/CGM, accessed on 14 December 2024).

2. Related Work

2.1. Conversational Question Answering

Question answering (QA) systems provide information in response to questions formatted in various forms, including structured and unstructured data, but are not limited to these categories. They can be viewed as a case of conversational artificial intelligence, and, with advancements in technology, a research topic has emerged focused on conversational question answering (conversational QA), which involves understanding multiple dialogue contexts to generate responses []. In QA conducted in a conversational manner, accurate answers must be generated by retrieving or extracting information based not only on the current user query but also on previous dialogue history []. A key task in this process is to appropriately select the dialogue history that the QA model should reference to understand the question effectively. Previous studies have adopted dynamic history selection methods based on attention mechanisms [] and reinforcement learning []. This paper demonstrates the effectiveness of a copying mechanism based on the GPT architecture to address one of the issues that arise in conversational QA: ellipsis and anaphora resolution.

2.2. Ellipsis and Anaphora Resolution

Ellipsis and anaphora are phenomena that frequently occur in multi-turn conversations, but they are also considered challenging problems to resolve.

In the case of ellipsis, it can be regarded as a type of anaphora; however, unlike anaphora, there are no indicators in the sentence that signify omitted parts, making it crucial to identify only the omitted meanings from the prior information, as these omitted meanings do not take on any special form []. Previous studies have adopted a structure that integrates seq2seq architecture with a copying mechanism to address ellipsis resolution in conversation []. Existing Seq2Seq models often suffer from problems such as focusing excessively on certain pieces of information, repetitively generating the same words, or encountering Out-Of-Vocabulary (OOV) issues.

Anaphora refer to different expressions that denote the same entity and are resolved using methods similar to coreference, which involves resolving any phrases or clauses that refer to the same physical entity. Recently, research has been conducted on resolving anaphora and coreference by utilizing transformer-based models like BERT and SpanBERT [], which implicitly integrate common sense and contextual information through complex embeddings []. Such models are advantageous for establishing connections between substitute expressions as they can understand the bidirectional context of any given sentence. However, since the proposed method is a model that classifies by comparing all possible candidates, it requires significant classification time.

2.3. Copy Mechanism

The copying mechanism is a method that copies specific parts of the input sequence directly to the output sequence, primarily applied to Seq2Seq models for tasks such as machine translation [] and document summarization []. The copying mechanism determines whether to generate new tokens or directly copy tokens from the encoder’s input sequence by using the attention mechanism during the decoding process. This approach combines the ability of language models to generate new words with the advantages of the copying mechanism to overcome OOV, effectively addressing various problems encountered in traditional Seq2Seq models [].

In this paper, we constructed a generative model for sentence restoration by applying the Pointer Generator [], which utilizes the copying mechanism, to the GPT architecture. The Pointer Generator decides whether to copy input tokens using the context vector generated from the encoder’s attention distribution and the

p_{gen}

switch. Since GPT consists solely of a decoder structure, there are several differences when compared with the actual functioning of the Pointer Generator:

In the Pointer Generator, the Bahdanau attention [] is used to determine which tokens in the encoder’s input sequence assign higher weights, which then informs the calculation of $p_{gen}$ . In this paper, we used 1 for tokens present in the input and 0 for tokens not present in the input.
When creating the $p_{gen}$ switch, the Pointer Generator utilizes the hidden states of both the encoder and decoder, along with the decoder’s input. In contrast, we generated $p_{gen}$ using only the decoder’s hidden states. In the GPT architecture, there is no encoder; instead, the final output hidden state of the decoder is assumed to contain information from all previous tokens, and the current decoder’s hidden state is employed.

Through this structure, we activated the copying mechanism to add weights to the tokens included in the input sequence, thereby increasing the probability of selecting the weighted tokens from the input sequence.

3. The Proposed Method

3.1. Problem Definition

To resolve ellipsis and anaphora using the copying mechanism, information about the parts where ellipsis and anaphora occur is necessary. This paper aims to resolve the current utterance (prompt text),

P_{i}

, where ellipsis and anaphora have occurred, to obtain the resolved utterance,

P_{i}^{'}

. We check whether the two given inputs,

R_{i}

(=

P_{< i}

) and

P_{i}

, belong to the same domain, and, depending on whether they are in the same domain or not, we input either both

R_{i}

and

P_{i}

together or only

P_{i}

to the model. We set the length of

R_{i}

to 2. Additionally, we assume that all ellipsis and anaphora have been resolved prior to the current utterance.

3.2. Domain Classification

As shown in Figure 1, if the inputs

R_{i}

and

P_{i}

given to the model belong to the same domain, the predicted label is set to 1, and, if they belong to different domains, it is set to 0. The classified result is multiplied by

R_{i}

. We trained the classifier using a corpus that combines IT domain and Finance domain data. The model determines whether the previous utterance belongs to the same domain as the current utterance.

Figure 1. Copy mechanism architecture using Mask.

3.3. Baseline

We used the Ko-GPT-Trinity (https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5 accessed on 14 December 2024) generative model, which has a GPT [] structure, fine-tuned for the task of resolving ellipsis and anaphora, as our baseline model. The baseline model’s input is as follows:

X_{i} = \{\begin{matrix} R_{i}; P_{i} & if Classifier (R_{i}, P_{i}) = 1 \\ “ @ @ ”; P_{i} & if Classifier (R_{i}, P_{i}) = 0 \end{matrix}

(1)

In this setup,

P_{i}

is used as the input for the generative model, with the references concatenated in front of

P_{i}

based on the result of the domain classifier. If the inputs belong to the same domain, the previous information,

R_{i}

, is concatenated. However, if they are from different domains, the previous information should not be referenced, so no reference information is included. The elements are separated by the delimiter ‘;’ to specify the parts that need to be resolved.

Copy-Mechanism GPT with Mask

We conducted resolution experiments using a model called CGM (Copy-Mechanism GPT with Mask), which applies the copying mechanism to the generative model. The architecture is shown in Figure 1. Like the baseline, we use prompts as in Equation (1) based on the classifier’s results, but we adjust the prediction probabilities of the generative model through operations using Mask and

p_{gen}

values, and then determine the final probability,

P_{CGM}

. The Mask is a one-dimensional vector created by marking tokens used in the input prompt as 1 and others as 0, then multiplied by

h_{t}

, which is the GPT’s last hidden value at the current time, t.

p_{gen}

is a value obtained by passing the Mask and

h_{t}

through a linear layer and then applying a sigmoid function, acting as a switch to determine how much weight to give to tokens used in the input. The formula for generating

p_{gen}

is as follows:

p_{gen} = σ (W^{⊤} h_{t} + b)

(2)

The

p_{gen}

value is multiplied by

P_{vocab}

, which is the prediction probability of the existing generative model, GPT. By multiplying by

p_{gen}

, the prediction probability value is decreased. Additionally, the value obtained by subtracting

p_{gen}

from 1 is multiplied by the Mask, indicating the weight that the input tokens marked in the Mask will receive. Through this process, the Mask and the predicted probabilities are combined, allowing the tokens used in the provided prompt to receive additional weights. The final predicted probability can be expressed as follows:

\begin{matrix} P_{vocab} & = softmax (V^{'} (V h_{t} + b) + b^{'}) \end{matrix}

(3)

\begin{matrix} e_{i}^{t} & = v^{⊤} tanh (1_{i} + W_{h} h_{t} + b) \end{matrix}

(4)

\begin{matrix} a^{t} & = softmax (e^{t}) \end{matrix}

(5)

\begin{matrix} P_{Mask} (w) & = \sum_{i : w_{i} = w} a_{i}^{t} \end{matrix}

(6)

\begin{matrix} P_{CGM} (w) & = p_{gen} \times P_{vocab} (w) + (1 - p_{gen}) \times P_{Mask} (w) . \end{matrix}

(7)

Here,

V, V^{'}, W_{h}

are weight vectors.

1_{i}

equals 1 if the i-th token is present in the input, and 0 otherwise.

Training

We trained the model using standard cross entropy loss. The learning rate was set to

10^{- 5}

and the batch size was set to 8. We used a pre-trained GPT-2 model and fine-tuned it using our training data. For experiments with the classifier, we additionally trained only the classifier model separately.

4. Experimental Results

4.1. Experimental Corpus

In our experiments, we use dialogue data about financial products and IT contract products. Table 2 shows the data used in the experiments (since these data were provided by Korea Telecom and cannot be fully disclosed, we have published a portion of the data on Github (https://github.com/cwnu-airlab/CGM, accessed on 14 December 2024). Additionally, to verify the generality of our proposed method, we conducted experiments using the data published in []. Table 3 shows the proportion of ellipsis and anaphora, and the number of dialogues in the experimental dataset.

Table 2. The statistics of experimental dataset.

Table 3. The number of ellipsis and anaphora in experimental dataset.

4.2. Comparing Models

We implemented and compared the following models to verify whether the model preserves the given keywords effectively in environments with limited training data. The detailed model structure is shown in the Figure 2.

Figure 2. Comparing models.

$R^{-} P$ : a general GPT-2 structure without a copying mechanism, using a prompt that sequentially concatenates the unresolved previous utterances, $R^{-}$ , and the current utterance, P, without any delimiters.
$R P$ : a general GPT-2 structure without a copying mechanism, using a prompt that concatenates the resolved previous utterances, R, and the current utterance, P, in sequence without any delimiters.
Baseline: a version of $R P$ that inserts the special character ‘@@’ when there is no previous information and uses the delimiter ‘;’ between each element.
PG-GPT: a model that applies the Pointer Generator structure to GPT, using a Mask based on the input tokens to copy the information instead of a context vector based on attention distribution.
CGM: a structure within PGM that does not use the Mask when generating $p_{gen}$ .
CGM + HardPrompt A: a modified model within the CGM structure that changes the prompt structure. Prompt A is ‘Reference: R Prompt: P Generate:’.
CGM + HardPrompt B: prompt B excludes the ‘Generate:’ part from CGM + HardPrompt A.
CGM + 0/1 Mask only: a structure within CGM that does not use the last hidden state when generating the Mask.
CGM + SepTwice: a model that modifies the prompt structure within CGM, using the delimiter twice before the current utterance, P, in the prompt.

4.3. Experimental Results and Analysis

4.3.1. Evaluation Metrics

In our experiments, we used BLEU [], Rouge-L [], simCSE [], and BERTScore []. The BLEU score evaluates the fluency by calculating n-gram precision between candidate and reference sentences, with a brevity penalty (

B P

):

BLEU = B P \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n}),

(8)

where

p_{n}

is the n-gram precision and

w_{n}

is the weight for each n-gram precision. The brevity penalty penalizes short translations:

B P = \{\begin{matrix} 1 & if c > r \\ e^{1 - r / c} & if c \leq r, \end{matrix}

(9)

where c is the length of the candidate sentence and r is the length of the reference sentence. ROUGE-L evaluates the longest common subsequence (LCS) between a candidate sequence and reference sequence. Given a reference sequence X of length m and a candidate sequence Y of length n, Rouge-L is calculated as:

R_{l c s} = \frac{LCS (X, Y)}{m},

(10)

P_{l c s} = \frac{LCS (X, Y)}{n},

(11)

Rouge - L = \frac{2 R_{l c s} P_{l c s}}{R_{l c s} + P_{l c s}},

(12)

where LCS(X, Y) is the length of the longest common subsequence between X and Y. As another metric to evaluate the model’s restoration capability, we calculate F1 through exact matching between the reference sequence and candidate sequence. We evaluated how well the restoration target was reconstructed using BLEU, Rouge-L, and F1, which indicate how accurately words match, and assessed how well the intent expressed in the user’s current utterance was maintained using simCSE and BERTScore, which represent semantic similarity between two sentences. simCSE computes the cosine similarity between sentence embeddings to measure semantic similarity:

simCSE (x, y) = \frac{embed (x) \cdot embed (y)}{∥ embed (x) ∥ ∥ embed (y) ∥} .

(13)

BERTScore computes token-level cosine similarity between BERT embeddings of two sentences, with importance weighting:

{BERTScore}_{P} = \frac{1}{| x |} \sum_{x_{i} \in x} max_{y_{j} \in y} x_{i}^{T} y_{j},

(14)

{BERTScore}_{R} = \frac{1}{| y |} \sum_{y_{j} \in y} max_{x_{i} \in x} x_{i}^{T} y_{j},

(15)

{BERTScore}_{F} = 2 \cdot \frac{P \cdot R}{P + R},

(16)

where x and y are the reference and candidate sentences, respectively.

4.3.2. IT Domain Utterance Restoration Experiments

Table 4 shows the performance results. We conducted experiments in a single domain to verify the restoration effect of the proposed method. We compared the performance by changing each element.

Table 4. Experimental result of utterance restoration in a single domain. Bold represents the best performance in each evaluation indicator.

Using the resolved information, R, we achieved higher performance compared with using the simple history

R^{-}

, and adding special characters to distinguish each sentence’s improved performance across all metrics. Additionally, considering BERTScore and the SimCSE score confirmed that the resolved sentences effectively preserved the original intent expressed by the user. In particular, the CGM-based model that applied the copying mechanism achieved higher performance in BLEU and ROUGE than the baseline, as it operates by directly copying the input tokens for output. In the restaurant reservation data as well, although the proposed method showed slightly lower performance in terms of sentence fluency and intent preservation, it demonstrated high performance in Rouge-L and F1, which indicate restoration capability.

Table 5 presents examples of each model’s result. In the current utterance,

P_{i}

, “But what if there’s an accident because of the robot?” the virtual product ‘AI delivery robot’ was substituted with ‘robot’, yet it can be observed that the ‘AI delivery robot’ was effectively restored by referencing the previous history,

R_{i}

.

Table 5. Examples of sentence restoration by model. Bold represents the keyword that needs to be recovered in the result of models.

Table 6 shows the resolution performance for each utterance type. In most performance metrics, CGM-based models applying the copy mechanism recorded high performance. For anaphora and ellipsis types, which require accurately copying necessary keywords from previous time points, CGM showed relatively high performance by copying even detailed keywords, as shown in Table 7, unlike other CGM-based models that only copy higher-level concept keywords. While other models simply connected the product name and the current question,

P_{i}

, and failed to replace ‘the first product’ with the product name, CGM was able to find ‘the first product’ from the answer,

R_{i}

, and correctly restore it. Furthermore, as shown in Table 8, CGM can accurately convey the exact meaning of the user’s utterance by copying only ‘card payments’, excluding other unnecessary words. On the other hand, for the ‘Others’ type, CGM recorded lower performance compared with other models. This is because, in the current copy mechanism structure for GPT, it is not possible to assign concentrated weights to specific parts of the given input information, so it failed to find the correct parts to copy. Table 9 shows an example of this. Compared with other models that correctly restored the product name without altering the user’s intent, CGM additionally restored ‘service customer’, resulting in errors such as altering the user’s intent. In the restaurant reservation data as well, similar to the IT domain, CGM-based models showed generally high performance in Rouge-L and F1 metrics.

Table 6. Experimental results by utterance type. Bold represents the best performance in each evaluation indicator.

Table 7. Examples of anaphora resolution. Bold represents the keyword that needs to be recovered in the result of models.

Table 8. Examples of ellipsis type restoration.

Table 9. Examples of other type restoration. Due to the limitations of the Mask, the current information is not emphasized, leading to the incorrect copying of previous information.

4.3.3. Detecting Domain

We also conducted experiments in a dialogue environment where domain shifts occur to verify the effectiveness of the copy mechanism on out-of-domain data.

For four CGM-based models that showed relatively high performance in single-domain environments, we examined their performance with and without the use of a domain classifier. The results are shown in Table 10, which demonstrate that using the domain classifier improves performance in BLEU and ROUGE, indicating more accurate word matching across all models.

Table 10. Sentence restoration experiment results based on the inclusion of domain classification.

5. Conclusions

We propose a copying mechanism structure for GPT-based sLLMs and validate its ability to preserve keywords found in the input by applying it to the tasks of resolving ellipsis and substitutes in conversation. To adapt the existing Pointer Generator structure, which is based on an encoder–decoder architecture, to the GPT structure, we adopted a Mask structure based on input tokens. Additionally, we conducted various experiments on ellipsis and anaphora resolution across different structures beyond the existing Pointer Generator structure, successfully demonstrating the introduction of the Pointer Generator within the GPT architecture. Notably, through experiments on ellipsis and anaphora resolution in conversations with domain transitions, we showed that adopting the Pointer Generator structure enables effective resolution of ellipsis and anaphora, even for domains outside the training data. By implementing the copying mechanism using Mask in the GPT architecture, we hope to facilitate the use of language models in restricted situations, such as with limited training data. However, the current structure has limitations in that it cannot specify which parts of the input information play an important role in generation, and, since it is based on tokenizer token IDs, it cannot handle OOV occurrences. We leave this for future research.

Author Contributions

Data curation, J.-W.C. (Ji-Won Cho); formal analysis, J.-W.C. (Ji-Won Cho), J.O. and J.-W.C. (Jeong-Won Cha); methodology, J.-W.C. (Ji-Won Cho), J.O. and J.-W.C. (Jeong-Won Cha); project administration, J.-W.C. (Jeong-Won Cha); software, J.-W.C. (Ji-Won Cho); supervision, J.-W.C. (Jeong-Won Cha); writing—original draft, J.-W.C. (Ji-Won Cho) and J.O.; writing—review and editing, J.O. and J.-W.C. (Jeong-Won Cha). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00320, RS-2022-II220320, Artificial intelligence research about cross-modal dialogue modeling for one-on-one multi-modal interactions).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, S.; Li, H.; Yuan, P.; Wu, Y.; He, X.; Zhou, B. Self-Attention Guided Copy Mechanism for Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1355–1362. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.M.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2019, 21, 140:1–140:67. [Google Scholar]
González, J.L.V.; Rodríguez, A.F. Importance of Pronominal Anaphora Resolution in Question Answering Systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 3–6 October 2000. [Google Scholar]
Aralikatte, R.; Lamm, M.; Hardt, D.; Søgaard, A. Ellipsis and Coreference Resolution as Question Answering. arXiv 2019, arXiv:1908.11141. [Google Scholar]
Aralikatte, R.; Lamm, M.; Hardt, D.; Søgaard, A. Ellipsis Resolution as Question Answering: An Evaluation. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Kiev, Ukraine, 21–23 April 2021. [Google Scholar]
Park, C.; Choi, K.H.; Lee, C.; Lim, S. Korean Coreference Resolution with Guided Mention Pair Model Using Deep Learning. ETRI J. 2016, 38, 1207–1217. [Google Scholar] [CrossRef]
Zaib, M.; Zhang, W.E.; Sheng, Q.Z.; Mahmood, A.; Zhang, Y. Conversational question answering: A survey. Knowl. Inf. Syst. 2021, 64, 3151–3195. [Google Scholar] [CrossRef]
Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.T.; Choi, Y.; Liang, P.; Zettlemoyer, L. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2174–2184. [Google Scholar] [CrossRef]
Qu, C.; Yang, L.; Qiu, M.; Zhang, Y.; Chen, C.; Croft, W.B.; Iyyer, M. Attentive History Selection for Conversational Question Answering. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019. [Google Scholar]
Qiu, M.; Huang, X.; Chen, C.; Ji, F.; Qu, C.; Wei, W.; Huang, J.; Zhang, Y. Reinforced History Backtracking for Conversational Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
Mutal, J.; Gerlach, J.; Bouillon, P.; Spechbach, H. Ellipsis Translation for a Medical Speech to Speech Translation System. In Proceedings of the European Association for Machine Translation Conferences/Workshops, Lisbon, Portugal, 4–6 May 2020. [Google Scholar]
Quan, J.; Xiong, D.; Webber, B.L.; Hu, C. GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2019, 8, 64–77. [Google Scholar] [CrossRef]
Liu, R.; Mao, R.; Luu, A.T.; Cambria, E. A brief survey on recent advances in coreference resolution. Artif. Intell. Rev. 2023, 56, 14439–14481. [Google Scholar] [CrossRef]
Gülçehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; Bengio, Y. Pointing the Unknown Words. arXiv 2016, arXiv:1603.08148. [Google Scholar]
Gu, J.; Lu, Z.; Li, H.; Li, V.O.K. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. arXiv 2016, arXiv:1603.06393. [Google Scholar]
Al-Sabahi, K.; Yang, K. Supervised Copy Mechanism for Grammatical Error Correction. IEEE Access 2023, 11, 72374–72383. [Google Scholar] [CrossRef]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. arXiv 2017, arXiv:1704.04368. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR 2014, arXiv:1409.0473. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]

Figure 1. Copy mechanism architecture using Mask.

Figure 2. Comparing models.

Table 1. The bold parts indicate the portions that need to be resolved. In the example below, standard GPT omits the word LLM that needs to be resolved.

Speaker	Utterance
User1	LLM은 무엇인가? (What is an LLM?)
User2	LLM은 방대한 양의 데이터로 사전 학습한 초대형 딥 러닝 모델입니다. (An LLM is a large-scale deep learning model that is pre-trained on vast amounts of data.)
User1	그걸 이용하는 응용 분야 알려줘. (Please tell me the application areas that utilize it.)
Ground Truth	LLM을 이용하는 응용 분야 알려줘. (Please tell me the application areas that utilize LLM.)
GPT	그걸 이용하는 응용 분야 알려줘. (Please tell me the application areas that utilize it.)
CGM	LLM을 이용하는 응용 분야 알려줘. (Please tell me the application areas that utilize LLM.)

Table 2. The statistics of experimental dataset.

Languages	Type	Train	Dev	Test	Total
Korean	IT	2603	325	327	3255
Korean	Finance	6371	354	354	7079
English	Restaurant reservation	2210	267	267	2744
	Total	11,184	946	948	13,079

Table 3. The number of ellipsis and anaphora in experimental dataset.

Languages	Type	Ellipsis	Anaphora	Others	# Dialogues
Korean	IT	1364	365	1526	878
Korean	Finance	3082	1139	2424	2225
English	Restaurant reservation	651	661	1432	673
	Total	5097	2165	5382	12,644

Table 4. Experimental result of utterance restoration in a single domain. Bold represents the best performance in each evaluation indicator.

	IT					Restaurant Reservation
Model	BLEU	Rouge-L	F1	simCSE	BERTScore	BLEU	Rouge-L	F1	simCSE	BERTScore
$R^{-} P$	0.3677	0.2435	0.0371	0.8472	0.8823	0.8017	0.3842	0.1433	0.9559	0.9551
$R P$	0.3894	0.3776	0.1118	0.8890	0.8972	0.8016	0.3827	0.1251	0.9577	0.9547
Baseline	0.4180	0.3612	0.1254	0.8977	0.9057	0.8224	0.4177	0.1457	0.9660	0.9629
PG-GPT	0.4315	0.6360	0.4204	0.8985	0.9072	0.8218	0.4443	0.1867	0.9674	0.9608
CGM	0.4319	0.4599	0.2066	0.9040	0.9070	0.8216	0.4276	0.1748	0.9638	0.9580
CGM + HardPrompt A	0.4216	0.4335	0.2087	0.9004	0.9084	0.8160	0.4473	0.1907	0.9626	0.9578
CGM + HardPrompt B	0.4280	0.4295	0.1985	0.8997	0.9081	0.8137	0.4373	0.1695	0.9604	0.9558
CGM + 0/1 Mask only	0.4216	0.3338	0.0011	0.8959	0.9090	0.8178	0.4100	0.1424	0.9644	0.9614
CGM + SepTwice	0.4158	0.4348	0.1782	0.9007	0.9039	0.8108	0.4343	0.1560	0.9646	0.9575

Table 5. Examples of sentence restoration by model. Bold represents the keyword that needs to be recovered in the result of models.

Model	Generated Example
$R_{i}$	When installing the AI delivery robot, are there any conditions?
$R_{i}$	Basically, you need facilities for water supply, electricity, and internet.
$P_{i}$	But what if there’s an accident because of the robot?
Baseline	What happens if the AI delivery robot has an accident?
PG-GPT	What if there’s an accident because of the AI delivery robot?
CGM	What if there’s an accident because of the AI delivery robot?
CGM + HardPrompt A	What if there’s an accident with the AI delivery robot?
CGM + HardPrompt B	What if there’s an accident with the AI delivery robot?
CGM + 0/1 Mask only	What if there’s an accident because of the AI delivery robot?
CGM + SepTwice	What if there’s an accident with the AI delivery robot?
Ground Truth	What if there’s an accident because of the AI delivery robot?

Table 6. Experimental results by utterance type. Bold represents the best performance in each evaluation indicator.

		IT					Restaurant Reservation
Type	Model	BLEU	Rouge-L	F1	simCSE	BERTScore	BLEU	Rouge-L	F1	simCSE	BERTScore
Anaphora	Baseline	0.4673	0.4054	0.1022	0.8724	0.9163	0.8351	0.4634	0.2122	0.9378	0.9475
	PG-GPT	0.4323	0.4125	0.1363	0.8492	0.9089	0.8581	0.5371	0.2796	0.9619	0.9589
	CGM	0.4804	0.4530	0.1022	0.8909	0.9257	0.8305	0.5355	0.2717	0.9364	0.9374
	CGM + HardPrompt A	0.4102	0.4378	0.1136	0.8393	0.9152	0.8186	0.5437	0.2876	0.9330	0.9388
	CGM + HardPrompt B	0.4332	0.3705	0.1363	0.8571	0.9159	0.8102	0.5306	0.2664	0.9246	0.9286
	CGM + 0/1 Mask only	0.4613	0.3242	0.0090	0.8829	0.9262	0.8008	0.4419	0.2108	0.9333	0.9419
	CGM + SepTwice	0.3945	0.4136	0.1250	0.8355	0.9086	0.8012	0.4981	0.2195	0.9375	0.9378
Ellipsis	Baseline	0.3462	0.3549	0.1287	0.8687	0.8960	0.7226	0.8575	0.0793	0.9257	0.9326
	PG-GPT	0.3542	0.4215	0.2114	0.8764	0.8980	0.6921	0.8235	0.0939	0.9045	0.9109
	CGM	0.3536	0.4609	0.2213	0.8912	0.9047	0.7210	0.8600	0.0780	0.9146	0.9206
	CGM + HardPrompt A	0.3553	0.4329	0.2221	0.8839	0.9029	0.7033	0.8550	0.0939	0.9117	0.8176
	CGM + HardPrompt B	0.3557	0.4378	0.2072	0.8765	0.9012	0.7031	0.8486	0.0727	0.9114	0.9196
	CGM + 0/1 Mask only	0.3342	0.3352	0.0000	0.8633	0.8939	0.7313	0.8573	0.0740	0.9169	0.9160
	CGM + SepTwice	0.3211	0.4378	0.1857	0.8782	0.8950	0.6992	0.8405	0.0925	0.9169	0.9160
Others	Baseline	0.4859	-	-	0.9316	0.9143	0.8613	-	-	0.9967	0.9833
	PG-GPT	0.5123	-	-	0.9289	0.9166	0.8635	-	-	0.9981	0.9839
	CGM	0.5068	-	-	0.9192	0.9066	0.8626	-	-	0.9981	0.9840
	CGM + HardPrompt A	0.4927	-	-	0.9266	0.9130	0.8650	-	-	0.9987	0.9843
	CGM + HardPrompt B	0.5028	-	-	0.9303	0.9142	0.8647	-	-	0.9982	0.9842
	CGM + 0/1 Mask only	0.5071	-	-	0.9319	0.9221	0.8641	-	-	0.9978	0.9838
	CGM + SepTwice	0.5182	-	-	0.9338	0.9125	0.8649	-	-	0.9842	0.9979

Table 7. Examples of anaphora resolution. Bold represents the keyword that needs to be recovered in the result of models.

Model	Generated Example
$R_{i}$	A’Cen On-Premise 의 어떤 상품들이 제공되고 있는지 알려주세요. (Please let me know which products are offered by A’Cen On-Premise.) 보이스 챗봇, AI KMS, 화자 인증 엔진 STT, TTS, TA 등을 제공하고 있습니다. (We provide voice chatbots, AI KMS, speaker authentication engines, STT, TTS, and TA, among others.)
$P_{i}$	첫 번째 상품에 대해서 알려주세요. (Please tell me about the first product.)
Baseline	A’Cen On-Premise의 첫 번째 상품에 대해서 알려주세요. (Please tell me about the first product of textbfA’Cen On-Premise.)
PG-GPT	A’Cen On-Premise의 보이스 챗봇, AI KMS, 화자 인증 엔진 STT, TTS, TA 등을 제공하고 있습니다. (A’Cen On-Premise offers voice chatbots, AI KMS, speaker authentication engines, STT, TTS, and TA, among other services.)
CGM	A’Cen On-Premise의 보이스 챗봇 상품에 대해서 알려주세요. (Please tell me about the voice chatbot product of A’Cen On-Premise.)
CGM + HardPrompt A	A’Cen On-Premise의 첫 번째 상품에 대해서 알려주세요. (Please tell me about the first product of A’Cen On-Premise.)
CGM + HardPrompt B	A’Cen On-Premise의 어떤 상품들이 제공되고 있는지 알려주세요. (Please let me know what products are offered by A’Cen On-Premise.)
CGM + 0/1 Mask	A’Cen On-Premise의 첫 번째 상품에 대해서 알려주세요. (Please tell me about the first product of A’Cen On-Premise.)
CGM + SepTwice	A’Cen On-Premise의 첫 번째 상품에 대해서 알려주세요. (Please tell me about the first product of A’Cen On-Premise.)
Ground Truth	A’Cen On-Premise의 보이스 챗봇에 대해서 알려주세요. (Please tell me about the voice chatbot of A’Cen On-Premise.)

Table 8. Examples of ellipsis type restoration.

Model	Generated Example
$R_{i}$	하이오더를 설치하기 전에 준비해야 할 사항이 있나요? (Is there anything that needs to be prepared before installing the High Order?) 하이오더를 설치하기 전에 네트워크 상태, POS 기기, 매장 구조, 전기 배선 등을 고려해야 합니다. (Before installing the High Order, you need to consider the network status, POS devices, store layout, electrical wiring, etc.)
$P_{i}$	카드 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있을까요? (I heard there are many errors during card payments. Is there a way to test this?)
Baseline	카드 결제 시 오류가 많다고 들었어요. 이를 테스트하는 방법이 있을까요? (I heard there are many errors during card payments. Is there a way to test this?)
PG-GPT	하이오더 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있나요? ( I heard there are many errors during High Order payments. Is there a way to test this?)
CGM	하이오더 카드 결제 시 오류가 많다고 들었어요, 이를 테스트하는 방법이 있을까요? ( I heard there are many errors during High Order card payments. Is there a way to test this? )
CGM + HardPrompt A	하이오더 설치 시 카드 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있나요? (I heard there are many errors during card payments when installing the High Order. Is there a way to test this?)
CGM + HardPrompt B	하이오더 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있나요? (I heard there are many errors during High Order payments. Is there a way to test this?)
CGM + 0/1 Mask	하이오더를 설치하기 전에 카드 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있을까요? (I heard there are many errors during card payments before installing the High Order. Is there a way to test this?)
CGM + SepTwice	하이오더 결제 시 오류가 많다고 하던데, 이를 테스트하는 방법이 있을까요? ( I heard there are many errors during High Order payments. Is there a way to test this?)
Ground Truth	하이오더에서 카드 결제 시 오류가 발생할 경우 이를 테스트하는 방법이 있나요? ( Is there a way to test this if errors occur during card payments at the High Order?)

Table 9. Examples of other type restoration. Due to the limitations of the Mask, the current information is not emphasized, leading to the incorrect copying of previous information.

Model	Generated Example
$R_{i}$	스마트메시지 RCS 서비스 고객이 수신 못하면 어떻게 되나요? (What happens if the customer of the SmartMessage RCS service does not receive it?) 단말이 RCS를 지원하지 않아, 부달될 경우 기업고객에게 부달 사실을 통보합니다. (If the device does not support RCS, a notification will be sent to corporate customers about the lack of support.)
$P_{i}$	다른 기업형 메시지랑 뭐가 달라 (How is it different from other corporate messages?)
Baseline	스마트메시지 RCS 서비스 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
PG-GPT	스마트메시지 RCS 서비스는 다른 기업형메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
CGM	스마트메시지 RCS 서비스 기업고객은 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service corporate customer different from other corporate messages?)
CGM + HardPrompt A	스마트메시지 RCS 서비스 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
CGM + HardPrompt B	스마트메시지 RCS 서비스는 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
PG-GPT+0/1 Mask	스마트메시지 RCS 서비스 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
CGM + 0/1 Mask	스마트메시지 RCS 서비스 다른 기업형 메시지랑 뭐가 달라 (How is the SmartMessage RCS service different from other corporate messages?)
Grounf Truth	스마트메시지 RCS와 다른 기업형메시지 차이가 뭔가요? (What is the difference between SmartMessage RCS and other corporate messages?)

Table 10. Sentence restoration experiment results based on the inclusion of domain classification.

Model	Classification	BLEU	Rouge-L	F1	simCSE	BERTScore
Baseline	x	0.3863	0.1916	0.0129	0.8379	0.8660
Baseline	o	0.4500	0.1863	0.0129	0.8723	0.8825
PG-GPT	x	0.3954	0.2398	0.0107	0.8573	0.8822
PG-GPT	o	0.4562	0.2350	0.0107	0.8903	0.8982
CGM	x	0.3920	0.2444	0.0130	0.8545	0.8815
CGM	o	0.4611	0.2395	0.0130	0.8902	0.8999
CGM + 0/1 Mask	x	0.3796	0.2215	0.0171	0.8464	0.8764
CGM + 0/1 Mask	o	0.4432	0.2171	0.0171	0.8801	0.8960
CGM + SepTwice	x	0.3836	0.2126	0.0090	0.8325	0.8745
CGM + SepTwice	o	0.4398	0.2090	0.0090	0.8693	0.8913

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

CGM: Copy Mechanism GPT with Mask for Ellipsis and Anaphora Resolution in Dialogue

Abstract

1. Introduction

2. Related Work

2.1. Conversational Question Answering

2.2. Ellipsis and Anaphora Resolution

2.3. Copy Mechanism

3. The Proposed Method

3.1. Problem Definition

3.2. Domain Classification

3.3. Baseline

Copy-Mechanism GPT with Mask

Training

4. Experimental Results

4.1. Experimental Corpus

4.2. Comparing Models

4.3. Experimental Results and Analysis

4.3.1. Evaluation Metrics

4.3.2. IT Domain Utterance Restoration Experiments

4.3.3. Detecting Domain

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics