Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition

Tang, Long; Wu, Baowen; Wen, Hongjian; Liu, Jie; Qu, Youli

doi:10.3390/electronics15020403

Open AccessArticle

Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition

by

Long Tang

¹,

Baowen Wu

¹,

Hongjian Wen

¹,

Jie Liu

^2,3,*

and

Youli Qu

⁴

¹

School of Artificial Intelligence, Wenshan University, Wenshan 663099, China

²

School of Artificial Intelligence and Computer Science, North China University of Technology, Beijing 100144, China

³

Research Center for Language Intelligence of China, Beijing 100048, China

⁴

School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 403; https://doi.org/10.3390/electronics15020403

Submission received: 16 November 2025 / Revised: 1 January 2026 / Accepted: 5 January 2026 / Published: 16 January 2026

Download

Browse Figures

Versions Notes

Abstract

Metaphor recognition plays a key role in natural language understanding and semantic analysis. This paper introduces a metaphor recognition model called EGSNet (Enhanced Gloss Siamese Network). Previous research has shown that the gloss of metaphor words contributes to their comprehension in metaphor recognition. To leverage this information, this paper incorporates gloss annotations of metaphor words into the metaphor recognition model. Mining the deep semantic information of glossed metaphorical words, and combining MIP linguistic rule to use the annotation information of metaphorical words can better perceive the semantic conflicts between the contextual meaning of the target word and its basic meaning, thereby improving the ability to recognize metaphors. Furthermore, by employing data augmentation techniques to reveal the true meaning of target words in contextual environments and combining gloss information, the EGSNet model can effectively capture subtle semantic differences in sentences as well as the linguistic characteristics of metaphors, thereby enhancing metaphor recognition performance. Experiments indicate that the EGSNet model achieves more accurate metaphor recognition results and makes a significant contribution to the field of metaphor recognition.

Keywords:

gloss; data augmentation; metaphor recognition

1. Introduction

Metaphor is a common linguistic phenomenon, referring to the use of one entity to describe or express another, thereby creating new meanings or effects. Metaphor is not just a figure of speech but also a cognitive mechanism, reflecting human understanding and imagination of the world. Metaphor finds extensive applications in various domains, including everyday communication, literary creation, scientific exploration, and more. It enhances language expressiveness, creativity, and persuasiveness while helping with the dissemination of knowledge and fostering innovation. Therefore, research on metaphor holds significant theoretical and practical value.

As a vital subfield of metaphor studies, metaphor recognition [1] aims to equip computers with the ability to review and understand metaphors, much like humans. It involves interdisciplinary collaboration between computer science, cognitive science, and other domains, offering profound theoretical and practical applications. Inability to address the challenges of metaphor recognition could adversely affect semantic comprehension and the enhancement of performance in diverse application systems. For instance, in areas such as sentiment analysis, reading comprehension, human-computer dialogue, text summarization, text generation, and machine translation, metaphor comprehension directly impacts the accuracy of processing. Figure 1 is an example of metaphor recognition, where gloss annotations of metaphorical words serve as supplementary information, providing additional semantic insights to enhance the effectiveness of metaphor recognition.

Currently, metaphor recognition heavily depends on large-scale models and utilizes two linguistic rules, Metaphor Identification Procedure [3] (MIP) and Selectional Preference Violation [4] (SPV), for identification. The Metaphor Identification Procedure assesses the similarity between the contextual and basic meanings of the target word, while the Selectional Preference Violation rule detects discrepancies between the target word and its context. There are further studies on the use of gloss annotations in metaphor recognition, which have achieved certain results. Wan et al. [2] used annotation information from metaphorical words. The novelty of their work mainly lies in the interpretation mechanism, utilizing glosses to interpret metaphorical words. It uses attention mechanism to estimate the integrated representation of multiple glosses in the candidate list. Then, the integrated representation is directly connected to the context word embedding, and the classical prediction layer is used to determine whether the given word has metaphoricity. These gloss annotations provide supplementary explanations, offering additional background knowledge and semantic context, thereby clarifying disparities between metaphors and their literal meanings. This, in turn, mitigates the risk of metaphor misunderstandings, assisting models in both understanding and elucidating metaphorical meanings, and ultimately enhancing the effectiveness of metaphor recognition.

Although current research has significantly improved the accuracy of metaphor identification, the semantic information embedded in metaphor gloss annotations has not yet been fully exploited in metaphor classification tasks. The potential of gloss annotations to convey fine-grained semantic distinctions of metaphorical expressions remains underexplored. To better leverage linguistic knowledge and more deeply mine the semantics encoded in annotations, this work extends the method proposed by Zhang et al. [5] by incorporating data augmentation and an enhanced gloss representation. Specifically, the proposed approach integrates data augmentation and gloss information with MIP and SPV mechanisms. The data augmentation strategy generates summaries that reveal the contextual meaning of the target word, serving as a semantic anchor that enables SPV to more effectively detect semantic incongruity between the target word and its context. Meanwhile, the gloss provides a concept-level interpretation of the target word, free from contextual noise, allowing MIP to more accurately distinguish between the contextual meaning and the conventional meaning of the target word.

Compared with other neural-based approaches, the proposed model demonstrates improved performance in metaphor identification and classification at the sentence level. In particular, the model performs well in scenarios where textual context is relatively explicit and metaphorical usage is salient. This paper makes the following contributions:

We propose a metaphor classification framework that integrates data augmentation and gloss-based semantic information to enhance metaphor detection.
By leveraging data augmentation to highlight contextual meanings and incorporating gloss-based conceptual knowledge, the proposed method strengthens both the SPV and MIP mechanisms, leading to improved discrimination of metaphorical expressions.
Extensive experiments on three benchmark datasets (VUA-All, VUA Verb, and MOH-X) demonstrate the effectiveness and robustness of the proposed approach, consistently outperforming strong baseline models.

2. Related Work

Metaphor recognition research has evolved from feature engineering to deep semantic modeling. To clarify the positioning of our work, we categorize existing methods into two primary streams: linguistic rule-integrated and RNN-based methods, and gloss-aware and transformer-based methods. A comprehensive comparison of these methods, highlighting their architectural components and performance, is presented in Table 1.

2.1. Linguistic Rule-Integrated and RNN-Based Methods

Early approaches to metaphor recognition primarily relied on Feature Engineering and Recurrent Neural Networks (RNNs), often explicitly integrating linguistic rules like the Metaphor Identification Procedure (MIP) and Selectional Preference Violation (SPV).

Feature engineering methods involve the construction of semantic rules for identification. For instance, Assaf et al. [12] identified metaphors by analyzing the relationships between context and vocabulary. RNN-based methods utilized LSTMs as feature extractors to capture sequence features. Do Dinh et al. [13] and Sun et al. [14] applied neural networks and Bi-LSTMs to learn contextual information. Crucially, researchers began integrating linguistic rules into these architectures. RNN_HG [6] utilizes the MIP rule to capture semantic differences between literal and contextual meanings. Similarly, RNN_MHCA [7] focuses on SPV, employing a multi-head attention mechanism to detect semantic incongruity. While these models effectively utilize linguistic rules, they generally lack the deep semantic modeling capabilities of pre-trained transformers and do not explicitly leverage gloss definitions for semantic augmentation.

2.2. Gloss-Aware and Siamese-Based Methods

With the advent of pre-trained language models like BERT [15] and RoBERTa [16], research has shifted towards finer-grained semantic modeling. Recent innovations have focused on incorporating external knowledge, specifically gloss (dictionary definitions), and utilizing Siamese architectures to enhance performance [17].

Wan et al. [2] were among the first to interpret metaphorical words using glosses, employing an attention mechanism to integrate gloss representations. This demonstrated that glosses provide crucial semantic background to clarify metaphor-literal disparities. More complex architectures have adopted Siamese-style networks to explicitly model semantic contrast. MelBERT [9] uses a dual-encoder structure to contrast the target word’s literal embedding with its contextual meaning, implicitly applying the SPV/MIP logic. The work by Tian et al. [10], grounded in Conceptual Metaphor Theory, explicitly models attribute similarity between source and target domains for the first time using an attribute Siamese network and domain contrastive learning. However, this approach primarily focuses on word-pair-level metaphor detection and is detached from real-world metaphor recognition scenarios in natural text. Song et al. [11] proposed a sentence-level metaphor recognition model that integrates syntax-aware local attention and SimCSE contrastive learning. However, this method lacks the capability to capture fine-grained semantic conflicts between literal and metaphorical meanings.

The architecture most similar to our work is MisNet [5], which employs a Siamese network to encode target words and sentences separately, incorporating dictionary information and part-of-speech features. While MisNet [5] incorporate gloss information, they rely on a basic usage from the gloss—typically a single specific example—which may introduce ambiguity. Building on MisNet [5]’s work, our study advances the approach by providing generalized gloss annotations at the conceptual level and incorporating a data augmentation module that uses summaries for semantic anchoring. These two innovations significantly enhance the ability of the Siamese network to distinguish between literal meaning and metaphorical expression.

3. Methodology

Building upon MisNet [5], this paper proposes a metaphor classification approach named EGSNet, which incorporates data augmentation and gloss annotation to further enhance metaphor recognition. The data augmentation strategy helps reveal the underlying semantic meaning of the target word and serves as a form of semantic anchoring in the SPV layer, highlighting the semantic conflict between the contextual usage of the target word and its surrounding context. Furthermore, by incorporating conceptual-level gloss of metaphorical words, the MIP layer is better able to capture the semantic discrepancy between the contextual meaning of a target word and its basic meaning [18]. The combination of these two components effectively improves the performance of metaphor recognition.

Following a word-level classification paradigm, we employ a word classification paradigm for metaphor detection. We sequentially regard each word in a given sentence as a target, and then, together with its given sentence, predict the metaphor of the target n times. The specific model architecture is depicted in Figure 2.

3.1. Data Augmentation

Since metaphors in the original sentences used for metaphor recognition are often highly conventionalized, they fail to explicitly reveal the semantic conflict between the contextual meaning of a metaphorical word and its original or literal meaning. For this reason, we adopt summary generation as the core form of data augmentation, as it directly targets the key semantic property required for metaphor recognition—namely, the actual meaning of the target word in context. Unlike surface-level data augmentation techniques, abstractive summarization preserves the global event structure while filtering out peripheral lexical details, thereby recovering the true meaning of the target word within the context. This property is particularly important for metaphor recognition, which relies on detecting semantic conflicts between a word’s contextual meaning and its conventional or literal sense. By generating a concise summary of the original sentence and concatenating it with the original input, the summary serves as a form of semantic anchoring, enabling the SPV layer to focus more on the conflict between the conventional semantics of the target word and its context, and allowing the MIP layer to concentrate more on the conflict between conventional semantics and contextual semantics. While other data augmentation techniques can enrich the context, the sentences they generate still retain metaphorical usage and lack this semantic anchoring effect, thus failing to deepen the semantic conflict. For example, synonym replacement or back-translation merely rephrase the sentence at the surface level and cannot explicitly reveal the true contextual meaning of the target word in the original context, as illustrated in Figure 3.

We choose T5-small as the summarization model for data augmentation based on a balanced consideration of effectiveness and efficiency. In terms of summarization quality, T5-small is a pretrained text-to-text transformation model that has been shown to generate coherent and semantically faithful summaries despite its relatively small scale. For the purpose of data augmentation, the goal is not to produce linguistically sophisticated or highly abstractive summaries, but rather to generate stable and semantically aligned paraphrases that capture the core event meaning, which is desirable for maintaining label consistency in metaphor recognition. From an efficiency perspective, T5-small has clear advantages over larger variants such as T5-base or T5-large. Its smaller number of parameters results in substantially lower computational cost, faster inference speed, and reduced memory consumption. This is particularly important because summary-based augmentation is applied to the entire training corpus and is performed during offline preprocessing. Using larger summarization models would significantly increase preprocessing time and resource consumption without yielding commensurate improvements in downstream metaphor recognition performance.

Figure 4 illustrates the workflow of generating summaries with T5-small and concatenating them with the original sentences. The figure provides a complete data augmentation example, including the specific sentence–summary pair, the prompt, and the concatenation strategy. It is important to note that after concatenation, the original contextual information of the sentence is fully preserved. The role of the summary is not to introduce additional context, but to provide a clearer and more explicit form of semantic anchoring. To ensure summary quality, we measure the semantic similarity between the original sentence and its generated summary, and retain only summaries with a cosine similarity of at least 0.6. If the similarity score falls below this threshold, the original sentence is used as the summary for concatenation. Finally, we randomly sample 1000 generated summaries, whose average sentence length is 17.8 words. Taking the VUA Verb_tr dataset (For relevant information about the dataset, please refer to Section 4.1), which has the longest average sentence length among the evaluated datasets, as an example, the original sentences have an average length of 25 words, while the concatenated sentence–summary pairs have an average length of 42.8 words. After BERT WordPiece tokenization, this corresponds to approximately 47–56 tokens, which is well below BERT’s maximum input length of 512 tokens. Therefore, input length constraints do not pose a concern for the proposed data augmentation strategy. In addition, we generate enhanced data at a 1:1 ratio, which means that each sample in the training or testing set generates an enhanced data.

3.2. Semantic Matching Based on SPV and MIP

Semantic matching aims to measure the similarity between two given texts, with Interaction-based models and Representation-based models being the two main approaches. In this paper, we employ the above two semantic matching models to implement SPV and MIP. Unlike MisNet [5], we choose RoBERTa [16] as the encoder for both semantic matching models. Compared to BERT [15], RoBERTa enhances the pre-training process and improves robustness.

Interaction-based SPV model: For interaction-based models (IM), two texts are concatenated as input, allowing every token in the input to fully interact with all other tokens [19,20]. The SPV mechanism emphasizes the inconsistency between the target word and its surrounding context, which can be measured through the semantic similarity between them. By concatenating the contextual information, the semantic discrepancy between the target word and its context is further amplified. As illustrated in Figure 5a, we adopt an interaction-based model to implement SPV, since the target word and its context originate from the same sequence and are naturally concatenated. In our approach, they are treated as two textual components to be matched. Consequently, within RoBERTa, they can fully interact through multi-head self-attention [21]. Finally, we extract the contextualized embedding of the target word

h_{t}

and the contextual representation

h_{c}

to compute their semantic similarity. Representation-based model for MIP: For representation-based models (RM), the two input texts are encoded independently by separate encoders, such that no interaction occurs between them during encoding [22,23]. The MIP mechanism aims to determine whether a target word has a more basic meaning. Accordingly, we compute the semantic similarity between the target word in the given sentence and its basic meaning representation. As shown in Figure 5b, we model MIP using an RM framework, where the sentence containing the target word and its basic usage are encoded by two independent encoders to avoid unnecessary interaction. This design allows the model to better capture the contextual meaning of the target word and its basic meaning. As shown in Figure 5c, after incorporating gloss information, we further compare the semantic similarity between the target word and its gloss, thereby strengthening the modeling of semantic incongruity. Finally, we obtain the contextual target embedding

h_{t}

, the basic meaning embedding

h_{t}

, and the gloss embedding

h_{g}

, and compute their respective semantic similarities.

3.3. EGSNet Architecture

Combining MIP and SPV: The use of SPV for metaphor detection relies on identifying semantic incongruity between a target word and its surrounding context. However, a conventional metaphorical target often does not occur in a conflicting context, as the surrounding context is usually shared with the target word itself. In such cases, SPV may become ineffective [24,25]. To address this issue, we introduce a data augmentation strategy that generates a summary and appends it to the original sentence. The summary partially restores the literal meaning of the target word, thereby amplifying the semantic discrepancy between the metaphorical usage and its context.

MIP, on the other hand, relies on comparing the basic meaning of a target word with its contextualized meaning. MisNet [5] employs the basic usage of a word to represent its literal meaning in order to handle novel metaphors. However, for conventional metaphors, the usage of the word often does not differ significantly from its basic meaning. Therefore, a more abstract and conceptual representation of the target word is required to better distinguish it from its contextual usage. To address this limitation, we extend MisNet by incorporating gloss information, which provides a more abstract and conceptual description than basic usage, enabling the model to better capture conventional metaphors.

Finally, we integrate both MIP and SPV to achieve more effective metaphor detection. As illustrated in Figure 2, EGSNet adopts a Siamese architecture to combine MIP and SPV. The left branch encodes the input sentence together with its generated summary, while the right branch encodes the target word along with its part-of-speech tag, basic usage, and gloss information. MIP is implemented across the two encoders, whereas SPV operates within the left encoder.

The input of the left encoder: the left RoBERTa encoder input is the processed given sentence and its summary.

L = ([CLS], given_sentence, [SEP], summary, [SEP])

(1)

where [CLS] and [SEP] are the two special tokens of RoBERTa.

The input of the right encoder: The input for the right encoder is constructed by concatenating the target word, its Part-of-Speech (POS) tag, its basic usage and its gloss. The basic usage and gloss (representing the direct or most fundamental meaning of the target word) are retrieved from WordNet using the retrieval method we have established, as shown in Figure 6. Formally, the input sequence

R

is defined as:

R = ([CLS], target_word, [SEP], POS, [SEP], basic_usage, [SEP], gloss, [SEP]) .

(2)

In instances where the basic usage and gloss cannot be successfully retrieved, the sequence is reduced to utilizing only the target word and its POS tag.

Input Features: Distinct components of the input sequence exert varying influences on metaphor detection. While the Self-Attention mechanism in BERT enhances semantic representations for input tokens [21], it may not be sufficient to fully distinguish the functional roles of heterogeneous input parts. To address this limitation, we introduce input type feature embeddings to the BERT input layer for both the left and right encoders. We design four specific features, which are embedded into fixed-length vectors:

POS Feature: Represents the POS tag of the target word. This feature is exclusive to the right encoder input.
Target Feature: Represents the target word itself. This feature is shared and identical across both the left and right inputs.
Augment Feature: Represents the feature of enhanced data (summary), which only appears in the left encoder.
Gloss Feature: Represents the feature of gloss, which only appears in the right encoder.
Local Feature: Following [8,9], we define the clause containing the target word as the local context. For simplicity, clauses are separated by punctuation such as commas, periods, exclamation marks, and question marks. Since a basic usage is typically brief, we treat the entire basic usage as a local feature.
Global Feature: Encompasses all tokens that are not categorized as POS tags, target words, summary, gloss, or local features.

Following tokenization via the Byte-Pair Encoding (BPE) algorithm [26], the left input

L

is segmented into n tokens, while the right input

R

comprises m tokens. The final input representation for BERT is composed of the sum of token embeddings, positional embeddings, and the aforementioned feature embeddings. We utilize RoBERTa to generate contextualized representations:

H_{L} = RoBERTa (L) = (h_{l_{1}}, h_{l_{2}}, \dots, h_{l_{n}})

(3)

H_{R} = RoBERTa (R) = (h_{r_{1}}, h_{r_{2}}, \dots, h_{r_{m}})

(4)

where

H_{L} \in R^{n \times d}

and

H_{R} \in R^{m \times d}

denote the embedding matrices of

L

and

R

, respectively, with d representing the hidden dimension size of RoBERTa.

From

H_{L}

, we derive the contextual meaning of the target word, denoted as

h_{t}

. If the target word is fragmented into k sub-tokens by BPE starting at index u,

h_{t}

is calculated by averaging these token embeddings:

h_{t} = \frac{1}{k} \sum_{i = u}^{u + k - 1} h_{l_{i}},

(5)

where u indicates the starting position of the target word in the left input.

Similarly, based on

H_{R}

, we obtain the basic meaning of the target word, denoted as

h_{b}

. Note that it is difficult to determine the exact position of the target word within the basic usage, as it may not be presented in its original form, and the basic usage is not always precise. However, we do not need to know the precise position of the target word in the basic usage. We simply anchor the target word at the first position of the right encoder (as shown in the input section of the right encoder in Figure 2), because the Transformer encoder applies a self-attention mechanism, allowing the target word in

R

to automatically focus on relevant parts of the basic usage and gloss [21]. So,

h_{b}

is calculated by averaging these token embeddings:

h_{b} = \frac{1}{k} \sum_{i = 1}^{1 + k - 1} h_{r_{i}},

(6)

in the input to the right encoder, the starting position of the target word is fixed at 1.

If the gloss is decomposed into p sub-tokens by BPE, with a starting position of o, then

h_{g}

is calculated by averaging these token embeddings:

h_{g} = \frac{1}{p} \sum_{i = o}^{o + p - 1} h_{r_{i}},

(7)

where o indicates the starting position of the target word in the left input.

For the left input, we compute the average of the entire embedding matrix

H_{L}

to obtain the global context embedding:

h_{c} = Mean (H_{L}) .

(8)

The

{MIP}_{1}

layer contrasts the basic meaning vector

h_{b}

with the contextual target meaning vector

h_{t}

. The

{MIP}_{2}

layer contrasts the gloss vector

h_{g}

with the contextual target meaning vector

h_{t}

. We employ a linear transformation to implement the MIP interaction:

h_{{MIP}_{1}} = W_{{MIP}_{1}}^{⊤} [h_{t}; h_{b}; | h_{t} - h_{b} |; h_{t} * h_{b}] + b_{{MIP}_{1}},

(9)

h_{{MIP}_{2}} = W_{{MIP}_{2}}^{⊤} [h_{t}; h_{g}; | h_{t} - h_{g} |; h_{t} * h_{g}] + b_{{MIP}_{2}},

(10)

where

[\cdot]

denotes a readout function. Here,

| \cdot |

represents the absolute difference, ; denotes concatenation, and * signifies the Hadamard product. These operations are combined to extract multi-faceted representations.

W_{{MIP}_{1}}

and

b_{{MIP}_{1}}

correspond to the weight matrix and bias of

{MIP}_{1}

, while

W_{{MIP}_{2}}

and

b_{{MIP}_{2}}

correspond to the weight matrix and bias of

{MIP}_{2}

.

Similarly, we perform the SPV operation on the context vector

h_{c}

and the contextual target meaning vector

h_{t}

:

h_{SPV} = W_{SPV}^{⊤} [h_{c}; h_{t}; | h_{c} - h_{t} |; h_{c} * h_{t}] + b_{SPV},

(11)

where

W_{SPV}

and

b_{SPV}

are the weight and bias parameters of the SPV layer.

Given the significance of Part-of-Speech (POS) information in metaphor detection, we explicitly extract the POS vector

h_{POS}

from the right encoder. Finally, we integrate the information from

{MIP}_{1}

,

{MIP}_{2}

, SPV, and POS to determine the metaphoricity of the target word:

y = σ (W^{⊤} [h_{{MIP}_{1}}; h_{{MIP}_{2}}; h_{SPV}; h_{POS}] + b),

(12)

where W and b represent the weight and bias, respectively.

σ

denotes the softmax function, and

y \in R^{2}

indicates the predicted label distribution.

3.4. Computational Cost and Architectural Efficiency

Finally, we discuss the computational cost and architectural efficiency of the proposed EGSNet. Although EGSNet adopts a Siamese-style architecture with two RoBERTa encoders and an additional T5-based enhancement module, which may appear computationally heavier than single-stream baselines, several design choices effectively mitigate redundancy and ensure efficiency.

First, the data augmentation mechanism is performed entirely offline. The T5-based summarization module is only used during preprocessing and does not participate in the inference stage. As a result, no additional computational overhead is introduced during model deployment. Second, parameter efficiency is achieved through weight sharing. The two RoBERTa encoders in the Siamese architecture share the same parameters, ensuring that the total number of trainable parameters remains identical to that of a single RoBERTa model. This design avoids the parameter inflation typically associated with dual-encoder architectures. Third, the overall architecture is functionally non-redundant. Each component serves a distinct and complementary role: the MIP module focuses on capturing conflicts between the contextual meaning and the basic meaning of the target word; the SPV module models contextual inconsistency; and the gloss-based enhancement provides concept-level semantic grounding that further strengthens both MIP and SPV. The interactions among these components are complementary rather than repetitive. Detailed analyses of these modules are provided in Section 3.3 and Section 4.6.

Overall, the computational cost of our model is comparable to that of structurally similar baselines such as MisNet. Considering the consistent performance gains observed across all three datasets (For specific experimental results, please refer to Section 4.5) we believe that the proposed design achieves a favorable trade-off between computational efficiency and performance improvement.

4. Experiments

In this section, we present the details of our experiments, including datasets, baseline results as well as analyses.

4.1. Datasets

VUA-All: VUA (VU Amsterdam Metaphor Corpus) is one of the largest publicly accessible manually annotated and cross-domain corpora of metaphorical language. It comprises of 2626 paragraphs, over 16,000 sentences, and approximately 200,000 word tokens. The corpus covers different domains and includes samples from academic texts, novels, news articles, and conversational texts. The distribution of the VUA corpus is presented in Table 2.

VUA Verb: The VUA Verb dataset is a subset of the VU Amsterdam Metaphor Corpus specifically designed for studying metaphorical verbs. It focuses on sentences that contain metaphorical verbs and provides relevant contextual information. The dataset is provided by further filtering and annotation of the VUA corpus, resulting in a collection of sentences, each containing at least one metaphorical verb. The contextual information of these verb metaphors is considered crucial for understanding the semantic and pragmatic features of metaphors.

MOH-X: MOH-X is a subset of the MOH corpus, where the texts are sourced from the WordNet dictionary. The sentences in this corpus are relatively short. It is a balanced dataset, and therefore, we do not apply separate class weights.

The specific data breakdown is shown in Table 3.

4.2. Setting

All experiments are conducted using the AdamW optimizer [27]. For both the VUA All and VUA Verb datasets, the learning rate is set to 3 × 10⁻⁵; the same learning rate is also adopted for the MOH-X dataset. Regarding the training configuration, the batch size is set to 64 for VUA All and VUA Verb, and the models are trained for 15 epochs. For the smaller MOH-X dataset, the batch size is set to 16, and the number of training epochs is also 15. During training, the parameters of RoBERTa are fully fine-tuned rather than kept frozen, allowing the model to better adapt to the semantic modeling requirements of the metaphor identification task. To address class imbalance, we apply a weighted cross-entropy loss on the VUA datasets. Specifically, for the VUA All dataset, the class weights for literal and metaphorical instances are set to 1 and 5, respectively; for the VUA Verb dataset, the corresponding weights are set to 1 and 4. Since MOH-X is a balanced dataset, no class weighting strategy is applied. Throughout training, no learning rate warm-up or scheduling strategy is employed; instead, a fixed learning rate is used. The dropout rate follows the default setting of RoBERTa (0.1), and no additional dropout layers are introduced. To ensure reproducibility, experiments on VUA All and VUA Verb are conducted with fixed random seeds. For the smaller MOH-X dataset, a 10-fold cross-validation strategy is adopted, and the reported results are averaged over all folds. All experiments are implemented using PyTorch 1.10 and CUDA 11.2, and are conducted on a single NVIDIA RTX 4090 GPU. The source code, trained model checkpoints, and processed datasets will be released upon publication to ensure full reproducibility.

4.3. Baseline

RNN_ELMo [6] and RNN_BERT [7] are sequence labeling models based on recurrent neural networks, utilized to capture the semantic information of words. RNN_ELMo combines ELMo and GloVe embeddings, while RNN_BERT combines BERT and GloVe embeddings, enabling better handling of semantic variations and contextual relevance.

RNN_HG and RNN_MHCA: RNN_HG utilizes GloVe embeddings and the MIP method to catch semantic differences in metaphorical meanings. RNN_MHCA, on the other hand, is based on SPV and a multi-head context attention mechanism, allowing for the simultaneous consideration of multiple vital contextual information, thereby accurately capturing metaphorical meanings.

MUL_GCN [28]: MUL_GCN is a multi-task learning model that utilizes graph convolutional networks and bidirectional long short-term memory networks to encode dependency relationships for metaphor detection and word sense disambiguation tasks.

RoBERTa_SEQ [29]: It is a sequence labeling baseline model based on the RoBERTa model. It takes sentences as input and uses a softmax classifier to predict the metaphoricity of each token.

DeepMet [8]: It treats metaphor identification as a reading comprehension task and combines various features such as query features, fine-grained POS features, and contextual features. By looking at these features collectively, it accurately identifies metaphors.

MelBERT [9]: It is based on the RoBERTa model. It considers the literal meaning of the target word and assumes that metaphorical information exists in the context of the target word. By integrating Siamese networks and RoBERTa representations, it better understands metaphorical meanings in context, thereby improving metaphor identification performance.

MrBERT [30]: It uses the BERT model and dependency parsing to identify metaphorical meanings of verbs. By incorporating dependency relationships into the input of BERT, it effectively captures metaphorical information, leading to better performance in metaphor detection.

AIDIL [10]: Based on Conceptual Metaphor Theory, this work explicitly models attribute similarity between source and target domains for the first time through an attribute Siamese network and domain contrastive learning. Furthermore, it proposes the AIDIL framework, which integrates attribute similarity with domain inconsistency for metaphor detection at the word-pair level.

CSS [11]: This work proposes a sentence-level metaphor recognition model that integrates syntax-aware local attention and SimCSE contrastive learning. By concurrently performing metaphor detection and contrastive learning tasks, it enhances contextual semantic representation capabilities.

MisNet [5]: It is based on a Siamese network model, which takes sentences, target words, and part-of-speech tags as inputs and extracts their semantic information. Additionally, MisNet incorporates dictionary information to enhance the recognition of metaphors. By considering multiple sources of information collectively, it can more accurately identify metaphorical expressions.

4.4. Evaluation Metrics

In this experiment, we utilized the following four evaluation metrics to assess the results: accuracy (A), recall (R), precision (P), and F1 score.

In the formula below, TP = true positive example, TN = true negative example, FP = false positive example, FN = false negative example.

Accuracy is part of the most commonly used evaluation metrics, which measures the proportion of correctly predicted metaphorical samples by the model. It is measured by dividing the number of correctly predicted samples by the total number of samples.

A = (TP + TN) / (TP + TN + FP + FN)

(13)

Precision represents the ratio of correctly predicted metaphorical samples to the total number of samples forecast as metaphorical by the model.

P = TP / (TP + FP)

(14)

Recall, on the other hand, measures the proportion of true positive metaphorical samples out of all definite positive metaphorical samples.

R = TP / (TP + FN)

(15)

F1 score is a comprehensive metric that takes into account the balance between precision and recall. It is the harmonic mean of precision and recall, utilized for the overall evaluation of metaphor identification models. The formula for F1 score is as follows:

F 1 = 2 * P * R / (P + R)

(16)

4.5. Results

The experimental results on the VUA All, VUA Verb, and MOH-X datasets demonstrate that the proposed EGSNet consistently outperforms all baseline methods. The experimental results are shown in Table 4, Table 5 and Table 6. Specifically, when compared with the strongest baseline MisNet, EGSNet achieves the best performance across all three evaluation metrics on VUA All, with Precision improving by 0.6 percentage points, F1-score by 0.1 percentage points, and Accuracy by 0.3 percentage points. Similarly, EGSNet also attains the best results on all three metrics for both VUA Verb and MOH-X. On the VUA Verb dataset, Recall is higher than that of MisNet by 1.4 percentage points, F1-score by 0.3 percentage points, and Accuracy by 1.3 percentage points. On the MOH-X dataset, Recall exceeds that of MisNet by 3.4 percentage points, F1-score by 0.9 percentage points, and Accuracy by 0.6 percentage points. Overall, these results indicate that our approach exhibits robust effectiveness across datasets. The incorporation of abstract concept-level lexical annotations and summary-based data augmentation further strengthens the model’s ability to capture the semantic divergence between a target word’s contextual meaning and its basic meaning, thereby improving metaphor recognition performance.

We further observe that although the F1-score improvements over the strongest baseline across the three datasets are consistent, they are relatively modest in magnitude and may partly reflect experimental variability. Therefore, while accounting for computational efficiency, we additionally assess the statistical reliability of our method. Robustness verification and statistical analysis are conducted on the VUA All dataset, where the improvement is smallest, and the detailed results are reported in Appendix A.

To explain the differing degrees of improvement across the three datasets, we analyze their characteristics from a structural perspective. The overall gain on the VUA All dataset is relatively limited (only +0.1 in F1), which can be primarily attributed to the high heterogeneity of this benchmark. VUA All encompasses metaphorical expressions across multiple parts of speech (verbs, nouns, adjectives, etc.), among which a substantial portion of metaphors are highly conventionalized and lexically entrenched. In such cases, the semantic deviation between metaphorical and literal usages is often subtle and may even appear near-literal in context. As a result, when performance is averaged over diverse linguistic phenomena, the fine-grained semantic comparison advantages introduced by lexical semantic information (e.g., glosses) tend to be diluted, leading to a smaller overall gain. In contrast, more pronounced improvements are observed on the VUA Verb dataset (+0.3 F1). This subset focuses exclusively on verbal metaphors, which typically involve a semantic shift from concrete physical actions to abstract concepts or processes. Such metaphorical mappings more directly manifest as deviations from basic semantic properties, aligning well with the proposed gloss-based semantic comparison mechanism and enabling the model to more effectively capture fine-grained differences between contextual meanings and canonical lexical semantics. Furthermore, the largest improvement is observed on the MOH-X dataset (+0.9 F1). This is because MOH-X primarily consists of carefully annotated verbal metaphors with clearer and more concentrated semantic contrasts between metaphorical and literal usages. Under this setting, the fine-grained semantic attributes provided by glosses (e.g., manner of action, participant roles, or degree of abstractness) can be more fully exploited, substantially enhancing the model’s ability to model metaphor-induced semantic conflict. Overall, the varying improvement magnitudes across the three datasets reflect differences in part-of-speech distribution, degree of metaphor conventionalization, and salience of semantic deviation, further indicating that the proposed method is particularly advantageous in metaphor scenarios where semantic divergence is clearer and more focused.

In addition, we conduct a qualitative analysis of representative instances from the VUA All dataset. Successful cases typically involve metaphorical verbs whose contextual usage subtly deviates from their literal meanings while remaining lexically conventional. For example, in the sentence “He grasped the idea quickly,” the absence of an explicit contextual contradiction leads baseline models to predict a literal usage. In contrast, our model explicitly compares contextual representations with lexical-level semantic features encoded in the dictionary entry (e.g., physical action, bodily involvement), enabling it to capture fine-grained semantic conflicts that are not surface-level obvious. Moreover, the proposed data augmentation strategy further reinforces this process: by concatenating automatically generated summaries with the original sentence, the model benefits from a semantic anchor that highlights the abstract target concept (e.g., understanding), thereby amplifying the contrast between the physical attributes encoded in the gloss and the abstract contextual meaning. The synergy between gloss-based semantic comparison and summary-based semantic anchoring allows the model to correctly identify metaphorical usage in such subtle cases. Failure cases primarily occur in scenarios involving highly conventionalized metaphors, where metaphorical usage has become so deeply lexicalized that it is nearly indistinguishable from its literal meaning (e.g., “prices fell sharply”). These cases demonstrate that although lexical information and semantic anchoring achieved through data augmentation significantly enhance fine-grained semantic discrimination, they are still insufficient to fully resolve such deeply conventionalized metaphors. The two representative examples discussed above are illustrated in Figure 7.

Finally, we analyze the prediction errors on a randomly sampled set of 2000 instances from the VUA All test set using confusion matrices, as shown in Table 7 and Table 8, which present the results for the baseline model and our method, respectively. The results indicate that the proposed method primarily reduces the false negative rate—from 230 cases to 224 cases—i.e., metaphorical verbs that are incorrectly classified as literal by the baseline model. This demonstrates that gloss-based semantic comparison improves the model’s sensitivity to implicit metaphorical usages. Meanwhile, the false positive rate remains largely unchanged (170 cases versus 172 cases), indicating that the observed performance gains are not achieved at the cost of over-predicting metaphors. This error pattern aligns well with our design objective: enhancing fine-grained semantic discrimination rather than introducing systematic prediction bias.

4.6. Ablation Study

To systematically evaluate the contributions of the proposed semantic enhancement (enhance) module and gloss annotation (gloss) module across different datasets, we conducted a series of ablation experiments on the VUA All (Table 9), VUA Verb (Table 10), and MOH-X (Table 11) datasets. By removing or selectively combining key components, these experiments help quantify the independent contributions of each module to overall performance as well as their synergistic effects.

On the VUA All dataset, when the data enhancement module was removed (w/o enhance), the model’s F1 score dropped from that of the full model to 79.0; when the gloss module was removed (w/o gloss), the F1 further decreased to 78.6. In contrast, the configuration incorporating both data enhancement and gloss (w/o) achieved the highest performance, with an F1 of 79.5. These results indicate that, in the VUA All dataset, which exhibits high diversity in part-of-speech and metaphor types, fine-grained lexical semantic information and enhanced contextual semantics play a complementary role in metaphor recognition.

On the more focused VUA Verb dataset, this trend is even more pronounced. Using only the gloss module (w/o enhance) yielded an F1 of 75.2; using only data enhancement (w/o gloss) gave an F1 of 74.7; while incorporating both mechanisms (w/o) increased the F1 to 76.2. Compared with VUA All, verb metaphors in VUA Verb rely more heavily on the semantic shift between “contextual meaning” and “basic meaning.” Thus, the literal semantic attributes provided by gloss and the semantic anchoring introduced by data enhancement are more effectively reflected in this dataset.

On the MOH-X dataset, the model exhibits a similar pattern under different ablation settings. With the complete method (w/o), the F1 is 84.3; removing only the data enhancement module (w/o enhance) results in an F1 of 83.9; removing only the gloss module (w/o gloss) yields an F1 of 83.5. An interesting phenomenon emerges: the joint effect of semantic comparison based on gloss annotations and semantic anchoring from data augmentation is particularly prominent on the MOH-X dataset, whereas removing either component individually only causes a slight performance drop. This can be attributed to the intrinsic characteristics of MOH-X. Compared with VUA All and VUA Verb, MOH-X is smaller in scale, more lexically concentrated, and nearly entirely focused on verb metaphors involving mappings from the physical to the abstract. In such cases, many metaphor instances simultaneously rely on precise lexical semantic references and stable representations of overall event semantics. When only one component is available, its contribution quickly saturates; however, the combination forms a complete semantic alignment loop, allowing the model to correct a set of commonly challenging boundary cases. Due to the smaller size and higher variance of MOH-X, these corrected cases have a more pronounced impact on the overall F1 score. In contrast, the heterogeneity and larger scale of the VUA datasets dilute this synergistic effect, resulting in smaller but more stable performance improvements.

Notably, across all ablation settings in VUA All, VUA Verb, and MOH-X, a consistent trend emerges: as the gloss or data enhancement modules are incrementally introduced, the model exhibits an increase in recall with a slight decrease in precision. This phenomenon is closely related to the nature of the metaphor recognition task. Analysis of the confusion matrices (Table 7 and Table 8) shows that the proposed modules primarily reduce false negatives—i.e., cases where metaphors are misclassified as literal usage—while false positives remain largely stable or show only minor changes. This indicates that the gloss and semantic enhancement mechanisms increase the model’s sensitivity to implicit or weakly salient metaphorical usages, making it more likely to “discover potential metaphors,” thereby boosting recall without artificially inflating performance through over-prediction of metaphors.

From a qualitative perspective, the two modules play complementary roles in actual predictions. As illustrated in Figure 7a, taking gloss as an example, in the sentence “He grasped the idea quickly,” the target verb grasp has the gloss “to take hold of something firmly with the hands,” which explicitly includes attributes such as “physical grasp” and “hand action.” By comparing this gloss representation with the contextual representation of the target word, the model can capture subtle but critical semantic conflicts, correcting errors where the baseline model misclassifies it as literal. Meanwhile, the data enhancement module generates and concatenates summary sentences that restate and focus on the core semantics of the original sentence—for example, emphasizing “understanding an idea” as an abstract event—creating a “semantic anchoring” effect that further amplifies the deviation between contextual semantics and literal semantic attributes. This enhancement is especially effective when metaphorical cues are not salient.

In summary, the ablation results consistently demonstrate across the three datasets that the gloss module models semantic conflicts by providing high-resolution normative semantic attributes, while the data enhancement module strengthens contextual representations through semantic anchoring. Their synergistic effect effectively improves the model’s ability to recognize implicit metaphors. Even when the overall F1 improvement is limited, this recall-driven enhancement has clear task relevance and interpretability, validating the rationale and effectiveness of the proposed method.

5. Conclusions

This paper proposes a metaphor classification approach based on data augmentation and gloss-based semantic modeling. Building upon the MisNet framework, we incorporate both data augmentation and gloss information to enhance metaphor detection. The data augmentation strategy serves as a semantic anchoring mechanism for the target word’s contextual meaning, while the gloss provides conceptual-level semantic knowledge of the target word. Together with the SPV and MIP modules, these components improve the model’s ability to capture semantic incongruity between the target word and its context, as well as the discrepancy between the contextual meaning and the basic meaning of the target word. Experimental results demonstrate that the proposed method achieves consistent performance improvements across three benchmark datasets. However, our analysis also reveals that the model’s effectiveness decreases for highly conventionalized metaphors. To address this limitation, future work will explore modeling distributional shifts of lexical meanings across different time periods and usage contexts, enabling more accurate identification of highly conventionalized metaphorical expressions.

Author Contributions

Study conception and design, L.T., B.W., H.W., J.L. and Y.Q.; data collection, L.T., B.W. and H.W.; analysis and interpretation of results, L.T., B.W. and J.L.; draft manuscript preparation, L.T., B.W., H.W. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Natural Science Foundation, grant number 4252035; the National Science and Technology Major Project, grant number 2020AAA0109703; and the National Natural Science Foundation of China, grant number 62076167.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, Jie Liu, upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Robustness Verification and Statistical Analysis

Specifically, under identical model architectures, hyperparameter settings, and training strategies, we conducted multiple independent training and evaluation runs of the proposed method using five different random seeds (seed = 42, 55, 6, 33, and 16). The resulting F1 scores from the five runs were 79.52, 79.50, 79.53, 79.48, and 79.51, respectively, as summarized in Table A1. As can be observed, the performance variation across different random seeds is minimal, with F1 scores consistently distributed within a narrow range of 79.48–79.53. No abnormally high or low results caused by particular random initializations were observed. Based on these five independent runs, we further computed an average F1 score of 79.5 with a standard deviation of 0.02 for the proposed method, indicating good training stability and reproducibility. Meanwhile, under the same random seed settings and experimental configurations, we performed five independent repeated experiments for the baseline method MisNet. The corresponding F1 scores were 79.41, 79.38, 79.39, 79.40, and 79.40. The statistical results show that MisNet achieves an average F1 score of 79.40 on the VUA All dataset, with a standard deviation of 0.01. It can be seen that both methods exhibit low performance variance across different random seeds. However, the proposed method consistently outperforms the baseline in all corresponding runs, yielding an average performance improvement of approximately 0.11 F1 points.

Table A1. Robustness evaluation with different random seeds on the VUA All dataset.

Model	Seed 42	Seed 55	Seed 6	Seed 33	Seed 16	Mean	Std
EGSNet	79.52	79.50	79.53	79.48	79.51	79.50	0.02
MisNet	79.41	79.38	79.39	79.40	79.40	79.40	0.01

On this basis, we conducted a paired two-tailed t-test on the F1 scores obtained by the proposed method and the baseline under the same random seed settings. The statistical test results indicate that the performance difference between the two methods is statistically significant (p < 0.05), providing quantitative evidence for the effectiveness of the proposed approach on the VUA All dataset.

In addition, we computed the 95% confidence intervals based on the results of the five independent runs. The results show that the performance interval of the proposed method is overall higher than that of the baseline, or exhibits only limited overlap at the margins, further indicating that its performance distribution has stably shifted toward a better direction. Taken together, these analyses lead to the conclusion that, although the absolute improvement in overall metrics is relatively modest, the performance gains achieved by the proposed method over MisNet on the VUA All dataset are robust, reproducible, and statistically significant, rather than being attributable to random variation.

References

Wan, M.; Ahrens, K.; Chersoni, E.; Jiang, M.; Su, Q.; Xiang, R.; Huang, C.R. Using conceptual norms for metaphor detection. In Proceedings of the Second Workshop on Figurative Language Processing, Seattle, WA, USA, 9 July 2020; pp. 104–109. [Google Scholar]
Wan, H.; Lin, J.; Du, J.; Shen, D.; Zhang, M. Enhancing metaphor detection by gloss-based interpretations. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Bangkok, Thailand, 1–6 August 2021; pp. 1971–1981. [Google Scholar]
Steen, G.J.; Dorst, A.G.; Krennmayr, T.; Kaal, A.A.; Herrmann, J.B. A Method for Linguistic Metaphor Identification; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2010. [Google Scholar]
Wilks, Y.; Dalton, A.; Allen, J.; Galescu, L. Automatic metaphor detection using large-scale lexical resources and conventional metaphor extraction. In Proceedings of the First Workshop on Metaphor in NLP, Atlanta, GA, USA, 13 June 2013; pp. 36–44. [Google Scholar]
Zhang, S.; Liu, Y. Metaphor detection via linguistics enhanced Siamese network. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4149–4159. [Google Scholar]
Gao, G.; Choi, E.; Choi, Y.; Zettlemoyer, L. Neural metaphor detection in context. arXiv 2018, arXiv:1808.09653. [Google Scholar] [CrossRef]
Mao, R.; Lin, C.; Guerin, F. End-to-end sequential metaphor identification inspired by linguistic theories. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; pp. 3888–3898. [Google Scholar]
Su, C.; Fukumoto, F.; Huang, X.; Li, J.; Wang, R.; Chen, Z. DeepMet: A reading comprehension paradigm for token-level metaphor detection. In Proceedings of the Second Workshop on Figurative Language Processing, Seattle, WA, USA, 9 July 2020; pp. 30–39. [Google Scholar]
Choi, M.; Lee, S.; Choi, E.; Park, H.; Lee, J.; Lee, D.; Lee, J. MelBERT: Metaphor detection via contextualized late interaction using metaphorical identification theories. arXiv 2021, arXiv:2104.13615. [Google Scholar] [CrossRef]
Tian, Y.; Xu, N.; Mao, W.; Zeng, D. Modeling Conceptual Attribute Likeness and Domain Inconsistency for Metaphor Detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 7736–7752. [Google Scholar]
Song, Z.; Tian, S.; Yu, L. Multi-task metaphor detection based on linguistic theory. Multimed. Tools Appl. 2024, 83, 64065–64078. [Google Scholar] [CrossRef]
Assaf, D.; Neuman, Y.; Cohen, Y.; Argamon, S.; Howard, N.; Last, M.; Koppel, M. Why “dark thoughts” aren’t really dark: A novel algorithm for metaphor identification. In Proceedings of the 2013 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), Singapore, 16–19 April 2013; pp. 60–65. [Google Scholar]
Do Dinh, E.L.; Gurevych, I. Token-level metaphor detection using neural networks. In Proceedings of the Fourth Workshop on Metaphor in NLP, San Diego, CA, USA, 17 June 2016; pp. 28–33. [Google Scholar]
Sun, S.; Xie, Z. Bilstm-based models for metaphor detection. In Proceedings of the 6th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC), Dalian, China, 8–12 November 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 431–442. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Bulat, L.; Kiela, D.; Clark, S. Modelling metaphor with attribute-based semantics. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), Valencia, Spain, 3–7 April 2017; pp. 523–528. [Google Scholar]
Chen, X.; Jiang, J.Y.; Chang, W.C.; Hsieh, C.J.; Yu, H.F.; Wang, W. MinPrompt: Graph-based minimal prompt data augmentation for few-shot question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 254–266. [Google Scholar]
Yang, R.; Zhang, J.; Gao, X.; Ji, F.; Chen, H. Simple and effective text matching with richer alignment features. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4699–4709. [Google Scholar]
Rao, J.; Liu, L.; Tay, Y.; Yang, W.; Shi, P.; Lin, J. Bridging the gap between relevance matching and semantic matching for short text similarity modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5370–5381. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 670–680. [Google Scholar]
Reimers, N.; Gurevych, I. SentenceBERT: Sentence embeddings using Siamese BERT networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Haagsma, H.; Bjerva, J. Detecting novel metaphor using selectional preference information. In Proceedings of the Fourth Workshop on Metaphor in NLP, San Diego, CA, USA, 17 June 2016; pp. 10–17. [Google Scholar]
Do Dinh, E.-L.; Wieland, H.; Gurevych, I. Weeding out conventionalized metaphors: A corpus of novel metaphor annotations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1412–1424. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 9–14 December 2019. [Google Scholar]
Le, D.; Thai, M.; Nguyen, T. Multi-task learning for metaphor detection with graph convolutional neural networks and word sense disambiguation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8139–8146. [Google Scholar]
Leong, C.W.; Klebanov, B.B.; Hamill, C.; Stemle, E.; Ubale, R.; Chen, X. A report on the 2020 VUA and TOEFL metaphor detection shared task. In Proceedings of the Second Workshop on Figurative Language Processing, Seattle, WA, USA, 9 July 2020; pp. 18–29. [Google Scholar]
Song, W.; Zhou, S.; Fu, R.; Liu, T.; Liu, L. Verb metaphor detection via contextual relation learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 4240–4251. [Google Scholar]

Figure 1. An example for metaphor detection. The metaphorical word is clouded (highlighted in bold), change as an interpretation of metaphorical words. Adapted from [2] under the Creative Commons Attribution 4.0 (CC BY 4.0) license.

Figure 2. Overall architecture of EGSNet. The two RoBERTa encoders share weights.

h_{c}, h_{t}, h_{b}, h_{P O S}, h_{g}

are context embedding, contextual target meaning, basic meaning, POS information and gloss information respectively. ⊕ denotes concatenation. ⊗ denotes Hadamard product. The complete input is displayed in the upper left corner of the figure.

Figure 2. Overall architecture of EGSNet. The two RoBERTa encoders share weights.

h_{c}, h_{t}, h_{b}, h_{P O S}, h_{g}

are context embedding, contextual target meaning, basic meaning, POS information and gloss information respectively. ⊕ denotes concatenation. ⊗ denotes Hadamard product. The complete input is displayed in the upper left corner of the figure.

Figure 3. Comparison of three data augmentation methods: Back-translation, Summarize, and Paraphrasing (with all generation models selected as T5-Small). Compared to Summarize, the sentences enhanced by the other two methods still retain the metaphorical usage of “lion”.

Figure 4. Taking “He is lion in the battlefield.” as an example of data augmentation, input the prompt into T5-Small to obtain the summary of the original sentence. Finally, concatenate and tokenize the summary and the original sentence to form tokens that can be directly input into BERT.

Figure 5. (a) Interaction-based Model. (b,c) Representation-based Model.

Figure 6. The retrieval strategy for “Drain” First, we search for the target word in WordNet. Then, we select the top-ranked gloss, POS tag, and example (basic usage) of the target word.

Figure 7. (a,b) The yellow (highlighted part indicates metaphorical word), green, and pink boxes represent the original sentence, gloss, and summary, respectively. The curved arrow marked with “Semantic Conflict” indicates the intensity of semantic conflict, while the red question mark indicates that the semantic conflict is not obvious.

Table 1. Overview of Metaphor Recognition Methods. Comparison based on key features (Gloss, MIP/SPV, Siamese Network) and reported F1 scores on the VUA-All dataset.

Model	Core Architecture	Gloss/Dict.	MIP/SPV	Siamese	Performance (F1)
RNN_HG [6]	RNN/ELMo	No	MIP	No	74.0
RNN_MHCA [7]	RNN/ELMo	No	SPV	No	74.3
Wan et al. [2]	BERT/BiLSTM	Yes	No	No	71.8
DeepMet [8]	RoBERTa	No	No	No	76.3
MelBERT [9]	RoBERTa	No (Literal Emb.)	SPV/MIP	Yes	78.5
AIDIL [10]	BERT	Yes	No	Yes	76.4
CSS [11]	BERT	No	SPV/MIP	Yes	73.8
MisNet [5]	RoBERTa	Yes	SPV/MIP	Yes	79.4

Note: “Yes” indicates the explicit integration of the feature. “Literal Emb.” refers to using literal word embeddings rather than textual gloss definitions.

Table 2. Distribution of VUA corpus.

Type	Vocabulary	Number of Text Fragments
Academic Text	49,561	16
Dialogue Text	48,001	24
Novel Text	44,892	12
News Text	45,116	63
Total	187,570	115

Table 3. Datasets information. Sent.N.: Number of sentences. Target.N: Number of target words. Met.P.: Percentage of metaphors. Avg.L: Average sentence length.

Dataset	Sent.N	Target.N	Met.P	Avg.L
VUA All_tr	6323	116,622	11.19	18.4
VUA All_val	1550	38,628	11.62	24.9
VUA All_te	2694	50,175	12.44	18.6
VUA Verb_tr	7479	15,516	27.90	20.2
VUA Verb_val	1541	1724	26.91	25.0
VUA Verb_te	2694	5873	29.98	18.6
MOH-X	647	647	48.69	8.0

Table 4. Experimental results of VUA All. Bold indicates the optimal value for each evaluation metric. “–” indicates that the corresponding accuracy scores are not reported in the original work [5], and thus are not included for fair comparison. The same applies to the following tables.

Model	Pre	Rec	F1	Acc
RNN_ELMO	71.6	73.6	72.6	93.1
RNN_BERT	71.5	71.9	71.7	92.9
RNN_HG	71.8	76.3	74.0	93.6
RNN_MHCA	73.0	75.7	74.3	93.8
MUL_GCN	74.8	75.5	75.1	93.8
RoBERTa_SEQ	80.4	74.9	77.5	–
DeepMet	82.0	71.3	76.3	–
MelBERT	80.1	76.9	78.5	–
MrBERT	82.7	72.5	77.2	94.7
AIDIL	75.7	75.5	76.4	76.47
CSS	77.4	71.5	74.5	85.8
MisNet	80.4	78.4	79.4	94.9
EGSNet	81.0	76.5	79.5	95.2

Table 5. Experimental results of VUA Verb.

Model	Pre	Rec	F1	Acc
RNN_ELMO	68.2	71.3	69.7	81.4
RNN_BERT	66.7	71.5	69.0	80.7
RNN_HG	69.3	72.3	70.8	82.1
RNN_MHCA	66.3	75.2	70.5	81.8
MUL_GCN	72.5	70.9	71.7	83.2
RoBERTa_SEQ	79.2	69.8	74.2	–
DeepMet	79.5	70.8	74.9	–
MelBERT	78.7	72.9	75.7	–
MrBERT	80.8	71.5	75.9	86.4
AIDIL	78.3	73.6	75.9	86.0
CSS	76.7	71.1	73.8	84.8
MisNet	78.3	73.6	75.9	86.0
EGSNet	77.3	75.0	76.2	87.3

Table 6. Experimental results of MOH-X (10 fold).

Model	Pre	Rec	F1	Acc
RNN_ELMO	79.1	73.5	75.6	77.2
RNN_BERT	75.1	81.8	78.2	78.1
RNN_HG	79.7	79.8	79.8	79.7
RNN_MHCA	77.5	83.1	80.0	79.8
MUL_GCN	79.7	80.5	79.6	79.9
MrBERT	80.0	85.1	82.1	81.9
MisNet	84.2	84.0	83.4	83.6
EGSNet	82.2	87.4	84.3	84.2

Table 7. Confusion matrix on the VUA All test set (Baseline, MisNet).

	Predicted Metaphor	Predicted Literal	Total
Gold Metaphor	TP₁ = 770	FN₁ = 230	1000
Gold Literal	FP₁ = 170	TN₁ = 830	1000
Total	940	1060	2000

Table 8. Confusion matrix on the VUA All test set (Ours, EGSNet).

	Predicted Metaphor	Predicted Literal	Total
Gold Metaphor	TP₂ = 776	FN₂ = 224	1000
Gold Literal	FP₂ = 172	TN₂ = 828	1000
Total	948	1052	2000

Table 9. Ablation experiments on the VUA All dataset.

Model	Pre	Rec	F1	Acc
w/o	81.0	76.5	79.5	95.2
w/o enhance	82.2	75.1	79.0	94.6
w/o gloss	83.7	74.6	78.6	94.5

Table 10. Ablation experiments on the VUA Verb dataset.

Model	Pre	Rec	F1	Acc
w/o	77.3	75.0	76.2	87.3
w/o enhance	78.6	74.5	75.2	86.8
w/o gloss	79.3	74.1	74.7	86.9

Table 11. Ablation experiments on the MOH-X dataset.

Model	Pre	Rec	F1	Acc
w/o	82.2	87.4	84.3	84.2
w/o enhance	83.3	84.5	83.9	83.8
w/o gloss	84.8	82.5	83.5	83.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, L.; Wu, B.; Wen, H.; Liu, J.; Qu, Y. Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition. Electronics 2026, 15, 403. https://doi.org/10.3390/electronics15020403

AMA Style

Tang L, Wu B, Wen H, Liu J, Qu Y. Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition. Electronics. 2026; 15(2):403. https://doi.org/10.3390/electronics15020403

Chicago/Turabian Style

Tang, Long, Baowen Wu, Hongjian Wen, Jie Liu, and Youli Qu. 2026. "Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition" Electronics 15, no. 2: 403. https://doi.org/10.3390/electronics15020403

APA Style

Tang, L., Wu, B., Wen, H., Liu, J., & Qu, Y. (2026). Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition. Electronics, 15(2), 403. https://doi.org/10.3390/electronics15020403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Data Augmentation and Gloss-Based Siamese Network for Metaphor Recognition

Abstract

1. Introduction

2. Related Work

2.1. Linguistic Rule-Integrated and RNN-Based Methods

2.2. Gloss-Aware and Siamese-Based Methods

3. Methodology

3.1. Data Augmentation

3.2. Semantic Matching Based on SPV and MIP

3.3. EGSNet Architecture

3.4. Computational Cost and Architectural Efficiency

4. Experiments

4.1. Datasets

4.2. Setting

4.3. Baseline

4.4. Evaluation Metrics

4.5. Results

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Robustness Verification and Statistical Analysis

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI