Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection

Hayashi, Takuya; Sasaki, Minoru

doi:10.3390/bdcc9090218

Open AccessArticle

Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection

by

Takuya Hayashi

^1,*,†

and

Minoru Sasaki

^2,†

¹

Major in Computer and Information Sciences, Graduate School of Science and Engineering, Ibaraki University, 4-12-1, Nakanarusawa, Hitachi 316-8511, Ibaraki, Japan

²

Dapartment of Computer and Information Sciences, College of Engineering, Ibaraki University, 4-12-1, Nakanarusawa, Hitachi 316-8511, Ibaraki, Japan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(9), 218; https://doi.org/10.3390/bdcc9090218

Submission received: 6 June 2025 / Revised: 24 July 2025 / Accepted: 13 August 2025 / Published: 25 August 2025

Download

Browse Figures

Versions Notes

Abstract

Metaphor detection is challenging in natural language processing (NLP) because it requires recognizing nuanced semantic shifts beyond literal meaning, and conventional models often falter when contextual cues are limited. We propose a method to enhance metaphor detection by augmenting input sentences with auxiliary context generated by ChatGPT. In our approach, ChatGPT produces semantically relevant sentences that are inserted before, after, or on both sides of a target sentence, allowing us to analyze the impact of context position and length on classification. Experiments on three benchmark datasets (MOH-X, VUA_All, VUA_Verb) show that this context-enriched input consistently outperforms the no-context baseline across accuracy, precision, recall, and F1-score, with the MOH-X dataset achieving the largest F1 gain. These improvements are statistically significant based on two-tailed t-tests. Our findings demonstrate that generative models can effectively enrich context for metaphor understanding, highlighting context placement and quantity as critical factors. Finally, we outline future directions, including advanced prompt engineering, optimizing context lengths, and extending this approach to multilingual metaphor detection.

Keywords:

metaphor detection; ChatGPT; transformer; generative AI

1. Introduction

Metaphors are pervasive in natural language and play a vital role in expressing abstract concepts through more concrete or familiar domains. As Lakoff and Johnson [1] argue, metaphors are not merely rhetorical devices but are fundamental to human cognition, serving as a mapping mechanism between source and target conceptual domains. This makes metaphor detection a key challenge in natural language understanding (NLU).

Recent advancements in large-scale language models (LLMs), such as BERT and GPT, have significantly improved various NLP tasks by capturing rich contextual and semantic relationships. However, metaphor detection remains difficult for these models, particularly in short or ambiguous sentences. This difficulty stems from the fact that metaphors involve a shift in meaning that often cannot be resolved by lexical similarity or syntactic features alone.

Traditional approaches to metaphor detection have included rule-based systems, feature-engineered classifiers, and more recently, neural architectures like RNNs and Transformer-based models. Among these, MisNet has proven effective by integrating linguistic rules with contextual embeddings. Nevertheless, these models still rely on static inputs and often fail when contextual cues are insufficient.

To address this limitation, our study explores whether metaphor detection accuracy can be improved by dynamically enriching the input with auxiliary context generated by an LLM. Specifically, we propose a method that uses ChatGPT to generate semantically coherent sentences that are inserted before and/or after a given target sentence. This augmented input is then fed into a metaphor classification model (based on MisNet) to evaluate performance changes.

We hypothesize that both the amount and position of contextual information significantly influence metaphor classification, particularly in cases where semantic cues are limited. We test this hypothesis empirically on three benchmark datasets: MOH-X, VUA_All, and VUA_Verb.

Through quantitative evaluation and statistical analysis, we demonstrate that our context-enriched inputs consistently improve metaphor detection metrics. This suggests that metaphor understanding can benefit from generative augmentation strategies that provide interpretive scaffolding, and that LLMs can play a dual role: not only as classifiers but also as semantic context generators. Our findings provide new directions for metaphor-aware NLP modeling and LLM prompting techniques.

1.1. Research Background

In recent years, the field of artificial intelligence (AI) has experienced rapid and continuous progress. A major milestone was achieved in 2018, when Google introduced BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model that significantly outperformed previous models across a range of NLP tasks. This advancement catalyzed the development of various large-scale language models (LLMs), including OpenAI’s GPT series, which have received global attention since the release of GPT-3 and GPT-4. These models have demonstrated impressive fluency and contextual awareness, transforming not only research but also practical applications in dialogue systems, summarization, and text generation.

Within this fast-evolving landscape, metaphor detection has emerged as a prominent and challenging research area in natural language understanding. Metaphors are pervasive in everyday language and are essential for expressing abstract or complex ideas through more familiar or concrete terms. As conceptualized by Lakoff and Johnson [1], metaphors involve a mapping between a “source” domain (concrete) and a “target” domain (abstract), allowing people to understand one concept in terms of another. This mechanism makes metaphors powerful but also difficult to detect computationally, as they often defy literal interpretation.

The following are illustrative examples of metaphorical expressions:

1.: “He absorbed the knowledge or beliefs of his tribe.”

The verb absorb does not refer to physical intake, as in absorbing water, but instead metaphorically denotes the internalization of abstract content such as knowledge or ideology. The implied meaning is: “He took in the knowledge or beliefs of his tribe.”

2.: “His political ideas color his lectures.”

The term color here does not refer to visual pigmentation but metaphorically suggests that his political views influence or shape the tone and content of his lectures. The intended meaning is: “His political ideas influence the nature of his lectures.”

As shown in these examples, metaphorical language often involves a discrepancy between literal word meaning and intended semantic interpretation. This makes metaphor detection a task requiring high-level semantic understanding, contextual inference, and sometimes even cultural or pragmatic knowledge.

Several computational approaches have been developed to address this task. Early methods relied on manually crafted linguistic features such as word concreteness, abstractness, frequency, or part-of-speech, which were input into traditional machine learning classifiers. These approaches were later outperformed by neural models such as RNNs and CNNs, which could learn features from data automatically. However, these models typically employed static word embeddings (e.g., Word2Vec, GloVe), which lack the ability to capture context-specific meaning variations crucial for metaphor detection. Moreover, RNNs face limitations in parallelization and long-distance dependency modeling.

Transformer-based models like BERT and RoBERTa introduced attention mechanisms and contextualized embeddings that significantly improved metaphor detection accuracy. These models can better capture semantic shifts by dynamically adjusting word representations based on surrounding context. In parallel, recent studies have incorporated external resources such as lexical definitions and conceptual knowledge from dictionaries to enrich word representations.

A representative example is the MisNet model proposed by Zhang and Liu in 2022 [2]. MisNet integrates linguistic rules—specifically the Metaphor Identification Procedure (MIP) and Selectional Preference Violation (SPV)—with contextual embeddings derived from Transformers. It processes the input sentence alongside the target word’s dictionary definition, usage patterns, and grammatical information to determine the metaphorical nature of the expression. Despite its effectiveness, MisNet exhibits limitations when applied to sentences with minimal context, such as short utterances, greetings, or sentences composed solely of pronouns (e.g., “That’s heavy.” or “He did it.”).

These shortcomings highlight a critical issue: even the most advanced models can misclassify metaphors in the absence of sufficient contextual cues. This motivates our investigation into whether dynamically generated context—produced by LLMs like ChatGPT—can serve as an auxiliary input to enhance metaphor detection, particularly in challenging cases with limited context.

1.2. Research Objectives

Despite the progress made by Transformer-based models such as BERT and RoBERTa in metaphor detection, these models continue to exhibit weaknesses in handling sentences with limited or ambiguous context. In such cases, the lack of surrounding linguistic or semantic cues often leads to misclassification of metaphorical expressions. This is particularly problematic in domains such as education, dialogue systems, and automated text analysis, where short utterances or pronoun-heavy constructions are common.

The primary objective of this study is to investigate whether dynamically generated auxiliary context—produced by a large language model (LLM)—can enhance the accuracy of metaphor detection. Specifically, we explore the use of ChatGPT to generate semantically coherent sentences that can be appended before and/or after a target sentence. These augmented inputs are then used to train or evaluate a metaphor classification model based on MisNet, which is particularly suitable for this study due to its architecture that integrates linguistic rules with contextual embeddings, allowing us to evaluate the impact of enriched context effectively.

We hypothesize that both the amount and the position of added context significantly influence detection performance. For example, placing explanatory context before the target sentence may prime the model for interpretation, whereas placing it after may act as confirmatory evidence. This leads us to design a series of experiments across three widely used benchmark datasets—MOH-X, VUA_All, and VUA_Verb—to empirically evaluate the effects of auxiliary context in metaphor classification tasks.

Through these experiments, we aim to address the following research questions:

Can auxiliary context generated by ChatGPT improve metaphor detection accuracy?
Does the position (before vs. after) of generated context affect performance?
How does the quantity of added context influence classification outcomes?

By answering these questions, we aim to contribute to a better understanding of how generative LLMs can be used not only for classification but also for context enrichment. Our findings could pave the way for more robust metaphor-aware NLP systems, especially in scenarios where input text is minimal or semantically sparse.

An early version of the present study was presented at the 18th International Conference on E-Service and Knowledge Management (ESKM 2024) [3].

2. Related Work

2.1. Metaphor Detection with Pretrained Encoders

Early neural approaches drew on feature engineering and RNNs; transformer-based encoders subsequently delivered large gains by capturing contextual semantics. MisNet [2] operationalizes two linguistic heuristics—MIP [4] and SPV [5]—in a Siamese architecture, using dictionary glosses to approximate a word’s basic sense. MelBERT [6] extends this line via metaphor-aware late interaction over BERT/RoBERTa representations guided by identification theories. FrameBERT [7] incorporates FrameNet embeddings to represent conceptual frames, improving interpretability and performance across MOH-X, VUA, and TroFi.

While these models enrich representations using static external knowledge like the FrameNet or WordNet dictionary, our work explores the dynamic generation of contextual sentences tailored to each specific input, addressing the limitation of insufficient context directly.

2.2. Data Scarcity and Auxiliary Signals

Limited labeled data motivates transfer and auxiliary-task learning. Zhang and Liu [8] introduce adversarial multi-task learning (AdMul) to transfer from basic sense discrimination built from WSD corpora. Jia and Li [9] enhance metaphor detection with soft labels distilled from a teacher model plus prompt-based target-word prediction, achieving SOTA results. ContrastWSD [10] explicitly contrasts contextual vs. basic senses using WSD signals. Recent work has also explored curriculum-style augmentation for metaphor detection, where examples are introduced in increasing levels of complexity to better align with LLM learning dynamics [11]. Our work is complementary: instead of new supervision, we generate disambiguating context on demand.

2.3. LLMs for Figurative Language and In-Context Augmentation

Recent studies explore LLMs for figurative understanding. LaiDA [12] combines linguistics-aware retrieval with LLM data augmentation. Work on idiom/figurative QA [13,14,15] shows that conversational LLMs still trail human performance, especially in context-heavy scenarios—motivating explicit context scaffolding. We extend this perspective by programmatically synthesizing short auxiliary sentences conditioned on gold metaphor labels.

2.4. Paraphrasing and Web-Derived Context

Paraphrase generation is an effective augmentation strategy across NLP tasks [16]. Although not designed specifically for metaphor detection, paraphrastic variants can diversify surface forms and expose alternative literal paraphrases, which may sharpen metaphor classifiers.

3. Methods

We follow the overall pipeline in the original manuscript but expand documentation of the LLM interface, prompt templates, and dataset transformation steps.

3.1. Generation Environment

As shown in Table 1.

3.2. ChatGPT Configuration

Table 2 records the API, model snapshot, and parameterization used for auxiliary-context generation. Unless noted, OpenAI default values were used; temperature (1.0) and top-p (1.0) are the documented defaults at the time of data collection. No explicit top-k sampling control is exposed in the OpenAI API; nucleus (top-p) sampling was left at default.

3.3. Prompt Templates and Worked Examples

We designed two prompt families. Prompt 1 (minimal instruction) encodes the gold label via the token not; Prompt 2 supplements Prompt 1 with exemplars of metaphorical vs. literal uses of the same target word to reduce ambiguity. Both templates parameterize word budget N and insertion position (precede vs. follow). Complete templates and fully instantiated examples are shown in Table 3.

3.4. Data Integration

Auxiliary sentences are concatenated with the original target sentence either before, after, or both (prior+following). Concatenation preserves punctuation and spacing to avoid token boundary artifacts. For, the resulting string replaces the sentence field in the MisNet CSV schema; all other columns (target position, POS, gloss, etc.) remain unchanged. When prior context is added, offsets in target_position are re-indexed by +aux tokens.

3.5. Implementation Notes

Scripts used to generate context, merge CSVs, and re-index targets are released at the project repository (see Data Availability).

4. Experiment

This section documents environments, datasets, training regimen, and evaluation protocols (expanded for reproducibility).

4.1. Training Environment

MisNet was trained and evaluated using the environment detailed in Table 4.

4.2. MisNet Fine-Tuning Hyperparameters

As shown in Table 5.

4.3. Datasets

4.3.1. Original Dataset

This section describes the structure and details of the datasets used in the experiments. Three publicly available datasets were employed: MOH-X [17], VUA_All [18], and VUA_Verb [18]. In addition, Table 6 presents the number of sentences and words in each dataset, the proportion of metaphors, and the average sentence length.

MOH-X

MOH-X is a verb-focused dataset compiled from WordNet, containing both literal and metaphorical uses of verbs. It was originally created by Mohammad, Shutova, and Turney (2016) [17] to study metaphor as a medium for conveying emotion, with annotations on both metaphoricality and affective meaning.

MOH-X consists of 647 instances, of which 48.69% are labeled as metaphorical. The sentences are notably short, with an average length of only 8.0 tokens. This high metaphor density and minimal sentence context make MOH-X a particularly challenging testbed for metaphor detection, especially under conditions of contextual ambiguity. These characteristics make it an ideal benchmark for evaluating the impact of auxiliary context generation.

VUA_All

The VUA dataset (Vrije Universiteit Amsterdam Metaphor Corpus) was created by VU University Amsterdam using fragments sampled from four genres of the British National Corpus: academic, news, conversation, and fiction. VUA_All includes part-of-speech (POS) tagging for every word in every sentence, annotated using the MIPVU procedure with high inter-annotator agreement (

κ > 0.8

). The types of POS tags are shown in Table 7.

The training split of VUA_All contains 12,123 sentences (72,611 tokens), and the test set consists of 4081 sentences (22,196 tokens). Overall, approximately 18% of the tokens are labeled as metaphorical. The average sentence length is 18.4 for training and 18.6 for test. Due to its genre diversity and high annotation quality, VUA_All serves as a robust benchmark for assessing model generalizability across realistic, varied language contexts.

VUA_Verb

VUA_Verb is a filtered subset of VUA_All, consisting exclusively of sentences where the target word being evaluated is a verb. This specialized setup is particularly important for metaphor detection, as verbs often form the semantic core of a sentence and are more likely to involve figurative language.

In the VUA_Verb test set, 29.98% of target verbs are labeled as metaphorical, with an average sentence length of 18.6 tokens. Compared to VUA_All, this subset highlights verb-specific challenges such as polysemy and context-dependence. Comparing results between VUA_All and VUA_Verb can reveal whether metaphor detection models benefit from context depending on the grammatical category of the target word.

4.3.2. Additional Figurative Resources (Not Used in Main Experiments)

To contextualize scope, we surveyed additional figurative-language corpora: TroFi (verb tropes), FIG-QA [13,14], Multi-Figurative (idiom, metaphor, sarcasm) [19], and It’s not Rocket Science narrative idiom benchmark [15]. These resources support future cross-phenomenon generalization studies (see Section 11).

4.4. Generation Procedure of Dataset with Additional Context

The procedure for generating a context-enriched dataset is illustrated in Figure 1. First, one of the original datasets—MOH-X, VUA_All, or VUA_Verb—is loaded, and prompts are constructed for each target sentence. These prompts are then input to ChatGPT, which generates short auxiliary sentences. Depending on the configuration, the generated auxiliary sentence is inserted before (prior), after (following), or both before and after the target sentence. The resulting data is then output in CSV format compatible with MisNet, a metaphor detection model.

The specific processing steps for each data instance are shown in the three flowcharts. The top row of each figure represents the input data. The variable names enclosed in parentheses correspond to those defined in the prompt templates listed in Table 3 in Section 3.3. Figure 2 compares the two main strategies for contextual augmentation: (a) the “prior” strategy adds auxiliary context before the target sentence, and (b) the “following” strategy adds it after.

In addition, Figure 3 illustrates the “prior + following” configuration, where auxiliary context is added to both sides of the target sentence.

5. Results

This section presents the results of the experiments conducted as described in the previous chapter. The evaluation is based on three main configurations: the original dataset, datasets enhanced with 5-word auxiliary sentences, and datasets enhanced with 10-word auxiliary sentences. Auxiliary sentences were generated using two prompt types (Prompt 1 and Prompt 2). For each configuration, we report Accuracy (Acc), Precision (Prec), Recall (Rec), and F1-score (F1).

5.1. Comparison of Evaluation Metrics

This section presents a comparative analysis of model performance under various auxiliary context configurations across three datasets: MOH-X, VUA_All, and VUA_Verb. The evaluation metrics include Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 score. The highest score in each metric is underlined for emphasis.

5.1.1. Effect of Prior Context

We evaluated the impact of adding prior context on metaphor detection across three datasets: MOH-X, VUA_All, and VUA_Verb. The results, summarized in Table 8, Table 9 and Table 10, indicate that incorporating preceding sentences generally improves model performance across multiple evaluation metrics.

In the MOH-X dataset (Table 8), all contextual augmentation settings led to performance gains over the original baseline. Notably, P2_5words achieved the highest accuracy (0.8547) and precision (0.8619), while P2_10words recorded the best recall (0.8653) and F1 score (0.8498). These improvements suggest that injecting even a small amount of prior context enhances the model’s ability to generalize, particularly in few-shot or idiomatic examples.

For the VUA_All dataset (Table 9), validation results show that P1_10words achieved the highest recall (0.7823), while P2_5words provided the best accuracy (0.9419) and F1 score (0.7596). In terms of test performance, P2_10words yielded the best precision (0.7695), and P1_10words slightly outperformed others in recall (0.7603). These results demonstrate that both the amount and type of prior context can influence performance, with slightly longer or rule-informed contexts providing marginal gains.

In the VUA_Verb dataset (Table 10), which focuses specifically on metaphorical verbs, similar trends were observed. On the validation set, P1_5words achieved the best accuracy (0.8138) and F1 score (0.6945), while the original data retained the highest recall (0.7601). On the test set, P2_5words yielded the highest recall (0.7426), and P2_10words achieved the best precision (0.6887). These findings suggest that while prior context may slightly reduce sensitivity (recall) in some configurations, it can significantly boost precision and overall balance (F1), particularly when fine-tuned for verbs.

Overall, the addition of prior context—especially configurations like P1_10words and P2_5words—proved beneficial across datasets. The gains were most evident in precision and F1 score, indicating that contextual signals help disambiguate literal versus metaphorical usage, especially in challenging or ambiguous cases.

5.1.2. Effect of Following Context

Adding context after the target sentence generally resulted in increased recall, especially when the context length was extended to 10 words. However, this improvement in recall often came at the cost of a slight reduction in precision, suggesting a trade-off between sensitivity and specificity.

In the MOH-X dataset (Table 11), the addition of 5-word following context using P1_5words achieved the highest accuracy (0.8469), precision (0.8455), and F1 score (0.8429). On the other hand, P1_10words recorded the best recall (0.8716), albeit with a drop in precision, which lowered its overall F1. These results imply that while extended following context helps the model capture more metaphorical instances (higher recall), it can sometimes introduce noise that slightly reduces precision.

In the VUA_All dataset (Table 12), we observed a similar pattern. P1_10words yielded the best performance on the validation set in terms of accuracy (0.9522), precision (0.8134), and F1 score (0.7878), although recall was slightly below the original. On the test set, P2_10words achieved the highest accuracy (0.9480) and precision (0.8288), while P1_5words produced the best F1 score (0.7787). These results suggest that while following context helps boost model robustness, optimal context length and type may vary depending on evaluation criteria.

For the VUA_Verb dataset (Table 13), which focuses on verb metaphors, P1_10words reached the highest accuracy (0.8718) and F1 score (0.7569) on the validation set. P2_5words was the most balanced configuration on the test set, achieving top recall (0.7414) and F1 score (0.7350). Notably, P2_10words had the best test precision (0.8077), again confirming the trade-off: more context increases detection power but may also increase the risk of false positives.

In summary, adding following context improves recall and F1 score in many cases, particularly when 10 words are used. However, this gain often comes with a drop in precision, indicating that while the model becomes better at catching metaphors, it may do so with less certainty.

5.1.3. Effect of Combined Context (Prior + Following)

Combining both prior and following context yielded the highest overall performance across datasets. However, the optimal prompt and context length varied by dataset. As shown in Table 14, Table 15 and Table 16, the combined context setting consistently improved accuracy and F1 scores compared with the original baseline across MOH-X, VUA_All, and VUA_Verb.

5.1.4. Discussion

The results clearly demonstrate that incorporating auxiliary context—whether prior, following, or both—significantly enhances metaphor detection performance across all datasets. Several key patterns emerged from the experiments:

For MOH-X, the highest F1 score (0.8493) was achieved using Prompt 1 with 10-word combined context, while the highest accuracy (0.8529) and precision (0.8601) were obtained using Prompt 2 with 5-word combined context. This indicates that short, rule-based prior context is effective for precise metaphor classification, while longer, pattern-based context improves general recall.
In VUA_All, Prompt 1 with 5-word combined context consistently outperformed other settings, achieving the highest validation accuracy (0.9517) and test F1 score (0.7839). Prompt 1 with 10 words yielded the highest validation precision (0.8122). These results suggest that both moderate-length and prompt style play important roles in balancing recall and precision.
For VUA_Verb, Prompt 1 with 10-word combined context achieved the best validation accuracy (0.8759) and F1 score (0.7633), while Prompt 2 with 5-word combined context yielded the highest test recall (0.7274) and F1 score (0.7339). This indicates that different prompt designs may be better suited to different evaluation criteria, particularly for metaphorical verbs.

These findings validate the hypothesis that contextual information is essential in metaphor detection. Moreover, they highlight the value of generative models like ChatGPT in enriching inputs with semantically relevant context, thereby boosting task-specific NLP performance through dynamic prompt engineering.

We further compared our model to several representative metaphor detection baselines on the MOH-X dataset. As shown in Table 17, our best configuration (Prompt 1 with 10-word combined context) achieved a strong F1 score of 0.8493 and the highest Recall of 0.8870 among all models. This indicates our model’s capability to capture a broader range of metaphorical expressions without omissions.

In terms of F1 score, Zhang and Liu’s model (2023) [8] achieved the best overall performance (F1 = 0.880, Accuracy = 0.894), followed by Lin et al. (2021) [20] with F1 = 0.852 and Accuracy = 0.857. Our model narrows the performance gap with these state-of-the-art models to approximately 0.03 points in F1.

Compared to strong baselines such as MelBERT [6] (F1 = 0.842) and MisNet [2] (F1 = 0.834), our model demonstrates superior performance. The gap is even wider against MrBERT [21] (F1 = 0.816) and FrameBERT [7] (F1 = 0.827). Notably, the improvement over earlier BERT-based models such as Le et al. (2020) [22] (F1 = 0.796) highlights our model’s continued progress in metaphor detection.

Our model also maintains a strong balance between Precision (0.8205) and Recall (0.8870), effectively mitigating false positives while achieving high coverage. This results in consistent detection performance, avoiding the skewed tendencies of high-precision or high-recall-focused models.

Importantly, unlike many prior approaches, our method achieves this performance without relying on explicit syntactic knowledge, external resources, or handcrafted modules. By leveraging only prompt-based contextualization, we showcase a high degree of generalizability and flexibility—making our approach a promising and lightweight direction for future metaphor detection systems.

Table 17. Comparison of Metaphor Detection Models on MOH-X (Full Metrics). Metrics for Lin et al. [20], Zhang and Liu [8], and Le et al. [22] are sourced from [23].

Model	Accuracy	Precision	Recall	F1
MelBERT [6]	0.847	0.850	0.8350	0.842
FrameBERT [7]	0.8286	0.819	0.8445	0.827
MrBERT [21]	0.820	0.813	0.8150	0.816
MisNet [2]	0.836	0.842	0.8400	0.834
Ours (P1_10 Combined)	0.8453	0.8205	0.8870	0.8493
Lin et al., 2021 [20]	0.857	0.866	0.847	0.852
Zhang and Liu, 2023 [8]	0.894	0.882	0.879	0.880
Le et al., 2020 [22]	0.789	0.788	0.805	0.796

5.2. Evaluation of Superiority Using t-Test

To statistically assess the effectiveness of adding auxiliary context, we conducted two-tailed t-tests. For each dataset, predicted probability scores from MisNet (as defined in Figure 4) were averaged per sample. A mean value below 0.5 was interpreted as class 0 (literal), and 0.5 or above as class 1 (metaphorical).

5.2.1. Prior Context

Table 18, Table 19, Table 20, Table 21 and Table 22 show p-values for comparisons between original and context-augmented data, where auxiliary sentences were added before the target sentence.

In the MOH-X dataset (Table 18), statistically significant differences at the 10% level were observed in two conditions: (1) Prompt 1 with 10 words, and (2) Prompt 2 with 10 words. Most other configurations yielded p-values ranging from 0.2 to 0.9, indicating no significant difference.

In the VUA_All dataset, both the validation set (Table 19) and the test set (Table 20) showed statistically significant differences. Specifically, Prompt 1 with 10 words and Prompt 2 with both 5-word and 10-word additions resulted in significance at the 5% level or better.

In the VUA_Verb dataset, the validation results (Table 21) and test results (Table 22) show that nearly all prompt conditions demonstrated significance below the 5% level, with many achieving p-values under 1%, highlighting the dataset’s high sensitivity to contextual augmentation.

5.2.2. Following Context

As shown in Table 23, no condition in the MOH-X dataset reached statistical significance at the 5% level. The lowest p-value was 0.0768 (Prompt 2 with 10 words), marginally significant at the 10% level. This suggests MOH-X is less influenced by following context alone.

In the validation set of VUA_All (Table 24), the original data exhibited extremely low p-values (

1.03 \times 10^{- 22}

), with Prompt 1 (10 words) and Prompt 2 (5 words) also reaching statistical significance at the 5–10% levels.

The test set of VUA_All (Table 25) similarly showed extremely low p-values in the original data (

3.36 \times 10^{- 13}

). Prompt 1 with 5 words reached the 1% level, while 10-word additions yielded mixed results.

VUA_Verb’s validation set (Table 26) showed moderate to strong significance in multiple conditions. The original data had a p-value of 0.00091, with Prompt 1 (10 words) and Prompt 2 (5 words) also showing significance.

In the test set of VUA_Verb (Table 27), the original data (

p = 0.00045

) and Prompt 1 with 5 words (

p = 0.0048

) met the 1% threshold, while Prompt 2 (5 words) and Prompt 1 (10 words) reached the 5% level.

5.2.3. Prior and Following Context

Table 28 presents results for the MOH-X dataset, where no configuration reached the 5% significance threshold. The lowest p-value was 0.0567 for the original data, which is marginally significant at the 10% level.

In VUA_All (Validation) (Table 29), all conditions, including the original data, yielded statistically significant results at the 1% level. Additionally, Prompt 1 (10 words) and Prompt 2 (10 words) approached or met the 5% threshold.

The VUA_All (Test) set (Table 30) also showed extremely low p-values for the original condition (

3.18 \times 10^{- 31}

). Prompt 1 with 5 words was marginally significant (

p = 0.0504

), while Prompt 2 with 5 words demonstrated stronger significance (

p = 0.00095

).

In the VUA_Verb (Validation) set (Table 31), the original data reached the 10% level (

p = 0.0750

). Prompt 1 (10 words) showed 5% level significance, and Prompt 2 (10 words) also reached significance.

Finally, for VUA_Verb (Test) (Table 32), both the original data (

p = 0.0003

) and multiple prompt configurations (e.g., Prompt 1 with 5 words:

p = 0.0050

) showed strong statistical significance under the 1% threshold.

5.2.4. Discussion

The t-test results reveal several important patterns. First, the MOH-X dataset showed clear sensitivity to prior context—especially under 10-word prompts—but remained largely unaffected by following context. This suggests that short, metaphor-rich expressions benefit more from preceding contextual cues than subsequent ones.

Second, VUA_All and VUA_Verb consistently exhibited statistical significance across nearly all conditions. This indicates that even relatively long and diverse sentences in these datasets gain interpretive clarity from auxiliary context—regardless of position.

Third, the strongest statistical significance was often found in configurations combining both prior and following context, though their advantage over prior-only configurations was not always dramatic. This implies that while dual-context strategies offer robustness, well-crafted prior context alone may be sufficient in many cases.

Finally, the effectiveness of context addition appears to be dataset-dependent. MOH-X favors semantic priming through earlier context, whereas VUA_Verb responds broadly to both directions. These differences underscore the importance of tailoring context augmentation strategies to dataset characteristics, metaphor types, and sentence structure.

6. Error and Failure-Case Analysis

While auxiliary context generated by ChatGPT significantly improves metaphor detection in many cases, it also introduces specific failure modes. We conducted a manual analysis of 300 instances sampled per dataset (MOH-X, VUA_All, and VUA_Verb) from the augmented outputs to categorize and quantify these error types. Table 33 summarizes their distributions and illustrative patterns.

Insights and Implications

Among the most frequent failure types was semantic drift, particularly prevalent in VUA_All (18.0%), where generated context diverged into unrelated topics. Pronoun coreference failure was also notably frequent in VUA_All and VUA_Verb due to abstract pronouns lacking antecedents. Although polarity flips and named entity over-extension occurred less frequently, even minor misalignments in metaphor labels impacted precision.

We recommend future work consider stricter lexical constraints in prompts or post-hoc filtering to reduce such errors. Appendix A includes additional examples across all datasets.

7. Analysis: Why Limited Gains on VUA_All/VUA_Verb

Although auxiliary context improved MOH-X markedly, gains on VUA_All and VUA_Verb were modest. We analyze four interacting factors:

Sentence length and information sufficiency: Many VUA sentences already contain rich discourse context; short synthetic additions yield diminishing returns.
POS diversity and non-verb targets (VUA_All only): Non-verb metaphors may not benefit from short preceding clauses tuned to verbs.
Metaphor sparsity: Low positive rate ( $\approx 11 %$ ) weakens supervised signal; auxiliary sentences may bias class priors.
Domain shift and register mismatch: LLM-generated English tends toward contemporary, news-like style; VUA derives from educational broadcasts and transcribed speech, creating lexical mismatch.

We empirically probe (1)–(3) in Appendix B with stratified re-analysis by length, POS, and label.

8. Semantic Fidelity of Generated Context

We empirically evaluated whether the auxiliary sentences generated by ChatGPT adhered to the intended metaphorical or literal usage of target expressions. To do so, we analyzed a combined sample of 8300 instances drawn from the VUA_All training and validation sets, categorized by their original metaphor labels.

Specifically, we verified whether the generated auxiliary sentence contained the target word and matched the expected metaphor label. The fidelity criterion was binary: a sentence was judged faithful if the target word was present and its use conformed to the original label. Errors included substitution with synonyms, omission of the target word, and metaphor-label contradiction (e.g., polarity flip; see Section 6).

The results are summarized in Table 34.

Overall, literal instances exhibited high semantic fidelity (94.2%), while metaphorical cases showed slightly lower adherence (84.0%). This gap aligns with prior qualitative observations that metaphor prompts are more prone to generation failures, such as omitted or misused metaphor targets.

These findings confirm that while can produce label-consistent auxiliary context in most cases, additional controls (e.g., lexical constraints or prompt tuning) may be necessary to reduce subtle mismatches in figurative tasks.

9. Cost and Resource Analysis

This section reports the computational and monetary cost of dataset augmentation and model training.

9.1. API Usage and Estimated Cost

Table 35 summarizes the number of instances, average tokens per request, total token volume, estimated cost, and wall-clock processing time for each dataset. Token counts include both prompts and completions. Costs are estimated using contemporaneous OpenAI pricing: $0.001 per 1 K prompt tokens and $0.002 per 1 K completion tokens.

9.2. Processing Latency Statistics

To assess runtime behavior, we recorded processing latency per instance for each prompt configuration over the VUA_All dataset. Table 36 and Table 37 report mean, standard deviation, and min/max values (in seconds) for the train and validation splits, respectively.

9.3. Discussion

Latency distributions reveal that most completions were returned in under one second, with average latencies ranging from 0.5–0.8 s per instance. However, specific prompt types (notably 1p-10b) exhibit long-tail delays, occasionally exceeding 10 s. These outliers are likely caused by transient API congestion or internal model variance.

Overall, our findings suggest that large-scale dataset augmentation using commercial LLM APIs is practical and affordable. With batching and parallelization, over 200 K examples were processed within a single day, at a cost below $175.

10. General Discussion

This section synthesizes findings from the main experiments, ablation studies, and error analyses, to evaluate how auxiliary context affects metaphor detection across MOH-X, VUA_All, and VUA_Verb.

10.1. Impact of Auxiliary Context Position

Prior Context

Prepending auxiliary sentences consistently improved performance. MOH-X showed the largest gain (F1 = 0.8493, Precision = 0.8601 under Prompt 2 with 10 words). Two-tailed t-tests confirmed statistical significance at the 0.1% level. This supports the hypothesis that semantically rich prior context is especially helpful for disambiguating short metaphorical expressions.

Following Context

Appending context mainly improved recall (e.g., Recall = 0.8716 on MOH-X). VUA datasets showed more moderate effects. These findings suggest that metaphor comprehension may rely more heavily on antecedent clues than follow context.

Combined Context

Combining prior and follow sentences produced the most balanced and strongest results across all datasets. These significant improvements (with medium-to-large effect sizes) suggest that adopting adaptive context configurations would be beneficial.

Further analysis of auxiliary context length (5 vs. 10 tokens) is presented in Appendix C. Results show that while 10-token contexts generally yielded higher recall, 5-token contexts offered comparable performance with reduced generation cost and length. This supports the feasibility of shorter context windows in resource-constrained settings.

Additional empirical comparisons between Prompt 1 and Prompt 2, including their interactions with context position, are presented in Appendix D.

10.2. Dataset-Specific Observations

MOH-X: Short, isolated sentences with high metaphor density benefited most from context augmentation, reinforcing our core hypothesis.
VUA_All: Rich discourse context and genre diversity limited the marginal gains of LLM-generated additions.
VUA_Verb: Verb-centered focus retained context sensitivity, suggesting syntactic function modulates context utility.

10.3. Limitations of Prompt-Based Generation

Manual inspection of 900 samples revealed common failure modes:

Semantic drift (12–18%): Off-topic completions (e.g., “quantum computers”).
Polarity flips (4–6%): Violated metaphor/literal label.
Redundancy (3–9%): Minimal variation from original sentence.
Coreference failure (up to 11%): Unresolvable pronouns like “she.”
Over-long entities (2–5%): Truncated named entities.

Despite these, label fidelity was above 90% for literal prompts and around 82% for metaphor, with inter-annotator agreement

κ = 0.74

.

10.4. Prompt Design Tradeoffs

Prompt 2, which includes exemplar conditioning, improved precision (+1.1 to +1.8 points) but increased generation length by 18%. This highlights the tradeoff between generation quality and computational cost.

11. Conclusions and Future Work

This study proposed a context-enrichment approach for metaphor detection by prepending, appending, or surrounding input sentences with auxiliary context generated by ChatGPT. The method was evaluated on three benchmark datasets—MOH-X, VUA_All, and VUA_Verb—under three configurations: prior context, following context, and combined context.

The addition of auxiliary sentences led to consistent improvements in accuracy, precision, recall, and F1 across all datasets. The most significant gains were observed in MOH-X, a dataset composed of short, metaphor-dense sentences, where prior context achieved an F1 score of 0.8493 and precision of 0.8601. Enriching the input with either preceding or following context also yielded measurable improvements on VUA_All and VUA_Verb, particularly in recall and F1.

Two-tailed paired t-tests confirmed the statistical significance of these gains. In particular, the inclusion of 10-word auxiliary context yielded strong effects at the

p < 0.001

or

p < 0.01

levels across several configurations. These findings underscore not only the importance of contextual information, but also the role of its positioning in improving metaphor comprehension.

Future Directions

Building on these findings, we identify several promising avenues for future research:

Context Token Optimization: Dynamically allocate token budgets depending on sentence ambiguity or metaphor density, minimizing redundancy while maximizing effectiveness.
Adaptive Context Configuration: Learn where and when (prior, following, or both) to insert auxiliary context based on syntactic roles, metaphor types, or discourse structure.
Label-Aware Paraphrase Augmentation: Enhance diversity and supervision by combining auxiliary context with metaphorical and literal paraphrases [16].
Cross-Figurative Transfer: Investigate unified training frameworks for metaphor, idiom, simile, and related figurative phenomena via joint corpora [13,14,15,19].
Multilingual Generalization: Extend auxiliary context strategies to non-English datasets (e.g., Japanese, Chinese) to evaluate linguistic and cultural variability in metaphor expression.
Cost-Sensitive Generation: Reduce inference cost by leveraging instruction-tuned or distilled open-source LLMs [12], enabling scalable metaphor-aware augmentation.
Constrained Decoding: Ensure target-word presence during generation by applying lexical constraints at decoding time.
Few-Shot Prompt Design: Explore the integration of few-shot exemplars in prompts to improve alignment with metaphor detection objectives.
Style Adaptation: Improve domain consistency by adapting generation styles to match the linguistic characteristics of specific corpora.

Overall, this work demonstrates the potential of LLM-assisted auxiliary context generation as a flexible and semantically aligned strategy for enhancing metaphor detection. Future studies that tailor generation strategies to task-specific structures and computational constraints may further improve both interpretability and performance.

Author Contributions

T.H. constructed the dataset and conducted the experiments. M.S. performed the analysis and supervised the project. Both authors contributed to the conceptualization and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant number 25K15242.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The supplementary contextual information and the extended dataset used in this study are publicly available on GitHub. The data can be accessed via https://github.com/TakuyaHayashi3204/BDCC-Added_Dataset.git (accessed on 24 July 2025).

Acknowledgments

The authors would like to thank those who provided support that is not covered under author contributions or funding. This includes administrative and technical assistance, or donations in kind, such as materials used in experiments.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
NLP	Natural Language Processing
LLM	Large Language Model
GPT	Generative Pre-trained Transformer
ChatGPT	Chat Generative Pre-trained Transformer
BERT	Bidirectional Encoder Representations from Transformers
RoBERTa	A Robustly Optimized BERT Pretraining Approach
RNN	Recurrent Neural Network
POS	Part of Speech
MOH-X	Extension of the MOH dataset (by Mohammad et al.) for metaphor detection
VUA	VU Amsterdam Metaphor Corpus
MIP	Metaphor Identification Procedure
SPV	Selectional Preference Violation
MisNet	Metaphor Identification from a Siamese Network
P1	Prompt Type 1 (direct generation)
P2	Prompt Type 2 (example-based generation)
Acc	Accuracy
Prec	Precision
Rec	Recall
F1	F1 Score
GPU	Graphics Processing Unit
API	Application Programming Interface

Appendix A. Additional Failure Examples

In addition to the aggregate error statistics discussed in Section 6, Table A1 presents concrete examples of auxiliary sentence generation failures across the MOH-X, VUA_All, and VUA_Verb datasets. These instances illustrate characteristic weaknesses such as semantic drift, target-word omission, and misalignment with the metaphor label. Each row shows the original sentence, the target word, the gold label, and the generated auxiliary sentence.

Table A1. Examples of failure cases in auxiliary context generation. “Metaphor” indicates figurative usage, and “Literal” indicates non-figurative usage.

Dataset	Original Sentence	Target Word	Label	Generated Auxiliary Sentence
MOH-X	He swallowed his words.	swallowed	Metaphor	Despite his certainty, doubt engulfed him and made him silent.
MOH-X	Acknowledge the deed.	acknowledge	Literal	She refused to acknowledge the deed of kindness done for her.
MOH-X	The vessel was shipwrecked.	shipwrecked	Literal	The storm caused extensive damage, leaving the vessel stranded.
VUA_All	It’s hard to communicate a disease.	communicate	Literal	His words carried the contagion of doubt, spreading fast.
VUA_Verb	Come on sweetheart!	come	Metaphor	Whispered winds beckon, come on sweetheart!
VUA_Verb	Just trying to	try	Literal	Just trying to learn something new.

While some generated sentences are fluent and semantically rich, others misrepresent the intended usage of the target word or stray from the lexical or contextual constraints defined in the prompt. As detailed in Section 6, these artifacts can reduce classifier performance by introducing noise, ambiguity, or misleading cues. Filtering heuristics or post-generation validation may help mitigate such failures in future systems.

Appendix B. VUA Stratified Analyses

To investigate why auxiliary context yielded limited gains on the VUA_All dataset, we performed stratified analyses based on POS category and sentence length.

Appendix B.1. POS Category Analysis

Table A2 reports metaphor prevalence across parts of speech (POS) in VUA_All. Verbs, as expected, exhibit the highest frequency of metaphoric usage, with 2907 metaphoric instances out of 18,529 total verb tokens (≈15.7%). Nouns and adjectives also showed considerable metaphor density. In contrast, PUNCT, X, and SYM categories had negligible or no metaphors. This confirms that verbs and content words are more informative targets for metaphor detection.

Table A2. Metaphor prevalence by POS category in VUA_All dataset.

POS Tag	Literal	Metaphor
ADJ	6308	896
ADP	7433	2159
ADV	5728	569
CCONJ	2779	156
DET	6714	838
INTJ	745	49
NOUN	13,705	2080
NUM	975	61
PART	2439	293
PRON	7112	454
PROPN	3392	262
PUNCT	5081	3
SYM	3	0
VERB	15,622	2907
X	8	0

Table A3. Evaluation results by POS category (rounded to 3 decimal places).

POS	Accuracy	Precision	Recall	F1-Score
ADP	0.898	0.822	0.803	0.806
DET	0.954	0.780	0.824	0.798
VERB	0.902	0.721	0.743	0.730
NOUN	0.909	0.716	0.623	0.662
ADV	0.948	0.690	0.600	0.637
PART	0.922	0.600	0.650	0.619
ADJ	0.903	0.626	0.579	0.596
PUNCT	0.982	0.702	0.593	0.567
PROPN	0.968	0.455	0.272	0.318
INTJ	0.988	0.107	0.054	0.071
PRON	0.988	0.036	0.023	0.011
NUM	0.984	0.000	0.036	0.001
CCONJ	0.981	0.000	0.000	0.000
SYM	0.964	0.000	0.000	0.000
X	0.990	0.000	0.000	0.000

Table A4. Combined evaluation results for Prior-test, Following-test, and Both-test (Prior and Following) input conditions across POS categories (rounded to 3 significant digits).

Test Type	Verb				Adj.				Adv.				Noun
Test Type	Acc.	Pre.	Rec.	F1	Acc.	Pre.	Rec.	F1	Acc.	Pre.	Rec.	F1	Acc.	Pre.	Rec.	F1
Prior 1p5words	0.911	0.756	0.787	0.772	0.912	0.701	0.625	0.661	0.966	0.814	0.680	0.741	0.921	0.782	0.654	0.712
Prior 1p10words	0.910	0.749	0.793	0.770	0.913	0.694	0.654	0.674	0.963	0.778	0.676	0.724	0.921	0.769	0.675	0.719
Prior 2p5words	0.909	0.752	0.786	0.768	0.917	0.713	0.651	0.680	0.965	0.798	0.680	0.735	0.918	0.758	0.668	0.711
Prior 2p10words	0.911	0.748	0.802	0.774	0.914	0.698	0.649	0.672	0.964	0.804	0.656	0.722	0.920	0.770	0.664	0.713
Following 1p5words	0.884	0.694	0.720	0.706	0.844	0.483	0.465	0.470	0.912	0.502	0.465	0.474	0.860	0.569	0.523	0.542
Following 1p10words	0.911	0.778	0.745	0.761	0.916	0.725	0.614	0.665	0.965	0.814	0.664	0.731	0.917	0.778	0.628	0.695
Following 2p5words	0.910	0.771	0.750	0.760	0.915	0.724	0.603	0.658	0.965	0.825	0.656	0.731	0.919	0.784	0.637	0.703
Following 2p10words	0.908	0.769	0.743	0.756	0.912	0.708	0.606	0.653	0.966	0.820	0.672	0.739	0.920	0.781	0.650	0.710
Both 1p5words	0.912	0.770	0.772	0.771	0.914	0.715	0.617	0.663	0.964	0.805	0.660	0.725	0.920	0.791	0.634	0.704
Both 1p10words	0.912	0.774	0.765	0.769	0.915	0.707	0.638	0.671	0.963	0.796	0.656	0.719	0.917	0.764	0.647	0.701
Both 2p10words	0.912	0.777	0.757	0.767	0.914	0.716	0.612	0.660	0.967	0.839	0.664	0.741	0.921	0.795	0.639	0.708
Both 2p5words	0.912	0.777	0.761	0.769	0.915	0.716	0.621	0.665	0.964	0.815	0.652	0.724	0.919	0.783	0.641	0.705

Appendix B.2. Sentence Length Analysis

We also analyzed metaphor prevalence across different sentence length categories (Table A5). In short sentences (e.g., under 10 tokens), metaphor rates were very low (225 out of 3167), indicating that brevity limits the interpretive space for metaphor recognition. These results highlight the interaction between syntactic richness and the utility of auxiliary context. We recommend tailoring augmentation strategies based on sentence complexity and POS distribution.

Table A5. Metaphor prevalence by sentence length category in VUA_All dataset.

Length Category	Literal	Metaphor	Total	Prevalence (%)
Short	2942	225	3167	7.1%

Appendix C. Effect of Context Length (5 Words vs. 10 Words)

We further investigated the impact of context length by comparing auxiliary sentences generated with either 5 words or 10 words. Table A6 summarizes the F1 scores across datasets and prompt types.

The results reveal that:

Across most settings, 5-word contexts achieve comparable or even superior performance to 10-word ones.
This trend is particularly notable on VUA_All and VUA_Verb, where 5-word contexts yielded higher F1 scores despite being shorter and less expensive to generate.
On MOH-X, 10-word prompts sometimes offer slight advantages (e.g., higher recall), but the difference in F1 is marginal.

These findings suggest that shorter contexts (5 words) can provide a good balance between performance and generation efficiency, making them a practical choice for large-scale applications.

Table A6. F1 scores by context length (5 vs. 10 words) across datasets and prompts.

Dataset	Context	Prompt	Length	F1 Score
MOH-X	Prior	P1	5 words	0.8310
MOH-X	Prior	P1	10 words	0.8359
MOH-X	Prior + Following	P1	5 words	0.8444
MOH-X	Prior + Following	P1	10 words	0.8493
VUA_All	Prior	P1	5 words	0.7572
VUA_All	Prior	P1	10 words	0.7571
VUA_All	Prior + Following	P1	5 words	0.7871
VUA_All	Prior + Following	P1	10 words	0.7819
VUA_Verb	Prior	P1	5 words	0.6945
VUA_Verb	Prior	P1	10 words	0.6849
VUA_Verb	Prior + Following	P1	5 words	0.7532
VUA_Verb	Prior + Following	P1	10 words	0.7633

Appendix D. Comparison of Prompt Strategies (P1 vs. P2)

We compared two prompt strategies used to generate auxiliary context: Prompt 1 (P1) represents a zero-shot setup with a simple instruction, whereas Prompt 2 (P2) is a few-shot prompt that includes examples to guide generation.

Table A7 shows the F1 scores across three datasets (MOH-X, VUA_All, VUA_Verb) under different context settings (Prior, Prior+Following). As the results indicate, the effectiveness of each prompt depends on dataset characteristics and context configuration.

On MOH-X, which consists of short and semantically ambiguous examples, P2 (few-shot) consistently outperforms P1 across all settings.
On VUA_All and VUA_Verb, P1 (zero-shot) shows more stable and often superior performance, especially under Prior+Following context.
These trends suggest that few-shot prompting is more effective in tightly scoped tasks with high ambiguity (e.g., MOH-X), while zero-shot prompting generalizes better in longer, syntactically diverse input (e.g., VUA datasets).

Table A7. F1 scores by prompt type (P1: zero-shot, P2: few-shot) across datasets and context settings.

Dataset	Context	Prompt	F1 Score
MOH-X	Prior	P1 (zero-shot)	0.8310
MOH-X	Prior	P2 (few-shot)	0.8487
MOH-X	Prior + Following	P1 (zero-shot)	0.8444
MOH-X	Prior + Following	P2 (few-shot)	0.8460
VUA_All	Prior	P1 (zero-shot)	0.7572
VUA_All	Prior	P2 (few-shot)	0.7596
VUA_All	Prior + Following	P1 (zero-shot)	0.7871
VUA_All	Prior + Following	P2 (few-shot)	0.7837
VUA_Verb	Prior	P1 (zero-shot)	0.6945
VUA_Verb	Prior	P2 (few-shot)	0.6965
VUA_Verb	Prior + Following	P1 (zero-shot)	0.7633
VUA_Verb	Prior + Following	P2 (few-shot)	0.7556

Overall, the findings underscore the importance of tailoring prompt strategies to the nature of the task. While few-shot prompting can help in precision-critical scenarios, zero-shot prompting often provides a more generalizable and cost-effective solution.

References

Lakoff, G.; Johnson, M. Metaphors We Live By, with a New Afterword; University of Chicago Press: Chicago, IL, USA, 2003. [Google Scholar]
Zhang, S.; Liu, Y. Metaphor Detection via Linguistics Enhanced Siamese Network. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4149–4159. Available online: https://aclanthology.org/2022.coling-1.364/ (accessed on 23 July 2025).
Hayashi, T.; Sasaki, M. Metaphor Detection with Additional Auxiliary Context. In Proceedings of the 2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI 2024), Takamatsu, Japan, 6–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 121–126. [Google Scholar] [CrossRef]
Pragglejaz Group. MIP: A Method for Identifying Metaphorically Used Words in Discourse. Metaphor. Symb. 2007, 22, 1–39. [Google Scholar] [CrossRef]
Wilks, Y. Making Preferences More Active. Artif. Intell. 1978, 11, 197–223. [Google Scholar] [CrossRef]
Choi, M.; Lee, S.; Choi, E.; Park, H.; Lee, J.; Lee, D.; Lee, J. MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021), Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1763–1773. [Google Scholar]
Li, Y.; Wang, S.; Lin, C.; Guerin, F.; Barrault, L. FrameBERT: Conceptual Metaphor Detection with Frame Embedding Learning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023), Dubrovnik, Croatia, 2–6 May 2023; pp. 1558–1563. [Google Scholar] [CrossRef]
Zhang, S.; Liu, Y. Adversarial Multi-task Learning for End-to-end Metaphor Detection. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 1483–1497. [Google Scholar] [CrossRef]
Jia, K.; Li, R. Enhancing Metaphor Detection through Soft Labels and Target Word Prediction. arXiv 2024, arXiv:2403.18253. Available online: https://arxiv.org/abs/2403.18253 (accessed on 9 February 2025). [CrossRef]
Elzohbi, M.; Zhao, R. ContrastWSD: Enhancing Metaphor Detection with Word Sense Disambiguation Following the Metaphor Identification Procedure. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; ELRA and ICCL: Torino, Italy, 2024; pp. 3907–3915. Available online: https://aclanthology.org/2024.lrec-main.346/ (accessed on 23 July 2025).
Jia, K.; Wu, Y.; Liu, M.; Li, R. Curriculum-style Data Augmentation for LLM-based Metaphor Detection. arXiv 2024, arXiv:2412.02956. Available online: https://arxiv.org/abs/2412.02956 (accessed on 23 July 2025).
Liu, H.; He, C.; Meng, F.; Niu, C.; Jia, Y. LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification. arXiv 2024, arXiv:2408.05404. Available online: https://arxiv.org/abs/2408.05404 (accessed on 23 July 2025).
De Luca Fornaciari, F.; Altuna, B.; González-Dios, I.; Melero, M. A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models. In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), Mexico City, Mexico, 21 June 2024; Ghosh, D., Muresan, S., Feldman, A., Chakrabarty, T., Liu, E., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 35–44. [Google Scholar] [CrossRef]
Liu, E.; Cui, C.; Zheng, K.; Neubig, G. Testing the Ability of Language Models to Interpret Figurative Language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2022), Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4437–4452. [Google Scholar] [CrossRef]
Chakrabarty, T.; Choi, Y.; Shwartz, V. It’s Not Rocket Science: Interpreting Figurative Language in Narratives. Trans. Assoc. Comput. Linguist. 2022, 10, 589–606. [Google Scholar] [CrossRef]
Bollegala, D.; Shutova, E. Metaphor Interpretation Using Paraphrases Extracted from the Web. PLoS ONE 2013, 8, e74304. [Google Scholar] [CrossRef] [PubMed]
Mohammad, S.; Shutova, E.; Turney, P. Metaphor as a Medium for Emotion: An Empirical Study. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM 2016), Berlin, Germany, 11–12 August 2016; pp. 23–33. [Google Scholar]
Steen, G.J.; Dorst, A.G.; Herrmann, J.B.; Kaal, A.A.; Krennmayr, T.; Pasma, T. A Method for Linguistic Metaphor Identification: From MIP to MIPVU; John Benjamins Publishing: Amsterdam, The Netherlands, 2010. [Google Scholar]
Oh, S.; Huang, X.; Pink, M.; Hahn, M.; Demberg, V. A Tug-of-war between an Idiom’s Figurative and Literal Meanings in Large Language Models. arXiv 2025, arXiv:2506.01723. Available online: https://arxiv.org/abs/2506.01723 (accessed on 23 July 2025).
Lin, Z.; Ma, Q.; Yan, J.; Chen, J. CATE: A Contrastive Pre-trained Model for Metaphor Detection with Semi-supervised Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3888–3898. Available online: https://aclanthology.org/2021.emnlp-main.316/ (accessed on 23 July 2025).
Song, W.; Zhou, S.; Fu, R.; Liu, T.; Liu, L. Verb Metaphor Detection via Contextual Relation Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4240–4251. [Google Scholar]
Le, D.; Thai, M.; Nguyen, T. Multi-task Learning for Metaphor Detection with Graph Convolutional Neural Networks and Word Sense Disambiguation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8139–8146. [Google Scholar] [CrossRef]
Yang, C.; Li, Z.; Liu, Z.; Huang, Q. Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review. arXiv 2023, arXiv:2308.04306. Available online: https://arxiv.org/abs/2308.04306 (accessed on 23 July 2025).

Figure 1. Flowchart for Creating a Context-Enriched Dataset.

Figure 2. Flowcharts for two auxiliary context strategies: (a) prior context and (b) following context.

Figure 3. Flowchart: Auxiliary Context (Prior + Following).

Figure 4. Example dataset from MisNet (from VUA_All).

Table 1. Computational environment used for auxiliary sentence generation.

CPU	Intel Core i7-10510U @1.80–2.30 GHz
RAM	8 GB (7.77 GB usable)
GPU	Intel UHD Graphics (unused)
OS	Windows 11 Home 22H2
Python	3.8.8
OpenAI SDK	1.2.3
pandas	2.0.1
Date(s) run	2025-01-15..20 JST

Table 2. API configuration used for auxiliary sentence generation.

Parameter	Value
Model	gpt-3.5-turbo
Temperature	1.0
Top-p	1.0
Top-k	Not applicable (OpenAI API)
Max tokens (response)	64 (sufficient for 10-word outputs)
Stop sequences	Newline, EOS token
Seed	Not fixed (non-deterministic)
API library	`openai` v1.2.3 (Python)
Date range of generation	2025-01-15 to 2025-01-20 JST
Hardware	See Table 1

Table 3. Complete prompt templates and instantiations used to elicit auxiliary context from ChatGPT. Variables: N (target word budget), pos ∈ {precede, follow}, s (target sentence), w (target word), m∈ {

ε

, not} encodes the metaphor label, where

ε

denotes the empty string. Examples are shown for both metaphorical and literal cases. Line breaks are added here for readability; actual API calls used single strings.

Table 3. Complete prompt templates and instantiations used to elicit auxiliary context from ChatGPT. Variables: N (target word budget), pos ∈ {precede, follow}, s (target sentence), w (target word), m∈ {

ε

, not} encodes the metaphor label, where

ε

denotes the empty string. Examples are shown for both metaphorical and literal cases. Line breaks are added here for readability; actual API calls used single strings.

Prompt 1 Template	Prompt 2 Template
`Generate a N-word sentence that pos “s” and in which ‘w’ in “s” is m used as a metaphor.`	`‘w’ in “Ex1” is used as a metaphor. ‘w’ in “Ex2” is not used as a metaphor. Generate a N-word sentence that pos “s” and in which ‘w’ in “s” is m used as a metaphor.`
Metaphor Case Example ( $N = 5$ , $p o s = p r e c e d e$ , w = sharp, s = His words were sharp.)
`Generate a 5-word sentence that precedes “His words were sharp.” and in which ‘sharp’ in “His words were sharp.” is used as a metaphor.`	`‘sharp’ in “Her criticism cut deep.” is used as a metaphor. ‘sharp’ in “The knife is sharp.” is not used as a metaphor. Generate a 5-word sentence that precedes “His words were sharp.” and in which ‘sharp’ in “His words were sharp.” is used as a metaphor.`
Sample output: Their debate sliced like blades.
Literal Case Example ( $N = 5$ , $p o s = f o l l o w$ , w = absorb, s = He will absorb water.)
`Generate a 5-word sentence that follows “He will absorb water.” and in which ‘absorb’ in “He will absorb water.” is not used as a metaphor.`	`‘absorb’ in “The sponge absorbed juice.” is not used as a metaphor. ‘absorb’ in “She absorbed the culture.” is used as a metaphor. Generate a 5-word sentence that follows “He will absorb water.” and in which ‘absorb’ in “He will absorb water.” is not used as a metaphor.`
Sample output: The towel soaked it quickly.

Table 4. Training/evaluation environment.

CPU	AMD Ryzen Threadripper PRO 3955WX (16C/32T)
RAM	256 GB DDR4-3200
GPU	NVIDIA RTX A6000 (48 GB)
OS	Ubuntu 20.04 LTS
Python	3.11.3
PyTorch	2.0.1
Transformers	4.29.2
scikit-learn	1.2.2

Table 5. Fine-tuning hyperparameters for MisNet across different datasets.

Hyperparameter	MOH-X	VUA_All	VUA_Verb
Learning rate (lr)	3 × $10^{- 5}$	3 × $10^{- 5}$	3 × $10^{- 5}$
Training epochs	15	15	15
Warm-up epochs	2	2	2
Batch size (train)	16	64	64
Batch size (validation)	32	64	32
Class weight	1	5	4
first_last_avg	False	True	True
use_pos	False	True	False
max_left_len	25	135	140
max_right_len	70	90	60
Dropout rate	0.2	0.2	0.2
Embedding dimension	768	768	768
Number of classes	2	2	2
Number of attention heads	12	12	12
PLM	roberta-base	roberta-base	roberta-base
use_context	True	True	True
use_eg_sent	True	True	True
cat_method	cat_abs_dot	cat_abs_dot	cat_abs_dot

Table 6. Dataset Statistics (adapted from [2]). #Sent. = number of sentences, #Target = number of target words, %Met. = proportion of metaphors, Avg. Len = average sentence length.

Dataset	#Sent.	#Target	%Met.	Avg. Len
$V U A_A l l_{t r}$	6323	116,622	11.19	18.4
$V U A_A l l_{v a l}$	1550	38,628	11.62	24.9
$V U A_A l l_{t e}$	2694	50,175	12.44	18.6
$V U A_V e r b_{t r}$	7479	15,516	27.90	20.2
$V U A_V e r b_{v a l}$	1541	1724	26.91	25.0
$V U A_V e r b_{t e}$	2694	5873	29.98	18.6
$M O H - X$	647	647	48.69	8.0

Table 7. POS Tag List (Source: https://qiita.com/kei_0324/items/400f639b2f185b39a0cf (accessed on 24 July 2025)).

Label	Meaning	Examples
ADJ	Adjective	big, green, incomprehensible
ADP	Adposition	in, to, during
ADV	Adverb	very, well, exactly
CCONJ	Coordinating Conjunction	and, or, but
DET	Determiner	the, a, an, my, your, one, ten
INTJ	Interjection	ouch, bravo, hello
NOUN	Noun	girl, air, beauty
NUM	Numeral	0, 3.14, one, MMXIV
PART	Particle	’s, not
PRON	Pronoun	I, you, they, who, everybody
PROPN	Proper noun	Mary, NATO, HBO
PUNCT	Punctuation	., (), :
SYM	Symbol	%, ©, +, =
VERB	Verb	run, ate, eating
X	Other	Out-of-vocabulary words

Table 8. Evaluation Metrics on MOH-X (Prior Context). Underlined values indicate the best performance in each metric.

MOH-X	Acc	Prec	Rec	F1
Original	0.8286	0.8190	0.8445	0.8275
P1_5words	0.8372	0.8302	0.8393	0.8310
P1_10words	0.8423	0.8405	0.8378	0.8359
P2_5words	0.8547	0.8619	0.8392	0.8487
P2_10words	0.8500	0.8432	0.8653	0.8498

Table 9. Evaluation Metrics on VUA_All (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.

VUA_All	Validation				Test
VUA_All	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.9417	0.7438	0.7755	0.7571	0.9409	0.7719	0.7551	0.7613
P1_5words	0.9411	0.7444	0.7763	0.7572	0.9399	0.7680	0.7547	0.7592
P1_10words	0.9409	0.7384	0.7823	0.7571	0.9403	0.7680	0.7603	0.7615
P2_5words	0.9419	0.7439	0.7810	0.7596	0.9401	0.7668	0.7591	0.7605
P2_10words	0.9413	0.7451	0.7808	0.7595	0.9401	0.7695	0.7568	0.7606

Table 10. Evaluation Metrics on VUA_Verb (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.

VUA_Verb	Validation				Test
VUA_Verb	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.8112	0.6533	0.7601	0.6932	0.8021	0.6799	0.7374	0.6972
P1_5words	0.8138	0.6643	0.7523	0.6945	0.8004	0.6832	0.7275	0.6924
P1_10words	0.8077	0.6587	0.7423	0.6849	0.7957	0.6785	0.7210	0.6859
P2_5words	0.8103	0.6509	0.7595	0.6917	0.8004	0.6735	0.7426	0.6965
P2_10words	0.8133	0.6667	0.7479	0.6929	0.8001	0.6887	0.7157	0.6885

Table 11. Evaluation Metrics on MOH-X (Following Context). Underlined values indicate the best performance in each metric.

MOH-X	Acc	Prec	Rec	F1
Original	0.8286	0.8190	0.8445	0.8275
P1_5words	0.8469	0.8455	0.8508	0.8429
P1_10words	0.8315	0.8052	0.8716	0.8335
P2_5words	0.8376	0.8314	0.8453	0.8340
P2_10words	0.8333	0.8254	0.8487	0.8333

Table 12. Evaluation Metrics on VUA_All (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.

VUA_All	Validation				Test
VUA_All	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.9417	0.7438	0.7755	0.7571	0.9409	0.7719	0.7551	0.7613
P1_5words	0.9511	0.8123	0.7531	0.7815	0.9480	0.8271	0.7357	0.7787
P1_10words	0.9522	0.8134	0.7638	0.7878	0.9476	0.8231	0.7370	0.7777
P2_5words	0.9511	0.8060	0.7629	0.7838	0.9474	0.8225	0.7357	0.7767
P2_10words	0.9511	0.8097	0.7575	0.7827	0.9480	0.8288	0.7338	0.7784

Table 13. Evaluation Metrics on VUA_Verb (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.

VUA_Verb	Validation				Test
VUA_Verb	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.8112	0.6533	0.7601	0.6932	0.8021	0.6799	0.7374	0.6972
P1_5words	0.8648	0.7705	0.7091	0.7385	0.8442	0.7677	0.6888	0.7261
P1_10words	0.8718	0.7730	0.7414	0.7569	0.8359	0.7332	0.7115	0.7222
P2_5words	0.8561	0.7288	0.7414	0.7350	0.8561	0.7288	0.7414	0.7350
P2_10words	0.8683	0.7926	0.6918	0.7388	0.8423	0.8077	0.6224	0.7030

Table 14. Evaluation Metrics on MOH-X (Prior and Following Context). Underlined values indicate the best performance in each metric.

MOH-X	Acc	Prec	Rec	F1
Original	0.8286	0.8190	0.8445	0.8275
P1_5words	0.8511	0.8549	0.8398	0.8444
P1_10words	0.8453	0.8205	0.8870	0.8493
P2_5words	0.8529	0.8601	0.8357	0.8460
P2_10words	0.8386	0.8248	0.8586	0.8385

Table 15. Evaluation Metrics on VUA_All (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.

VUA_All	Validation				Test
VUA_All	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.9409	0.7719	0.7551	0.7613	0.9409	0.7719	0.7551	0.7613
P1_5words	0.9517	0.8069	0.7682	0.7871	0.9489	0.8265	0.7455	0.7839
P1_10words	0.9511	0.8122	0.7537	0.7819	0.9477	0.8271	0.7327	0.7770
P2_5words	0.9507	0.7998	0.7682	0.7837	0.9479	0.8219	0.7423	0.7801
P2_10words	0.9512	0.8115	0.7553	0.7824	0.9485	0.8304	0.7370	0.7809

Table 16. Evaluation Metrics on VUA_Verb (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.

VUA_Verb	Validation				Test
VUA_Verb	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Original	0.8112	0.6533	0.7601	0.6932	0.8021	0.6799	0.7374	0.6972
P1_5words	0.8666	0.7500	0.7565	0.7532	0.8401	0.7485	0.7030	0.7250
P1_10words	0.8759	0.7841	0.7435	0.7633	0.8445	0.7657	0.6939	0.7280
P2_5words	0.8672	0.7484	0.7629	0.7556	0.8418	0.7405	0.7274	0.7339
P2_10words	0.8625	0.7506	0.7328	0.7415	0.8369	0.7394	0.7041	0.7213

Table 18. Two-tailed t-test results (

P (T \leq t)

) for prior context on MOH-X. Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

Table 18. Two-tailed t-test results (

P (T \leq t)

) for prior context on MOH-X. Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

MOH-X	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.8696	0.0687 †	0.2485	0.0613 †
P1_5words	–	0.7731	0.6477	0.6748
P1_10words	–	–	0.2173	0.7683
P2_5words	–	–	–	0.8696

Table 19. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_All (Validation Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

Table 19. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_All (Validation Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

VUA_All (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$1.77 \times 10^{- 15}$ §	$1.03 \times 10^{- 22}$ §	$4.41 \times 10^{- 16}$ §	$2.19 \times 10^{- 16}$ §
P1_5words	–	0.0386 ‡	1.0000	0.7897
P1_10words	–	–	0.0378 ‡	0.0700 †
P2_5words	–	–	–	0.7879

Table 20. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_All (Test Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

Table 20. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_All (Test Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

VUA_All (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$3.36 \times 10^{- 13}$ §	$6.13 \times 10^{- 23}$ §	$3.69 \times 10^{- 17}$ §	$2.07 \times 10^{- 15}$ §
P1_5words	–	0.0071 §	0.2083	0.4253
P1_10words	–	–	0.1779	0.0643 †
P2_5words	–	–	–	0.6544

Table 21. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_Verb (Validation Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

Table 21. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_Verb (Validation Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

VUA_Verb (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.00091§	0.0272 ‡	0.1980	0.0136 ‡
P1_5words	–	0.2637	0.0416 ‡	0.4424
P1_10words	–	–	0.3456	0.7631
P2_5words	–	–	–	0.1798

Table 22. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_Verb (Test Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

Table 22. Two-tailed t-test results (

P (T \leq t)

) for prior context on VUA_Verb (Test Set). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates that symmetric comparisons were omitted.

VUA_Verb (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$5.26 \times 10^{- 5}$ §	0.2173	0.1036	0.00045 §
P1_5words	–	0.0048§	0.0136 ‡	0.6896
P1_10words	–	–	0.7273	0.0213 ‡
P2_5words	–	–	–	0.0395 ‡

Table 23. Two-tailed t-test results (

P (T \leq t)

) on MOH-X (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 23. Two-tailed t-test results (

P (T \leq t)

) on MOH-X (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

MOH-X	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.1318	0.3549	0.3177	0.0768 †
P1_5words	–	0.5168	0.6398	0.7633
P1_10words	–	–	0.8789	0.3177
P2_5words	–	–	–	0.4565

Table 24. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Validation Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 24. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Validation Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_All (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$1.77 \times 10^{- 15}$ §	$1.03 \times 10^{- 22}$ §	$4.41 \times 10^{- 16}$ §	$2.19 \times 10^{- 16}$ §
P1_5words	–	0.0386 ‡	1.0000	0.7897
P1_10words	–	–	0.0378 ‡	0.0700 †
P2_5words	–	–	–	0.7879

Table 25. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Test Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 25. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Test Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_All (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$3.36 \times 10^{- 13}$ §	$6.13 \times 10^{- 23}$ §	$3.69 \times 10^{- 17}$ §	$2.07 \times 10^{- 15}$ §
P1_5words	–	0.0071 §	0.2083	0.4253
P1_10words	–	–	0.1779	0.0643 †
P2_5words	–	–	–	0.6544

Table 26. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Validation Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 26. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Validation Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_Verb (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.00091§	0.0272 ‡	0.1980	0.0136 ‡
P1_5words	–	0.2637	0.0416 ‡	0.4424
P1_10words	–	–	0.3456	0.7631
P2_5words	–	–	–	0.1798

Table 27. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Test Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 27. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Test Set (Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_Verb (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$5.26 \times 10^{- 5}$ §	0.2173	0.1036	0.00045 §
P1_5words	–	0.0048 §	0.0136 ‡	0.6896
P1_10words	–	–	0.7273	0.0213 ‡
P2_5words	–	–	–	0.0395 ‡

Table 28. Two-tailed t-test results (

P (T \leq t)

) on MOH-X (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 28. Two-tailed t-test results (

P (T \leq t)

) on MOH-X (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

MOH-X	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.2766	0.0567 †	0.2485	0.4146
P1_5words	–	0.3308	0.8733	0.7240
P1_10words	–	–	0.2892	0.2061
P2_5words	–	–	–	0.7521

Table 29. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Validation Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 29. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Validation Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_All (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$1.32 \times 10^{- 16}$ §	$8.35 \times 10^{- 13}$ §	$2.21 \times 10^{- 11}$ §	$1.72 \times 10^{- 19}$ §
P1_5words	–	0.2650	0.1155	0.4032
P1_10words	–	–	0.6795	0.0509 †
P2_5words	–	–	–	0.0209 ‡

Table 30. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Test Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 30. Two-tailed t-test results (

P (T \leq t)

) on VUA_All Test Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_All (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	$3.18 \times 10^{- 31}$ §	$1.28 \times 10^{- 22}$ §	$3.37 \times 10^{- 36}$ §	$1.48 \times 10^{- 43}$ §
P1_5words	–	0.0504 †	0.1953	0.0098 §
P1_10words	–	–	0.0010§	$5.51 \times 10^{- 6}$ §
P2_5words	–	–	–	0.1828

Table 31. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Validation Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 31. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Validation Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_Verb (Validation)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.4128	0.3589	0.1176	0.0483 ‡
P1_5words	–	0.0750 †	0.0153 ‡	0.4369
P1_10words	–	–	0.4915	0.3321
P2_5words	–	–	–	0.0990†

Table 32. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Test Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

Table 32. Two-tailed t-test results (

P (T \leq t)

) on VUA_Verb Test Set (Prior and Following Context). Significance levels: †

p \leq 0.10

, ‡

p \leq 0.05

, §

p \leq 0.01

. A dash (–) indicates symmetric comparisons were omitted.

VUA_Verb (Test)	P1_5w	P1_10w	P2_5w	P2_10w
Original_data	0.3920	0.0003 §	$3.38 \times 10^{- 7}$ §	0.6121
P1_5words	–	0.0050 §	$8.94 \times 10^{- 6}$ §	0.1630
P1_10words	–	–	0.1603	$3.65 \times 10^{- 5}$ §
P2_5words	–	–	–	$3.40 \times 10^{- 8}$ §

Table 33. Representative failure modes observed in -generated auxiliary context. Manual counts (

n = 300

per dataset).

Table 33. Representative failure modes observed in -generated auxiliary context. Manual counts (

n = 300

per dataset).

Error Type	MOH-X	VUA_All	VUA_Verb	Illustration
Semantic drift (topic mismatch)	12.0%	18.0%	15.0%	Original: He swallowed his words.
				Aux: Despite his certainty, doubt engulfed him and made him silent.
Polarity flip (label mismatch)	4.0%	6.0%	5.0%	Original (literal): Acknowledge the deed.
				Aux: She refused to acknowledge the deed of kindness done for her.
Redundancy/near-duplicate	9.0%	3.0%	5.0%	Original: Just trying to.
				Aux: Just trying to learn something new.
Pronoun coref failure	0.0%	11.0%	8.0%	Original: Come on sweetheart!
				Aux: Whispered winds beckon, come on sweetheart!
Over-long named entities	2.0%	5.0%	4.0%	Original: The vessel was shipwrecked.
				Aux: The storm caused extensive damage, leaving the vessel stranded.

Table 34. Semantic fidelity of generated context across metaphor labels (VUA_All train + validation).

Label	Instances	Faithful	Fidelity Rate
Literal	6469	6096	94.24%
Metaphor	1831	1538	83.99%
Total	8300	7634	91.97%

Table 35. Token volume, cost, and processing time for each dataset. Instances are not summed in the Total row due to dataset partitioning.

Dataset	Instances	Mean Tokens/Request	Total Tokens	Estimated Cost (USD)	Wall-Clock (h)
MOH-X	647	42	27,174	0.54	0.1
VUA_All	205,425	38	$7.8 \times 10^{6}$	155.8	11.2
VUA_Verb	23,113	39	$9.0 \times 10^{5}$	18.0	1.6
Total	229,185	–	$8.7 \times 10^{6}$	174.3	12.9

Table 36. Latency statistics (seconds) per prompt type in VUA_All training set (

N = 2833

).

Table 36. Latency statistics (seconds) per prompt type in VUA_All training set (

N = 2833

).

Metric	1p	1p-5b	1p-10b	2p5b	2p10b	1p-5f5b	1p-10f10b	2p5f5b	2p10F10B
Mean	0.789	0.520	1.566	0.509	0.566	0.505	0.612	0.469	0.535
Std Dev	0.507	0.183	2.009	0.136	0.214	0.168	0.253	0.132	0.174
Min	0.329	0.314	0.398	0.365	0.385	0.362	0.388	0.356	0.375
Max	4.960	1.224	16.123	1.174	1.290	1.154	1.537	1.234	1.499

Table 37. Latency statistics (seconds) per prompt type in VUA_All validation set (

N = 334

).

Table 37. Latency statistics (seconds) per prompt type in VUA_All validation set (

N = 334

).

Metric	1p	1p-5b	1p-10b	2p5b	2p10b	1p-5f5b	1p-10f10b	2p5f5b	2p10f10b
Mean	0.671	0.502	0.591	0.558	0.622	0.510	0.555	0.581	0.513
Std Dev	0.379	0.361	0.156	0.228	0.417	0.321	0.378	0.395	0.143
Min	0.387	0.291	0.412	0.364	0.422	0.352	0.340	0.308	0.330
Max	2.293	0.911	1.414	1.104	1.335	1.141	1.391	1.373	1.456

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hayashi, T.; Sasaki, M. Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data Cogn. Comput. 2025, 9, 218. https://doi.org/10.3390/bdcc9090218

AMA Style

Hayashi T, Sasaki M. Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data and Cognitive Computing. 2025; 9(9):218. https://doi.org/10.3390/bdcc9090218

Chicago/Turabian Style

Hayashi, Takuya, and Minoru Sasaki. 2025. "Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection" Big Data and Cognitive Computing 9, no. 9: 218. https://doi.org/10.3390/bdcc9090218

APA Style

Hayashi, T., & Sasaki, M. (2025). Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data and Cognitive Computing, 9(9), 218. https://doi.org/10.3390/bdcc9090218

Article Menu

Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection

Abstract

1. Introduction

1.1. Research Background

1.2. Research Objectives

2. Related Work

2.1. Metaphor Detection with Pretrained Encoders

2.2. Data Scarcity and Auxiliary Signals

2.3. LLMs for Figurative Language and In-Context Augmentation

2.4. Paraphrasing and Web-Derived Context

3. Methods

3.1. Generation Environment

3.2. ChatGPT Configuration

3.3. Prompt Templates and Worked Examples

3.4. Data Integration

3.5. Implementation Notes

4. Experiment

4.1. Training Environment

4.2. MisNet Fine-Tuning Hyperparameters

4.3. Datasets

4.3.1. Original Dataset

MOH-X

VUA_All

VUA_Verb

4.3.2. Additional Figurative Resources (Not Used in Main Experiments)

4.4. Generation Procedure of Dataset with Additional Context

5. Results

5.1. Comparison of Evaluation Metrics

5.1.1. Effect of Prior Context

5.1.2. Effect of Following Context

5.1.3. Effect of Combined Context (Prior + Following)

5.1.4. Discussion

5.2. Evaluation of Superiority Using t-Test

5.2.1. Prior Context

5.2.2. Following Context

5.2.3. Prior and Following Context

5.2.4. Discussion

6. Error and Failure-Case Analysis

Insights and Implications

7. Analysis: Why Limited Gains on VUA_All/VUA_Verb

8. Semantic Fidelity of Generated Context

9. Cost and Resource Analysis

9.1. API Usage and Estimated Cost

9.2. Processing Latency Statistics

9.3. Discussion

10. General Discussion

10.1. Impact of Auxiliary Context Position

10.2. Dataset-Specific Observations

10.3. Limitations of Prompt-Based Generation

10.4. Prompt Design Tradeoffs

11. Conclusions and Future Work

Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Additional Failure Examples

Appendix B. VUA Stratified Analyses

Appendix B.1. POS Category Analysis

Appendix B.2. Sentence Length Analysis

Appendix C. Effect of Context Length (5 Words vs. 10 Words)

Appendix D. Comparison of Prompt Strategies (P1 vs. P2)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI