Next Article in Journal
A Systematic Literature Review of Artificial Intelligence in Prehospital Emergency Care
Previous Article in Journal
An Adaptive Unsupervised Learning Approach for Credit Card Fraud Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection

by
Takuya Hayashi
1,*,† and
Minoru Sasaki
2,†
1
Major in Computer and Information Sciences, Graduate School of Science and Engineering, Ibaraki University, 4-12-1, Nakanarusawa, Hitachi 316-8511, Ibaraki, Japan
2
Dapartment of Computer and Information Sciences, College of Engineering, Ibaraki University, 4-12-1, Nakanarusawa, Hitachi 316-8511, Ibaraki, Japan
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Big Data Cogn. Comput. 2025, 9(9), 218; https://doi.org/10.3390/bdcc9090218
Submission received: 6 June 2025 / Revised: 24 July 2025 / Accepted: 13 August 2025 / Published: 25 August 2025

Abstract

Metaphor detection is challenging in natural language processing (NLP) because it requires recognizing nuanced semantic shifts beyond literal meaning, and conventional models often falter when contextual cues are limited. We propose a method to enhance metaphor detection by augmenting input sentences with auxiliary context generated by ChatGPT. In our approach, ChatGPT produces semantically relevant sentences that are inserted before, after, or on both sides of a target sentence, allowing us to analyze the impact of context position and length on classification. Experiments on three benchmark datasets (MOH-X, VUA_All, VUA_Verb) show that this context-enriched input consistently outperforms the no-context baseline across accuracy, precision, recall, and F1-score, with the MOH-X dataset achieving the largest F1 gain. These improvements are statistically significant based on two-tailed t-tests. Our findings demonstrate that generative models can effectively enrich context for metaphor understanding, highlighting context placement and quantity as critical factors. Finally, we outline future directions, including advanced prompt engineering, optimizing context lengths, and extending this approach to multilingual metaphor detection.

1. Introduction

Metaphors are pervasive in natural language and play a vital role in expressing abstract concepts through more concrete or familiar domains. As Lakoff and Johnson [1] argue, metaphors are not merely rhetorical devices but are fundamental to human cognition, serving as a mapping mechanism between source and target conceptual domains. This makes metaphor detection a key challenge in natural language understanding (NLU).
Recent advancements in large-scale language models (LLMs), such as BERT and GPT, have significantly improved various NLP tasks by capturing rich contextual and semantic relationships. However, metaphor detection remains difficult for these models, particularly in short or ambiguous sentences. This difficulty stems from the fact that metaphors involve a shift in meaning that often cannot be resolved by lexical similarity or syntactic features alone.
Traditional approaches to metaphor detection have included rule-based systems, feature-engineered classifiers, and more recently, neural architectures like RNNs and Transformer-based models. Among these, MisNet has proven effective by integrating linguistic rules with contextual embeddings. Nevertheless, these models still rely on static inputs and often fail when contextual cues are insufficient.
To address this limitation, our study explores whether metaphor detection accuracy can be improved by dynamically enriching the input with auxiliary context generated by an LLM. Specifically, we propose a method that uses ChatGPT to generate semantically coherent sentences that are inserted before and/or after a given target sentence. This augmented input is then fed into a metaphor classification model (based on MisNet) to evaluate performance changes.
We hypothesize that both the amount and position of contextual information significantly influence metaphor classification, particularly in cases where semantic cues are limited. We test this hypothesis empirically on three benchmark datasets: MOH-X, VUA_All, and VUA_Verb.
Through quantitative evaluation and statistical analysis, we demonstrate that our context-enriched inputs consistently improve metaphor detection metrics. This suggests that metaphor understanding can benefit from generative augmentation strategies that provide interpretive scaffolding, and that LLMs can play a dual role: not only as classifiers but also as semantic context generators. Our findings provide new directions for metaphor-aware NLP modeling and LLM prompting techniques.

1.1. Research Background

In recent years, the field of artificial intelligence (AI) has experienced rapid and continuous progress. A major milestone was achieved in 2018, when Google introduced BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model that significantly outperformed previous models across a range of NLP tasks. This advancement catalyzed the development of various large-scale language models (LLMs), including OpenAI’s GPT series, which have received global attention since the release of GPT-3 and GPT-4. These models have demonstrated impressive fluency and contextual awareness, transforming not only research but also practical applications in dialogue systems, summarization, and text generation.
Within this fast-evolving landscape, metaphor detection has emerged as a prominent and challenging research area in natural language understanding. Metaphors are pervasive in everyday language and are essential for expressing abstract or complex ideas through more familiar or concrete terms. As conceptualized by Lakoff and Johnson [1], metaphors involve a mapping between a “source” domain (concrete) and a “target” domain (abstract), allowing people to understand one concept in terms of another. This mechanism makes metaphors powerful but also difficult to detect computationally, as they often defy literal interpretation.
The following are illustrative examples of metaphorical expressions:
1.
“He absorbed the knowledge or beliefs of his tribe.”
The verb absorb does not refer to physical intake, as in absorbing water, but instead metaphorically denotes the internalization of abstract content such as knowledge or ideology. The implied meaning is: “He took in the knowledge or beliefs of his tribe.”
2.
“His political ideas color his lectures.”
The term color here does not refer to visual pigmentation but metaphorically suggests that his political views influence or shape the tone and content of his lectures. The intended meaning is: “His political ideas influence the nature of his lectures.”
As shown in these examples, metaphorical language often involves a discrepancy between literal word meaning and intended semantic interpretation. This makes metaphor detection a task requiring high-level semantic understanding, contextual inference, and sometimes even cultural or pragmatic knowledge.
Several computational approaches have been developed to address this task. Early methods relied on manually crafted linguistic features such as word concreteness, abstractness, frequency, or part-of-speech, which were input into traditional machine learning classifiers. These approaches were later outperformed by neural models such as RNNs and CNNs, which could learn features from data automatically. However, these models typically employed static word embeddings (e.g., Word2Vec, GloVe), which lack the ability to capture context-specific meaning variations crucial for metaphor detection. Moreover, RNNs face limitations in parallelization and long-distance dependency modeling.
Transformer-based models like BERT and RoBERTa introduced attention mechanisms and contextualized embeddings that significantly improved metaphor detection accuracy. These models can better capture semantic shifts by dynamically adjusting word representations based on surrounding context. In parallel, recent studies have incorporated external resources such as lexical definitions and conceptual knowledge from dictionaries to enrich word representations.
A representative example is the MisNet model proposed by Zhang and Liu in 2022  [2]. MisNet integrates linguistic rules—specifically the Metaphor Identification Procedure (MIP) and Selectional Preference Violation (SPV)—with contextual embeddings derived from Transformers. It processes the input sentence alongside the target word’s dictionary definition, usage patterns, and grammatical information to determine the metaphorical nature of the expression. Despite its effectiveness, MisNet exhibits limitations when applied to sentences with minimal context, such as short utterances, greetings, or sentences composed solely of pronouns (e.g., “That’s heavy.” or “He did it.”).
These shortcomings highlight a critical issue: even the most advanced models can misclassify metaphors in the absence of sufficient contextual cues. This motivates our investigation into whether dynamically generated context—produced by LLMs like ChatGPT—can serve as an auxiliary input to enhance metaphor detection, particularly in challenging cases with limited context.

1.2. Research Objectives

Despite the progress made by Transformer-based models such as BERT and RoBERTa in metaphor detection, these models continue to exhibit weaknesses in handling sentences with limited or ambiguous context. In such cases, the lack of surrounding linguistic or semantic cues often leads to misclassification of metaphorical expressions. This is particularly problematic in domains such as education, dialogue systems, and automated text analysis, where short utterances or pronoun-heavy constructions are common.
The primary objective of this study is to investigate whether dynamically generated auxiliary context—produced by a large language model (LLM)—can enhance the accuracy of metaphor detection. Specifically, we explore the use of ChatGPT to generate semantically coherent sentences that can be appended before and/or after a target sentence. These augmented inputs are then used to train or evaluate a metaphor classification model based on MisNet, which is particularly suitable for this study due to its architecture that integrates linguistic rules with contextual embeddings, allowing us to evaluate the impact of enriched context effectively.
We hypothesize that both the amount and the position of added context significantly influence detection performance. For example, placing explanatory context before the target sentence may prime the model for interpretation, whereas placing it after may act as confirmatory evidence. This leads us to design a series of experiments across three widely used benchmark datasets—MOH-X, VUA_All, and VUA_Verb—to empirically evaluate the effects of auxiliary context in metaphor classification tasks.
Through these experiments, we aim to address the following research questions:
  • Can auxiliary context generated by ChatGPT improve metaphor detection accuracy?
  • Does the position (before vs. after) of generated context affect performance?
  • How does the quantity of added context influence classification outcomes?
By answering these questions, we aim to contribute to a better understanding of how generative LLMs can be used not only for classification but also for context enrichment. Our findings could pave the way for more robust metaphor-aware NLP systems, especially in scenarios where input text is minimal or semantically sparse.
An early version of the present study was presented at the 18th International Conference on E-Service and Knowledge Management (ESKM 2024) [3].

2. Related Work

2.1. Metaphor Detection with Pretrained Encoders

Early neural approaches drew on feature engineering and RNNs; transformer-based encoders subsequently delivered large gains by capturing contextual semantics. MisNet [2] operationalizes two linguistic heuristics—MIP [4] and SPV [5]—in a Siamese architecture, using dictionary glosses to approximate a word’s basic sense. MelBERT [6] extends this line via metaphor-aware late interaction over BERT/RoBERTa representations guided by identification theories. FrameBERT [7] incorporates FrameNet embeddings to represent conceptual frames, improving interpretability and performance across MOH-X, VUA, and TroFi.
While these models enrich representations using static external knowledge like the FrameNet or WordNet dictionary, our work explores the dynamic generation of contextual sentences tailored to each specific input, addressing the limitation of insufficient context directly.

2.2. Data Scarcity and Auxiliary Signals

Limited labeled data motivates transfer and auxiliary-task learning. Zhang and Liu [8] introduce adversarial multi-task learning (AdMul) to transfer from basic sense discrimination built from WSD corpora. Jia and Li [9] enhance metaphor detection with soft labels distilled from a teacher model plus prompt-based target-word prediction, achieving SOTA results. ContrastWSD [10] explicitly contrasts contextual vs. basic senses using WSD signals. Recent work has also explored curriculum-style augmentation for metaphor detection, where examples are introduced in increasing levels of complexity to better align with LLM learning dynamics [11]. Our work is complementary: instead of new supervision, we generate disambiguating context on demand.

2.3. LLMs for Figurative Language and In-Context Augmentation

Recent studies explore LLMs for figurative understanding. LaiDA [12] combines linguistics-aware retrieval with LLM data augmentation. Work on idiom/figurative QA [13,14,15] shows that conversational LLMs still trail human performance, especially in context-heavy scenarios—motivating explicit context scaffolding. We extend this perspective by programmatically synthesizing short auxiliary sentences conditioned on gold metaphor labels.

2.4. Paraphrasing and Web-Derived Context

Paraphrase generation is an effective augmentation strategy across NLP tasks [16]. Although not designed specifically for metaphor detection, paraphrastic variants can diversify surface forms and expose alternative literal paraphrases, which may sharpen metaphor classifiers.

3. Methods

We follow the overall pipeline in the original manuscript but expand documentation of the LLM interface, prompt templates, and dataset transformation steps.

3.1. Generation Environment

As shown in Table 1.

3.2. ChatGPT Configuration

Table 2 records the API, model snapshot, and parameterization used for auxiliary-context generation. Unless noted, OpenAI default values were used; temperature (1.0) and top-p (1.0) are the documented defaults at the time of data collection. No explicit top-k sampling control is exposed in the OpenAI API; nucleus (top-p) sampling was left at default.

3.3. Prompt Templates and Worked Examples

We designed two prompt families. Prompt 1 (minimal instruction) encodes the gold label via the token not; Prompt 2 supplements Prompt 1 with exemplars of metaphorical vs. literal uses of the same target word to reduce ambiguity. Both templates parameterize word budget N and insertion position (precede vs. follow). Complete templates and fully instantiated examples are shown in Table 3.

3.4. Data Integration

Auxiliary sentences are concatenated with the original target sentence either before, after, or both (prior+following). Concatenation preserves punctuation and spacing to avoid token boundary artifacts. For, the resulting string replaces the sentence field in the MisNet CSV schema; all other columns (target position, POS, gloss, etc.) remain unchanged. When prior context is added, offsets in target_position are re-indexed by +aux tokens.

3.5. Implementation Notes

Scripts used to generate context, merge CSVs, and re-index targets are released at the project repository (see Data Availability).

4. Experiment

This section documents environments, datasets, training regimen, and evaluation protocols (expanded for reproducibility).

4.1. Training Environment

MisNet was trained and evaluated using the environment detailed in Table 4.

4.2. MisNet Fine-Tuning Hyperparameters

As shown in Table 5.

4.3. Datasets

4.3.1. Original Dataset

This section describes the structure and details of the datasets used in the experiments. Three publicly available datasets were employed: MOH-X [17], VUA_All [18], and VUA_Verb [18]. In addition, Table 6 presents the number of sentences and words in each dataset, the proportion of metaphors, and the average sentence length.
MOH-X
MOH-X is a verb-focused dataset compiled from WordNet, containing both literal and metaphorical uses of verbs. It was originally created by Mohammad, Shutova, and Turney (2016) [17] to study metaphor as a medium for conveying emotion, with annotations on both metaphoricality and affective meaning.
MOH-X consists of 647 instances, of which 48.69% are labeled as metaphorical. The sentences are notably short, with an average length of only 8.0 tokens. This high metaphor density and minimal sentence context make MOH-X a particularly challenging testbed for metaphor detection, especially under conditions of contextual ambiguity. These characteristics make it an ideal benchmark for evaluating the impact of auxiliary context generation.
VUA_All
The VUA dataset (Vrije Universiteit Amsterdam Metaphor Corpus) was created by VU University Amsterdam using fragments sampled from four genres of the British National Corpus: academic, news, conversation, and fiction. VUA_All includes part-of-speech (POS) tagging for every word in every sentence, annotated using the MIPVU procedure with high inter-annotator agreement ( κ > 0.8 ). The types of POS tags are shown in Table 7.
The training split of VUA_All contains 12,123 sentences (72,611 tokens), and the test set consists of 4081 sentences (22,196 tokens). Overall, approximately 18% of the tokens are labeled as metaphorical. The average sentence length is 18.4 for training and 18.6 for test. Due to its genre diversity and high annotation quality, VUA_All serves as a robust benchmark for assessing model generalizability across realistic, varied language contexts.
VUA_Verb
VUA_Verb is a filtered subset of VUA_All, consisting exclusively of sentences where the target word being evaluated is a verb. This specialized setup is particularly important for metaphor detection, as verbs often form the semantic core of a sentence and are more likely to involve figurative language.
In the VUA_Verb test set, 29.98% of target verbs are labeled as metaphorical, with an average sentence length of 18.6 tokens. Compared to VUA_All, this subset highlights verb-specific challenges such as polysemy and context-dependence. Comparing results between VUA_All and VUA_Verb can reveal whether metaphor detection models benefit from context depending on the grammatical category of the target word.

4.3.2. Additional Figurative Resources (Not Used in Main Experiments)

To contextualize scope, we surveyed additional figurative-language corpora: TroFi (verb tropes), FIG-QA [13,14], Multi-Figurative (idiom, metaphor, sarcasm) [19], and It’s not Rocket Science narrative idiom benchmark [15]. These resources support future cross-phenomenon generalization studies (see Section 11).

4.4. Generation Procedure of Dataset with Additional Context

The procedure for generating a context-enriched dataset is illustrated in Figure 1. First, one of the original datasets—MOH-X, VUA_All, or VUA_Verb—is loaded, and prompts are constructed for each target sentence. These prompts are then input to ChatGPT, which generates short auxiliary sentences. Depending on the configuration, the generated auxiliary sentence is inserted before (prior), after (following), or both before and after the target sentence. The resulting data is then output in CSV format compatible with MisNet, a metaphor detection model.
The specific processing steps for each data instance are shown in the three flowcharts. The top row of each figure represents the input data. The variable names enclosed in parentheses correspond to those defined in the prompt templates listed in Table 3 in Section 3.3. Figure 2 compares the two main strategies for contextual augmentation: (a) the “prior” strategy adds auxiliary context before the target sentence, and (b) the “following” strategy adds it after.
In addition, Figure 3 illustrates the “prior + following” configuration, where auxiliary context is added to both sides of the target sentence.

5. Results

This section presents the results of the experiments conducted as described in the previous chapter. The evaluation is based on three main configurations: the original dataset, datasets enhanced with 5-word auxiliary sentences, and datasets enhanced with 10-word auxiliary sentences. Auxiliary sentences were generated using two prompt types (Prompt 1 and Prompt 2). For each configuration, we report Accuracy (Acc), Precision (Prec), Recall (Rec), and F1-score (F1).

5.1. Comparison of Evaluation Metrics

This section presents a comparative analysis of model performance under various auxiliary context configurations across three datasets: MOH-X, VUA_All, and VUA_Verb. The evaluation metrics include Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 score. The highest score in each metric is underlined for emphasis.

5.1.1. Effect of Prior Context

We evaluated the impact of adding prior context on metaphor detection across three datasets: MOH-X, VUA_All, and VUA_Verb. The results, summarized in Table 8, Table 9 and Table 10, indicate that incorporating preceding sentences generally improves model performance across multiple evaluation metrics.
In the MOH-X dataset (Table 8), all contextual augmentation settings led to performance gains over the original baseline. Notably, P2_5words achieved the highest accuracy (0.8547) and precision (0.8619), while P2_10words recorded the best recall (0.8653) and F1 score (0.8498). These improvements suggest that injecting even a small amount of prior context enhances the model’s ability to generalize, particularly in few-shot or idiomatic examples.
For the VUA_All dataset (Table 9), validation results show that P1_10words achieved the highest recall (0.7823), while P2_5words provided the best accuracy (0.9419) and F1 score (0.7596). In terms of test performance, P2_10words yielded the best precision (0.7695), and P1_10words slightly outperformed others in recall (0.7603). These results demonstrate that both the amount and type of prior context can influence performance, with slightly longer or rule-informed contexts providing marginal gains.
In the VUA_Verb dataset (Table 10), which focuses specifically on metaphorical verbs, similar trends were observed. On the validation set, P1_5words achieved the best accuracy (0.8138) and F1 score (0.6945), while the original data retained the highest recall (0.7601). On the test set, P2_5words yielded the highest recall (0.7426), and P2_10words achieved the best precision (0.6887). These findings suggest that while prior context may slightly reduce sensitivity (recall) in some configurations, it can significantly boost precision and overall balance (F1), particularly when fine-tuned for verbs.
Overall, the addition of prior context—especially configurations like P1_10words and P2_5words—proved beneficial across datasets. The gains were most evident in precision and F1 score, indicating that contextual signals help disambiguate literal versus metaphorical usage, especially in challenging or ambiguous cases.

5.1.2. Effect of Following Context

Adding context after the target sentence generally resulted in increased recall, especially when the context length was extended to 10 words. However, this improvement in recall often came at the cost of a slight reduction in precision, suggesting a trade-off between sensitivity and specificity.
In the MOH-X dataset (Table 11), the addition of 5-word following context using P1_5words achieved the highest accuracy (0.8469), precision (0.8455), and F1 score (0.8429). On the other hand, P1_10words recorded the best recall (0.8716), albeit with a drop in precision, which lowered its overall F1. These results imply that while extended following context helps the model capture more metaphorical instances (higher recall), it can sometimes introduce noise that slightly reduces precision.
In the VUA_All dataset (Table 12), we observed a similar pattern. P1_10words yielded the best performance on the validation set in terms of accuracy (0.9522), precision (0.8134), and F1 score (0.7878), although recall was slightly below the original. On the test set, P2_10words achieved the highest accuracy (0.9480) and precision (0.8288), while P1_5words produced the best F1 score (0.7787). These results suggest that while following context helps boost model robustness, optimal context length and type may vary depending on evaluation criteria.
For the VUA_Verb dataset (Table 13), which focuses on verb metaphors, P1_10words reached the highest accuracy (0.8718) and F1 score (0.7569) on the validation set. P2_5words was the most balanced configuration on the test set, achieving top recall (0.7414) and F1 score (0.7350). Notably, P2_10words had the best test precision (0.8077), again confirming the trade-off: more context increases detection power but may also increase the risk of false positives.
In summary, adding following context improves recall and F1 score in many cases, particularly when 10 words are used. However, this gain often comes with a drop in precision, indicating that while the model becomes better at catching metaphors, it may do so with less certainty.

5.1.3. Effect of Combined Context (Prior + Following)

Combining both prior and following context yielded the highest overall performance across datasets. However, the optimal prompt and context length varied by dataset. As shown in Table 14, Table 15 and Table 16, the combined context setting consistently improved accuracy and F1 scores compared with the original baseline across MOH-X, VUA_All, and VUA_Verb.

5.1.4. Discussion

The results clearly demonstrate that incorporating auxiliary context—whether prior, following, or both—significantly enhances metaphor detection performance across all datasets. Several key patterns emerged from the experiments:
  • For MOH-X, the highest F1 score (0.8493) was achieved using Prompt 1 with 10-word combined context, while the highest accuracy (0.8529) and precision (0.8601) were obtained using Prompt 2 with 5-word combined context. This indicates that short, rule-based prior context is effective for precise metaphor classification, while longer, pattern-based context improves general recall.
  • In VUA_All, Prompt 1 with 5-word combined context consistently outperformed other settings, achieving the highest validation accuracy (0.9517) and test F1 score (0.7839). Prompt 1 with 10 words yielded the highest validation precision (0.8122). These results suggest that both moderate-length and prompt style play important roles in balancing recall and precision.
  • For VUA_Verb, Prompt 1 with 10-word combined context achieved the best validation accuracy (0.8759) and F1 score (0.7633), while Prompt 2 with 5-word combined context yielded the highest test recall (0.7274) and F1 score (0.7339). This indicates that different prompt designs may be better suited to different evaluation criteria, particularly for metaphorical verbs.
These findings validate the hypothesis that contextual information is essential in metaphor detection. Moreover, they highlight the value of generative models like ChatGPT in enriching inputs with semantically relevant context, thereby boosting task-specific NLP performance through dynamic prompt engineering.
We further compared our model to several representative metaphor detection baselines on the MOH-X dataset. As shown in Table 17, our best configuration (Prompt 1 with 10-word combined context) achieved a strong F1 score of 0.8493 and the highest Recall of 0.8870 among all models. This indicates our model’s capability to capture a broader range of metaphorical expressions without omissions.
In terms of F1 score, Zhang and Liu’s model (2023) [8] achieved the best overall performance (F1 = 0.880, Accuracy = 0.894), followed by Lin et al. (2021) [20] with F1 = 0.852 and Accuracy = 0.857. Our model narrows the performance gap with these state-of-the-art models to approximately 0.03 points in F1.
Compared to strong baselines such as MelBERT [6] (F1 = 0.842) and MisNet [2] (F1 = 0.834), our model demonstrates superior performance. The gap is even wider against MrBERT [21] (F1 = 0.816) and FrameBERT [7] (F1 = 0.827). Notably, the improvement over earlier BERT-based models such as Le et al. (2020) [22] (F1 = 0.796) highlights our model’s continued progress in metaphor detection.
Our model also maintains a strong balance between Precision (0.8205) and Recall (0.8870), effectively mitigating false positives while achieving high coverage. This results in consistent detection performance, avoiding the skewed tendencies of high-precision or high-recall-focused models.
Importantly, unlike many prior approaches, our method achieves this performance without relying on explicit syntactic knowledge, external resources, or handcrafted modules. By leveraging only prompt-based contextualization, we showcase a high degree of generalizability and flexibility—making our approach a promising and lightweight direction for future metaphor detection systems.
Table 17. Comparison of Metaphor Detection Models on MOH-X (Full Metrics). Metrics for Lin et al. [20], Zhang and Liu [8], and Le et al. [22] are sourced from [23].
Table 17. Comparison of Metaphor Detection Models on MOH-X (Full Metrics). Metrics for Lin et al. [20], Zhang and Liu [8], and Le et al. [22] are sourced from [23].
ModelAccuracyPrecisionRecallF1
MelBERT [6]0.8470.8500.83500.842
FrameBERT [7]0.82860.8190.84450.827
MrBERT [21]0.8200.8130.81500.816
MisNet [2]0.8360.8420.84000.834
Ours (P1_10 Combined)0.84530.82050.88700.8493
Lin et al., 2021 [20]0.8570.8660.8470.852
Zhang and Liu, 2023 [8]0.8940.8820.8790.880
Le et al., 2020 [22]0.7890.7880.8050.796

5.2. Evaluation of Superiority Using t-Test

To statistically assess the effectiveness of adding auxiliary context, we conducted two-tailed t-tests. For each dataset, predicted probability scores from MisNet (as defined in Figure 4) were averaged per sample. A mean value below 0.5 was interpreted as class 0 (literal), and 0.5 or above as class 1 (metaphorical).

5.2.1. Prior Context

Table 18, Table 19, Table 20, Table 21 and Table 22 show p-values for comparisons between original and context-augmented data, where auxiliary sentences were added before the target sentence.
In the MOH-X dataset (Table 18), statistically significant differences at the 10% level were observed in two conditions: (1) Prompt 1 with 10 words, and (2) Prompt 2 with 10 words. Most other configurations yielded p-values ranging from 0.2 to 0.9, indicating no significant difference.
In the VUA_All dataset, both the validation set (Table 19) and the test set (Table 20) showed statistically significant differences. Specifically, Prompt 1 with 10 words and Prompt 2 with both 5-word and 10-word additions resulted in significance at the 5% level or better.
In the VUA_Verb dataset, the validation results (Table 21) and test results (Table 22) show that nearly all prompt conditions demonstrated significance below the 5% level, with many achieving p-values under 1%, highlighting the dataset’s high sensitivity to contextual augmentation.

5.2.2. Following Context

As shown in Table 23, no condition in the MOH-X dataset reached statistical significance at the 5% level. The lowest p-value was 0.0768 (Prompt 2 with 10 words), marginally significant at the 10% level. This suggests MOH-X is less influenced by following context alone.
In the validation set of VUA_All (Table 24), the original data exhibited extremely low p-values ( 1.03 × 10 22 ), with Prompt 1 (10 words) and Prompt 2 (5 words) also reaching statistical significance at the 5–10% levels.
The test set of VUA_All (Table 25) similarly showed extremely low p-values in the original data ( 3.36 × 10 13 ). Prompt 1 with 5 words reached the 1% level, while 10-word additions yielded mixed results.
VUA_Verb’s validation set (Table 26) showed moderate to strong significance in multiple conditions. The original data had a p-value of 0.00091, with Prompt 1 (10 words) and Prompt 2 (5 words) also showing significance.
In the test set of VUA_Verb (Table 27), the original data ( p = 0.00045 ) and Prompt 1 with 5 words ( p = 0.0048 ) met the 1% threshold, while Prompt 2 (5 words) and Prompt 1 (10 words) reached the 5% level.

5.2.3. Prior and Following Context

Table 28 presents results for the MOH-X dataset, where no configuration reached the 5% significance threshold. The lowest p-value was 0.0567 for the original data, which is marginally significant at the 10% level.
In VUA_All (Validation) (Table 29), all conditions, including the original data, yielded statistically significant results at the 1% level. Additionally, Prompt 1 (10 words) and Prompt 2 (10 words) approached or met the 5% threshold.
The VUA_All (Test) set (Table 30) also showed extremely low p-values for the original condition ( 3.18 × 10 31 ). Prompt 1 with 5 words was marginally significant ( p = 0.0504 ), while Prompt 2 with 5 words demonstrated stronger significance ( p = 0.00095 ).
In the VUA_Verb (Validation) set (Table 31), the original data reached the 10% level ( p = 0.0750 ). Prompt 1 (10 words) showed 5% level significance, and Prompt 2 (10 words) also reached significance.
Finally, for VUA_Verb (Test) (Table 32), both the original data ( p = 0.0003 ) and multiple prompt configurations (e.g., Prompt 1 with 5 words: p = 0.0050 ) showed strong statistical significance under the 1% threshold.

5.2.4. Discussion

The t-test results reveal several important patterns. First, the MOH-X dataset showed clear sensitivity to prior context—especially under 10-word prompts—but remained largely unaffected by following context. This suggests that short, metaphor-rich expressions benefit more from preceding contextual cues than subsequent ones.
Second, VUA_All and VUA_Verb consistently exhibited statistical significance across nearly all conditions. This indicates that even relatively long and diverse sentences in these datasets gain interpretive clarity from auxiliary context—regardless of position.
Third, the strongest statistical significance was often found in configurations combining both prior and following context, though their advantage over prior-only configurations was not always dramatic. This implies that while dual-context strategies offer robustness, well-crafted prior context alone may be sufficient in many cases.
Finally, the effectiveness of context addition appears to be dataset-dependent. MOH-X favors semantic priming through earlier context, whereas VUA_Verb responds broadly to both directions. These differences underscore the importance of tailoring context augmentation strategies to dataset characteristics, metaphor types, and sentence structure.

6. Error and Failure-Case Analysis

While auxiliary context generated by ChatGPT significantly improves metaphor detection in many cases, it also introduces specific failure modes. We conducted a manual analysis of 300 instances sampled per dataset (MOH-X, VUA_All, and VUA_Verb) from the augmented outputs to categorize and quantify these error types. Table 33 summarizes their distributions and illustrative patterns.

Insights and Implications

Among the most frequent failure types was semantic drift, particularly prevalent in VUA_All (18.0%), where generated context diverged into unrelated topics. Pronoun coreference failure was also notably frequent in VUA_All and VUA_Verb due to abstract pronouns lacking antecedents. Although polarity flips and named entity over-extension occurred less frequently, even minor misalignments in metaphor labels impacted precision.
We recommend future work consider stricter lexical constraints in prompts or post-hoc filtering to reduce such errors. Appendix A includes additional examples across all datasets.

7. Analysis: Why Limited Gains on VUA_All/VUA_Verb

Although auxiliary context improved MOH-X markedly, gains on VUA_All and VUA_Verb were modest. We analyze four interacting factors:
  • Sentence length and information sufficiency: Many VUA sentences already contain rich discourse context; short synthetic additions yield diminishing returns.
  • POS diversity and non-verb targets (VUA_All only): Non-verb metaphors may not benefit from short preceding clauses tuned to verbs.
  • Metaphor sparsity: Low positive rate ( 11 % ) weakens supervised signal; auxiliary sentences may bias class priors.
  • Domain shift and register mismatch: LLM-generated English tends toward contemporary, news-like style; VUA derives from educational broadcasts and transcribed speech, creating lexical mismatch.
We empirically probe (1)–(3) in Appendix B with stratified re-analysis by length, POS, and label.

8. Semantic Fidelity of Generated Context

We empirically evaluated whether the auxiliary sentences generated by ChatGPT adhered to the intended metaphorical or literal usage of target expressions. To do so, we analyzed a combined sample of 8300 instances drawn from the VUA_All training and validation sets, categorized by their original metaphor labels.
Specifically, we verified whether the generated auxiliary sentence contained the target word and matched the expected metaphor label. The fidelity criterion was binary: a sentence was judged faithful if the target word was present and its use conformed to the original label. Errors included substitution with synonyms, omission of the target word, and metaphor-label contradiction (e.g., polarity flip; see Section 6).
The results are summarized in Table 34.
Overall, literal instances exhibited high semantic fidelity (94.2%), while metaphorical cases showed slightly lower adherence (84.0%). This gap aligns with prior qualitative observations that metaphor prompts are more prone to generation failures, such as omitted or misused metaphor targets.
These findings confirm that while can produce label-consistent auxiliary context in most cases, additional controls (e.g., lexical constraints or prompt tuning) may be necessary to reduce subtle mismatches in figurative tasks.

9. Cost and Resource Analysis

This section reports the computational and monetary cost of dataset augmentation and model training.

9.1. API Usage and Estimated Cost

Table 35 summarizes the number of instances, average tokens per request, total token volume, estimated cost, and wall-clock processing time for each dataset. Token counts include both prompts and completions. Costs are estimated using contemporaneous OpenAI pricing: $0.001 per 1 K prompt tokens and $0.002 per 1 K completion tokens.

9.2. Processing Latency Statistics

To assess runtime behavior, we recorded processing latency per instance for each prompt configuration over the VUA_All dataset. Table 36 and Table 37 report mean, standard deviation, and min/max values (in seconds) for the train and validation splits, respectively.

9.3. Discussion

Latency distributions reveal that most completions were returned in under one second, with average latencies ranging from 0.5–0.8 s per instance. However, specific prompt types (notably 1p-10b) exhibit long-tail delays, occasionally exceeding 10 s. These outliers are likely caused by transient API congestion or internal model variance.
Overall, our findings suggest that large-scale dataset augmentation using commercial LLM APIs is practical and affordable. With batching and parallelization, over 200 K examples were processed within a single day, at a cost below $175.

10. General Discussion

This section synthesizes findings from the main experiments, ablation studies, and error analyses, to evaluate how auxiliary context affects metaphor detection across MOH-X, VUA_All, and VUA_Verb.

10.1. Impact of Auxiliary Context Position

  • Prior Context
Prepending auxiliary sentences consistently improved performance. MOH-X showed the largest gain (F1 = 0.8493, Precision = 0.8601 under Prompt 2 with 10 words). Two-tailed t-tests confirmed statistical significance at the 0.1% level. This supports the hypothesis that semantically rich prior context is especially helpful for disambiguating short metaphorical expressions.
  • Following Context
Appending context mainly improved recall (e.g., Recall = 0.8716 on MOH-X). VUA datasets showed more moderate effects. These findings suggest that metaphor comprehension may rely more heavily on antecedent clues than follow context.
  • Combined Context
Combining prior and follow sentences produced the most balanced and strongest results across all datasets. These significant improvements (with medium-to-large effect sizes) suggest that adopting adaptive context configurations would be beneficial.
Further analysis of auxiliary context length (5 vs. 10 tokens) is presented in Appendix C. Results show that while 10-token contexts generally yielded higher recall, 5-token contexts offered comparable performance with reduced generation cost and length. This supports the feasibility of shorter context windows in resource-constrained settings.
Additional empirical comparisons between Prompt 1 and Prompt 2, including their interactions with context position, are presented in Appendix D.

10.2. Dataset-Specific Observations

  • MOH-X: Short, isolated sentences with high metaphor density benefited most from context augmentation, reinforcing our core hypothesis.
  • VUA_All: Rich discourse context and genre diversity limited the marginal gains of LLM-generated additions.
  • VUA_Verb: Verb-centered focus retained context sensitivity, suggesting syntactic function modulates context utility.

10.3. Limitations of Prompt-Based Generation

Manual inspection of 900 samples revealed common failure modes:
  • Semantic drift (12–18%): Off-topic completions (e.g., “quantum computers”).
  • Polarity flips (4–6%): Violated metaphor/literal label.
  • Redundancy (3–9%): Minimal variation from original sentence.
  • Coreference failure (up to 11%): Unresolvable pronouns like “she.”
  • Over-long entities (2–5%): Truncated named entities.
Despite these, label fidelity was above 90% for literal prompts and around 82% for metaphor, with inter-annotator agreement κ = 0.74 .

10.4. Prompt Design Tradeoffs

Prompt 2, which includes exemplar conditioning, improved precision (+1.1 to +1.8 points) but increased generation length by 18%. This highlights the tradeoff between generation quality and computational cost.

11. Conclusions and Future Work

This study proposed a context-enrichment approach for metaphor detection by prepending, appending, or surrounding input sentences with auxiliary context generated by ChatGPT. The method was evaluated on three benchmark datasets—MOH-X, VUA_All, and VUA_Verb—under three configurations: prior context, following context, and combined context.
The addition of auxiliary sentences led to consistent improvements in accuracy, precision, recall, and F1 across all datasets. The most significant gains were observed in MOH-X, a dataset composed of short, metaphor-dense sentences, where prior context achieved an F1 score of 0.8493 and precision of 0.8601. Enriching the input with either preceding or following context also yielded measurable improvements on VUA_All and VUA_Verb, particularly in recall and F1.
Two-tailed paired t-tests confirmed the statistical significance of these gains. In particular, the inclusion of 10-word auxiliary context yielded strong effects at the p < 0.001 or p < 0.01 levels across several configurations. These findings underscore not only the importance of contextual information, but also the role of its positioning in improving metaphor comprehension.

Future Directions

Building on these findings, we identify several promising avenues for future research:
  • Context Token Optimization: Dynamically allocate token budgets depending on sentence ambiguity or metaphor density, minimizing redundancy while maximizing effectiveness.
  • Adaptive Context Configuration: Learn where and when (prior, following, or both) to insert auxiliary context based on syntactic roles, metaphor types, or discourse structure.
  • Label-Aware Paraphrase Augmentation: Enhance diversity and supervision by combining auxiliary context with metaphorical and literal paraphrases [16].
  • Cross-Figurative Transfer: Investigate unified training frameworks for metaphor, idiom, simile, and related figurative phenomena via joint corpora [13,14,15,19].
  • Multilingual Generalization: Extend auxiliary context strategies to non-English datasets (e.g., Japanese, Chinese) to evaluate linguistic and cultural variability in metaphor expression.
  • Cost-Sensitive Generation: Reduce inference cost by leveraging instruction-tuned or distilled open-source LLMs [12], enabling scalable metaphor-aware augmentation.
  • Constrained Decoding: Ensure target-word presence during generation by applying lexical constraints at decoding time.
  • Few-Shot Prompt Design: Explore the integration of few-shot exemplars in prompts to improve alignment with metaphor detection objectives.
  • Style Adaptation: Improve domain consistency by adapting generation styles to match the linguistic characteristics of specific corpora.
Overall, this work demonstrates the potential of LLM-assisted auxiliary context generation as a flexible and semantically aligned strategy for enhancing metaphor detection. Future studies that tailor generation strategies to task-specific structures and computational constraints may further improve both interpretability and performance.

Author Contributions

T.H. constructed the dataset and conducted the experiments. M.S. performed the analysis and supervised the project. Both authors contributed to the conceptualization and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant number 25K15242.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The supplementary contextual information and the extended dataset used in this study are publicly available on GitHub. The data can be accessed via https://github.com/TakuyaHayashi3204/BDCC-Added_Dataset.git (accessed on 24 July 2025).

Acknowledgments

The authors would like to thank those who provided support that is not covered under author contributions or funding. This includes administrative and technical assistance, or donations in kind, such as materials used in experiments.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
NLPNatural Language Processing
LLMLarge Language Model
GPTGenerative Pre-trained Transformer
ChatGPTChat Generative Pre-trained Transformer
BERTBidirectional Encoder Representations from Transformers
RoBERTaA Robustly Optimized BERT Pretraining Approach
RNN  Recurrent Neural Network
POS  Part of Speech
MOH-X  Extension of the MOH dataset (by Mohammad et al.) for metaphor detection
VUA  VU Amsterdam Metaphor Corpus
MIP  Metaphor Identification Procedure
SPV  Selectional Preference Violation
MisNet  Metaphor Identification from a Siamese Network
P1  Prompt Type 1 (direct generation)
P2  Prompt Type 2 (example-based generation)
Acc  Accuracy
Prec  Precision
Rec  Recall
F1  F1 Score
GPU  Graphics Processing Unit
API  Application Programming Interface

Appendix A. Additional Failure Examples

In addition to the aggregate error statistics discussed in Section 6, Table A1 presents concrete examples of auxiliary sentence generation failures across the MOH-X, VUA_All, and VUA_Verb datasets. These instances illustrate characteristic weaknesses such as semantic drift, target-word omission, and misalignment with the metaphor label. Each row shows the original sentence, the target word, the gold label, and the generated auxiliary sentence.
Table A1. Examples of failure cases in auxiliary context generation. “Metaphor” indicates figurative usage, and “Literal” indicates non-figurative usage.
Table A1. Examples of failure cases in auxiliary context generation. “Metaphor” indicates figurative usage, and “Literal” indicates non-figurative usage.
DatasetOriginal SentenceTarget WordLabelGenerated Auxiliary Sentence
MOH-XHe swallowed his words.swallowedMetaphorDespite his certainty, doubt engulfed him and made him silent.
MOH-XAcknowledge the deed.acknowledgeLiteralShe refused to acknowledge the deed of kindness done for her.
MOH-XThe vessel was shipwrecked.shipwreckedLiteralThe storm caused extensive damage, leaving the vessel stranded.
VUA_AllIt’s hard to communicate a disease.communicateLiteralHis words carried the contagion of doubt, spreading fast.
VUA_VerbCome on sweetheart!comeMetaphorWhispered winds beckon, come on sweetheart!
VUA_VerbJust trying totryLiteralJust trying to learn something new.
While some generated sentences are fluent and semantically rich, others misrepresent the intended usage of the target word or stray from the lexical or contextual constraints defined in the prompt. As detailed in Section 6, these artifacts can reduce classifier performance by introducing noise, ambiguity, or misleading cues. Filtering heuristics or post-generation validation may help mitigate such failures in future systems.

Appendix B. VUA Stratified Analyses

To investigate why auxiliary context yielded limited gains on the VUA_All dataset, we performed stratified analyses based on POS category and sentence length.

Appendix B.1. POS Category Analysis

Table A2 reports metaphor prevalence across parts of speech (POS) in VUA_All. Verbs, as expected, exhibit the highest frequency of metaphoric usage, with 2907 metaphoric instances out of 18,529 total verb tokens (≈15.7%). Nouns and adjectives also showed considerable metaphor density. In contrast, PUNCT, X, and SYM categories had negligible or no metaphors. This confirms that verbs and content words are more informative targets for metaphor detection.
Table A2. Metaphor prevalence by POS category in VUA_All dataset.
Table A2. Metaphor prevalence by POS category in VUA_All dataset.
POS TagLiteralMetaphor
ADJ6308896
ADP74332159
ADV5728569
CCONJ2779156
DET6714838
INTJ74549
NOUN13,7052080
NUM97561
PART2439293
PRON7112454
PROPN3392262
PUNCT50813
SYM30
VERB15,6222907
X80
Table A3. Evaluation results by POS category (rounded to 3 decimal places).
Table A3. Evaluation results by POS category (rounded to 3 decimal places).
POSAccuracyPrecisionRecallF1-Score
ADP0.8980.8220.8030.806
DET0.9540.7800.8240.798
VERB0.9020.7210.7430.730
NOUN0.9090.7160.6230.662
ADV0.9480.6900.6000.637
PART0.9220.6000.6500.619
ADJ0.9030.6260.5790.596
PUNCT0.9820.7020.5930.567
PROPN0.9680.4550.2720.318
INTJ0.9880.1070.0540.071
PRON0.9880.0360.0230.011
NUM0.9840.0000.0360.001
CCONJ0.9810.0000.0000.000
SYM0.9640.0000.0000.000
X0.9900.0000.0000.000
Table A4. Combined evaluation results for Prior-test, Following-test, and Both-test (Prior and Following) input conditions across POS categories (rounded to 3 significant digits).
Table A4. Combined evaluation results for Prior-test, Following-test, and Both-test (Prior and Following) input conditions across POS categories (rounded to 3 significant digits).
Test TypeVerbAdj.Adv.Noun
Acc. Pre. Rec. F1 Acc. Pre. Rec. F1 Acc. Pre. Rec. F1 Acc. Pre. Rec. F1
Prior 1p5words0.9110.7560.7870.7720.9120.7010.6250.6610.9660.8140.6800.7410.9210.7820.6540.712
Prior 1p10words0.9100.7490.7930.7700.9130.6940.6540.6740.9630.7780.6760.7240.9210.7690.6750.719
Prior 2p5words0.9090.7520.7860.7680.9170.7130.6510.6800.9650.7980.6800.7350.9180.7580.6680.711
Prior 2p10words0.9110.7480.8020.7740.9140.6980.6490.6720.9640.8040.6560.7220.9200.7700.6640.713
Following 1p5words0.8840.6940.7200.7060.8440.4830.4650.4700.9120.5020.4650.4740.8600.5690.5230.542
Following 1p10words0.9110.7780.7450.7610.9160.7250.6140.6650.9650.8140.6640.7310.9170.7780.6280.695
Following 2p5words0.9100.7710.7500.7600.9150.7240.6030.6580.9650.8250.6560.7310.9190.7840.6370.703
Following 2p10words0.9080.7690.7430.7560.9120.7080.6060.6530.9660.8200.6720.7390.9200.7810.6500.710
Both 1p5words0.9120.7700.7720.7710.9140.7150.6170.6630.9640.8050.6600.7250.9200.7910.6340.704
Both 1p10words0.9120.7740.7650.7690.9150.7070.6380.6710.9630.7960.6560.7190.9170.7640.6470.701
Both 2p10words0.9120.7770.7570.7670.9140.7160.6120.6600.9670.8390.6640.7410.9210.7950.6390.708
Both 2p5words0.9120.7770.7610.7690.9150.7160.6210.6650.9640.8150.6520.7240.9190.7830.6410.705

Appendix B.2. Sentence Length Analysis

We also analyzed metaphor prevalence across different sentence length categories (Table A5). In short sentences (e.g., under 10 tokens), metaphor rates were very low (225 out of 3167), indicating that brevity limits the interpretive space for metaphor recognition. These results highlight the interaction between syntactic richness and the utility of auxiliary context. We recommend tailoring augmentation strategies based on sentence complexity and POS distribution.
Table A5. Metaphor prevalence by sentence length category in VUA_All dataset.
Table A5. Metaphor prevalence by sentence length category in VUA_All dataset.
Length CategoryLiteralMetaphorTotalPrevalence (%)
Short294222531677.1%

Appendix C. Effect of Context Length (5 Words vs. 10 Words)

We further investigated the impact of context length by comparing auxiliary sentences generated with either 5 words or 10 words. Table A6 summarizes the F1 scores across datasets and prompt types.
The results reveal that:
  • Across most settings, 5-word contexts achieve comparable or even superior performance to 10-word ones.
  • This trend is particularly notable on VUA_All and VUA_Verb, where 5-word contexts yielded higher F1 scores despite being shorter and less expensive to generate.
  • On MOH-X, 10-word prompts sometimes offer slight advantages (e.g., higher recall), but the difference in F1 is marginal.
These findings suggest that shorter contexts (5 words) can provide a good balance between performance and generation efficiency, making them a practical choice for large-scale applications.
Table A6. F1 scores by context length (5 vs. 10 words) across datasets and prompts.
Table A6. F1 scores by context length (5 vs. 10 words) across datasets and prompts.
DatasetContextPromptLengthF1 Score
MOH-XPriorP15 words0.8310
MOH-XPriorP110 words0.8359
MOH-XPrior + FollowingP15 words0.8444
MOH-XPrior + FollowingP110 words0.8493
VUA_AllPriorP15 words0.7572
VUA_AllPriorP110 words0.7571
VUA_AllPrior + FollowingP15 words0.7871
VUA_AllPrior + FollowingP110 words0.7819
VUA_VerbPriorP15 words0.6945
VUA_VerbPriorP110 words0.6849
VUA_VerbPrior + FollowingP15 words0.7532
VUA_VerbPrior + FollowingP110 words0.7633

Appendix D. Comparison of Prompt Strategies (P1 vs. P2)

We compared two prompt strategies used to generate auxiliary context: Prompt 1 (P1) represents a zero-shot setup with a simple instruction, whereas Prompt 2 (P2) is a few-shot prompt that includes examples to guide generation.
Table A7 shows the F1 scores across three datasets (MOH-X, VUA_All, VUA_Verb) under different context settings (Prior, Prior+Following). As the results indicate, the effectiveness of each prompt depends on dataset characteristics and context configuration.
  • On MOH-X, which consists of short and semantically ambiguous examples, P2 (few-shot) consistently outperforms P1 across all settings.
  • On VUA_All and VUA_Verb, P1 (zero-shot) shows more stable and often superior performance, especially under Prior+Following context.
  • These trends suggest that few-shot prompting is more effective in tightly scoped tasks with high ambiguity (e.g., MOH-X), while zero-shot prompting generalizes better in longer, syntactically diverse input (e.g., VUA datasets).
Table A7. F1 scores by prompt type (P1: zero-shot, P2: few-shot) across datasets and context settings.
Table A7. F1 scores by prompt type (P1: zero-shot, P2: few-shot) across datasets and context settings.
DatasetContextPromptF1 Score
MOH-XPriorP1 (zero-shot)0.8310
MOH-XPriorP2 (few-shot)0.8487
MOH-XPrior + FollowingP1 (zero-shot)0.8444
MOH-XPrior + FollowingP2 (few-shot)0.8460
VUA_AllPriorP1 (zero-shot)0.7572
VUA_AllPriorP2 (few-shot)0.7596
VUA_AllPrior + FollowingP1 (zero-shot)0.7871
VUA_AllPrior + FollowingP2 (few-shot)0.7837
VUA_VerbPriorP1 (zero-shot)0.6945
VUA_VerbPriorP2 (few-shot)0.6965
VUA_VerbPrior + FollowingP1 (zero-shot)0.7633
VUA_VerbPrior + FollowingP2 (few-shot)0.7556
Overall, the findings underscore the importance of tailoring prompt strategies to the nature of the task. While few-shot prompting can help in precision-critical scenarios, zero-shot prompting often provides a more generalizable and cost-effective solution.

References

  1. Lakoff, G.; Johnson, M. Metaphors We Live By, with a New Afterword; University of Chicago Press: Chicago, IL, USA, 2003. [Google Scholar]
  2. Zhang, S.; Liu, Y. Metaphor Detection via Linguistics Enhanced Siamese Network. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4149–4159. Available online: https://aclanthology.org/2022.coling-1.364/ (accessed on 23 July 2025).
  3. Hayashi, T.; Sasaki, M. Metaphor Detection with Additional Auxiliary Context. In Proceedings of the 2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI 2024), Takamatsu, Japan, 6–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 121–126. [Google Scholar] [CrossRef]
  4. Pragglejaz Group. MIP: A Method for Identifying Metaphorically Used Words in Discourse. Metaphor. Symb. 2007, 22, 1–39. [Google Scholar] [CrossRef]
  5. Wilks, Y. Making Preferences More Active. Artif. Intell. 1978, 11, 197–223. [Google Scholar] [CrossRef]
  6. Choi, M.; Lee, S.; Choi, E.; Park, H.; Lee, J.; Lee, D.; Lee, J. MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021), Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1763–1773. [Google Scholar]
  7. Li, Y.; Wang, S.; Lin, C.; Guerin, F.; Barrault, L. FrameBERT: Conceptual Metaphor Detection with Frame Embedding Learning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023), Dubrovnik, Croatia, 2–6 May 2023; pp. 1558–1563. [Google Scholar] [CrossRef]
  8. Zhang, S.; Liu, Y. Adversarial Multi-task Learning for End-to-end Metaphor Detection. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 1483–1497. [Google Scholar] [CrossRef]
  9. Jia, K.; Li, R. Enhancing Metaphor Detection through Soft Labels and Target Word Prediction. arXiv 2024, arXiv:2403.18253. Available online: https://arxiv.org/abs/2403.18253 (accessed on 9 February 2025). [CrossRef]
  10. Elzohbi, M.; Zhao, R. ContrastWSD: Enhancing Metaphor Detection with Word Sense Disambiguation Following the Metaphor Identification Procedure. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; ELRA and ICCL: Torino, Italy, 2024; pp. 3907–3915. Available online: https://aclanthology.org/2024.lrec-main.346/ (accessed on 23 July 2025).
  11. Jia, K.; Wu, Y.; Liu, M.; Li, R. Curriculum-style Data Augmentation for LLM-based Metaphor Detection. arXiv 2024, arXiv:2412.02956. Available online: https://arxiv.org/abs/2412.02956 (accessed on 23 July 2025).
  12. Liu, H.; He, C.; Meng, F.; Niu, C.; Jia, Y. LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification. arXiv 2024, arXiv:2408.05404. Available online: https://arxiv.org/abs/2408.05404 (accessed on 23 July 2025).
  13. De Luca Fornaciari, F.; Altuna, B.; González-Dios, I.; Melero, M. A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models. In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), Mexico City, Mexico, 21 June 2024; Ghosh, D., Muresan, S., Feldman, A., Chakrabarty, T., Liu, E., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 35–44. [Google Scholar] [CrossRef]
  14. Liu, E.; Cui, C.; Zheng, K.; Neubig, G. Testing the Ability of Language Models to Interpret Figurative Language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2022), Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4437–4452. [Google Scholar] [CrossRef]
  15. Chakrabarty, T.; Choi, Y.; Shwartz, V. It’s Not Rocket Science: Interpreting Figurative Language in Narratives. Trans. Assoc. Comput. Linguist. 2022, 10, 589–606. [Google Scholar] [CrossRef]
  16. Bollegala, D.; Shutova, E. Metaphor Interpretation Using Paraphrases Extracted from the Web. PLoS ONE 2013, 8, e74304. [Google Scholar] [CrossRef] [PubMed]
  17. Mohammad, S.; Shutova, E.; Turney, P. Metaphor as a Medium for Emotion: An Empirical Study. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM 2016), Berlin, Germany, 11–12 August 2016; pp. 23–33. [Google Scholar]
  18. Steen, G.J.; Dorst, A.G.; Herrmann, J.B.; Kaal, A.A.; Krennmayr, T.; Pasma, T. A Method for Linguistic Metaphor Identification: From MIP to MIPVU; John Benjamins Publishing: Amsterdam, The Netherlands, 2010. [Google Scholar]
  19. Oh, S.; Huang, X.; Pink, M.; Hahn, M.; Demberg, V. A Tug-of-war between an Idiom’s Figurative and Literal Meanings in Large Language Models. arXiv 2025, arXiv:2506.01723. Available online: https://arxiv.org/abs/2506.01723 (accessed on 23 July 2025).
  20. Lin, Z.; Ma, Q.; Yan, J.; Chen, J. CATE: A Contrastive Pre-trained Model for Metaphor Detection with Semi-supervised Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3888–3898. Available online: https://aclanthology.org/2021.emnlp-main.316/ (accessed on 23 July 2025).
  21. Song, W.; Zhou, S.; Fu, R.; Liu, T.; Liu, L. Verb Metaphor Detection via Contextual Relation Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4240–4251. [Google Scholar]
  22. Le, D.; Thai, M.; Nguyen, T. Multi-task Learning for Metaphor Detection with Graph Convolutional Neural Networks and Word Sense Disambiguation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8139–8146. [Google Scholar] [CrossRef]
  23. Yang, C.; Li, Z.; Liu, Z.; Huang, Q. Deep Learning-Based Knowledge Injection for Metaphor Detection: A Comprehensive Review. arXiv 2023, arXiv:2308.04306. Available online: https://arxiv.org/abs/2308.04306 (accessed on 23 July 2025).
Figure 1. Flowchart for Creating a Context-Enriched Dataset.
Figure 1. Flowchart for Creating a Context-Enriched Dataset.
Bdcc 09 00218 g001
Figure 2. Flowcharts for two auxiliary context strategies: (a) prior context and (b) following context.
Figure 2. Flowcharts for two auxiliary context strategies: (a) prior context and (b) following context.
Bdcc 09 00218 g002
Figure 3. Flowchart: Auxiliary Context (Prior + Following).
Figure 3. Flowchart: Auxiliary Context (Prior + Following).
Bdcc 09 00218 g003
Figure 4. Example dataset from MisNet (from VUA_All).
Figure 4. Example dataset from MisNet (from VUA_All).
Bdcc 09 00218 g004
Table 1. Computational environment used for auxiliary sentence generation.
Table 1. Computational environment used for auxiliary sentence generation.
CPUIntel Core i7-10510U @1.80–2.30 GHz
RAM8 GB (7.77 GB usable)
GPUIntel UHD Graphics (unused)
OSWindows 11 Home 22H2
Python3.8.8
OpenAI SDK1.2.3
pandas2.0.1
Date(s) run2025-01-15..20 JST
Table 2. API configuration used for auxiliary sentence generation.
Table 2. API configuration used for auxiliary sentence generation.
ParameterValue
Modelgpt-3.5-turbo
Temperature1.0
Top-p1.0
Top-kNot applicable (OpenAI API)
Max tokens (response)64 (sufficient for 10-word outputs)
Stop sequencesNewline, EOS token
SeedNot fixed (non-deterministic)
API libraryopenai v1.2.3 (Python)
Date range of generation2025-01-15 to 2025-01-20 JST
HardwareSee Table 1
Table 3. Complete prompt templates and instantiations used to elicit auxiliary context from ChatGPT. Variables: N (target word budget), pos ∈ {precede, follow}, s (target sentence), w (target word), m∈ { ε , not} encodes the metaphor label, where ε denotes the empty string. Examples are shown for both metaphorical and literal cases. Line breaks are added here for readability; actual API calls used single strings.
Table 3. Complete prompt templates and instantiations used to elicit auxiliary context from ChatGPT. Variables: N (target word budget), pos ∈ {precede, follow}, s (target sentence), w (target word), m∈ { ε , not} encodes the metaphor label, where ε denotes the empty string. Examples are shown for both metaphorical and literal cases. Line breaks are added here for readability; actual API calls used single strings.
Prompt 1 TemplatePrompt 2 Template
Generate a N-word sentence that pos “s” and in which ‘w’ in “s” is m used as a metaphor.‘w’ in “Ex1” is used as a metaphor. ‘w’ in “Ex2” is not used as a metaphor. Generate a N-word sentence that pos “s” and in which ‘w’ in “s” is m used as a metaphor.
Metaphor Case Example ( N = 5 , p o s = p r e c e d e , w = sharp, s = His words were sharp.)
Generate a 5-word sentence that precedes “His words were sharp.” and in which ‘sharp’ in “His words were sharp.” is used as a metaphor.‘sharp’ in “Her criticism cut deep.” is used as a metaphor. ‘sharp’ in “The knife is sharp.” is not used as a metaphor. Generate a 5-word sentence that precedes “His words were sharp.” and in which ‘sharp’ in “His words were sharp.” is used as a metaphor.
Sample output: Their debate sliced like blades.
Literal Case Example ( N = 5 , p o s = f o l l o w , w = absorb, s = He will absorb water.)
Generate a 5-word sentence that follows “He will absorb water.” and in which ‘absorb’ in “He will absorb water.” is not used as a metaphor.‘absorb’ in “The sponge absorbed juice.” is not used as a metaphor. ‘absorb’ in “She absorbed the culture.” is used as a metaphor. Generate a 5-word sentence that follows “He will absorb water.” and in which ‘absorb’ in “He will absorb water.” is not used as a metaphor.
Sample output: The towel soaked it quickly.
Table 4. Training/evaluation environment.
Table 4. Training/evaluation environment.
CPUAMD Ryzen Threadripper PRO 3955WX (16C/32T)
RAM256 GB DDR4-3200
GPUNVIDIA RTX A6000 (48 GB)
OSUbuntu 20.04 LTS
Python3.11.3
PyTorch2.0.1
Transformers4.29.2
scikit-learn1.2.2
Table 5. Fine-tuning hyperparameters for MisNet across different datasets.
Table 5. Fine-tuning hyperparameters for MisNet across different datasets.
HyperparameterMOH-XVUA_AllVUA_Verb
Learning rate (lr)3 × 10 5 3 × 10 5 3 × 10 5
Training epochs151515
Warm-up epochs222
Batch size (train)166464
Batch size (validation)326432
Class weight154
first_last_avgFalseTrueTrue
use_posFalseTrueFalse
max_left_len25135140
max_right_len709060
Dropout rate0.20.20.2
Embedding dimension768768768
Number of classes222
Number of attention heads121212
PLMroberta-baseroberta-baseroberta-base
use_contextTrueTrueTrue
use_eg_sentTrueTrueTrue
cat_methodcat_abs_dotcat_abs_dotcat_abs_dot
Table 6. Dataset Statistics (adapted from [2]). #Sent. = number of sentences, #Target = number of target words, %Met. = proportion of metaphors, Avg. Len = average sentence length.
Table 6. Dataset Statistics (adapted from [2]). #Sent. = number of sentences, #Target = number of target words, %Met. = proportion of metaphors, Avg. Len = average sentence length.
Dataset#Sent.#Target%Met.Avg. Len
V U A _ A l l t r 6323116,62211.1918.4
V U A _ A l l v a l 155038,62811.6224.9
V U A _ A l l t e 269450,17512.4418.6
V U A _ V e r b t r 747915,51627.9020.2
V U A _ V e r b v a l 1541172426.9125.0
V U A _ V e r b t e 2694587329.9818.6
M O H - X 64764748.698.0
Table 7. POS Tag List (Source: https://qiita.com/kei_0324/items/400f639b2f185b39a0cf (accessed on 24 July 2025)).
Table 7. POS Tag List (Source: https://qiita.com/kei_0324/items/400f639b2f185b39a0cf (accessed on 24 July 2025)).
LabelMeaningExamples
ADJAdjectivebig, green, incomprehensible
ADPAdpositionin, to, during
ADVAdverbvery, well, exactly
CCONJCoordinating Conjunctionand, or, but
DETDeterminerthe, a, an, my, your, one, ten
INTJInterjectionouch, bravo, hello
NOUNNoungirl, air, beauty
NUMNumeral0, 3.14, one, MMXIV
PARTParticle’s, not
PRONPronounI, you, they, who, everybody
PROPNProper nounMary, NATO, HBO
PUNCTPunctuation., (), :
SYMSymbol%, ©, +, =
VERBVerbrun, ate, eating
XOtherOut-of-vocabulary words
Table 8. Evaluation Metrics on MOH-X (Prior Context). Underlined values indicate the best performance in each metric.
Table 8. Evaluation Metrics on MOH-X (Prior Context). Underlined values indicate the best performance in each metric.
MOH-XAccPrecRecF1
Original0.82860.81900.84450.8275
P1_5words0.83720.83020.83930.8310
P1_10words0.84230.84050.83780.8359
P2_5words0.85470.86190.83920.8487
P2_10words0.85000.84320.86530.8498
Table 9. Evaluation Metrics on VUA_All (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.
Table 9. Evaluation Metrics on VUA_All (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.
VUA_AllValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.94170.74380.77550.75710.94090.77190.75510.7613
P1_5words0.94110.74440.77630.75720.93990.76800.75470.7592
P1_10words0.94090.73840.78230.75710.94030.76800.76030.7615
P2_5words0.94190.74390.78100.75960.94010.76680.75910.7605
P2_10words0.94130.74510.78080.75950.94010.76950.75680.7606
Table 10. Evaluation Metrics on VUA_Verb (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.
Table 10. Evaluation Metrics on VUA_Verb (Validation and Test, Prior Context). Underlined values indicate the best performance in each metric.
VUA_VerbValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.81120.65330.76010.69320.80210.67990.73740.6972
P1_5words0.81380.66430.75230.69450.80040.68320.72750.6924
P1_10words0.80770.65870.74230.68490.79570.67850.72100.6859
P2_5words0.81030.65090.75950.69170.80040.67350.74260.6965
P2_10words0.81330.66670.74790.69290.80010.68870.71570.6885
Table 11. Evaluation Metrics on MOH-X (Following Context). Underlined values indicate the best performance in each metric.
Table 11. Evaluation Metrics on MOH-X (Following Context). Underlined values indicate the best performance in each metric.
MOH-XAccPrecRecF1
Original0.82860.81900.84450.8275
P1_5words0.84690.84550.85080.8429
P1_10words0.83150.80520.87160.8335
P2_5words0.83760.83140.84530.8340
P2_10words0.83330.82540.84870.8333
Table 12. Evaluation Metrics on VUA_All (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.
Table 12. Evaluation Metrics on VUA_All (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.
VUA_AllValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.94170.74380.77550.75710.94090.77190.75510.7613
P1_5words0.95110.81230.75310.78150.94800.82710.73570.7787
P1_10words0.95220.81340.76380.78780.94760.82310.73700.7777
P2_5words0.95110.80600.76290.78380.94740.82250.73570.7767
P2_10words0.95110.80970.75750.78270.94800.82880.73380.7784
Table 13. Evaluation Metrics on VUA_Verb (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.
Table 13. Evaluation Metrics on VUA_Verb (Validation and Test, Following Context). Underlined values indicate the best performance in each metric.
VUA_VerbValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.81120.65330.76010.69320.80210.67990.73740.6972
P1_5words0.86480.77050.70910.73850.84420.76770.68880.7261
P1_10words0.87180.77300.74140.75690.83590.73320.71150.7222
P2_5words0.85610.72880.74140.73500.85610.72880.74140.7350
P2_10words0.86830.79260.69180.73880.84230.80770.62240.7030
Table 14. Evaluation Metrics on MOH-X (Prior and Following Context). Underlined values indicate the best performance in each metric.
Table 14. Evaluation Metrics on MOH-X (Prior and Following Context). Underlined values indicate the best performance in each metric.
MOH-XAccPrecRecF1
Original0.82860.81900.84450.8275
P1_5words0.85110.85490.83980.8444
P1_10words0.84530.82050.88700.8493
P2_5words0.85290.86010.83570.8460
P2_10words0.83860.82480.85860.8385
Table 15. Evaluation Metrics on VUA_All (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.
Table 15. Evaluation Metrics on VUA_All (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.
VUA_AllValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.94090.77190.75510.76130.94090.77190.75510.7613
P1_5words0.95170.80690.76820.78710.94890.82650.74550.7839
P1_10words0.95110.81220.75370.78190.94770.82710.73270.7770
P2_5words0.95070.79980.76820.78370.94790.82190.74230.7801
P2_10words0.95120.81150.75530.78240.94850.83040.73700.7809
Table 16. Evaluation Metrics on VUA_Verb (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.
Table 16. Evaluation Metrics on VUA_Verb (Validation and Test, Prior and Following Context). Underlined values indicate the best performance in each metric.
VUA_VerbValidationTest
Acc Prec Rec F1 Acc Prec Rec F1
Original0.81120.65330.76010.69320.80210.67990.73740.6972
P1_5words0.86660.75000.75650.75320.84010.74850.70300.7250
P1_10words0.87590.78410.74350.76330.84450.76570.69390.7280
P2_5words0.86720.74840.76290.75560.84180.74050.72740.7339
P2_10words0.86250.75060.73280.74150.83690.73940.70410.7213
Table 18. Two-tailed t-test results ( P ( T t ) ) for prior context on MOH-X. Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
Table 18. Two-tailed t-test results ( P ( T t ) ) for prior context on MOH-X. Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
MOH-XP1_5wP1_10wP2_5wP2_10w
Original_data0.86960.0687 †0.24850.0613 †
P1_5words0.77310.64770.6748
P1_10words0.21730.7683
P2_5words0.8696
Table 19. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_All (Validation Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
Table 19. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_All (Validation Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
VUA_All
(Validation)
P1_5wP1_10wP2_5wP2_10w
Original_data 1.77 × 10 15 § 1.03 × 10 22 § 4.41 × 10 16 § 2.19 × 10 16 §
P1_5words0.0386 ‡1.00000.7897
P1_10words0.0378 ‡0.0700 †
P2_5words0.7879
Table 20. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_All (Test Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
Table 20. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_All (Test Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
VUA_All (Test)P1_5wP1_10wP2_5wP2_10w
Original_data 3.36 × 10 13 § 6.13 × 10 23 § 3.69 × 10 17 § 2.07 × 10 15 §
P1_5words0.0071 §0.20830.4253
P1_10words0.17790.0643 †
P2_5words0.6544
Table 21. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_Verb (Validation Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
Table 21. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_Verb (Validation Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
VUA_Verb (Validation)P1_5wP1_10wP2_5wP2_10w
Original_data0.00091§0.0272 ‡0.19800.0136 ‡
P1_5words0.26370.0416 ‡0.4424
P1_10words0.34560.7631
P2_5words0.1798
Table 22. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_Verb (Test Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
Table 22. Two-tailed t-test results ( P ( T t ) ) for prior context on VUA_Verb (Test Set). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates that symmetric comparisons were omitted.
VUA_Verb (Test)P1_5wP1_10wP2_5wP2_10w
Original_data 5.26 × 10 5 §0.21730.10360.00045 §
P1_5words0.0048§0.0136 ‡0.6896
P1_10words0.72730.0213 ‡
P2_5words0.0395 ‡
Table 23. Two-tailed t-test results ( P ( T t ) ) on MOH-X (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 23. Two-tailed t-test results ( P ( T t ) ) on MOH-X (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
MOH-XP1_5wP1_10wP2_5wP2_10w
Original_data0.13180.35490.31770.0768 †
P1_5words0.51680.63980.7633
P1_10words0.87890.3177
P2_5words0.4565
Table 24. Two-tailed t-test results ( P ( T t ) ) on VUA_All Validation Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 24. Two-tailed t-test results ( P ( T t ) ) on VUA_All Validation Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_All (Validation)P1_5wP1_10wP2_5wP2_10w
Original_data 1.77 × 10 15 § 1.03 × 10 22 § 4.41 × 10 16 § 2.19 × 10 16 §
P1_5words0.0386 ‡1.00000.7897
P1_10words0.0378 ‡0.0700 †
P2_5words0.7879
Table 25. Two-tailed t-test results ( P ( T t ) ) on VUA_All Test Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 25. Two-tailed t-test results ( P ( T t ) ) on VUA_All Test Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_All (Test)P1_5wP1_10wP2_5wP2_10w
Original_data 3.36 × 10 13 § 6.13 × 10 23 § 3.69 × 10 17 § 2.07 × 10 15 §
P1_5words0.0071 §0.20830.4253
P1_10words0.17790.0643 †
P2_5words0.6544
Table 26. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Validation Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 26. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Validation Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_Verb (Validation)P1_5wP1_10wP2_5wP2_10w
Original_data0.00091§0.0272 ‡0.19800.0136 ‡
P1_5words0.26370.0416 ‡0.4424
P1_10words0.34560.7631
P2_5words0.1798
Table 27. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Test Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 27. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Test Set (Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_Verb (Test)P1_5wP1_10wP2_5wP2_10w
Original_data 5.26 × 10 5 §0.21730.10360.00045 §
P1_5words0.0048 §0.0136 ‡0.6896
P1_10words0.72730.0213 ‡
P2_5words0.0395 ‡
Table 28. Two-tailed t-test results ( P ( T t ) ) on MOH-X (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 28. Two-tailed t-test results ( P ( T t ) ) on MOH-X (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
MOH-XP1_5wP1_10wP2_5wP2_10w
Original_data0.27660.0567 †0.24850.4146
P1_5words0.33080.87330.7240
P1_10words0.28920.2061
P2_5words0.7521
Table 29. Two-tailed t-test results ( P ( T t ) ) on VUA_All Validation Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 29. Two-tailed t-test results ( P ( T t ) ) on VUA_All Validation Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_All (Validation)P1_5wP1_10wP2_5wP2_10w
Original_data 1.32 × 10 16 § 8.35 × 10 13 § 2.21 × 10 11 § 1.72 × 10 19 §
P1_5words0.26500.11550.4032
P1_10words0.67950.0509 †
P2_5words0.0209 ‡
Table 30. Two-tailed t-test results ( P ( T t ) ) on VUA_All Test Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 30. Two-tailed t-test results ( P ( T t ) ) on VUA_All Test Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_All (Test)P1_5wP1_10wP2_5wP2_10w
Original_data 3.18 × 10 31 § 1.28 × 10 22 § 3.37 × 10 36 § 1.48 × 10 43 §
P1_5words0.0504 †0.19530.0098 §
P1_10words0.0010§ 5.51 × 10 6 §
P2_5words0.1828
Table 31. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Validation Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 31. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Validation Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_Verb (Validation)P1_5wP1_10wP2_5wP2_10w
Original_data0.41280.35890.11760.0483 ‡
P1_5words0.0750 †0.0153 ‡0.4369
P1_10words0.49150.3321
P2_5words0.0990†
Table 32. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Test Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
Table 32. Two-tailed t-test results ( P ( T t ) ) on VUA_Verb Test Set (Prior and Following Context). Significance levels: † p 0.10 , ‡ p 0.05 , § p 0.01 . A dash (–) indicates symmetric comparisons were omitted.
VUA_Verb (Test)P1_5wP1_10wP2_5wP2_10w
Original_data0.39200.0003 § 3.38 × 10 7 §0.6121
P1_5words0.0050 § 8.94 × 10 6 §0.1630
P1_10words0.1603 3.65 × 10 5 §
P2_5words 3.40 × 10 8 §
Table 33. Representative failure modes observed in -generated auxiliary context. Manual counts ( n = 300 per dataset).
Table 33. Representative failure modes observed in -generated auxiliary context. Manual counts ( n = 300 per dataset).
Error TypeMOH-XVUA_AllVUA_VerbIllustration
Semantic drift (topic mismatch)12.0%18.0%15.0%Original: He swallowed his words.
Aux: Despite his certainty, doubt engulfed him and made him silent.
Polarity flip (label mismatch)4.0%6.0%5.0%Original (literal): Acknowledge the deed.
Aux: She refused to acknowledge the deed of kindness done for her.
Redundancy/near-duplicate9.0%3.0%5.0%Original: Just trying to.
Aux: Just trying to learn something new.
Pronoun coref failure0.0%11.0%8.0%Original: Come on sweetheart!
Aux: Whispered winds beckon, come on sweetheart!
Over-long named entities2.0%5.0%4.0%Original: The vessel was shipwrecked.
Aux: The storm caused extensive damage, leaving the vessel stranded.
Table 34. Semantic fidelity of generated context across metaphor labels (VUA_All train + validation).
Table 34. Semantic fidelity of generated context across metaphor labels (VUA_All train + validation).
LabelInstancesFaithfulFidelity Rate
Literal6469609694.24%
Metaphor1831153883.99%
Total8300763491.97%
Table 35. Token volume, cost, and processing time for each dataset. Instances are not summed in the Total row due to dataset partitioning.
Table 35. Token volume, cost, and processing time for each dataset. Instances are not summed in the Total row due to dataset partitioning.
DatasetInstancesMean Tokens/RequestTotal TokensEstimated Cost (USD)Wall-Clock (h)
MOH-X6474227,1740.540.1
VUA_All205,42538 7.8 × 10 6 155.811.2
VUA_Verb23,11339 9.0 × 10 5 18.01.6
Total229,185 8.7 × 10 6 174.312.9
Table 36. Latency statistics (seconds) per prompt type in VUA_All training set ( N = 2833 ).
Table 36. Latency statistics (seconds) per prompt type in VUA_All training set ( N = 2833 ).
Metric1p1p-5b1p-10b2p5b2p10b1p-5f5b1p-10f10b2p5f5b2p10F10B
Mean0.7890.5201.5660.5090.5660.5050.6120.4690.535
Std Dev0.5070.1832.0090.1360.2140.1680.2530.1320.174
Min0.3290.3140.3980.3650.3850.3620.3880.3560.375
Max4.9601.22416.1231.1741.2901.1541.5371.2341.499
Table 37. Latency statistics (seconds) per prompt type in VUA_All validation set ( N = 334 ).
Table 37. Latency statistics (seconds) per prompt type in VUA_All validation set ( N = 334 ).
Metric1p1p-5b1p-10b2p5b2p10b1p-5f5b1p-10f10b2p5f5b2p10f10b
Mean0.6710.5020.5910.5580.6220.5100.5550.5810.513
Std Dev0.3790.3610.1560.2280.4170.3210.3780.3950.143
Min0.3870.2910.4120.3640.4220.3520.3400.3080.330
Max2.2930.9111.4141.1041.3351.1411.3911.3731.456
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hayashi, T.; Sasaki, M. Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data Cogn. Comput. 2025, 9, 218. https://doi.org/10.3390/bdcc9090218

AMA Style

Hayashi T, Sasaki M. Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data and Cognitive Computing. 2025; 9(9):218. https://doi.org/10.3390/bdcc9090218

Chicago/Turabian Style

Hayashi, Takuya, and Minoru Sasaki. 2025. "Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection" Big Data and Cognitive Computing 9, no. 9: 218. https://doi.org/10.3390/bdcc9090218

APA Style

Hayashi, T., & Sasaki, M. (2025). Applying Additional Auxiliary Context Using Large Language Model for Metaphor Detection. Big Data and Cognitive Computing, 9(9), 218. https://doi.org/10.3390/bdcc9090218

Article Metrics

Back to TopTop