Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses

Yoshikawa, Shota; Sasaki, Minoru

doi:10.3390/app16083797

Open AccessArticle

Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses

by

Shota Yoshikawa

and

Minoru Sasaki

^*

Major in Computer and Information Sciences, Graduate School of Science and Engineering, Ibaraki University, Mito 310-8512, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3797; https://doi.org/10.3390/app16083797

Submission received: 17 March 2026 / Revised: 8 April 2026 / Accepted: 8 April 2026 / Published: 13 April 2026

(This article belongs to the Special Issue The Advanced Trends in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Word sense induction (WSI) aims to automatically discover the different senses of a word from contextual usage without predefined sense inventories. However, existing distributional clustering methods often suffer from dominant-sense bias and struggle to correctly identify minority senses. In this paper, we propose a definition-anchored reclassification framework for WSI that leverages large language models (LLMs) to generate explicit sense descriptions and refine cluster assignments. Unlike purely distributional approaches, our method integrates semantic definitions into the induction process. Our method improves instance-level alignment by introducing a trade-off with global structural consistency, as it shifts the decision process from geometric clustering to definition-based semantic matching. Experiments on the SemEval-2010 and SemEval-2013 datasets demonstrate that the proposed method consistently outperforms traditional clustering baselines and existing WSI systems across both structural metrics (NMI and V-measure) and instance-level metrics (F-B³ and Fuzzy-F-B³). In particular, our approach effectively mitigates dominant-sense bias and improves the recovery of minority senses by preserving them as distinct clusters while correctly assigning their instances. These results suggest that explicit semantic representations generated by LLMs provide a promising direction for addressing long-standing challenges in unsupervised word sense induction. Furthermore, unlike purely distributional clustering approaches, our method explicitly introduces LLM-generated semantic definitions as anchors, enabling more robust mitigation of dominant-sense bias and improved recall of minority senses.

Keywords:

word sense induction; large language models; clustering; semantic definitions

1. Introduction

Word Sense Disambiguation (WSD) is a fundamental task in natural language processing that supports applications such as machine translation and information extraction. However, supervised WSD approaches depend on fixed sense inventories derived from lexical resources. This limits their coverage of novel words, emerging usages, and domain-specific terminology. This limitation is often referred to as the knowledge bottleneck [1].

Recent WSI methods primarily employ contextualized embeddings from pre-trained language models (PLMs) such as BERT [2] due to their ability to encode fine-grained contextual variation [3,4]. Although these approaches have improved upon existing methods, they suffer from a significant drawback known as sensitivity to skewed sense distributions. In actual texts, sense frequencies follow a long-tail distribution, meaning dominant senses account for most instances [5]. In these conditions, minority senses occupy sparse regions of the embedding space. As a result, they are often absorbed into dense clusters during unsupervised clustering [6]. Existing methods rely exclusively on the geometry of contextual representations, which limits their ability to mitigate this frequency-driven bias. This results in the systematic under-detection of rare senses [7].

We propose a definition-driven unsupervised WSI framework that introduces explicit semantic constraints to complement implicit embedding-based signals. Our key insight is that large language models (LLMs) can generate explicit sense definitions that characterize semantic boundaries independently of instance frequency. In contrast, contextual embeddings primarily reflect distributional patterns. The proposed method operates in three stages: (1) initial clustering of contextualized embeddings to obtain preliminary sense groups, (2) extraction of discriminative features from each cluster using a linear SVM, and (3) prompting an LLM to generate sense definitions conditioned on these features, which then serve as semantic anchors for reassigning instances. This refinement loop enables minority senses to be stabilized around explicit definitions rather than being subsumed by dominant clusters.

We evaluate our approach on standard WSI benchmarks and investigate the following research questions:

RQ1 (Effectiveness): Does incorporating LLM-generated definitions as classification anchors improve WSI performance, particularly for minority senses that are prone to conflation in embedding space?
RQ2 (Feature Contribution): How do discriminative features (SVM-derived keywords) and topical features (PMI-based co-occurrences) contribute to the quality of generated definitions and downstream sense classification?
RQ3 (Robustness): To what extent does the proposed method maintain performance under varying degrees of sense distribution imbalance, compared to purely embedding-based baselines?

Our contributions are as follows:

(RQ1) To address RQ1, we propose a definition-driven refinement framework for WSI. The method feeds unsupervised clustering results into an LLM to generate sense definitions, which are then used to refine cluster boundaries.
(RQ2) To address RQ2, we conduct an analysis of feature requirements for definition generation, showing that explicit discriminative features extracted via SVM are essential for generating accurate and distinctive sense descriptions.
(RQ3) To address RQ3, we demonstrate that the proposed method improves robustness under skewed sense distributions, particularly enhancing minority sense detection while maintaining competitive overall performance.

2. Related Work

2.1. Word Sense Induction via Clustering of Contextualized Representations

WSI is commonly formulated as an unsupervised clustering problem over contextual representations of target words. Earlier approaches relied on sparse lexical features and traditional clustering algorithms such as k-means or hierarchical clustering [8,9]. Recent studies have improved WSI by adopting contextualized embeddings from neural language models [2,3], which capture fine-grained semantic variation across usages [10,11].

A representative example is the method of Amrami and Goldberg [4], which combines contextual embeddings from a neural biLM with symmetric dependency patterns to induce sense clusters, achieving strong performance on standard benchmarks. Subsequent work further refined this substitution-based approach [12]. Despite differences in model architectures or feature augmentation, most modern WSI methods share a core assumption: sense distinctions are recoverable from the geometry of the contextual embedding space using distance-based or density-based clustering [13]. Our research differs from this purely geometric approach by introducing additional features for cluster assignment. These are linguistic features explicitly represented in the form of sense definitions generated by LLM.

2.2. Limitations of Existing WSI Under Skewed Sense Distributions

A fundamental challenge in WSI is the highly imbalanced distribution of word senses in natural language. The frequency of word senses typically follows Zipf’s law [14], where the most frequent senses account for the majority of occurrences, while low-frequency senses appear rarely [1,15]. This imbalance causes difficulties for clustering-based methods, which tend to prefer high-density regions in the embedding space.

As a result, instances of minority senses are often absorbed into clusters corresponding to dominant senses, leading to degraded recall for rare meanings [16,17]. This phenomenon persists with expressive contextualized representations and is observed in strong baselines such as neural embedding-based WSI [10] and nonparametric clustering methods like HDP [18,19,20]. While some studies acknowledge dominant sense bias, most studies remain within the same geometric framework and do not fundamentally address the reliance on frequency-driven spatial dominance.

2.3. Gloss-Aware and Definition-Based Sense Modeling

In semantic modeling, there are many studies that use glosses and definitions. Although traditional methods exploit overlaps between context words and glosses in dictionaries [21,22], more recent approaches embed glosses in continuous vector spaces and compare them with contextual representations [23,24]. These techniques have proven effective mainly in supervised or weakly supervised WSD settings [6,25,26,27].

In the context of WSI, methods that rely on glosses or definitions are less commonly explored, primarily because WSI operates without predefined sense inventories [1,28]. Existing approaches typically treat glosses as static external resources rather than dynamically derived semantic representations [29,30].

While prior work has explored gloss-based representations and LLM-generated descriptions, these are mainly used for interpretation or in supervised settings. In contrast, our approach integrates dynamically generated definitions as operational components within the WSI pipeline, using them to refine cluster assignments. This highlights a key distinction from prior work, where definitions are treated as auxiliary explanations rather than as active elements in the induction process.

2.4. LLM-Assisted Semantic Interpretation in WSI

Recently, LLMs have been increasingly utilized for semantic interpretation and explanatory tasks [31]. Within the context of WSI, previous studies have primarily leveraged LLMs to perform subsequent analysis or labeling of induced clusters by exploiting their generative capabilities to produce interpretable, human-readable descriptions. Recent comprehensive evaluations have demonstrated that LLMs, including ChatGPT (e.g., gpt-3.5-turbo), exhibit varying degrees of competence across different NLP tasks, including word sense disambiguation [32,33]. The technique of few-shot prompting has also been used to generate sense summaries from example sentences [34].

However, in existing studies, LLMs primarily serve as interpretive tools rather than as components that influence clustering decisions. Even when specific models such as GPT [35] or Gemini [36] are employed, the generated descriptions are not incorporated back into the induction pipeline. In contrast, our approach integrates LLM-generated definitions directly into the refinement process. These definitions act as semantic anchors that actively reshape sense assignments, rather than merely providing explanatory descriptions.

3. Proposed Method

To mitigate the majority-sense bias that arises when classifying example sentences of a target word into senses, we propose a novel word sense induction framework that integrates WSI with large language models (LLMs). Our method consists of three stages: (1) initial clustering using WSI-NS, (2) generation of sense definitions with an LLM, and (3) reclassification of test instances based on the generated definitions. Figure 1 provides an overview of the proposed framework, including the initial clustering, LLM-based definition generation, and definition-guided reclassification stages.

Formal Problem Definition.

Let

X = {x_{1}, x_{2}, \dots, x_{n}}

denote a set of instances, where each

x_{i}

represents a contextual occurrence of a target word.

An initial clustering method produces a set of induced sense clusters:

C = {C_{1}, C_{2}, \dots, C_{K}}, ⋃_{k = 1}^{K} C_{k} = X

For each cluster

C_{k}

, we extract two types of keyword sets:

S_{k} = f_{svm} (C_{k}), P_{k} = f_{pmi} (C_{k})

where

f_{svm}

and

f_{pmi}

denote keyword extraction functions based on SVM weights and PMI scores, respectively.

We also sample a set of representative instances:

E_{k} \subseteq C_{k}

Using these inputs, we generate a semantic anchor (definition) for each cluster:

D_{k} = LLM (S_{k}, P_{k}, E_{k})

Finally, for a given test instance x, the predicted sense label is determined by:

\hat{y} (x) = arg max_{k \in {1, \dots, K}} Match (x, D_{k})

where

Match (x, D_{k})

is implemented via prompting an LLM to select the most semantically appropriate definition given the context x.

3.1. Initial Clustering

Given a set of instances containing occurrences of a target word in context, we first apply an existing WSI method to obtain preliminary sense clusters. We adopt WSI-NS [4] as our base clustering method due to its strong performance on standard benchmarks.

WSI-NS represents each instance by combining contextualized embeddings from ELMo [3] with TF-IDF vectors of substitute words obtained via symmetric patterns. Agglomerative clustering is then applied to partition instances into sense groups. Following the original implementation, we employ soft clustering to allow instances near cluster boundaries to belong to multiple clusters probabilistically.

The output of this stage is a set of K clusters

{C_{1}, C_{2}, \dots, C_{K}}

, where each cluster

C_{i}

contains instances that are hypothesized to share a common word sense. These clusters serve as the input to the subsequent definition generation stage. We select WSI-NS as the base clustering method because it effectively combines contextual embeddings with substitution-based features, which has been shown to produce strong and stable sense clusters in prior work. In addition, its soft clustering mechanism allows instances near cluster boundaries to have probabilistic memberships, which is particularly beneficial for subsequent definition generation, as it provides more diverse and representative inputs for each sense.

3.2. Sense Definition Generation with LLMs

Unlike purely distributional clustering, which relies on the geometric density of embeddings, the proposed method introduces explicit semantic constraints through natural language definitions. These definitions provide a global semantic reference that is less sensitive to frequency imbalance. As a result, minority senses, which may be sparsely distributed in the embedding space, can still be identified through their conceptual coherence rather than their spatial density. This shifts the decision boundary from local geometry to higher-level semantic alignment, thereby mitigating dominant-sense bias.

We define semantic anchors as explicit natural language definitions that provide high-level semantic constraints for cluster assignment. These anchors serve as interpretable semantic references that guide the reclassification of instances.

Specifically, each semantic anchor corresponds to a single induced sense cluster and summarizes its core semantic characteristics in a concise form.

For each induced sense cluster, we use an LLM to generate a natural language definition describing the meaning of that cluster. This step aims to mitigate the difficulty of identifying minority senses caused by skewed sense distributions and to provide explicit semantic cues for the subsequent classification stage. In our experiments, we evaluate multiple LLMs (GPT-5, GPT-4o, and Gemini-2.5-Flash) for definition generation and classification.

Inputs to the LLM
We extract discriminative and salient keywords from each cluster to guide definition generation. We train a one-vs-rest linear SVM with L1 regularization and balanced class weights, and select the top 10 keywords ranked by absolute SVM weight. In addition, we compute a Pointwise Mutual Information (PMI)-based salience score defined as the ratio between the cluster-specific feature probability and the global feature probability, and select the top 10 keywords per cluster accordingly.
SVM keywords emphasize inter-cluster discrimination. In contrast, PMI keywords capture salient conceptual features within each cluster. These signals are complementary and particularly helpful for identifying minority senses. The number of keywords is fixed across clusters to avoid bias toward larger clusters.
Each cluster is represented by (i) SVM keywords, (ii) PMI keywords, and (iii) representative example sentences obtained from the initial clustering. For each cluster, we randomly sample up to 20 example sentences to capture diverse contextual realizations of the sense. This strategy mitigates the risk of overfitting the definition to a single highly frequent or central instance.
Based on these inputs, the LLM infers the shared concept underlying the cluster and generates a concise one-sentence sense definition.
Prompt for Definition Generation
The prompt is structured as follows:
Sense: <sense ID>
SVM keywords: <keyword list>
PMI keywords: <keyword list>
Representative examples: <sentences>
Generate a concise, one-sentence English definition that best describes the main meaning of this cluster.

Figure 2 and Figure 3 show examples of cluster inputs and the resulting generated definitions.

This formulation enables a direct incorporation of semantic information into the clustering-refinement process, bridging the gap between distributional representations and conceptual meaning. To improve reproducibility and technical clarity, we explicitly provide a formal algorithmic specification of the proposed method in Algorithm 1.

Algorithm 1 Definition-Anchored WSI: Overall pipeline including clustering, definition generation, and reclassification

Require: Instance set X

Ensure: Predicted sense labels

\hat{y} (x)

1:: Obtain clusters $C \leftarrow {C_{1}, \dots, C_{K}}$ using WSI-NS
2:: for each cluster $C_{k} \in C$ do
3:: $S_{k} \leftarrow f_{svm} (C_{k})$ {SVM keywords}
4:: $P_{k} \leftarrow f_{pmi} (C_{k})$ {PMI keywords}
5:: $E_{k} \leftarrow Sample (C_{k})$ {sample representative instances}
6:: $D_{k} \leftarrow LLM (S_{k}, P_{k}, E_{k})$
7:: end for
8:: for each test instance $x \in X$ do
9:: $\hat{y} (x) \leftarrow arg {max}_{k \in {1, \dots, K}} Match (x, D_{k})$
10:: end for
11:: return $\hat{y} (x)$

3.3. Definition-Guided Reclassification

After generating sense definitions for each cluster, we leverage the abstract semantic information encoded in these definitions to re-evaluate instance assignments. The initial clustering relies on local features and statistical co-occurrence patterns. As a result, semantically appropriate but distributionally rare instances may be misassigned. By introducing high-level semantic representations in the form of definitions, our method re-compares instances at a conceptual level and corrects such misallocations.

For each unseen test instance, we provide the LLM with (i) the set of generated sense definitions and (ii) the test sentence containing the target word. The model is instructed to select the sense whose definition best matches the contextual meaning of the instance. This allows even minority senses to be selected when their semantic descriptions are sufficiently clear.

The prompt used for classification is as follows:

You are classifying example sentences for the lemma <lemma_pos>.
Here are the possible senses:
1. <definition for sense 1>
2. <definition for sense 2>
3. <definition for sense 3>
…
For each example, respond ONLY with the number of the most appropriate sense.

The LLM outputs the index of the selected sense for each instance. We compare these predictions with the gold labels provided by SemEval to compute classification performance.

This definition-guided reclassification step is particularly effective when minority senses are underrepresented or when instances lie near cluster boundaries.

4. Experiments

4.1. Datasets

We evaluate our proposed method on two standard benchmarks for word sense induction: SemEval-2010 Task 14 [37] and SemEval-2013 Task 13 [28].

Each XML file contains <instance> tags representing individual occurrences of a target word, context fields describing the surrounding text, an id attribute uniquely identifying each occurrence, and a pos attribute indicating the part of speech (noun or verb).

SemEval-2010 Task 14 focuses on unsupervised word sense induction and disambiguation for a predefined set of 100 target words (50 nouns and 50 verbs). The training set contains 879,807 instances collected from the web, and the testing set consists of 8915 instances drawn from OntoNotes. The average number of gold-standard senses per target word is 3.79, with 4.46 for nouns and 3.12 for verbs. No sense labels are provided for training, reflecting the fully unsupervised nature of the task; gold-standard sense labels are available only for evaluation. The dataset includes a balanced selection of polysemous words exhibiting varying degrees of sense granularity and skewed sense distributions, making it suitable for analyzing the robustness of WSI methods under majority-sense bias.

SemEval-2013 Task 13 targets 50 lemmas (20 nouns, 20 verbs, and 10 adjectives) and comprises 4664 test instances drawn from the Open American National Corpus (OANC) across multiple genres including journal articles, telephone conversations, fiction, and technical writing. Unlike the 2010 task, this task adopts a graded sense annotation scheme in which each instance may be labeled with multiple WordNet 3.1 senses weighted by their applicability, with an average of 1.12 senses per instance. For sense induction, the ukWaC corpus was provided as the training resource instead of target-specific training data. The task exhibits more fine-grained and diverse sense inventories with stronger long-tail distributions across senses, and emphasizes evaluating how well induced sense clusters correspond to gold-standard senses across different domains and usage patterns. As in the 2010 task, all clustering instances are unlabeled, and supervision is restricted to evaluation only.

For both datasets, we follow the official training/test splits provided by the organizers.

4.2. Evaluation Metrics

We evaluate clustering quality using both cluster-level and instance-level metrics to obtain a comprehensive assessment of sense induction performance.

Normalized Mutual Information (NMI).

NMI measures the mutual dependence between the predicted clusters and the gold-standard sense labels. It evaluates the overall structural alignment between two partitions while being invariant to label permutations.

V-measure.

V-measure is the harmonic mean of homogeneity and completeness. Homogeneity assesses whether each induced cluster contains only instances of a single gold sense, while completeness evaluates whether all instances of a gold sense are assigned to the same cluster. This metric captures the trade-off between over-clustering and under-clustering.

Paired F-Score.

Paired F-Score evaluates the alignment between induced sense clusters and gold-standard senses by measuring pairwise agreement between instance assignments. It captures how consistently pairs of instances that belong to the same gold sense are grouped together in the predicted clustering, and vice versa.

F-B³.

F-B³ computes precision and recall at the instance level, measuring how well individual instances are assigned to the correct sense cluster. Unlike global metrics such as NMI, F-B³ is sensitive to instance-wise misassignments.

Fuzzy-F-B³.

Since our method produces soft cluster memberships in the initial stage, we additionally report Fuzzy-F-B³, which accounts for probabilistic cluster assignments and provides a more faithful evaluation of soft clustering behavior.

We report the average performance over five independent runs with different random seeds, together with standard deviations. This ensures robustness and reduces sensitivity to stochastic variations in clustering initialization and LLM outputs.

4.3. Baselines

We compare our method against both traditional clustering approaches and recent word sense induction methods to ensure a comprehensive evaluation.

k-means.

As a simple and widely used clustering algorithm, k-means serves as a basic centroid-based baseline. We apply k-means to the same contextual representations used in the initial clustering stage.

We fix the number of clusters to

k = 4

across all experiments. This setting is independent of the gold sense inventory. Gold sense annotations are used only for evaluation.

HDP.

The Hierarchical Dirichlet Process (HDP) is a non-parametric Bayesian clustering method that automatically determines the number of clusters. It provides a strong unsupervised baseline for sense induction without requiring prior knowledge of the number of senses.

WSI-NS.

We include WSI-NS as our primary baseline, since it represents the underlying clustering framework used in the initial stage of our method.

Our method does not aim to replace WSI-NS, but to extend it with a definition-guided refinement stage. Therefore, this comparison is not intended as a direct competition between identical clustering methods, but rather as an evaluation of how definition-based reclassification improves performance on top of an existing clustering framework.

This design enables a fair comparison by isolating the contribution of semantic definitions to instance-level alignment.

1cpl.

We report results for the one-cluster-per-lemma (1cpl) baseline, which assigns all instances of a lemma to a single cluster. Despite its simplicity, this baseline can perform competitively on skewed datasets and thus serves as a strong lower-bound reference [38].

Discussion on LLM-only Clustering.

We do not include an LLM-only clustering baseline for two reasons. First, large language models may have been exposed to benchmark datasets during pretraining, which introduces potential data leakage concerns. Second, LLM-only clustering lacks a consistent and reproducible mechanism for grouping instances across different runs, making fair comparison difficult.

Therefore, we focus on evaluating the effect of integrating LLM-generated definitions into a standard WSI pipeline, where clustering remains grounded in explicit and reproducible criteria.

Our goal is not to replace clustering with LLMs, but to enhance it with explicit semantic information.

Additional Baselines from Recent Work.

To provide a broader comparison, we also refer to recent WSI results reported in prior work [38]. Specifically, we include results reported by Mosolova et al. [38] for several clustering-based and LLM-based WSI approaches on the SemEval-2010 and SemEval-2013 datasets. These results are not directly reproduced in our experimental setup, but are included as reference baselines for contextual comparison.

Their results show that traditional clustering-based methods (e.g., PolyLM and LSDP) achieve competitive performance, while direct LLM-based approaches tend to underperform and exhibit high variance.

The results reported in prior work suggest that purely distributional clustering methods are limited by their reliance on embedding geometry, while direct LLM-based clustering suffers from instability and inconsistent outputs.

In contrast, our method does not rely on LLMs for direct clustering. Instead, it integrates LLM-generated definitions into a clustering-refinement pipeline.

This allows us to leverage the semantic strengths of LLMs while avoiding the instability observed in LLM-only approaches.

Our approach provides a middle ground between these paradigms. By introducing semantic anchors in the form of generated definitions, we enable more stable and interpretable refinement of cluster assignments.

This further supports the design choice of our method, which combines distributional clustering with semantic refinement, rather than relying solely on either approach.

Because these reference results are taken from a separate study, they are intended for contextual comparison rather than strict head-to-head evaluation.

4.4. Implementation Details

Clustering Configuration.

We adopt WSI-NS as the initial clustering framework. The maximum number of clusters is set to 8 for agglomerative clustering to accommodate fine-grained sense distinctions. For k-means and HDP, the maximum number of clusters is capped at 4, as preliminary experiments indicated that this setting yielded the most stable and competitive performance across datasets. All clustering experiments are repeated five times with different random seeds, and we report the mean performance with standard deviations.

Keyword Extraction.

For each cluster, we extract 10 discriminative keywords based on the absolute weights of a one-vs-rest linear SVM with L1 regularization and balanced class weights. In addition, we compute a PMI-based salience score defined as the ratio between the cluster-specific feature probability and the global feature probability and select the top 10 PMI keywords per cluster. The number of keywords is fixed across clusters to prevent bias toward larger clusters and to ensure comparability across senses.

Representative Examples.

For each cluster, we randomly sample up to 20 example sentences to capture diverse contextual realizations of the sense. Random sampling mitigates overfitting to highly frequent or central instances. Random seeds are fixed to ensure reproducibility.

We observe that the standard deviation across runs is small, indicating that the effect of random sampling is limited. This suggests that the proposed method is robust to variations in sampled instances.

These results indicate that the impact of sampling variability is negligible.

LLM Settings.

We evaluate three large language models: GPT-5, GPT-4o, and Gemini-2.5-Flash. All models are used for both definition generation and definition-guided reclassification. Unless otherwise specified, the temperature parameter is set to 0.0 to ensure deterministic outputs. No additional system prompts or chain-of-thought reasoning are employed during inference.

All prompts are fixed across experiments to ensure consistency.

We set the temperature to 0.0 to reduce non-determinism in LLM outputs. Each experiment is repeated five times with different random seeds to account for variability in clustering and sampling.

Computation Environment and Cost.

All experiments are conducted in a Linux environment with GPU acceleration for clustering. LLM inference is performed via API access.

4.5. Data Partitioning

To evaluate robustness under varying data availability conditions, we construct multiple splits for each target lemma.

Specifically, we vary the ratio of clustering instances to classification instances from 1:9 to 9:1 in increments of one instance, as well as a balanced 10:10 setting where applicable. This allows us to simulate both low-resource and high-resource scenarios, which is particularly important for assessing minority-sense recovery.

All splits are performed at the lemma level, meaning that instances belonging to the same target lemma are partitioned into clustering instances and classification instances without mixing across different lemmas. This ensures that evaluation reflects within-lemma sense discrimination rather than cross-lemma generalization.

For each split configuration, experiments are repeated five times with different random seeds, and we report the average performance with standard deviations.

These settings simulate realistic variations in data availability, allowing us to systematically evaluate the robustness of the proposed method. This is particularly important for minority-sense detection, where data scarcity is a critical challenge.

As shown in Figure 4, performance generally improves as the proportion of classification instances increases.

4.6. Main Results

The main results of our method are presented below in comparison with baseline approaches.

Our method consistently achieves the best or competitive performance in terms of instance-level metrics across both datasets.

On SemEval-2010, GPT-5 achieves the highest F-B³ score, indicating strong instance-level alignment with gold sense labels. On SemEval-2013, Gemini-2.5-Flash achieves the best performance under Fuzzy-F-B³, reflecting its ability to capture overlapping sense assignments.

In contrast, traditional clustering methods such as k-means and the HDP perform significantly worse, suggesting that purely distributional approaches struggle to capture fine-grained semantic distinctions.

These results demonstrate the effectiveness of incorporating LLM-generated definitions into the WSI pipeline.

As shown in Figure 5, our method achieves strong instance-level performance while maintaining competitive results across different datasets.

4.7. Ablation Study

To analyze the impact of different prompt components, we compare the following four input configurations:

SVM keywords + PMI keywords + instances
SVM keywords + instances
PMI keywords + instances
instances only

Here, instances denote representative example sentences sampled from each cluster.

We use GPT-4o for all ablation settings and keep all other parameters fixed. This allows us to isolate the contribution of discriminative features (SVM), statistical co-occurrence signals (PMI), and raw contextual examples.

Results are averaged over five independent runs.

5. Results

5.1. Comparison Across LLMs

We compare three large language models used in the definition-based classification stage: GPT-5, GPT-4o, and Gemini-2.5-Flash.

Table 1, Table 2 and Table 3 show that Gemini-2.5-Flash achieves the highest Fuzzy-F-B³ score on SemEval-2013, slightly outperforming GPT-5 at the 9/1 split (63.68 vs. 63.07). This indicates that Gemini-2.5-Flash is particularly effective at recovering fuzzy sense overlaps and handling graded membership in ambiguous cases.

In contrast, GPT-5 achieves the highest overall performance on SemEval-2010, reaching an F-B³ score of 80.02 at the 9/1 split, compared to 78.17 for Gemini-2.5-Flash. This suggests that GPT-5 provides more stable and precise sense alignment when sharper sense boundaries are required.

GPT-4o consistently underperforms both GPT-5 and Gemini-2.5-Flash across all splits and metrics. Although its performance improves as the proportion of data allocated to the classification stage increases, its peak scores remain substantially lower (Fuzzy-F-B³ = 57.71 on SemEval-2013 and F-B³ = 73.45 on SemEval-2010).

Overall, these results indicate that Gemini-2.5-Flash is more effective in modeling soft, overlapping sense distributions, while GPT-5 excels at producing sharper, more discriminative sense boundaries.

5.2. Comparison with Baselines

As shown in Figure 5, our method achieves strong instance-level performance, while differences across datasets highlight the impact of evaluation metrics.

We compare our method with k-means, Hierarchical Dirichlet Process (HDP), the one-cluster baseline (1cpl), and WSI-NS in our own experiments. To broaden the comparison, we also include reference results reported by Mosolova et al. [38] for several clustering-based and LLM-based WSI approaches evaluated on the SemEval-2010 and SemEval-2013 datasets. Results are reported in Table 4. For fair comparison, all methods reported in Table 4 are evaluated on the identical 9/1 clustering/classification split. The same set of classification (test) instances is used across all models to ensure comparability.

On SemEval-2013, Gemini-2.5-Flash and GPT-5 achieve the highest Fuzzy-F-B³ scores (63.68 and 63.07, respectively), clearly outperforming all unsupervised clustering baselines. The one-cluster baseline (1cpl) achieves a relatively high Fuzzy-F-B³ score (58.07); however, because it assigns all instances to a single cluster, its NMI is 0. k-means performs poorly (Fuzzy-F-B³ = 31.05), suggesting that fixed-k clustering fails to capture the semantic structure of word senses. WSI-NS achieves the highest NMI (17.93) but lower Fuzzy-F-B³ (54.81), indicating over-fragmentation without alignment to gold senses.

On SemEval-2010, GPT-5 achieves the best overall performance with an F-B³ score of 80.02, followed by Gemini-2.5-Flash (78.17) and GPT-4o (73.45). While the 1cpl baseline achieves a high F-B³ score of 61.96, it does not perform any sense distinction and assigns all instances to a single cluster. As a result, both its NMI and V-M are 0. WSI-NS shows very high NMI and V-M values (82.56) but substantially lower F-B³ (55.21), suggesting that it produces structurally coherent but poorly aligned clusters.

We further analyze why our definition-guided refinement can reduce NMI/V-measure while improving instance-level F-B³. First, the initial WSI-NS clustering is allowed to produce up to eight clusters, which may yield a partition that is structurally consistent but relatively fine-grained. During reclassification, the LLM may implicitly prefer a smaller set of salient senses and thus merge several fine-grained clusters into fewer labels. Such cluster merging can substantially change the global partition structure, leading to lower NMI/V-measure even if many individual assignments become more plausible.

Second, the LLM may exhibit a majority-sense bias during definition-based matching: if the majority-sense definition is broader or more semantically prototypical, borderline instances from minority clusters can be absorbed into the majority label. This phenomenon would also reduce structural agreement with the original clustering and the gold partition while potentially improving local consistency for high-frequency senses.

Third, the quality and specificity of generated definitions may limit how faithfully the LLM can preserve the original cluster structure. If definitions are underspecified or overlap across senses, the reclassification step becomes less capable of maintaining the fine-grained distinctions present in the initial clustering, again decreasing NMI/V-measure. Overall, these factors suggest that definition-guided refinement trades off global structural fidelity for instance-level alignment, which is consistent with our goal of improving minority-sense recovery.

Overall, definition-based refinement with large language models substantially improves alignment with gold sense labels while preserving meaningful sense distinctions. These results demonstrate that incorporating explicit semantic representations can effectively address dominant-sense bias and improve minority-sense recovery in unsupervised word sense induction.

Analysis of 1cpl Baseline Behavior.

In SemEval-2010, the 1cpl baseline assigns all instances to a single cluster. Although this trivial solution yields a relatively high F-B³ score (61.96), both NMI and V-measure are 0 because no actual sense discrimination is performed.

This discrepancy can be explained by the skewed sense distribution of the dataset. Since many target words exhibit a dominant majority sense, assigning all instances to a single cluster results in high instance-level overlap for the majority class. Because F-B³ is computed at the instance level and rewards local precision and recall, it can remain relatively high when the majority sense dominates the dataset.

In contrast, NMI and V-measure evaluate global structural agreement between predicted clusters and gold partitions. A single-cluster solution contains no information about sense distinctions, leading to zero mutual information with the gold partition.

This result highlights how heavily majority-sense skew influences instance-level metrics and demonstrates that F-B³ alone may overestimate performance under extreme imbalance. In comparison, our proposed method produces more balanced cluster structures while maintaining competitive F-B³, indicating improved robustness against majority-sense bias.

This analysis further supports the robustness of our method, which mitigates majority-sense bias while retaining meaningful structural information. Unlike trivial single-cluster solutions, our approach preserves non-zero mutual information with the gold partition, demonstrating its ability to capture genuine sense distinctions rather than merely reflecting skewed distributions.

5.3. Ablation on Prompt Inputs

We evaluate the effect of different prompt inputs used for sense definition generation. Specifically, we compare four configurations: no keywords (instances only), SVM keywords only, PMI keywords only, and the combination of SVM and PMI keywords.

All ablation experiments are conducted using the full dataset (100% of instances for both clustering and classification), without applying the train–test split configurations used in Section 4.5. This design removes variability due to data partitioning and allows us to directly evaluate the contribution of each prompt component.

Table 5 shows that combining SVM and PMI keywords consistently yields the best overall performance. On SemEval-2013, the full configuration achieves the highest Fuzzy-F-B³ score (61.03) and the highest Fuzzy-NMI (21.57), clearly outperforming all reduced variants. This indicates that the combination of discriminative features (SVM) and distributional co-occurrence information (PMI) provides complementary signals that improve sense separation.

Using only PMI keywords improves Fuzzy-F-B³ compared to the no-keyword baseline (59.46 vs. 58.47), while SVM-only shows a similar but slightly weaker effect (59.23). This suggests that co-occurrence statistics contribute more strongly to fuzzy sense recovery, whereas discriminative features alone are insufficient.

On SemEval-2010, the performance differences are smaller. The combined configuration still achieves the highest F-B³ score (52.87), although PMI-only achieves the highest V-M and NMI values. This suggests that PMI-based features may better preserve global cluster structure, while the combination with SVM improves alignment with gold sense labels.

Overall, these results indicate that SVM and PMI provide complementary information: PMI captures distributional similarity across instances, while SVM highlights cluster-specific discriminative cues. Their combination yields the most robust and balanced performance across datasets and evaluation metrics.

6. Discussion

6.1. Effect of Definition-Based Refinement

The results demonstrate that definition-based refinement using large language models substantially improves alignment with gold sense labels compared to purely unsupervised clustering methods. In both datasets, LLM-based approaches outperform k-means, HDP, and WSI-NS in terms of F-B³ or Fuzzy-F-B³, indicating that explicit semantic descriptions provide information that is not captured by distributional similarity alone.

This suggests that LLM-generated definitions act as semantic anchors that stabilize sense boundaries, particularly in cases where embedding-based clustering fails due to sense imbalance.

6.2. Model-Specific Behavior

Gemini-2.5-Flash achieves the highest Fuzzy-F-B³ score on SemEval-2013, suggesting that it is particularly effective at modeling overlapping or graded sense membership. In contrast, GPT-5 achieves the highest F-B³ on SemEval-2010, indicating that it is better suited for producing sharper and more discriminative sense boundaries. Although GPT-5 is slightly below Gemini-2.5-Flash on SemEval-2013, it still achieves a high Fuzzy-F-B³ score, indicating that both models are capable of leveraging definition-based refinement effectively.

These differences may stem from variations in model behavior and how semantic information is represented internally. For example, Gemini-2.5-Flash may capture more graded or overlapping semantic distinctions, while GPT-5 may produce more discrete and separable representations.

This interpretation is consistent with the qualitative differences observed in the generated definitions, where Gemini-2.5-Flash tends to merge multiple senses within a single definition, while GPT-5 produces more clearly separated and discriminative definitions.

However, these observations are preliminary, and further investigation is required to determine whether these differences reflect intrinsic model properties or interactions with prompt design and data splits.

Performance on SemEval-2013.

We observe that our method achieves relatively lower scores on SemEval-2013 compared to some reference baselines, particularly on fuzzy evaluation metrics.

One possible explanation is that our method relies on hard cluster assignment during the reclassification stage, where each instance is assigned to a single sense. In contrast, SemEval-2013 includes evaluation metrics such as Fuzzy-F-B³, which reward soft or overlapping sense assignments.

As a result, methods that implicitly capture graded or overlapping sense membership may achieve higher scores under these metrics. In contrast, our approach focuses on producing discrete and interpretable sense distinctions, which may lead to lower performance on fuzzy metrics despite maintaining clear semantic boundaries.

This observation highlights a trade-off between interpretability and flexibility in sense modeling, and suggests that incorporating soft assignment mechanisms could be a promising direction for future work.

This is also consistent with the observation that Gemini-2.5-Flash, which tends to produce more overlapping or merged definitions, achieves higher scores on fuzzy evaluation metrics.

6.3. Qualitative Comparison of Generated Definitions

To further analyze the quality of generated definitions, we compare outputs from different LLMs for the same target word under the 9/1 clustering/classification split.

For the lemma late.j, which contains both (i) a “deceased” sense and (ii) an “end of period” sense, the generated definitions are as follows:

GPT-5:

(1) “refers to someone who has recently passed away”

(2) “refers to events occurring towards the end of a period”.

GPT-4o:

(1) “someone who has passed away, often highlighting their accomplishments”

(2) “the latter part of a specified time period”.

Gemini-2.5-Flash:

(1) “relating to a person who is deceased or a time that is past”

(2) “occurring near the end of a specified period or a person’s life”.

We observe that GPT-5 produces more clearly separated and discriminative definitions for each sense. In contrast, GPT-4o tends to generate more verbose and less specific descriptions. Gemini-2.5-Flash partially merges multiple senses within a single definition, which may explain its strong performance in fuzzy evaluation metrics.

These qualitative differences are consistent with the quantitative results, where GPT-5 achieves higher instance-level alignment, while Gemini-2.5-Flash performs better in fuzzy sense modeling.

We use a fixed prompt template across all experiments to ensure consistency. While prompt design may affect LLM outputs, our results are stable across multiple runs. We observe that the standard deviation across runs is small, suggesting that the impact of prompt variation is limited. These observations suggest that the proposed method is robust to prompt variations under the current experimental setup. A systematic analysis of prompt sensitivity is left for future work.

We qualitatively inspect generated definitions and confirm that they capture meaningful and distinguishable semantic concepts across senses. These examples demonstrate that the generated definitions are interpretable and semantically coherent. A full human evaluation is left for future work.

6.4. Interpretation of Baseline Behavior

The strong F-B³ performance of the 1cpl baseline highlights the substantial impact of skewed sense distributions on evaluation metrics. Because the SemEval datasets are dominated by a majority sense, collapsing all instances into a single cluster can yield deceptively high scores. However, since no sense distinction is performed, both NMI and V-M are zero.

WSI-NS, in contrast, achieves very high NMI and V-M values but relatively low F-B³, suggesting that it produces structurally coherent clusters that are poorly aligned with gold sense labels. This indicates that no single metric is sufficient to evaluate WSI systems and that multiple complementary measures are required.

To provide a comprehensive evaluation, we report multiple complementary metrics, including NMI, V-measure, and F-B³. While F-B³ captures instance-level alignment, NMI and V-measure reflect global cluster structure. Therefore, our conclusions are based on consistent trends across multiple metrics, rather than relying on a single evaluation measure. This is particularly important under skewed sense distributions, where instance-level metrics alone may be misleading. This ensures a balanced and reliable assessment of WSI performance.

Compared with the recent reference results reported by Mosolova et al. [38], our method shows a different trade-off. Their strongest clustering-based baselines, such as PolyLM and LSDP, remain competitive on standard WSI benchmarks, while direct LLM-based approaches tend to be weaker and more variable.

In contrast, our framework does not use LLMs for direct clustering; instead, it uses LLM-generated definitions to refine an existing clustering structure.

This suggests that LLMs are more effective as semantic refinement modules than as standalone clustering engines in WSI.

6.5. Statistical Significance Analysis

We conduct paired t-tests on the F-B³ scores for the 9/1 split on both SemEval-2010 and SemEval-2013, comparing GPT-5 and Gemini-2.5-Flash.

The results do not show statistically significant differences (p > 0.05). However, we observe consistent trends across runs, suggesting that the observed performance differences are stable.

Given the small number of runs, statistical power is limited, and further large-scale evaluation is left for future work.

6.6. Role of Prompt Inputs

The ablation study shows that SVM and PMI features provide complementary information. PMI captures global distributional similarity, while SVM keywords emphasize cluster-specific discriminative cues. Their combination yields the most robust performance, confirming that both types of information are necessary for generating useful sense definitions.

Using examples alone leads to higher fuzzy scores but poorer structural alignment, suggesting that surface-level similarity is insufficient for stable sense induction.

6.7. Minority-Sense Clustering

We analyze the results in Table 4 from the perspective of minority-sense clustering. The 1cpl baseline achieves a relatively high F-B³ score (61.96) but has zero NMI and V-M, indicating that all instances are collapsed into a single cluster and that minority senses are completely absorbed into the dominant one.

k-means exhibits non-zero NMI (24.86) but extremely low Fuzzy-F-B³ (31.05), suggesting that it fails to recover instances of minority senses. The HDP achieves higher Fuzzy-F-B³ (58.13) but very low NMI (8.98), indicating that minority senses are not preserved as independent clusters.

WSI-NS achieves very high NMI and V-M (82.56), indicating that it captures the overall sense structure well. However, its relatively low F-B³ (55.21) suggests poor alignment with the answers provided in the dataset, particularly for minority-sense instances.

In contrast, GPT-5 and Gemini-2.5-Flash achieve high values for both NMI and F-B³ (e.g., GPT-5: NMI = 59.34, F-B³ = 80.02; Gemini: NMI = 54.51, F-B³ = 78.17), indicating that minority senses are preserved as distinct clusters while their instances are also correctly recovered.

Overall, effective minority sense clustering requires both structural preservation, as reflected by high NMI, and instance-level alignment, as reflected by high F-B³; our reclassification based on LLM-generated definitions is effective because it improves both aspects simultaneously.

6.8. Implications

Similar challenges of handling distributional shifts without relying on predefined baselines have also been studied in other domains, such as structural health monitoring [39].

Overall, the findings suggest that combining unsupervised clustering with explicit semantic representation via LLM-generated definitions offers a promising direction for addressing long-standing challenges in WSI, particularly dominant-sense bias and minority-sense recovery.

7. Conclusions

In this paper, we proposed a definition-based reclassification framework for word sense induction that integrates LLM-generated sense definitions into the clustering process to balance structural preservation and instance-level alignment.

Experimental results on SemEval-2010 and SemEval-2013 show that our approach outperforms traditional clustering methods and strong baselines in terms of both instance-level metrics (F-B³ and Fuzzy-F-B³) and structural metrics (NMI and V-M). In particular, our method effectively mitigates dominant-sense bias and improves the recovery of minority senses by preserving them as distinct clusters while correctly assigning their instances.

Through comparisons across models and ablation studies on input components, we further demonstrated that both distributional similarity and discriminative information are essential for effective definition generation and reclassification.

These findings suggest that combining unsupervised clustering with explicit semantic representations generated by large language models provides a promising direction for addressing long-standing challenges in word sense induction and related semantic modeling tasks.

Author Contributions

Conceptualization, S.Y. and M.S.; methodology, S.Y.; software, S.Y.; validation, S.Y.; writing—original draft preparation, S.Y.; writing—review and editing, M.S.; supervision, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Japan Society for the Promotion of Science (JSPS), Grant Number 25K15242.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: SemEval-2010 Task 14 and SemEval-2013 Task 13 datasets.

Acknowledgments

This work utilized AI-based tools (e.g., GPT-4o/GPT-5) for language refinement and editing. All technical content and interpretations were developed and verified by the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Navigli, R. Word Sense Disambiguation: A Survey. ACM Comput. Surv. 2009, 41, 10. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2227–2237. [Google Scholar]
Amrami, A.; Goldberg, Y. Word Sense Induction with Neural biLM and Symmetric Patterns. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 4860–4867. [Google Scholar]
Kilgarriff, A. How Dominant is the Commonest Sense of a Word? In Proceedings of the 5th International Conference on Text, Speech and Dialogue, Brno, Czech Republic, 8–11 September 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 103–111. [Google Scholar]
Blevins, T.; Zettlemoyer, L. Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1006–1017. [Google Scholar]
Su, Y.; Zhang, H.; Song, Y.; Zhang, T. Rare and Zero-shot Word Sense Disambiguation using Z-Reweighting. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 4713–4723. [Google Scholar]
Schütze, H. Automatic Word Sense Discrimination. Comput. Linguist. 1998, 24, 97–123. [Google Scholar]
Pantel, P.; Lin, D. Discovering Word Senses from Text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM: New York, NY, USA, 2002; pp. 613–619. [Google Scholar]
Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019), Erlangen, Germany, 9–11 October 2019; German Society for Computational Linguistics & Language Technology: Erlangen, Germany, 2019; pp. 161–170. [Google Scholar]
Hadiwinoto, C.; Ng, H.T.; Gan, W.C. Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 5297–5306. [Google Scholar]
Amrami, A.; Goldberg, Y. Towards Better Substitution-based Word Sense Induction. arXiv 2019, arXiv:1905.12598. [Google Scholar] [CrossRef]
Alagić, D.; Šnajder, J.; Padó, S. Leveraging Lexical Substitutes for Unsupervised Word Sense Induction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Washington, DC, USA, 2018; pp. 5004–5011. [Google Scholar]
Zipf, G.K. Human Behavior and the Principle of Least Effort; Addison-Wesley Press: Boston, MA, USA, 1949. [Google Scholar]
McCarthy, D.; Koeling, R.; Weeds, J.; Carroll, J. Finding Predominant Word Senses in Untagged Text. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 279–286. [Google Scholar]
Lau, J.H.; Cook, P.; McCarthy, D.; Newman, D.; Baldwin, T. Word Sense Induction for Novel Sense Detection. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; pp. 591–601. [Google Scholar]
Mancini, M.; Camacho-Collados, J.; Iacobacci, I.; Navigli, R. Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, BC, Canada, 3–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 100–111. [Google Scholar]
Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Hierarchical Dirichlet Processes. J. Am. Stat. Assoc. 2006, 101, 1566–1581. [Google Scholar] [CrossRef]
Brody, S.; Lapata, M. Bayesian Word Sense Induction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March–3 April 2009; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 103–111. [Google Scholar]
Yao, X.; Van Durme, B. Nonparametric Bayesian Word Sense Induction. In Proceedings of the TextGraphs-6 Workshop on Graph-based Methods for Natural Language Processing, Portland, OR, USA, 23 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 10–14. [Google Scholar]
Lesk, M. Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In Proceedings of the 5th Annual International Conference on Systems Documentation (SIGDOC ’86), Champaign, IL, USA, 13–15 October 1986; ACM: New York, NY, USA, 1986; pp. 24–26. [Google Scholar]
Banerjee, S.; Pedersen, T. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2002), Mexico City, Mexico, 17–23 February 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 136–145. [Google Scholar]
Luo, F.; Liu, T.; Xia, Q.; Chang, B.; Sui, Z. Incorporating Glosses into Neural Word Sense Disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, VIC, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2473–2482. [Google Scholar]
Huang, L.; Sun, C.; Qiu, X.; Huang, X. GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3509–3514. [Google Scholar]
Raganato, A.; Camacho-Collados, J.; Navigli, R. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 99–110. [Google Scholar]
Loureiro, D.; Jorge, A.M. Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5682–5691. [Google Scholar]
Bevilacqua, M.; Navigli, R. Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2854–2864. [Google Scholar]
Jurgens, D.; Klapaftis, I. SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses. In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval-2013), Atlanta, GA, USA, 14–15 June 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 290–299. [Google Scholar]
Chen, X.; Liu, Z.; Sun, M. A Unified Model for Word Sense Representation and Disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1025–1035. [Google Scholar]
Rothe, S.; Schütze, H. AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 1793–1803. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Qin, C.; Zhang, A.; Zhang, Z.; Chen, J.; Yasunaga, M.; Yang, D. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1339–1384. [Google Scholar]
Kocoń, J.; Cichecki, I.; Kaszyca, O.; Kochanek, M.; Szydło, D.; Baran, J.; Bielaniewicz, J.; Gruza, M.; Janz, A.; Kanclerz, K.; et al. ChatGPT: Jack of All Trades, Master of None. Inf. Fusion 2023, 99, 101861. [Google Scholar] [CrossRef]
Hanna, M.; Mareček, D. Analyzing BERT’s Knowledge of Hypernymy via Prompting. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Abu Dhabi, United Arab Emirates, 8 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 275–282. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Gemini Team; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.B.; Yu, J.; Sorber, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Manandhar, S.; Klapaftis, I.P.; Dligach, D.; Pradhan, S.S. SemEval-2010 Task 14: Word Sense Induction & Disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-2010), Uppsala, Sweden, 15–16 July 2010; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 63–68. [Google Scholar]
Mosolova, A.; Loureiro, D.; Glavaš, G. Large Language Models Struggle to Outperform One Cluster per Lemma in Word Sense Induction. In Findings of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2025. [Google Scholar]
Liu, W.; Hu, J.; Lv, F.; Tang, Z. A new method for long-term temperature compensation of structural health monitoring by ultrasonic guided wave. Measurement 2025, 252, 117310. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed method. The pipeline consists of three stages: (1) initial clustering using WSI-NS (Word Sense Induction with neural biLM and Symmetric Patterns), which produces preliminary sense clusters; (2) LLM-based definition generation, where each cluster is converted into a semantic anchor represented as a natural language definition; and (3) definition-guided reclassification, where each test instance is assigned to the most appropriate sense based on these definitions. Arrows indicate the flow of data between components.

Figure 2. An example of the initial clustering output provided to the LLM, including cluster-specific keywords and representative example sentences used for sense definition generation.

Figure 3. An example of sense definitions generated by the LLM.

Figure 4. Performance across different split settings. Left: SemEval-2010 (F-B³). Right: SemEval-2013 (Fuzzy-F-B³). Error bars indicate standard deviation.

Figure 5. Comparison of instance-level performance. Left: SemEval-2010 (F-B³). Right: SemEval-2013 (Fuzzy-F-B³).

Table 1. GPT-5 results averaged on SemEval-2013 and SemEval-2010. Bold values indicate the best performance in each column.

Model Clustering/Classify	SemEval 2013		SemEval 2010
Model Clustering/Classify	Fuzzy-NMI	Fuzzy-F-B³	V-M	Paired F-S	NMI	F-B³
1/9	20.09 [±3.19]	60.27 [±2.17]	41.52 [±1.14]	60.85 [±1.79]	41.52 [±1.14]	66.06 [±1.32]
2/8	14.11 [±0.90]	54.50 [±0.48]	44.33 [±3.11]	62.23 [±2.58]	44.33 [±3.11]	67.17 [±1.81]
3/7	15.08 [±1.80]	55.17 [±1.45]	44.56 [±1.69]	63.65 [±0.79]	44.56 [±1.69]	68.13 [±0.74]
4/6	14.65 [±1.47]	55.13 [±0.67]	46.45 [±1.40]	64.52 [±1.30]	46.45 [±1.40]	69.34 [±1.16]
5/5	14.60 [±1.06]	56.57 [±1.10]	47.52 [±2.83]	64.22 [±1.35]	47.52 [±2.83]	70.26 [±1.10]
6/4	14.47 [±1.27]	56.37 [±1.27]	49.57 [±3.01]	64.59 [±2.00]	49.57 [±3.01]	71.85 [±1.54]
7/3	17.22 [±2.57]	58.71 [±2.21]	52.00 [±2.99]	64.14 [±2.44]	52.00 [±2.99]	73.30 [±1.48]
8/2	17.83 [±2.54]	58.73 [±2.48]	55.96 [±1.04]	62.76 [±2.21]	55.96 [±1.04]	76.39 [±0.87]
9/1	16.59 [±2.68]	63.07 [±2.01]	59.34 [±4.55]	54.71 [±1.99]	59.34 [±4.55]	80.02 [±1.87]

Table 2. GPT-4o results averaged on SemEval-2013 and SemEval-2010. Bold values indicate the best performance in each column.

Model Clustering/Classify	SemEval 2013		SemEval 2010
Model Clustering/Classify	Fuzzy-NMI	Fuzzy-F-B³	V-M	Paired F-S	NMI	F-B³
1/9	15.05 [±1.24]	55.72 [±1.42]	15.57 [±0.31]	39.27 [±1.48]	15.57 [±0.31]	46.89 [±1.03]
2/8	12.16 [±1.18]	50.81 [±0.94]	16.01 [±1.17]	42.76 [±0.97]	16.01 [±1.17]	48.95 [±0.45]
3/7	12.91 [±2.31]	51.15 [±1.72]	17.39 [±0.97]	45.10 [±1.24]	17.39 [±0.97]	51.19 [±0.85]
4/6	12.05 [±1.29]	50.87 [±1.65]	20.10 [±0.80]	45.37 [±0.72]	20.10 [±0.80]	52.22 [±0.89]
5/5	12.09 [±1.79]	51.60 [±1.76]	24.29 [±0.82]	46.44 [±0.67]	24.29 [±0.82]	54.54 [±0.23]
6/4	11.46 [±0.98]	51.50 [±1.25]	28.67 [±2.60]	48.18 [±1.08]	28.67 [±2.60]	57.84 [±1.22]
7/3	13.49 [±2.32]	53.11 [±0.79]	33.53 [±2.02]	47.96 [±1.71]	33.53 [±2.02]	60.90 [±1.33]
8/2	13.83 [±2.23]	53.70 [±1.95]	41.96 [±2.95]	49.02 [±1.89]	41.96 [±2.95]	67.43 [±0.94]
9/1	13.08 [±1.79]	57.71 [±0.67]	50.54 [±6.47]	41.76 [±2.38]	50.54 [±6.47]	73.45 [±2.30]

Table 3. Gemini-2.5flash results averaged on SemEval-2013 and SemEval-2010. Bold values indicate the best performance in each column.

Model Clustering/Classify	SemEval 2013		SemEval 2010
Model Clustering/Classify	Fuzzy-NMI	Fuzzy-F-B³	V-M	Paired F-S	NMI	F-B³
1/9	19.92 [±2.13]	59.81 [±1.32]	33.30 [±1.81]	55.71 [±1.72]	33.30 [±1.81]	60.82 [±1.21]
2/8	15.21 [±0.74]	55.00 [±1.06]	36.98 [±1.70]	59.70 [±1.78]	36.98 [±1.70]	63.90 [±1.23]
3/7	16.38 [±1.50]	55.87 [±0.82]	37.77 [±1.64]	62.13 [±1.56]	37.77 [±1.64]	65.97 [±1.46]
4/6	15.46 [±1.71]	56.61 [±2.16]	40.56 [±1.73]	62.15 [±1.62]	40.56 [±1.73]	66.46 [±1.25]
5/5	14.65 [±1.28]	56.61 [±2.28]	42.76 [±1.30]	62.92 [±1.10]	42.76 [±1.30]	68.30 [±1.02]
6/4	16.64 [±2.54]	57.82 [±2.36]	44.21 [±3.12]	62.40 [±1.70]	44.21 [±3.12]	69.33 [±1.34]
7/3	16.71 [±2.61]	57.99 [±1.79]	47.87 [±1.85]	63.03 [±1.62]	47.87 [±1.85]	71.36 [±1.22]
8/2	17.97 [±1.44]	58.52 [±1.04]	53.22 [±1.65]	62.19 [±0.63]	53.22 [±1.65]	75.92 [±0.59]
9/1	17.81 [±2.58]	63.68 [±2.26]	54.51 [±3.86]	52.37 [±2.60]	54.51 [±3.86]	78.17 [±1.46]

Table 4. Comparison on SemEval-2013 and SemEval-2010. The upper block reports our experimental results, while the lower block shows reference results reported by Mosolova et al. [38]. Bold values indicate the best performance in each column. NA indicates that the corresponding result was not reported in the referenced study.

Model Train/Classify	SemEval 2013		SemEval 2010
Model Train/Classify	Fuzzy-NMI	Fuzzy-F-B³	V-M	Paired F-S	NMI	F-B³
Our experiments
k-means	16.94	31.05	24.86	47.04	24.86	52.92
HDP	8.98	58.13	11.89	54.51	11.89	58.13
1cpl	0	58.07	0	60.68	0	61.96
GPT-5 (9/1)	16.59 [±2.68]	63.07 [±2.01]	59.34 [±4.55]	54.71 [±1.99]	59.34 [±4.55]	80.02 [±1.87]
GPT-4o (9/1)	13.08 [±1.79]	57.71 [±0.67]	50.54 [±6.47]	41.76 [±2.38]	50.54 [±6.47]	73.45 [±2.30]
Gemini-2.5-Flash (9/1)	17.81 [±2.58]	63.68 [±2.26]	54.51 [±3.86]	52.37 [±2.60]	54.51 [±3.86]	78.17 [±1.46]
WSI-NS	18.11 [±0.32]	54.96 [±0.37]	12.58 [±0.34]	61.81 [±2.33]	82.30 [±0.41]	55.02 [±0.29]
Reported in Mosolova et al. [38]
PolyLM-large	23.7	66.7	43.6	67.5	6.2	49.2
PolyLM-base	23.0	65.4	41.8	66.4	6.2	49.1
LSDP	21.1 [±0.6]	64.1 [±0.5]	38.9 [±1.0]	70.7 [±0.4]	4.6 [±0.1]	52.8 [±0.2]
GPT-4o (reported)	16.9 [±0.5]	58.6 [±1.6]	36.3 [±2.0]	63.9 [±2.0]	7.1 [±0.3]	47.7 [±1.9]
1cpl (reported)	0.0	61.23	0.0	63.5	0.0	64.1
1cpex (reported)	6.9	NA	31.7	0	19.5	8.0

Table 5. Compare results on SemEval 2013 and 2010.

Model Train/Classify	SemEval 2013		SemEval 2010
Model Train/Classify	Fuzzy-NMI	Fuzzy-F-B³	V-M	Paired F-S	NMI	F-B³
none	17.87 [±1.11]	58.47 [±0.67]	14.54 [±0.69]	47.11 [±1.33]	14.54 [±0.69]	51.63 [±1.01]
svm	18.95 [±2.49]	59.23 [±1.62]	13.82 [±1.40]	47.56 [±1.55]	13.82 [±1.40]	52.01 [±1.22]
pmi	17.88 [±1.11]	59.46 [±0.67]	14.92 [±2.17]	47.37 [±0.95]	14.92 [±2.17]	51.92 [±0.92]
svm and pmi	21.57 [±1.08]	61.03 [±0.71]	13.81 [±0.61]	47.43 [±0.88]	13.81 [±0.61]	52.87 [±0.73]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yoshikawa, S.; Sasaki, M. Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses. Appl. Sci. 2026, 16, 3797. https://doi.org/10.3390/app16083797

AMA Style

Yoshikawa S, Sasaki M. Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses. Applied Sciences. 2026; 16(8):3797. https://doi.org/10.3390/app16083797

Chicago/Turabian Style

Yoshikawa, Shota, and Minoru Sasaki. 2026. "Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses" Applied Sciences 16, no. 8: 3797. https://doi.org/10.3390/app16083797

APA Style

Yoshikawa, S., & Sasaki, M. (2026). Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses. Applied Sciences, 16(8), 3797. https://doi.org/10.3390/app16083797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Definition-Anchored Unsupervised Word Sense Induction Using LLM-Generated Glosses

Abstract

1. Introduction

2. Related Work

2.1. Word Sense Induction via Clustering of Contextualized Representations

2.2. Limitations of Existing WSI Under Skewed Sense Distributions

2.3. Gloss-Aware and Definition-Based Sense Modeling

2.4. LLM-Assisted Semantic Interpretation in WSI

3. Proposed Method

3.1. Initial Clustering

3.2. Sense Definition Generation with LLMs

3.3. Definition-Guided Reclassification

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Baselines

4.4. Implementation Details

4.5. Data Partitioning

4.6. Main Results

4.7. Ablation Study

5. Results

5.1. Comparison Across LLMs

5.2. Comparison with Baselines

5.3. Ablation on Prompt Inputs

6. Discussion

6.1. Effect of Definition-Based Refinement

6.2. Model-Specific Behavior

6.3. Qualitative Comparison of Generated Definitions

6.4. Interpretation of Baseline Behavior

6.5. Statistical Significance Analysis

6.6. Role of Prompt Inputs

6.7. Minority-Sense Clustering

6.8. Implications

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI