Next Article in Journal
An Intelligent Simulation Training System for Power Grid Control and Operations
Next Article in Special Issue
A Dynamic Prompt-Based Logic-Aided Compliance Checker
Previous Article in Journal
Comparative Read Performance Analysis of PostgreSQL and MongoDB in E-Commerce: An Empirical Study of Filtering and Analytical Queries
Previous Article in Special Issue
CBR2: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Validating the Effectiveness of Fine-Tuning for Semantic Classification of Japanese Katakana Words: An Analysis of Frequency and Polysemy Effects on Accuracy

Department of Computer and Information Sciences, Faculty of Engineering, Ibaraki University, Ibaraki 316-8511, Japan
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(3), 67; https://doi.org/10.3390/bdcc10030067
Submission received: 22 January 2026 / Revised: 13 February 2026 / Accepted: 25 February 2026 / Published: 26 February 2026
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

Abstract

In semantic classification of katakana words using large language models and pre-trained language models, semantic divergences from original English meanings, such as those found in Wasei-Eigo which is Japanese-made English, and the inherent sense ambiguity in katakana words may affect model accuracy. To analyze the impact of these loanword semantic characteristics on classification accuracy, we created a large-scale dataset from the Balanced Corpus of Contemporary Written Japanese. We extracted 403,819 sentences covering 230 katakana words defined in dictionaries and suitable for word sense disambiguation tasks, and used the gpt-4.1-mini model to predict the meaning of the target words based on their context, to create annotation data. We then fine-tuned the pre-trained language model DeBERTa V3 with this data. We compared baseline and fine-tuned model accuracy, dividing data into four quadrants based on frequency and polysemy to conduct statistical analysis and explore strategies for improving accuracy. We also tested the hypothesis that high-frequency, low-polysemy words would achieve the highest accuracy, while low-frequency, high-polysemy words would achieve the lowest. As a result, the fine-tuned model showed an average accuracy improvement of approximately 53% compared to the baseline model. As hypothesized, high-frequency, low-polysemy words achieved the highest accuracy (93.93%), while low-frequency, high-polysemy words achieved the lowest (81.14%). Our analysis quantitatively revealed that both frequency and polysemy contributed to accuracy improvement, but polysemy had a greater impact on accuracy than frequency.

1. Introduction

1.1. Research Background

Research on Large Language Models (LLMs) and pre-trained language models (PLMs) has been actively conducted in the field of natural language processing, with widespread applications expected across diverse domains due to the development of generative AI technologies. Among these, Natural Language Processing (NLP) tasks are anticipated to make significant contributions, and research applying these models to various downstream tasks has progressed, including Word Sense Disambiguation (WSD) [1]. WSD is a task that correctly interprets words with different meanings depending on context. By addressing this task, language models can more accurately determine the meaning of target words in sentences, enabling more precise text generation and human-like conversations, thereby advancing the field of NLP. However, Japanese WSD tasks face unique challenges. One factor is that many generative AIs based on LLMs and PLMs are trained using English data, with the proportion of Japanese data in training sets being relatively low. Consequently, Japanese learning is considered insufficient compared to English, making it difficult to interpret word meanings according to sentence intent and context. Another factor is the existence of katakana words in Japanese, which have special origins not found in other languages. Katakana words include loanwords imported from English sources and Wasei-Eigo (Japanese-made English), which are interpreted and adopted uniquely in Japan. These often differ from their original meanings, making correct semantic classification in context highly challenging. Due to these factors, Japanese WSD tasks, particularly for katakana words, possess unique challenges compared to other languages.

1.2. Related Work

1.2.1. Word Sense Disambiguation: State of the Art

WSD, the task of determining the correct meaning of ambiguous words in context, has traditionally followed two main approaches: supervised learning methods and knowledge-based methods. Supervised methods classify ambiguous words using models trained on corpora annotated with validation senses by humans. Recent approaches leverage pre-trained language models, such as BERT [2] and RoBERTa [3], to embed words and sentences for sense prediction [4,5]. More recent work has explored advanced architectures such as Poly-encoders combined with BERT for improved disambiguation [6]. Knowledge-based methods, on the other hand, rely on external resources such as dictionaries and ontologies. Some approaches vectorize word definitions to learn sense relationships [7], while others use synonym relationships to derive meaningful sense vectors [8]. More recently, attempts have been made to use LLM-based generative AI for WSD, including research examining WSD capabilities of GPT-4 [9], evaluation of LLM performance in multilingual WSD [10], and exploitation of LLMs’ WSD capabilities for machine translation [11]. While these studies show promising performance, they have not yet reached state-of-the-art levels.
In recent years, LLMs such as LLM-jp [12], Stability AI Japanese Stable LM [13], and Swallow [14] have demonstrated excellent performance in Japanese processing, particularly excelling in In-context Learning (ICL) capabilities. These models possess versatility to handle diverse tasks by simply including a few examples in prompts. However, when WSD is targeted, task-specific recognition capabilities are more important than general generative capabilities, so learning task-specific representations directly from training data is more efficient than relying on the flexibility of ICL capabilities. PLMs such as BERT, RoBERTa, and DeBERTa V3 enable efficient fine-tuning in standard GPU environments, making hyperparameter exploration and multiple experimental iterations practical, which provides an advantage in exploring task-specific representation acquisition strategies. DeBERTa V3 architecture effectively captures context and positional information through its Disentangled Attention Mechanism, which is beneficial for WSD tasks where word meaning interpretation varies with context. It achieves performance comparable to LLMs in benchmarks such as GLUE and SuperGLUE. Japanese PLMs such as ku-nlp/deberta-v3-base-japanese [15] are also available. Additionally, medium-scale models facilitate overfitting control and possess appropriate model capacity relative to training data volume, demonstrating high affinity with fine-tuning. For these reasons, this study adopts DeBERTa V3 as the base model from the perspectives of balancing computational efficiency and accuracy, and task-specific optimization from the viewpoint of leveraging large amounts of training data.

1.2.2. Lexical Characteristics and Semantic Change of Japanese Katakana Words

Katakana words constitute a unique lexical stratum in Japanese vocabulary, primarily formed through borrowing from foreign languages, particularly English. During the assimilation process, these loanwords undergo semantic transformations including “narrowing”, “broadening”, and “shift”. Japanese katakana words are no exception to this pattern. In particular, Wasei-Eigo (Japanese-made English words) that have acquired meanings or usages different from their original English counterparts include those that share the same form as English but have different meanings. For instance, the Japanese word manshon denotes a relatively large apartment, whereas the English “mansion” signifies a luxurious estate, exemplifying semantic broadening during assimilation. Similarly, Winker refers to a vehicle’s turn signal in Japanese, though recognized as standard terminology, yet this word does not exist in English, which uses “turn signal” instead. This illustrates semantic shift during the adaptation process. Jinnai defines Wasei-Eigo as “vocabulary created within Japanese using English (or other foreign language) materials that appears English-like but does not function as English,” highlighting its inherent ambiguity [16]. These examples align with this definition. From a computational linguistic perspective, Takamura quantitatively measured semantic divergence in Japanese loanwords by calculating distances between distributed representation vectors of source words and borrowed words, analyzing the degree of semantic change [17]. However, their research focused primarily on “divergence from the source word” as a single analytical dimension, without sufficiently examining the polysemy acquired post-borrowing or context-dependent semantic diversification within Japanese. The issues of semantic change and polysemy in katakana words remain an unexplored domain. With the advancement of NLP in Japanese, systematically analyzing katakana polysemy and constructing WSD models has become fully feasible. This study aims to extend the approach of semantic divergence by focusing on the polysemy that katakana words acquire within the Japanese linguistic system, attempting to quantitatively elucidate katakana-specific semantic ambiguity through fine-tuning with DeBERTa V3 as the foundation model.

1.2.3. Statistical Characteristics of Words and WSD Performance

The statistical characteristics of words, particularly frequency and polysemy, are known to significantly influence the performance of NLP tasks. In natural language, word frequency distributions exhibit extreme skewness, with a small number of high-frequency words accounting for the majority of occurrences, while the vast majority of words appear only at low frequencies. This frequency bias carries important implications for machine learning-based NLP tasks. In WSD tasks, high-frequency words have abundant training examples in the corpus, enabling models to sufficiently learn their contextual patterns. Conversely, low-frequency words have limited training examples, making generalization performance prone to degradation. Polysemy also constitutes a critical factor determining the difficulty of WSD tasks. Polysemy can be quantified information-theoretically through entropy. A word exhibits higher entropy when it possesses multiple senses that are distributed relatively uniformly across usage contexts. In the SemEval-2010 Japanese WSD Task [18], 50 target words were evaluated and SVM-based methods achieved the highest accuracy of approximately 76.4%. A negative correlation between the number of word senses and accuracy was reported, with quantitative evidence demonstrating that words with more sense options become more difficult to disambiguate correctly. This finding provides essential insights for analyzing model accuracy and output tendencies in this study.

1.2.4. LLM-Based Data Construction and Annotation Reliability

In recent years, automatic annotation methods using LLMs have gained attention as efficient means for data construction in NLP. Traditionally, creating high-quality annotated data required manual information assignment by domain experts, imposing substantial burdens in both time and cost. Against this backdrop, research has been conducted on data annotation using LLMs, examining LLM-based annotation data generation, evaluation of generated data, and practical applications [19]. Their analysis demonstrates that LLMs can annotate diverse data types, and when combined with appropriate prompt design and evaluation strategies, can achieve performance comparable to or exceeding human annotation. However, automatic annotation by LLMs presents the challenge of noise (inconsistent data) incorporation. Zhang proposed LLMaAA (LLM as Active Annotators), a method utilizing LLMs as active annotators [20]. By extracting in-context examples from few-shot samples and assigning learnable weights to training samples, they enhanced robustness against noise. Their experiments achieved performance surpassing supervised models with only hundreds of annotation examples. Furthermore, the benchmark “NoisywikiHow,” which includes noisy labels from the real world, provided an environment closer to actual annotation errors by designing multiple label noises aligned with human cognitive processes, rather than using artificially synthesized noise [21]. These studies demonstrate that models trained under real-world label noise exhibit superior generalization performance, suggesting that noise in annotations does not necessarily impede learning but can be effectively utilized through appropriate learning strategies and more realistic settings. The approach of LLM-based annotation combined with fine-tuning of BERT-oriented models employed in this study represents an efficient and practical method for recent data construction. Specifically, we utilize the advanced language understanding capabilities of GPT-LLMs to automatically generate sense annotations for large quantities of katakana words, then fine-tune DeBERTa V3 using this data to construct high-accuracy task-specific models. This approach offers the advantage of generating higher-quality pseudo-labels compared to conventional methods through LLM contextual understanding capabilities. By fine-tuning BERT-oriented models with LLM-generated data, inference costs can be substantially reduced while achieving task-specific high accuracy. Furthermore, efficient fine-tuning becomes feasible in standard GPU environments, with the practical benefit of enabling iterative parameter exploration and multiple experiments. This approach demonstrates effectiveness in balancing practicality and accuracy, forming the foundation supporting the novelty and practical value of this research.

1.3. Research Objectives and Contributions

Considering the research background and related studies described above, we concluded that undertaking a WSD task focused on katakana words, with an emphasis on frequency and polysemy, holds great promise for contributing significantly to the field of NLP in Japanese. Therefore, this paper aims to improve semantic classification accuracy for Japanese katakana words through fine-tuning PLMs on Japanese-specific data, while analyzing classification patterns. We constructed a dataset from the Balanced Corpus of Contemporary Written Japanese (BCCWJ) [22] provided by the National Institute for Japanese Language and Linguistics (NINJAL), and fine-tuned DeBERTa V3 [23]. DeBERTa V3 improves upon BERT by adopting a split attention mechanism and an enhanced masked decoder, achieving high performance across various NLP tasks, including those requiring detailed semantic understanding. Among these, the ku-nlp/deberta-v3-base-japanese model [15] is pre-trained on Japanese data, enabling long-sentence processing that accounts for Japanese contextual dependencies. Using this model as a baseline, we compare and evaluate its performance against the fine-tuned model. We also analyze characteristics such as word frequency and polysemy to discuss challenges and potential improvements in katakana WSD. In particular, polysemy can pose unique difficulties in semantic classification of katakana words, as many loanwords have context-dependent meanings. As an example of such polysemous words, the word chip has meanings used in fields such as food, electronics, services, and entertainment in Japanese. We believe that analyzing the characteristics of polysemous words with such semantic distinct meanings can contribute to the WSD task for katakana words. While a study exists that quantifies polysemy in Japanese and evaluates it as an indicator [18], to the best of our knowledge, no prior research has systematically analyzed WSD from both frequency and polysemy perspectives while focusing exclusively on katakana words. This study addresses this gap by combining these two analytical dimensions for katakana word WSD.

2. Methods

To improve the baseline model performance using a large-scale balanced corpus containing katakana words, we constructed a fine-tuning dataset from the BCCWJ, provided by the NINJAL. Please note that due to the terms of the BCCWJ license agreement, we are prohibited from including in published research results any data that could be used to reconstruct all or part of the corpus itself. Therefore, we hereby state in advance that this paper will not include information such as example sentences extracted from the corpus that could directly reference the internal structure of the BCCWJ. The detailed procedure is described below.

2.1. Extraction

The data format adopted for extraction was M-XML (Morphological-based XML) in BCCWJ. We utilized M-XML data because its XML format, which integrates two samples (fixed-length and variable-length) based on the character-based format, simplifies the handling of linguistic structure information. The BCCWJ was selected because it is a balanced corpus covering contemporary written Japanese from diverse genres, including books, magazines, newspapers, and web texts. The corpus’s diversity ensures that the extracted katakana words reflect actual usage patterns across multiple fields. Therefore, it is expected to contribute significantly to solving the WSD task, where word sense ambiguity arises depending on the context. BCCWJ employs two levels of word segmentation, Short Unit Words (SUW) and Long Unit Words (LUW), the content within sentence tags is further classified by LUW and SUW tags. In this study, we decided to distinguish the information of words using SUW tags. The reason for choosing SUW over LUW is that SUW provides the minimum morpheme units necessary for the WSD task. While LUW groups multiple morphemes into larger semantic units, SUW separates individual words, allowing the specific word sense in context to be analyzed without interference from surrounding morphemes. We determined that SUW’s ability to preserve more granular information and contribute to more detailed trend and statistical analysis justified its use for extracting the necessary information in WSD tasks. All sentences within SUW tags where the wType attribute (to represent the form of a word) holds the value “foreign word” were extracted. This yielded 1,566,913 sentences, which contain 180,664 words. At this stage, the data still included expressions other than katakana words, such as foreign language expressions. Therefore, the following steps were applied to extract only katakana words.
Next, to exclude neologisms and other non-standard words, we used the morphological analyzer MeCab [24] with the IPADIC dictionary. MeCab with the IPADIC dictionary was chosen because it is widely adopted in Japanese NLP research and comprehensively covers standard vocabulary, which is expected to be useful for our method of extracting only established katakana words. We then extracted only words listed in the dictionary from the aforementioned 180,664 words, resulting in 12,161 words. Furthermore, proper nouns were excluded as they are names used to identify specific entities and therefore lack contextual sense ambiguity, making them unsuitable for the WSD (Word-Sense Disambiguation) task. Similarly, interjections, which also have a word sense that is not context-dependent, were excluded because they lack the ambiguity that is the subject of WSD. Words for which only one example sentence was available were also excluded, as it was determined that sufficient training data could not be secured for each word, preventing an accurate performance evaluation of the fine-tuned model. As a result of these exclusions, a final set of 6925 words was obtained.
Finally, to focus only on words registered in actual dictionaries, we consulted the Digital Daijisen dictionary [25] via the Weblio online dictionary service and extracted only words with two or more defined word senses. This is fundamental to WSD, as the WSD task inherently requires selecting from multiple possible meanings. Words with only one defined sense fall outside the scope of this study because they present no ambiguity to be resolved. Specifically, we adopted a method of checking the sense definitions by targeting the HTML element whose class is SGKDJ (where definitions derived from Digital Daijisen are listed) and excluding those with only one sense definition. This approach allowed for a structural confirmation of the dictionary-defined word senses. The resulting 801,899 sentences, which contain 1639 words, were used to construct the fine-tuning dataset. At this stage, each sentence containing a target word is classified and saved (for example, if the target word is “target”, it is saved in a file containing only sentences that include “target”). Table 1 summarizes the extraction process and the number of words and sentences retained at each step.

2.2. Annotation

Using the 1639 words extracted in Section 2.1 as target words, we employed the OpenAI API https://platform.openai.com/docs/(accessed on 26 July 2025) provided by OpenAI Inc. (San Francisco, CA, USA). to predict the meaning of target words in each sentence. A recent study has shown that LLMs can effectively generate annotation data from example sentences for WSD tasks [19]. To validate whether these predictions could be properly used as annotation data, we randomly sampled 1000 sentences from the 801,899 sentences across 1639 words and manually assigned the human-annotated word senses. During this process, sentences were excluded from sampling if they contained words whose meanings were not included in word sense definitions extracted from the Digital Daijisen dictionary, if the words were proper nouns, or if the sentences were extremely short, resulting in almost no context and making them impossible to discern. Note that the meaning options for annotations utilize those extracted from the Digital Daijisen dictionary. To compare predictions across multiple models, we first prepared the sense data. Since the word senses with SGKDJ values in Weblio are based on definitions from Digital Daijisen, we scraped these senses for each target word. The scraping conditions were set to extract senses beginning with numbers, Chinese numerals, alphabets, or symbols. We then compared human annotations with prediction results from five GPT models. The model selected at this time was the latest model released as of July 2025, when the experiment was conducted. The results are shown in Table 2. The Input and Output costs are based on the OpenAI API pricing.
As shown in Table 2, gpt-4.1 achieved the highest agreement rate. However, gpt-4.1-mini achieved 4.2% lower accuracy than gpt-4.1 at approximately one-fifth of the cost, and 2.2% lower accuracy than gpt-4o at approximately one-sixth of the cost. Based on these results, we selected gpt-4.1-mini as the model for annotation, judging that it maintains acceptable accuracy while significantly reducing costs. The 80.30% agreement rate between gpt-4.1-mini predictions and human annotations indicates the presence of approximately 19.70% potential noise labels. However, we intentionally retained all annotations without manual correction or filtering. This decision is based on recent research regarding learning with noise labels, as discussed in Section 1.2.4. Wu demonstrated that models trained on real-world noisy label data exhibit superior generalization performance compared to models trained on synthetic noisy label data, suggesting that noisy labels do not necessarily hinder learning. Furthermore, manually reviewing and correcting all or part of the 801,899 sentences would be extremely time-consuming, undermining the practical advantages of LLM-based automatic annotation. For these reasons, despite the presence of noisy labels, we expected the abundance of training examples in our dataset, particularly for high-frequency words, to enable learning correct semantic patterns through statistical regularities.
Using this sense data and files containing example sentences for each target word, we created a program for API usage. The prompt format actually used in the program is shown in Figure 1.
The choices shown in Figure 1 represent the word senses of each target word, and since each choice is assigned a number, we asked the model to output this number in its response. During API usage, based on the validation results in Table 2, we executed predictions using gpt-4.1-mini, which offered the best cost-effectiveness. The obtained prediction results were used as annotation data, which was subsequently utilized to create the data for later fine-tuning.

2.3. Fine-Tuning

2.3.1. Data Characterization

Before conducting fine-tuning, we examined the annotation data (1639 words). First, we tokenized each word using the tokenizer and extracted words that were not split into subwords. The tokenizer utilizes tiktoken, a BPE (Byte Pair Encoding) tokenizer provided by OpenAI, and the tokenizer was configured to use “cl100k_base”. The extracted words suggest they appeared with sufficient frequency during the training of the model underlying the tokenizer. In other words, these words are likely to be treated as generally important terms, and their semantic representations can be efficiently learned as complete units, making them suitable for the WSD task in this study. For these reasons, the careful selection of words using the tokenizer is crucial. As a result, 940 words and 504,688 sentences were extracted. Furthermore, for each target word, we calculated the frequency as a value indicating how many sentences containing the target word exist in the BCCWJ. For the initial set of 940 words extracted through tokenization, the results of the statistical analysis yielded a mean of 536.90, a median of 169.00, a standard deviation of 1168.16, a minimum value of 2, and a maximum value of 12,791. The values shown in this statistical information indicate the number of sentences. The mean is greater than the median, and the difference between the median and the maximum value is very large, suggesting that the frequency distribution is heavily skewed to the right. This suggested it might closely follow Zipf’s law [26]. Further examination of the frequency distribution revealed that the top 24.5% of words by frequency (230 words) accounted for 80.0% of all sentences (403,819/504,688). Consequently, it was determined that the data followed a Pareto distribution consistent with Zipf’s law, allowing the application of the Pareto principle. Therefore, we selected these 230 words with 403,819 sentences for fine-tuning, ensuring that each word had sufficient training examples for robust data splitting. Furthermore, using the annotation data obtained in Section 2.2, we calculated the frequency of each word’s semantic distribution and the word entropy. Frequency is an indicator that focuses on the number of example sentences to examine the occurrence tendencies and distribution of katakana words within the BCCWJ. We introduced it because it allows for intuitive information acquisition. On the other hand, entropy is a measure representing the uncertainty or unpredictability of information. Defining a word sense as the information contained within a word, the number of senses represents the information content of the word. Thus, the ambiguity of word senses can be quantified by the word’s entropy. Higher entropy indicates a more evenly distributed sense distribution and greater prediction difficulty. Therefore, we introduced entropy as a necessary metric for the WSD task, where resolving such ambiguity and uncertainty is crucial. In this study, we calculated the Shannon entropy [27]:
H = i p ( i ) log 2 p ( i ) .
We used this as the entropy value for the semantic distribution of each word. Here, p(i) represents the probability that the i-th semantic meaning is selected. For example, if option i is selected 100% of the time (when p(i) = 1), the entropy becomes 0. Conversely, if the probability of selecting option i is the same for all options (p(i) approaches 0 as the number of options increases), this is equivalent to selecting options completely at random, resulting in the maximum possible entropy. Therefore, values closer to 0 indicate less ambiguity, while values closer to the maximum indicate greater ambiguity. Note that the maximum entropy equals log 2 n , where n is the number of options: it is 1 when there are two equally probable options and exceeds 1 as the number of options increases beyond two.
Statistical analysis of the frequency distribution of 230 words yielded the following values: mean 1755.73, median 1057.50, standard deviation 1889.85, minimum 493, and maximum 12,791. Calculating entropy yielded an average of 0.9768, a median of 0.8968, a standard deviation of 0.5619, a minimum of 0.0257, and a maximum of 2.5900.
The frequency distribution exhibited strong positive skewness (skewness = 3.30), and the distance from the median to the maximum value (11,733.50) was approximately 10 times the interquartile range (IQR = 1194.75), confirming that the distribution is heavily influenced by a small number of extremely high-frequency words. Here, we introduce the concept of breakdown point, which indicates the maximum proportion of outliers that an estimator can tolerate before it becomes unreliable [28]. The median has a breakdown point of 50%, meaning it remains stable even when nearly half of the data points are outliers, which is the theoretical maximum for location estimators. In contrast, the mean has a breakdown point of 0%, as even a single extreme outlier can distort it arbitrarily. We compared three measures of central tendency as potential thresholds: the arithmetic mean (1755.73), the geometric mean (1270.41), and the median (1057.50). Using the arithmetic mean as the threshold would classify only 68 words (29.6%) as high frequency and 162 words (70.4%) as low frequency. The geometric mean, which is theoretically appropriate for right-skewed distributions following Zipf’s law, would yield 97 words (42.2%) as high frequency and 133 words (57.8%) as low frequency. Only the median provides an equal split of 115 words (50.0%) in each category. Considering these results together with the breakdown point perspective, we concluded that the median is the most appropriate measure for this study.
Furthermore, since the primary objective of this study is to compare WSD performance across quadrants, balanced group sizes ensure that performance estimates for each quadrant are based on data of comparable quality. Additionally, when constructing the four quadrants using both frequency and entropy thresholds, the geometric mean resulted in highly unbalanced quadrant sizes (ranging from 38 to 76 words, with a maximum-to-minimum ratio of 2.00), whereas the median produced well-balanced quadrants (ranging from 54 to 61 words, with a ratio of 1.13). Therefore, for these reasons, the median was adopted as the threshold.
To verify the robustness of this choice, we examined whether the main conclusions would change if the arithmetic mean or geometric mean were used as the threshold instead of the median. We re-partitioned the 230 words into four quadrants under each threshold and recalculated the mean accuracy per quadrant. The results are shown in Table 3.
As shown in Table 3, the accuracy ranking across quadrants (Q2 > Q4 > Q1 > Q3) was identical under all three thresholds. Furthermore, the accuracy gap between high and low polysemy quadrants ranged from 10.1% to 10.9%, whereas the gap between high and low frequency quadrants ranged from 1.1% to 2.4%, indicating that the polysemy effect consistently exceeded the frequency effect. The slight differences in absolute accuracy values across thresholds are attributable to changes in group composition rather than threshold quality. For instance, the geometric mean yields a lower entropy threshold than the median (0.783 compared to 0.897), which classifies more words as low polysemy and is likely to raise the average accuracy. However, these differences reflect selection effects rather than genuine performance improvements. Combined with the balanced group sizes (ratio of 1.13) and the maximum breakdown point discussed above, we consider the median to be the most appropriate threshold for this study, while confirming that the main conclusions are robust to this methodological choice.
We defined high frequency (HF) and low frequency (LF) relative to our dataset using the median frequency value of 1057.50 as the threshold. It is important to note that even our “low-frequency” words have a minimum of 493 occurrences, which is relatively high from a general linguistic perspective. The term “low-frequency” here does not refer to absolute frequency, but rather indicates words that are relatively low-frequency within the dataset created for this study (230 words). Similarly, we defined high polysemy (HP) and low polysemy (LP) using the median entropy value of 0.8968 as the threshold. Classifying the 230 words based on these two axes resulted in four quadrants, as shown in Table 4.
Considering the relationships among the four quadrants shown in Table 4, we can predict that words with high polysemy and low frequency (Q3) will have the lowest accuracy, while conversely, words with low polysemy and high frequency (Q2) will have the highest accuracy. This is because higher polysemy means that word senses are captured more ambiguously, while higher frequency suggests that the word is used in many contexts and is generally an important word. To verify this hypothesis, we fine-tune DeBERTa V3 (ku-nlp/deberta-v3-base-japanese) model for each quadrant and measure its accuracy.

2.3.2. Training Procedure

First, using the 230 words extracted in Section 2.3.1, we created training, development, and test sets by randomly splitting the data at an 8:1:1 ratio. This was done using stratified splitting, utilizing example sentences for each word, validation labels from annotation data, and choice labels based on word definitions. Stratified Split is a data splitting technique that maintains the proportion of each class (in this case, each word). Unlike random splitting where some words might be entirely absent from certain sets, Stratified Split ensures that data for each word is distributed across training, development, and test sets in the specified ratio, thereby enabling fair evaluation across all target words. As shown in Section 2.3.1, even the least frequent word among the 230 words has 493 example sentences, which is sufficient for splitting at an 8:1:1 ratio (resulting in at least approximately 394 training data points, 49 development data points, and 49 test data points per word). To compare the accuracy of the baseline model and the fine-tuned model on average, we performed the split five times. For reproducibility, we specified random seed values of 42, 43, 44, 45, and 46.
During this process, some example sentences were skipped because valid labels were not assigned during annotation. There were 148 such instances in Q2, while none existed in Q1, Q3, or Q4. Of these, 147 instances were related to the word “pan”, where the predicted labels were “None” or values that did not correspond to any of the choices. The remaining case was also determined to have no corresponding meaning among the options. For “pan”, the meaning of “pan” as a food item was not extracted during scraping from Weblio. This occurred because the definitions of “pan” in the Digital Daijisen dictionary were diverse and each was used in highly independent contexts, failing to meet the scraping criteria. Specifically, the HTML structure of the dictionary entry for “pan” did not conform to the standard format assumed by the scraping conditions, preventing the extraction of all sense definitions. The remaining instance also likely failed to meet the scraping criteria. However, since the skip rate was only 0.096% of the total (148 out of 153,713; total number of example sentences in Q2), we considered its impact on data analysis and accuracy negligible. We therefore excluded the 148 cases and proceeded with fine-tuning using the remaining data. We formatted the created training data, development data, and test data into a format suitable for easy loading during training, generating JSONL files in prompt format. The actual format is shown in Figure 2.
As shown in Figure 2, we designed this as a word sense classification task where the model generates responses based on prompts that provide options and ask the model to select the correct word sense from those options for target words within a sentence. To obtain values for each evaluation metric (accuracy, recall, precision, F1 score), we provided model-generated reference labels as the evaluation labels. For the execution environment, we fine-tuned the ku-nlp/deberta-v3-base-japanese model using the Transformers library (version 4.30.2) and PyTorch 2.0.1. Training was performed with a learning rate of 2 × 10−5, maximum sequence length of 256, batch size of 12, 3 epochs, and the AdamW optimizer. Training the fine-tuned models across four quadrants for each random seed required approximately 27.5 h.

3. Results

3.1. Performance Evaluation

We conducted fine-tuning following the procedure described above and evaluated the model performance on each quadrant. Table 5 shows the baseline model performance, and Table 6 shows the fine-tuned model performance on the test set for each quadrant, both averaged over five random seeds. We report Accuracy, Precision, Recall, and F1 score as evaluation metrics.
As shown in Table 5, Q4, which is low-frequency and low-polysemy in the baseline model, achieved the highest accuracy across all evaluation metrics. Similarly, Q2, which also has low polysemy, showed approximately 11–13% lower accuracy compared to Q4. Furthermore, compared to the high-polysemy quadrants Q1 and Q3, the low-polysemy quadrants Q2 and Q4 demonstrated values approximately 4% or more higher across all evaluation metrics. These results indicate that words with lower polysemy achieve higher accuracy in the baseline model (DeBERTa V3). Additionally, among low-polysemy words, low-frequency words (Q4) tended to exhibit higher accuracy than high-frequency words (Q2).
Meanwhile, as confirmed by the results shown in Table 6, the quadrant with high frequency and low polysemy (Q2) yielded the highest accuracy of approximately 93% across all evaluation metrics. Conversely, the quadrant with low frequency and high polysemy (Q3) recorded the lowest values across all evaluation metrics. These results were consistent with our hypothesis mentioned in Section 2.3.1. Furthermore, focusing solely on polysemy, the low-polysemy quadrants (Q2, Q4) showed approximately 8–12% higher accuracy than the high-polysemy quadrants (Q1, Q3). This clearly demonstrates the significant impact of polysemy on performance.
To evaluate the training dynamics and assess potential overfitting, we examined the training and validation loss curves during the 3-epoch training process. Figure 3 shows the loss curves for all four quadrants.
As shown in Figure 3, the training loss decreased consistently throughout the training process across all quadrants. The validation loss was evaluated at the end of each epoch, providing three data points per quadrant. While this limited number of evaluation points constrains detailed analysis of overfitting dynamics, observable trends indicate that the validation loss plateaued or showed slight increases while the training loss continued to decrease, particularly in Q1. This suggests the occurrence of mild overfitting between epochs 2 and 3. However, the achieved accuracy remained high and stable across the five random seeds, as evidenced by the small standard deviations in Table 6. Therefore, the impact of overfitting is considered to be minimal. For future work, although not implemented in this study, implementing early stopping based on validation loss, especially when training on datasets created from highly polysemous words, could potentially mitigate this overfitting tendency.

3.2. Significance Verification

To assess the statistical significance of these differences, we conducted bootstrap analysis with 10,000 resamples on the accuracy of the fine-tuned model for each quadrant [29]. The results are shown in Table 7.
Since none of the 95% confidence intervals contain zero, all pairwise comparisons between quadrants showed statistically significant differences. This quantitatively demonstrates that polysemy has a larger effect on accuracy (approximately 10% difference between high and low polysemy) than frequency (approximately 2% difference between high and low frequency).

3.3. Comparison with BERT Models

To verify whether the findings of this study can be generalized across different PLMs, we conducted similar experiments using BERT (cl-tohoku/bert-base-japanese) [30] as the baseline model. Table 8 shows the performance of the baseline model, and Table 9 shows the performance of the fine-tuned model.
The BERT model exhibited a consistent pattern with DeBERTa V3. The baseline model achieved the highest accuracy on Q4 (low frequency, low polysemy). Q2, also in the low polysemy quadrant, showed values approximately 4–8% lower than Q4. The low polysemy quadrants (Q2, Q4) demonstrated accuracy at least 7% higher than the high polysemy quadrants (Q1, Q3) across all evaluation metrics. In contrast, the fine-tuned model achieved highest accuracy in Q2 (high frequency, low polysemy) and lowest accuracy in Q3 (low frequency, high polysemy). The low polysemy quadrants (Q2, Q4) also showed approximately 8–12% higher accuracy than the high polysemy quadrants (Q1, Q3). This result is largely consistent with the trend observed for DeBERTa V3. Furthermore, comparing accuracy across models, the fine-tuned DeBERTa V3 model showed slightly higher values than BERT across all metrics in all quadrants (approximately 0.02–1.1%). However, the improvement in accuracy through fine-tuning was comparable, with DeBERTa V3 achieving an average improvement of approximately 53% and BERT also showing an average improvement of approximately 51%.
Additionally, examining the differences in metric values for the baseline models revealed that BERT had higher values across all metrics in all quadrants except Q4 (approximately 2–4%). In Q4, DeBERTa V3 showed higher accuracy (approximately 0.45%), but this difference was not significant, and it cannot be clearly stated that DeBERTa V3 is more accurate for Q4. These results suggest that the impact of frequency and polysemy on the accuracy of the WSD task is roughly consistent across different PLM architectures.

4. Discussion

4.1. Effects of Frequency and Polysemy

4.1.1. Fine-Tuning Model

As shown in the results, the baseline model achieved the highest accuracy for low-frequency and low-polysemy words (Q4), while the fine-tuned model achieved the highest accuracy for high-frequency and low-polysemy words (Q2). First, we discuss the fine-tuned model from the perspectives of frequency and polysemy. Regarding frequency, as suggested by the hypothesis mentioned in Section 2.3.1, this is likely due to these words being important terms used in many contexts. Being used in many contexts means that these words appear frequently during training, allowing the model to sufficiently acquire their semantic representations. Therefore, the fine-tuned model achieved the highest accuracy on Q2. However, despite its high frequency, Q1 showed approximately 10% lower accuracy than Q2. This is thought to be caused by the polysemy discussed next. Regarding polysemy, we can predict that higher polysemy increases ambiguity. Table 5 and Table 6 show that words with high polysemy (Q1, Q3) have lower accuracy than those with low polysemy (Q2, Q4), confirming higher predictive ambiguity. This is thought to stem from the dispersion of contextually appropriate meanings across numerous senses, leading to unclear correspondence between context and meaning. In other words, the finer and more numerous the senses a target word possesses, the more fragmented the contextual patterns associated with each sense become. This increases instances where different senses are used in similar contexts. Consequently, it becomes difficult for the model to identify the appropriate sense from the context, leading to reduced accuracy. Therefore, the reason Q1 showed significantly lower accuracy than Q2 despite being high-frequency lies in its high polysemy. This high polysemy made correct semantic classification difficult. In other words, while high-frequency words contribute to accuracy improvement in semantic classification, low polysemy is even more crucial and contributes more significantly to accuracy enhancement.

4.1.2. Baseline Model

Next, we discuss the baseline model. Unlike the fine-tuned model, the baseline model achieved the highest accuracy on Q4 rather than Q2. Summarizing by polysemy level, the low-polysemy quadrants Q2 and Q4 showed higher accuracy than the high-polysemy quadrants Q1 and Q3. This is thought to be due to the significant influence of polysemy mentioned earlier, where higher polysemy leads to lower accuracy. As previously stated, high-frequency words are used in many contexts. This suggests that because they appear in various example sentences, predictions may become dispersed in an untrained state. Conversely, low-frequency words have fewer samples before training and are used only in specific contexts, so their patterns may be somewhat clearer. From this, in the baseline model, high-frequency words are used in diverse contexts, which is thought to increase sense ambiguity in an untrained state. Indeed, comparing Q2 and Q4 with low polysemy, the low-frequency Q4 showed higher accuracy. Similarly, comparing Q1 and Q3 with high polysemy, the low-frequency Q3 showed higher accuracy. This supports the hypothesis that low-frequency words have clearer contextual patterns. In summary, in the baseline model, high polysemy is a cause of accuracy decline, and high frequency also makes prediction difficult due to contextual diversity in an untrained state, leading to accuracy decline. Thus, the effect of frequency on accuracy is reversed between the baseline model and the fine-tuned model. This suggests that fine-tuning enables the utilization of abundant training samples for high-frequency words, allowing contextual diversity to work advantageously. This indicates that for accurate semantic classification of high-frequency words in the katakana WSD task, additional training such as fine-tuning is highly important.

4.1.3. Limitations of Frequency as a Metric

The frequency in this study is defined as the number of sentences containing the target word in the BCCWJ and does not directly represent the contextual diversity of word usage. Even for high-frequency words, if their occurrences are concentrated in a limited range of genres, their impact on WSD tasks could differ from that of words used with comparable frequency across diverse contexts. As shown in Table 7, one possible reason for the relatively small frequency effect (approximately 2.3%) observed in the bootstrap analysis compared to the polysemy effect (approximately 10.5%) may be this influence. Specifically, simple occurrence counts cannot distinguish between concentrated and diverse usage, potentially attenuating the observed effect of frequency on WSD accuracy. Because the BCCWJ is a balanced corpus covering multiple genres, high-frequency words generally tend to appear in varied contexts, but this does not necessarily hold for every word. This suggests a limitation in that simple occurrence counts cannot fully capture frequency as a metric.

4.2. Analysis in Each Quadrant

We further analyze the sense count distribution, frequency, and entropy for each quadrant in detail. Table 10 shows the statistical information on frequency and entropy for Q1 through Q4. Similarly, Table 11 shows the statistical information on sense count distribution for Q1 through Q4.
From Table 10, we can see that high-frequency words (Q1, Q2) have approximately 4 times larger mean values and approximately 14 times larger standard deviation values compared to low-frequency words (Q3, Q4). As also shown in Section 2.3.1, the maximum frequency value is 12,791, which is much larger than the high-frequency word mean values of 2726.1 and 2846.5. This indicates that extremely high-frequency words are included among the high-frequency words. From Table 5 and Table 6, we can confirm that the accuracy improvement is larger for high-frequency words Q1 and Q2 compared to low-frequency words Q3 and Q4 (average improvement of 58.0% for high-frequency words compared to 48.7% for low-frequency words). We speculate that high-frequency words enabled sufficient learning during fine-tuning due to their abundant sample sizes, resulting in larger accuracy improvements. Examining entropy, we find that the high-polysemy quadrants Q1 and Q3 also had larger standard deviations compared to Q2 and Q4. Q3 had the largest value (0.4534), indicating the most heterogeneous word characteristics within the quadrant. The lowest accuracy of Q3 may be attributed to this heterogeneity lowering the overall accuracy. Additionally, examining the mean and median values, we find that high-polysemy words Q1 and Q3 have more than twice the values of low-polysemy words Q2 and Q4. This indicates a large difference in polysemy between quadrants, which likely amplified the effect of polysemy on accuracy.
Next, examining Table 11, Q4 had the smallest standard deviation value (0.721), approximately one-third of Q3’s value (2.210), which was the largest. Compared to Q2 (1.290), which also has low polysemy, there is a difference of approximately 0.6, which is considered one factor contributing to Q4 showing the highest accuracy in the baseline model. Q4 had the smallest variance in sense count, with most words having 2–3 senses. This homogeneity made prediction easier even in the baseline model due to the limited number of choices. This supports the earlier discussion that being low-frequency words made prediction easier.

4.3. Data Balance

We discuss the findings for each evaluation metric. Regarding the baseline model’s accuracy, Precision values were higher than Accuracy and Recall across all quadrants Q1 through Q4. This indicates that while the proportion of selected answers matching the actual evaluation answers was high, the proportion of evaluation samples missed was also high. In other words, the model tends to favor predicting certain easier-to-predict senses, suggesting an imbalance in its prediction distribution. This supports the earlier discussion on why low-frequency words showed higher accuracy than high-frequency words in the baseline model. Examining the fine-tuned model, the four metrics are nearly consistent, indicating that the model can predict almost evenly across all word senses. In fact, we observed substantial accuracy improvements of approximately 45% to 61% compared to the baseline model across all quadrants. This demonstrates that the dataset used in this study is balanced and of high quality, contributing to the improvement in the baseline model’s prediction balance. Our finding that improved data balance significantly contributed to accuracy enhancement is reinforced by [31], who demonstrated that imbalanced training examples can cause incorrect sense predictions, particularly in non-English languages. This emphasizes the importance of training with balanced data.
Furthermore, while the number of words in each quadrant is roughly equal, the number of example sentences associated with each word varies significantly. Therefore, to investigate whether this imbalance affects model performance, we conducted additional experiments applying a word-level weighted loss function. Specifically, we followed the same experimental procedure as the fine-tuning described in Section 2.3.2, applying commonly used “inverse frequency weighting” when calculating the cross-entropy loss at each step in fine-tuning. For each target word, we calculated weights based on the inverse of the frequency using the following formula:
w i = N K × n i
Here, N is the total frequency of the target quadrant, K is the number of target words in the quadrant, and n i is the frequency of word i. For example, if a word in Q1 has a frequency of 1000, then N is the total frequency of Q1, K is 61, and n i is 1000. Through this weighting, words with fewer usage examples (lower frequency) receive higher weights multiplied by the entropy loss, allowing adjustment so that all words can be learned equally.
Table 12 shows the performance on the test set for each quadrant of the fine-tuned model with the weighted loss function applied. Comparing these results with the original fine-tuned model shown in Table 6, the differences were mostly small, with accuracy changes in each quadrant ranging from −0.57% to +0.40%. Q3 showed a slight improvement of +0.40%, while other quadrants showed slight decreases.
These results indicate that word-level frequency imbalance within each quadrant had little impact on model performance. Therefore, this suggests that the accuracy differences observed between quadrants are largely due to the effects of frequency and polysemy, rather than the bias in frequency distribution among words.

4.4. Comparative Verification Between BERT-Oriented Models

To examine whether the semantic prediction patterns observed in the Section 3 are specific to DeBERTa V3 or generalizable across PLMs with different architectures, we compared and analyzed the results with BERT as presented in Section 3.3. The fine-tuned DeBERTa V3 model outperformed BERT across all quadrants, with accuracy improvements ranging from 0.02% to 1.11%. Notably, the largest improvement was observed in Q1, where DeBERTa V3 achieved 83.43% accuracy compared to BERT’s 82.32%. This suggests that the Disentangled Attention Mechanism of DeBERTa V3, which separately encodes content and positional information, may be particularly effective for WSD of highly polysemous words where contextual understanding is crucial. However, it should be noted that the performance differences between the two models were relatively small across all quadrants.
Furthermore, both models exhibited identical patterns regarding the effects of frequency and polysemy. Specifically, the fine-tuned models achieved the highest accuracy in Q2 and the lowest in Q3, while the baseline models achieved the highest accuracy in Q4. These patterns were observed in both models. Although BERT and DeBERTa V3 have different architectures, the consistent observation of these patterns across both models suggests that the influence of frequency and polysemy on the katakana WSD task is a task-inherent characteristic rather than a model-specific phenomenon.
Interestingly, in the baseline models, BERT showed approximately 1.6–2.7% higher accuracy than DeBERTa V3 in the high-polysemy quadrants Q1 and Q3. This result is considered to be partly attributable to differences in the pre-training methods between the two models. BERT employs a pre-training method called MLM (Masked Language Modeling). This approach masks portions of the input text and trains the model to infer the masked words from context. Through this process, the model acquires the ability to deeply understand the overall sentence structure and relationships between words, exhibiting high compatibility with WSD tasks that require determining which word meaning is appropriate in a given context. In contrast, DeBERTa V3 employs a different pre-training method called RTD (Replaced Token Detection). RTD replaces target tokens with alternatives and performs binary classification to determine whether each token is correct or incorrect. While this method enables the model to acquire the ability to judge overall sentence coherence, it does not directly perform semantic prediction of what meaning a word should have at a given position, and thus does not possess properties as directly aligned with WSD tasks. This difference in pre-training methods may explain why DeBERTa V3 had not acquired representations as directly applicable to context-dependent semantic classification of polysemous words as BERT.
However, when comparing the fine-tuned models, DeBERTa V3 outperformed BERT across all quadrants and all metrics. This phenomenon suggests that even though RTD-based pre-training does not have direct compatibility with WSD tasks, the separate representation of context and positional information through the Disentangled Attention Mechanism can effectively adapt through fine-tuning on task-specific data. In other words, for katakana WSD tasks, fine-tuning may become more efficient through the Disentangled Attention Mechanism.
In addition to the above comparison between BERT-oriented models, we discuss the relationship with existing Japanese WSD methods in a broader context. XL-WSD [32] is a cross-lingual WSD evaluation framework covering 18 languages. It provides a unified sense inventory based on BabelNet [33], a large-scale multilingual lexical knowledge base. In this benchmark, XLM-R Large achieved the highest score for Japanese with F1 = 61.87. However, the sense inventory of BabelNet is built upon WordNet [34], which differs from the Digital Daijisen-based sense definitions used in this study. Furthermore, XL-WSD targets general vocabulary including nouns, verbs, and adjectives, whereas this study focuses exclusively on katakana words. For these reasons, direct accuracy comparison is not feasible. Among related work, [17] quantified semantic change in loanwords using distributed representations. Ref. [35] analyzed the cross-lingual characteristics of the XL-WSD dataset for Japanese. However, neither study systematically addressed katakana polysemy as a WSD task. To the best of our knowledge, no prior study has specifically targeted katakana word WSD. This study is therefore positioned as the first systematic benchmark for katakana WSD.

4.5. Sensitivity Analysis of Annotation Noise

In Section 2.2, although approximately 19.70% noise is present, we created the annotation data and conducted fine-tuning as is, based on practical considerations of automatic annotation and prior findings that noise does not necessarily hinder learning. However, the human-annotated samples used for model selection consist of only 1000 instances, which is extremely limited compared to the approximately 800,000 example sentences. Therefore, the issue of how sensitive the human-annotated data is to potential annotation errors still remains.
In this section, we quantitatively examine, through sensitivity analysis, how potential noise contained in the reference labels affects the evaluation metrics. Specifically, for the 1000 human-annotated samples, we simulated annotation noise by randomly altering a portion of the assigned labels. Labels were selected at rates of 0%, 5%, 10%, 15%, and 20% out of the 1000 samples, and the selected labels were converted to different labels to create pseudo error-prone labels, under which model performance was evaluated. For each noise level, 100 trials were conducted, and the mean accuracy was recorded. In addition to the gpt-4.1-mini model selected in Section 2.2, the same experiments were conducted for other comparison models. The results are shown in Table 13.
As shown in Table 13, model accuracy decreased almost linearly as the noise level increased. Even a small amount of noise (5–10%) led to a substantial drop in accuracy (approximately 4–8 %), which was comparable to or larger than the performance differences among the models. This indicates that the evaluation metrics are highly sensitive to the quality of the evaluation labels. However, the relative performance differences among the models remained stable across all noise levels. At every noise level, the ranking of model accuracy was preserved in the same order, indicating that even when the given reference labels are imperfect, model selection is hardly affected in terms of robustness to potential annotation errors. These findings suggest that while automatically generated annotation labels can greatly influence the absolute values of evaluation metrics, under the conditions of this study, a certain degree of robustness is maintained with respect to relative comparisons among models.
In addition to the evaluation metrics discussed above, noise in the annotation data may also affect entropy values. This is because entropy is calculated from the probability distribution p(i) derived from automatically annotated data. To examine this effect, we conducted a sensitivity analysis on the annotation labels of the 230 target words. For each word, we randomly altered a portion of the assigned labels at rates of 5%, 10%, 15%, and 20% and recalculated the entropy. We then measured the Spearman rank correlation coefficient [36] between the original and perturbed entropy rankings. This coefficient evaluates whether the relative ordering of words by entropy is preserved regardless of changes in absolute values. For each noise level, 100 trials were conducted and the mean values were recorded. The results are shown in Table 14.
As shown in Table 14, the Spearman rank correlation remained relatively high. At 5% noise, r = 0.988, indicating that the relative entropy ranking is highly stable. Even at 20%, which is close to the estimated noise level of 19.7% in this study, r = 0.8705. This indicates that the overall ranking structure is preserved. Combined with the model ranking robustness demonstrated in Table 13, these results suggest that although annotation noise affects the absolute values of entropy, entropy remains a reliable relative ranking indicator.
The preceding analyses examined the impact of noise on evaluation labels. Next, to examine the impact of noise contained in the training data on model learning, we conducted a noise injection experiment on the training data. Specifically, we randomly injected additional noise into the labels of the training data created in Section 2.3.2 for each quadrant. For each sample, at specified rates (+5%, +10%, +15%, +20%), the label was uniformly randomly replaced with a different valid sense label for that word. The labels of the development and test data were kept unchanged. Under each condition, DeBERTa V3 was fine-tuned with the same hyperparameters as described in Section 2.3.2, and accuracy was measured on the original test data. To confirm the trend of noise impact, this experiment was conducted with three random seeds (42, 43, 44), which is sufficient given the small standard deviations observed in the main experiments (Table 6). The mean and standard deviation were recorded. The results are shown in Table 15.
As shown in Table 15, accuracy generally decreased as noise increased across all quadrants. However, the magnitude of degradation was gradual; even with +20% additional noise, the maximum accuracy decrease was 0.0279 (Q3). To analyze the effect of frequency and polysemy on noise robustness, we compared the accuracy decrease between 0% and +20% noise for each quadrant. The high-frequency quadrants (Q1: 0.8323 → 0.8151, −0.0172; Q2: 0.9388 → 0.9255, −0.0133) exhibited smaller accuracy decreases compared to the low-frequency quadrants (Q3: 0.8155 → 0.7876, −0.0279; Q4: 0.9126 → 0.8883, −0.0243), demonstrating greater robustness to noise, which suggests that larger amounts of training data may mitigate the impact of noise. Furthermore, the low-polysemy quadrants (Q2: −0.0133, Q4: −0.0243) showed smaller accuracy decreases compared to the high-polysemy quadrants (Q1: −0.0172, Q3: −0.0279), indicating greater robustness to noise. Additionally, the accuracy ranking among quadrants (Q2 > Q4 > Q1 > Q3) was consistently preserved across all noise conditions.
Combined with the sensitivity analyses on evaluation labels presented in Table 13 and Table 14, these results confirm from both the training and evaluation perspectives that the noise contained in this study’s data may affect the absolute values of evaluation metrics, but does not substantially impair model learning, and the validity of relative comparisons among quadrants and models is maintained.

4.6. Impact Analysis of Extremely High-Frequency Words

As shown in Table 10, the frequency distribution of high-frequency word groups (Q1, Q2) exhibits large standard deviations, with some words exceeding 10,000 occurrences. These extremely high-frequency words may disproportionately influence the overall results. To investigate this possibility, we conducted additional experiments excluding the top 10 highest-frequency words. The excluded 10 words had frequencies ranging from 5707 to 12,791. Compared to the overall 230-word dataset statistics reported in Section 2.3.1 (mean: 1755.73, median: 1057.50), these words exhibited frequencies 5.4 to 12.1 times the median value and 3.3 to 7.3 times the mean value, representing notable deviations from the central tendency of the distribution. We excluded these 10 words from the 230-word dataset and performed fine-tuning under identical conditions using the remaining 220 words (Q1: 57 words, Q2: 48 words). This exclusion resulted in a substantial reduction of training data: 23.9% for Q1 (from 166,139 to 126,511 sentences) and 30.9% for Q2 (from 153,566 to 106,108 sentences). Table 16 presents the results.
Comparing these results with the original 230-word dataset shown in Table 6, excluding extremely high-frequency words resulted in accuracy decreases of 0.80% for Q1 and 0.27% for Q2. Neither of these differences can be considered substantial, indicating that extremely high-frequency words do not disproportionately influence the overall results. Despite a considerable reduction in training data of approximately 24–31%, the accuracy decrease remained minimal, demonstrating that the model maintains robust performance even without the extremely high-frequency words. Rather, the slight decrease in accuracy upon excluding these words suggests that they contribute positively to model learning by providing sufficient training samples for acquiring appropriate semantic representations. These findings indicate that extremely high-frequency words do not introduce unique biased semantic interpretations or diversity to the language model, and have little disproportionate influence on the learning results. Furthermore, these results support that the observed effects of frequency and polysemy on accuracy represent general trends across the entire dataset, rather than phenomena driven by a small number of extremely high-frequency words.

4.7. Impact of Semantic Change on WSD Performance

This study suggested that the semantic divergence of Wasei-Eigo from their original English meanings may influence the performance of katakana word WSD task. In this subsection, several representative Wasei-Eigo are selected from the 230 target words, and a detailed analysis is conducted for each case. Section 1.2.2 discussed that, through the processes of semantic “narrowing”, “broadening”, and “shift”, Wasei-Eigo acquires usages that are particularly different from their original English meanings even among katakana words. Accordingly, one representative word associated with each type of semantic change process is selected, and its impact on WSD performance is examined.
First, we examine semantic “broadening”. Analyzing word-level accuracy, the accuracy of top is 45.62%, which is 37.81% lower than the average of its corresponding quadrant Q1 (83.43%). In English, “top” primarily refers to the uppermost part or surface of an object and its spatial position. In Japanese, however, the interpretation of “position” has been extended to include meanings such as a leadership status or the highest rank in a hierarchy. Furthermore, additional senses clearly separated from the original meaning, such as upper-body clothing (toppusu), have emerged, resulting in a substantial expansion of the semantic range. This expansion increases the number of possible senses, which in turn leads to a markedly lower accuracy compared with the quadrant average.
Next, we examine semantic “narrowing”. The word tension achieves an accuracy of 95.33%, exceeding the average of its quadrant Q4 (91.58%) by 3.75%. In English, “tension” denotes mental or physical strain. However, its usage has been narrowed to express emotional excitement or high energy, as in expressions such as high tenshon in Japanese. This restriction of meaning reduces the number of possible senses, making classification easier and thereby improving WSD performance.
Finally, we examine semantic “shift”. The word muffler shows an accuracy of 96.25%, which is 15.11% higher than the average of its quadrant Q3 (81.14%). In English, “muffler” mainly refers to an automobile exhaust silencer, whereas in Japanese it primarily denotes a scarf worn around the neck. Because the original sense has been replaced by a different meaning, the semantic boundaries in Japanese contexts are clear, which likely contributes to the high classification accuracy.
These findings demonstrate a strong relationship between types of semantic change and WSD performance. Words that undergo semantic “broadening” during the borrowing process tend to acquire additional senses (e.g., “upper-body clothing” in the case of “top”), thereby increasing polysemy and consequently lowering classification accuracy. In contrast, semantic “narrowing” constrains the semantic range of loanwords, reduces contextual variability, and can improve WSD performance. Semantic “shift”, which results in the acquisition of meanings unique to Japanese, may also contribute to improved performance in multilingual WSD settings.

4.8. External Validity of the Filtered Vocabulary

The filtering process described in Section 2.1 substantially narrowed the target vocabulary from 180,664 to 1639 words. We analyzed the characteristics of the words excluded at each stage to examine how this filtering may affect the generalizability.
At Step 2 in Table 1, morphological analysis using MeCab with the IPADIC dictionary reduced the vocabulary to 12,161 words by retaining only words registered in the dictionary. At Step 3 in Table 1, the exclusion of unregistered neologisms, proper nouns, interjections, and words with only one example sentence further reduced the count to 6925 words. Analysis of the 5236 words excluded at this step revealed that proper nouns accounted for the majority (4042 words, 77.2%). Proper nouns denote specific entities and lack contextual sense ambiguity, making them inherently unsuitable for WSD tasks. The next largest categories were sa-hen connection nouns (367 words, 7.0%) and adjective stems (143 words, 2.7%), which were excluded because their part-of-speech subcategories in the IPADIC dictionary did not match the “general” classification criterion, though some of these words may possess polysemy. Adverbs (152 words, 2.9%) and interjections (35 words, 0.7%) were excluded based on their respective part-of-speech characteristics. The remaining approximately 481 words (9.2%) had only one example sentence, providing insufficient data for training and evaluation.
At Step 4 in Table 1, 5286 words were excluded based on the Digital Daijisen dictionary criterion of having two or more defined senses. The excluded words were almost entirely common nouns (97.9%) that were monosemous in the dictionary, meaning they present no ambiguity to be resolved in WSD.
The above analysis indicates that the majority of words excluded through the filtering process were proper nouns or monosemous words that fall outside the scope of WSD tasks, and no systematic part-of-speech bias was introduced. However, it should be noted that some words potentially possessing polysemy, such as sa-hen connection nouns and adjective stems, were excluded based on the part-of-speech classification criterion. Considering these results, the findings of this study are considered generalizable within the scope of established polysemous words with multiple dictionary-defined senses, and this premise is expected to hold similarly when applying the methodology to other languages or corpora.

5. Conclusions

In this study, we evaluated the performance of semantic classification tasks for katakana words by fine-tuning DeBERTa V3, using 230 katakana words ultimately extracted from the BCCWJ and 403,819 sentences containing these target words, which were annotated using the OpenAI API(gpt-4o-mini, gpt-4o, gpt-4.1-nano, gpt-4.1-mini, gpt-4.1). The main findings obtained from the results are the following five points.
The first concerns the effectiveness of fine-tuning. Compared to the baseline model, the fine-tuned model achieved an average accuracy improvement of approximately 53% across all quadrants. We concluded that high frequency led to sufficient learning during fine-tuning, resulting in larger accuracy improvements. Indeed, it was confirmed that the high-frequency word quadrants showed approximately 9% greater improvement. This pattern was also observed in BERT, indicating that the effectiveness of fine-tuning for katakana WSD is generalizable across different BERT-oriented models.
The second concerns the impact of polysemy. In the fine-tuned model, only approximately 2% difference was observed between low-frequency and high-frequency quadrants, while low-polysemy words showed approximately 8–12% higher accuracy than high-polysemy words. Bootstrap statistical analysis confirmed significant differences in all pairwise comparisons between quadrants. Taking into account these detailed analyses of each quadrant, we concluded that polysemy has a greater impact on accuracy than frequency.
The third concerns the reversal of frequency effects. While low-frequency words showed higher accuracy in the baseline model, high-frequency words showed higher accuracy after fine-tuning. We attributed this to the fact that in the untrained baseline model state, with limited training data, high-frequency words are used in diverse contexts, which increases sense ambiguity. This increased sense ambiguity leads to high polysemy, which, based on the second finding, was considered to contribute to accuracy decline. The accuracy improvement in the fine-tuned model indicates that the abundant samples of high-frequency words were effectively utilized, and contextual diversity worked advantageously. From this, we concluded that additional training such as fine-tuning is highly effective for accurate semantic classification of high-frequency words in katakana WSD tasks.
The fourth concerns the impact of semantic change in Wasei-Eigo on WSD performance. Word-level analysis revealed that semantic “broadening” during the borrowing process increases polysemy and lowers classification accuracy, whereas “narrowing” and “shift” constrain the semantic range and contribute to higher accuracy. This indicates that the type of semantic change Wasei-Eigo undergoes significantly influences WSD performance.
The fifth concerns the robustness to annotation noise. Sensitivity analyses confirmed that although annotation noise affects the absolute values of evaluation metrics, the relative performance rankings among quadrants and models are consistently preserved. Furthermore, high-frequency and low-polysemy words exhibited greater robustness to noise in training data, reinforcing the reliability of the above findings under the automatic annotation conditions employed in this study.
Future challenges include highly reliable comparative validation through human annotation, discovery and analysis of factors contributing to accuracy improvement other than frequency and polysemy, and validation using more extensive data. Regarding scalability, the methodology combining LLM-based annotation with fine-tuning is applicable to other languages with appropriate multilingual models. While the API costs remain feasible for larger corpora, the balance between cost and accuracy requires consideration when scaling. The forthcoming BCCWJ2 will also enable validation with a larger Japanese corpus. The findings of this study demonstrate the importance of fine-tuning for Japanese katakana WSD tasks, quantitatively reveal the impact of frequency and polysemy on accuracy, show that polysemy contributes more significantly to accuracy than frequency, highlight the influence of semantic change processes in Wasei-Eigo on classification performance, and confirm the robustness of these findings under automatic annotation conditions.

Author Contributions

Conceptualization, K.K. and M.S.; methodology, K.K. and M.S.; validation, K.K.; formal analysis, K.K.; investigation, K.K.; writing—original draft preparation, K.K.; writing—review and editing, K.K. and M.S.; visualization, K.K.; supervision, M.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS grant numbers 22K12161 and 25K15242.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data used to create the dataset in this study is the Balanced Corpus of Contemporary Written Japanese (BCCWJ) provided by the National Institute for Japanese Language and Linguistics (NINJAL). According to the license agreement for the “Balanced Corpus of Contemporary Written Japanese,” the inclusion of all or part of the corpus content, or any information that could enable its reconstruction, in published research outcomes is prohibited. Therefore, we refrain from sharing target words, example sentences, and other information from our dataset that could potentially enable the reconstruction of BCCWJ content.

Acknowledgments

During the preparation of this study, the authors used [claude code, Open AI API: claude-sonnet-4-5-20250929, gpt-4.1-mini] for the purposes of [Preliminary creation of scripts for extracting necessary data from BCCWJ, preliminary creation of scripts for analyzing the obtained data, collection of predicted values for annotation data creation, and translation tasks during manuscript preparation]. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BCCWJBalanced Corpus of Contemporary Written Japanese
WSDWord Sense Disambiguation
NLPNatural Language Processing
DeBERTa V3Decoding-enhanced BERT with Disentangled Attention Version 3
NINJALNational Institute for Japanese Language and Linguistics
BERTBidirectional Encoder Representations from Transformers
LLMLarge Language Model
PLMPre-trained Language Model
MLMMasked Language Modeling
RTDReplaced Token Detection

References

  1. Navigli, R. Word sense disambiguation: A survey. ACM Comput. Surv. 2009, 41, 10. [Google Scholar] [CrossRef]
  2. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  3. Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; pp. 1218–1227. [Google Scholar] [CrossRef]
  4. Maru, M.; Conia, S.; Bevilacqua, M.; Navigli, R. Nibbling at the hard core of word sense disambiguation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 4724–4737. [Google Scholar] [CrossRef]
  5. Blevins, T.; Zettlemoyer, L. Moving down the long tail of word sense disambiguation with gloss informed bi-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1006–1017. [Google Scholar] [CrossRef]
  6. Masethe, H.D.; Masethe, M.A.; Ojo, S.O.; Owolawi, P.A.; Giunchiglia, F. Hybrid transformer-based large language models for word sense disambiguation in the low-resource Sesotho sa Leboa language. Appl. Sci. 2025, 15, 3608. [Google Scholar] [CrossRef]
  7. Mizuki, S.; Okazaki, N. Semantic specialization for knowledge-based word sense disambiguation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 3457–3470. [Google Scholar] [CrossRef]
  8. Wang, M.; Wang, Y. A synset relation-enhanced framework with a try-again mechanism for word sense disambiguation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6229–6240. [Google Scholar] [CrossRef]
  9. Kocoń, J.; Cichecki, I.; Kaszyca, O.; Kochanek, M.; Szydło, D.; Baran, J.; Bielaniewicz, J.; Gruza, M.; Janz, A.; Kanclerz, K.; et al. ChatGPT: Jack of all trades, master of none. Inf. Fusion 2023, 99, 101861. [Google Scholar] [CrossRef]
  10. Kang, H.; Blevins, T.; Zettlemoyer, L. Translate to disambiguate: Zero-shot multilingual word sense disambiguation with pretrained language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, St. Julian’s, Malta, 17–22 March 2024; pp. 1562–1575. [Google Scholar] [CrossRef]
  11. Tran, V.-H.; Dabre, R.; Kaing, H.; Song, H.; Tanaka, H.; Utiyama, M. Exploiting word sense disambiguation in large language models for machine translation. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, Abu Dhabi, United Arab Emirates, 20 January 2025; pp. 135–144. Available online: https://aclanthology.org/2025.loreslm-1.10/ (accessed on 10 December 2025).
  12. LLM-jp Consortium. LLM-jp: Japanese Large Language Models. Available online: https://llm-jp.nii.ac.jp/ (accessed on 18 January 2026).
  13. Stability AI. Japanese StableLM: Japanese Language Models. Available online: https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b (accessed on 18 January 2026).
  14. Fujii, K.; Nakamura, T.; Loem, M.; Iida, H.; Ohi, M.; Hattori, K.; Shota, H.; Mizuki, S.; Yokota, R.; Okazaki, N. Continual pre-training for cross-lingual LLM adaptation: Enhancing Japanese language capabilities. In Proceedings of the First Conference on Language Modeling (COLM 2024), Philadelphia, PA, USA, 7–9 October 2024; Available online: https://openreview.net/forum?id=TQdd1VhWbe (accessed on 18 January 2026).
  15. Kyoto University NLP Group. DeBERTa V3 Japanese Base Model. Available online: https://huggingface.co/ku-nlp/deberta-v3-base-japanese (accessed on 10 December 2025).
  16. Jinnai, M. The Sociolinguistics of Loanwords: A Glocal Perspective on Japanese; Sekaishisosha-Kyogakusha: Kyoto, Japan, 2007. [Google Scholar]
  17. Takamura, H.; Nagata, R.; Kawasaki, Y. Analyzing semantic change in Japanese loanwords. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; pp. 1195–1204. [Google Scholar] [CrossRef]
  18. Okumura, M.; Shirai, K.; Komiya, K.; Yokono, H. On SemEval-2010 Japanese WSD task. J. Nat. Lang. Process. 2011, 18, 293–307. [Google Scholar] [CrossRef][Green Version]
  19. Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large language models for data annotation and synthesis: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 930–957. [Google Scholar] [CrossRef]
  20. Zhang, R.; Li, Y.; Ma, Y.; Zhou, M.; Zou, L. LLMaAA: Making large language models as active annotators. In Proceedings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 13088–13103. [Google Scholar] [CrossRef]
  21. Wu, T.; Ding, X.; Tang, M.; Zhang, H.; Qin, B.; Liu, T. NoisywikiHow: A benchmark for learning with real-world noisy labels in natural language processing. In Proceedings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 4856–4873. [Google Scholar] [CrossRef]
  22. Maekawa, K.; Yamazaki, M.; Ogiso, T.; Maruyama, T.; Ogura, H.; Kashino, W.; Koiso, H.; Yamaguchi, M.; Tanaka, M.; Den, Y. Balanced corpus of contemporary written Japanese. Lang. Resour. Eval. 2014, 48, 345–371. [Google Scholar] [CrossRef]
  23. He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
  24. Kudo, T.; Yamamoto, K.; Matsumoto, Y. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 230–237. [Google Scholar]
  25. Shogakukan Dictionary Editorial Department (Ed.). Digital Daijisen; Shogakukan: Tokyo, Japan, 2012; Available online: https://www.weblio.jp/ (accessed on 10 December 2025).
  26. Zipf, G.K. Human Behavior and the Principle of Least Effort; Addison-Wesley: Cambridge, MA, USA, 1949. [Google Scholar]
  27. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  28. Donoho, D.L.; Huber, P.J. The notion of breakdown point. In A Festschrift for Erich L. Lehmann; Bickel, P.J., Doksum, K., Hodges, J.L., Jr., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 157–184. [Google Scholar]
  29. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993. [Google Scholar]
  30. Tohoku NLP Group. BERT Japanese: Pretrained BERT Models for Japanese Text. Available online: https://github.com/cl-tohoku/bert-japanese (accessed on 18 January 2026).
  31. Sumanathilaka, D.K.; Micallef, N.; Hough, J. Prompt balance matters: Understanding how imbalanced few-shot learning affects multilingual sense disambiguation in LLMs. In Proceedings of the Workshop on Beyond English: Natural Language Processing for All Languages in an Era of Large Language Models, Vienna, Austria, 1 September 2025; pp. 7–15. Available online: https://aclanthology.org/2025.globalnlp-1.2 (accessed on 10 December 2025).
  32. Pasini, T.; Raganato, A.; Navigli, R. XL-WSD: An extra-large and cross-lingual evaluation framework for word sense disambiguation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 13648–13656. [Google Scholar] [CrossRef]
  33. Navigli, R.; Ponzetto, S.P. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 2012, 193, 217–250. [Google Scholar] [CrossRef]
  34. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  35. Ganbat, N.; Asada, S.; Komiya, K. Analysis of cross-linguality of XL-WSD dataset: A comparative study of Japanese and Dutch. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, Bangkok, Thailand, 28–30 November 2024; pp. 316–325. Available online: https://aclanthology.org/2024.paclic-1.32 (accessed on 18 January 2026).
  36. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1904, 15, 72–101. [Google Scholar] [CrossRef]
Figure 1. Example of the prompt used for automatic word sense annotation with GPT-4.1-mini.
Figure 1. Example of the prompt used for automatic word sense annotation with GPT-4.1-mini.
Bdcc 10 00067 g001
Figure 2. JSONL data format for fine-tuning. Each line contains the prompt text, model-generated reference label, target word, original sentence, and label text.
Figure 2. JSONL data format for fine-tuning. Each line contains the prompt text, model-generated reference label, target word, original sentence, and label text.
Bdcc 10 00067 g002
Figure 3. Training and validation loss curves for all quadrants (Q1–Q4) over 3 epochs. Blue lines represent training loss (recorded every 100 steps), and orange lines represent validation loss (evaluated at the end of each epoch).
Figure 3. Training and validation loss curves for all quadrants (Q1–Q4) over 3 epochs. Blue lines represent training loss (recorded every 100 steps), and orange lines represent validation loss (evaluated at the end of each epoch).
Bdcc 10 00067 g003
Table 1. Summary of the extraction process from BCCWJ.
Table 1. Summary of the extraction process from BCCWJ.
StepFiltering CriterionWordsSentences
1Extract SUW with wType = “foreign word”180,6641,566,913
2Filtering by MeCab (IPADIC dictionary)12,161
3Filtering unsuitable words for WSD6925
4Filtering by Digital Daijisen1639801,899
Table 2. Agreement rates between human annotation and GPT model predictions on 1000 samples.
Table 2. Agreement rates between human annotation and GPT model predictions on 1000 samples.
ModelConcordance1 Input ($/1 M)2 Output ($/1 M)
gpt-4.184.50%2.008.00
gpt-4o82.50%2.5010.00
gpt-4.1-mini80.30%0.401.60
gpt-4o-mini78.50%0.150.60
gpt-4.1-nano62.20%0.100.40
1 2 The cost ($) per 1 million tokens.
Table 3. Quadrant-level mean accuracy under different threshold choices. Numbers in parentheses indicate the word count per quadrant.
Table 3. Quadrant-level mean accuracy under different threshold choices. Numbers in parentheses indicate the word count per quadrant.
ThresholdQ1 (HF, HP)Q2 (HF, LP)Q3 (LF, HP)Q4 (LF, LP)1 Ratio
Median0.8357 (61)0.9365 (54)0.8113 (54)0.9143 (61)1.13
Arith.mean0.8217 (32)0.9264 (36)0.8070 (63)0.9156 (99)3.09
Geom.mean0.8370 (59)0.9457 (38)0.8254 (76)0.9313 (57)2.00
1 Maximum-to-minimum ratio of word counts across the four quadrants; values closer to 1.0 indicate more balanced quadrant sizes.
Table 4. Distribution of 230 target words by frequency and polysemy. The four quadrants are assigned Q1, Q2, Q3, and Q4.
Table 4. Distribution of 230 target words by frequency and polysemy. The four quadrants are assigned Q1, Q2, Q3, and Q4.
High-PolysemyLow-Polysemy
high-frequency[Q1] 61 words (26.5%)[Q2] 54 words (23.5%)
low-frequency[Q3] 54 words (23.5%)[Q4] 61 words (26.5%)
Table 5. Baseline model performance on test set for each quadrant (mean ± std over 5 seeds).
Table 5. Baseline model performance on test set for each quadrant (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.2806 ± 0.01500.3157 ± 0.03010.2806 ± 0.01500.2881 ± 0.0204
Q2 (HF, LP)0.3326 ± 0.03220.3723 ± 0.03880.3326 ± 0.03220.3470 ± 0.0353
Q3 (LF, HP)0.2901 ± 0.01900.2962 ± 0.02340.2901 ± 0.01900.2895 ± 0.0203
Q4 (LF, LP)0.4624 ± 0.04550.4896 ± 0.03890.4624 ± 0.04550.4666 ± 0.0443
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Table 6. Fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
Table 6. Fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.8343 ± 0.00350.8334 ± 0.00370.8343 ± 0.00350.8336 ± 0.0036
Q2 (HF, LP)0.9393 ± 0.00130.9394 ± 0.00140.9393 ± 0.00130.9393 ± 0.0013
Q3 (LF, HP)0.8114 ± 0.00980.8206 ± 0.01240.8114 ± 0.00980.8128 ± 0.0114
Q4 (LF, LP)0.9158 ± 0.00480.9162 ± 0.00470.9158 ± 0.00480.9159 ± 0.0048
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Table 7. Bootstrap analysis results for pairwise comparisons between quadrants (10,000 resamples).
Table 7. Bootstrap analysis results for pairwise comparisons between quadrants (10,000 resamples).
Comparison1 Mean Difference2 95% CI
Q1–Q2 (Polysemy effect at HF)−10.50%[−10.83, −10.19]
Q3–Q4 (Polysemy effect at LF)−10.44%[−11.41, −9.51]
Q1–Q3 (Frequency effect at HP)+2.29%[+1.40, +3.20]
Q2–Q4 (Frequency effect at LP)+2.35%[+1.88, +2.74]
1 Pairwise mean accuracy differences between quadrants. 2 The interval that contains the true difference with 95% probability.
Table 8. BERT baseline model performance on test set for each quadrant (mean ± std over 5 seeds).
Table 8. BERT baseline model performance on test set for each quadrant (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.3073 ± 0.01600.3452 ± 0.01270.3073 ± 0.01600.3130 ± 0.0155
Q2 (HF, LP)0.3775 ± 0.02390.4261 ± 0.05770.3775 ± 0.02390.3923 ± 0.0314
Q3 (LF, HP)0.3060 ± 0.01740.3086 ± 0.01250.3060 ± 0.01740.3034 ± 0.0144
Q4 (LF, LP)0.4579 ± 0.03800.4729 ± 0.03090.4579 ± 0.03800.4594 ± 0.0359
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Table 9. BERT fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
Table 9. BERT fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.8232 ± 0.00370.8223 ± 0.00420.8232 ± 0.00370.8224 ± 0.0038
Q2 (HF, LP)0.9356 ± 0.00190.9356 ± 0.00190.9356 ± 0.00190.9356 ± 0.0019
Q3 (LF, HP)0.8112 ± 0.00570.8138 ± 0.00430.8112 ± 0.00570.8104 ± 0.0049
Q4 (LF, LP)0.9088 ± 0.00530.9090 ± 0.00510.9088 ± 0.00530.9088 ± 0.0052
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Table 10. Statistical summary of frequency and entropy for each quadrant.
Table 10. Statistical summary of frequency and entropy for each quadrant.
1 F. MeanF. MedianF. Std2 E. MeanE. MedianE. Std
Q12726.119202232.31.36471.31150.4062
Q22846.519402227.90.50920.49200.2437
Q3745.78772160.591.47111.49770.4534
Q4713.85684157.230.56500.58060.2219
1 Frequency, 2 Entropy
Table 11. Statistical summary of sense count for each quadrant.
Table 11. Statistical summary of sense count for each quadrant.
WordsMeanMedianStd
Q1614.36141.932
Q2542.81521.290
Q3544.27842.210
Q4612.52520.721
Table 12. Word-weighted fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
Table 12. Word-weighted fine-tuned model performance on test set for each quadrant (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.8286 ± 0.00170.8269 ± 0.00170.8286 ± 0.00170.8276 ± 0.0017
Q2 (HF, LP)0.9384 ± 0.00150.9385 ± 0.00120.9384 ± 0.00150.9383 ± 0.0015
Q3 (LF, HP)0.8154 ± 0.00430.8155 ± 0.00610.8154 ± 0.00430.8136 ± 0.0049
Q4 (LF, LP)0.9132 ± 0.00270.9139 ± 0.00270.9132 ± 0.00270.9134 ± 0.0027
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Table 13. GPT-based model accuracy mean under different levels of annotation noise in 100 trials.
Table 13. GPT-based model accuracy mean under different levels of annotation noise in 100 trials.
Noise4.14o4.1-Mini4o-Mini4.1-Nano
0%0.84500.82500.80300.78500.6220
5%0.80310.78380.76300.74550.5912
10%0.76040.74260.72270.70660.5600
15%0.71850.70140.68300.66770.5296
20%0.67620.66050.64240.62850.4979
Table 14. Spearman rank correlation of entropy rankings under annotation noise (mean over 100 trials).
Table 14. Spearman rank correlation of entropy rankings under annotation noise (mean over 100 trials).
NoiseSpearman r
0%1.0000
5%0.9877
10%0.9581
15%0.9110
20%0.8705
Table 15. Fine-tuned model accuracy under different levels of training data noise injection per quadrant (mean ± std over 3 seeds).
Table 15. Fine-tuned model accuracy under different levels of training data noise injection per quadrant (mean ± std over 3 seeds).
Added NoiseQ1 (HF, HP)Q2 (HF, LP)Q3 (LF, HP)Q4 (LF, LP)
0%0.8323 ± 0.00340.9388 ± 0.00170.8155 ± 0.00490.9126 ± 0.0028
+5%0.8285 ± 0.00210.9364 ± 0.00150.8160 ± 0.00700.9088 ± 0.0038
+10%0.8244 ± 0.00140.9330 ± 0.00150.8020 ± 0.00180.9064 ± 0.0033
+15%0.8191 ± 0.00510.9330 ± 0.00090.7835 ± 0.00560.8972 ± 0.0049
+20%0.8151 ± 0.00270.9255 ± 0.00070.7876 ± 0.00490.8883 ± 0.0061
Table 16. Fine-tuned model performance after excluding extremely high-frequency words (mean ± std over 5 seeds).
Table 16. Fine-tuned model performance after excluding extremely high-frequency words (mean ± std over 5 seeds).
QuadrantAccuracyPrecisionRecallF1
Q1 (HF, HP)0.8263 ± 0.00300.8269 ± 0.00300.8263 ± 0.00300.8260 ± 0.0030
Q2 (HF, LP)0.9366 ± 0.00330.9363 ± 0.00330.9366 ± 0.00330.9363 ± 0.0032
HF: High Frequency, LF: Low Frequency, HP: High Polysemy, LP: Low Polysemy
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kodaki, K.; Sasaki, M. Validating the Effectiveness of Fine-Tuning for Semantic Classification of Japanese Katakana Words: An Analysis of Frequency and Polysemy Effects on Accuracy. Big Data Cogn. Comput. 2026, 10, 67. https://doi.org/10.3390/bdcc10030067

AMA Style

Kodaki K, Sasaki M. Validating the Effectiveness of Fine-Tuning for Semantic Classification of Japanese Katakana Words: An Analysis of Frequency and Polysemy Effects on Accuracy. Big Data and Cognitive Computing. 2026; 10(3):67. https://doi.org/10.3390/bdcc10030067

Chicago/Turabian Style

Kodaki, Kazuki, and Minoru Sasaki. 2026. "Validating the Effectiveness of Fine-Tuning for Semantic Classification of Japanese Katakana Words: An Analysis of Frequency and Polysemy Effects on Accuracy" Big Data and Cognitive Computing 10, no. 3: 67. https://doi.org/10.3390/bdcc10030067

APA Style

Kodaki, K., & Sasaki, M. (2026). Validating the Effectiveness of Fine-Tuning for Semantic Classification of Japanese Katakana Words: An Analysis of Frequency and Polysemy Effects on Accuracy. Big Data and Cognitive Computing, 10(3), 67. https://doi.org/10.3390/bdcc10030067

Article Metrics

Back to TopTop