2.3.1. Data Characterization
Before conducting fine-tuning, we examined the annotation data (1639 words). First, we tokenized each word using the tokenizer and extracted words that were not split into subwords. The tokenizer utilizes tiktoken, a BPE (Byte Pair Encoding) tokenizer provided by OpenAI, and the tokenizer was configured to use “cl100k_base”. The extracted words suggest they appeared with sufficient frequency during the training of the model underlying the tokenizer. In other words, these words are likely to be treated as generally important terms, and their semantic representations can be efficiently learned as complete units, making them suitable for the WSD task in this study. For these reasons, the careful selection of words using the tokenizer is crucial. As a result, 940 words and 504,688 sentences were extracted. Furthermore, for each target word, we calculated the frequency as a value indicating how many sentences containing the target word exist in the BCCWJ. For the initial set of 940 words extracted through tokenization, the results of the statistical analysis yielded a mean of 536.90, a median of 169.00, a standard deviation of 1168.16, a minimum value of 2, and a maximum value of 12,791. The values shown in this statistical information indicate the number of sentences. The mean is greater than the median, and the difference between the median and the maximum value is very large, suggesting that the frequency distribution is heavily skewed to the right. This suggested it might closely follow Zipf’s law [
26]. Further examination of the frequency distribution revealed that the top 24.5% of words by frequency (230 words) accounted for 80.0% of all sentences (403,819/504,688). Consequently, it was determined that the data followed a Pareto distribution consistent with Zipf’s law, allowing the application of the Pareto principle. Therefore, we selected these 230 words with 403,819 sentences for fine-tuning, ensuring that each word had sufficient training examples for robust data splitting. Furthermore, using the annotation data obtained in
Section 2.2, we calculated the frequency of each word’s semantic distribution and the word entropy. Frequency is an indicator that focuses on the number of example sentences to examine the occurrence tendencies and distribution of katakana words within the BCCWJ. We introduced it because it allows for intuitive information acquisition. On the other hand, entropy is a measure representing the uncertainty or unpredictability of information. Defining a word sense as the information contained within a word, the number of senses represents the information content of the word. Thus, the ambiguity of word senses can be quantified by the word’s entropy. Higher entropy indicates a more evenly distributed sense distribution and greater prediction difficulty. Therefore, we introduced entropy as a necessary metric for the WSD task, where resolving such ambiguity and uncertainty is crucial. In this study, we calculated the Shannon entropy [
27]:
We used this as the entropy value for the semantic distribution of each word. Here, p(i) represents the probability that the i-th semantic meaning is selected. For example, if option i is selected 100% of the time (when p(i) = 1), the entropy becomes 0. Conversely, if the probability of selecting option i is the same for all options (p(i) approaches 0 as the number of options increases), this is equivalent to selecting options completely at random, resulting in the maximum possible entropy. Therefore, values closer to 0 indicate less ambiguity, while values closer to the maximum indicate greater ambiguity. Note that the maximum entropy equals
, where
n is the number of options: it is 1 when there are two equally probable options and exceeds 1 as the number of options increases beyond two.
Statistical analysis of the frequency distribution of 230 words yielded the following values: mean 1755.73, median 1057.50, standard deviation 1889.85, minimum 493, and maximum 12,791. Calculating entropy yielded an average of 0.9768, a median of 0.8968, a standard deviation of 0.5619, a minimum of 0.0257, and a maximum of 2.5900.
The frequency distribution exhibited strong positive skewness (skewness = 3.30), and the distance from the median to the maximum value (11,733.50) was approximately 10 times the interquartile range (IQR = 1194.75), confirming that the distribution is heavily influenced by a small number of extremely high-frequency words. Here, we introduce the concept of breakdown point, which indicates the maximum proportion of outliers that an estimator can tolerate before it becomes unreliable [
28]. The median has a breakdown point of 50%, meaning it remains stable even when nearly half of the data points are outliers, which is the theoretical maximum for location estimators. In contrast, the mean has a breakdown point of 0%, as even a single extreme outlier can distort it arbitrarily. We compared three measures of central tendency as potential thresholds: the arithmetic mean (1755.73), the geometric mean (1270.41), and the median (1057.50). Using the arithmetic mean as the threshold would classify only 68 words (29.6%) as high frequency and 162 words (70.4%) as low frequency. The geometric mean, which is theoretically appropriate for right-skewed distributions following Zipf’s law, would yield 97 words (42.2%) as high frequency and 133 words (57.8%) as low frequency. Only the median provides an equal split of 115 words (50.0%) in each category. Considering these results together with the breakdown point perspective, we concluded that the median is the most appropriate measure for this study.
Furthermore, since the primary objective of this study is to compare WSD performance across quadrants, balanced group sizes ensure that performance estimates for each quadrant are based on data of comparable quality. Additionally, when constructing the four quadrants using both frequency and entropy thresholds, the geometric mean resulted in highly unbalanced quadrant sizes (ranging from 38 to 76 words, with a maximum-to-minimum ratio of 2.00), whereas the median produced well-balanced quadrants (ranging from 54 to 61 words, with a ratio of 1.13). Therefore, for these reasons, the median was adopted as the threshold.
To verify the robustness of this choice, we examined whether the main conclusions would change if the arithmetic mean or geometric mean were used as the threshold instead of the median. We re-partitioned the 230 words into four quadrants under each threshold and recalculated the mean accuracy per quadrant. The results are shown in
Table 3.
As shown in
Table 3, the accuracy ranking across quadrants (Q2 > Q4 > Q1 > Q3) was identical under all three thresholds. Furthermore, the accuracy gap between high and low polysemy quadrants ranged from 10.1% to 10.9%, whereas the gap between high and low frequency quadrants ranged from 1.1% to 2.4%, indicating that the polysemy effect consistently exceeded the frequency effect. The slight differences in absolute accuracy values across thresholds are attributable to changes in group composition rather than threshold quality. For instance, the geometric mean yields a lower entropy threshold than the median (0.783 compared to 0.897), which classifies more words as low polysemy and is likely to raise the average accuracy. However, these differences reflect selection effects rather than genuine performance improvements. Combined with the balanced group sizes (ratio of 1.13) and the maximum breakdown point discussed above, we consider the median to be the most appropriate threshold for this study, while confirming that the main conclusions are robust to this methodological choice.
We defined high frequency (HF) and low frequency (LF) relative to our dataset using the median frequency value of 1057.50 as the threshold. It is important to note that even our “low-frequency” words have a minimum of 493 occurrences, which is relatively high from a general linguistic perspective. The term “low-frequency” here does not refer to absolute frequency, but rather indicates words that are relatively low-frequency within the dataset created for this study (230 words). Similarly, we defined high polysemy (HP) and low polysemy (LP) using the median entropy value of 0.8968 as the threshold. Classifying the 230 words based on these two axes resulted in four quadrants, as shown in
Table 4.
Considering the relationships among the four quadrants shown in
Table 4, we can predict that words with high polysemy and low frequency (Q3) will have the lowest accuracy, while conversely, words with low polysemy and high frequency (Q2) will have the highest accuracy. This is because higher polysemy means that word senses are captured more ambiguously, while higher frequency suggests that the word is used in many contexts and is generally an important word. To verify this hypothesis, we fine-tune DeBERTa V3 (ku-nlp/deberta-v3-base-japanese) model for each quadrant and measure its accuracy.
2.3.2. Training Procedure
First, using the 230 words extracted in
Section 2.3.1, we created training, development, and test sets by randomly splitting the data at an 8:1:1 ratio. This was done using stratified splitting, utilizing example sentences for each word, validation labels from annotation data, and choice labels based on word definitions. Stratified Split is a data splitting technique that maintains the proportion of each class (in this case, each word). Unlike random splitting where some words might be entirely absent from certain sets, Stratified Split ensures that data for each word is distributed across training, development, and test sets in the specified ratio, thereby enabling fair evaluation across all target words. As shown in
Section 2.3.1, even the least frequent word among the 230 words has 493 example sentences, which is sufficient for splitting at an 8:1:1 ratio (resulting in at least approximately 394 training data points, 49 development data points, and 49 test data points per word). To compare the accuracy of the baseline model and the fine-tuned model on average, we performed the split five times. For reproducibility, we specified random seed values of 42, 43, 44, 45, and 46.
During this process, some example sentences were skipped because valid labels were not assigned during annotation. There were 148 such instances in Q2, while none existed in Q1, Q3, or Q4. Of these, 147 instances were related to the word “
pan”, where the predicted labels were “None” or values that did not correspond to any of the choices. The remaining case was also determined to have no corresponding meaning among the options. For “
pan”, the meaning of “
pan” as a food item was not extracted during scraping from Weblio. This occurred because the definitions of “
pan” in the Digital Daijisen dictionary were diverse and each was used in highly independent contexts, failing to meet the scraping criteria. Specifically, the HTML structure of the dictionary entry for “
pan” did not conform to the standard format assumed by the scraping conditions, preventing the extraction of all sense definitions. The remaining instance also likely failed to meet the scraping criteria. However, since the skip rate was only 0.096% of the total (148 out of 153,713; total number of example sentences in Q2), we considered its impact on data analysis and accuracy negligible. We therefore excluded the 148 cases and proceeded with fine-tuning using the remaining data. We formatted the created training data, development data, and test data into a format suitable for easy loading during training, generating JSONL files in prompt format. The actual format is shown in
Figure 2.
As shown in
Figure 2, we designed this as a word sense classification task where the model generates responses based on prompts that provide options and ask the model to select the correct word sense from those options for target words within a sentence. To obtain values for each evaluation metric (accuracy, recall, precision, F1 score), we provided model-generated reference labels as the evaluation labels. For the execution environment, we fine-tuned the ku-nlp/deberta-v3-base-japanese model using the Transformers library (version 4.30.2) and PyTorch 2.0.1. Training was performed with a learning rate of 2 × 10
−5, maximum sequence length of 256, batch size of 12, 3 epochs, and the AdamW optimizer. Training the fine-tuned models across four quadrants for each random seed required approximately 27.5 h.