3.2.1. Static Reduction
According to the PVI theory [
18], we conducted difficulty analysis and static reduction experiments on the Chinese NLI datasets OCNLI, CMNLI, and CINLI. The theory indicates that high-PVI instances suggest that the model can easily extract information strongly associated with the label
from the input
. These instances may contain annotation artifacts (such as high-frequency words, fixed patterns) or shallow patterns, leading the model to achieve high accuracy through “shortcut learning” rather than deep semantic inference. Therefore, removing such instances can encourage the model to learn from low-PVI instances that require more complex inference, thereby enhancing generalization ability and reducing reliance on artifacts.
In the experiment, the Chinese-BERT-wwm model was used to calculate the PVI of the training set, and high-PVI instances were reduced in descending order of PVI by 10%, 20%, …, 90%, respectively, to construct training subsets with 90%, 80%, …, 10% of the original size. A series of experiments were conducted, we focused on analyzing the accuracy changes in the classification model at different reduction ratios in
Table 2, and
Table 3,
Table 4 and
Table 5 record the accuracy results on different datasets with different models, where SIM represents the
standard
input
model, EIM represents the
empty
input
model, and CM represents the
classification
model. As the reduction ratio of high-PVI instances increases, the accuracy of the classification models on the three datasets generally shows a declining trend, but the rate and extent of the decline vary across datasets, revealing the moderating effect of different task types on data redundancy.
Figure 3 shows this trend.
OCNLI: As the easy instances are reduced, the accuracy of the model on the test set gradually decreases from 69.59% of the full training set to 22.97% (marked in
red font in
Table 3), with performance loss increasing linearly with the proportion of training set reduction. When removing 10–20% of high-PVI instances, the accuracy of the model decreases slightly (69.59%→68.85%→66.2%), indicating limited dependence of model performance on a small number of high-PVI instances. At this stage, the reduced dataset can save training resources while maintaining model performance within an acceptable range. After reducing 10% of the data, training time decreases, but accuracy drops by only 0.74%, meeting the practical application requirements for balancing efficiency and effectiveness. When 50% of the high-PVI instances are removed, the model accuracy drops to 49.37% (marked in
blue font in
Table 3), representing a decrease of 19.48% compared to removing 10% of the instances. This indicates that high-PVI instances still contain key generalizable information for the task, and excessive removal can disrupt the model’s ability to learn fundamental semantic patterns. The reason might be that not all high-PVI instances correspond to artifacts; some high PVI may arise from genuine strong correlations between input and labels (e.g., the logical relationship of “raining→wet ground” with “entailment” labels), and removing these instances would lead to information loss. Additionally, low-PVI instances contain complex inference patterns but may also include labeling noise or semantic ambiguity. Excessive removal of high-PVI instances alters the data distribution, directly increasing task difficulty beyond the model’s processing capacity, resulting in performance collapse.
Table 3.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in OCNLI.
Table 3.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in OCNLI.
OCNLI | Base | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|
SIM * | 89.29 | 83.45 | 83.08 | 78.20 | 66.56 | 64.04 | 57.70 | 57.52 | 63.68 | 72.11 |
EIM * | 34.12 | 37.09 | 42.57 | 46.85 | 45.94 | 45.53 | 31.98 | 44.95 | 45.27 | 45.94 |
CM * | 69.59 | 68.85 | 66.20 | 62.60 | 54.24 | 49.37 | 41.28 | 34.42 | 26.80 | 22.97 |
Experiments demonstrate that high-PVI instances are irreplaceable for training the OCNLI model when the reduction ratio ≥ 0.1, for the following reasons:
Loss of fundamental features: High-PVI instances typically contain strong association patterns between labels and inputs (e.g., the mapping of negation words like “不 (No)” to contradiction-class labels), which serve as the foundation for the model to learn basic inference rules. Removing these patterns makes it difficult for the model to learn basic inference rules.
Increased exposure to noise: Potential labeling errors or semantic ambiguity in low-PVI instances (e.g., ambiguous instances labeled as “neutral”) are amplified during training, disrupting the model’s optimization direction [
38]. The removal of high-PVI instances disrupts the stable state of the original data distribution, where the noise dominates the training data, leading the model to converge to local optima. This result validates the core tenet of
-information theory: the difficulty of a dataset is a dynamic function of model capability and data distribution. The removal of high-PVI instances alters the data distribution, thereby changing the task difficulty.
OCNLI is a low-structured task, necessitating the retention of more high-PVI instances to maintain basic inference capabilities. When reducing the data, attention must be paid to the safe reduction ratio of low-proportion deletion. Removing 10–20% of high-PVI instances results in only a slight decrease in accuracy on the test set (2–3% drop), making a reduction ratio of around 10% more recommended. The removal of a small number of high-PVI instances can eliminate some redundant artifacts (e.g., overly obvious syntactic templates), prompting the model to learn more generalizable features. However, the reduction ratio must be strictly limited (<20%), and a conservative reduction strategy should be adopted. Beyond 20%, the combined effect of fundamental feature loss and increased noise exposure would accelerate performance decline.
CMNLI: Without considering the balance of the dataset, as the reduction ratio increases, the accuracy of the model on the test set gradually decreases, from 79.99% when using the complete training set to 17.27% after removing 90% of easy instances (marked in
red font in
Table 4), which indicates that many easy instances being removed negatively impacts model performance. Among these, when 10% of high-PVI instances are removed, the accuracy is 79.94%, when 20% are removed, it is 79.23%, and when 30% are removed, it is 79.03% (marked in
blue font in
Table 4). This is similar to the experimental results on the OCNLI dataset, suggesting that the model’s performance has limited dependence on a small number of high-PVI instances. At this point, trimming the dataset can save training resources to some extent while maintaining model performance within an acceptable range. However, when more than 50% of the high-PVI instances are removed, the accuracy drops significantly, such that when 50% are removed, the accuracy is 0.6199 (marked in
green font in
Table 4), which is 17.95% lower than when 10% are removed. This may be because excessive removal leads to the loss of basic features, making it difficult for the model to effectively learn the semantic patterns, and the noise in low-PVI instances is amplified, affecting the model’s optimization direction.
Table 4.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in CMNLI.
Table 4.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in CMNLI.
CMNLI | Base | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|
SIM | 88.58 | 87.06 | 84.66 | 82.01 | 74.52 | 52.75 | 48.00 | 40.74 | 51.65 | 64.70 |
EIM | 33.34 | 36.36 | 36.43 | 36.22 | 36.93 | 37.97 | 38.90 | 39.93 | 40.77 | 40.71 |
CM | 79.99 | 79.94 | 79.23 | 79.03 | 76.94 | 61.99 | 34.94 | 37.30 | 23.29 | 17.27 |
CINLI: Through the use of the same static reduction method, the top 10%, 20%, …, 90% of high-PVI instances were removed in descending order of PVI to construct training subsets. The experimental results show that even after removing 40% of the high-PVI instances, the model’s accuracy remained at a high level of 87.13% (marked in
red font in
Table 5). This phenomenon contrasts significantly with the OCNLI experimental results, indicating a much slower performance degradation compared to OCNLI, revealing the regulatory effect of task types on data redundancy.
Table 5.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in CINLI.
Table 5.
Accuracy (%) comparison between different reduction ratios ( from 0 to 0.9) in CINLI.
CINLI | Base | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|
SIM | 97.32 | 97.03 | 96.31 | 95.28 | 92.32 | 88.17 | 86.07 | 80.07 | 59.93 | 59.93 |
EIM | 29.07 | 37.61 | 42.32 | 47.93 | 55.92 | 64.82 | 56.47 | 46.04 | 28.53 | 36.34 |
CM | 91.14 | 91.31 | 90.76 | 89.04 | 87.13 | 77.70 | 79.05 | 75.53 | 48.70 | 48.70 |
The stability of CINLI stems from its intrinsic characteristics:
Structured semantics: The fixed meaning of idioms allows the model to perform generalization inference using a small number of keywords (e.g., “剑 (sword)” in “刻舟求剑”, which literally means “to carve a mark on a boat to find a lost sword”; or “蛇 (snake)” and “足 (foot)” in “画蛇添足”, which means “to draw a snake and add feet to it”), reducing reliance on data volume and eliminating the need to learn complex contextual correlations. This differs from the causal chain inference in OCNLI, where complex logical inference also requires more task-specific parameter updates. Additionally, the semantic boundaries of idioms are clear, resulting in higher compactness of data distribution and a more concentrated PVI distribution of training instances (low redundancy in high-PVI instances). Even after removal, the remaining instances still cover core semantic patterns.
Pre-training compensation: BERT-wwm has encoded the general semantics of idioms [
28], thereby reducing sensitivity to training instances. The idiom inference task in CINLI is highly compatible with BERT’s masked language modeling objective, both relying on local semantic correlations. Through large-scale corpora, idioms have learned distributed representations, and model finetuning only requires aligning the label space rather than constructing semantic mappings from scratch. Therefore, even after removing some instances, the model can still leverage prior knowledge for generalization inference. This phenomenon aligns with the discussion in the original text on the task–distribution coupling effect: task difficulty is determined by both data distribution attributes (e.g., degree of semantic structuring) and model prior knowledge.
CINLI corresponds to highly structured tasks, with strong feasibility of data reduction, allowing for the prioritized removal of redundant high-PVI instances, saving resources without affecting performance. For such tasks, an aggressive reduction strategy can be adopted, which can reduce approximately 30% of high-PVI instances.
Class Balance: In the process of reducing the dataset, we discovered that as more easy instances were reduced, more class imbalances were introduced in the remaining training subset. Therefore, we artificially controlled for proportional reduction in each category and explored the impact of class balance on model training. The experimental results (see
Appendix A) indicate that after applying balanced reduction to the dataset to balance the class distribution, the issue of distribution bias caused by the removal of high-PVI instances was mitigated to some extent. Under this balanced constraint, the accuracy of the trained empty input model (EIM) consistently remained close to the random probability of a three-class classification (33%), which aligns with our assumption about balanced reduction. This suggests that the balanced constraint effectively weakens the impact of label distribution bias but does not alter the information-theoretic nature of the empty model. The limited utilization of input information by the empty model and the stability of its performance further highlight the capability of standard input models in effectively utilizing input information for prediction. Simultaneously, this also indirectly confirms that the performance decline of the standard input model after the removal of high-PVI instances is not due to the model itself becoming completely ineffective, but rather because it loses the effective utilization of key input information.
Noise: To verify the robustness and generalization ability of the method, we introduced noise into the OCNLI dataset, aiming to simulate more realistic application scenarios. We randomly replaced instances in the OCNLI training set with low-quality text at a replacement ratio of 0.1. These low-quality texts were generated by rewriting original sentences, incorporating features such as synonym substitution, misspellings, punctuation noise, internet slang, meaningless phrase insertion, and sentence splitting and merging. As depicted in
Figure 4, the model’s performance trend on the noisy dataset closely mirrors that on both class-imbalanced and class-balanced datasets. According to the previous conclusion, a reduction of at least 10% of the data, as per PVI, has little impact on the model’s performance. It is evident that the performance on datasets with added noise almost remains lower than that on the other two dataset types (imbalanced and balanced).
Random: We also included a random baseline as a control, randomly deleting instances from the training data without considering any scores or specific features of the data points. We observed that reducing the easy instances did not lead to a gain in performance over the random baseline. A similar phenomenon appeared in Rabiraj’s research [
39], which designed a pruning strategy for sexism detection using three influence scores including PVI. As the reduction ratio of simple instances increased, the training subset contained more difficult instances. Consequently, the difficulty of the dataset rose, and the performance gap between the random baseline and other baselines using PVI for reduction gradually widened. We speculated that training solely on a difficult subset could cause the model to over-emphasize learning from edge cases and ambiguous instances, potentially leading to overfitting to these specific hard instances and consequently poor generalization on the broader test set. The random baseline’s performance, where some easy instances were inevitably retained due to random deletion, might implicitly benefit from this broader representation. Removing a large number of simple instances could disrupt the difficulty distribution of the dataset. Therefore, we recommend reducing the dataset within a moderate range, removing simpler instances only to the extent that the overall difficulty structure remains largely intact. To further improve upon the baseline performance and the reduction in simple instances, one could incorporate a more sophisticated weighting mechanism for the remaining difficult instances during training. In subsequent work, we will consider combining PVI-based difficulty measurement with diversity measurement (e.g., EL2N [
40], VoG [
41], TracIn [
42]) as an improvement strategy to select the instances to be retained. This would ensure that the model learns not only from challenging instances but also from a diverse set of representative instances across the data spectrum.
3.2.2. Progressive Learning
In this section, the experiments primarily focus on the OCNLI and CINLI datasets, aiming to investigate the effectiveness of progressive learning strategies. The selection of these two datasets is based on the following considerations: The OCNLI dataset holds significant representativeness in the field of Chinese natural language inference, effectively evaluating the model’s baseline performance and generalization capabilities; the CINLI dataset, with its unique text pair construction and inference task design, facilitates an in-depth examination of the model’s inference accuracy and stability. In comparison, the CMNLI dataset, with its large instance size and status as a translation-generated dataset, exhibits limitations such as semantic bias and cultural differences, which may introduce confounding factors. Therefore, under constrained experimental resources, prioritizing the OCNLI and CINLI datasets ensures the acquisition of more reference-worthy and persuasive experimental results.
Following Algorithm 3, the training set is sorted based on PVI (from easiest to hardest), and Qwen3-0.6B (available on
https://huggingface.co/Qwen/Qwen3-0.6B (accessed on 15 June 2025)) is used as the base model to train. Initially, PVI values are computed for all instances in the training set to establish their difficulty ranking. Then, the training process commences with the simplest instances and gradually incorporates more difficult ones by selecting subsets of the sorted training data. After each progressive training stage on a subset, the trained model is evaluated on a fixed held-out test set, recording accuracy, precision, recall, and F1 score to assess performance evolution. The experimental results demonstrate that training the dataset sorted by PVI enhances model performance. Since Micro-average is used to calculate recall in multi-class tasks, the three categories in the dataset are relatively evenly distributed, with values close to accuracy.
Table 6 presents the experimental results on the OCNLI dataset. By sorting the training set based on PVI from easiest to hardest, the model’s accuracy improves by approximately 0.81% relative to the baseline (69.76% − 68.95 = 0.81%), and the F1 score also rises from a baseline of 69.08% to 69.91%, an increase of about 0.83% (marked in
bold font in
Table 6). This indicates a positive impact of PVI sorting on model performance. Even with a 10% reduction in training data, the model performance remains high, reflecting the effectiveness of the sorting and reduction strategies.
Table 7 presents the mean and standard deviation of the model’s performance over three runs on the OCNLI dataset.
Figure 5 visually compares the model’s performance under different processing methods. Comparing the “Base” (
green bar) and “Sort” (
blue bar) clearly shows that after PVI sorting, the model improves in accuracy, precision, recall, and F1 score. While “Sort & Reducing 10%” (
orange bar) performs slightly lower than “Sort” on all metrics, it still maintains a level of precision close to that of “Base,” consistent with the data analysis in
Table 6, further confirming that even with reduced data volume, the model can still exhibit strong performance.
Table 8 presents the experimental results on the CINLI dataset, showing that the model’s performance also improves after data processing. The accuracy increases from the baseline of 91.7852% to 91.8676%. The F1 score rises from the baseline of 91.7861% to 91.8651% (marked in
bold font in
Table 8). At a reduction ratio of
=0.3 (i.e., reducing 30% of the data volume), the model still maintains an accuracy of 90.42% and an F1 score of 90.38%, further validating that the progressive learning strategy can effectively reduce the demand for training data while preserving model performance.
Figure 6 compares the model’s performance on the CINLI dataset under different processing methods. Similarly to the analysis of OCNLI,
Figure 6 clearly illustrates the improvements in “Sort” (
blue bar) over “Base” (
green bar) in all performance metrics, although the magnitude of the improvement is relatively small. It is noteworthy that the performance of “Sort & Reducing 30%” (
orange bar) declines in accuracy, precision, recall, and F1 score, but remains above 90%.
Table 9 presents the mean and standard deviation of the model’s performance over three runs on the CINLI dataset.
We speculate that this progressive learning strategy from easy to difficult (as a form of curriculum learning) enables the model to prioritize learning instances that are information-rich but low in difficulty during the early stages of training, thereby rapidly constructing foundational feature representations and pattern recognition capabilities. Subsequently, the model gradually exposes itself to and learns more complex instances, which helps it progressively master more abstract and fine-grained knowledge. This reasonable distribution of difficulty optimizes the “quality” and utilization efficiency of the training set during the training process, avoiding interference from a large number of difficult or noisy instances in the early stages, thus promoting faster convergence rates and higher final performance. From the perspective of model optimization, a reasonable distribution of difficulty can guide the gradient descent process to converge to better local minima or, at the very least, achieve more robust parameter initialization in the early stages of training, laying a solid foundation for subsequent learning.