Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Electronics 2025, 14(15), 3092; https://doi.org/10.3390/electronics14153092

by Fei Chen^*

and Wenchi Zhou

Reviewer 1: Anonymous

Reviewer 2:

Aleksandar Stanimirović

Reviewer 3:

Jinyi Zhang

Reviewer 4: Anonymous

Electronics 2025, 14(15), 3092; https://doi.org/10.3390/electronics14153092

Submission received: 15 June 2025 / Revised: 29 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

(This article belongs to the Special Issue Data Retrieval and Data Mining)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript presents a data-centric strategy for large-scale dataset reduction using Pointwise V-Information (PVI). The authors aim to improve training efficiency and model performance by identifying and removing low-utility (high-PVI) instances, and by progressively introducing training examples from easy to hard. The study demonstrates empirical results on multiple Chinese Natural Language Inference (NLI) datasets and adapts a framework initially proposed for English-language tasks.

While the paper is well-written and methodologically sound, several concerns—particularly regarding novelty, evaluation depth, and practical utility—must be addressed before the manuscript is suitable for publication.

Major Comments

The core contribution—the use of PVI for instance filtering and progressive training—is largely based on prior work (Ethayarajh et al., 2022). While the application to Chinese NLP tasks adds some novelty, the manuscript should better clarify what is fundamentally new in the approach versus an application of existing theory.

Recommendation: Clarify the novel aspects of your method, particularly in comparison to prior work on instance difficulty metrics and curriculum learning. A comparative table with existing approaches would help.

The paper evaluates only against full-dataset baselines or simple ablations (e.g., reduction ratios), but omits comparison with standard difficulty-based sampling or data selection methods (e.g., entropy, loss-based pruning, gradient-based sampling).

Recommendation: Include baselines such as:

Loss-based data pruning.
Entropy-based uncertainty sampling.
Random data selection (to control for PVI impact).
Active learning-inspired methods.

This is necessary to fairly demonstrate the effectiveness and advantage of using PVI in practice.

The PVI computation requires fine-tuning models with null inputs, which is computationally expensive, especially at scale. However, the paper lacks a discussion of training/runtime cost versus performance benefit.

Include runtime analysis (e.g., time/memory saved during training vs time spent computing PVI) to justify the strategy’s practical value in real-world large-scale NLP applications.

The reported performance improvements are relatively small (e.g., +0.8% accuracy). No standard deviations or significance tests are provided, which limits the confidence in claims of improvement.

Report mean ± standard deviation across multiple runs. Where possible, include statistical significance testing (e.g., t-tests or confidence intervals).

While PVI is a promising metric, the paper seems to equate high-PVI instances with low utility. This may be oversimplified. High-PVI instances could contain foundational patterns necessary for generalization (as your own results sometimes show).

Distinguish between “instance difficulty” and “instance usefulness” more clearly in the text. Consider integrating qualitative examples where removing high-PVI instances caused performance loss due to the loss of core features.

Minor Comments

The writing is mostly clear, but could benefit from more concise summaries, particularly in the abstract and introduction.
Figures should include error bars to indicate variance across runs.
Please verify and correct any possible labeling issues in your test set examples (Appendix B), as these could bias results or mislead readers.
Include more discussion on how your findings generalize to non-NLI tasks or non-text modalities.

Author Response

Dear Reviewer,

Please see the attachment.

Kind regards,

The authors.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors of the study propose a data reduction strategy based on Pointwise V-Information (PVI) to improve the efficiency and performance of model training in data-centric AI. The basic idea is to quantify instance difficulty and filter out less informative examples. The proposed two-fold approach (static filtering and progressive learning) is well-motivated and convincingly validated.

One of the paper’s key strengths lies in its thorough experimental evaluation. The authors show that discarding 10–30% of the data results in only minimal accuracy loss. In some cases, using progressive learning, even performance improvement is reported. The experimental setup is clearly described, and the fact that the source code is publicly available further improves reproducibility and transparency. Also, applying the PVI framework to diverse Chinese NLP tasks is a great advancement in cross-lingual data reduction research.

Despite these strengths, some potential drawbacks should be mentioned:

I am missing detailed exploration of other approaches that are used for data reduction taks. Also some comparison with the proposed approach should be provided.
What is the computational complexity of the proposed approach? Can we justify additional data processing steps and efforts? Comparison with alternative approaches?
The proposed framework relies heavily on the accuracy of PVI as a measure of instance difficulty. If the PVI score does not reliably capture the true "usefulness" of a data point (e.g., due to bias in the model used to estimate it), important or edge-case data may be discarded. Provide more details about how future approaches involving multi-faceted selection criteria and synthetic data generation could improve the proposed framework.
Experimental results are limited only to standard benchmark datasets. Evaluation on noisy real-life datasets would provide stronger evidence for practical adoption.
Proposed approach shows promising results for specific classification tasks. Authors should provide comments regarding using the proposed framework for different tasks and domains.
Add some more discussion about reasons why the proposed approach may not be suitable for different data formats.
Figure 1 is barely readable.

Author Response

Dear Reviewer,

Please see the attachment.

Kind regards,

The authors.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes a large-scale data reduction strategy based on PVI, demonstrating that removing 10%-30% of low-quality data minimally impacts model performance, while progressive learning methods further enhance convergence speed and accuracy. The approach successfully extends from English to Chinese NLP tasks, validating cross-lingual applicability.

Data reduction—or should it be called Data Distillation? There’s already extensive literature on Data Distillation. Carefully review relevant papers to clarify the terminology.
This paper claims to challenge the mainstream Scaling Law perspective, but the experiments are incomplete. The analysis primarily relies on black-box performance metrics (e.g., accuracy) to infer the effectiveness of the data selection mechanism, lacking a sufficient theoretical explanation for why PVI can accurately identify "high-information" data instances.
The core argument aligns with Microsoft’s Phi-series large models, which also emphasize the superiority of high-quality data. Thus, comparisons and analyses with Phi-series papers are necessary. Current large model research trends involve scaling up; however, this study fails to validate its approach on larger pre-trained models (e.g., 7B/14B parameters), limiting its novelty as it remains confined to BERT and Qwen3-0.6B—models with relatively small scales.
If the primary value of data reduction lies in addressing "data quality prioritization" and "resource efficiency," is it necessary to conduct experiments on resource-rich languages like Chinese and English? With sufficient computational resources to process massive datasets, what is the practical significance of this research?
The study exclusively uses Chinese NLI tasks (OCNLI/CMNLI/CINLI) and neglects other critical NLP tasks (e.g., text classification, sequence labeling, machine translation) or multimodal scenarios.
PVI calculation depends on fine-tuning model parameters (Formulas 1–4), yet the analysis does not explore how model architecture (e.g., BERT layer depth, attention mechanisms) influences PVI values.
NLI tasks inherently exhibit low sensitivity to data redundancy due to their clear logical structures. Thus, the conclusions may not generalize to high-redundancy tasks (e.g., sentiment analysis, question-answering systems).

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Dear Reviewer,

Please see the attachment.

Kind regards,

The authors.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper proposes the introduction of an effective data reduction strategy in NLP based on the quantification of the difficulty of the instances through the Pointwise V-Information (PVI). Although the proposal shows a very interesting ideation and a considerable ambition in the cross-linguistic extension of the method, the manuscript presents several critical issues, reported below, that need to be addressed:

The treatment of the concept of "dataset difficulty" suffers from a certain degree of terminological and semantic confusion since, for example, in line 54 it is stated that "dataset difficulty is a concept which describes the data quality", an ambiguous statement that tends to overlap two distinct notions: the computational difficulty and the epistemic quality of the instance.

The definition of PVI provided between lines 68–77, although technically correct, is introduced as "lack of model usable information," a formula that presupposes a clear distinction between accessible information and useful information, a distinction that the paper needs to be explicitly stated in a well-founded and rigorously formal way.

In sections 2.1 and 2.2 that illustrate the system architecture and the reduction algorithms, it is not clear how the models 𝑔 and 𝑔′ of the prediction family 𝒱 were selected. In particular, line 171 states that "𝑔′ and 𝑔 are the models selected from the prediction family 𝒱", but the definition of a selection criterion and the related formal discussion about the effect of this choice on the results are missing.

Regarding Algorithm 2 (lines 221–243) and Algorithm 3 (lines 244–259), a systematic evaluation of their behavior in conditions of strong data noise or on unbalanced datasets is missing.

It is repeatedly stated that the high PVI instances are “redundant” or “not very informative” (lines 72–77, 309–311), but at the same time, it is observed that their removal significantly degrades the model performance in some configurations (lines 335–346) presenting a contradiction that should be addressed as a need for a balance between difficulty and informativeness rather than being avoided through the conceptual simplifications presented.

The expository style is overall clear but at times excessively didactic and redundant; in particular, there are conceptual reiterations that are not very relevant to a rigorous scientific style in terms of scientific synthesis. For example, the duplication of statements about the effectiveness of the PVI as a “model-aware” measure (lines 132–135) should be reformulated in a more sober and precise way.

Regarding the purely technical language, formal definitions are missing for some fundamental concepts (e.g. "usable information", "redundant easy instance") that should be rigorously reformulated with more scientific rigor.

Author Response

Dear Reviewer,

Please see the attachment.

Kind regards,

The authors.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

After reading the new version of the study provided by the authors, I can confirm that author provided satisfactory answers to all my previous comments.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors' response merely added a few formulas, avoided key experimental validations, promised future work, and lacked substantive improvements. It remains at the level of explanatory supplements rather than substantive enhancements, failing to meet the journal's requirements for theoretical innovation.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

All the major concerns I highlighted have been substantially addressed.
Only comment 7 has been partially addressed because some definitions remain informal.
A light language edit would help make the text more concise and precise, especially in parts with dense theoretical explanations.
I suggest accepting the paper after minor revisions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have carefully addressed the questions raised and added experiments to validate the effectiveness of the proposed method. However, there are still grammar/spelling issues (e.g., “know as,” “it is become”). The axes and legends in Figures 3 and 7 are difficult to read; please enlarge the font and label the units. Numerical values in Tables 2/3/4/5 (e.g., 0.6199) should use a consistent number of decimal places. Several references are improperly formatted (e.g., the repeated phrase “Proceedings of the Proceedings”). If cross-lingual generality is claimed, please include at least one small-scale experiment on a non-Chinese task or explicitly state this as future work.

The manuscript states that ‘We observed that reducing the easy instances didn't lead to a gain in performance over the random baseline.’ Please provide an analysis of the reasons and propose improvement strategies; otherwise, this weakens the claimed effectiveness of your method.

If these issues are resolved, I support acceptance.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Further Information

Guidelines

MDPI Initiatives

Follow MDPI