Abstract
Instance selection is a critical preprocessing technique for enhancing data quality and improving machine learning model efficiency. However, existing algorithms for regression tasks face a fundamental trade-off: non-heuristic methods offer high precision but suffer from sequential dependencies that hinder parallelization, while heuristic methods support parallelization but often yield coarse-grained results susceptible to local optima. To address these challenges, we propose CRDISA, a novel distributed instance selection algorithm driven by a formalized cognitive reasoning logic. Unlike traditional approaches that evaluate subsets, CRDISA transforms each instance into an independent “Instance Expert” capable of reasoning about the global data distribution through a unique difference knowledge base. For regression tasks with continuous outputs, we introduce a soft partitioning strategy to define adaptive error boundaries and a bidirectional voting mechanism to robustly identify high-quality instances. Although the fine-grained reasoning implies high computational complexity, we implement CRDISA on Apache Spark using an optimized broadcast mechanism. This architecture provides linear scalability in wall-clock time, enabling scalable processing without sacrificing theoretical rigor. Experiments on 22 datasets demonstrate that CRDISA achieves an average compression rate of 31.7% while maintaining predictive accuracy ( ) comparable to or better than state-of-the-art methods, proving its superiority in balancing selection granularity and distributed efficiency.