You are currently on the new version of our website. Access the old version .
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

15 January 2026

A Distributed Instance Selection Algorithm Based on Cognitive Reasoning for Regression Tasks

,
,
and
School of Electronic Information, Central South University, Changsha 410004, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Big Data Driven Machine Learning and Deep Learning

Abstract

Instance selection is a critical preprocessing technique for enhancing data quality and improving machine learning model efficiency. However, existing algorithms for regression tasks face a fundamental trade-off: non-heuristic methods offer high precision but suffer from sequential dependencies that hinder parallelization, while heuristic methods support parallelization but often yield coarse-grained results susceptible to local optima. To address these challenges, we propose CRDISA, a novel distributed instance selection algorithm driven by a formalized cognitive reasoning logic. Unlike traditional approaches that evaluate subsets, CRDISA transforms each instance into an independent “Instance Expert” capable of reasoning about the global data distribution through a unique difference knowledge base. For regression tasks with continuous outputs, we introduce a soft partitioning strategy to define adaptive error boundaries and a bidirectional voting mechanism to robustly identify high-quality instances. Although the fine-grained reasoning implies high computational complexity, we implement CRDISA on Apache Spark using an optimized broadcast mechanism. This architecture provides linear scalability in wall-clock time, enabling scalable processing without sacrificing theoretical rigor. Experiments on 22 datasets demonstrate that CRDISA achieves an average compression rate of 31.7% while maintaining predictive accuracy (R2=0.681) comparable to or better than state-of-the-art methods, proving its superiority in balancing selection granularity and distributed efficiency.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.