A Distributed Instance Selection Algorithm Based on Cognitive Reasoning for Regression Tasks

Linzi Yin; Wendi Cai; Zhanqi Li; Xiaochao Hou

doi:10.3390/app16020913

,

and

School of Electronic Information, Central South University, Changsha 410004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2026, 16(2), 913;https://doi.org/10.3390/app16020913

This article belongs to the Special Issue Big Data Driven Machine Learning and Deep Learning

Version Notes

Order Reprints

Abstract

Instance selection is a critical preprocessing technique for enhancing data quality and improving machine learning model efficiency. However, existing algorithms for regression tasks face a fundamental trade-off: non-heuristic methods offer high precision but suffer from sequential dependencies that hinder parallelization, while heuristic methods support parallelization but often yield coarse-grained results susceptible to local optima. To address these challenges, we propose CRDISA, a novel distributed instance selection algorithm driven by a formalized cognitive reasoning logic. Unlike traditional approaches that evaluate subsets, CRDISA transforms each instance into an independent “Instance Expert” capable of reasoning about the global data distribution through a unique difference knowledge base. For regression tasks with continuous outputs, we introduce a soft partitioning strategy to define adaptive error boundaries and a bidirectional voting mechanism to robustly identify high-quality instances. Although the fine-grained reasoning implies high computational complexity, we implement CRDISA on Apache Spark using an optimized broadcast mechanism. This architecture provides linear scalability in wall-clock time, enabling scalable processing without sacrificing theoretical rigor. Experiments on 22 datasets demonstrate that CRDISA achieves an average compression rate of 31.7% while maintaining predictive accuracy (

R^{2} = 0.681

) comparable to or better than state-of-the-art methods, proving its superiority in balancing selection granularity and distributed efficiency.

Keywords:

Apache Spark; cognitive reasoning; distributed computing; instance selection; regression tasks

A Distributed Instance Selection Algorithm Based on Cognitive Reasoning for Regression Tasks

Abstract

Article Metrics

Citations

Article Access Statistics