Next Article in Journal
Parallel Matrix-Free Higher-Order Finite Element Solvers for Phase-Field Fracture Problems
Previous Article in Journal
CyVerse Austria—A Local, Collaborative Cyberinfrastructure
Previous Article in Special Issue
Data-Driven Bayesian Network Learning: A Bi-Objective Approach to Address the Bias-Variance Decomposition

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessFeature PaperArticle

Windowing as a Sub-Sampling Method for Distributed Data Mining

1
Centro de Investigación en Inteligencia Artificial, Universidad Veracruzana, Sebastián Camacho No 5, Xalapa, Veracruz, México 91000, Mexico
2
Facultad de Estadística e Informática, Universidad Veracruzana, Av. Xalapa s/n, Xalapa, Veracruz, México 91000, Mexico
3
Departament d’Informàtica, Universitat de València, Avinguda de la Universitat, s/n, Burjassot-València, 46100 València, Spain
*
Author to whom correspondence should be addressed.
Math. Comput. Appl. 2020, 25(3), 39; https://doi.org/10.3390/mca25030039
Received: 31 May 2020 / Revised: 27 June 2020 / Accepted: 29 June 2020 / Published: 30 June 2020
(This article belongs to the Special Issue New Trends in Computational Intelligence and Applications)
Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.
Keywords: sub-sampling; windowing; distributed data mining sub-sampling; windowing; distributed data mining
MDPI and ACS Style

Martínez-Galicia, D.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Limón, X.; Grimaldo, F. Windowing as a Sub-Sampling Method for Distributed Data Mining. Math. Comput. Appl. 2020, 25, 39.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop