Next Article in Journal
Lossless and Efficient Polynomial-Based Secret Image Sharing with Reduced Shadow Size
Previous Article in Journal
A Local Approximation Approach for Processing Time-Evolving Graphs
Article Menu
Issue 7 (July) cover image

Export Article

Open AccessArticle
Symmetry 2018, 10(7), 248; https://doi.org/10.3390/sym10070248

From Theory to Practice: A Data Quality Framework for Classification Tasks

1
Grupo de Ingeniería Telemática, Universidad del Cauca, Campus Tulcán, 190002 Popayán, Colombia
2
Departamento de Informática, Universidad Carlos III de Madrid, Avenida de la Universidad, 30, 28911 Leganés, Spain
3
Grupo de Ingeniería Telemática, Universidad del Cauca, Campus Tulcán, 190002 Popayán, Colombia
These authors contributed equally to this work.
*
Author to whom correspondence should be addressed.
Received: 26 April 2018 / Revised: 29 May 2018 / Accepted: 29 May 2018 / Published: 1 July 2018
Full-Text   |   PDF [1280 KB, uploaded 1 July 2018]   |  

Abstract

The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors. View Full-Text
Keywords: DQF4CT; data quality issue; classification task; conceptual framework; data cleaning ontology DQF4CT; data quality issue; classification task; conceptual framework; data cleaning ontology
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Corrales, D.C.; Ledezma, A.; Corrales, J.C. From Theory to Practice: A Data Quality Framework for Classification Tasks. Symmetry 2018, 10, 248.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Symmetry EISSN 2073-8994 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top