From Theory to Practice: A Data Quality Framework for Classiﬁcation Tasks

: The data preprocessing is an essential step in knowledge discovery projects. The experts afﬁrm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will signiﬁcantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classiﬁcation tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classiﬁcation tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an ofﬁce room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classiﬁcation tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.


Introduction
The digital information era is an inevitable trend. Recently, advances in Information Technologies (Telecommunications, smartphone applications, Internet of Things, etc.) have generated a deluge of digital data [1,2].
Recently, the IT divisions of enterprises are centered on taking advantage of the significant amount of data to extract useful knowledge and supporting decision-making [3,4]. These benefits facilitate the growth of organizational locations, strategies, and customers. Decision-makers can utilize the more readily available data to maximize customer satisfaction and profits, and predict potential opportunities and risks. In all cases to achieve it, the data quality (DQ) must be guaranteed. Data quality is directly related to the perceived or established purposes of the data. High-quality data meets expectations to a greater extent than low-quality data [5].
Thus, to guarantee data quality (DQ), before extracting knowledge from data, a preprocessing phase must be considered. The experts affirm that the preprocessing phase takes 50% to 70% of the total time of the knowledge discovery process [6].
Several data mining tools based on methodologies as Knowledge Discovery in Databases (KDD) [7] or Cross Industry Standard Process for Data Mining (CRISP-DM) [8] offer algorithms for data pre processing: graphical environments such as Waikato Environment for Knowledge Analysis (WEKA) [9], RapidMiner [10], KNIME [11] and script mathematical tools such as MATLAB [12], R [13], Octave [14]. However, these data mining tools do not offer a standard guided process for data cleaning [15].
To tackle the aforementioned problems, we proposed a Data Quality Framework For Classification Tasks (DQF4CT) through (i) a conceptual framework to provide the user a guidance of how deal data quality issues in classification tasks and (ii) an ontology that represents the knowledge in data cleaning and suggests the suitable data cleaning approaches. The rest of the paper is organized as follows: Section 2 discusses several definitions of DQ frameworks and ontologies. The related works are presented in Section 3. DQF4CT is explained in Section 4; Section 5 presents results and Section 6 provides conclusions and future works.

Background
In this section, we briefly review the concepts that were employed for building DQF4CT:

Data Quality Framework
Data are representations of the perception of the real world and the basis of information and digital knowledge [16]. In this context, from the data quality field, there exist two factors to define the perception and consumers needs: how well it meets the expectations of data consumers [17] and how well it represents the objects, events, and concepts of the real world. To measure whether data meets expectations or is "fit for use", both need to be defined through metrics as consistency, completeness, etc. [5].
For ensuring data quality in Data Management Systems (DMS), we need to consider two relevant aspects: the actual quality of the data source and the expected quality by the users [18].
The Data Quality Frameworks (DQF) are used for assessing, analyzing and using clean data with poor quality in DMS. These DMS are relevant because they drive profitability tasks or processes within an organization [19,20]. The structure of DQF can go beyond the individual elements of quality assessment, and the DQF must provide a general scheme to analyze and solve data quality problems [21].
A number of works built data quality frameworks for relational databases.The work of [25] is focused on assessing the integrity constraints. In case of [18], the authors proposed an extension of the Common Warehouse Metamodel, which stores cleansing methods for eliminating duplicates, handling inconsistencies, managing imprecise data, missing data, and data freshness. In [26], the authors offers a data cleansing process: data transformation, duplicate elimination and data fusion. DQ 2 S is a framework for data profiling [27]. The authors in [28] built a framework for management of an enterprise data warehouse based on an object-oriented data quality model, from dimensions: relevance, consistency, currency, usability, correctness and completeness. The work in [29] proposes a big data pre-processing quality framework. It deals with data quality issues as: data type, data format, and domain.
From health systems, data quality frameworks are built. For instance, the authors of [30] proposed a DQ framework for matching the records from multiples sources of electronic medical record data. The authors in [31] proposed a framework for cloud-based health care systems. The aim is to gather electronic health records from different sources. The work of [32] proposed a framework of procedures for data quality assurance in medical registries; they address data quality problems as illegible handwriting, incompleteness, and unsuitable data format.
Other researchers design data quality conceptual frameworks. The work presented in [36] develops a framework as a basis for organizational databases considering domain information as operation and assurance costs, and personnel management. The authors of [33] monitored the content in an e-government meta-data repository, using syntactic, semantic and pragmatic data quality metrics. Similarly, the authors of [34] designed a framework for Government Data based on three DQ issues: missing values, lack of meta-data, and timeliness. The authors [35] have identified relationships amongst four data quality dimensions: consistency, timeliness, accuracy and completeness. A qualitative approach was conducted applying 37 surveys. Factor analysis and Cronbach-alpha test were applied to interpret the results. A data quality framework for manage resources in Enterprise Service Bus (ESB) is built in [38]. The framework measures data quality coming from different sensors and selects the most suitable data source among all available data sources, in respect to the data quality metrics: accuracy, trueness, completeness, timeliness, and consistency.
Finally, the authors in [15] built a conceptual framework based on data quality issues mentioned in data mining methodologies such as CRISP-DM [8], SEMMA [39], KDD [7] and the Data Science Process [40]. Subsequently, the same authors [37] designed a data cleaning process in regression models.
We observed a large diversity of data quality frameworks used in the literature designed mainly for health systems, data warehouses, relational databases, and enterprise service buses; however, the related works are not concentrated in classification models. In addition, these works lack: • A user-oriented process to address DQ issues: high dimensionality, imbalanced classes, outliers, duplicate instances, mislabeled instances and missing values.

•
Recommendations of the suitable data cleaning algorithm/approach to address data quality issues.

Data Cleaning Ontologies
Similarly to literature review presented in Section 3.1, we found data cleaning ontologies from relational databases, health systems, and others' domain applications. Table 2 shows the data quality issues addressed by the ontologies.
Different data cleaning ontologies were found in the literature. [41] design an ontology that selects data cleaning algorithms respect to the user's goal. The selected algorithm is applied to DB based on the results produced from queries on ontology. The authors of [42] designed a model to represent data cleaning operations, enabling their reuse in different databases. The model is composed of an orthogonal cleaning ontology and domain ontologies. Rule Mining for Automatic Ontology Based Data Cleaning is proposed in [43]. This consists of checking tuples for correctness. When invalid tuples are being detected, they have to be modified using valid tuples stored in their ontology. The work in [44] contains a method for dealing with semantic heterogeneity during the process of data cleaning, which is the difference of terminologies in distinct data sources. They are based on linguistic knowledge provided by a domain ontology in order to generate some correspondence assertions between tuples. These assertions are used during the integration of the data. Other data cleaning ontologies were used to support health systems. In [45], health care data quality literature was mined for the important terms used to describe the ontology. Four high-level data quality dimensions were defined: Correctness, Consistency, Completeness and Currency. In [46], an ontology for patient clinical records was built to assess uniqueness, existence and consistency. They are supported in domain ontology to analyze relations as a doctor cannot be treated himself as a patient.
From other domains, the data cleaning ontologies are also used for the construction of reservoir models [48], selection of features in datasets related to cancer [47], and preparation of genotype-phenotype relationships in a familial hypercholesterolemia dataset [49].
The related works presented above conduct data cleaning ontologies from databases; however, they do not focus on data quality issues for classification tasks, while the remaining research is focused on solving data quality issues in a specific domain. Thus, in Section 4.2, we propose an ontology for data cleaning in classification tasks.

Data Quality Framework for Classification Tasks
In this section, we describe the proposed data quality framework (DQF4CT). Our approach is defined by (i) a conceptual framework to provide the user with guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. Below, we expose each of these components.

Conceptual Framework
We build the conceptual framework to address poor quality data in classification tasks of data mining and machine learning projects from epistemological, and methodological perspectives taking into account the philosophy, definitions, and procedures proposed by [50]. The epistemological concept defines "how things are" and "how things work" in an assumed reality, and the methodological concept exposes the process of building the conceptual framework and assessing what it can tell us about the "real" world [51]. Next, each phase of procedures is explained.
The proposed conceptual framework was developed to address poor quality data in classification tasks of data mining and machine learning projects. To construct the conceptual framework, we adapted the methodology [50] following the next steps:

Mapping the Selected Data Sources
The primary goal of this step is identify the data quality issues presented in classification tasks. This process includes review text types and other sources of data, such as research papers, standards or methodologies. From data mining, we find four relevant methodologies. Table 3 shows the data quality issues found in data mining methodologies. High dimensionality, duplicate instances, outliers, missing values and noise were the data quality issues found in the methodologies mentioned in Table 3. Furthermore, we built a literature review [52] with the aim of finding DQ issues in classification tasks. We analyzed four digital libraries: IEEE Xplore, Science Direct, Springer Link, and Google Scholar. Table 4 shows the papers found by data quality issue and digital library: Besides the data quality issues previously found (Table 3), in the literature review (Table 4), we identified new data quality issues as inconsistency, redundancy (refers to: high dimensionality and duplicate instances), the amount of data (imbalanced class), heterogeneity, and timeliness. In this phase, we present the definitions of DQ issues found for classification tasks: • Noise: defined by [53] as errors contained in the data. Datasets with a large amount of noise can have a detrimental impact on the success of classification task, e.g., reducing the predictive ability of a classifier [54].

•
Missing values: refers to missed of values of an attribute, and typically occurs due to faults in the process of data recollection, e.g, data transfer problems, sensor faults, incompleteness in surveys, etc. [55]. • Outliers: considered as a distant observation from other observations [56]. An outlier can be an inconsistent value or abnormal behavior of the measured variable [57,58]. • High dimensionality: is referred to datasets with a large number of variables. The variables can be categorized as: relevant, irrelevant, or redundant [59]. With the presence of a large number of features, a learning model tends to overfit, resulting in their performance degenerating [60]. • Imbalanced class: is considered when a dataset exhibits an unequal distribution between its classes [61]. When a dataset is imbalanced, the approximation of the misclassification rate used in learning systems can contribute negatively to decrease the accuracy and the quality of learning [62]. • Inconsistency: refers to duplicate instances with different class labels [63]. • Redundancy: in classification tasks, this is referred to as duplicate records [63]. • Amount of data: corresponds to the total of available data for training a classifier [63]; this DQ issue is highly related with high dimensionality. Large datasets with a high number of features can generate high dimensionality, while small datasets can build inaccurate models. • Heterogeneity: defined as data incompatibility of a variable. This occurs when data from different sources are joined [64]. • Timeliness: corresponds to the degree of representation of the real world through data in a required point in time [5,65,66].

Identifying and Categorizing Components
In this phase, we organize and categorize the DQ issues with respect to their meaning. Thus, we concluded: • Inconsistency, Redundancy and Timeliness were renamed as Mislabelled class, Duplicate instances and Data obsolescence, respectively.

•
According to the Noise definition "irrelevant or meaningless data", we considered as kinds of Noise: Missing values, Outliers, High dimensionality, Imbalanced class, Mislabelled class and Duplicate instances.

•
We redefined Amount of data as lack of information due to the poor process of data collection.

•
The amount of data, Heterogeneity and Data obsolescence are issues of the recollection data process. Therefore, these data quality issues were classified in a new category called Provenance, defined by the Oxford English Dictionary as a fact of coming from some particular source or quarter, origin, or derivation. Figure 1 presents the classification of data quality issues. The conceptual framework is focused on solving Noise problems in the data.

Integrating Components
In this step, we present the data cleaning tasks to address Noise issues (Table 5). Subsequently, we analyze how to integrate the cleaning tasks. Next, the data cleaning tasks are explained: • Imputation: fills missing data with synthetic values. Different approaches are defined for imputing missing values: (i) Deletion: removes all instances with missing values [67]. (ii) Hot deck: missing data are filled with values from the same dataset [68]. (iii) Imputation based on missing attributes: computes a new value from measures of central tendency as median, mode, mean, etc. The computed value is used for filling the missing data. (iv) Imputation based on non-missing attributes: a classification or regression model is built from available data to fill the missing values [69]. • Outlier detection: selects candidate outliers based on algorithms for high-dimensional spaces as Angle-Based Outlier Degree (ABOD) or density algorithms as Local Outlier Factor (LOF), Density-based spatial clustering of applications with noise (DBSCAN), etc. [70][71][72].
• Dimensionality reduction: selects a subset of relevant attributes to represent the dataset [73] based on attribute importance [59,74]. Three-dimensionality reduction approaches are defined: (i) Filter: computes correlation coefficients between features and class, then it selects the features with highest correlation [74]; (ii) Wrapper: builds models with all combinations of features. The subset of features is selected based on model performance [75]; (iii) Embedded: incorporates the feature selection as part of the training process and reduces the computation time taken up for reclassifying different subsets that are done in wrapper methods [76,77]. • Balanced Classes: distributes instances equitable per class. Balanced classes consist of two approaches: (i) oversampling: interpolates instances between two examples from minority class [78]; (ii) undersampling: eliminates instances from majority class [79]. • Label correction: are identified instances with the same values. If classes are different, the label is corrected, or the instance is removed [80]. • Remove duplicate instances: deletes duplicate records from dataset [81].
Once the data cleaning tasks are defined, we integrated them as depicted in Figure 2: Thus, a user of the conceptual framework follow the next steps: A Check missing values in the dataset. B In case of the occurrence of missing values, the imputation task must be applied. The new data must be analyzed because the imputation methods can create outliers. C Subsequently, an outlier detection algorithm is applied with the aim to find candidate outliers in the raw dataset or generated in the previous step. D Label correction algorithm looks for mislabelled instances in the raw dataset or generated by the Imputation methods. E Verify if the dataset is imbalanced; Imbalance Ratio (IR) is used to measure the distribution of binary class: where C + represents the size of the majority class and C − the size of the minority class. A dataset with IR 1 is perfectly balanced, while datasets with a higher IR are more imbalanced [82].
In case of the class having more than two labels, Normalized Entropy is used [83]. This measure indicates the degree of uniformity of the distribution of class labels, denoted by where q i = p(class = x i ) is the probability that class assumes the ith value x i , for i = 1, ..., n.
We suppose that each label of the class has the same probability of appearing, therefore the theoretical maximum value for the entropy of the class is log 2 (n). Thus, the normalized entropy can be computed as: The class is balanced when H(class) is close to 1. F If the dataset is imbalanced, then we use an algorithm for balanced of classes. This creates synthetic instances (oversampling or undersampling techniques) on the minority class. G Remove duplicate instances in the raw dataset or generated by previous data cleaning tasks. H Finally, the algorithms for dimensionality reduction are used for reducing the dimensionality of the dataset.
Our conceptual framework guides the user in the solution of data quality issues in classification tasks. In Section 4.2, we propose an ontology for representation of knowledge in data cleaning and recommendation of the suitable data cleaning approaches.

Validating the Conceptual Framework
We evaluated DQF4CT through UCI datasets [84]. The datasets cleaned by DQF4CT are used for training the classifiers proposed by authors of UCI datasets. Subsequently, we compare the Precision and Area Under Curve of the classifiers generated by the datasets' authors against the classifiers of the datasets processed by DQF4CT. Additionally, we show the data cleaning process of DQF4CT in the datasets: • Dataset for prediction of occupancy in an office room [85], • Dataset for physical activity monitoring [86].
In Section 5, we will explain in detail the validation of the conceptual framework.

Data Cleaning Ontology
In this subsection, we describe the Data Cleaning Ontology (DCO). This one represents the knowledge in data cleaning and the rules to solve the data quality issues. From the analysis presented by [87], which compares six appropriate methodologies to build ontologies through the criteria: level of detail and associated software application, we selected METHONTOLOGY [88] as the methodology to create DCO. METHONTOLOGY defines five phases: glossary of terms, concept taxonomies, ad hoc binary relation diagrams, concept dictionary, and rules. Next, we describe the way DCO was created following the phases mentioned above.

Build Glossary of Terms
In this step, we identify the set of terms to be included on the Data Cleaning Ontology (their natural language definition, and their synonyms and acronyms). Table 6 presents the main classes considered. We defined 23 subclasses of the classes given in Table 6. The taxonomies of the classes are shown below.

Build Concept Taxonomies
This task involves building concept taxonomies from a glossary of terms. We defined two general taxonomies from Attribute and Data Cleaning Task classes.
The class Attribute has two subclasses: Numeric: attribute with continuous values or Nominal: attribute with discrete values as shown in Figure 3.   Each sub-class of Data Cleaning Tasks itself has several techniques to solve the problem identified. According to Section 4.1.4, the methods used by sub-classes of the taxonomy of Data Cleaning Tasks are presented hereunder.

•
Imputation is resolved through approaches: Imputation Based on Non-Missing Attributes, Deletion, Hot Deck Imputation, and Imputation Based on Missing Attributes. Figure 5 exposes the approaches that are sub-classes of Imputation tasks. • The Outliers Detection take into account techniques based on Density or High Dimensional (see Figure 6). • Figure 7 shows the approaches to Balanced Classes: Over Sampling and Under Sampling. • Label Correction is addressed in two ways: approaches based on Threshold or Classification algorithms. Figure 8 shows: • Approaches as Embedded, Filter, Projection and Wrapper are used to Dimensionality Reduction. Figure 9 lists: • Up to now, we have not found the classification of techniques of Removing of Duplicate Instances.

Build Ad Hoc Binary Relation Diagrams
In this task, we establish ad hoc relationships between concepts of the taxonomies.

Build Concept Dictionary
This task builds a description of the classes as their instances and their attributes.
The Dataset class presents information related to the number of classes, instances, attributes, imbalance ratio of the classes, percentage of missing values (mv_PP), instances with percentage of missing values greater than or equal to 65% (ins_mv_65PP) and ordered. Model dependency, fast assessment describes the user preferences to build a classifier. Dataset contains 35 instances. These instances correspond to UCI datasets [84] selected to evaluate the conceptual framework.
The Nominal class has a number of discrete values or labels of an attribute while Numeric class statistical information has mean, median, minimum, maximum, standard deviation, 1st_quartile and 3rd_quartile. Instances correspond to the number of attributes per dataset from ds1_att_i to ds9_att_i where i is the i-th attribute of dataset.
The Data Quality Issue class is composed of the instances: missing values, outliers, imbalanced class, mislabeled class, duplicate instances and high-dimensional spaces. Table 7 shows the instances of Imputation approaches. Thus, Deletion is represented by algorithms: list wise deletion, pair wise deletion, Hot Deck Imputation: last observation carried forward, Imputation Based on Missing Attributes by measures: mean, median, mode and Imputation Based on Non-Missing Attributes by regression models: linear, logistic and Bayesian. Table 7. Concept dictionary of: Imputation, Deletion, Hot Deck Imputation, Imputation Based on Missing Attributes and Imputation Based on Non-Missing Attributes.

Class Name Class Attributes Instances
Imputation --Deletion -list wise deletion, pair wise deletion Hot Deck Imputation -last observation carried forward Imputation Based on Missing Attributes -mean, median, mode Imputation Based on Non-Missing Attributes -Bayesian linear regression, linear regression, logistic regression Table 8 gathers algorithms for Outliers Detection. Density-based spatial clustering of applications with noise (dbscan), local outlier factor and ordering points to identify the clustering structure (optics) are algorithms based on Density. In High-Dimensional spaces, algorithms are used as: angle based outlier degree, grid based subspace outlier, and sub space outlier degree.

Class Name Class Attributes Instances
Outliers Detection --Density -dbscan, local outlier factor, optics High Dimensional -angle based outlier degree, grid based subspace outlier, sub space outlier degree Removing of Duplicate Instances -- Table 9 encompasses instances of the approaches of Balanced Classes and Label Correction. Random over sampling and smote are algorithms of Over Sampling, while condensed nearest neighbor rule, edited nearest neighbor rule, neighborhood cleaning rule, one side selection, random under sampling, and tome link of the Under Sampling approach. In Label Correction, Classification algorithms such as c4.5, k nearest neighbor, support vector machine and Threshold as entropy conditional distribution and least complex correct hypothesis are commonly used.  Table 10 contains Filter, Projection and Wrapper algorithms for Dimensionality Reduction. Measures such as chi-squared test, gain ratio, information gain, Pearson correlation, and Spearman correlation belong to the Filter approach. Principal component analysis is an algorithm based on Projection, while sequential backward elimination and sequential forward selection are algorithms based on the Wrapper approach. We used Semantic Web Rule Language (SWRL) to create the rules of Data Cleaning Ontology. SWRL is a proposal to combine Web Ontology Language (OWL) and RuleML. The rules are expressed regarding of OWL concepts (classes, attributes, instances) and saved as part of the ontology. These include a high-level abstract syntax for Horn-like rules [89]. The rules syntax have the form: antecedent =⇒ consequent, where the antecedent and consequent are conjunctions of atoms a 1 ∧ ... ∧ a n and functions f 1 (?a 1 , ?a 2 ) ∧ ... ∧ f n (?a n ). The variables are represented through a question mark (e.g., ?a 1 ).
We built thirty rules for addressing the data quality issues. The rules were constructed based on literature reviews about data cleaning tasks [15,52,[90][91][92][93][94]. The most representative rules are explained below. For example, the rules for selecting a suitable data cleaning approach of dimensionality reduction were built based on the work proposed by [93]. These authors proposed three scenarios for the use of dimensionality reduction approaches as shown Table 11. Table 11. Scenarios for the use of dimensionality reduction approaches.

Scenario Method
The data analyst has defined the learning algorithm to use in the classification task and he works with high computational resources. Wrapper The data analyst has defined the learning algorithm to use in the classification task. The computational resources are limited Embedded The data analyst has not defined the learning algorithm to use in the classification task and he works with low computational resources. Filter Thus, the rules for Dimensionality Reduction are defined based on two criteria [59]: whether the learning algorithm to use in the classification task is defined (learning algorithm defined = Yes: 1 or Not: 0), and the computational resources for processing the dimensionality reduction algorithms (computational resources= High: 2, Limited: 1 or Low: 0). Thus, the Wrapper rule is defined: In the case of the Imputation task, the Deletion rules were defined based on our experience and knowledge, for example Deletion is applied on Dataset when the instances with missing values greater than 65% (ins_mv_65PP) are less than or equal to 10% of instances of the dataset: Dataset(?a) ∧ datasetHasDQIssue(?a, missingValues) ∧ Deletion(?b) ∧ ins_mv_65PP(?a, ?c) ∧ swrlb : greaterThan(?c, 0) ∧ swrlb : lessThanOrEqual(?c, 10) =⇒ datasetUsesDCAlgorithm (?a, ?b).
In addition, the Deletion approach is applied on Attribute when the missing values are greater than 50% (att_mv_50PP): (?a, ?c).
The proportion of missing data is directly related to the quality of statistical inferences. There is no established cutoff from the literature regarding an acceptable percentage of missing data in a dataset for valid statistical inferences [95]. The threshold of the Imputation rules was defined based on our knowledge and the assumption of [96], for which he asserted that a missing rate of 5% or less is inconsequential.

Evaluation
This section shows the evaluation of DQF4CT. The basic idea is to apply DQF4CT to datasets used by research works where the primary goal are classification tasks. We took the datasets of [86] (physical activity monitoring) and [85] (prediction of occupancy in an office room). After that, we apply DQF4CT to the datasets mentioned above (datasets without preprocessing). Subsequently, the cleaned datasets by DQF4CT are used to train the classifiers proposed by [85,86]. Finally, we compare the Accuracy of the models created with the datasets authors against the models of the datasets processed by DQF4CT. This section is organized as follows: in Section 5.1, we present the experimental dataset; in Section 5.2, we process the datasets using DQF4CT; Section 5.3 exposes the accuracy results achieved by the classifiers (the trained by original datasets versus the trained by DQF4CT).

Physical Activity Monitoring
We used nine datasets for physical activity monitoring [86]. Each dataset represents one subject. The entire dataset contains 54 attributes and 2,871,916 instances related with sensors' measurements (located at chest, hand and ankle). The class has 12 labels: walking, running, rope jumping, vacuum cleaning, ironing, standing, sitting, nordic walking, lying, cycling, ascending and descending stairs. Table 12 shows the instances by subject.

Occupancy Detection of an Office Room
The authors in [85] proposed a dataset for prediction of occupancy in an office room using six variables: temperature, humidity, light, humidity ratio, CO 2 and the class occupancy status (0 for non-occupied, 1 for occupied). Three datasets were used, one for training (8143 instances), and two for testing the models (Test 1: 2665 instances and Test 2: 9752 ).

Evaluation Process
We applied DQF4CT for each real dataset. First, the conceptual framework is applied. Subsequently, Data Cleaning Ontology is used on each cleaning task of the conceptual framework for selecting the suitable approach. The selection of the data cleaning algorithm is based on expert knowledge. Figure 11 presents the suggested process by our approach for cleaning the Physical activity monitoring (PAM) dataset. The process is explained in detail below: Figure 11. Data cleaning process for the physical activity monitoring dataset. Imputation: First, we observed how the missing values are distributed on the dataset. Figure 12 illustrates the frequencies of missing data patterns. The magenta color shows the missing values and the blue color non-missing data. Each row represents a missing data pattern. For example, the first row (bottom up) indicates that the heart_rate has 0.9% missing values when the remaining attributes have data. On the other hand, in the sixth row, the attributes' temp hand, X3D accel hand, scale hand, resolution hand, X3D accel hand 2, scale hand 2, resolution hand 2, X3D giro hand 1, X3D giro hand 2, X3D giro hand 3, X3D magno hand 1, X3D magno hand 2, X3D magno hand 3, orienta hand 1, orienta hand 2, orienta hand 3, and orienta hand 4 has 0.004% missing values, while the remainder of the attributes have data. The missing data for each subject is shown in Table 13. Datasets have around 1.83-2.10% missing values. Heart rate has the highest missing data (greater than 90%). Following the Data Cleaning Ontology, this one suggests:

Physical Activity Monitoring
• Use Deletion approach to remove heart_rate attribute and 34 instances. We used List Wise Deletion.

•
Use Imputation Based on Non-Missing Attributes on the dataset. We imputed each subject dataset with Linear and Bayesian regression.
Outliers Detection: once the values are imputed, the outliers' detection task is applied with the aim to find erroneous imputations. Data Cleaning Ontology recommends using Density algorithms. We used the Local Outlier Factor (LOF). Table 14 shows the potential outliers for each subject. Thus, the instances with a Local Outlier Factor less than the lower limits or greater than the upper limits are considered potential outliers. The lower and upper limits are calculated from Tukey Fences [97]; potential outliers are values below Q 1 − 1.5(Q 3 − Q 1 ) (lower limit) or above Q 3 + 1.5(Q 3 − Q 1 ) (upper limit), where Q 1 and Q 3 are the first and third quartiles. In Figure 13, the whiskers of the box plots represent the Tukey Fences of the Local Outlier Factor. The candidate outliers presented in Table 14 were removed. This is noise added by imputation tasks.

Label Correction:
To correct the labels of the classes, we used Contradictory Instances Detection. The dataset has no contradictory instances.

Balanced Classes:
According to Figure 11, we used the balanced classes task for each subject. In this sense. Data Cleaning Ontology recommends the oversampling approach, we use a Synthetic Minority Over-sampling Technique (Smote). Due to the dataset having 12 classes, we first identify the majority class and the minority classes, thus we applied Smote for each minority class when 2 < IR < 10. Figure 14 shows the instance distribution per class for all subjects. Purple bars represent the imbalanced dataset, and blue bars the balanced dataset using Smote.

Remove Duplicate Instances:
For evaluation of duplicate instances, we used Standard Duplicate Elimination. In this case, the dataset has no duplicate instances.

Dimensionality Reduction:
We merge the nine subjects in one dataset, and then we applied the dimensionality reduction task. Data Cleaning Ontology suggests the Filter approach. This approach is considered faster and has a low computational cost [93]. The absolute values of pairwise correlations are considered. If two attributes have a high correlation, the filter algorithm looks at the mean absolute correlation of each attribute and removes the variable with the largest mean absolute correlation [94]. In this case, we used the Pearson Correlation; the algorithm finds weights of continuous attributes based on their correlation with the class. Figure 15 presents the Top-15 of attributes with the highest correlation.  Figure 16 presents the suggested process by our approach for cleaning the occupancy detection dataset. The process is explained in detail below: Figure 16. Data cleaning process for the physical activity monitoring dataset.

Outliers detection:
In the first step, we apply the outliers' detection task. Data Cleaning Ontology recommends using Density algorithms. We again used the Local Outlier Factor (LOF). The algorithm finds 872 potential outliers based on Tukey fences. We considered as potential outliers the instances with LOF among 0.808 (lower fence) and 1.297 (upper fence). After removing the potential outliers, 1600 instances indicate that the room is occupied (Yes), and 5671 non-occupied (No).

Label Correction:
To correct the labels of the classes, we used Contradictory Instances Detection. The dataset has no contradictory instances.

Balanced Classes:
The Data Cleaning Ontology recommends an oversampling approach because the imbalance ratio of classes is 3.7 and then IR < 10 (rule 4.10). We used the Synthetic Minority Over-sampling Technique (Smote). Figure 17 shows the instance distribution per class for all subjects. Purple bars represent the imbalanced dataset, and blue bars the balanced dataset using Smote. Thus, 4000 instances were added to the minority class.

Remove Duplicate Instances:
We used Standard Duplicate Elimination again. We removed 812 duplicate instances (809 non-occupied and three occupied).

Results
In this section, we compare the accuracy of the classifiers trained with the datasets produced by [85,86] versus the same classifiers but trained with the datasets processed by DQF4CT.

Physical Activity Monitoring
The authors of the Physical Activity Monitoring (PAM) dataset [86] used the classifiers: Decision tree (C4.5), Boosting-C4.5 decision tree, Bagging-C4.5 decision tree, Naive Bayes and K nearest neighbor from the Weka toolkit. We used the same experimental configuration proposed by the authors [86] based on standard x-fold cross-validation. Table 15 shows the accuracy for the Physical Activity Monitoring (PAM) dataset.
In standard 9-fold cross-validation (Table 15), our conceptual framework obtained better accuracy to the models: Decision tree (99.30), Boosted (99.99), Bagging (99.60) and K-nearest neighbor (99.97). Meanwhile, the best results for Naive Bayes (94.19) are obtained by [86] (physical activity monitoring). Therefore, from the results obtained by our approach using Naive Bayes, we think that many attributes of the dataset analyzed represent similar information (e.g., two accelerometers for a wrist with a three-axis in two scales = 12 attributes). Moreover, if we take into account the affirmation of [98], the Naive Bayes has a systemic problem with analyzing independency of features.

Occupancy Detection of an Office Room
The authors in [85] used the classifiers: Random Forest (RF), Gradient Boosting Machines (GBM), Linear Discriminant Analysis (LDA) and Classification and Regression Trees (CART) with a CARET package available in R [94]. For the classifiers, we used the same experimental configuration proposed by the authors [85]. Table 16 presents the accuracies for mentioned models with 10-fold cross-validation (once our approach was applied), occupancy detection with original attributes and preprocessing attributes. For Test 1, once our conceptual framework on training data was applied, the accuracy of the models RF and GBM are 0.63 and 0.98 percentage points below the best results of occupancy detection with preprocessing attributes. For the CART model, our conceptual framework has the best accuracy (97.75), while the three approaches obtain the same accuracy for LDA (97.90).
For Test 2, our conceptual framework reaches higher accuracies in RF (99.25), GBM (96.68) and CART (98.70) models. The higher accuracy for the LDA model (99.33) is obtained by [85] with preprocessing attributes.
The best results obtained by occupancy detection with preprocessing (RF, GBM in Test 1 and LDA in Test 2) can be due to two new attributes included by the authors: the number of seconds from midnight for each day and week status (weekend or a weekday).

Comparative Study
To guarantee the performance of Data Quality Framework for Classification Tasks, DQF4CT was validated with real datasets coming from the UCI Repository of Machine Learning Databases [84].  Tables 17 and 18 show the  Precision and Area Under Curve (AUC) of classifiers produced by authors of UCI datasets compared  with the same classifiers trained with the datasets processed by DQF4CT. The values underlined in  Tables 17 and 18 correspond to the highest precisions and the best AUCs. Once the UCI datasets by DQF4CT were cleaned, 84% of the models achieve the highest precisions and the best AUCs compared to models proposed by the datasets' authors. The remaining 16% correspond to the models of the dataset authors: "Anuran families calls", "Anuran species calls", "Portuguese bank telemarketing" and "Phishing detection". In case of "Anuran families calls" and "Anuran species calls", the precision difference of the MLP generated by authors with respect to MLP built with datasets processed by DQF4CT are 1.4% and 0.1%, while the precision difference of "Portuguese bank telemarketing" is 0.3%. For "Phishing detection", the Area Under Curve generated by the CART model of the dataset authors covers 6.2% more than the CART model of DQF4CT.
DQF4CT offers a general data cleaning solution for several domains, while the preparation of the datasets' authors is based on previous domain knowledge and a data cleaning process ad hoc. Thus, DQF4CT reaches or overcomes the results proposed by the datasets' authors.

Conclusions
In this work, we proposed a framework to address the DQ issues in classification tasks. DQF4CT is composed of: (i) a conceptual framework to provide the user with guidance of how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. DQF4CT supports inexpert users to detection of data quality problems, and also in the recommendation of a suitable data cleaning approach. Additionally, DQF4CT achieves good quality datasets without considering the domain knowledge, reaching or overcoming accuracy results in classification tasks (84% of the models generated by datasets cleaned by DQF4CT achieve the highest precisions and the best AUCs compared to models proposed by the datasets' authors). However, the results obtained by the LDA model in "Occupancy detection of an office room" [85], where the authors preprocess attributes, make us suppose that domain knowledge will allow for improving the results obtained.
We propose as future works: • Build an integrated data quality framework for several knowledge discovery tasks as regression [37], clustering and association rules. The integrated data quality framework would consider the Big Data paradigm [29] and hence huge datasets. Deletion of redundancies will play a key role in decreasing the computational complexity of the Big Data models.

•
Use ontologies of several domains to improve the performance of the data cleaning algorithms-e.g, the dimensionality reduction task. We could use the ontology of cancer diagnosis developed by [47] to select a subset of relevant features in datasets related to cancer.
• Create a case-based reasoning (CBR) for the recommendation of suitable data cleaning algorithms based on past experiences. The case representation would be based on annotation of samples also called dataset meta-features (e.g. mean absolute skewness, mean absolute kurtosis, mutual information, etc. [119]). The meta-features gather knowledge about datasets in order to provide an automatic selection, recommendation, or support for a future task [120]-in this case, recommendation of data cleaning algorithms.
Author Contributions: This paper is the result of the PhD thesis: "Framework for Data Quality in Knowledge Discovery Tasks" by David Camilo Corrales with the support of his supervisors Agapito Ledezma and Juan Carlos Corrales.
Acknowledgments: the authors are grateful to the Control Learning Systems Optimization Group (CAOS) of the Carlos III University of Madrid and the Telematics Engineering Group (GIT) of the University of Cauca for the technical support. In addition, the authors are grateful to COLCIENCIAS for the PhD scholarship granted to David Camilo Corrales. This work has also been supported by: • Project: "Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca". Convocatoria 03-2018 Publicación de artículos en revistas de alto impacto. • Project: "Alternativas Innovadoras de Agricultura Inteligente para sistemas productivos agrícolas del departamento del Cauca soportado en entornos de IoT -ID 4633" financed by Convocatoria 04C-2018 "Banco de Proyectos Conjuntos UEES-Sostenibilidad" of Project "Red de formación de talento humano para la innovación social y productiva en el Departamento del Cauca InnovAcción Cauca". • Spanish Ministry of Economy, Industry and Competitiveness (Projects TRA2015-63708-R and TRA2016-78886-C3-1-R).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: DQ Data Quality DQF4CT Data Quality Framework for Classification Tasks DCO Data Cleaning Ontology