Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation

In banks, governments, and Internet companies, inconsistent data problems may often arise when various information systems are collecting, processing, and updating data due to human or equipment reasons. The emergence of inconsistent data makes it impossible to obtain correct information from the data and reduces its availability. Such problems may be fatal in data-intensive enterprises, which causes huge economic losses. Moreover, it is very difficult to clean inconsistent data in databases, especially for data containing conditional functional dependencies with built-in predicates (CFDPs), because it tends to contain more candidate repair values. For the inconsistent data containing CFDPs to detect incomplete and repair difficult problems in databases, we propose a dependency lifting algorithm (DLA) based on the maximum dependency set (MDS) and a reparation algorithm (C-Repair) based on integrating the minimum cost and attribute correlation, respectively. In detection, we find recessive dependencies from the original dependency set to obtain the MDS and improve the original algorithm by dynamic domain adjustment, which extends the applicability to continuous attributes and improves the detection accuracy. In reparation, we first set up a priority queue (PQ) for elements to be repaired based on the minimum cost idea to select a candidate element; then, we treat the corresponding conflict-free instance (Inv) as the training set to learn the correlation among attributes and compute the weighted distance (WDis) between the tuple of the candidate element and other tuples in Inv according to the correlation; and, lastly, we perform reparation based on the WDis and re-compute the PQ after each reparation round to improve the efficiency, and use a label, flag, to mark the repaired elements to ensure the convergence at the same time. By setting up a contrast experiment, we compare the DLA with the CFDPs based algorithm, and the C-Repair with a cost-based, interpolation-based algorithm on a simulated instance and a real instance. From the experimental results, the DLA and C-Repair algorithms have better detection and repair ability at a higher time cost.


Introduction
With the development of social informatization, data storage, data analysis, and aid decision-making, relying on various information systems, have occupied a very important position in information society. In the era of the Internet, the data scale has expanded unceasingly due to increasing data requirements and constant shortening of data acquisition and the updating cycle. How to solve data quality problems accompanying big data is an urgent problem for government departments, enterprises, and institutions.
In the field of data quality, data consistency refers to the degree to which a given data set satisfies constraints or the consistency to which the same thing is expressed in the case of multi-source data We hold the opinion in this paper that it is more credible to repair inconsistent data elements by the existing values in data sets under unsupervised circumstances [14] (pp. 547-554). The main idea is: For given data sets and CFD P s, the maximum dependency set (MDS) is obtained to detect and locate the inconsistent elements by finding recessive dependencies contained in CFD P s. The dynamic domain adjustment is proposed to improve the original algorithm's shortcomings to the continuous attributes [15] (pp. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. Then, the priority queue (PQ) of candidate repair values is established based on the located inconsistent elements, selecting the data to be repaired according to the minimum cost idea and computing the correlation among attributes through symmetric uncertainty (SU) in the information theory. At last, the improved KNN algorithm is used to repair the inconsistent data. This algorithm integrates the minimum cost idea and correlation among attributes, and it can be applied to inconsistent data that violates CFD P s.

Purpose and Structure of This Paper
The purpose of this paper is to propose a detection and reparation algorithm for data that violates the CFD P s in data sets. The main contributions and innovations are: (1) We propose a heuristic algorithm for inconsistent data detection and reparation, which can repair inconsistent data violating the CFD P s in data sets under unsupervised circumstances; (2) For inconsistent data detection, the maximum dependency set is used to detect them. By finding recessive dependencies from original dependency sets, we improve the detection accuracy and apply the algorithm to continuous attributes; and (3) For inconsistent data reparation, we use unsupervised machine learning to learn the correlation among attributes in data sets, and integrate the minimum cost idea and information theory to repair, which makes the repair results most relevant to the initial values with minimum repair times.
The rest of the paper is structured as follows: In Section 2, after formally describing the inconsistency in databases, we propose an inconsistent data cleaning framework and give an example for readers to understand and reproduce. Then, we design the detection and reparation algorithms and analyze the convergence and complexity of them. In Section 3, to verify the effectiveness of our algorithms, we compare them with other algorithms through different data scales and inconsistent proportions, and then analyze the experimental results.
In Section 4, we analyze the advantages and disadvantages of two algorithms in detail, explain the reason why our algorithms perform better, and put forward the subsequent improvement direction. At last, in Section 5, we summarize the contributions of this paper.

Materials and Methods
In this Section, we first formally describe the inconsistency in databases, then we propose an inconsistent data cleaning framework and design the detection and reparation algorithms according to it. In detection, we use the MDS to improve the detection accuracy and extend the algorithm to continuous attributes. In reparation, we first set up the PQ based on the minimum cost idea, then we learn the correlation among attributes in the corresponding conflict-free data instance using an unsupervised machine learning method. Finally, we do reparation according to the learned correlation. Moreover, the algorithm does not require manual intervention in the repair process and can be applied to special cases violating CFD P s.

Problem Description
Data consistency means the degree to which elements in a data instance, I, satisfy a dependency set, D. Different from the "equality constraints" of CFDs [16] (pp. 864-868), the CFD P s are special forms of dependencies containing predicates [4] (pp. 3274-3288). Intuitively, CFD P s are actually extending the "equality constraints" of CFDs to "predicate constraints". For any tuple, t i , t i ∈ R, the form Symmetry 2018, 10, 516 4 of 24 of a CFD P is like " t i [X] > a → t i [Y] > b ", and this dependency type is also common in data sets. Compared with "equality constraints" of CFDs, "predicate constraints" are more difficult to detect and repair because they often contain more candidate values.
In semantics, if the X and Y attributes of a tuple, t i in I (t i ∈ I) violates the given dependencies, c f d P , which means the X attribute of t i satisfies the LHS (left hand side) of c f d p , but the Y attribute violates the RHS, expressed as t i [X] = c f d P and t i [Y] = c f d P . When describing the inconsistent data elements of X attribute, we still express it as t i [X] = c f d P , because the location of error data is often random.
For the inconsistent data elements detected and located, the following quadruple is used to describe them in this paper: where t i is the unique identification of a tuple in data instances; t i [A m ] and t i [A n ] represents the m th and n th attributes of t i , which violate the given dependencies respectively; and c f d P k means the k th dependency in the dependency set. The location problem of inconsistent data can be solved through Expression (1), which can also facilitate subsequent reparation work.
The following relation schema is given to describe the inconsistent problem in a data set: staff (id, name, age, excellent_staff, work_seniority, work_place, marry_status, monthly_salary, department) The meanings, value types, and abbreviations of each attribute in the relation schema are shown in Table 1. For description convenience, subsequent attributes' descriptions are all represented by the abbreviations in this paper, for example, the ES and MSs respectively indicate excellent staff status and marry status, and the values are T and F.
Selecting part data tuples in the data set, a data instance in the staff relation schema is shown in Table 2. The following possible dependencies are given to describe the inconsistent data problem in Table 2: indicates the seniority of an excellent staff should be at least three years; c f d P 2 means the basic salary of a manager is 6300 RMB; and c f d P 3 shows the age of employees whose seniority exceeds one year should be no less than 19 years old. These three dependencies are partial dependencies that may exist in the data instance according to the staff characteristics in a company, and are used to constrain the data in Table 2.
Based on the given c f d P , inconsistent data elements existing in Table 2 can be located and expressed by the quadruple in Expression (1). For example, the MSy and DM values of tuple 2 violate the c f d P 2 , and the ES and WS values of tuple 3 violate the c f d P 1 , which can be expressed as (t 2 , MSy, DM, c f d P 2 ) and (t 3 , ES, WS, c f d P 1 ), respectively. In fact, this method, which detects inconsistent data directly through the given c f d P , is flawed. We will explain these faults in Section 2.2.1 and use the maximum dependency set to do detection.

Inconsistent Data Cleaning Framework
In this section, we propose a framework for inconsistent data detection and reparation. The framework takes a data instance, I, and a dependency set, D, as input, and repaired results, I' as output, which can detect and repair the inconsistent data in I. It can be divided into two sub-modules: Detection and reparation, as shown in Figure 1. In the framework, the detection module is the foundation of the repair module, taking the detected inconsistent data-elements set (IDS) as input to participate in the repair process, and recalculating the IDS after every reparation round, the repair results will be obtained until the IDS is empty. The following possible dependencies are given to describe the inconsistent data problem in where indicates the seniority of an excellent staff should be at least three years; means the basic salary of a manager is 6300 RMB; and shows the age of employees whose seniority exceeds one year should be no less than 19 years old. These three dependencies are partial dependencies that may exist in the data instance according to the staff characteristics in a company, and are used to constrain the data in Table 2.
Based on the given , inconsistent data elements existing in Table 2 can be located and expressed by the quadruple in Expression (1). For example, the MSy and DM values of tuple 2 violate the , and the ES and WS values of tuple 3 violate the , which can be expressed as ( ,  ,  , ) and ( , , , ), respectively. In fact, this method, which detects inconsistent data directly through the given , is flawed. We will explain these faults in Section 2.2.1 and use the maximum dependency set to do detection.

Inconsistent Data Cleaning Framework
In this section, we propose a framework for inconsistent data detection and reparation. The framework takes a data instance, I, and a dependency set, D, as input, and repaired results, I', as output, which can detect and repair the inconsistent data in I. It can be divided into two sub-modules: Detection and reparation, as shown in Figure 1. In the framework, the detection module is the foundation of the repair module, taking the detected inconsistent data-elements set (IDS) as input to participate in the repair process, and recalculating the IDS after every reparation round, the repair results will be obtained until the IDS is empty.   In the detection module, referring to the method of obtaining the MDS proposed in [15] (pp. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], we propose the dynamic domain adjustment to improve the original algorithm by setting pointers of a numerical change direction, and extend the algorithm to continuous attributes. As a result, the ability to detect inconsistent data is improved. In the reparation module, we first delete tuples with inconsistent elements to get a conflict-free data instance, I nv , as the training set based on the located inconsistent data; at this point, there is no inconsistent data in I nv . Then, we consider the correlation among attributes in I nv , which is computed by the symmetric uncertainty method in information theory, and select a candidate repair data element according to the established PQ. At last, the improved KNN algorithm is used to do reparation.

Inconsistent Data Detection Module
The inconsistent data detection module detects inconsistent elements in a data instance by obtaining the MDS of a given c f d P , and uses the quadruple in Expression (1) to represent and locate. In Section 2.1, we mentioned that the algorithm, which detects inconsistent data directly through the given c f d P , is flawed. Then, we will describe these kinds of faults in detail.
Taking the staff data instance in Table 2 and the three c f d P as an example, the inconsistent data elements in staff can be represented as (t 2 , MSy, DM, c f d P 2 ) and (t 3 , ES, WS, c f d P 1 ) through the three dependencies, c f d P 1 , c f d P 2 , and c f d P 3 . In fact, by analyzing the c f d P 1 and c f d P 3 , it is easy to get a new c f d P , called recessive CFD P s (RCFD P s): The rc f d P 1 is a new dependency obtained by the c f d P 1 and c f d P 3 , which do not exist in the original dependency set. According to the rc f d P 1 , we can find the AGE and ES attributes in tuple 3 are inconsistent too, expressed as (t 3 , AGE, ES, rc f d P 1 ), and this inconsistent data element cannot be detected by the original dependency set. At this point, the c f d P 1 , c f d P 2 , c f d P 3 , and rc f d P 1 constitute the MDS of the original dependency set. In this paper, we propose a dependency lifting algorithm (DLA) based on the MDS, which discovers the rc f d P contained in c f d P to obtain the MDS, and do detection by the acquired MDS. Generally, it can be divided into three sub-stages: Acquiring related dependencies, acquiring the MDS and location, and representation of the inconsistent data.

Acquiring Related Dependencies
The process of obtaining the rc f d P 1 from the c f d P 1 , c f d P 2 , and c f d P 3 in Table 2 is used as an example. The DLA first select one attribute in the attribute set (Attr(R)) as the start attribute (start_attr), and then computes the rc f d P from all c f d P related to the start_attr. The algorithm ends when all attributes in the data instance, I, are traversed once.
Paper [15] (pp. 1-18) proposed a method to find the rc f d P from the c f d P , which defined explicit constraint dependencies through domain knowledge, and got the closed set by mathematical ways [17] (pp. 69-71). This algorithm does set operations by enumerating attribute values, which performs well in discrete data whose attribute values are finite. However, there are two defects in practical application processes: (1) Because the algorithm needs to enumerate all attribute values that do not satisfy the dependencies, it takes a lot of time and space resources to compute in data instances with a large number of dependencies, causing a waste of memory; and (2) since the variance step of continuous attribute values cannot be measured, the algorithm is not suitable for data instances with continuous attributes. We propose a method instead of the enumeration process, and extend the original algorithm to continuous attributes by setting forward pointers (L) and backward pointers (U).
Selecting the ES attribute as the start_attr, the dependencies containing the start_attr are c f d P Noting the vio c f d P i is a value space that does not satisfy the c f d P i , the value spaces that do not satisfy the c f d P 1 and c f d P 3 are shown as follows: vio(c f d Acquiring the MDS Acquiring the MDS from a given c f d P is the core of the DLA. For data tuples with N attributes, T(T 1 , T 2 , T 3 , . . . , T N ), and an initial dependency set, D, the mathematical forms of obtaining the rc f d P after selecting a start attribute, T i , are shown as follows: In Expression (2), A j is the values space of the j th attribute, and A d j is the values space of the j th attribute, which does not satisfy the dependency, d.
Selecting the WS attribute as the start_attr, and the vio c f d P

1
and vio c f d P 3 as input, the computing process of obtaining the rc f d P 1 is as follows:  , and an initial dependency set, D, the mathematical forms of obtaining the after selecting a start attribute, , are shown as follows: In Expression (2), is the values space of the attribute, and is the values space of the attribute, which does not satisfy the dependency, d.
Selecting the WS attribute as the start_attr, and the and as input, the computing process of obtaining the is as follows: where "∪" and "∩" operations of the ES, WS, and AGE attributes involves the merging process of different pointers, which are similar to the combination of intervals on a number axis. The WS attribute is shown as an example in Figure 2. In this way, a new dependency, , is obtained: The MDS will be obtained after all attributes in Attr R are traversed once. Then, we judge the data elements in the data instance through the MDS and finally get the inconsistent data. In this way, a new dependency, rc f d P 1 , is obtained: The MDS will be obtained after all attributes in Attr(R) are traversed once. Then, we judge the data elements in the data instance through the MDS and finally get the inconsistent data.

Location and Representation of Inconsistent Data
For the inconsistent data detected by the MDS, we use the quadruple in Expression (1) to express and locate them. The inconsistent data-elements set (IDS) of staff instance in Table 2 can be expressed as follows: IDS : (t 2 , MSy, DM, c f d

Inconsistent Data Reparation Module
The inconsistent data reparation module takes the repair times of data instances as the repair cost, and first computes the counts of violating dependencies for every inconsistent data element to get the priority queue (PQ). Then, putting the corresponding data instance with no inconsistent data, I nv , as the training set, we learn the correlation among attributes in I nv by the symmetric uncertainty method in information theory. At last, we select the candidate repair data element from PQ and do reparation based on the improved KNN algorithm. The whole module takes a data instance, I, and the MDS and the IDS as input, and can also be divided into three sub-stages: Candidate repair data priority queue, attribute correlation computing, and attribute values reparation.

Candidate Repair Data Priority Queue
We choose repair times of the data instance as the repair cost in this paper, establish the PQ based on the IDS, and select the first element from PQ to repair. Taking the VioCount(t i , A) as the violation counts of attribute, A, in tuple, t i , for n dependencies in MDS, the way to get VioCount(t i , A) can be expressed as follows: where A ∈ c f d P j means the attribute, A, of the tuple, t i , is contained in a dependency, c f d P j ; and t i , A ∈ IDS means the attributes of tuples are all selected from the IDS to ensure the selected data elements are all inconsistent.
Taking the MDS and IDS in Section 2.2.1, and the staff instance in Table 2 as an example too, all Then, we establish the PQ according to the VioCount(t i , A) and select the first element in PQ to be repaired:

Attribute Correlation Computing
When computing the attribute correlation, we need to first delete inconsistent tuples in I to get the corresponding conflict-free data instance, I nv , and learn the correlation from the training set, I nv . For example, the I nv of the staff instance in Table 2 does not contain tuples, t 2 and t 3 .
In real data sets, there may be a certain correlation among attributes, so it is more advantageous to repair data sets considering the correlation under the unsupervised environment. Generally speaking, in information theory, the way to compute a correlation is the information gain (IG) [18] (pp. [1][2][3][4][5][6][7][8] or the symmetrical uncertainty (SU) [19] (pp. 429-438), however, the disadvantage of the IG is that it tends to select attributes with multiple different values and should be standardized to ensure comparability. In this case, the SU method is chosen to compute in I nv .
In information theory, the uncertainty of a variable, X, can be measured by information entropy, H(X), which increases with the uncertainty of the variable. The mathematical definition is as follows: where p(x) means the probability that a variable, X, takes the value of x. The conditional entropy, H(X|Y) , represents the uncertainty of a variable, X, when the Y is determined.
Symmetry 2018, 10, 516 9 of 24 In Expression (7), p(x|y) means the probability that the variable, X, takes the value of x when the variable Y is y. In this case, the IG can be expressed as: To eliminate the influence of variable units and variable values, the SU method is used to normalize the IG: Taking the staff instance in Table 2 as an example, the I nv can be obtained as Similarly, the correlation between the ES and other attributes, ID, NM, WS, WP, MSs, MSy, and DM, is 0, 0, 0.43, 0.11, 0.05, 0, and 0.43, respectively. It must be noted that because there are only eight tuples in I nv , the correlation among attributes learned from the I nv can only represent the correlation in Table 2, and may not be representative of the whole industry.

Attribute Values Reparation
In the stage of attribute values reparation, we choose the first element in PQ as the candidate repair data element, and compute the distance between tuples according to the improved KNN algorithm and the correlation. The improved KNN algorithm takes the SU among attributes as weight to get the weighted distance (WDis) of tuples, which can be expressed as follows: where SU(X,Y) means the correlation between the X and Y attributes, X is the attribute of the candidate repair data element, and Y is another attribute in the data instance. The RDis t i [Y], t j [Y] means the relative distance between tuples, t i and t j , on the attributem Y. Due to different types of attributes (numeric, text, and boolean), it is not comparable directly through the Euclidean distance or the edit distance. Therefore, we choose the relative distance (RDis) to measure the distance among attributes in tuples in this paper, which makes the comparability between different types of attributes. For numerical attributes, the ratio of their Euclidean distance to the larger value is computed as the RDis, and for others, the ratio of their edit distance to the longer string is computed as the RDis.
where EucD and EditD represent the Euclidean distance and the edit distance, respectively. Thanks to the relative distance, the distance of attributes in different tuples can be well mapped to the interval [0,1], which makes them comparable.
Taking tuples, t 3 and t 1 , in Table 2 as an example, the candidate repair data element selected from PQ is (t 3 , ES), and the correlation between the ES and other attributes{ID, NM, AGE, WS, WP, MSs, MSy, DM} is SU = {0, 0, 0.31, 0.43, 0.11, 0.05, 0, 0.43}, respectively. In this case, the correlate set of attribute ES is {AGE, WS, WP, MSs, DM}, then the RDis between the t 3 and t 1 on the correlate set is RDis = {0.28, 0.875, 0.125, 1, 0.89} based on the method in Expression (11). At last, we obtain the WDis between tuples, t 3 and t 1 , according to Expression (10), WDis(t 3 , t 1 ) = 0.954. Similarly, the WDis between the t 3 and other tuples {t 1 , t 4 , t 5 , t 6 , t 7 , t 8 , t 9 , t 10 } in I nv is: Selecting n(n + 1)/2, (n ≥ 2) nearest tuples as class tuples, we finish a reparation round by the most frequent values of candidate repair attribute in the class tuples. Because there are only 10 tuples in Table 2, the class tuples are {t 7 , t 8 , t 9 } with n = 2, and the ES attribute values of class tuples are {T, F, F}, respectively. Therefore, the reparation value of the (t 3 , ES) is "F".
After the above repair process, the (t 3 , ES) is consistent now, which does not violate the c f d P 1 and the rc f d P 1 . Then, we consider a dynamic process that re-computes the PQ and continues to repair other inconsistent data elements. In this way, the computational efficiency of the reparation algorithm is improved effectively because one round of reparation can make multiple inconsistent data elements consistent or produce new inconsistent data elements. However, in special cases, a candidate repair data element in PQ may still be inconsistent after a round of reparation, and the algorithm will fall into an endless-loop. To keep the convergence of the algorithm, we use a label, flag, to mark the repaired data elements, which ensures every data element is repaired at most once and then is removed from the PQ. The concrete implementation will be shown in Section 2.4.

Inconsistent Data Detection Algorithm
According to the inconsistent data cleaning framework in Section 2.2, we propose a dependency lifting algorithm (DLA) to detect and locate the inconsistent data in data instances. The main idea of the DLA is finding the rc f d P from the given c f d P to obtain the MDS, and then detecting the inconsistent data based on the MDS. At last, the detected inconsistent data is expressed by the quadruple in Expression (1).

Design of Detection Algorithm
To find the rc f d P from the c f d P , we need to first get all attributes (allattr) and attribute types in data instances, choose one attribute as the start_attr, and obtain the dependencies, c f d P , related to the start_attr. Then, we normalize the c f d P by using pointers that measure the direction of value change (forward pointer, L, and backward pointer, U) after judging attribute types, and obtain the value space, vio c f d P , that does not satisfy the c f d P . At last, we do set operation and pointer merging for the vio c f d P to get an rc f d P . If the resulting rc f d P is a new dependency in the original dependency set, we add it to the MDS. The DLA will end while all attributes in the data instance are traversed once.
The DLA flow is shown in Algorithm 1. The L1 to L5 get all attributes from a data instance, selects a start attribute(start_attr), and obtains the related dependencies, c f d p ; the L6 to L12 obtain the vioc f d p from the c f d p ; the L13 to L19 do union operation for the start_attr and intersection operation for the non-start_attr; the L20 to L24 indicates the MDS will be obtained by adding into the c f d p and the new vioc f d p after all attributes are traversed once; and the L25 to L29 detect the inconsistent data based on the MDS and express them by the quadruple in Expression (1).
In Algorithm 1, the function "union" in L15 and "intersection" in L17 mean union operation for the start_attr and intersection operation for the non-start_attr, respectively, which includes the pointer merging process. Different from the method for discrete data in the finite set proposed in paper [15] (pp. 1-18), the DLA do operations similar to merge multiple intervals on a number axis, which is suitable for continuous attributes too. The implements of the functions, "union" and "intersection", are similar; they both need to merge various states of the forward pointer, L, and the backward pointer, U. We select the function "union" as an example as shown in Algorithm 2, where the L5 to L15 mean the possible cases of the pointer merging process.

The Convergence and Complexity Analysis of DLA
Since we consider the dynamic process of acquiring the PQ in Section 2.2.2, it is necessary to detect and locate the inconsistent data elements in data instances many times, therefore, the convergence and complexity of detection by the MDS will be analyzed in Section 2.4.2. For a given dependency set, computing the MDS is the key consumption of detecting inconsistent data without considering the detection process by the MDS, and the DLA itself is a loop to obtain the rc f d p continuously. Therefore, it is essential to analysis the convergence and computation complexity of the algorithm.
(1) Convergence analysis The convergence of DLA means the terminality of the algorithm, which mainly indicates that the process of computing the MDS from the given c f d p must terminate to get new dependency sets. The DLA is convergent for the c f d p , and the proof is given below. . . , c f d P n ) in a data instance, I. Every computing process of acquiring the MDS needs to select an attribute, A i , from the attribute set, A, as the start attribute, and computes the related dependencies to generate m rc f d p . The worst case is that all the n + m related c f d p (initial m = 0) of the selected start attribute, A i , are involved in computation, resulting in, at most, [(n + m)×(n + m + 1)]/2 new dependencies. Because the generation of new dependencies mainly involves set operations and pointers merging (similar to the idea of merging intervals on a number axis), the computational complexity of the two is O(1) and O(n), respectively. Therefore, it will not fall into endless loops, and the algorithm is convergent to one computation of acquiring the MDS. As a result, the DLA will also be convergent after N traversals.
(2) Complexity analysis The time complexity of the DLA depends mainly on two parts: The traversal processes of all attributes and the dynamic domain adjustment.
The traversal processes of all attributes need to select a start attribute, A i , from the attribute set, A, which can be terminated by one traversal and the time complexity is O(N).
The complexity of dynamic domain adjustment can also be subdivided into two parts: The generation of the vioc f d P and the pointer merging process. The generation of the vioc f d P is to do operations with n + m related c f d p and can be obtained by one loop, thus, the time complexity is O(n + m). The pointer merging process (i.e., the "union" and "intersection" functions) can complete three steps, the start_attr judgment, forward pointer, L, and backward pointer, U, location identification, through one loop, and the time complexity is also O(n + m). In the worst case, the number of newly generated dependencies is m = n(n + 1)/2. Therefore, the time complexity of dynamic domain Accordingly, the time complexity of acquiring MDS is O Nn 2 . The DLA uses the MDS to detect inconsistent data; when comparing with the traditional c f d p based algorithm, the time complexity of the DLA is higher because of new recessive dependencies. The level is determined by the number of the rc f d p and the size of the data instance. The time complexity of detecting inconsistent data elements by the MDS in data instances will be analyzed in Section 2.4.2.

Inconsistent Data Reparation Algorithm
According to the inconsistent data cleaning framework in Section 2.2, we propose the C-Repair algorithm to repair the inconsistent data detected, which integrates the minimum cost idea and attribute correlation in a data instance, I. To ensure the convergence of the algorithm, we consider a dynamic process to obtain the PQ and set a label, flag, to mark the repaired data elements. Finally, we get the repaired data instance, I'.

Design of Reparation Algorithm
The repair time of a data instance, I, is selected as the repair cost in the C-Repair algorithm. We first sort the inconsistent data elements in the IDS with the minimum cost idea to get the PQ and choose the first element in PQ as the candidate repair data. Then, we obtain the training set, I nv , according to the IDS, compute the attribute correlation, SU, and the WDis between the tuple of the candidate repair data and other tuples in I nv . At last, we select n(n + 1)/2, (n ≥ 2) class tuples with smaller WDis, and do reparation based on the class tuples. The C-Repair algorithm flow is shown in Algorithm 3.
In Algorithm 3, the L2 to L8 compute the PQ from the IDS using Expression (4), and select a candidate repair data element, β; the L9 to L14 treat the I nv as a training set and learn the correlation SU between the attribute of β and other attributes using Expression (9); the L15 to L21 obtain the WDis between the tuple of β and other tuples in I nv using Expressions (10) and (11); the L22 to L24 select class tuples according to the WDis and repair the data element, β; and the L25 to L28 re-compute the IDS and PQ after a round of reparation to ensure the convergence, and set the label, flag, to ensure every inconsistent data element is repaired at most once. //calculate the count by Formula (4) 04.
I nv ← conflict-free data tuples in I; 10. //each attribute r.attr in I nv 11. for r.attr ∈ I nv .attr do 12.

The Convergence and Complexity Analysis of C-Repair
The C-Repair algorithm takes the IDS and MDS, which are output of the DLA, and a data instance, I, as the input, and the output is the repaired data instance, I'. Similar to the DLA, the C-Repair algorithm is also a continuous loop process to gradually repair the inconsistent data, therefore, it is necessary to consider the convergence and complexity of the algorithm.
(1) Convergence analysis The convergence of the C-Repair algorithm means that the algorithm should terminate and get a stable repair result, I', after multiple reparation rounds. It can be proved that the C-Repair algorithm is convergent for a data instance, I, with limited tuples. Proof 2. The C-Repair algorithm is convergent for a data instance, I, with limited tuples.
Suppose there are M tuples (t 1 , t 2 , t 3 , . . . , t M ) and N attributes (A 1 , A 2 , A 3 , . . . , A N ) in a data instance, I, and m tuples (t 1 , t 2 , t 3 , . . . , t m , m ≤ M) in the corresponding conflict-free data instance, I nv . The C-Repair algorithm is a loop process to repair the inconsistent data elements based on the IDS; for one reparation round, we need first to select a candidate data element from the M × N elements according to the IDS and PQ, and then, compute the correlation between the attribute of the candidate repair element and other N − 1 attributes in I nv . Finally, we obtain the WDis among tuples based on the correlation, and repair the inconsistent data. In this case, it is convergent to compute the PQ and SU in a data instance with limited tuples, so it is convergent for one round of reparation. In view of multiple reparations, the algorithm can ensure every data element is repaired once at most, because we take a label, flag, to mark the repaired elements. In the worst case, all M × N data elements in I are inconsistent; at this time, the algorithm will still be convergent due to every reparation round's convergence, but the loop times will increase. In summary, the C-Repair algorithm is convergent for a data instance, I, with limited tuples.
(2) Complexity analysis The C-Repair algorithm needs to select an element as the candidate repair element from the sorted IDS(PQ), and re-compute the IDS after one reparation round. When there are no elements in the IDS, the algorithm ends. Suppose there are n c f d p (c f d P 1 , c f d P 2 , c f d P 3 , . . . , c f d P n ) in the MDS and t data elements in the IDS. For one reparation round, the complexity of the algorithm consists of four parts: Obtaining the PQ, computing the attribute correlation, inconsistent data elements reparation, and re-computing the IDS. There is an execution sequence among these four parts.
To obtain the PQ, we need to compute the violation counts of dependencies, VioCount, for every element in the IDS, and computing the VioCount also requires a traversal for every dependency in the MDS, so the time complexity is O(tn). When computing the correlation, SU(X, Y), between attributes, X and Y, we need to first traverse the m tuples in I nv to get the probability, p(x) and p(y), of the attributes, X and Y, and compute the corresponding information entropy, H(X) and H(Y), when the time complexity is O(m). Then, we traverse every value of the attribute, Y, to get the conditional probability, p(x|y) , and compute the correlation, SU(X, Y), according to Expression In summary, for one reparation round, the time complexity can be expressed intuitively as O tn + (N − 1)m 2 + (N − 1)m + Mn . In the worst case, all elements in the data instance, I, are inconsistent, that is t max = MN, in this case, the time complexity is: For multiple reparations, the maximum repair times can be t max = MN, so the time complexity of the C-Repair algorithm is: The time complexity in Expression (12) is obtained in the worst case where all elements in I are inconsistent. In an actual data instance, the inconsistent data amount is often small, so the complexity is much smaller than Expression (12).

Results
The experiment is divided into two parts in this paper, the detection part and the reparation part, which analyses the validity of the DLA and C-Repair algorithm, respectively. Because it is difficult to get an actual data instance and the contained dependencies at the same time, we use simulation data instance for DLA. However, the C-Repair algorithm is verified both in simulation and actual data instances.

Experimental Environment
The experiment selects a Core-i7 2.5 GHz processor and 8 GB memory on 64-bit Windows 10 operation system (Hasee, Shenzhen, China). The algorithm is written in Java language, and runs on the Eclipse platform.

Experimental Data Instances
In the detection experiment, we use a simulation data instance (I) to verify the validity of the DLA. Firstly, we specify several dependencies and generate an instance I which satisfies the given dependencies with 1000 consistent data tuples, the staff relation schema in Section 2.1 is selected in I, and the attributes description is shown in Table 1. There are 9 attributes (existing some continuous attributes), 1000 data tuples and 9000 data elements in I, and we can test the applicability of the DLA to continuous attributes.
In reparation experiment, we use both the instance I and the house price forecast instance in Kaggle website (House Prices: Advanced Regression Techniques, HPART) to verify the validity of the C-Repair, and the scale of HPART instance is 1460 × 81 (1460 tuples and 81 attributes). In unsupervised environment, we randomly generate inconsistent elements in HPART, and compare the repaired results with the original truth values to verify the effectiveness.
In the experiment, in order to reduce the contingency of experimental results, all the data of our experiment are the average results of algorithms running for ten times.

Experimental CFD P s
In the experiment, we modified some elements in I to obtain the inconsistent data instance, Ic, based on the given dependencies, and then, detected and repaired the inconsistent data elements in Ic using the DLA and C-Repair algorithm, respectively. By setting multiple inconsistent elements and data tuples in I, the efficiency of different algorithms was compared.
According to the attributes' description in Table 1, we specified 10 dependencies, which should be satisfied in Ic.

Experimental Results of the DLA
For inconsistent data detection, the universal method is an algorithm based on the CFD P s, which detects the inconsistent elements in data instances according to the given CFD P s. We compared the DLA and the CFD P s based algorithm in the experiment, and analyzed the accuracy and time-cost of the two algorithms.

Acquiring the MDS
The DLA first computed the recessive dependencies contained in the given dependencies to get the MDS, and detected and located inconsistent elements based on the MDS. According to the DLA flow in Algorithm 1, we obtained six recessive dependencies.

Evaluation Indexes
In the inconsistent detection part, we selected two indexes, detection accuracy and time-cost, to compare the DLA with the CFD P s based algorithm. The detection accuracy index indicates the detection ability of different algorithms for inconsistent elements, and the time-cost index means the time it takes for algorithms to detect inconsistent data.
(1) Detection Accuracy In the experiment, we chose three sub-indexes, precision, recall, and F-measure, to measure the detection accuracy of inconsistent data in data instances. Since the precision and recall indexes are conflicting in nature, we used the F-measure, which is the harmonic mean of the precision and recall, to do comprehensive consideration [20] (pp. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. The computing methods of the three sub-indexes are as follows: where sum_vio and sum_real represent the detected inconsistent elements and actual inconsistent elements in a data instance, respectively; and sum_vio ∩ sum_real means inconsistent elements correctly detected. (2) Time-Cost The time-cost index (T c ) means the time of detecting inconsistent elements in a data instance. In this paper, the time-cost of the two algorithms is measured by the system running time.

Analysis of the Experimental Results
There were 1000 data tuples, 9 attributes, and 9000 data elements in the original data instance, I, with no missing elements. We obtained the detection accuracy of the DLA and the CFD P s based algorithm by setting multiple inconsistent elements (proportion), and the results are shown in Figure 3.

Analysis of the Experimental Results
There were 1000 data tuples, 9 attributes, and 9000 data elements in the original data instance, I, with no missing elements. We obtained the detection accuracy of the DLA and the CFD P s based algorithm by setting multiple inconsistent elements (proportion), and the results are shown in Figure  3.  We can see from Figure 3 that the two algorithms have little fluctuation in the detection accuracy of different proportions, and the DLA always performs better than the CFD P s based algorithm. However, for a data instance, I, although the DLA performs well in all indexes of the detection accuracy, it is not perfect in all the precision, recall and F-measure index. We think it is caused by errors of the LHS attributes in dependencies.
Next, we analyzed the repair time-cost ( ) of the two algorithms by setting multiple inconsistent data proportions (proportion) and data tuples (data-tuple) in I. The results of the required system time for the two algorithms detection of inconsistent elements are shown in Figure 4.  In the experiment, the was recorded by the average time of 10 runs. As can be seen from the Figure 4a, in the case of fixed data tuples, multiple inconsistent elements' proportions have small effects on the DLA or the CFD P s based algorithm, and Figure 4b shows a sharp increase in the of the two algorithms as data tuples increase in the case of a fixed proportion. In summary, from the experimental results in Figure 4, it is not difficult to see that the of the DLA is always higher than the CFD P s based algorithm; that is mainly because the DLA needs get all contained in . We can see from Figure 3 that the two algorithms have little fluctuation in the detection accuracy of different proportions, and the DLA always performs better than the CFD P s based algorithm. However, for a data instance, I, although the DLA performs well in all indexes of the detection accuracy, it is not perfect in all the precision, recall and F-measure index. We think it is caused by errors of the LHS attributes in dependencies.
Next, we analyzed the repair time-cost (T c ) of the two algorithms by setting multiple inconsistent data proportions (proportion) and data tuples (data-tuple) in I. The results of the required system time for the two algorithms detection of inconsistent elements are shown in Figure 4.

Analysis of the Experimental Results
There were 1000 data tuples, 9 attributes, and 9000 data elements in the original data instance, I, with no missing elements. We obtained the detection accuracy of the DLA and the CFD P s based algorithm by setting multiple inconsistent elements (proportion), and the results are shown in Figure  3.  We can see from Figure 3 that the two algorithms have little fluctuation in the detection accuracy of different proportions, and the DLA always performs better than the CFD P s based algorithm. However, for a data instance, I, although the DLA performs well in all indexes of the detection accuracy, it is not perfect in all the precision, recall and F-measure index. We think it is caused by errors of the LHS attributes in dependencies.
Next, we analyzed the repair time-cost ( ) of the two algorithms by setting multiple inconsistent data proportions (proportion) and data tuples (data-tuple) in I. The results of the required system time for the two algorithms detection of inconsistent elements are shown in Figure 4.  In the experiment, the was recorded by the average time of 10 runs. As can be seen from the Figure 4a, in the case of fixed data tuples, multiple inconsistent elements' proportions have small effects on the DLA or the CFD P s based algorithm, and Figure 4b shows a sharp increase in the of the two algorithms as data tuples increase in the case of a fixed proportion. In summary, from the experimental results in Figure 4, it is not difficult to see that the of the DLA is always higher than the CFD P s based algorithm; that is mainly because the DLA needs get all contained in . In the experiment, the T c was recorded by the average time of 10 runs. As can be seen from the Figure 4a, in the case of fixed data tuples, multiple inconsistent elements' proportions have small effects on the DLA or the CFD P s based algorithm, and Figure 4b shows a sharp increase in the T c of the two algorithms as data tuples increase in the case of a fixed proportion. In summary, from the experimental results in Figure 4, it is not difficult to see that the T c of the DLA is always higher than the CFD P s based algorithm; that is mainly because the DLA needs get all rc f d p contained in c f d p . In this case, the dependencies in the MDS are often more than the given c f d p , resulting in a longer detection time of the DLA.
According to the results in Figures 3 and 4, the DLA has a higher time-cost than the CFD P s based algorithm, but it has more advantages in detection accuracy. In this paper, we regard this phenomenon as a process of time and accuracy exchange; the difference between detection accuracy and time-cost of the two algorithms was influenced by data tuples and the number of dependencies.

Experimental Results of the C-Repair
In view of inconsistent elements' reparation in data instances, we compare the C-Repair with both cost-based and interpolation-based algorithms in the data instance, I, and HPART, and analyze the difference among algorithms in multiple data tuples (data-tuple) and inconsistent elements (proportion).

Evaluation Indexes
In the experiment, we select three indexes, error-rate, time-cost, and validity, to evaluate the pros and cons of C-Repair and two cost-based algorithms in the data instance, I, and select three indexes, validity, satisfaction, and time-cost, to evaluate the pros and cons of C-Repair and two interpolation-based algorithms in the data instance, HPART. Because the C-Repair uses an improved K-NN algorithm combining the attribute correlation and repair cost to do reparation, which selects the most relevant elements in the corresponding conflict-free data instance to replace the original elements in nature, there is no guarantee that the data elements obtained by every reparation will be conflict-free. The error-rate index was used to measure the inconsistent data amount remaining after reparation, and the time-cost index was used to measure the running time of different reparation algorithms, the validity index was used to measure the variation amount of inconsistent elements before and after reparation, and the satisfaction index was used to measure the satisfaction degree of repaired results and initial truth values. The specific computing method is shown as follows.
(1) Error-Rate Because the repair results of two algorithms cannot always be conflict-free considering the attribute correlation, we introduced the error-rate index, which can be computed as follows, to measure the inconsistent elements remaining in the repaired data instance, I': where ∑ I sum_vio means the inconsistent elements amount remaining in a repaired instance, I'; and ∑ I sum_vio means the inconsistent elements' amount in the original instance, I.
(2) Time-Cost Similar to the time-cost index in Section 3.2, the time-cost (T c ) of the reparation algorithm is also described by the system time of the algorithms.

(3) Validity
In the experiment, the validity index is described by the ratio of variation amount in inconsistent elements before and after reparation to the total amount of inconsistent elements in the data instance, I, which can be computed as follows: where ∑ I sum_vio means the inconsistent elements amount in the original instance; and ∑ I sum_vio means the inconsistent elements' amount remaining in the repaired instance.
(4) Satisfaction In inconsistent data reparation, different algorithms always have different repaired values, although all of them may satisfy the given dependency, and there is a gap between them and truth where Dis(repaired, truth) means the distance between repaired values and the corresponding truth values, which was measured by the Euclidean distance and edit distance for the numeric attributes and other types, respectively. The Max(repaired, truth) indicates the larger one between repaired values and truth values, which was measured by the numeric size and string length for the numeric attributes and other types, respectively.

Analysis of the Experimental Results
In the experiment, we first compared the C-Repair algorithm with the cost-based algorithm by setting multiple proportions and data tuples in the data instance, I, and observed the repair capabilities of the two algorithms.
For the data instance, I, with 1000 tuples and 9 attributes, the results of error-rate, time-cost, and validity indexes are shown in Figure 5a-c, respectively, by setting multiple proportions.  (18) where , means the distance between repaired values and the corresponding truth values, which was measured by the Euclidean distance and edit distance for the numeric attributes and other types, respectively. The , indicates the larger one between repaired values and truth values, which was measured by the numeric size and string length for the numeric attributes and other types, respectively.

Analysis of the Experimental Results
In the experiment, we first compared the C-Repair algorithm with the cost-based algorithm by setting multiple proportions and data tuples in the data instance, I, and observed the repair capabilities of the two algorithms.
For the data instance, I, with 1000 tuples and 9 attributes, the results of error-rate, time-cost, and validity indexes are shown in Figure 5a-c, respectively, by setting multiple proportions.  To analyze the effect of data tuples on the time-cost ( ) and repair times (repair-times) of the C-Repair algorithm, we set multiple tuples to do the experiment with the fixed inconsistent elements amount (proportion = 0.03), and the results are shown in Figure 5d.
As we can see from Figure 5a-c, although every reparation round of the C-Repair algorithm cannot guarantee a conflict-free result, there were few inconsistent data elements after reparation, and even in the experiment, the repaired data instance, I', is completely conflict-free many times. We think this is due to a re-computing of the PQ after every reparation round. In Figure 5a-c, the C- To analyze the effect of data tuples on the time-cost (T c ) and repair times (repair-times) of the C-Repair algorithm, we set multiple tuples to do the experiment with the fixed inconsistent elements amount (proportion = 0.03), and the results are shown in Figure 5d.
As we can see from Figure 5a-c, although every reparation round of the C-Repair algorithm cannot guarantee a conflict-free result, there were few inconsistent data elements after reparation, and even in the experiment, the repaired data instance, I', is completely conflict-free many times. We think this is due to a re-computing of the PQ after every reparation round. In Figure 5a-c, the C-Repair algorithm outperforms the cost-based algorithm in both error-rate and validity indexes, but the time-cost is relatively large.
In Figure 5d, we changed the data tuples to observe the T c and repair-times of the C-Repair algorithm with the fixed proportion. It is not difficult to see that the T c increased sharply as the data tuples increases, but the repair times were basically stable for the fixed proportion.
For the HPART instance, we compared the C-Repair with the interpolation-based algorithm by setting multiple proportions to observe the repair capabilities of two algorithms.
We set the inconsistent proportions to 0.01%, 0.02%, 0.03%, 0.04%, and 0.05% in the HPART, and the results of the validity, satisfaction, and time-cost indexes are shown in Figure 6a-c, respectively. Repair algorithm outperforms the cost-based algorithm in both error-rate and validity indexes, but the time-cost is relatively large.
In Figure 5d, we changed the data tuples to observe the and repair-times of the C-Repair algorithm with the fixed proportion. It is not difficult to see that the increased sharply as the data tuples increases, but the repair times were basically stable for the fixed proportion.
For the HPART instance, we compared the C-Repair with the interpolation-based algorithm by setting multiple proportions to observe the repair capabilities of two algorithms.
We set the inconsistent proportions to 0.01%, 0.02%, 0.03%, 0.04%, and 0.05% in the HPART, and the results of the validity, satisfaction, and time-cost indexes are shown in Figure 6a-c, respectively. From the results in Figure 6a,b, we can see that the validity and satisfaction of the C-Repair algorithm is always higher than the interpolation-based algorithm with the inconsistent proportion change in the HPART instance. In other words, the C-Repair can ensure better results compared with the interpolation-based algorithm, and has less inconsistent data remaining and closer distance between the repaired results and initial truth values. From the results in Figure 6c the time-cost of the two algorithms increases dramatically, and the C-Repair is always higher.
In the experiment, there are two reasons why the C-Repair always has a higher time-cost: (1) We need a high time-cost to learn the correlation from the corresponding because of large data tuples; and (2) we need to compute the WDis between each tuple in and the candidate element to select class tuples; in this case, the time-cost can be high too. In big data, we should make some improvements to satisfy data sets with a huge scale, and it may be a good idea to divide the origin data set into blocks. However, it is worth considering how to divide blocks to improve the time efficiency on the basis of guaranteeing the repair ability.
Combining the results of Figures 5 and 6, we can find that, although the C-Repair can perform better in validity and satisfaction indexes compared with the interpolation-based algorithm in the HPART instance, it is hard to obtain such good results of the simulated instance, I. That is because, unlike the MDS in the simulated instance, I, the truth value of every inconsistent element is unique in the HPART instance. In this case, the repaired results only need to satisfy the MDS of instance, I, to achieve our repair purpose, but a perfect repair result in the HPART instance must be exactly the same as the initial truth value.

Discussion
For the inconsistent data cleaning in data instances, we propose the DLA and C-Repair to do detection and reparation, respectively. From the results in Section 3.2, the DLA has a higher detection accuracy and completeness than the traditional dependency-based detection algorithm (CFDs and CFD P s), so we think it is feasible to detect inconsistent elements by finding the MDS from given dependency sets. However, the DLA cannot perform perfectly on either the precision or recall index; that is, it cannot detect all inconsistent elements in data instances. We hold the opinion that this may be due to some errors on the LHS attributes of given dependencies, making such inconsistent errors unable to be detected by the MDS, which is also the limitation of DLA. From the results in Section 3.3, From the results in Figure 6a,b, we can see that the validity and satisfaction of the C-Repair algorithm is always higher than the interpolation-based algorithm with the inconsistent proportion change in the HPART instance. In other words, the C-Repair can ensure better results compared with the interpolation-based algorithm, and has less inconsistent data remaining and closer distance between the repaired results and initial truth values. From the results in Figure 6c the time-cost of the two algorithms increases dramatically, and the C-Repair is always higher.
In the experiment, there are two reasons why the C-Repair always has a higher time-cost: (1) We need a high time-cost to learn the correlation from the corresponding I nv because of large data tuples; and (2) we need to compute the WDis between each tuple in I nv and the candidate element to select class tuples; in this case, the time-cost can be high too. In big data, we should make some improvements to satisfy data sets with a huge scale, and it may be a good idea to divide the origin data set into blocks. However, it is worth considering how to divide blocks to improve the time efficiency on the basis of guaranteeing the repair ability.
Combining the results of Figures 5 and 6, we can find that, although the C-Repair can perform better in validity and satisfaction indexes compared with the interpolation-based algorithm in the HPART instance, it is hard to obtain such good results of the simulated instance, I. That is because, unlike the MDS in the simulated instance, I, the truth value of every inconsistent element is unique in the HPART instance. In this case, the repaired results only need to satisfy the MDS of instance, I, to achieve our repair purpose, but a perfect repair result in the HPART instance must be exactly the same as the initial truth value.

Discussion
For the inconsistent data cleaning in data instances, we propose the DLA and C-Repair to do detection and reparation, respectively. From the results in Section 3.2, the DLA has a higher detection accuracy and completeness than the traditional dependency-based detection algorithm (CFDs and CFD P s), so we think it is feasible to detect inconsistent elements by finding the MDS from given dependency sets. However, the DLA cannot perform perfectly on either the precision or recall index; that is, it cannot detect all inconsistent elements in data instances. We hold the opinion that this may be due to some errors on the LHS attributes of given dependencies, making such inconsistent errors unable to be detected by the MDS, which is also the limitation of DLA. From the results in Symmetry 2018, 10, 516 22 of 24 Section 3.3, the C-Repair algorithm performs better than the cost-based and interpolation-based algorithm in the error-rate, validity, and satisfaction indexes in both simulated data instance, I and HPART. In this case, we think it is feasible to do reparation based on the attributes correlation in an unsupervised environment.
The reason why the DLA performance is better in inconsistent data detection is that it can find recessive dependencies to enlarge the dependency amount, so the detection accuracy and completeness can be improved accordingly. When repairing, there is always a certain correlation among attributes in actual data sets, and the data is not a random occurrence, so we can obtain better results by learning the correlation to guide our repair using the symmetric uncertainty method. Moreover, the C-Repair is an unsupervised repair algorithm without manual intervention, which is also a feature and advantage of the algorithm.
Meanwhile, we find two interesting phenomena in the experiment: (1) The C-Repair algorithm considers the correlation among attributes, and the repair results select the most relevant elements from the corresponding conflict-free instance, instead of directly modifying the values to satisfy the dependencies. Therefore, there may still be a small number of inconsistent elements in the repaired instance, I'. However, when we do experiments on the simulated instance, we get so many consistent results; and (2) the C-Repair can perform better on the simulated instance, I, than the HPART, which may be the unique truth value for every inconsistent element in the HPART instance. The repair results only need to satisfy the MDS in instance, I, which reduces the requirements.
However, both the DLA and C-Repair have a higher time-cost while improving the detection accuracy and repair ability, especially the C-Repair. So, how to reduce the time complexity of two algorithms while maintaining the detection and reparation ability is a problem worth considering. The subsequent research will explore the following three aspects: (1) Because dependencies in data instances are not always specified in advance, we may improve the applicability of the DLA through some data mining techniques (i.e., association rules mining [21] (pp. 54-69) and frequent pattern mining [22] (pp. 104-114)), which can obtain dependencies automatically; (2) exploring whether we can reduce the time complexity of two algorithms while keeping the detection and reparation ability, which means they can be applied to big data and improves the ability of DLA algorithm to LHS attribute errors; and (3) in an unsupervised environment, can the C-Repair algorithm, which integrates the minimum cost idea and attribute correlation, be applied to other data quality problems, such as missing data filling [23] (pp. 157-166) and inaccurate data correction [24] (pp. 230-234).

Conclusions
The main achievements of this paper are to propose a data cleaning framework for inconsistent elements and two detection and repair algorithms, DLA and C-Repair. From the experimental results, the DLA can improve the detection accuracy and completeness compared with the dependency-based algorithm, while the C-Repair uses unsupervised machine learning to obtain the correlation among attributes, and gets the most relevant repair results under minimal repair times. Compared with the cost-based and interpolation-based algorithm, the C-Repair performs better in error-rate, validity, and satisfaction indexes, and does not require any manual intervention, so it has a certain applicability in unsupervised cleaning, but also provides a method that can be referenced by other data quality problems.