Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation
Abstract
:1. Introduction
1.1. Inconsistent Data Cleaning Status
1.2. Purpose and Structure of This Paper
- (1)
- We propose a heuristic algorithm for inconsistent data detection and reparation, which can repair inconsistent data violating the CFD^{P}s in data sets under unsupervised circumstances;
- (2)
- For inconsistent data detection, the maximum dependency set is used to detect them. By finding recessive dependencies from original dependency sets, we improve the detection accuracy and apply the algorithm to continuous attributes; and
- (3)
- For inconsistent data reparation, we use unsupervised machine learning to learn the correlation among attributes in data sets, and integrate the minimum cost idea and information theory to repair, which makes the repair results most relevant to the initial values with minimum repair times.
2. Materials and Methods
2.1. Problem Description
staff (id, name, age, excellent_staff, work_seniority, work_place, marry_status, monthly_salary, department)
2.2. Inconsistent Data Cleaning Framework
2.2.1. Inconsistent Data Detection Module
Acquiring Related Dependencies
Acquiring the MDS
Location and Representation of Inconsistent Data
2.2.2. Inconsistent Data Reparation Module
Candidate Repair Data Priority Queue
Attribute Correlation Computing
Attribute Values Reparation
2.3. Inconsistent Data Detection Algorithm
2.3.1. Design of Detection Algorithm
Algorithm 1. The dependency lifting algorithm (DLA) flow. |
Input: data instance I, cfd^{p} Output: IDS |
01. get allattr from I; 02. //loop all attributes 03. for each $attr\in allattr$ do 04. start_attr $\leftarrow $ attr; 05. cfd^{p} $\leftarrow $ realted cfd^{p}s; 06. //get vio(cfd^{p}) of every cfd^{p} 07. for each cfd^{p} do 08. //normalization of every cfd^{p} by setting pointers. 09. nor_cfd^{p} $\leftarrow $ normalization (cfd^{p}); 10. //reverse right part of nor_cfd^{p} to get vio(cfd^{p}) 11. vio(cfd^{p}) $\leftarrow $ reverse RHS of nor_cfd^{p}; 12. end for; 13. for each $\beta \in $ viocfd^{p} do 14. if ($\beta \in allattr$ && $\beta $==start_attr) 15. result $\leftarrow $ union(); 16. else if ($\beta \in allattr$ && $\beta $!=start_attr) 17. result $\leftarrow $ intersection(); 18. end if; 19. end for; 20. MDS $\leftarrow $ MDS + cfd^{p}; 21. if (!MDS.contains(result) 22. MDS $\leftarrow $ MDS + result; 23. end if; 24. end for; 25. //detect inconsistent data by MDS from I 26. ins_data.detect();27. //express inconsistent data by quadruple in formula (2) 28. IDS.express(ins_data); 29. return IDS; |
Algorithm 2. The implementation of the function union. |
Input: viocfd^{p} Output: result of pointer-merging |
01. //$\beta $ is start_attr in viocfd^{p} 02. for each $\beta \in $ viocfd^{p} do 03. buffer $\leftarrow $ every $\beta .value$ in viocfd^{p}; 04. end for; 05. if (buffer only L, Not U) 06. choose max; 07. result $\leftarrow $ max+”L” 08. else if (buffer only U, Not L) 09. choose min; 10. result $\leftarrow $ min+”U”; 11. else if (buffer exists L and U) 12. judge the location L and U; 13. do set operation in axis; 14. get result; 15. end if; 16. return result; |
2.3.2. The Convergence and Complexity Analysis of DLA
2.4. Inconsistent Data Reparation Algorithm
2.4.1. Design of Reparation Algorithm
Algorithm 3. The C-Repair algorithm flow. |
Input: data instance I, MDS, IDS Output: repaired result I’ |
01. for IDS $\text{}\ne \varnothing $ do 02. for (${t}_{i},A$)$\in $ IDS do 03. //calculate the count by Formula (4) 04. α $\text{}\leftarrow VioCount\left({t}_{i},A\right)$; 05. PQ $\leftarrow $ descending order of α; 06. end for; 07. // β is the data element firstly repaired 08. β $\leftarrow $ first element of PQ; 09. $\text{}{I}_{nv}\leftarrow $ conflict-free data tuples in I; 10. //each attribute r.attr in ${I}_{nv}$ 11. for r.attr $\in {I}_{nv}.attr$ do 12. //calculate the correlation by Formula (9) 13. SU(β.attr, r.attr); 14. end for; 15. //each data tuple $t.{t}_{j}$ in ${I}_{nv}$ 16. for $t.{t}_{j}\in {I}_{nv}.{t}_{j}$ do 17. //calculate the relative distance by Formula (11) 18. RDis($\mathsf{\beta}.{t}_{i},t.{t}_{j}$); 19. //calculate the weighted distance by Formula (10) 20. WDis($\mathsf{\beta}.{t}_{i},t.{t}_{j}$); 21. end for; 22. buffer $\leftarrow $ ascending order of WDis; 23. class_tuple $\leftarrow $ choose n(n+1)/2 tuples of buffer; 24. I’ $\leftarrow $ repair I by class_tuple; 25. flag $\leftarrow $ β; 26. IDS $\leftarrow $ recalculate IDS by MDS; 27. IDS $\leftarrow $ IDS – flag; 28. end for; 29. return I’; |
2.4.2. The Convergence and Complexity Analysis of C-Repair
3. Results
3.1. Experimental Configuration
3.1.1. Experimental Environment
3.1.2. Experimental Data Instances
3.1.3. Experimental CFD^{P}s
3.2. Experimental Results of the DLA
3.2.1. Acquiring the MDS
3.2.2. Evaluation Indexes
3.2.3. Analysis of the Experimental Results
3.3. Experimental Results of the C-Repair
3.3.1. Evaluation Indexes
3.3.2. Analysis of the Experimental Results
4. Discussion
- (1)
- (2)
- exploring whether we can reduce the time complexity of two algorithms while keeping the detection and reparation ability, which means they can be applied to big data and improves the ability of DLA algorithm to LHS attribute errors; and
- (3)
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Fan, W. Data Quality: From Theory to Practice. ACM SIGMOD Rec. 2015, 44, 7–18. [Google Scholar] [CrossRef]
- Tu, S.; Huang, M. Scalable Functional Dependencies Discovery from Big Data. In Proceedings of the IEEE Second International Conference on Multimedia Big Data, Taipei, Taiwan, 20–22 April 2016; pp. 426–431. [Google Scholar]
- Zhou, J.; Diao, X.; Cao, J.; Pan, Z. An Optimization Strategy for CFDMiner: An Algorithm of Discovering Constant Conditional Functional Dependencies. IEICE Trans. Inf. Syst. 2016, E99-D, 537–540. [Google Scholar] [CrossRef]
- Ma, S.; Duan, L.; Fan, W.; Hu, C.; Chen, W. Extending Conditional Dependencies with Built-in Predicates. IEEE Trans. Knowl. Data Eng. 2015, 27, 3274–3288. [Google Scholar] [CrossRef]
- Liu, X.-L.; Li, J.-Z. Consistent Estimation of Query Result in Inconsistent Data. Chin. J. Comput. 2015, 38, 1727–1738. [Google Scholar]
- Parisi, F.; Grant., J. On repairing and querying inconsistent probabilistic spatio-temporal databases. Int. J. Approx. Reason. 2017, 84, 41–74. [Google Scholar] [CrossRef]
- Bohannon, P.; Fan, W.; Flaster, M.; Rastogi, R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14–16 June 2005; pp. 143–154. [Google Scholar]
- Brisaboa, N.R.; Rodríguez, M.A.; Seco, D.; Troncoso, R.A. Rank-based strategies for cleaning inconsistent spatial databases. Int. J. Geogr. Inf. Sci. 2015, 29, 280–304. [Google Scholar] [CrossRef]
- Bai, L.; Shao, Z.; Lin, Z.; Cheng, S. Fixing inconsistencies of fuzzy spatiotemporal XML data. Appl. Intell. 2017, 1–19. [Google Scholar] [CrossRef]
- Liu, B.; C, M.; Zhou, X. Study on Data Repair and Consistency Query Processing. Comput. Sci. 2016, 43, 232–236, 241. [Google Scholar]
- Arieli, O.; Zamansky, A. A Graded Approach to Database Repair by Context-Aware Distance Semantics; Elsevier North-Holland, Inc.: Amsterdam, The Netherlands, 2016. [Google Scholar] [CrossRef]
- Xu, Y.; Li, Z.; Chen, Q.; Zhong, P. Repairing Inconsistent Relational Data Based on Possible Word Model. J. Softw. 2016, 27, 1685–1699. [Google Scholar]
- Liu, B.; Liu, H. Data consistency repair method for enterprise information integration. Comput. Integr. Manuf. Syst. 2013, 10, 2664–2669. [Google Scholar]
- Kim, J.; Jang, G.J.; Lee, M. Investigation of the Efficiency of Unsupervised Learning for Multi-task Classification in Convolutional Neural Network. In Neural Information Processing; Springer International Publishing: New York, NY, USA, 2016; pp. 547–554. ISSN 0302-9743. [Google Scholar]
- Scholtus, S. A generalized Fellegi-Holt paradigm for automatic error localization. Surv. Methodol. 2016, 42, 1–18. [Google Scholar]
- Zhang, C.; Diao, Y. Conditional functional dependency discovery and data repair based on decision tree. In Proceedings of the International Conference on Fuzzy Systems and Knowledge Discovery, Changsha, China, 13–15 August 2016; pp. 864–868. [Google Scholar]
- Cao, J.; Diao, X. Introduction to Data Quality; National Defense Industry Press: Beijing, China, 2017; pp. 69–71. ISBN 9787118114058. [Google Scholar]
- Le, H.T.; Urruty, T.; Gbèhounou, S.; Lecellier, F.; Martinet, J.; Fernandez-Maloigne, C. Improving retrieval framework using information gain models. Signal Image Video Process. 2017, 11, 1–8. [Google Scholar] [CrossRef]
- Ye, M.; Gao, L.; Wu, C.; Wan, C. Informative Gene Selection Method Based on Symmetric Uncertainly and SVM Recursive Feature Elimination. Pattern Recognit. Artif. Intell. 2017, 30, 429–438. [Google Scholar]
- Ahmad, H.A.; Wang, H. An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 2018, 36, 593–612. [Google Scholar] [CrossRef]
- Martin, D.; Rosete, A.; Alcala-Fdez, J.; Herrera, F. A New Multiobjective Evolutionary Algorithm for Mining a Reduced Set of Interesting Positive and Negative Quantitative Association Rules. IEEE Trans. Evol. Comput. 2014, 18, 54–69. [Google Scholar] [CrossRef]
- Zhang, X.J.; Wang, M.; Meng, X.F. An Accurate Method for Mining top-k Frequent Pattern Under Differential Privacy. J. Comput. Res. Dev. 2014, 104–114. [Google Scholar]
- Pedersen, A.B.; Mikkelsen, E.M.; Croninfenton, D.; Kristensen, N.R.; Pham, T.M.; Pedersen, L.; Petersen, I. Missing data and multiple imputation in clinical epidemiological research. Clin. Epidemiol. 2017, 9, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Diao, Y.; Liu, K.Y.; Meng, X.; Ye, X.; He, K. A Big Data Online Cleaning Algorithm Based on Dynamic Outlier Detection. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, China, 12–14 October 2015; pp. 230–234. [Google Scholar]
Attributes | Meanings | Value Types | Abbreviations |
---|---|---|---|
id | staff number | numeric | ID |
name | staff name | text | NM |
age | staff age | numeric | AGE |
excellent_staff | excellent staff status | boolean | ES |
work_seniority | years of work | numeric | WS |
work_place | branch office | text | WP |
marry_status | marry status | boolean | MSs |
monthly_salary | monthly salary | numeric | MSy |
department | work department | text | DM |
ID | NM | AGE | ES | WS | WP | MSs | MSy | DM |
---|---|---|---|---|---|---|---|---|
1 | Kitty | 25 | T | 4 | branch#1 | T | 3825 | security |
2 | Ming | 20 | F | 2 | branch#7 | F | 4718 | manager |
3 | Wong | 18 | T | 0.5 | branch#5 | F | 3264 | personnel |
4 | Lity | 36 | F | 3.5 | branch#1 | F | 7560 | manager |
5 | John | 41 | T | 5 | branch#3 | T | 5300 | communication |
6 | Li Lei | 25 | F | 4 | branch#4 | T | 3750 | sale |
7 | Lucy | 36 | T | 3 | branch#2 | F | 4103 | market |
8 | Xie Feng | 26 | F | 2 | branch#3 | T | 3520 | sale |
9 | Jack | 19 | F | 1.5 | branch#2 | F | 5030 | culture |
10 | Kai | 41 | T | 5 | branch#1 | T | 7263 | manager |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, P.; Dai, C.; Wang, W. Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation. Symmetry 2018, 10, 516. https://doi.org/10.3390/sym10100516
Li P, Dai C, Wang W. Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation. Symmetry. 2018; 10(10):516. https://doi.org/10.3390/sym10100516
Chicago/Turabian StyleLi, Pei, Chaofan Dai, and Wenqian Wang. 2018. "Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation" Symmetry 10, no. 10: 516. https://doi.org/10.3390/sym10100516